Forensic Genetics: A Machine Learning Algorithm for Mutation Modelling
DOI:
https://doi.org/10.51126/revsalus.v7isup.999Palavras-chave:
Y chromosome, mutation, microsatellites, Y-STRsResumo
Microsatellites or short tandem repeats (STRs) are the most used markers in population and forensic genetics due to their high polymorphism that is consequence of high germinal mutation rates. Mutation modeling has been a topic of intense research as its proper estimation is crucial for a wide range of forensic genetics’ problems. The objective of this work is to obtain a statistical system for mutation modeling able to accommodate as predictors the parental allele length and age, known to be correlated with the biological mechanism.
Due to its haploid mode of transmission the analysis of Y-chromosomal markers provides invaluable insights regarding germinal mutation modeling as it allows the inference of which parental allele originated which filial one [1]. In contrast, for diploid and haplodiploid markers not only hidden mutations can occur, as also multistep mutations can be misinterpreted as single step ones, which biases the modelling of the phenomena [2]. Mutation rates of STRs are known to be correlated with the parental sex, age, and allele size and sequence of the repetitive motif [3]. Nonetheless, corresponding estimates are generally computed simply considering the marker-specific ratio between the number of Mendelian incompatibilities and transmissions observed. This naïve approach hides the variation in germinal mutation rates within each marker, dependent on the allele, sex and age of the individual.
Under the framework of a working commission of the Spanish and Portuguese Speaking Working Group of the International Society for Forensic Genetics (GHEP-ISFG), father-son segregation data for 28 Y-STRs were analyzed, and a machine-learning model was developed, where logistic regression analyses were computed to estimate marker specific mutation rates depending on paternal age and/or allele length [4]. Statistical significance was reached for both predictors for three markers out of the 25 analyzed, with allele length showing greater contribution than age (from 5 to 16 times greater). Greater subsets of data were able to be analyzed when considering only the allele length as predictor, which allowed statistical significance to be reached for 18 Y-STRs out of the 28 analyzed. For each case, algebraic expressions were provided for estimating marker specific mutation rates depending on paternal age and/or allele length.
These results support that machine learning algorithms may be used to improve mutation modelling, statistical significance depending on the available data to be used as training and test sets. As for any other rare event, a huge amount of data is needed for the proper estimation of mutation parameters. Therefore, interlaboratory studies are crucial to produce and gather important amounts of data, in parallel to the establishment of publication guidelines to assure the release of data with the proper level of detail. To circumvent the limitation inherent to the scarce data available and increase its potential, in this work we evaluate the possibility of gathering data from different markers with the same structure of the repetitive motif for modelling mutation rates considering also as predictors the parental allele and/or age.
Downloads
Publicado
Edição
Secção
Licença
Direitos de Autor (c) 2025 RevSALUS - Revista Científica Internacional da Rede Académica das Ciências da Saúde da Lusofonia

Este trabalho encontra-se publicado com a Licença Internacional Creative Commons Atribuição 4.0.
You are free to:
Share — copy and redistribute the material in any medium or format;
Adapt — remix, transform, and build upon the material for any purpose, even commercially.







Endereço e contactos: