105. Machine learning for automated tumor classification based on mutation repertoire

Aly Abdelkareem

Matthew Inkman

Matthew Inkman is Senior Scientist in the Zhang Translational Genomics Laboratory at Washington University School of Medicine. His research focuses on bioinformatics, algorithm and tool development, and the application of computational and machine learning approaches to biological data to advance the understanding of cancer biology, with a particular focus on HPV-associated cancers.

Matthew has contributed to such bioinformatics tools as HPV-EM (HPV detection and genotyping), DANSR (detection of annotated and novel small RNAs), GAiN (data augmentation for DE gene analysis using GANs) and the INTEGRATE suite (detection, visualization and neo-antigen discovery from genomic rearrangements). Current interests include discovery of biomarkers for treatment resistance in cervical cancer, the impact of HPV genomic structures on cancer biology, integration of patient imaging and genomics data into prognostic radiogenomic models, and improving outcomes from immunotherapy.

Matthew is an alumnus of Northwestern University and Caltech and has previously worked as a software developer.


Matthew Inkmana , Qiao Xuanyuana, Victoria Tomazb, Rafael Lucas Muniz Guedesb, Michael Watersa, Jin Zhanga, Paulo Campregherb

 aWashington University School of Medicine, St. Louis, MO, United States; bHospital Israelita Albert Einstein, São Paulo, Brazil

Introduction: A cornerstone of oncologic treatment is accurate tumor type and tissue of origin (TT/TO) identification. Nevertheless, histological analysis is occasionally inconclusive. As comprehensive genomic profiling has become prevalent in cancer care, a method to identify the TT/TO based on tumor mutations would be valuable for clinical practice. Our goal was to develop a machine learning model to identify TT/TO based on the repertoire of mutations restricted to genes present in the Trusight Oncology 500 Assay (Illumina) (TSO500).

Material and Methods: Somatic mutation and CNV calls for 32 TCGA cancer types were downloaded from cBioPortal and filtered to retain data on the 522 genes present in TSO500, yielding 1607 input features for 7853 samples. Two classifiers were trained to predict sample origins under 5-fold cross validation (CV), a DNN using all features and XGBoost using 500 features after recursive feature elimination. 35% of samples were held out as a validation set.

Results: The XGBoost classifier achieved mean F1 score of 0.92 for CV test sets and 0.90 for the hold-out validation set, outperforming the DNN’s scores of 0.67 and 0.68, respectively. The set of regional mutational densities (RMD) were the most important input features. When applied to samples from a set of closely biologically-related sites of origin, TCGA GI tract cancers, the XGBoost classifier achieved F1 score of 0.95.

Conclusion: In conclusion, we have developed a highly accuracy machine learning based model to predict the TT/TO of 32 tumor types based on the mutation repertoire of genes included in TSO500.