47. A Symbolic Regression Approach to Hepatocellular Carcinoma Diagnosis Using Circulating Cell-Free DNA

Rushank Goyal

Rushank Goyal is a high school senior from India conducting research in the field of machine learning-based life sciences, especially cancer genomics and transcriptomics. He has presented his work at numerous fairs and conferences, the most recent being Regeneron’s International Science and Engineering Fair (ISEF) 2022, where he received a third-place award out of 1800+ projects from the American Statistical Association for excellent use of statistical and data science principles in his project. He conducts independent research as part of Betsos and has previously worked under a mentor at the All India Institute of Medical Sciences, Bhopal. His other interests lie in maternal mortality and indigenous Indian ethnomedicine, and he hopes to investigate the translation of biological and technological principles to those fields in the near future. In his free time, he enjoys writing, cycling, and playing badminton.


Rushank Goyal

Betsos, Bhopal, Madhya Pradesh, India

Purpose: Hepatocellular carcinoma is the most common primary liver cancer, accounting for 90% of cases, and a major cause of death worldwide. Despite this, alpha-fetoprotein tests are the only blood-based diagnostic tools available, and their use is limited by their low sensitivity. DNA methylation changes, implicated in a majority of cancers, offer an alternative method of diagnosis through measuring such changes in circulating cell-free DNA present in blood plasma.

Method: A genetic programming-based symbolic regression approach was applied to gain the benefits of machine learning while avoiding the opacity drawbacks of ‘black box’ models. The data included plasma samples from 36 patients with hepatocellular carcinoma as well as a control group of 55 that contained patients with and without cirrhosis. A 75-25 train-test splitting was done before training.

Results: The symbolic regression methodology developed an equation utilizing the methylation levels of three biomarkers, with an accuracy of 91.3%, a sensitivity of 100%, and a specificity of 87.5% on the test data. All three biomarkers are differentially methylated in cancerous and non-cancerous samples. The performance matches prior research while providing the added benefits of transparency.

  Conclusion: Circulating cell-free DNA presents opportunities for minimally invasive early diagnosis of hepatocellular carcinoma, and utilizing transparent machine learning approaches like symbolic regression can allow accurate diagnosis by combining biological and mathematical principles. Future validation of the model obtained here on a larger and more diverse dataset can reveal the potential for such approaches in cancer diagnosis and open the way for further research.