Anastasia Smitha, James Stevensonb, Kori Kuzmab, Wesley Goarb, Matthew Cannonb, Alex Wagnerb
aThe Ohio State University, Columbus, OH, United States; bNationwide Children’s Hospital, Columbus, OH, United States
Gene symbols, maintained by gene naming authorities such as HGNC, are error-prone when used as identifiers for describing genes in databases and biomedical literature. Gene symbols are subject to changes over time, and may conflict with community aliases for gene loci, leading to potential errors. We investigated the scale of this issue by evaluating the gene symbols and aliases of two authoritative gene sets: NCBI Gene and HGNC. We found 3,940 gene records (2.3%) containing aliases that identically matched the primary symbol of another gene record. For example, KRAS is both the primary symbol for an NCBI Entrez gene (ncbigene: 3845) as well as an alias for the related but distinct RAS-family gene, NRAS (ncbigene: 4893). Our analysis illustrates how these findings may impact downstream gene data analyses including natural language processing and literature curation. As with this example, intersections between aliases and gene symbols are present in well classified and frequently referenced genes, making disambiguation a recurring issue and a challenge to resolve. To raise awareness of this issue and provide policies for resolving these challenges we have developed the Gene Normalizer. This resource harmonizes data and improves corroboration for gene records across commonly used resources. The development of the Gene Normalizer is a piece of a larger effort to improve clinical application workflows that depend on efficient processes and precise genetic information for patient treatment.