James Stevensona, Kori Kuzmaa, Matthew Cannona, Susanna Kiwalab, Jason Walkerb, Jeremy Warnerc, Obi L. Griffithb, Malachi Griffithb, Alex Wagnera
aNationwide Children’s Hospital, Columbus, OH, USA; bWashington University, St. Louis, MO, USA; cVanderbilt University, Nashville, TN, USA
TheraPy is a Python software package that constructs searchable normalized concepts for drugs and other therapeutics to assist in harmonization of biomedical knowledge. Because a therapy may variably be referred to by brand or generic names, experimental terms, and database IDs, it is challenging to match concepts across different sources. TheraPy provides mappings from each of these differing referents to a stable concept identifier, enabling integration of evidence extracted from literature or from curated sources such as Clinical Interpretations of Variations in Cancer (CIViC) and Molecular Oncology Almanac (MOAlmanac). A more exhaustively-linked corpus of therapy information enables improvements in arenas like drug repurposing and variation pathogenicity assessment.
TheraPy extracts terms from community-generated resources such as Wikidata and the HemOnc ontology, as well as ChEMBL, the National Cancer Institute Thesaurus, RxNorm, ChemIDplus, Drugs@FDA, DrugBank, and the IUPHAR Guide to Pharmacology. Records from each source are merged with records to which they contain cross-references to produce normalized groups, combining properties like aliases and trade names under a common identifier. User access is provided via a web API (normalize.cancervariants.org/therapy/) or with a provided command-line interface.
An analysis of the breadth of normalization capabilities was performed on a selected group of external knowledgebases. TheraPy successfully normalized associated therapeutic terms for 87.72% of MOAlmanac assertions, 97.12% of CIViC evidence items, 87.26% of PharmGKB clinical annotations, and 74.41% of Genomics of Drug Sensitivity in Cancer measurements. Remaining normalization challenges stemming from inconsistent, incomplete, or inaccurate source annotations are discussed.