46. Data driven refinement of gene signatures for enrichment analysis and cell state characterization

Alexander Wenzel

Alexander Wenzel is a PhD candidate in the Biomedical Informatics department at UC San Diego. He earned his B.S. in Computer Science from Northwestern University while working as a Bioinformatician in the laboratory of Dr. Jaehyuk Choi at the Northwestern University Lurie Comprehensive Cancer Center studying the genomics of cutaneous lymphomas. He is currently studying in the laboratory of Dr. Jill Mesirov at UCSD, focusing on algorithms for analyzing standard and single-cell RNA-seq data in cancer.


Alexander Wenzel, Pablo Tamayo, Jill Mesirov

University of California, San Diego, San Diego, CA, USA

The use of gene expression data has been crucial to the functional characterization of changes in molecular pathway activity and for identifying targets for novel treatments. However, the interpretation of this data is complicated by its high dimensionality and the difficulty of identifying biological signals within a list of differentially expressed genes. Gene Set Enrichment Analysis (GSEA) is a standard method for identifying pathway enrichment in gene expression data by testing whether a set of genes whose expression would indicate the activity of a specific process or phenotype are coordinately up- or downregulated more than would be expected by chance. As GSEA relies on high quality gene sets with coordinately regulated member genes, we maintain the Molecular Signatures Database (MSigDB) which contains 9 collections of curated gene sets representing different biological pathways and processes. Over time, we have observed that some of the MSigDB gene sets, especially those that are manually curated or defined in a very specific biological context, may not provide a sensitive and specific enough co-regulation signature. In response, we have created a data-driven, matrix-factorization-based refinement method to build more sensitive and specific gene sets. This method incorporates large-scale datasets from multiple sources such as the Cancer Dependency Map as well as curated protein-protein interaction networks. We will present the initial results of this refinement method and our ongoing work which will yield a new collection of refined gene sets that will be made freely available in MSigDB for use with GSEA and many other applications.