Our group focuses on reverse translational efforts aiming to integrate vastly different kinds of biological data by leveraging a combination of machine learning, systems biology, and omics data science techniques to enable target identification and biomarker discovery across a range of indications.
Statistical analysis and quality control of microbial multi-omics sequencing data
High-throughput multi-omics datasets have several properties that complicate their analysis: these include multi-dimensionality, discrete and compositional data structure, over-dispersion, and hierarchical, spatial, and temporal dependence, among others. To combat these challenges, specialized methods and software are needed that can realistically characterize multi-modal phenomenon within large human health population studies, with downstream statistical analysis workflows implemented in a way that is useful both for experimentalists and computational scientists.
We have contributed to several multi-omics projects for both quality control and downstream statistical analysis. These include MaAsLin 2 (>25K downloads), a software tool for association analysis of multivariate clinical metadata with microbial community multi-omics profiles. MaAsLin 2 facilitates a combination of targeted epidemiological hypothesis testing along with exploratory analysis strategies, comprising of several steps including data transformation, multivariable inference, multiple comparisons, and visualization, all based on a set of flexible and computationally efficient generalized linear models.
We have additionally contributed to several statistical methodology projects for analyzing microbiome data including (i) ecological network inference, (ii) longitudinal analysis, (iii) sparsity-aware biomarker discovery, (iv) synthetic microbiome generation, and (v) meta-analysis of population heterogeneity in Inflammatory Bowel Disease (IBD) patients. We have further contributed to multiple applied metagenomic data analysis projects including the first large-scale metatranscriptome study published from the Nurses’ Health Study and Health Professionals Follow-Up Study and the retrospective association study of family history, early life environment, host genetics, and the microbiome in IBD patients. Finally, We have been an active member of the Integrative Human Microbiome Project (HMP2) project, where I was the primary statistician responsible for the major downstream statistical analyses of the project.
- Mallick H et al. (2017). Experimental Design and Quantitative Analysis of Microbial Community Multiomics. Genome Biology 18(1):228. PMID: 29187204.
- Lloyd-Price J et al. (2019). Multi-omics Detail the Gut Microbial Ecosystem in Inflammatory Bowel Diseases. Nature 569(7758): 655-622. PMID: 31142855.
- Quinn RA et al. (2020). Global Chemical Effects of the Microbiome Include New Bile-acid Conjugations. Nature 579(7797):123-129. PMID: 32103176.
- Mallick H et al. (2021). Multivariable Association Discovery in Population-scale Meta-omics Studies. PLoS Computational Biology 17(11):e1009442. PMID: 34784344
Discovery and validation of biomarkers from integrated multi-omics, single-cell and spatial omics
Microbial community metabolomics is a fast-growing subfield within microbiome research. By measuring microbial metabolites (a molecular “interface” between host and microbes), we can achieve a molecular-level understanding of the host-microbiome chemical interactions. However, these data may not be readily available at scale, while metagenomic functional profiles (genes) are easily measured for populations of many thousands. By taking advantage of strong cross-biome relationships between microbial gene and metabolic abundances, I developed a data-driven metabolite prediction model in the microbiome, MelonnPan (>150 citations). This provides the first method by which metabolite pools can be predicted in association with microbial communities in the absence of explicit metabolite measurements, thus providing a cost-effective methodology for integrating gene and metabolite information, with important implications in microbiome epidemiology and public health.
Building upon these methodological advances, we have recently developed Tweedieverse, a tool for single-cell and spatial differential expression, which has potential applications in other data types such digital pathology, imaging, and single-cell and spatial multimodal data. Further, we have developed an integrated Bayesian machine learner for multi-omics prediction and classification as well as for end-to-end biomarker discovery which can be used to stratify patients for therapeutic interventions, providing a promising route to stratified medicine.
- Mallick H et al. (2019). Predictive Metabolomic Profiling of Microbial Communities Using Amplicon or Metagenomic Sequences. Nature Communications 10(1):3136-3146. PMID: 31316056.
- Mallick H et al. (2022). Differential Expression of Single-cell RNA-seq Data using Tweedie Models. Statistics in Medicine 41(18), 3492-3510. PMID: 35656596.
- Mallick H et al. (2022). An Integrated Bayesian Framework for Multi-omics Prediction and Classification. bioRxiv. DOI: 10.1101/2022.11.06.514786.
- Weige C, Birtwistle M, Mallick H et al. Transcriptomes and shRNA Suppressors in A TP53 Allele–Specific Model of Early-Onset Colon Cancer in African Americans. Molecular Cancer Research, 12(7):1029-1041.
Population structure discovery in large-scale genetic association studies and clinical trials
We developed efficient machine learning algorithms for biomedical research which together led to population structure discovery in several large-scale synthetic and real genetic association studies, as well as an improved paradigm for variant discovery through well-powered statistical methodology.
These include a set of (i) haplotype block methods for genetic biomarker discovery, (ii) hierarchical false discovery method for genetic association studies, and (iii) SNP detection in the presence of zero-inflated count phenotypes. In addition, I have extensively collaborated with researchers in clinical applications of personalized medicine which include (i) heterogeneity of treatment effects analyses in historical clinical trial datasets, (ii) Bayesian adaptive clinical trial designs to address clinical endpoint and predictive biomarker uncertainty, and (iii) a set of variable selection methods for zero-inflated count responses in healthcare and other disciplines.
- Mallick H, Tiwari H (2016). EM Adaptive LASSO - A Multilocus Modeling Strategy for Detecting SNPs Associated with Zero-inflated Count Phenotypes. Frontiers in Genetics 7:32. PMID: 27066062.
- Chatterjee S*, Chowdhury S*, Mallick H*, Banerjee P, Garai B (2018). Group Regularization for Zero-inflated Negative Binomial Regression Models with An Application to German Healthcare. Statistics in Medicine 37(20): 3012-3026. PMID: 29900575 (*indicates co-first or corresponding authorship)
- Ma S, Ren B, Mallick H, et al. (2021). A Statistical Model for Describing and Simulating Microbial Community Profiles. PLOS Computational Biology, 17(9), e1008913.
- Ma S, Shungin D, Mallick H, et al. (2022). Population Structure Discovery in Meta-Analyzed Microbial Communities and Inflammatory Bowel Disease. Genome Biology 23(1), 1-31. PMID: 36192803.
Scalable Bayesian and machine learning meta-analysis for high-dimensional omics data
We solved an open-ended research question in Bayes regularization. The most challenging aspect of this work was constructing a suitable data-augmentation technique that can facilitate efficient posterior computation. To this end, we developed a flexible framework endowed with richer model summaries, better performance in estimation and prediction, and more nuanced uncertainty quantification, compared to classical methods. Building upon my dissertation research on high-dimensional regression, we contributed to the development of a Bayesian Graphical LASSO algorithm to estimate the correlation matrix from compositional data. We have further developed a statistical benchmarking framework to associate multi-omics data with clinical covariates in large epidemiological populations, along with public implementations.
- Mallick H, Yi N (2014). A New Bayesian LASSO. Statistics and Its Interface. 2014;7(4):571-582. PMID: 27570577.
- Schwager EH, Mallick H, et al. (2017). A Bayesian Method for Detecting Pairwise Associations in Compositional Data. PLoS Computational Biology 13(11):e1005852. PMID: 29140991.
- Mallick H, Yi N (2017). Bayesian Group Bridge for Bi-level Variable Selection. Computational Statistics and Data Analysis. PMID: 28943688.
- Mallick H et al. (2021). The Reciprocal Bayesian LASSO. Statistics in Medicine 40(22):4830-4849. PMID: 34126655.
Digital Pathology and Long-read Sequencing
Very recently, we have ventured into the world of digital pathology in collaboration with University of Florida, where we are actively working on problems such as diagonal integration in the context of end-stage renal disease. Additionally, we are actively developing new methodologies for long-read sequencing data in collaboration with Dr. Hagen Tilgner.
Other collaborative projects
We have contributed to several collaborative projects as lead biostatisticians and computational biologists. These include (i) a large, randomized equivalency trial, (ii) a retrospective analysis of mortality in surgical necrotizing enterocolitis, and (iii) an allele-specific mouse model of early-onset colon cancer in African Americans, among others.
- Kelleher J et al. (2013). Oronasopharyngeal Suctioning Versus Wiping the Mouth and Nose at Birth: A Randomized Controlled Equivalency Trial. The Lancet 382(9889):326-330. PMID: 23739521.
- Li M, Cleves MA, Mallick H, et al. (2014). A Genetic Association Study Detects Haplotypes Associated with Obstructive Heart Defects. Human Genetics 133(9):1127-38. PMID: 24894164.
- Abu-Ali GS, Mehta RS, Lloyd-Price J, Mallick H, et al (2018). Metatranscriptome of Human Faecal Microbial Communities in A Cohort of Adult Men. Nature Microbiology 3(3): 356–366. PMID: 29335555
- Nguyen LH et al. (2020). Association Between Sulfur-Metabolizing Bacterial Communities in Stool and Risk of Distal Colorectal Cancer in Men. Gastroenterology 158(5):1313-1325. PMID: 31972239.