top of page

aoutools: Tools for All of Us Researcher Workbench

aoutools is a lightweight Python package designed to streamline and accelerate Polygenic risk score (PRS) computations directly on the native Hail VariantDataset within the All of Us Researcher Workbench. By leveraging a batch-processing framework and ensuring accurate variant matching for multi-allelic sites, the tool significantly reduces runtime and computational resource usage compared to traditional methods. Furthermore, it lowers technical barriers for researchers by providing an automated workflow that seamlessly converts PGS Catalog identifiers into final scores.

AMIA Informatics Summit 2026 [PyPI]

 

https://aoutools.readthedocs.io

​​

An Agentic System for Automated Data Curation and Analysis in Large-Scale Biobanks

This autonomous dual component agentic system is an end to end tool designed to automate biomedical research workflows from initial hypothesis to final report using large scale datasets like the UK Biobank. It features an LLM based preprocessing framework that seamlessly translates complex clinical and lifestyle concepts into machine readable phenotypes, paired with an Analysis Agent that autonomously executes statistical plans and synthesizes findings. Ultimately, this system eliminates manual data bottlenecks, enhancing reproducibility and democratizing complex data analysis for clinicians and researchers of all technical backgrounds.

ML4H 2025 [OpenReview]

 

​​

https://github.com/ukjung21/ukb-agent

GeOKG: geometry-aware knowledge graph embedding for Gene Ontology and genes

GeOKG (Geometry-Aware Knowledge Graph Embeddings) is a deep learning tool designed to generate superior representations of Gene Ontology (GO) and Gene Ontology Annotations (GOA) by utilizing interactions among multiple geometric spaces. By exploiting these geometric interactions, the tool effectively captures the complex, nonmonotonic hierarchical structure of GO graphs that traditional single-space methods fail to fully model. Ultimately, GeOKG provides robust embeddings for heterogeneous biomedical networks, significantly outperforming existing approaches in critical downstream tasks like protein-protein interaction prediction.

Bioinformatics, 2025 [PubMed]

 

​​

https://github.com/ukjung21/GeOKG

MAESTRO: Masked Encoding Set Transformer with Self-Distillation

MAESTRO is a self-supervised machine learning tool that generates comprehensive vector representations of entire immune profiles from cytometry data, moving beyond traditional individual cell phenotyping. By leveraging attention mechanisms and a self-distillation framework to reconstruct immune profiles even when 90% of the cells are hidden, it effectively models the complex immune system as a holistic network. Ultimately, MAESTRO outperforms existing approaches by delivering superior accuracy in retrieving cell-type distributions and capturing critical clinical metadata like disease diagnosis, age, and sex for downstream tasks.

ICLR 2025 [OpenReview]

 

​​

https://github.com/matthew-lee1/MAESTRO

PGxQA: A Resource for Evaluating LLM Performance for Pharmacogenomic QA Tasks

PGxQA provides a standardized suite of automated and expert-scored tests designed to evaluate the accuracy and safety of large language models (LLMs) in answering pharmacogenetics questions for clinicians, patients, and researchers. By demonstrating that even advanced models like GPT-4o still fall short of the rigorous standards required for clinical use, the benchmark serves as a critical public resource to safely guide the future development and implementation of genetics-guided medical AI.

PSB 2025 [PubMed]

​​

https://github.com/KarlKeat/PGxQA

Cytometry masked autoencoder: An accurate and interpretable automated
immunophenotyper

The cytometry masked autoencoder (cyMAE) is a scalable machine learning tool designed to automate cellular immunophenotyping and cell type annotation in large-scale single-cell cytometry datasets. By leveraging a self-supervised training phase on unlabelled data followed by task-specific fine-tuning, cyMAE overcomes the robustness and accuracy limitations of traditional methods while upholding interpretable, user-defined cell type definitions. Ultimately, this tool provides researchers with reliable cross-study comparability and improved metadata predictions without incurring high, repeated training costs.

Cell Reports Medicine, 2024 [PubMed]

 

​​

https://github.com/JaesikKim/cyMAE

NETMAGE: A human disease phenotype map generator for the network-based visualization of phenome-wide association study results

To improve the accessibility of the visualization of shared genetic components across phenotypes, we developed the humaN-disEase phenoType MAp GEnerator (NETMAGE), a web-based tool that produces interactive phenotype network visualizations from summarized PheWAS results. Users can search the map by a variety of attributes, and they can select nodes to view information such as related phenotypes, associated SNPs, and other network statistics. As a test case, we constructed a network using UK BioBank PheWAS summary data. By examining the associations between phenotypes in our map, we can potentially identify novel instances of pleiotropy, where loci influence multiple phenotypic traits. Thus, our tool provides researchers with a means to identify prospective genetic targets for drug design, contributing to the exploration of personalized medicine.

GigaScience, 2022 [PubMed]

https://hdpm.biomedinfolab.com/netmage/

iDRW: integrative directed random walk-based pathway activity inference method

We propose a general framework for integrative pathway activity inference on the multi-omics network and investigate multiple scenarios of the multi-layered gene-gene graph construction that can be applied to various datasets. To reflect the interaction effects between multi-omics data, we designed a directed gene-gene graph using pathway information by assigning interactions between genes in multiple layers of networks. The proposed method selects cooperative driver pathways and predicts overall survival (OS) or metastasis. As a proof-of-concept study, it was evaluated using three genomic profiles of urologic cancer patients. iDRW is implemented as the R software package. 

Bioinformatics, 2021 [PubMed]​

 

https://github.com/sykim122/iDRW 

HiG2Vec: Hierarchical representations of gene ontology and gene

Using the knowledge from Gene Ontology (GO) and annotation, the manipulation can be mainly done by using vector-representation of GO terms and genes for versatile applications like deep learning approach. We propose hierarchical representations of gene ontology and gene (HiG2Vec) that applies Poincare embedding specialized in the representation of hierarchy through two-step procedures: GO embedding and gene embedding. Experimental results indicate that HiG2Vec has superiority at capturing the GO and gene semantics and utilization of data, and has robustness to be able to apply to manipulate various biological knowledge.

Bioinformatics, 2021 [PubMed]

https://github.com/JaesikKim/HiG2Vec

Human-Disease Phenotype Map

This is the phenotype connectivity map from one of the largest PheWAS using electronic health record (EHR)-derived phenotypes across 38,682 unrelated samples from the Geisinger’s MyCode Community Health Initiative genotyped through the DiscovEHR project. Click on each disease node to highlight other diseases found to be associated with this disease via SNPs.

Human-disease phenotype map derived from PheWAS across 38,682 individuals, American Journal of Human Genetics, 2019 [PubMed]

hdpm.biomedinfolab.com

Picture1.png

MildInt: Deep learning-based multimodal longitudinal data integration framework

The python package MildInt (Deep learning-based Multimodal longitudinal data integration framework) provides the pre-constructed deep learning architecture for a classification task. MildInt contains two learning phases: learning feature representation from each modality of data and training a classifier for the final decision. Adopting deep architecture in the first phase leads to learning more task-relevant feature representation than a linear model. In the second phase, linear regression classifier is used for detecting and investigating biomarkers from multimodal data. Thus, by combining the linear model and deep learning model higher accuracy and better interpretability can be achieved. MildInt is capable of integrating multiple forms of numerical data including time series and non-time series data for extracting complementary features from the multimodal dataset.

Frontiers in Genetics, 2019 [PubMed]

https://github.com/goeastagent/MildInt

bottom of page