
About Me
_edited.jpg)
Hi, I’m Shreya — a passionate bioinformatician who loves working with genomic data to solve real-world healthcare problems. With a background in Computer Engineering and currently pursuing my Master’s in Bioinformatics, I’m exploring how data and biology come together to create meaningful impact.
I’m especially curious about single-cell data analysis and how we can use machine learning and large language models (LLMs) to understand complex biological systems. I’ve worked on projects involving GWAS, haplotype analysis, variant annotation, RNA-seq, and ChIP-seq — and I’m always looking for ways to connect these tools to real human health outcomes.
At the core of everything I do is one goal:
to contribute to solutions that genuinely help people. I want my work in bioinformatics to not just stay in code or datasets, but reach the clinic, the community, and the people who need it.
My Projects


Population-Specific Haplotype Landscape and Its Association with
Alzheimer's Disease Susceptibility
-
Analyzed 5.38 million SNPs from 1,865 ADNI samples, integrating 1000 Genomes Project data to refine haplotype analysis.
-
Identified 4,066 LD blocks and 388 significant haplotypes, with the top haplotype showing strong Alzheimer’s disease association (p = 3.7e-24).
-
Optimized GWAS pipelines with PLINK, SHAPEIT5, and BEDTools, reducing analysis runtime by ~40% using high-performance computing.
-
Built an automated pipeline to annotate 7 cell clusters from the PBMC 3k single-cell RNA-seq dataset using a pretrained LLM (Flan-T5), eliminating the need for manual marker lookup.
-
Extracted top 5 marker genes per cluster, formatted them into zero-shot prompts, and achieved consistent annotation of major immune cell types such as B cells and NK cells.
-
Visualized over 2,600 cells using UMAP with LLM-derived labels, demonstrating feasibility of integrating LLMs into genomic interpretation workflows.


LLM based cell type annotation of scRNA-seq data


Impact of TP53 mutations on binding sites of breast cancer
-
Analysis of the TCGA breast cancer dataset to investigate TP53 mutations and their impact on transcription factor binding at key loci (BAX, BBC3, BRCA1, CDKN1A) using ChIP-seq and DeepTools.
-
Developed and optimized a computational pipeline for comparative analysis between wild-type and mutant TP53, identifying mutation-driven ectopic binding at the BBC3 locus and novel oncogenic regulatory disruptions.
-
Validated pipeline reproducibility and accuracy, ensuring robust processing of large-scale genomic data for downstream cancer research applications. Also performed differential gene expression.

My other projects
From Bioinformatics to Software Development, I have a varied range of projects
BioNexus
A literature mining platform integrating MySQL for database management, scraping NCBI databases using Entrez E-Utilities API, and utilizing NLP for semantic similarity and article recommendation.
Sarcastic.IO
A cutting-edge platform utilizing a fine-tuned bidirectional LSTM model for detecting sarcasm in text. The system is fully dockerized for seamless deployment and management, featuring an intuitive API and front-end interface for effortless user interaction.
Correlation Analysis of miRNA Expression Across Cancers
Conducted an in-depth analysis of miRNA expression profiles using RPKM values across multiple cancer types. Generated a 12x12 Pearson correlation matrix to quantify relationships between cancer types and visualized the results as a heatmap to highlight similarities.
Operon Prediction
Developed a Python-based workflow to predict operons using PTT files for genomes of Escherichia coli K12, Bacillus Subtilis, Halobacterium, and Synechocystis.
DashInvoice
DashInvoice streamlines bookkeeping for B2B and B2C sectors by automating invoice processing and analysis. It uses AWS services, including Amazon Textract, for OCR-based text extraction, processes the data with regex to convert it into a semi-structured format, and stores it in PostgreSQL.
My Experience


