Research Portfolio

Two Distinct Scales: Microbial Function & Science of Science
Xuexin Li
Duke CBB Interview
Unmasking Microbial Function with Protein Language Models
Beyond Sequence Homology: A Deep Learning Approach to Metagenomics
0. Defining "Similarity": A Semantic Approach

Before analyzing clusters, we must define what constitutes a "functional match." Standard metrics often conflate specific terms with their broad parents. We adopted a probability-based approach to measure the specificity of the shared ancestor.

$$ Sim(t_1, t_2) = 1 - p(LCA) $$

This ensures that we penalize vague matches (broad parents) and only reward specific functional concordance.

Engineering Contribution: To scale this evaluation to millions of protein homologs, I engineered a modularized CPU/GPU parallelization pipeline, optimizing the pairwise calculation.

1. The Problem: The Confidence Trap

We benchmarked State-of-the-Art (SOTA) prediction models like DeepGOSE. Our analysis reveals a critical flaw: low prediction confidence directly correlates with low accuracy (Fig 1a).

Furthermore, for unannotated genes ("Microbial Dark Matter"), these models exhibit extremely low confidence (Fig 1b). Effectively, SOTA models fail exactly where they are needed most.

Critically, we observe a distinct separation (Fig 1c): clusters tend to be either fully annotated or fully unannotated. The scarcity of "mixed" clusters suggests that "Dark Matter" proteins form coherent, independent functional groups rather than being randomly scattered among known families.

Sim vs Conf
Fig 1a: Accuracy drops as confidence drops.
Conf dist
Fig 1b: Unannotated genes (red) have systematically low confidence.
Annotation Correlation
Fig 1c: Cluster Purity. Most clusters are either fully annotated or unannotated, confirming Dark Matter as distinct entities.

2. The Solution: Beyond Structure

PLM vs Structure
Figure 2: Capturing "Contextual" Meaning. Unlike 3D structure alignment which fails on fragments or disjoint domains, PLM embeddings capture the functional semantic context, allowing us to align gene fragments that belong to the same functional family.

Our pipeline leverages the ESM-2 (3B parameter) model to generate high-dimensional embeddings. This allows us to move beyond rigid structural alignment and capture "fuzzy" semantic similarities.

Methodology:
  1. Embedding: UniRef90 sequences encoded via ESM-2.
  2. Dim Reduction: Incremental PCA (2560 → 500 dimensions).
  3. Graph: k-NN construction ($k=15$) using cosine similarity.
  4. Clustering: Leiden algorithm optimized for modularity.
Pipeline

3. Next Step: Identifying Disease Clusters

We are currently applying this framework to 785 samples from curatedMetagenomicData.

Our objective is to recover "Dark Matter" clusters associated with Colorectal Cancer (CRC) and IBD—signals that standard UniRef pipelines likely discard.

CRC Results
Figure 3: Preliminary Association. Investigating potential enrichment in IBD/CRC cohorts.
Deciphering Scientific Collaboration in the LLM Era
Network Analysis of 5,674 Bio-LLM Research Articles

1. The Research Question: Democratizing Access?

Large Language Models (LLMs) are reshaping biomedical research, but are they democratizing science or reinforcing elite hubs? To answer this, we needed to map the changing landscape of cross-disciplinary collaboration.

We analyzed a corpus of 5,674 Bio-LLM research articles from PubMed to track how institutions and disciplines interact in this emerging field.

Disciplinary Network
The Landscape: Medicine acts as the bridge between CS and Clinical disciplines.

2. The Engineering Bottleneck: Unifying 34k Entities

The core challenge was data heterogeneity. Affiliation strings in PubMed are messy and inconsistent. To map these networks, I had to harmonize 34,000 heterogeneous institution entities. Classical string matching failed, so I designed a Cascaded Entity Resolution Pipeline.
Stage 1
LLM Pre-screener
Standardize & Filter
Stage 2
Dense Retrieval Model
Candidate Generation
Stage 3
Logic-Guided LLM
Final Discriminator
Validation: Achieved >99% accuracy on 200 manually verified samples.

3. The Discovery: The Collaboration Bonus

This precise entity standardization enabled us to link bibliometric data with NIH funding records. We discovered a quantifiable "Collaboration Bonus": resource-constrained institutions can achieve "elite-tier" impact (citations) by establishing bridging ties with central hubs.

Conclusion: Strategic collaboration effectively allows under-resourced organizations to "borrow" impact, suggesting that LLM research has the potential to be a democratizing force.

Resource vs Impact
Funding & Impact: (Left) Correlation Scatter. (Right) Impact Stratification showing the "Bonus" effect.