Methodology

PepFold is a 5-stage computational pipeline that takes SNP variant identifiers (rsIDs) as input and produces annotated pharmacogenomic reports with ranked peptide candidates and Fmoc-SPPS synthesis protocols.

Pipeline Architecture

rsIDs → [1. ClinVar] → [2. UniProt] → [3. Evo 2 / Rational Design] → [4. ESMFold] → [5. Scoring] → Report

Stage 1: Variant Annotation

Source: NCBI ClinVar via EUtils API (esearch.fcgi + esummary.fcgi)
Input: List of rsIDs (e.g., rs429358)
Output: Clinical significance (pathogenic, benign, drug-response, etc.), associated gene symbol, review status
Filtering: Variants classified as purely "benign" are excluded from downstream analysis
Note: 12 common pharmacogenomic variants are cached locally to reduce API latency
Rate limiting: NCBI requires max 3 requests/second without API key

Stage 2: Target Mapping

Source: UniProt REST API
Input: Gene symbols from Stage 1
Output: Protein sequence, annotated binding sites (if available), protein name, function
Binding region estimation: If UniProt provides binding site annotations, the first annotated site is used; otherwise, a 30-residue window centered on the sequence midpoint is used as the candidate interaction region

Stage 3: Peptide Generation

Primary: NVIDIA BioNeMo Evo 2

40B parameter genomic foundation model
Endpoint: health.api.nvidia.com/v1/biology/arc/evo2-40b/forward
Method: Forward pass on the binding region sequence, sample top-K amino acids per position from output logits
Candidates per target: Configurable (default 3)

Fallback: Deterministic Rational Design

When the Evo 2 API is unavailable or returns no usable logits, the pipeline uses a deterministic rational design heuristic. This fallback is NOT machine learning; it is a rule-based heuristic.

Method: Charge-complementarity mapping (acidic→basic, hydrophobic pairs, aromatic pairs)
Determinism: SHA-256 hash of position index for reproducible selection

Note: The generation method (evo2 or rational_design) is recorded per candidate.

Stage 4: Structure Prediction

Source: ESMFold API (Meta, api.esmatlas.com/foldSequence/v1/pdb/)
Input: Peptide amino acid sequence (plain text)
Output: PDB-format 3D structure, per-residue pLDDT confidence scores
Rendering: Interactive py3Dmol viewers in HTML reports, colored by pLDDT
Failure mode: If ESMFold is unavailable, candidate receives pLDDT=0.0 and is flagged as "structure not predicted"

Stage 5: Heuristic Scoring

Scoring is NOT machine learning — it relies on deterministic, rule-based scoring across 4 dimensions with fixed weights.

Dimension	Weight	Method
Binding affinity	35%	Charge complementarity + hydrophobic matching + size compatibility between peptide and target binding region
Structural confidence	30%	pLDDT score from ESMFold (0-100 normalized to 0-1) + peptide length optimality (8-25 aa = 1.0) + amino acid diversity
Clinical relevance	20%	Keyword matching on ClinVar `clinical_significance` field. Pathogenic=1.0, Drug response=0.9, Risk factor=0.7, Uncertain=0.3, Benign=0.1. Review status weighted 30%, significance 70%.
Novelty	15%	Pairwise sequence similarity against all other candidates. Score = 1 - average_similarity.

Overall = 0.35 × binding + 0.30 × structural + 0.20 × clinical + 0.15 × novelty

Candidates are ranked by overall score, and the top N per target are selected.

Synthesis Protocol Generation

Method: Rule-based Fmoc-SPPS (solid-phase peptide synthesis) template.

Resin selection: Wang resin for standard C-terminal, Rink Amide for K/R C-terminal
Coupling reagents: HBTU/DIPEA standard; HATU for difficult residues (H, N, Q, R, W)
Coupling times: 30-60 min based on position and residue difficulty
Cleavage cocktail: Selected based on sequence composition:
- Cys/Met present → Reagent K (TFA/phenol/thioanisole/EDT)
- Trp/Arg present → TFA/TIS/water/EDT
- Otherwise → TFA/TIS/water
Purification: RP-HPLC with C18 or C4 column based on hydrophobicity
QC: ESI-MS, RP-HPLC, amino acid analysis, LAL endotoxin test
Cost estimate: $15/residue base + surcharges for difficult residues and purity

IMPORTANT: These are TEMPLATES requiring laboratory optimization, not validated protocols.

Report Format

HTML: Interactive (py3Dmol viewers, Plotly charts), self-contained
PDF: Generated via Playwright headless browser
Sections: Variant annotations table, target mapping, ranked candidates with scores, 3D viewers, synthesis protocol per candidate

Data Sources & External Dependencies

Dependency	Purpose	Rate Limits / Notes	Failure Handling
NCBI EUtils	ClinVar variant annotation	Max 3 req/sec without API key	Local cache fallback for 12 common variants
UniProt API	Protein sequence mapping	Standard fair use	Pipeline aborts target if mapping fails
BioNeMo Evo 2	Peptide generation	NVIDIA API limits	Fallback to rational design heuristic
ESMFold API	Structure prediction	Meta API limits	Structure flagged as not predicted; pLDDT=0.0

Limitations

No molecular docking or free energy calculations
Binding scores are heuristic, not physics-based
ESMFold predictions are models, not experimental structures
Evo 2 may fall back to rational design without notice (flagged in report)
Synthesis protocols require wet-lab optimization
Not experimentally validated
Clinical relevance scoring uses keyword matching, not curated pharmacogenomic databases like PharmGKB

Reproducibility

Pipeline is deterministic given the same external API responses
Rational design fallback uses SHA-256 seeded by position for reproducibility
Job IDs are cryptographic (128-bit) for report retrieval
All external data sources are timestamped in reports

Citation

If you use PepFold in your research, please cite:

PepFold: Pharmacogenomic Variant-to-Synthesis Pipeline. Olam Création, 2026. https://pepfold.com