Methodology
PepFold is a 5-stage computational pipeline that takes SNP variant identifiers (rsIDs) as input and produces annotated pharmacogenomic reports with ranked peptide candidates and Fmoc-SPPS synthesis protocols.
Pipeline Architecture
Stage 1: Variant Annotation
- Source: NCBI
ClinVarvia EUtils API (esearch.fcgi+esummary.fcgi) - Input: List of rsIDs (e.g.,
rs429358) - Output: Clinical significance (pathogenic, benign, drug-response, etc.), associated gene symbol, review status
- Filtering: Variants classified as purely "benign" are excluded from downstream analysis
- Note: 12 common pharmacogenomic variants are cached locally to reduce API latency
- Rate limiting: NCBI requires max 3 requests/second without API key
Stage 2: Target Mapping
- Source:
UniProtREST API - Input: Gene symbols from Stage 1
- Output: Protein sequence, annotated binding sites (if available), protein name, function
- Binding region estimation: If UniProt provides binding site annotations, the first annotated site is used; otherwise, a 30-residue window centered on the sequence midpoint is used as the candidate interaction region
Stage 3: Peptide Generation
Primary: NVIDIA BioNeMo Evo 2
- 40B parameter genomic foundation model
- Endpoint:
health.api.nvidia.com/v1/biology/arc/evo2-40b/forward - Method: Forward pass on the binding region sequence, sample top-K amino acids per position from output logits
- Candidates per target: Configurable (default 3)
Fallback: Deterministic Rational Design
When the Evo 2 API is unavailable or returns no usable logits, the pipeline uses a deterministic rational design heuristic. This fallback is NOT machine learning; it is a rule-based heuristic.
- Method: Charge-complementarity mapping (acidic→basic, hydrophobic pairs, aromatic pairs)
- Determinism: SHA-256 hash of position index for reproducible selection
Note: The generation method (evo2 or rational_design) is recorded per candidate.
Stage 4: Structure Prediction
- Source:
ESMFoldAPI (Meta,api.esmatlas.com/foldSequence/v1/pdb/) - Input: Peptide amino acid sequence (plain text)
- Output: PDB-format 3D structure, per-residue pLDDT confidence scores
- Rendering: Interactive
py3Dmolviewers in HTML reports, colored by pLDDT - Failure mode: If ESMFold is unavailable, candidate receives pLDDT=0.0 and is flagged as "structure not predicted"
Stage 5: Heuristic Scoring
Scoring is NOT machine learning — it relies on deterministic, rule-based scoring across 4 dimensions with fixed weights.
| Dimension | Weight | Method |
|---|---|---|
| Binding affinity | 35% | Charge complementarity + hydrophobic matching + size compatibility between peptide and target binding region |
| Structural confidence | 30% | pLDDT score from ESMFold (0-100 normalized to 0-1) + peptide length optimality (8-25 aa = 1.0) + amino acid diversity |
| Clinical relevance | 20% | Keyword matching on ClinVar clinical_significance field. Pathogenic=1.0, Drug response=0.9, Risk factor=0.7, Uncertain=0.3, Benign=0.1. Review status weighted 30%, significance 70%. |
| Novelty | 15% | Pairwise sequence similarity against all other candidates. Score = 1 - average_similarity. |
Overall = 0.35 × binding + 0.30 × structural + 0.20 × clinical + 0.15 × novelty
Candidates are ranked by overall score, and the top N per target are selected.
Synthesis Protocol Generation
Method: Rule-based Fmoc-SPPS (solid-phase peptide synthesis) template.
- Resin selection: Wang resin for standard C-terminal, Rink Amide for K/R C-terminal
- Coupling reagents: HBTU/DIPEA standard; HATU for difficult residues (H, N, Q, R, W)
- Coupling times: 30-60 min based on position and residue difficulty
- Cleavage cocktail: Selected based on sequence composition:
- Cys/Met present → Reagent K (TFA/phenol/thioanisole/EDT)
- Trp/Arg present → TFA/TIS/water/EDT
- Otherwise → TFA/TIS/water
- Purification: RP-HPLC with C18 or C4 column based on hydrophobicity
- QC: ESI-MS, RP-HPLC, amino acid analysis, LAL endotoxin test
- Cost estimate: $15/residue base + surcharges for difficult residues and purity
IMPORTANT: These are TEMPLATES requiring laboratory optimization, not validated protocols.
Report Format
- HTML: Interactive (
py3Dmolviewers, Plotly charts), self-contained - PDF: Generated via Playwright headless browser
- Sections: Variant annotations table, target mapping, ranked candidates with scores, 3D viewers, synthesis protocol per candidate
Data Sources & External Dependencies
| Dependency | Purpose | Rate Limits / Notes | Failure Handling |
|---|---|---|---|
| NCBI EUtils | ClinVar variant annotation | Max 3 req/sec without API key | Local cache fallback for 12 common variants |
| UniProt API | Protein sequence mapping | Standard fair use | Pipeline aborts target if mapping fails |
| BioNeMo Evo 2 | Peptide generation | NVIDIA API limits | Fallback to rational design heuristic |
| ESMFold API | Structure prediction | Meta API limits | Structure flagged as not predicted; pLDDT=0.0 |
Limitations
- No molecular docking or free energy calculations
- Binding scores are heuristic, not physics-based
- ESMFold predictions are models, not experimental structures
- Evo 2 may fall back to rational design without notice (flagged in report)
- Synthesis protocols require wet-lab optimization
- Not experimentally validated
- Clinical relevance scoring uses keyword matching, not curated pharmacogenomic databases like PharmGKB
Reproducibility
- Pipeline is deterministic given the same external API responses
- Rational design fallback uses SHA-256 seeded by position for reproducibility
- Job IDs are cryptographic (128-bit) for report retrieval
- All external data sources are timestamped in reports
Citation
If you use PepFold in your research, please cite:
PepFold: Pharmacogenomic Variant-to-Synthesis Pipeline. Olam Création, 2026. https://pepfold.com