A regression test that fails by fixture_id
How a 20-fixture A/B JSON sidecar replaced our 'looks-good-to-me' detector reviews.
Every detector change runs against a curated NMRShiftDB2 corpus before merge. CI fails by name when any single fixture drifts >50% — so reviewers see 'nmrshiftdb2_60000006_13c regressed' instead of 'tests passed (with notes).' The boring infrastructure that quietly raised our ship velocity.
What 'experimental' actually means in our promotion gate
Every new AI backend ships as opt-in. Promotion to default is a published-threshold decision, not a vibes call.
GSD-Prompt-3 shipped as `experimental: true` with a documented gate (95% solvent detect, median compound-count delta ≤2). Until both clear, the default stays legacy. Most AI startups ship and patch; we publish the corpus, the threshold, and the date a feature crosses each one.
No confidence number without an audit trail
Why we'd rather show 'pending' than a polished score with no provenance.
Every numerical claim in the UI links to its source — the spectrum file, the picked peaks, the SMILES candidate, the literature citation, the human reviewer who signed off. The implementation cost is real. The regulatory cost of doing it otherwise is higher.
From Bruker SFO1 to GSD: plumbing instrument metadata through the contract
A 500-MHz field hardcoded in the FE became a real number from the vendor metadata. Three lines of code, one cascade, no contract change.
Phase 8 traced field_mhz through the preview → process → analyze chain so the GSD endpoint receives the spectrometer frequency the instrument actually used (600.13 MHz, in our verification fixture) instead of a hardcoded 500. The same plumbing pattern works for vendor / solvent / nucleus.
Why legacy's fit χ² of 10¹⁵ is honest
Per-peak QC metrics landed on legacy peaks and immediately surfaced a units mismatch. We shipped the column anyway.
GSD reports fit residuals normalized to baseline σ; legacy reports them in raw signal-domain units. The same threshold paints 31/37 peaks 'red' on legacy spectra. The right fix is detector-side normalization — but in the meantime, the column tells the truth.
Validation against references that count the way detectors count
NMRShiftDB2 said the algorithm was failing. HMDB-style references said it was clearing the strict gate. Both were right.
Same algorithm, two corpora, two verdicts. The Phase 14 framework added expert-curated multiplet-line references so we could finally separate detector quality from corpus granularity. Strict gate cleared at multiplet-line scale; NMRShiftDB2 environment-scale stays xfailed by design.
Additive, never destructive — across 39 evidence layers
Every existing endpoint and regression test must stay green as new layers land. Here's how the typed-Pydantic contract makes that affordable.
Layer 22 (proton/carbon-13 scoring) and Layer 39 (LCMS feature grouping) speak the same API style. Stable JSON keys, additive fields, openapi-typescript regen on every contract change. The 'never overwrite a prior layer' rule is what lets us ship weekly without breaking last year's dossier.
Reading the FDA's January 2025 AI framework, in code
Stage-4 human oversight gates aren't a paragraph in a policy; they're a release queue in your audit table.
The FDA's 2025 framework formalizes risk-based credibility for AI in regulatory submissions. We mapped each stage onto concrete code: model-card registry, recipe-hash provenance, human-signoff queue, immutable raw vault. The PRs are linkable; the audit ledger is queryable.