Trace Ecosystem (v0.63.0)

Pull traces from any SaaS observability dashboard, mine prompts, sample the most uncertain rows, run sequential A/B with martingale-controlled Type-I error, and watch production for distributional drift — all from one CLI, all offline, no per-trace fees.

Five new top-level commands, every one of them LIVE in v0.63.0.

`soup ingest` — universal trace importer

bash
soup ingest --source langfuse --logs ~/Downloads/langfuse_export.jsonl \
  --output ./traces.jsonl

Six sources at launch:

SourceEnv varNotes
langfuseLANGFUSE_KEYLangFuse dashboard export
langsmithLANGSMITH_API_KEYLangSmith API traces
heliconeHELICONE_API_KEYHelicone observability
openpipeOPENPIPE_API_KEYOpenPipe production traces
otelOTEL_EXPORTER_OTLP_HEADERSOpenTelemetry OTLP
openai-storedOPENAI_API_KEYOpenAI Stored Completions

No network calls — operators export from the SaaS dashboard, then soup ingest normalises the file. PII warning prints once per invocation.

Output schema (frozen TraceRecord with MappingProxyType metadata):

json
{
  "trace_id": "trace_abc123",
  "prompt": "What is the capital of France?",
  "output": "The capital of France is Paris.",
  "source": "langfuse",
  "signal": "none",
  "metadata": {"user_id": "user_789", "timestamp": "2026-05-20T..."}
}

Feeds directly into v0.26 soup data from-traces for preference-pair building.

`soup prune-prompt` — system-prompt mining

bash
soup prune-prompt --input ./traces.jsonl --output ./traces_pruned.jsonl --min-frequency 0.95

Detects the longest shared system-prompt prefix across rows via binary search over up to 32 candidate lengths (v0.63 fixed an O(N²) early-exit bug from the prototype). Strips it from training data so the fine-tuned model internalizes the boilerplate instead of repeating it — OpenPipe's signature trick.

Two-pass file read; capped at 100k rows to prevent DoS.

`soup data active-sample` — uncertainty sampling

bash
soup data active-sample --input ./prod_traces.jsonl --budget 200

Two modes auto-detected from the JSONL schema:

  • Single score — max-entropy on rm_score (peaks at 0.5).
  • Dual scores — pairwise disagreement on rm_scores list.

Output is a drop-in eval prompt set for human judging.

`soup ab` — Wald-SPRT sequential A/B

bash
soup ab --input ./ab_results.jsonl --metric judge_score \
  --alpha 0.05 --beta 0.20 --effect-size 0.1

Wald's SPRT (Sequential Probability Ratio Test) is a martingale under the null hypothesis — Type-I error is controlled at every stopping time, not just at a fixed sample size. Early-stops when the log-likelihood ratio crosses A = log((1-β)/α) (reject H0) or B = log(β/(1-α)) (accept H0).

Input rows:

json
{"arm": "control", "latency": 150.2}
{"arm": "treatment", "latency": 145.1}

Decisions: reject_h0 / accept_h0 / continue. v0.63 ships a CRITICAL fix for a sign error in the historical mSPRT implementation.

Three metrics at launch: latency, judge_score, retry_rate.

`soup drift-alarm` — KL drift watch with webhooks

bash
soup drift-alarm \
  --reference ./ft_output_dist.jsonl --live ./prod_output_dist.jsonl \
  --threshold 0.2 \
  --slack-url "https://hooks.slack.com/services/T.../B.../..."

Rolls KL divergence between token-distribution snapshots at fine-tune time vs. production. Surfaces both behavioral drift ("model now outputs JSON") and vocabulary drift ("same 20 phrases repeated"). Whitespace tokenization is the default; pluggable tokenizers ship in v0.63.1.

Input rows (one per token):

json
{"token": "json", "log_prob": -3.2}

Optional Slack / Discord webhooks fire on drift, SSRF-validated (loopback-only, RFC1918 / link-local / cloud-metadata IPs rejected). Exit code 3 on drift for cron-friendly automation.

Numbers

+219 new tests in v0.63.0 (9816 → 10035). Security: 1 CRITICAL (mSPRT sign error → Wald SPRT), 2 HIGH, 3 MEDIUM, 2 LOW.

See also

  • [Soup loop](/docs/soup-loop) — the HarvestFn half of the v0.58 production flywheel is now powered by soup ingest.
  • [Trace-to-preference](/docs/trace-to-preference) — convert normalised traces into DPO / KTO pairs.
  • [Eval design](/docs/eval-design) — derive a goal-conditioned eval suite from the active-sampled prompts.