Smart Inference Server (v0.30.0)
soup serve graduated from a basic OpenAI-compatible wrapper into a production-grade serving stack. Speculative decoding, prefix caching, structured output, dynamic LoRA hot-swap, a continuous-batching dashboard, and OpenTelemetry tracing all ship in v0.30.0.
Speculative decoding
Use a smaller draft model to speed up generation 2-3×.
# Transformers backend — uses HF assisted generation
soup serve --model ./output --speculative-decoding small-draft-model --spec-tokens 5
# vLLM backend — uses vLLM native speculative decoding
soup serve --model ./output --backend vllm --speculative-decoding small-draft-model
# Auto-pair: Soup picks the draft for you based on the target family
soup serve --model meta-llama/Llama-3.1-70B-Instruct --backend vllm --auto-spec--auto-spec handles Llama 3.1 / 3.3 / 4, Qwen 2.5 / 3, Mistral Large, Mixtral, DeepSeek V3 / R1, and Gemma 2 / 3. Models without a known draft pairing print a yellow "no draft" note and fall back to standard decoding.
Prefix caching (vLLM)
For RAG and agent workloads with a shared system prompt:
soup serve --model ./output --backend vllm --prefix-cacheThe first request with a given prefix warms the cache; subsequent requests skip the shared prefix compute entirely.
Structured output
Constrain output to a JSON schema or regex pattern.
# JSON schema (file must live under cwd)
soup serve --model ./output --structured-output json --json-schema product.json
# Regex (length-capped at 2048 chars, null bytes rejected)
soup serve --model ./output --structured-output regex --regex-pattern '\\d{3}-\\d{4}'Schemas serialised over 64 KB are rejected. JSON schemas must declare a top-level type field. Constraints are validated at startup, not per-request.
Dynamic LoRA hot-swap
Switch the active adapter at runtime without restarting the server.
soup serve --model base-model --adapters chat=./chat-adapter code=./code-adaptercurl -X POST http://localhost:8000/v1/adapters/activate/chat
# → {"active": "chat", "status": "ok"}
curl -X POST http://localhost:8000/v1/adapters/deactivate
# → {"active": null, "status": "ok"}
curl http://localhost:8000/v1/adapters
# → {"adapters": [{"name": "chat", "active": true}, ...], "active": "chat"}Names match ^[a-zA-Z0-9][a-zA-Z0-9-]*$. Activate/deactivate is thread-safe behind a lock.
Continuous-batching dashboard + `/metrics`
soup serve --model ./output --dashboardcurl http://localhost:8000/metrics
# {
# "requests_total": 1234,
# "tokens_generated_total": 456789,
# "active_requests": 3,
# "latency_p50_ms": 185.2,
# "latency_p95_ms": 720.0,
# "latency_samples": 1000
# }Latency percentiles are computed from the last 1000 requests; counters include failure paths so the dashboard reflects true reliability.
OpenTelemetry tracing
Emit per-request spans to your OTLP collector.
pip install opentelemetry-sdk opentelemetry-exporter-otlp
soup serve --model ./output \
--trace --trace-endpoint http://localhost:4317OTLP endpoint hardening mirrors HF_ENDPOINT: scheme allowlist, plain HTTP only for loopback, RFC1918 / link-local / 0.0.0.0 rejected via ipaddress.ip_address. Missing SDK is a no-op with a warning.
DeepSpeed-MII backend
soup serve --model ./output --backend miiLoopback-only CORS, max_tokens capped at 16384, streaming disabled (no SSE for MII v0.x). Pipeline crashes return generic 500 with no stack-trace leak.
Auto-quant picker
soup serve --model ./output --auto-quantThe picker API is registered; live evaluation soft-falls-back to the highest-scored candidate so the server still binds when no candidate clears min_score.