Production pilot guide
chukei is a transparent proxy: your drivers keep their credentials and SQL and just point at a different host. Rollback is repointing them back — chukei holds no state your queries depend on.
1. Start from the conservative profile
Use config/customer-pilot.yaml:
cache with strict determinism gate and 5% blame sampling, no result data
persisted at rest, suspend in suggest-only, router off. Three values to
change: TLS cert/key, account locator, your $/credit rate.
2. TLS
Clients must trust the certificate for the hostname they'll use
(e.g. chukei.internal.example.com) — an internal-CA cert is the normal
choice. JDBC enforces OCSP against the proxy cert; add ocspFailOpen=true
to the JDBC URL for the pilot. The Python connector and snowsql soft-fail
by default (validated against Snowflake 10.20).
chukei doctor --config chukei.yaml
# ✓ config ✓ tls ✓ upstream ✓ listen ✓ savings ✓ cache
3. Cut over a subset first
Pilot with one team or workload via explicit host override, not an account-wide DNS change. See Examples for per-driver snippets.
4. Watch
:9090/metrics, Prometheus format:
| Signal | Threshold | Action |
|---|---|---|
chukei_cache_blame_mismatches_total | > 0 | Page. Set CHUKEI_PLUGINS_CACHE_ENABLED=false, restart, file a bug with the fingerprint. |
chukei_circuit_breaker_fast_fails_total | climbing | Snowflake-side trouble; check status. |
p99 chukei_proxy_overhead_seconds | > 0.005 | File a bug with /metrics output. |
/healthz | non-200 | Restart policy handles it; if flapping, roll back. |
5. Rollback (rehearse once)
Point the driver host back at <account>.snowflakecomputing.com. Per-plugin
kill switches: CHUKEI_PLUGINS_CACHE_ENABLED=false,
CHUKEI_PLUGINS_REWRITE_ENABLED=false, CHUKEI_PLUGINS_SUSPEND_ENABLED=false.
Measured restart behaviour: a restart faster than the driver's retry budget (~10s for the Python connector) is invisible to running clients; longer outages surface as connection errors only after each query burns its retry budget. Existing sessions resume after restart without re-login.
Known pilot limitations
- Large chunked results are never cached (they pass through unmodified — validated to 200k rows).
- Suspend stays suggest-only in week one; enforce mode needs a
CHUKEI_SUSPENDERservice account. - One instance per deployment (per-instance cache).
- The savings ledger deliberately under-claims (×0.7) and says so in every report.