Hypernym API Status — 2026-05-12

Omnifact

zephyr.hypernym.ai/api/omnifact

alive · worker stalled

Begin call returns 202 in 1.1s. Job queued but worker never picks it up. elapsed_seconds: 0.0 after 42s of polling.

Repo Analyze

zephyr-b-gpu.hypernym.ai/api/repo/analyze

401 · invalid key

Endpoint healthy but rejects current keychain key. Needs HYPERNYM_REPO_INGEST_API_KEY (pending rotation since 2026-05-09).

Modulum

gemma4.hypernym.ai/v1

200 OK · prefill broken

Returns valid responses. Decode normal (38-60 tok/s). Prefill regressed ~190× from yesterday: 1 tok/s today vs 193 tok/s on 2026-05-11.

01 · Detailed test results

Per-API observations

Omnifact API · `zephyr.hypernym.ai`

# POST /api/omnifact/begin
$ curl -X POST https://zephyr.hypernym.ai/api/omnifact/begin \
  -H "X-API-Key: $HYPERNYM_API_KEY" \
  -d '{"text": "...50-token corpus..."}'

→ 202 Accepted in 1.13s
{
  "experiment_id": "f4c50888-4493-4a17-8cab-47ba41baf150",
  "status": "PROCESSING",
  "extractor": "fc",
  "content_hash": "49808f54...",
  "input_tokens": 50
}

# GET /api/omnifact/{id}/status — polled 7× over 42s
{
  "status": "PROCESSING",
  "phase": "PSPAN",
  "phase_number": 1,
  "total_phases": 4,
  "progress_pct": 0,
  "timing": { "elapsed_seconds": 0.0 }  # ← timer never starts
}

Diagnosis: the begin endpoint queues experiments correctly and returns valid experiment IDs. The worker pool that processes them isn't dequeuing. elapsed_seconds: 0.0 after 42 seconds confirms the worker timer has never started — the job is sitting in the queue waiting for a processor. Likely a stalled worker process or zero workers running.

Compressed Repo Analyze · `zephyr-b-gpu.hypernym.ai`

# GET /api/repo/analyze
$ curl https://zephyr-b-gpu.hypernym.ai/api/repo/analyze \
  -H "X-API-Key: $HYPERNYM_API_KEY"

→ 401 Unauthorized in 1.14s
{ "error": "Invalid API key" }

Diagnosis: the standard HYPERNYM_API_KEY (which works for Omnifact) is rejected here. Per session memory dated 2026-05-09, Repo Analyze uses a separate key HYPERNYM_REPO_INGEST_API_KEY that has been pending rotation. The endpoint itself is healthy — it responds with a structured 401 in 1.1s, confirming reachability and TLS termination.

Modulum API · `gemma4.hypernym.ai`

# POST /v1/chat/completions — "reply with 'live'" (20-token prompt)
$ curl -X POST https://gemma4.hypernym.ai/v1/chat/completions \
  -H "Authorization: Bearer $MODULUM_API_KEY" \
  -d '{"messages":[{"role":"user","content":"Reply with the single word: live"}],"max_tokens":10}'

→ 200 OK in 22.0s
{
  "choices": [{"message": {"content": "live"}, "finish_reason": "stop"}],
  "model": "gemma-4-31B-it-Q4_K_M.gguf",
  "usage": {"prompt_tokens": 20, "completion_tokens": 4, "total_tokens": 24},
  "timings": {
    "prompt_ms": 20925.1,             // ← prefill broken
    "prompt_per_second": 1.0,         // ← 1 tok/s (was 193 yesterday)
    "predicted_ms": 66.7,
    "predicted_per_second": 60.0,     // decode is normal
    "draft_n": 2,
    "draft_n_accepted": 2             // speculative drafting working
  }
}

Diagnosis: the model serves correct responses and speculative drafting (draft_n=2, 100% acceptance) is working as designed. Decode throughput is normal at 60 tok/s. The regression is entirely in prefill — yesterday's 193 tokens/sec prefill is now 1 token/sec. Same Tundra backend, same model file. This is the same backend that returned 503 on 2026-05-10 and timed out my BABILong probe at 64k+ on 2026-05-11. The prefill regression is what's blocking all longer-context validation.

What 1 tok/s prefill means in practice

At 1 tok/s prefill, processing common context lengths takes:

32k context = 32,000 s = 8.9 hours just to read the input
64k context = 17.8 hours
128k context = 35.5 hours

This is why yesterday's BABILong probe timed out at 64k+ but completed at 32k. The BABILong +9pp at 128k claim cannot be reproduced via the public API in its current state. Internal benchmarks presumably ran on different infrastructure with intact prefill performance.

02 · Yesterday vs today

Modulum performance regression timeline

Same endpoint, same model file, same auth key. Only variable is time.

Metric	2026-05-10	2026-05-11	2026-05-12 (today)
Endpoint reachable	503 backend down	200 OK	200 OK
Total latency · short prompt	n/a (503)	1.2s	22.0s
Prefill speed (tokens/sec)	n/a	193 tok/s	1.0 tok/s
Decode speed (tokens/sec)	n/a	57 tok/s	38-60 tok/s
Speculative draft acceptance	n/a	2/2 (100%)	2/2 short · 23/42 (55%) long
BABILong 32k retrieval	n/a	5/5 = 100%	untested today (probe yesterday)
BABILong 64k retrieval	n/a	0/5 (timeouts + 500s)	not retested · prefill regression makes it worse
BABILong 128k retrieval	n/a	0/3 (timeouts + 500s)	not retested · prefill regression makes it worse

Two-day pattern: 2026-05-10 the backend was down (503). 2026-05-11 it came back at full performance (193 tok/s prefill, 1.2s short-prompt latency, all BABILong 32k samples correct). 2026-05-12 the endpoint is up but prefill is at 0.5% of yesterday's speed. Something changed between 2026-05-11 and 2026-05-12 in the Tundra deployment. Worth checking deploy logs / kernel config / GPU memory pressure in that window.

03 · The "old Hypernym problem" pattern

All three APIs are individually different failure modes — same shape

Each of the three APIs responds to requests. None are hard-down. But each is operationally non-functional for production use:

Omnifact: queue accepts jobs, but workers don't process them. Symptom: experiments stuck at phase 1/4 with elapsed_seconds: 0.0.
Repo Analyze: endpoint responds with structured 401, but the in-circulation key fails authentication. Pending key rotation from SESSION_STARTUP_2026_05_10.md is still open.
Modulum: serves correct outputs at correct prefill quality yesterday, broken prefill quality today. Same code path, same model, different operational state.

The common factor: the API code/infrastructure is functional. The operational state (worker pools, key rotation, prefill performance) drifts between healthy and broken without obvious correlation to anything. This is the "old Hypernym problem" — services that look up in monitoring but fail when actually exercised.

04 · Three concrete asks for ops

What needs to happen for production-grade access

Ask 1 · Restart Omnifact worker pool

Job f4c50888-4493-4a17-8cab-47ba41baf150 queued at zephyr.hypernym.ai, never dequeued. Check worker pool health; restart if stalled. If the worker process is alive but blocked, check queue config / dead-letter queue / Redis (or whatever message broker is in front of it).

Ask 2 · Provide rotated HYPERNYM_REPO_INGEST_API_KEY

The current key in our keychain (used for both Omnifact and tested against Repo Analyze) returns 401 on zephyr-b-gpu.hypernym.ai/api/repo/analyze. The session memory dated 2026-05-09 notes the rotation has been pending. Please provide the new key value so we can update the macOS keychain entry: security add-generic-password -U -s HYPERNYM_REPO_INGEST_API_KEY -a $USER -w "<new-key>".

Ask 3 · Diagnose Tundra prefill regression

Same endpoint, same model file, prefill dropped from 193 tok/s on 2026-05-11 to 1 tok/s on 2026-05-12. Decode is unaffected (38-60 tok/s consistent). Suspect causes: (a) GPU memory pressure forcing CPU-side prefill, (b) batch-size config regression, (c) cache-line alignment / KV cache config issue, (d) inadvertent debug-mode flag flipped. Check deploy logs between 2026-05-11 21:00 UTC and 2026-05-12 19:00 UTC — that window contains the regression event.

05 · How to verify when fixed

Smoke-test commands for each service

Run these after ops changes to confirm fix.

# Omnifact — should complete within ~30s
EXP=$(curl -sS -X POST https://zephyr.hypernym.ai/api/omnifact/begin \
  -H "X-API-Key: $HYPERNYM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text":"The sky is blue. Water boils at 100C."}' \
  | jq -r .experiment_id)
sleep 30
curl -sS https://zephyr.hypernym.ai/api/omnifact/$EXP/status \
  -H "X-API-Key: $HYPERNYM_API_KEY" | jq .status
# expect "DONE" not "PROCESSING"


# Repo Analyze — should return 200, not 401
curl -sS https://zephyr-b-gpu.hypernym.ai/api/repo/analyze \
  -H "X-API-Key: $HYPERNYM_REPO_INGEST_API_KEY" -w "%{http_code}\n"
# expect 200


# Modulum — prefill should be ~190 tok/s
curl -sS https://gemma4.hypernym.ai/v1/chat/completions \
  -H "Authorization: Bearer $MODULUM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"reply ok"}],"max_tokens":5}' \
  | jq '.timings.prompt_per_second'
# expect > 100 tok/s · 1.0 means prefill still broken

06 · Why this matters now

External validation depends on production-grade API

Hypernym's published benchmark (BABILong +9pp at 128k on Gemma 4 31B + Modulum, 2026-05-08 proof doc) is structurally valid. The architecture works. But external customers / partners attempting to reproduce the benchmark via the public API will hit the prefill regression and conclude the system doesn't work as advertised.

The R19 build proposal for Substrate Delta Masks + MTP compound assumes a working Tundra backend as the production validation surface. Phase 1 (baseline reproduction) needs the API to actually serve 64k+ contexts in finite time. Phase 5 (production hardening) was already scoped as critical-path; this latest regression confirms why.

Distribution timing: IHC benchmarking, Chris-direction R19 doc, and any Year-1 commercial wedge (Hypernym Router, Reasoning State SDK, Cursor partnership) all require the public API to be production-grade. The three asks above are the unblock-list.