Ingestion Failure Modes
Job-backed ingestion (/api/ingest, /api/ingest/upload)
- job transitions to
failed - error message stored in job state
- websocket emits terminal
errorevent
Direct ingestion (/api/ingest/url)
- often returns payload-level
status: "error" - caller must inspect response body, not only HTTP status
Relevance gate LLM failure
When the LLM provider (OpenAI) is unavailable during relevance assessment, the gate rejects the source (fail closed). The response includesrejection_reason: "LLM call failed; rejected (fail closed)". This prevents low-quality content from silently entering the knowledge base during outages. Use force: true to bypass the relevance gate when the provider is known to be down.
Partial pipeline failure (rollback)
If learnings extraction fails after chunks are already stored in ChromaDB + FTS5, those chunks are rolled back (deleted from both indexes). The job transitions tofailed. This prevents orphaned chunks with no corresponding learnings metadata.
Concurrent duplicate ingestion
Per-source mutual exclusion prevents two concurrent requests from ingesting the same source. The second request receives"Ingestion already in progress for {source_id}". Wait for the first ingestion to complete, then retry if needed.
Retrieval Degradation Modes
| Condition | Behavior |
|---|---|
| FTS query issue | lexical branch can degrade to empty set |
| rerank provider unavailable | retry 2x with backoff (0.5s, 1.0s) for transient errors (429, timeout, connection); on exhaustion, fall back to vector similarity scores as degraded proxy; reranker_available: false threaded through response; provenance note warns user |
| rerank provider permanent error | no retry; immediate fallback to vector similarity scores |
| related-source cache issues | connection endpoint can return empty results |
Chat Failure Modes
| Failure | Expected Behavior |
|---|---|
| input blocked by guardrails | request rejected with validation-style error |
| query rewriter returns unexpected format | fallback to RETRIEVE with raw user message (no rewrite) |
| query rewriter LLM unavailable | fallback to RETRIEVE with raw user message |
| consecutive clarifications (rewriter loops) | hard breaker forces RETRIEVE after any non-retrieval turn; logged as clarify_breaker_forced_retrieve |
| retrieval yields weak/empty evidence | deterministic follow-up path (library/search/rephrase options) |
| provider failure in generation path | fallback provider attempt, then controlled degradation/failure |
| stream internal exception | terminal error SSE event |
| client disconnect | stream exits without synthetic terminal rewrite |
Platform Safeguards
- API/chat rate limiting (
429+Retry-After) - request timing header (
X-Response-Time) - deep health check endpoint for dependency status
- fallback-aware LLM client with circuit-breaker behavior