Skip to main content

Ingestion Failure Modes

Job-backed ingestion (/api/ingest, /api/ingest/upload)

  • job transitions to failed
  • error message stored in job state
  • websocket emits terminal error event

Direct ingestion (/api/ingest/url)

  • often returns payload-level status: "error"
  • caller must inspect response body, not only HTTP status

Relevance gate LLM failure

When the LLM provider (OpenAI) is unavailable during relevance assessment, the gate rejects the source (fail closed). The response includes rejection_reason: "LLM call failed; rejected (fail closed)". This prevents low-quality content from silently entering the knowledge base during outages. Use force: true to bypass the relevance gate when the provider is known to be down.

Partial pipeline failure (rollback)

If learnings extraction fails after chunks are already stored in ChromaDB + FTS5, those chunks are rolled back (deleted from both indexes). The job transitions to failed. This prevents orphaned chunks with no corresponding learnings metadata.

Concurrent duplicate ingestion

Per-source mutual exclusion prevents two concurrent requests from ingesting the same source. The second request receives "Ingestion already in progress for {source_id}". Wait for the first ingestion to complete, then retry if needed.

Retrieval Degradation Modes

ConditionBehavior
FTS query issuelexical branch can degrade to empty set
rerank provider unavailableretry 2x with backoff (0.5s, 1.0s) for transient errors (429, timeout, connection); on exhaustion, fall back to vector similarity scores as degraded proxy; reranker_available: false threaded through response; provenance note warns user
rerank provider permanent errorno retry; immediate fallback to vector similarity scores
related-source cache issuesconnection endpoint can return empty results

Chat Failure Modes

FailureExpected Behavior
input blocked by guardrailsrequest rejected with validation-style error
query rewriter returns unexpected formatfallback to RETRIEVE with raw user message (no rewrite)
query rewriter LLM unavailablefallback to RETRIEVE with raw user message
consecutive clarifications (rewriter loops)hard breaker forces RETRIEVE after any non-retrieval turn; logged as clarify_breaker_forced_retrieve
retrieval yields weak/empty evidencedeterministic follow-up path (library/search/rephrase options)
provider failure in generation pathfallback provider attempt, then controlled degradation/failure
stream internal exceptionterminal error SSE event
client disconnectstream exits without synthetic terminal rewrite

Platform Safeguards

  • API/chat rate limiting (429 + Retry-After)
  • request timing header (X-Response-Time)
  • deep health check endpoint for dependency status
  • fallback-aware LLM client with circuit-breaker behavior

Operational Guidance

Treat deterministic and provenance contracts as critical correctness invariants. When a failure path touches these contracts, require tests and docs updates in the same change.