Failure Modes

Ingestion Failure Modes

Job-backed ingestion (`/api/ingest`, `/api/ingest/upload`)

job transitions to failed
error message stored in job state
websocket emits terminal error event

Direct ingestion (`/api/ingest/url`)

often returns payload-level status: "error"
caller must inspect response body, not only HTTP status

Relevance gate LLM failure

When the LLM provider (OpenAI) is unavailable during relevance assessment, the gate rejects the source (fail closed). The response includes rejection_reason: "LLM call failed; rejected (fail closed)". This prevents low-quality content from silently entering the knowledge base during outages. Use force: true to bypass the relevance gate when the provider is known to be down.

Partial pipeline failure (rollback)

If learnings extraction fails after chunks are already stored in ChromaDB + FTS5, those chunks are rolled back (deleted from both indexes). The job transitions to failed. This prevents orphaned chunks with no corresponding learnings metadata.

Concurrent duplicate ingestion

Per-source mutual exclusion prevents two concurrent requests from ingesting the same source. The second request receives "Ingestion already in progress for {source_id}". Wait for the first ingestion to complete, then retry if needed.

Retrieval Degradation Modes

Condition	Behavior
FTS query issue	lexical branch can degrade to empty set
rerank provider unavailable	retry 2x with backoff (0.5s, 1.0s) for transient errors (429, timeout, connection); on exhaustion, fall back to vector similarity scores as degraded proxy; `reranker_available: false` threaded through response; provenance note warns user
rerank provider permanent error	no retry; immediate fallback to vector similarity scores
related-source cache issues	connection endpoint can return empty results

Chat Failure Modes

Failure	Expected Behavior
input blocked by guardrails	request rejected with validation-style error
query rewriter returns unexpected format	fallback to RETRIEVE with raw user message (no rewrite)
query rewriter LLM unavailable	fallback to RETRIEVE with raw user message
consecutive clarifications (rewriter loops)	hard breaker forces RETRIEVE after any non-retrieval turn; logged as `clarify_breaker_forced_retrieve`
retrieval yields weak/empty evidence	deterministic follow-up path (library/search/rephrase options)
provider failure in generation path	fallback provider attempt, then controlled degradation/failure
stream internal exception	terminal `error` SSE event
client disconnect	stream exits without synthetic terminal rewrite

Platform Safeguards

API/chat rate limiting (429 + Retry-After)
request timing header (X-Response-Time)
deep health check endpoint for dependency status
fallback-aware LLM client with circuit-breaker behavior

Operational Guidance

Treat deterministic and provenance contracts as critical correctness invariants. When a failure path touches these contracts, require tests and docs updates in the same change.

Start

Architecture

Data Model

Internals

Reliability

Authoring

Ingestion Failure Modes

Job-backed ingestion (`/api/ingest`, `/api/ingest/upload`)

Direct ingestion (`/api/ingest/url`)

Relevance gate LLM failure

Partial pipeline failure (rollback)

Concurrent duplicate ingestion

Retrieval Degradation Modes

Chat Failure Modes

Platform Safeguards

Operational Guidance

Start

Architecture

Data Model

Internals

Reliability

Authoring

​Ingestion Failure Modes

​Job-backed ingestion (/api/ingest, /api/ingest/upload)

​Direct ingestion (/api/ingest/url)

​Relevance gate LLM failure

​Partial pipeline failure (rollback)

​Concurrent duplicate ingestion

​Retrieval Degradation Modes

​Chat Failure Modes

​Platform Safeguards

​Operational Guidance

​Related

Ingestion Failure Modes

Job-backed ingestion (`/api/ingest`, `/api/ingest/upload`)

Direct ingestion (`/api/ingest/url`)

Relevance gate LLM failure

Partial pipeline failure (rollback)

Concurrent duplicate ingestion

Retrieval Degradation Modes

Chat Failure Modes

Platform Safeguards

Operational Guidance

Related