Ingestion Internals

Ingestion Entry Points

POST /api/ingest (job-backed async)
POST /api/ingest/upload (job-backed async file ingestion)
POST /api/ingest/url (direct response path)
POST /api/ingest/screenshots (upload screenshots for a source)
GET /api/ingest/page-title (lightweight title + published date fetch from URL)
GET /api/ingest/page-body (extract visible body text via Jina for article preview)
GET /api/ingest/{job_id}
WS /api/ingest/ws/{job_id}

Job State Model

Job-backed paths follow:

queued
processing
completed or failed

Progress is exposed via polling and WebSocket events.

Processing Flow

Atomicity and Concurrency

Per-source locking - Only one ingestion per unique source_id may run at a time. Concurrent submissions for the same source return an error immediately.
Fail-closed relevance gate - If the LLM provider fails during relevance assessment, the source is rejected rather than accepted. Use force: true to bypass.
Rollback on partial failure - If learnings extraction fails after chunks are stored, chunks are rolled back from both ChromaDB and FTS5. No orphaned data remains.

Duplicate Detection

Duplicate URLs are caught at two levels:

API-level dedup (pre-job) - POST /api/ingest checks for an existing completed job with the same URL before creating a new job. Returns HTTP 409 with the existing job’s title and ID. The frontend shows an “Already in library” toast and never starts the pipeline.
Pipeline-level dedup (in-pipeline) - If a URL reaches the pipeline (e.g., via force: true), ChromaDB is checked for existing chunks by source_id (articles) or video_id (YouTube). Duplicates are rejected unless force: true is passed, which deletes existing chunks first and re-ingests.

URL normalization handles cosmetic differences (trailing slash, casing). Redirect resolution (HEAD request) catches cases where two different URL paths resolve to the same content (e.g., /research/... redirecting to /engineering/...).

Source Type Notes

YouTube/article routes are supported in URL-driven paths.
PDF and local docs are handled via upload/file ingestion paths.
Direct URL endpoint can return payload-level status: error without throwing HTTP-level failures for many domain issues.
POST /api/ingest/url supports force: true to override duplicate checks and continue article ingestion even when the relevance gate would reject.

Published Date Extraction

Article sources extract a publication date through a two-tier fallback:

Firecrawl metadata - checks publishedDate and ogArticle:published_time from the scrape response metadata. Most sites don’t set these tags, so this often returns empty.
LLM extraction (Gemini 2.5 Flash Lite) - sends the first 2,000 chars of body text to the LLM and asks for a YYYY-MM-DD date. Catches dates rendered in the page body (e.g., “Published Dec 19, 2024”).

The source of the date is tracked as published_date_source in chunk metadata: "metadata", "llm", "frontend", or "". This is surfaced in the Library source detail view so you can see how the date was found. If neither method finds a date, a missing_publish_date warning event is emitted in the ingest trace.

Manual Caption Override (Articles)

Both POST /api/ingest and POST /api/ingest/url accept three optional article fields:

manual_caption - Full post/article text. When provided, Firecrawl is skipped entirely and this text is used as source content.
manual_title - Override title (otherwise extracted from page <title> tag).
manual_description - Additional context prepended to the caption.
published_date - ISO date string from the frontend’s page-title pre-fetch. If missing, the pipeline runs Gemini 2.5 Flash Lite extraction on the caption text as a fallback.

This is designed for LinkedIn posts and other social content where auto-fetch fails or returns garbage. The rest of the pipeline (relevance gate, chunking, embedding, learnings) runs on the provided text identically to auto-fetched content. YouTube manual fields (transcript, transcript_title, transcript_description) remain separate and unchanged.

Screenshot Upload

POST /api/ingest/screenshots accepts source_url (form field) + files (multipart images). The source_id is derived deterministically from the URL (article_{sha256(url)[:12]}), so screenshots can be uploaded before, during, or after ingestion. Screenshots are stored under data/source_assets/{source_id}/screenshots/ and automatically copied to any projects extracted from that source.

Learnings Confirmation Lifecycle

Learnings extracted at ingest time are not immediately active. They follow a two-phase lifecycle:

Pending - store_pending() writes the LLM-extracted learnings (concepts, tools, code snippets, summary) with confirmed_at = NULL. These are invisible to the career engine and library detail views.
Confirmed - confirm_learnings() replaces child rows (concepts, tools, snippets) with the user-reviewed set and sets confirmed_at to the current timestamp. Only confirmed learnings feed skill evidence aggregation and career intelligence.

The confirmation step lets the user review and edit extracted learnings before they affect career scoring. get_learnings() returns only confirmed rows.

Impact Snapshots

At confirmation time, two snapshots are computed and persisted:

Pattern snapshot (pattern_snapshot_json) - Diffs pattern fit scores before vs after the new skills. Each delta shows which required/optional concepts were newly covered and by how much the pattern score changed. Computed by preview_pattern_impact().
Composite snapshot (composite_snapshot_json) - Diffs capability composite readiness before vs after the new skills.

These snapshots are stored as JSON columns on the learnings row and retrieved via get_impact_snapshots(). They power the “Impact Snapshot” section on library source detail pages (/library/$sourceId), showing what each ingested source contributed to career intelligence. Unconfirmed sources and sources ingested before migration 0068 have NULL snapshots.

Post-Ingest Side Effects

retriever cache clear
connections cache invalidation
learnings artifact write

These ensure newly ingested content is discoverable in chat/library flows.

File/Vault Ingestion (CLI)

file_ingest.py is a standalone CLI script for bulk-ingesting folders of documents (PDF, Markdown, TXT) into ChromaDB:

python file_ingest.py --folder /path/to/docs --namespace global
python file_ingest.py --folder /path/to/docs --force              # re-ingest all
python file_ingest.py --folder /path/to/docs --backfill-learnings  # extract learnings for existing files

Key behaviors:

Dedup via SHA-256 hashes - stored in file_ingestion_hashes table. Unchanged files are skipped.
Learnings extraction - each ingested file runs through Gemini 2.5 Flash Lite for concepts/tools/summary extraction, auto-confirmed (no user approval gate).
--backfill-learnings - extracts learnings for files already indexed but missing learnings, without re-chunking.
PDF support - uses Mistral OCR (mistral-ocr-latest) for text extraction.

Start

Architecture

Data Model

Internals

Reliability

Authoring

Ingestion Internals

Ingestion Entry Points

Job State Model

Processing Flow

Atomicity and Concurrency

Duplicate Detection

Source Type Notes

Published Date Extraction

Manual Caption Override (Articles)

Screenshot Upload

Learnings Confirmation Lifecycle

Impact Snapshots

Post-Ingest Side Effects

File/Vault Ingestion (CLI)

Start

Architecture

Data Model

Internals

Reliability

Authoring

​Ingestion Entry Points

​Job State Model

​Processing Flow

​Atomicity and Concurrency

​Duplicate Detection

​Source Type Notes

​Published Date Extraction

​Manual Caption Override (Articles)

​Screenshot Upload

​Learnings Confirmation Lifecycle

​Impact Snapshots

​Post-Ingest Side Effects

​File/Vault Ingestion (CLI)

​Related

Ingestion Entry Points

Job State Model

Processing Flow

Atomicity and Concurrency

Duplicate Detection

Source Type Notes

Published Date Extraction

Manual Caption Override (Articles)

Screenshot Upload

Learnings Confirmation Lifecycle

Impact Snapshots

Post-Ingest Side Effects

File/Vault Ingestion (CLI)

Related