Ingestion Entry Points
POST /api/ingest(job-backed async)POST /api/ingest/upload(job-backed async file ingestion)POST /api/ingest/url(direct response path)POST /api/ingest/screenshots(upload screenshots for a source)GET /api/ingest/page-title(lightweight title + published date fetch from URL)GET /api/ingest/page-body(extract visible body text via Jina for article preview)GET /api/ingest/{job_id}WS /api/ingest/ws/{job_id}
Job State Model
Job-backed paths follow:queuedprocessingcompletedorfailed
Processing Flow
Atomicity and Concurrency
- Per-source locking - Only one ingestion per unique
source_idmay run at a time. Concurrent submissions for the same source return an error immediately. - Fail-closed relevance gate - If the LLM provider fails during relevance assessment, the source is rejected rather than accepted. Use
force: trueto bypass. - Rollback on partial failure - If learnings extraction fails after chunks are stored, chunks are rolled back from both ChromaDB and FTS5. No orphaned data remains.
Duplicate Detection
Duplicate URLs are caught at two levels:-
API-level dedup (pre-job) -
POST /api/ingestchecks for an existing completed job with the same URL before creating a new job. Returns HTTP 409 with the existing job’s title and ID. The frontend shows an “Already in library” toast and never starts the pipeline. -
Pipeline-level dedup (in-pipeline) - If a URL reaches the pipeline (e.g., via
force: true), ChromaDB is checked for existing chunks bysource_id(articles) orvideo_id(YouTube). Duplicates are rejected unlessforce: trueis passed, which deletes existing chunks first and re-ingests.
/research/... redirecting to /engineering/...).
Source Type Notes
- YouTube/article routes are supported in URL-driven paths.
- PDF and local docs are handled via upload/file ingestion paths.
- Direct URL endpoint can return payload-level
status: errorwithout throwing HTTP-level failures for many domain issues. POST /api/ingest/urlsupportsforce: trueto override duplicate checks and continue article ingestion even when the relevance gate would reject.
Published Date Extraction
Article sources extract a publication date through a two-tier fallback:- Firecrawl metadata - checks
publishedDateandogArticle:published_timefrom the scrape response metadata. Most sites don’t set these tags, so this often returns empty. - LLM extraction (Gemini 2.5 Flash Lite) - sends the first 2,000 chars of body text to the LLM and asks for a
YYYY-MM-DDdate. Catches dates rendered in the page body (e.g., “Published Dec 19, 2024”).
published_date_source in chunk metadata: "metadata", "llm", "frontend", or "". This is surfaced in the Library source detail view so you can see how the date was found.
If neither method finds a date, a missing_publish_date warning event is emitted in the ingest trace.
Manual Caption Override (Articles)
BothPOST /api/ingest and POST /api/ingest/url accept three optional article fields:
manual_caption- Full post/article text. When provided, Firecrawl is skipped entirely and this text is used as source content.manual_title- Override title (otherwise extracted from page<title>tag).manual_description- Additional context prepended to the caption.published_date- ISO date string from the frontend’spage-titlepre-fetch. If missing, the pipeline runs Gemini 2.5 Flash Lite extraction on the caption text as a fallback.
transcript, transcript_title, transcript_description) remain separate and unchanged.
Screenshot Upload
POST /api/ingest/screenshots accepts source_url (form field) + files (multipart images). The source_id is derived deterministically from the URL (article_{sha256(url)[:12]}), so screenshots can be uploaded before, during, or after ingestion.
Screenshots are stored under data/source_assets/{source_id}/screenshots/ and automatically copied to any projects extracted from that source.
Learnings Confirmation Lifecycle
Learnings extracted at ingest time are not immediately active. They follow a two-phase lifecycle:- Pending -
store_pending()writes the LLM-extracted learnings (concepts, tools, code snippets, summary) withconfirmed_at = NULL. These are invisible to the career engine and library detail views. - Confirmed -
confirm_learnings()replaces child rows (concepts, tools, snippets) with the user-reviewed set and setsconfirmed_atto the current timestamp. Only confirmed learnings feed skill evidence aggregation and career intelligence.
get_learnings() returns only confirmed rows.
Impact Snapshots
At confirmation time, two snapshots are computed and persisted:- Pattern snapshot (
pattern_snapshot_json) - Diffs pattern fit scores before vs after the new skills. Each delta shows which required/optional concepts were newly covered and by how much the pattern score changed. Computed bypreview_pattern_impact(). - Composite snapshot (
composite_snapshot_json) - Diffs capability composite readiness before vs after the new skills.
learnings row and retrieved via get_impact_snapshots(). They power the “Impact Snapshot” section on library source detail pages (/library/$sourceId), showing what each ingested source contributed to career intelligence.
Unconfirmed sources and sources ingested before migration 0068 have NULL snapshots.
Post-Ingest Side Effects
- retriever cache clear
- connections cache invalidation
- learnings artifact write
File/Vault Ingestion (CLI)
file_ingest.py is a standalone CLI script for bulk-ingesting folders of documents (PDF, Markdown, TXT) into ChromaDB:
- Dedup via SHA-256 hashes - stored in
file_ingestion_hashestable. Unchanged files are skipped. - Learnings extraction - each ingested file runs through Gemini 2.5 Flash Lite for concepts/tools/summary extraction, auto-confirmed (no user approval gate).
--backfill-learnings- extracts learnings for files already indexed but missing learnings, without re-chunking.- PDF support - uses Mistral OCR (
mistral-ocr-latest) for text extraction.