Skip to main content

Why This Matters

Chunk metadata powers:
  • citation rendering
  • source identity and deletion
  • namespace filtering
  • retrieval and ranking diagnostics
If metadata drifts, chat/library behavior degrades quickly.

Shared Metadata Invariants

All chunk types include:
  • namespace
  • chunk_index
  • optional taxonomy fields (labels/path/confidence/classifier_version)
Validation behavior:
  • unknown fields are rejected by typed models
  • optional None fields are omitted from final Chroma payload

Per-Type Required Fields

YouTube chunks

Required keys include:
  • video_id
  • title
  • duration
  • timestamp_start
  • source_url
  • namespace
  • chunk_index

Article chunks

Required keys include:
  • source_id
  • title
  • source_url
  • namespace
  • chunk_index

File/document chunks

Required keys include:
  • source_id
  • file_path
  • file_name
  • file_hash
  • file_type
  • title
  • total_chunks
  • namespace
  • chunk_index

Chunk ID Patterns

  • YouTube: <video_id>_<chunk_index>
  • Article: <article_source_id>_<chunk_index>
  • File: file_<file_hash12>_<chunk_index>
IDs are deterministic to support stable upsert/delete behavior.