Skip to main content

Identifier Layers

  • Source IDs: stable identity for an entire source
  • Chunk IDs: stable identity for individual chunk rows
Ranking feedback is source-based, not chunk-based.

Canonical Source IDs

YouTube

  • canonical form: yt_<video_id>
  • accepted aliases include raw video id and YouTube URL forms

Article

  • canonical form: article_<hash12>
  • deterministic from URL hashing

Document

  • canonical form: doc_<hash12>
  • generated from namespace-aware source material in current path

Canonicalization Rules

Normalization utilities:
  • trim
  • drop empties
  • canonicalize recognized YouTube aliases to yt_<id>
  • dedupe while preserving order
This is applied before feedback persistence to avoid alias double-counting.

Compatibility Note

Legacy document aliases are still handled in read/delete paths for backward compatibility, but canonical IDs are preferred for new writes.