Identifier Layers
- Source IDs: stable identity for an entire source
- Chunk IDs: stable identity for individual chunk rows
Canonical Source IDs
YouTube
- canonical form:
yt_<video_id> - accepted aliases include raw video id and YouTube URL forms
Article
- canonical form:
article_<hash12> - deterministic from URL hashing
Document
- canonical form:
doc_<hash12> - generated from namespace-aware source material in current path
Canonicalization Rules
Normalization utilities:- trim
- drop empties
- canonicalize recognized YouTube aliases to
yt_<id> - dedupe while preserving order