Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add discover, a breadth-first walk of the YouTube graph#9

Merged
tamnd merged 2 commits into
mainfrom
discover-graph-bfs
Jun 15, 2026
Merged

Add discover, a breadth-first walk of the YouTube graph#9
tamnd merged 2 commits into
mainfrom
discover-graph-bfs

Conversation

@tamnd

@tamnd tamnd commented Jun 15, 2026

Copy link
Copy Markdown
Owner

The reads each answer one question about one object: a video's metadata, a channel's uploads, a playlist's items. discover chains them. From a seed video, channel, or playlist it follows that object's links outward, hop by hop, streaming one node per row as it is reached.

The walker

The traversal lives in the library (youtube/discover.go) behind a small grapher interface, the exact subset of *Client it needs, so the BFS is tested hermetically over a fake in-memory graph with no network. The bounds, ordering, dedup, and degradation are unit tests, not integration tests.

Nine edges across five node kinds:

Edge From to Gated Meaning
channel video to channel no the uploader
related video to video no a related video
comments video to comment yes a comment
uploads channel to video no an upload
playlists channel to playlist no an owned playlist
community channel to post no a community post
items playlist to video no a playlist item
owner playlist to channel no the owner
commenter comment to channel yes the comment's author

Presets bundle the edges by intent (content, feed, comments, all), with content the default since it spans every seed kind, so plain discover does the obvious thing with no flags. Preset names are kept disjoint from edge names, so naming an edge follows just that link.

Tier-less, with graceful degradation

Unlike x-cli there are no scrape tiers to gate, so nothing is dropped up front. The only walled edges are the two that touch comments, refused by YouTube's per-IP Restricted Mode on some networks. The walk attempts them and degrades to a one-line note on failure, continuing on the rest of the graph, rather than failing the whole walk. A seed that cannot be fetched is still fatal, matching a single read; deeper failures are notes.

Depth, fanout, and a total node budget bound the walk so it always terminates, even with an uncapped fanout. Edges are recorded eagerly so the stored graph stays complete while nodes stay de-duplicated by an alias-collapsing key.

Persisting

ytb discover --store tees the walk into the typed crawl store and records each traversed link into a new edges table, so a live walk doubles as a crawl you can query with ytb db query afterwards. The existing seed/crawl/queue/jobs worklist crawler is untouched; discover is the complement that finds the worklist by walking instead of draining one.

Tests and docs

Hermetic walker tests cover edge parsing, seed classification, BFS order, presets, the comment degradation path, the budget and fanout caps, dedup, fatal-seed, and depth-zero. Docs get a graph-discovery guide, a persist-a-walk section in the store guide, and a discover entry in the CLI reference.

Verified: CGO_ENABLED=0 go build/vet/test ./... green, gofmt clean, go mod tidy no-op, live walk confirmed against a real video.

tamnd added 2 commits June 15, 2026 13:41
The reads each answer one question about one object: a video's metadata,
a channel's uploads, a playlist's items. discover chains them. From a
seed video, channel, or playlist it follows that object's links outward,
hop by hop, streaming one node per row as it is reached.

The walker lives in the library behind a small grapher interface, the
exact subset of Client it needs, so the BFS is tested hermetically over a
fake in-memory graph with no network: the bounds, ordering, dedup, and
degradation are unit tests, not integration tests.

Nine edges across five node kinds: a video to its channel and related
videos and comments, a channel to its uploads, playlists, and community
posts, a playlist to its items and owner, a comment to the channel that
wrote it. Presets bundle them by intent (content, feed, comments, all),
with content the default since it spans every seed kind, so plain
discover does the obvious thing with no flags.

Unlike X there are no scrape tiers to gate, so nothing is dropped up
front. The only walled edges are the two that touch comments, refused by
YouTube's per-IP Restricted Mode on some networks. The walk attempts
them and degrades to a one-line note on failure, continuing on the rest
of the graph, rather than failing the whole walk. A seed that cannot be
fetched is still fatal, matching a single read; deeper failures are
notes.

Depth, fanout, and a total node budget bound the walk so it always
terminates, even with an uncapped fanout. Edges are recorded eagerly so
the stored graph stays complete while nodes stay de-duplicated by an
alias-collapsing key.

discover --store tees the walk into the typed crawl store and records
each traversed link into a new edges table, so a live walk doubles as a
crawl you can query with db query afterwards. The existing seed, crawl,
queue, and jobs worklist crawler is untouched; discover is the
complement that finds the worklist by walking instead of draining one.

Docs get a graph-discovery guide, a persist-a-walk section in the store
guide, and a discover entry in the CLI reference.
@tamnd tamnd merged commit 8432603 into main Jun 15, 2026
7 checks passed
@tamnd tamnd deleted the discover-graph-bfs branch June 15, 2026 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant