Add discover, a breadth-first walk of the YouTube graph#9
Merged
Conversation
The reads each answer one question about one object: a video's metadata, a channel's uploads, a playlist's items. discover chains them. From a seed video, channel, or playlist it follows that object's links outward, hop by hop, streaming one node per row as it is reached. The walker lives in the library behind a small grapher interface, the exact subset of Client it needs, so the BFS is tested hermetically over a fake in-memory graph with no network: the bounds, ordering, dedup, and degradation are unit tests, not integration tests. Nine edges across five node kinds: a video to its channel and related videos and comments, a channel to its uploads, playlists, and community posts, a playlist to its items and owner, a comment to the channel that wrote it. Presets bundle them by intent (content, feed, comments, all), with content the default since it spans every seed kind, so plain discover does the obvious thing with no flags. Unlike X there are no scrape tiers to gate, so nothing is dropped up front. The only walled edges are the two that touch comments, refused by YouTube's per-IP Restricted Mode on some networks. The walk attempts them and degrades to a one-line note on failure, continuing on the rest of the graph, rather than failing the whole walk. A seed that cannot be fetched is still fatal, matching a single read; deeper failures are notes. Depth, fanout, and a total node budget bound the walk so it always terminates, even with an uncapped fanout. Edges are recorded eagerly so the stored graph stays complete while nodes stay de-duplicated by an alias-collapsing key. discover --store tees the walk into the typed crawl store and records each traversed link into a new edges table, so a live walk doubles as a crawl you can query with db query afterwards. The existing seed, crawl, queue, and jobs worklist crawler is untouched; discover is the complement that finds the worklist by walking instead of draining one. Docs get a graph-discovery guide, a persist-a-walk section in the store guide, and a discover entry in the CLI reference.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The reads each answer one question about one object: a video's metadata, a channel's uploads, a playlist's items.
discoverchains them. From a seed video, channel, or playlist it follows that object's links outward, hop by hop, streaming one node per row as it is reached.The walker
The traversal lives in the library (
youtube/discover.go) behind a smallgrapherinterface, the exact subset of*Clientit needs, so the BFS is tested hermetically over a fake in-memory graph with no network. The bounds, ordering, dedup, and degradation are unit tests, not integration tests.Nine edges across five node kinds:
channelrelatedcommentsuploadsplaylistscommunityitemsownercommenterPresets bundle the edges by intent (
content,feed,comments,all), withcontentthe default since it spans every seed kind, so plaindiscoverdoes the obvious thing with no flags. Preset names are kept disjoint from edge names, so naming an edge follows just that link.Tier-less, with graceful degradation
Unlike x-cli there are no scrape tiers to gate, so nothing is dropped up front. The only walled edges are the two that touch comments, refused by YouTube's per-IP Restricted Mode on some networks. The walk attempts them and degrades to a one-line note on failure, continuing on the rest of the graph, rather than failing the whole walk. A seed that cannot be fetched is still fatal, matching a single read; deeper failures are notes.
Depth, fanout, and a total node budget bound the walk so it always terminates, even with an uncapped fanout. Edges are recorded eagerly so the stored graph stays complete while nodes stay de-duplicated by an alias-collapsing key.
Persisting
ytb discover --storetees the walk into the typed crawl store and records each traversed link into a newedgestable, so a live walk doubles as a crawl you can query withytb db queryafterwards. The existingseed/crawl/queue/jobsworklist crawler is untouched;discoveris the complement that finds the worklist by walking instead of draining one.Tests and docs
Hermetic walker tests cover edge parsing, seed classification, BFS order, presets, the comment degradation path, the budget and fanout caps, dedup, fatal-seed, and depth-zero. Docs get a graph-discovery guide, a persist-a-walk section in the store guide, and a
discoverentry in the CLI reference.Verified:
CGO_ENABLED=0 go build/vet/test ./...green,gofmtclean,go mod tidyno-op, live walk confirmed against a real video.