Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

cryptprosteel/bluesky-jetstream-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Bluesky Jetstream Scraper

A real-time Bluesky data scraper that captures posts, interactions, and profiles directly from the live network stream. It helps researchers and analysts monitor trends, users, and conversations as they happen, without relying on slow or rate-limited crawling methods.

Bitbash Banner

Telegram   WhatsApp   Gmail   Website

Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for bluesky-jetstream-scraper you've just found your team — Let’s Chat. 👆👆

Introduction

This project collects live data from the Bluesky network using a continuous stream approach, focusing on speed, flexibility, and scale. It solves the problem of delayed or incomplete social data collection by processing content the moment it’s created. It’s built for developers, data analysts, and researchers who need timely insights from Bluesky.

Why real-time Bluesky data matters

  • Captures posts and interactions the instant they’re published
  • Avoids traditional API rate limits and heavy request overhead
  • Supports flexible filtering by hashtags, users, and language
  • Designed for high-volume monitoring and trend analysis

Features

Feature Description
Real-time stream ingestion Collects Bluesky content as it is created, not after the fact.
Advanced filtering Filter by hashtags, usernames, languages, and content types.
Media and reply control Toggle inclusion of images, media URLs, and reply metadata.
Profile enrichment Optionally attach detailed author profile information.
Resilient connections Automatic reconnection and retry handling for long runs.

What Data This Scraper Extracts

Field Name Field Description
postUri Unique identifier of the Bluesky post.
text Full text content of the post.
createdAt Timestamp when the post was published.
language Detected or declared language of the post.
hashtags Extracted hashtags found in the text.
hasMedia Indicates whether the post contains media.
mediaUrls Direct URLs to attached media files.
isReply Shows whether the post is a reply.
authorHandle Username of the post author.
authorDid Decentralized identifier of the author.
authorFollowersCount Number of followers (when profile enrichment is enabled).

Example Output

[
  {
    "postUri": "at://did:plc:abc123/app.bsky.feed.post/xyz789",
    "text": "Exploring real-time data on Bluesky is fascinating.",
    "createdAt": "2025-01-12T14:32:10Z",
    "language": "en",
    "hashtags": ["bluesky", "data"],
    "hasMedia": false,
    "isReply": false,
    "authorHandle": "researcher.bsky.social",
    "authorDid": "did:plc:abc123",
    "authorFollowersCount": 1840
  }
]

Directory Structure Tree

Bluesky Jetstream Scraper/
├── src/
│   ├── index.js
│   ├── stream/
│   │   ├── jetstreamClient.js
│   │   └── reconnectHandler.js
│   ├── filters/
│   │   ├── hashtags.js
│   │   ├── users.js
│   │   └── languages.js
│   ├── processors/
│   │   ├── postProcessor.js
│   │   └── profileEnricher.js
│   └── config/
│       └── settings.example.json
├── data/
│   ├── samples.json
│   └── checkpoints/
├── package.json
└── README.md

Use Cases

  • Market analysts use it to monitor emerging topics, so they can spot trends early.
  • Researchers use it to collect live social data, so they can analyze online behavior in real time.
  • Developers use it to build dashboards, so they can visualize Bluesky activity as it happens.
  • Journalists use it to track breaking discussions, so they can respond quickly to news cycles.

FAQs

Does this scraper collect historical data? No. It processes only live content from the moment it starts running. Past posts are not accessible.

Can I limit the amount of data collected? Yes. You can cap collection by post count, time limit, or both to control output size.

Is language detection reliable? Language detection works well for most posts, but accuracy depends on text length and clarity.

Can it run continuously for long periods? Yes. With automatic reconnection and retries enabled, it’s suitable for extended monitoring.


Performance Benchmarks and Results

Primary Metric: Processes thousands of posts per minute during peak network activity.

Reliability Metric: Maintains stable connections with over 99% uptime in long-running sessions.

Efficiency Metric: Low CPU and memory footprint due to stream-based processing.

Quality Metric: High data completeness with consistent field extraction across supported content types.

Book a Call Watch on YouTube

Review 1

"Bitbash is a top-tier automation partner, innovative, reliable, and dedicated to delivering real results every time."

Nathan Pennington
Marketer
★★★★★

Review 2

"Bitbash delivers outstanding quality, speed, and professionalism, truly a team you can rely on."

Eliza
SEO Affiliate Expert
★★★★★

Review 3

"Exceptional results, clear communication, and flawless delivery.
Bitbash nailed it."

Syed
Digital Strategist
★★★★★

Releases

No releases published

Packages

No packages published