A real-time Bluesky data scraper that captures posts, interactions, and profiles directly from the live network stream. It helps researchers and analysts monitor trends, users, and conversations as they happen, without relying on slow or rate-limited crawling methods.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for bluesky-jetstream-scraper you've just found your team — Let’s Chat. 👆👆
This project collects live data from the Bluesky network using a continuous stream approach, focusing on speed, flexibility, and scale. It solves the problem of delayed or incomplete social data collection by processing content the moment it’s created. It’s built for developers, data analysts, and researchers who need timely insights from Bluesky.
- Captures posts and interactions the instant they’re published
- Avoids traditional API rate limits and heavy request overhead
- Supports flexible filtering by hashtags, users, and language
- Designed for high-volume monitoring and trend analysis
| Feature | Description |
|---|---|
| Real-time stream ingestion | Collects Bluesky content as it is created, not after the fact. |
| Advanced filtering | Filter by hashtags, usernames, languages, and content types. |
| Media and reply control | Toggle inclusion of images, media URLs, and reply metadata. |
| Profile enrichment | Optionally attach detailed author profile information. |
| Resilient connections | Automatic reconnection and retry handling for long runs. |
| Field Name | Field Description |
|---|---|
| postUri | Unique identifier of the Bluesky post. |
| text | Full text content of the post. |
| createdAt | Timestamp when the post was published. |
| language | Detected or declared language of the post. |
| hashtags | Extracted hashtags found in the text. |
| hasMedia | Indicates whether the post contains media. |
| mediaUrls | Direct URLs to attached media files. |
| isReply | Shows whether the post is a reply. |
| authorHandle | Username of the post author. |
| authorDid | Decentralized identifier of the author. |
| authorFollowersCount | Number of followers (when profile enrichment is enabled). |
[
{
"postUri": "at://did:plc:abc123/app.bsky.feed.post/xyz789",
"text": "Exploring real-time data on Bluesky is fascinating.",
"createdAt": "2025-01-12T14:32:10Z",
"language": "en",
"hashtags": ["bluesky", "data"],
"hasMedia": false,
"isReply": false,
"authorHandle": "researcher.bsky.social",
"authorDid": "did:plc:abc123",
"authorFollowersCount": 1840
}
]
Bluesky Jetstream Scraper/
├── src/
│ ├── index.js
│ ├── stream/
│ │ ├── jetstreamClient.js
│ │ └── reconnectHandler.js
│ ├── filters/
│ │ ├── hashtags.js
│ │ ├── users.js
│ │ └── languages.js
│ ├── processors/
│ │ ├── postProcessor.js
│ │ └── profileEnricher.js
│ └── config/
│ └── settings.example.json
├── data/
│ ├── samples.json
│ └── checkpoints/
├── package.json
└── README.md
- Market analysts use it to monitor emerging topics, so they can spot trends early.
- Researchers use it to collect live social data, so they can analyze online behavior in real time.
- Developers use it to build dashboards, so they can visualize Bluesky activity as it happens.
- Journalists use it to track breaking discussions, so they can respond quickly to news cycles.
Does this scraper collect historical data? No. It processes only live content from the moment it starts running. Past posts are not accessible.
Can I limit the amount of data collected? Yes. You can cap collection by post count, time limit, or both to control output size.
Is language detection reliable? Language detection works well for most posts, but accuracy depends on text length and clarity.
Can it run continuously for long periods? Yes. With automatic reconnection and retries enabled, it’s suitable for extended monitoring.
Primary Metric: Processes thousands of posts per minute during peak network activity.
Reliability Metric: Maintains stable connections with over 99% uptime in long-running sessions.
Efficiency Metric: Low CPU and memory footprint due to stream-based processing.
Quality Metric: High data completeness with consistent field extraction across supported content types.