Coelacanth is a Ruby gem for extracting high-quality article content, metadata, and screenshots from arbitrary web pages. It is built to power content ingestion pipelines that have to withstand layout experiments, CMS redesigns, and inconsistent markup while remaining easy to extend.
It is the successor to web_stat and continues the same goal of reliable article
extraction under the slidict umbrella. Compared to web_stat the gem has been
re-architected with a modern extractor pipeline, built-in screenshot capture, and a clearer configuration story so you can drop
it into contemporary ingestion stacks without bespoke glue code.
- Features
- Requirements
- Installation
- Quick start
- Extractor pipeline
- Configuration
- Development workflow
- Testing
- Contributing
- License
- Layout-resilient extraction – Multi-stage extractor falls back from structured metadata to heuristics and lightweight machine learning so you continue to get clean article bodies even when markup drifts.
- UTF-8 normalization – HTML responses are normalized into UTF-8 before parsing to play nicely with Japanese and other multi-byte sources.
- Screenshot capture – Fetches full-page PNGs via a configurable browser client so you can archive visual context alongside the extracted text.
- Redirect resolution – Follows HTTP redirects and long redirect chains to guarantee the extractor works on the final landing page.
- Configurable HTTP headers – Inject custom headers (user agent, authorization, etc.) into the remote browser session for authenticated or geo-targeted crawling.
- Multi-stage pipeline –
web_statrelied on a single-pass heuristic extractor, whereas Coelacanth layers metadata, heuristic, and optional ML probes that graduate based on confidence thresholds. - First-class screenshots – Capture full-page PNGs alongside the extracted text without writing a separate headless browser integration.
- Environment-aware configuration – Manage remote browser credentials, HTTP headers, and client selection through
config/coelacanth.ymlinstead of hand-tuned initializer code. - Markdown-first output – Get both Markdown and raw DOM representations from
Coelacanth.analyzeso you can publish the same payload to static-site builders, CMS importers, or downstream summarizers.
- Ruby 3.4 or newer
- Bundler for dependency management
- A remote Chrome-compatible WebSocket endpoint when using the default Ferrum client (see Configuration)
Add the gem to your application:
gem "coelacanth"Install the dependencies:
bundle installOr install the gem directly:
gem install coelacanthrequire "coelacanth"
result = Coelacanth.analyze("https://example.com/article")
result[:extraction] # => article metadata and body markdown
result[:dom] # => Oga DOM representation for downstream processing
result[:screenshot] # => PNG screenshot as a binary string
result[:response] # => HTTP status, headers, and final URLYou can run the morphological analyzer without fetching a page by passing plain text:
Coelacanth.morphological_analysis("これはテストです。 Testing morphology twice.")
# => [
# { token: "testing morphology twice", score: 1.23, count: 2 },
# { token: "テスト", score: 1.02, count: 1 },
# ...
# ]The returned hash includes:
:extraction– output fromCoelacanth::Extractor, including title, Markdown body (body_markdown,body_markdown_list, and scored morphemes inbody_morphemes), the normalized plain-text body (body_text), images, listings, published date, detected site name, and the probe source and confidence score. The extractor also echoes the HTTP metadata it received viaresponse_metadatafor downstream consumers that only operate on the extraction payload.:dom– a parsed Oga DOM if you need to traverse the document manually.:screenshot– raw PNG data that you can persist or feed to other systems.:response– HTTP metadata captured during the initial fetch.
The :response key exposes a hash with the following keys:
:status_code– Numeric HTTP status (e.g.,200).:headers– A lowercase header hash as returned byNet::HTTP#each_header.:final_url– The URL that was ultimately fetched after resolving redirects.
Within the extraction payload (result[:extraction]), the following additional metadata is available:
:site_name– Site or application name inferred from Open Graph/Twitter meta tags or the document<title>.:body_text– Plain-text body with collapsed whitespace, suitable for search indexing or summarization.:response_metadata– Mirrors the top-level:responsehash so downstream processing can access HTTP metadata without carrying the entire analysis result.
Coelacanth ships with a multi-stage extractor that tries increasingly involved probes until one meets its confidence target:
- MetadataProbe (threshold
0.85) pullsschema.orgJSON-LD, Open Graph, Twitter Cards, or semantic containers such as<main>/<article>when available. - HeuristicProbe (threshold
0.75) scores block-level nodes using text length, link density, punctuation density, DOM depth, and sibling variance, then greedily attaches surrounding headers and media. - WeakMlProbe (threshold
0.70) optionally boosts accuracy with a lightweight classifier that combines heuristic features with class and id tokens (e.g.,article-body,post,content). - FallbackProbe acts as a safety net by following AMP/print links or summarizing the whole document when the previous probes fail.
Markdown-based listings are generated from the extracted body so lists such as "Latest news" blocks can be stored alongside the article without scanning the rest of the page layout.
Runtime configuration is stored in config/coelacanth.yml. Environments inherit from the development section by default.
development:
client: "ferrum" # Options: "ferrum", "screenshot_one"
remote_client:
ws_url: "ws://chrome:3000/chrome"
timeout: 10
wait_for_idle_timeout: 5
headers:
<% if (auth = ENV["COELACANTH_REMOTE_CLIENT_AUTHORIZATION"]).to_s.strip != "" %>
Authorization: "<%= auth %>"
<% end %>
User-Agent: "<%= ENV.fetch("COELACANTH_REMOTE_CLIENT_USER_AGENT", "Coelacanth Chrome Extension") %>"
screenshot_one:
key: "<%= ENV.fetch("COELACANTH_SCREENSHOT_ONE_API_KEY", "your_screenshot_one_api_key_here") %>"
youtube:
api_key: "<%= ENV.fetch("COELACANTH_YOUTUBE_API_KEY", "") %>"
morphology:
latin_joiners:
- ","
japanese_hiragana_suffixes:
- "ら"
- "の"
- "え"
japanese_category_breaks:
- "katakana_to_kanji"- Ferrum client – Requires a running Chrome instance that exposes the DevTools protocol via WebSocket. Configure the URL, timeout, the network idle timeout, and any headers to inject.
- ScreenshotOne client – Supply an API key to offload screenshot capture to ScreenshotOne.
- Eyecatch image extraction – Representative images are discovered automatically by checking Open Graph/Twitter metadata,
Schema.org JSON-LD payloads, and high-signal
<img>elements (hero/cover images, large dimensions, etc.). No manual XPath maintenance is required. - YouTube Data API – Set an API key to turn YouTube watch URLs into structured articles using the video description and thumbnail for downstream processing.
- Configuration is environment-aware: set
RAILS_ENV/RACK_ENVor use Rails' built-in environment handling when the gem is used inside a Rails project.
The terms returned in body_morphemes can be tuned per deployment by configuring the optional morphology section:
morphology.latin_joiners— An array of characters that should be treated as connectors between Latin tokens. The default value includes a comma so numbers such as7,000stay intact instead of being split into separate terms.morphology.japanese_hiragana_suffixes— A whitelist of Hiragana tokens that are allowed to extend Kanji sequences. By default we keep common nominal suffixes such asら,の, and the trailingえin訴えwhile preventing particles likeにfrom merging with the preceding noun. Provide your own list or set the value tonull/~to allow any Hiragana suffix.morphology.japanese_category_breaks— An array of transitions (e.g.,katakana_to_kanji) that should stop Japanese token sequences. This is useful when you want Katakana loanwords such asタワマンto stand alone instead of being merged with the Kanji terms that follow them.
Representative images are downloaded into a temporary directory using the built-in HTTP client. The extractor returns both the
resolved URL and the local file path via extraction[:eyecatch_image]. Remember to move or delete the file once you have
persisted it—temporary directories are not automatically cleaned up for long-running processes.
Configuration values that would otherwise contain credentials are loaded from environment variables. Set the following
variables in your shell (or dotenv file) before running the gem:
# Optional: only set when the remote browser requires authentication.
export COELACANTH_REMOTE_CLIENT_AUTHORIZATION="Bearer <token>"
export COELACANTH_REMOTE_CLIENT_USER_AGENT="Coelacanth Chrome Extension"
export COELACANTH_SCREENSHOT_ONE_API_KEY="your_screenshot_one_api_key_here"
export COELACANTH_YOUTUBE_API_KEY="your_youtube_data_api_key"If COELACANTH_REMOTE_CLIENT_AUTHORIZATION is omitted or left blank, the Authorization header is not injected into the
remote browser session.
With COELACANTH_YOUTUBE_API_KEY configured (or youtube.api_key populated directly in config/coelacanth.yml),
Coelacanth::Extractor runs a preprocessor that recognizes standard YouTube watch URLs (youtube.com, youtu.be,
m.youtube.com, etc.). The preprocessor fetches the video snippet from the YouTube Data API and builds an article-like HTML
document that contains:
- The video title and publish timestamp as structured metadata (JSON-LD and Open Graph).
- The full description rendered as Markdown-friendly paragraphs.
- The highest available thumbnail, passed to the eye-catch/image collector pipeline.
If the API key is missing or the API request fails, the extractor falls back to the original HTML that was fetched from YouTube, so non-video pages continue to behave as before.
When using Docker Compose, you can create a .env file or export the variables in your environment so the app service picks
them up automatically.
If you are working inside Docker, make sure the UID environment variable matches your host user by exporting it in your shell
startup file:
export UID=${UID}Clone the repository and install dependencies:
git clone https://github.com/slidict/coelacanth.git
cd coelacanth
bundle installYou can open an interactive console with the gem loaded via:
bin/consoleRun the test suite with RSpec:
bundle exec rspecBug reports and pull requests are welcome on GitHub at https://github.com/slidict/coelacanth. Please follow the Conventional Commits specification so we can keep the changelog automation healthy.
By participating in this project you agree to abide by the Contributor Covenant.
Coelacanth is available as open source under the terms of the MIT License.