url-to-markdown-cli-tool

A Node.js CLI tool that converts web pages into clean, LLM-friendly markdown format. It fetches web content using Puppeteer and strips away noise like ads, navigation elements, and unnecessary formatting, leaving you with properly formatted markdown that's perfect for feeding into large language models, RAG systems, or AI training datasets.

No LLM or API keys required.

Key features:

🔄 Convert any webpage to properly formatted markdown with headers, links, and structure
🎯 Targeted content extraction - Include only specific HTML tags like articles, main content, or sections
🧹 Smart content cleaning - Remove navigation, footers, scripts and other non-content elements
🚫 Remove images, links, or specific HTML tags as needed
📊 Enhanced table conversion - HTML tables become clean markdown tables with pipes and headers
📱 Viewport control - Mobile, tablet, desktop viewports for responsive content extraction
⏱️ Configurable wait times for dynamic content and SPAs
👁️ Headless or visible browser modes for debugging
📦 Easy npm installation with global CLI access
🧠 Clean output optimized for LLM parsing and understanding
⚡ Fast Node.js implementation with Puppeteer browser automation

Quick Install

npm install -g url-to-markdown-cli-tool

Then immediately get started:

url-to-md https://example.com -o example.md

Installation

Via npm (When published)

# Install globally to use anywhere
npm install -g url-to-markdown-cli-tool

# Or install locally in your project
npm install url-to-markdown-cli-tool

From Source (Current)

# Clone and install from this repository
git clone https://github.com/yourusername/url-to-markdown-cli-tool.git
cd url-to-markdown-cli-tool
npm install
npm install -g .

System Requirements

Node.js 18.0.0 or higher
Google Chrome or Chromium browser installed and accessible in PATH

Chrome/Chromium will be automatically detected.

Usage

# Convert a webpage to markdown file
url-to-md https://example.com -o example.md

# Smart content cleaning - removes nav, footer, aside, script, style, header, noscript, canvas
url-to-md https://blog.example.com --clean-content -o clean-blog.md

# Output to console
url-to-md https://example.com

# Get help
url-to-md --help

Targeted Content Extraction

# Extract only article content (perfect for news sites and blogs)
url-to-md https://news-site.com --include-tags article -o article.md

# Extract main content areas (common for documentation and blogs)
url-to-md https://blog.com --include-tags main section article -o content.md

# Combine targeted extraction with cleanup (remove ads within articles)
url-to-md https://news.com --include-tags article --remove-tags aside nav -o clean-article.md

# Extract specific sections from documentation
url-to-md https://docs.example.com --include-tags main section -o docs.md

How Include-Tags and Remove-Tags Work Together

The --include-tags and --remove-tags flags can be combined for powerful content filtering:

Include-tags first: Only content within specified tags is processed
Remove-tags second: Specified tags are removed from the included content
Priority rule: If a tag appears in both lists, include-tags takes precedence

# Example: Extract article content but remove ads and navigation within it
url-to-md https://news-site.com \
  --include-tags article \
  --remove-tags aside nav .advertisement \
  -o clean-article.md

Content Filtering

# Smart content cleaning - removes nav, footer, aside, script, style, header, noscript, canvas
url-to-md https://blog.example.com --clean-content -o clean-blog.md

# Remove all images
url-to-md https://blog.example.com --no-images -o clean-blog.md

# Remove webpage links (keep text, remove hyperlinks)
url-to-md https://article.com --no-links -o text-only.md

# Remove specific image types
url-to-md https://site.com --no-gif-images --no-svg-images

# Remove specific HTML tags manually
url-to-md https://site.com --remove-tags nav footer aside script

Viewport Configuration

# Use mobile viewport for responsive sites (375x667 - iPhone)
url-to-md https://example.com --mobile -o mobile-view.md

# Use tablet viewport (768x1024 - iPad portrait)
url-to-md https://example.com --tablet -o tablet-view.md

# Use desktop viewport (1920x1080 - standard desktop)
url-to-md https://example.com --desktop -o desktop-view.md

# Custom viewport dimensions
url-to-md https://example.com --viewport-width 1200 --viewport-height 800 -o custom-view.md

Advanced Options

# Debug with visible browser to see content loading
url-to-md https://dynamic-site.com --show-browser --wait 5.0

# Maximum cleanup with targeted extraction
url-to-md https://article.com \
  --include-tags article main \
  --clean-content \
  --no-images \
  --no-links \
  --wait 3.0 \
  -o clean-article.md

# Extract article content with mobile viewport (great for responsive sites)
url-to-md https://news-site.com \
  --include-tags article \
  --mobile \
  --clean-content \
  -o mobile-article.md

# Disable web security for difficult sites (use with caution)
url-to-md https://cors-protected-site.com --disable-web-security

Batch Processing & Automation

# Batch processing with error handling
for url in $(cat urls.txt); do
  echo "Processing: $url"
  url-to-md "$url" -o "output/$(basename $url).md" || echo "Failed: $url"
done

Full Command Line Options

Usage: url-to-md [options] <url>

Fetch URL content and output LLM-friendly markdown

Arguments:
  url                              URL to fetch

Options:
  -V, --version                    output the version number
  -o, --output <file>              Write output to file instead of stdout
  --no-links                       Remove webpage links from the output
  --no-images                      Remove images from the output
  --no-gif-images                  Remove GIF images from the output  
  --no-svg-images                  Remove SVG images from the output
  --clean-content                  Remove common non-content tags (nav, footer, aside, script, style, header, noscript, canvas)
  --include-tags <tags...>         Include only specific HTML tags and their content (e.g., --include-tags article main section)
  --remove-tags <tags...>          Remove specific HTML tags (e.g., --remove-tags div span)
  --wait <seconds>                 Seconds to wait for page to load (default: 1.5)
  --show-browser                   Show browser window (visible mode)
  --mobile                         Use mobile viewport (375x667 - iPhone)
  --tablet                         Use tablet viewport (768x1024 - iPad portrait)
  --desktop                        Use desktop viewport (1920x1080 - standard desktop)
  --viewport-width <width>         Set viewport width in pixels (320-1920, default: 375)
  --viewport-height <height>       Set viewport height in pixels (568-1080, default: 667)
  --disable-web-security           Disable web security (CORS) - use with caution for difficult sites
  -h, --help                       display help for command

Troubleshooting

Content Not Fully Extracted?

Many modern websites use JavaScript to load content dynamically after the initial page loads. If you're only seeing partial content, loading spinners, or placeholder text, the website likely renders its main content using JavaScript.

Solutions:

Increase wait time for slow-loading content:

# Wait 5 seconds for JavaScript content to load
url-to-md https://spa-app.com --wait 5.0

Debug with visible browser to see what's happening:

# Watch the page load in a visible browser window
url-to-md https://dynamic-site.com --show-browser --wait 5.0

Common Issues

"Permission denied" errors

# On Linux/macOS, make sure the binary is executable
chmod +x /usr/local/bin/url-to-md

# Or reinstall globally
npm uninstall -g url-to-markdown-cli-tool
npm install -g url-to-markdown-cli-tool

Network timeout errors

# Increase wait time for slow networks
url-to-md https://example.com --wait 10.0

Use Cases

🤖 AI & Machine Learning

Training Data: Extract clean text from web articles for LLM training
RAG Systems: Convert documentation and articles for vector databases
Content Curation: Batch process URLs for AI content pipelines
Structured Data: Extract tables and structured content in markdown format

📚 Documentation & Research

Knowledge Base: Convert external docs to consistent markdown format
Research: Extract academic papers and articles for analysis
Archival: Preserve web content in clean, readable format

🔄 Content Migration

CMS Migration: Extract content from old websites
Documentation Sites: Convert existing content to markdown-based docs
Static Site Generation: Process dynamic content for static sites

Perfect for: RAG systems, LLM training data preparation, documentation extraction, table data extraction, and any workflow that needs clean, structured text from web content.

Acknowledgments

Originally inspired by m92vyas/llm-reader, a Python library
Built with Puppeteer for reliable browser automation
Uses Cheerio for server-side HTML manipulation
Powered by Turndown for HTML to Markdown conversion

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.kiro		.kiro
src		src
tests		tests
.gitignore		.gitignore
.npmignore		.npmignore
CLAUDE.md		CLAUDE.md
README.md		README.md
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

url-to-markdown-cli-tool

Quick Install

Installation

Via npm (When published)

From Source (Current)

System Requirements

Usage

Targeted Content Extraction

How Include-Tags and Remove-Tags Work Together

Content Filtering

Viewport Configuration

Advanced Options

Batch Processing & Automation

Full Command Line Options

Troubleshooting

Content Not Fully Extracted?

Common Issues

"Permission denied" errors

Network timeout errors

Use Cases

🤖 AI & Machine Learning

📚 Documentation & Research

🔄 Content Migration

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

mmdclx/url-to-markdown-cli-tool

Folders and files

Latest commit

History

Repository files navigation

url-to-markdown-cli-tool

Quick Install

Installation

Via npm (When published)

From Source (Current)

System Requirements

Usage

Targeted Content Extraction

How Include-Tags and Remove-Tags Work Together

Content Filtering

Viewport Configuration

Advanced Options

Batch Processing & Automation

Full Command Line Options

Troubleshooting

Content Not Fully Extracted?

Common Issues

"Permission denied" errors

Network timeout errors

Use Cases

🤖 AI & Machine Learning

📚 Documentation & Research

🔄 Content Migration

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages