Introduction to Web Crawling
Web crawling, also known as web scraping, is the process where automated programs called
web crawlers or spiders systematically browse the internet to discover and index content.
How Web Crawling Works:
1. Starting Point: The crawler begins with a list of seed URLs (initial websites to visit).
2. Fetching Content: It downloads the webpage's content, including text, images, and
links.
3. Following Links: The crawler identifies and follows hyperlinks on the page to
discover new pages.
4. Indexing: The extracted data is stored in a database, making it searchable for search
engines like Google.
5. Revisiting Pages: Crawlers revisit pages periodically to update content changes.
Why is Web Crawling Important?
Search Engines (Google, Bing, etc.): Crawlers help index websites, making them
searchable.
SEO Optimization: Websites must be crawlable for good search rankings.
Data Scraping: Companies use crawlers to collect market research, news, or
competitor data.
Website Monitoring: Businesses monitor competitors or changes in specific web
pages.
Building a Simple Web Crawler in Python
You can build a simple web crawler using Python with the requests and BeautifulSoup
libraries.
Step 1: Install Dependencies
pip install requests beautifulsoup4
Step 2: Create a Simple Web Crawler
This script fetches and extracts all links from a webpage.
import requests
from bs4 import BeautifulSoup
def crawl(url):
headers = {"User-Agent": "Mozilla/5.0"} # Helps avoid getting blocked
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
print(f"Title: {soup.title.string}\n") # Print page title
print("Links found:")
for link in soup.find_all("a", href=True): # Find all links
print(link["href"])
else:
print(f"Failed to retrieve {url}, Status Code:
{response.status_code}")
# Example usage
crawl("https://example.com")
Building an Advanced Web Crawler with Scrapy
For large-scale web scraping, Scrapy is a more powerful framework.
Step 1: Install Scrapy
pip install scrapy
Step 2: Create a Scrapy Project
Run the following command in your terminal:
scrapy startproject mycrawler
cd mycrawler
Step 3: Create a Spider
Inside the spiders folder, create a new Python file (my_spider.py) and add:
import scrapy
class MySpider(scrapy.Spider):
name = "mycrawler"
start_urls = ["https://example.com"] # Replace with your target
website
def parse(self, response):
# Extract page title
title = response.xpath("//title/text()").get()
print(f"Title: {title}")
# Extract all links
for link in response.xpath("//a/@href").getall():
yield {"link": response.urljoin(link)} # Normalize relative
URLs
# Follow links and crawl deeper
for link in response.xpath("//a/@href").getall():
yield response.follow(link, callback=self.parse)
Step 4: Run the Spider
Execute the spider from the terminal:
scrapy crawl mycrawler -o output.json
This will save the crawled data into output.json.
Enhancements for a More Powerful Crawler:
Follow links with restrictions (only crawl certain domains).
Store data in a database (MongoDB, SQLite, or PostgreSQL).
Use middlewares to rotate user agents and avoid getting blocked.
With these steps, you can build a functional web crawler for scraping and indexing websites
efficiently!