Part 1: Foundations and Ethics
Task 1.1: Ethics and Legal Research
Write a 500-word report addressing:
• What is web scraping and when is it appropriate?
• Explain robots.txt files and how to check them
• Discuss the legal and ethical considerations
• Provide 3 real-world examples of responsible web scraping
Task 1.2: Basic HTML Understanding (15 points)
Create a simple HTML page with:
• A table containing at least 10 rows of sample data (books, movies, products, etc.)
• Use proper HTML tags: <table>, <tr>, <td>, <th>
• Include attributes like class and id
• Add some basic CSS styling
• Practice using browser developer tools to inspect elements
Deliverable: HTML file and screenshot of developer tools inspection
Part 2: Basic Scraping Techniques
Task 2.1: Static Page Scraping
Using Python and BeautifulSoup, scrape the HTML page you created in Task 1.2:
python
# Required libraries: requests, beautifulsoup4, pandas
# Your code should:
# 1. Load the HTML file
# 2. Parse it with BeautifulSoup
# 3. Extract all table data
# 4. Save to CSV format
Requirements:
• Proper error handling
• Clean, commented code
• Output data to CSV file
• Print summary statistics (number of rows extracted)
Task 2.2: Public API Integration (15 points)
Choose one of these free APIs and create a data collection script:
• JSONPlaceholder (fake data for testing)
• OpenWeatherMap (weather data)
• REST Countries (country information)
• Cat Facts API
Requirements:
• Make at least 10 API calls
• Handle API rate limits appropriately
• Save data in both JSON and CSV formats
• Include error handling for failed requests
Part 3: Intermediate Scraping
Task 3.1: Real Website Scraping
Choose ONE of these beginner-friendly websites:
• Books.toscrape.com (practice scraping site)
• Quotes.toscrape.com (quotes collection)
• Scrape.center (designed for learning)
Scraping Requirements:
• Extract at least 50 items
• Collect minimum 4 attributes per item
• Implement respectful delays (1-2 seconds between requests)
• Handle pagination if applicable
• Check and respect robots.txt
Data Processing:
• Clean and validate the extracted data
• Handle missing values appropriately
• Create basic visualizations using matplotlib or seaborn
• Generate a summary report of your findings
Task 3.2: Advanced Challenges
Implement TWO of the following features:
• User-Agent rotation: Use different user agents for requests
• Session handling: Maintain cookies across requests
• Data validation: Implement schema validation for scraped data
• Duplicate detection: Identify and handle duplicate entries
• Concurrent scraping: Use threading for faster collection (with care)