Python scripts for generating and categorizing web browsing tasks for benchmark datasets. Supports multiple LLM providers: OpenAI (GPT-5), Anthropic (Claude), and Google (Gemini).
# 1. Install dependencies
pip install -r requirements.txt
# 2. Set up API key (create .env file)
echo "OPENAI_API_KEY=your_key_here" > .env
# 3. Generate tasks from example websites
python task_generation.py --input example_websites.csv --output tasks.csv --model gpt-5 --tasks-per-site 5
# 4. Categorize the generated tasks
python task_categorization.py --subcategory=true --login=true --tags=true --input tasks.csv --model gpt-5- Task Generation: Generate unique, intemporal web browsing tasks for any website
- Auto-Detection: LLM automatically analyzes websites and generates appropriate tasks
- Task Categorization: Add subcategories, login requirements, and hierarchical tags
- Multi-Provider Support: Use OpenAI, Anthropic, or Gemini models
- Resume Capability: Scripts automatically resume if interrupted
git clone https://github.com/anaishowland/dataset_creation.git
cd dataset_creation
pip install -r requirements.txtCreate a .env file with at least one API key:
OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here
GEMINI_API_KEY=your_key_hereCreate a CSV with website domains:
Domain
amazon.com
github.com
wikipedia.org
Generate tasks:
python task_generation.py --input websites.csv --output tasks.csv --model gpt-5The LLM will automatically analyze each website and generate appropriate tasks. No need to specify categories!
Options:
--model: Choosegpt-5,claude-sonnet-4-5-20250929, orgemini-2.5-pro--tasks-per-site: Number of tasks per website (default: 15)--categorize: Automatically categorize after generation
Add metadata to your generated tasks:
python task_categorization.py \
--subcategory=true \
--login=true \
--tags=true \
--input tasks.csv \
--model gpt-5This adds:
- subCategory: One of 12 predefined categories
- login: Whether login is required (yes/no)
- tags: JSON array of hierarchical tags
Generate and categorize in one command:
python task_generation.py --input websites.csv --output tasks.csv --model gpt-5 --categorizeTry different models:
# GPT-5
python task_generation.py --input sites.csv --output tasks.csv --model gpt-5
# Claude
python task_generation.py --input sites.csv --output tasks.csv --model claude-sonnet-4-5-20250929
# Gemini
python task_generation.py --input sites.csv --output tasks.csv --model gemini-2.5-proResume if interrupted:
# Just run the same command again - it will skip already processed items
python task_generation.py --input websites.csv --output tasks.csv --model gpt-5The categorization script uses 12 predefined categories:
- Search & Research - Information lookup and fact-finding
- Social Media & Communication - Social platforms and messaging
- Shopping & E-commerce - Product discovery and purchasing
- News & Media - Current events and articles
- Entertainment & Leisure - Streaming, gaming, entertainment
- Navigation & Maps - Routes and locations
- Utilities & Tools - Web utilities and productivity tools
- Finance & Banking - Account management and market data
- Education & Learning - Courses and tutorials
- Health & Wellness - Medical information and fitness
- Development & Tech Services - Code repositories and APIs
- Other / Miscellaneous - Everything else
Tags are organized hierarchically (see tags_taxonomy.csv):
Information Retrieval & Analysis
├── Search, Research, Extraction, Comparison
Navigation & Workflow Control
├── Navigation, Filtering, Sorting, Pagination
Transactional Operations
├── Shopping, Checkout, Account Management, Booking
Content Interaction
├── Reading, Watching, Listening, Commenting
... and more
You can customize tags_taxonomy.csv for your specific needs.
- Start small: Test with 2-3 websites before processing hundreds
- Monitor costs: Each API call uses tokens
- Use resume: Scripts save progress every 10 items
- Custom columns: Scripts auto-detect column names, or specify with
--domain-col,--task-col, etc.
API Key Not Found
- Make sure
.envfile exists with the correct key for your provider
Import Errors
pip install -r requirements.txtColumn Not Found
# Specify column names if auto-detection fails
python task_generation.py --input data.csv --domain-col "website"
python task_categorization.py --task-col "description" --website-col "url"MIT License - free to use and modify.
Anais Howland (@anaishowland)
@software{howland2025dataset,
author = {Howland, Anais},
title = {Dataset Creation Tools for Web Browsing Benchmarks},
year = {2025},
url = {https://github.com/anaishowland/dataset_creation}
}