Dataset Creation Tools

Python scripts for generating and categorizing web browsing tasks for benchmark datasets. Supports multiple LLM providers: OpenAI (GPT-5), Anthropic (Claude), and Google (Gemini).

Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Set up API key (create .env file)
echo "OPENAI_API_KEY=your_key_here" > .env

# 3. Generate tasks from example websites
python task_generation.py --input example_websites.csv --output tasks.csv --model gpt-5 --tasks-per-site 5

# 4. Categorize the generated tasks
python task_categorization.py --subcategory=true --login=true --tags=true --input tasks.csv --model gpt-5

Features

Task Generation: Generate unique, intemporal web browsing tasks for any website
Auto-Detection: LLM automatically analyzes websites and generates appropriate tasks
Task Categorization: Add subcategories, login requirements, and hierarchical tags
Multi-Provider Support: Use OpenAI, Anthropic, or Gemini models
Resume Capability: Scripts automatically resume if interrupted

Installation

git clone https://github.com/anaishowland/dataset_creation.git
cd dataset_creation
pip install -r requirements.txt

Create a .env file with at least one API key:

OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here
GEMINI_API_KEY=your_key_here

Usage

Generate Tasks

Create a CSV with website domains:

Domain
amazon.com
github.com
wikipedia.org

Generate tasks:

python task_generation.py --input websites.csv --output tasks.csv --model gpt-5

The LLM will automatically analyze each website and generate appropriate tasks. No need to specify categories!

Options:

--model: Choose gpt-5, claude-sonnet-4-5-20250929, or gemini-2.5-pro
--tasks-per-site: Number of tasks per website (default: 15)
--categorize: Automatically categorize after generation

Categorize Tasks

Add metadata to your generated tasks:

python task_categorization.py \
  --subcategory=true \
  --login=true \
  --tags=true \
  --input tasks.csv \
  --model gpt-5

This adds:

subCategory: One of 12 predefined categories
login: Whether login is required (yes/no)
tags: JSON array of hierarchical tags

Examples

Generate and categorize in one command:

python task_generation.py --input websites.csv --output tasks.csv --model gpt-5 --categorize

Try different models:

# GPT-5
python task_generation.py --input sites.csv --output tasks.csv --model gpt-5

# Claude
python task_generation.py --input sites.csv --output tasks.csv --model claude-sonnet-4-5-20250929

# Gemini
python task_generation.py --input sites.csv --output tasks.csv --model gemini-2.5-pro

Resume if interrupted:

# Just run the same command again - it will skip already processed items
python task_generation.py --input websites.csv --output tasks.csv --model gpt-5

Task Categories

The categorization script uses 12 predefined categories:

Search & Research - Information lookup and fact-finding
Social Media & Communication - Social platforms and messaging
Shopping & E-commerce - Product discovery and purchasing
News & Media - Current events and articles
Entertainment & Leisure - Streaming, gaming, entertainment
Navigation & Maps - Routes and locations
Utilities & Tools - Web utilities and productivity tools
Finance & Banking - Account management and market data
Education & Learning - Courses and tutorials
Health & Wellness - Medical information and fitness
Development & Tech Services - Code repositories and APIs
Other / Miscellaneous - Everything else

Tag Taxonomy

Tags are organized hierarchically (see tags_taxonomy.csv):

Information Retrieval & Analysis
  ├── Search, Research, Extraction, Comparison

Navigation & Workflow Control
  ├── Navigation, Filtering, Sorting, Pagination

Transactional Operations
  ├── Shopping, Checkout, Account Management, Booking

Content Interaction
  ├── Reading, Watching, Listening, Commenting

... and more

You can customize tags_taxonomy.csv for your specific needs.

Tips

Start small: Test with 2-3 websites before processing hundreds
Monitor costs: Each API call uses tokens
Use resume: Scripts save progress every 10 items
Custom columns: Scripts auto-detect column names, or specify with --domain-col, --task-col, etc.

Troubleshooting

API Key Not Found

Make sure .env file exists with the correct key for your provider

Import Errors

pip install -r requirements.txt

Column Not Found

# Specify column names if auto-detection fails
python task_generation.py --input data.csv --domain-col "website"
python task_categorization.py --task-col "description" --website-col "url"

License

MIT License - free to use and modify.

Author

Anais Howland (@anaishowland)

Citation

@software{howland2025dataset,
  author = {Howland, Anais},
  title = {Dataset Creation Tools for Web Browsing Benchmarks},
  year = {2025},
  url = {https://github.com/anaishowland/dataset_creation}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env_template.txt		env_template.txt
example_websites.csv		example_websites.csv
requirements.txt		requirements.txt
tags_taxonomy.csv		tags_taxonomy.csv
task_categorization.py		task_categorization.py
task_generation.py		task_generation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dataset Creation Tools

Quick Start

Features

Installation

Usage

Generate Tasks

Categorize Tasks

Examples

Task Categories

Tag Taxonomy

Tips

Troubleshooting

License

Author

Citation

About

Uh oh!

Releases

Packages

Languages

License

anaishowland/dataset_creation

Folders and files

Latest commit

History

Repository files navigation

Dataset Creation Tools

Quick Start

Features

Installation

Usage

Generate Tasks

Categorize Tasks

Examples

Task Categories

Tag Taxonomy

Tips

Troubleshooting

License

Author

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages