Thanks to visit codestin.com
Credit goes to github.com

Skip to content

anaishowland/dataset_creation

Repository files navigation

Dataset Creation Tools

Python scripts for generating and categorizing web browsing tasks for benchmark datasets. Supports multiple LLM providers: OpenAI (GPT-5), Anthropic (Claude), and Google (Gemini).

Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Set up API key (create .env file)
echo "OPENAI_API_KEY=your_key_here" > .env

# 3. Generate tasks from example websites
python task_generation.py --input example_websites.csv --output tasks.csv --model gpt-5 --tasks-per-site 5

# 4. Categorize the generated tasks
python task_categorization.py --subcategory=true --login=true --tags=true --input tasks.csv --model gpt-5

Features

  • Task Generation: Generate unique, intemporal web browsing tasks for any website
  • Auto-Detection: LLM automatically analyzes websites and generates appropriate tasks
  • Task Categorization: Add subcategories, login requirements, and hierarchical tags
  • Multi-Provider Support: Use OpenAI, Anthropic, or Gemini models
  • Resume Capability: Scripts automatically resume if interrupted

Installation

git clone https://github.com/anaishowland/dataset_creation.git
cd dataset_creation
pip install -r requirements.txt

Create a .env file with at least one API key:

OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here
GEMINI_API_KEY=your_key_here

Usage

Generate Tasks

Create a CSV with website domains:

Domain
amazon.com
github.com
wikipedia.org

Generate tasks:

python task_generation.py --input websites.csv --output tasks.csv --model gpt-5

The LLM will automatically analyze each website and generate appropriate tasks. No need to specify categories!

Options:

  • --model: Choose gpt-5, claude-sonnet-4-5-20250929, or gemini-2.5-pro
  • --tasks-per-site: Number of tasks per website (default: 15)
  • --categorize: Automatically categorize after generation

Categorize Tasks

Add metadata to your generated tasks:

python task_categorization.py \
  --subcategory=true \
  --login=true \
  --tags=true \
  --input tasks.csv \
  --model gpt-5

This adds:

  • subCategory: One of 12 predefined categories
  • login: Whether login is required (yes/no)
  • tags: JSON array of hierarchical tags

Examples

Generate and categorize in one command:

python task_generation.py --input websites.csv --output tasks.csv --model gpt-5 --categorize

Try different models:

# GPT-5
python task_generation.py --input sites.csv --output tasks.csv --model gpt-5

# Claude
python task_generation.py --input sites.csv --output tasks.csv --model claude-sonnet-4-5-20250929

# Gemini
python task_generation.py --input sites.csv --output tasks.csv --model gemini-2.5-pro

Resume if interrupted:

# Just run the same command again - it will skip already processed items
python task_generation.py --input websites.csv --output tasks.csv --model gpt-5

Task Categories

The categorization script uses 12 predefined categories:

  1. Search & Research - Information lookup and fact-finding
  2. Social Media & Communication - Social platforms and messaging
  3. Shopping & E-commerce - Product discovery and purchasing
  4. News & Media - Current events and articles
  5. Entertainment & Leisure - Streaming, gaming, entertainment
  6. Navigation & Maps - Routes and locations
  7. Utilities & Tools - Web utilities and productivity tools
  8. Finance & Banking - Account management and market data
  9. Education & Learning - Courses and tutorials
  10. Health & Wellness - Medical information and fitness
  11. Development & Tech Services - Code repositories and APIs
  12. Other / Miscellaneous - Everything else

Tag Taxonomy

Tags are organized hierarchically (see tags_taxonomy.csv):

Information Retrieval & Analysis
  ├── Search, Research, Extraction, Comparison

Navigation & Workflow Control
  ├── Navigation, Filtering, Sorting, Pagination

Transactional Operations
  ├── Shopping, Checkout, Account Management, Booking

Content Interaction
  ├── Reading, Watching, Listening, Commenting

... and more

You can customize tags_taxonomy.csv for your specific needs.

Tips

  • Start small: Test with 2-3 websites before processing hundreds
  • Monitor costs: Each API call uses tokens
  • Use resume: Scripts save progress every 10 items
  • Custom columns: Scripts auto-detect column names, or specify with --domain-col, --task-col, etc.

Troubleshooting

API Key Not Found

  • Make sure .env file exists with the correct key for your provider

Import Errors

pip install -r requirements.txt

Column Not Found

# Specify column names if auto-detection fails
python task_generation.py --input data.csv --domain-col "website"
python task_categorization.py --task-col "description" --website-col "url"

License

MIT License - free to use and modify.

Author

Anais Howland (@anaishowland)

Citation

@software{howland2025dataset,
  author = {Howland, Anais},
  title = {Dataset Creation Tools for Web Browsing Benchmarks},
  year = {2025},
  url = {https://github.com/anaishowland/dataset_creation}
}

About

Python scripts for generating and categorizing web browsing tasks for benchmark datasets

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages