Thanks to visit codestin.com
Credit goes to github.com

Skip to content

EthanL06/ut-dining-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

UT Dining Scraper

A high-performance web scraper for University of Texas dining hall menus, built with TypeScript, Cheerio, and Supabase integration. This tool scrapes menu data, nutrition information, and allergen details from the UT dining system.

πŸš€ Features

  • Fast Scraping: Uses Cheerio and fetch for lightweight, efficient scraping
  • Parallel Processing: Configurable concurrency for optimal performance
  • Nutrition Data: Extracts detailed nutrition facts for each menu item
  • Allergen Information: Identifies allergens and dietary restrictions
  • Supabase Integration: Direct database insertion capabilities
  • Performance Testing: Built-in performance testing suite
  • Error Handling: Robust retry mechanisms and error recovery
  • TypeScript: Full type safety and IntelliSense support

πŸ“‹ Prerequisites

  • Node.js (v16 or higher)
  • pnpm (package manager)
  • TypeScript (installed globally or via pnpm)

πŸ› οΈ Installation

  1. Clone the repository

    git clone <repository-url>
    cd ut-dining-scraper
  2. Install dependencies

    pnpm install
  3. Set up environment variables (optional, for Supabase)

    cp .env.example .env
    # Edit .env with your Supabase credentials
  4. Build the project

    pnpm build

πŸƒβ€β™‚οΈ Quick Start

Basic Usage

  1. Run the scraper

    pnpm start
    # or
    node dist/index.js
  2. Output

    • JSON file saved to data/ directory
    • Console output with progress and statistics
    • Optional Supabase insertion (if configured)

Example Output

πŸš€ Starting menu scraping with Cheerio...
πŸ“ Found 7 dining locations to process
🏒 Processing J2 Dining (3 menus found)
  πŸ“‹ Lunch menu: 74 food items across 12 categories
  βœ… Lunch: 74/74 items with nutrition data
πŸ“Š Scraped 7 locations, 3 menus, 177 food items
πŸ’Ύ Data saved to data/ut_menus_2025-06-21T16-42-58-936Z.json
⏱️ Total scraping time: 4.32 seconds

βš™οΈ Configuration

Concurrency Settings

Located in src/index.ts:

const NUTRITION_CONCURRENCY = 50; // Parallel nutrition requests
const MENU_CONCURRENCY = 7; // Parallel menu page requests

Dining Locations

Modify the LINKS array in src/index.ts to add/remove locations:

const LINKS = [
  "https://hf-foodpro.austin.utexas.edu/foodpro/shortmenu.aspx?locationName=J2+Dining",
  // Add more locations here
];

Supabase Integration

  1. Enable Supabase by setting ENABLE_SUPABASE in your .env file:

    ENABLE_SUPABASE=true # Set to true to enable Supabase insertion, or false to disable
  2. Configure credentials in .env:

    SUPABASE_URL=your_supabase_url
    SUPABASE_ANON_KEY=your_anon_key

πŸ“Š Performance Testing

The project includes a comprehensive performance testing suite to find optimal concurrency settings.

πŸ§ͺ Running Performance Tests

Option 1: Using the Script (Recommended)

chmod +x run-performance-test.sh
./run-performance-test.sh

Option 2: Manual Commands

pnpm build
node dist/performance-test.js

πŸ“ˆ What the Performance Test Does

The performance test systematically evaluates different concurrency configurations:

Test Config Nutrition Concurrency Menu Concurrency Purpose
Conservative 20 5 Safe baseline
Low 30 6 Light load
Baseline 40 7 Current default
Moderate 50 7 Recommended
Aggressive 60-70 8 High performance
Maximum 80-100 9-10 Stress test

πŸ“Š Performance Metrics

Each test measures:

  • ⏱️ Duration: Total scraping time
  • πŸ”’ Items/Second: Processing throughput
  • βœ… Success Rate: Percentage of successful nutrition fetches
  • ❌ Error Count: Number of failed requests
  • πŸ† Composite Score: Overall performance rating

πŸ“‹ Sample Performance Test Output

πŸ§ͺ Testing: Nutrition=50, Menu=7
   ⏱️  Duration: 4.12s
   πŸ“Š Items: 177 (43.0/s)
   βœ… Success Rate: 98.3%
   ❌ Errors: 3

πŸ“ˆ PERFORMANCE TEST RESULTS
═══════════════════════════════════════════════════════════════════════════════
Nutrition | Menu | Duration | Items/s | Success% | Errors | Score
───────────────────────────────────────────────────────────────────────────────
       20 |    5 |     8.45s |    20.9 |    100.0 |      0 | 20.90
       50 |    7 |     4.12s |    43.0 |     98.3 |      3 | 41.27
       70 |    8 |     3.21s |    55.1 |     96.6 |      6 | 51.89

πŸ† OPTIMAL CONFIGURATION:
   Nutrition Concurrency: 70
   Menu Concurrency: 8
   Performance: 55.1 items/second
   Success Rate: 96.6%
   Duration: 3.21 seconds

πŸ“ Performance Test Results

Results are automatically saved to:

  • Console: Real-time progress and summary
  • JSON File: performance-test-results/performance_test_[timestamp].json

🎯 Interpreting Results

Look for:

  • βœ… Success rate > 95%
  • ⚑ High items/second ratio
  • 🎯 Low error count
  • πŸ† High composite score

Warning signs:

  • ❌ Success rate < 90%
  • 🐌 Decreasing items/second despite higher concurrency
  • πŸ’₯ High error count

⚑ Applying Optimal Settings

After running performance tests, update your configuration in src/index.ts:

// Use the optimal values from your performance test results
const NUTRITION_CONCURRENCY = 70; // Example optimal value
const MENU_CONCURRENCY = 8; // Example optimal value

πŸ“ Project Structure

ut-dining-scraper/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ index.ts              # Main scraper application
β”‚   β”œβ”€β”€ performance-test.ts   # Performance testing suite
β”‚   β”œβ”€β”€ supabase.ts          # Supabase configuration
β”‚   └── types/               # TypeScript type definitions
β”œβ”€β”€ data/                    # Scraped menu data (JSON files)
β”œβ”€β”€ performance-test-results/ # Performance test results
β”œβ”€β”€ dist/                    # Compiled JavaScript (generated)
β”œβ”€β”€ package.json
β”œβ”€β”€ tsconfig.json
└── run-performance-test.sh  # Performance test runner script

🚨 Troubleshooting

Common Issues

Slow Performance

  • Run performance tests to find optimal concurrency
  • Check network connection
  • Verify server isn't rate-limiting

High Error Rates

  • Reduce concurrency values
  • Check if dining sites are accessible
  • Verify network stability

TypeScript Errors

# Clean build
rm -rf dist/
pnpm build

Permission Errors

chmod +x run-performance-test.sh

πŸ“œ Available Scripts

# Build the project
pnpm build

# Run the main scraper
pnpm start

# Run performance tests
pnpm run performance-test:script

# Development mode with watch
pnpm dev

# Clean build artifacts
pnpm clean

# One-command setup
pnpm quick-start

πŸ“š Documentation

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run performance tests to ensure no regression
  5. Submit a pull request

πŸ“„ License

This project is licensed under the ISC License.

πŸ™‹β€β™‚οΈ Support

For issues and questions:

  1. Check the troubleshooting section
  2. Run performance tests to identify configuration issues
  3. Create an issue with relevant logs and configuration details

About

Web scraper for UT Dining mobile app.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published