A high-performance web scraper for University of Texas dining hall menus, built with TypeScript, Cheerio, and Supabase integration. This tool scrapes menu data, nutrition information, and allergen details from the UT dining system.
- Fast Scraping: Uses Cheerio and fetch for lightweight, efficient scraping
- Parallel Processing: Configurable concurrency for optimal performance
- Nutrition Data: Extracts detailed nutrition facts for each menu item
- Allergen Information: Identifies allergens and dietary restrictions
- Supabase Integration: Direct database insertion capabilities
- Performance Testing: Built-in performance testing suite
- Error Handling: Robust retry mechanisms and error recovery
- TypeScript: Full type safety and IntelliSense support
- Node.js (v16 or higher)
- pnpm (package manager)
- TypeScript (installed globally or via pnpm)
-
Clone the repository
git clone <repository-url> cd ut-dining-scraper
-
Install dependencies
pnpm install
-
Set up environment variables (optional, for Supabase)
cp .env.example .env # Edit .env with your Supabase credentials -
Build the project
pnpm build
-
Run the scraper
pnpm start # or node dist/index.js -
Output
- JSON file saved to
data/directory - Console output with progress and statistics
- Optional Supabase insertion (if configured)
- JSON file saved to
π Starting menu scraping with Cheerio...
π Found 7 dining locations to process
π’ Processing J2 Dining (3 menus found)
π Lunch menu: 74 food items across 12 categories
β
Lunch: 74/74 items with nutrition data
π Scraped 7 locations, 3 menus, 177 food items
πΎ Data saved to data/ut_menus_2025-06-21T16-42-58-936Z.json
β±οΈ Total scraping time: 4.32 seconds
Located in src/index.ts:
const NUTRITION_CONCURRENCY = 50; // Parallel nutrition requests
const MENU_CONCURRENCY = 7; // Parallel menu page requestsModify the LINKS array in src/index.ts to add/remove locations:
const LINKS = [
"https://hf-foodpro.austin.utexas.edu/foodpro/shortmenu.aspx?locationName=J2+Dining",
// Add more locations here
];-
Enable Supabase by setting
ENABLE_SUPABASEin your.envfile:ENABLE_SUPABASE=true # Set to true to enable Supabase insertion, or false to disable
-
Configure credentials in
.env:SUPABASE_URL=your_supabase_url SUPABASE_ANON_KEY=your_anon_key
The project includes a comprehensive performance testing suite to find optimal concurrency settings.
chmod +x run-performance-test.sh
./run-performance-test.shpnpm build
node dist/performance-test.jsThe performance test systematically evaluates different concurrency configurations:
| Test Config | Nutrition Concurrency | Menu Concurrency | Purpose |
|---|---|---|---|
| Conservative | 20 | 5 | Safe baseline |
| Low | 30 | 6 | Light load |
| Baseline | 40 | 7 | Current default |
| Moderate | 50 | 7 | Recommended |
| Aggressive | 60-70 | 8 | High performance |
| Maximum | 80-100 | 9-10 | Stress test |
Each test measures:
- β±οΈ Duration: Total scraping time
- π’ Items/Second: Processing throughput
- β Success Rate: Percentage of successful nutrition fetches
- β Error Count: Number of failed requests
- π Composite Score: Overall performance rating
π§ͺ Testing: Nutrition=50, Menu=7
β±οΈ Duration: 4.12s
π Items: 177 (43.0/s)
β
Success Rate: 98.3%
β Errors: 3
π PERFORMANCE TEST RESULTS
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Nutrition | Menu | Duration | Items/s | Success% | Errors | Score
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
20 | 5 | 8.45s | 20.9 | 100.0 | 0 | 20.90
50 | 7 | 4.12s | 43.0 | 98.3 | 3 | 41.27
70 | 8 | 3.21s | 55.1 | 96.6 | 6 | 51.89
π OPTIMAL CONFIGURATION:
Nutrition Concurrency: 70
Menu Concurrency: 8
Performance: 55.1 items/second
Success Rate: 96.6%
Duration: 3.21 seconds
Results are automatically saved to:
- Console: Real-time progress and summary
- JSON File:
performance-test-results/performance_test_[timestamp].json
Look for:
- β Success rate > 95%
- β‘ High items/second ratio
- π― Low error count
- π High composite score
Warning signs:
- β Success rate < 90%
- π Decreasing items/second despite higher concurrency
- π₯ High error count
After running performance tests, update your configuration in src/index.ts:
// Use the optimal values from your performance test results
const NUTRITION_CONCURRENCY = 70; // Example optimal value
const MENU_CONCURRENCY = 8; // Example optimal valueut-dining-scraper/
βββ src/
β βββ index.ts # Main scraper application
β βββ performance-test.ts # Performance testing suite
β βββ supabase.ts # Supabase configuration
β βββ types/ # TypeScript type definitions
βββ data/ # Scraped menu data (JSON files)
βββ performance-test-results/ # Performance test results
βββ dist/ # Compiled JavaScript (generated)
βββ package.json
βββ tsconfig.json
βββ run-performance-test.sh # Performance test runner script
Slow Performance
- Run performance tests to find optimal concurrency
- Check network connection
- Verify server isn't rate-limiting
High Error Rates
- Reduce concurrency values
- Check if dining sites are accessible
- Verify network stability
TypeScript Errors
# Clean build
rm -rf dist/
pnpm buildPermission Errors
chmod +x run-performance-test.sh# Build the project
pnpm build
# Run the main scraper
pnpm start
# Run performance tests
pnpm run performance-test:script
# Development mode with watch
pnpm dev
# Clean build artifacts
pnpm clean
# One-command setup
pnpm quick-start- README.md - Main documentation (you are here)
- QUICK_START.md - Get running in 5 minutes
- Fork the repository
- Create a feature branch
- Make your changes
- Run performance tests to ensure no regression
- Submit a pull request
This project is licensed under the ISC License.
For issues and questions:
- Check the troubleshooting section
- Run performance tests to identify configuration issues
- Create an issue with relevant logs and configuration details