How One Query Nearly Killed Our Database
The performance optimization that went from 15 seconds to 29 milliseconds—and what I learned about scaling assumptions.
Everyone thinks complex database queries are just a performance problem you can solve with better hardware.
I thought that too.
Last month, our monitoring alerts started screaming. One query was consuming 85% of database runtime. Users were abandoning pages. Support tickets were flooding in about timeouts.
The query looked reasonable in development. Clean JOINs, proper indexing, standard patterns. It worked fine with test data.
Then production traffic hit it like a freight train.
Here’s what we were dealing with: a webhook delivery system that tracked failed synchronization events for student records. The query had to identify which student entities had failed webhook deliveries.
SELECT COALESCE(deliveries.entity_id, delivery_entities.entity_id) AS entity_id
FROM webhook_deliveries deliveries
LEFT JOIN webhook_delivery_entities delivery_entities
ON deliveries.id = delivery_entities.webhook_delivery_id
WHERE (deliveries.entity_id IN (:student_ids) AND deliveries.entity_type = ‘Student’)
OR (delivery_entities.entity_id IN (:student_ids) AND delivery_entities.entity_type = ‘Student’)
Looks harmless, right?
Wrong.
In production, those IN clauses contained 975,000 UUIDs. The database was processing 47MB of data per request. The LEFT JOIN with OR conditions created execution plans that made the query planner weep.
Result: 14.94 seconds per request.
The original developer (me) made a classic mistake: optimizing for the wrong constraint.
I optimized for:
Fewer database round trips
Single query complexity
“Elegant” SQL patterns
I should have optimized for:
Query execution time under load
Database memory usage
Realistic data volumes
The insight: sometimes two simple queries beat one complex query by 500x.
The Counter-Intuitive Solution
Instead of fixing the complex query, I broke it into two simple ones:
# Query 1: Get failed entity IDs from main deliveries table
failed_from_deliveries = WebhookDelivery
.where(status: ‘failed’, entity_type: ‘Student’)
.where(created_at: date_range)
.pluck(:entity_id)
# Query 2: Get failed entity IDs from junction table
failed_from_entities = WebhookDeliveryEntity
.joins(:webhook_delivery)
.where(webhook_deliveries: { status: ‘failed’ })
.where(entity_type: ‘Student’, created_at: date_range)
.pluck(:entity_id)
# Combine and filter
all_failed = (failed_from_deliveries + failed_from_entities).uniq
student_ids.select { |id| all_failed.include?(id) }
“But that’s two database calls!”
Yes. And it’s 508x faster.
The Scaling Reality Check
I benchmarked both approaches across realistic data volumes:
Load Level UUIDs Original Query New Approach Improvement Small 1,000 13.26ms 1.53ms 8.7x Medium 325,000 2,950ms 12.25ms 241x Large 650,000 8,575ms 24.48ms 350x Production 975,000 14,943ms 29.37ms 509x
The pattern: as data volume increased, the performance gap widened exponentially.
The original query’s execution time scaled catastrophically. The new approach scaled almost linearly.
What I Learned About Database Performance
Lesson 1: Query planners have limits Complex queries with large IN clauses and OR conditions can overwhelm the database optimizer. Sometimes you need to help it by breaking things down.
Lesson 2: Memory pressure matters more than round trips The 47MB of UUID data being processed was the real bottleneck, not the network round trip for a second query.
Lesson 3: Test with realistic data volumes Our test database had 1,000 records. Production had nearly a million. The scaling characteristics were completely different.
Lesson 4: Simple often beats elegant Two straightforward queries with predictable execution plans outperformed one “clever” query by orders of magnitude.
The Caching Layer That Made It Bulletproof
Even with the optimized queries, I added a caching layer to handle traffic spikes:
def failed_entity_ids_for_organization(org, date_range: nil)
cache_key = “failed_entities/#{org.id}/#{date_range&.first&.to_date}”
Rails.cache.fetch(cache_key, expires_in: 5.minutes) do
get_failed_entity_ids(org, date_range)
end
end
The date-scoped cache keys were crucial. Instead of invalidating all cached data when any webhook delivery changed, we only invalidated data for specific days.
Result: 99% cache hit rate in production.
The Production Impact
After deployment:
Database CPU utilization dropped from 95% to 15%
Page load times went from 15+ seconds to under 500ms
Support tickets about timeouts disappeared completely
We could handle 10x more concurrent users
The query that was consuming 85% of database runtime now uses less than 1%.
What This Means for Your Systems
Stop optimizing for theoretical elegance. Start optimizing for production reality.
Your database doesn’t care how elegant your SQL looks. It cares about execution plans, memory usage, and I/O patterns.
Test with realistic data volumes, not toy datasets. Performance characteristics can change dramatically at scale.
Consider the total system load, not just individual query performance. Sometimes multiple simple queries create less overall system stress than one complex query.
Always have a caching strategy for expensive operations, even optimized ones.
Your Next Database Optimization
Look at your slowest queries in production. Ask yourself:
Are you optimizing for elegance or performance?
Have you tested with realistic data volumes?
Could breaking complex queries into simple ones help?
Do you have appropriate caching for expensive operations?
The query optimization that saved our database wasn’t about better SQL—it was about challenging assumptions about what “better” actually means.
This week’s experiment: benchmarking database query patterns at scale. What assumptions are you making about your system’s performance that might not hold at production volumes?
- Rafa
P.S. The most expensive performance optimization is the one you never measure. If you’re not benchmarking with realistic data, you’re optimizing blind.
