Optimize querying for missing albums #5495

JackDanger · 2025-04-06T19:54:53Z

Database Migration

NO

Description

With ~200,000 albums in the database the queries to fetch missing albums become very slow, even on a fast postgres db.

Problem	Cause
High row count	selecting rows before grouping and sorting
Large sort in memory	GB of memory used for low millions of records w/ quicksort
Expensive Anti Join	On `TrackFiles.Id IS NULL` condition
Grouping before LIMIT	Entire result grouped before applying LIMIT+OFFSET

This is one approach to optimize these queries without significantly changing the query builder.

The original query explanation

lidarr_main=> explain analyze SELECT "Albums".* FROM "Albums" JOIN "Artists" ON ("Albums"."ArtistMetadataId" = "Artists"."ArtistMetadataId") JOIN "AlbumReleases" ON ("Albums"."Id" = "AlbumReleases"."AlbumId") JOIN "Tracks" ON ("AlbumReleases"."Id" = "Tracks"."AlbumReleaseId") LEFT JOIN "TrackFiles" ON ("Tracks"."TrackFileId" = "TrackFiles"."Id") WHERE ("TrackFiles"."Id" IS NULL) AND ("AlbumReleases"."Monitored" = 't') AND ("Albums"."ReleaseDate" <= '2025-04-01') AND (("Albums"."Monitored" = 't') AND ("Artists"."Monitored" = 't')) GROUP BY "Albums"."Id" , "Artists"."SortName" ORDER BY "Albums"."ReleaseDate" DESC LIMIT 20 OFFSET 20;

                                                                                                              QUERY PLAN                                                                                                              
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=89169.19..89169.19 rows=1 width=571) (actual time=23750.310..23752.152 rows=20 loops=1)
   ->  Sort  (cost=89169.18..89169.19 rows=1 width=571) (actual time=23750.307..23752.148 rows=40 loops=1)
         Sort Key: "Albums"."ReleaseDate" DESC
         Sort Method: top-N heapsort  Memory: 81kB
         ->  Group  (cost=89169.16..89169.17 rows=1 width=571) (actual time=23395.390..23692.280 rows=149339 loops=1)
               Group Key: "Albums"."Id", "Artists"."SortName"
               ->  Sort  (cost=89169.16..89169.17 rows=1 width=571) (actual time=23395.381..23527.284 rows=2359531 loops=1)
                     Sort Key: "Albums"."Id", "Artists"."SortName"
                     Sort Method: quicksort  Memory: 1493490kB
                     ->  Nested Loop  (cost=9761.97..89169.15 rows=1 width=571) (actual time=120.457..21173.061 rows=2359531 loops=1)
                           ->  Nested Loop  (cost=9761.68..89168.84 rows=1 width=560) (actual time=120.439..17482.768 rows=2558513 loops=1)
                                 ->  Nested Loop  (cost=9761.26..89168.04 rows=1 width=4) (actual time=113.705..11849.661 rows=2945179 loops=1)
                                       ->  Gather  (cost=9760.84..89167.59 rows=1 width=4) (actual time=111.690..1867.697 rows=5526730 loops=1)
                                             Workers Planned: 5
                                             Workers Launched: 0
                                             ->  Parallel Hash Anti Join  (cost=8760.84..88167.49 rows=1 width=4) (actual time=111.120..1544.392 rows=5526730 loops=1)
                                                   Hash Cond: ("Tracks"."TrackFileId" = "TrackFiles"."Id")
                                                   ->  Parallel Index Only Scan using idx_tracks_albumreleaseid_trackfileid on "Tracks"  (cost=0.43..75618.48 rows=1010293 width=8) (actual time=0.016..597.486 rows=5826988 loops=1)
                                                         Heap Fetches: 58839
                                                   ->  Parallel Hash  (cost=7533.59..7533.59 rows=98145 width=4) (actual time=107.524..107.524 rows=300865 loops=1)
                                                         Buckets: 524288  Batches: 1  Memory Usage: 15872kB
                                                         ->  Parallel Index Only Scan using idx_trackfiles_id on "TrackFiles"  (cost=0.42..7533.59 rows=98145 width=4) (actual time=0.033..37.543 rows=300865 loops=1)
                                                               Heap Fetches: 16938
                                       ->  Index Scan using "PK_AlbumReleases" on "AlbumReleases"  (cost=0.42..0.45 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=5526730)
                                             Index Cond: ("Id" = "Tracks"."AlbumReleaseId")
                                             Filter: "Monitored"
                                             Rows Removed by Filter: 0
                                 ->  Index Scan using "PK_Albums" on "Albums"  (cost=0.42..0.80 rows=1 width=560) (actual time=0.002..0.002 rows=1 loops=2945179)
                                       Index Cond: ("Id" = "AlbumReleases"."AlbumId")
                                       Filter: ("Monitored" AND ("ReleaseDate" <= '2025-04-01 00:00:00+00'::timestamp with time zone))
                                       Rows Removed by Filter: 0
                           ->  Index Scan using idx_artists_artistmetadataid on "Artists"  (cost=0.29..0.32 rows=1 width=15) (actual time=0.001..0.001 rows=1 loops=2558513)
                                 Index Cond: ("ArtistMetadataId" = "Albums"."ArtistMetadataId")
                                 Filter: "Monitored"
                                 Rows Removed by Filter: 0
 Planning Time: 3.742 ms
 Execution Time: 23850.137 ms
(37 rows)

The NEW query explain

lidarr_main=> explain analyze SELECT "Albums".* FROM "Albums" JOIN "Artists" ON ("Albums"."ArtistMetadataId" = "Artists"."ArtistMetadataId") WHERE (("Albums"."Monitored" = true) AND ("Albums"."ReleaseDate" <= '2025-04-01')) AND ("Artists"."Monitored" = true) A
ND "Albums"."Id" IN (SELECT "AlbumReleases"."AlbumId" FROM "AlbumReleases" JOIN "Tracks" ON ("AlbumReleases"."Id" = "Tracks"."AlbumReleaseId") LEFT JOIN "TrackFiles" ON ("Tracks"."TrackFileId" = "TrackFiles"."Id") WHERE "TrackFiles" IS NULL AND "AlbumReleases"
."Monitored" = true GROUP BY "AlbumReleases"."AlbumId") AND (("Albums"."Monitored" = true) AND ("Artists"."Monitored" = true)) GROUP BY "Albums"."Id" , "Artists"."SortName" ORDER BY "Albums"."ReleaseDate" DESC LIMIT 20 OFFSET 100 ;
                                                                                                                      QUERY PLAN                                                                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=125886.44..125886.49 rows=20 width=571) (actual time=2165.526..2176.292 rows=20 loops=1)
   ->  Sort  (cost=125886.19..125911.44 rows=10103 width=571) (actual time=2147.578..2158.345 rows=120 loops=1)
         Sort Key: "Albums"."ReleaseDate" DESC
         Sort Method: top-N heapsort  Memory: 176kB
         ->  Group  (cost=125411.00..125486.77 rows=10103 width=571) (actual time=2062.566..2115.053 rows=149337 loops=1)
               Group Key: "Albums"."Id", "Artists"."SortName"
               ->  Sort  (cost=125411.00..125436.25 rows=10103 width=571) (actual time=2062.539..2085.123 rows=149337 loops=1)
                     Sort Key: "Albums"."Id", "Artists"."SortName"
                     Sort Method: quicksort  Memory: 89482kB
                     ->  Hash Join  (cost=105232.94..124739.02 rows=10103 width=571) (actual time=1853.816..1996.874 rows=149337 loops=1)
                           Hash Cond: ("Albums"."ArtistMetadataId" = "Artists"."ArtistMetadataId")
                           ->  Hash Join  (cost=104202.74..123682.23 rows=10132 width=560) (actual time=1847.397..1970.359 rows=154380 loops=1)
                                 Hash Cond: ("Albums"."Id" = "AlbumReleases"."AlbumId")
                                 ->  Seq Scan on "Albums"  (cost=0.00..19065.85 rows=157571 width=560) (actual time=0.013..72.567 rows=175607 loops=1)
                                       Filter: ("Monitored" AND "Monitored" AND ("ReleaseDate" <= '2025-04-01 00:00:00+00'::timestamp with time zone))
                                       Rows Removed by Filter: 24177
                                 ->  Hash  (cost=104041.49..104041.49 rows=12900 width=4) (actual time=1847.314..1858.076 rows=174924 loops=1)
                                       Buckets: 262144 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 8198kB
                                       ->  Group  (cost=102449.31..104041.49 rows=12900 width=4) (actual time=1589.835..1843.704 rows=174924 loops=1)
                                             Group Key: "AlbumReleases"."AlbumId"
                                             ->  Gather Merge  (cost=102449.31..104009.24 rows=12900 width=4) (actual time=1589.810..1770.925 rows=2945157 loops=1)
                                                   Workers Planned: 5
                                                   Workers Launched: 5
                                                   ->  Sort  (cost=101449.23..101455.68 rows=2580 width=4) (actual time=1569.655..1582.641 rows=490860 loops=6)
                                                         Sort Key: "AlbumReleases"."AlbumId"
                                                         Sort Method: quicksort  Memory: 12289kB
                                                         Worker 0:  Sort Method: quicksort  Memory: 24577kB
                                                         Worker 1:  Sort Method: quicksort  Memory: 12289kB
                                                         Worker 2:  Sort Method: quicksort  Memory: 12289kB
                                                         Worker 3:  Sort Method: quicksort  Memory: 12289kB
                                                         Worker 4:  Sort Method: quicksort  Memory: 24577kB
                                                         ->  Nested Loop  (cost=20746.34..101303.04 rows=2580 width=4) (actual time=83.970..1538.354 rows=490860 loops=6)
                                                               ->  Parallel Hash Left Join  (cost=20745.92..99018.92 rows=5051 width=4) (actual time=82.755..240.181 rows=921118 loops=6)
                                                                     Hash Cond: ("Tracks"."TrackFileId" = "TrackFiles"."Id")
                                                                     Filter: ("TrackFiles".* IS NULL)
                                                                     Rows Removed by Filter: 50047
                                                                     ->  Parallel Index Only Scan using idx_tracks_albumreleaseid_trackfileid on "Tracks"  (cost=0.43..75621.36 rows=1010298 width=8) (actual time=0.085..62.140 rows=971165 loops=6)
                                                                           Heap Fetches: 58975
                                                                     ->  Parallel Hash  (cost=19518.55..19518.55 rows=98155 width=543) (actual time=82.169..82.170 rows=50148 loops=6)
                                                                           Buckets: 524288  Batches: 1  Memory Usage: 158944kB
                                                                           ->  Parallel Seq Scan on "TrackFiles"  (cost=0.00..19518.55 rows=98155 width=543) (actual time=5.097..28.659 rows=50148 loops=6)
                                                               ->  Index Scan using "PK_AlbumReleases" on "AlbumReleases"  (cost=0.42..0.45 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=5526708)
                                                                     Index Cond: ("Id" = "Tracks"."AlbumReleaseId")
                                                                     Filter: "Monitored"
                                                                     Rows Removed by Filter: 0
                           ->  Hash  (cost=736.01..736.01 rows=23535 width=15) (actual time=6.399..6.399 rows=23573 loops=1)
                                 Buckets: 32768  Batches: 1  Memory Usage: 1392kB
                                 ->  Seq Scan on "Artists"  (cost=0.00..736.01 rows=23535 width=15) (actual time=0.017..4.288 rows=23573 loops=1)
                                       Filter: ("Monitored" AND "Monitored")
                                       Rows Removed by Filter: 33
 Planning Time: 1.050 ms
 JIT:
   Functions: 122
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 3.874 ms, Inlining 0.000 ms, Optimization 2.900 ms, Emission 45.741 ms, Total 52.515 ms
 Execution Time: 2194.403 ms
(56 rows)

| Problem | Cause | |--------------------------|------------------------------------------------------------| | High row count | selecting rows before grouping and sorting | | Large sort in memory | GB of memory used for low millions of records w/ quicksort | | Expensive Anti Join | On `TrackFiles.Id IS NULL` condition | | Grouping before LIMIT | Entire result grouped before applying LIMIT+OFFSET |

JackDanger · 2025-04-06T19:55:13Z

src/NzbDrone.Core/Music/Repositories/AlbumRepository.cs

+                .Join<Album, Artist>((l, r) => l.ArtistMetadataId == r.ArtistMetadataId)
+                .Where<Album>(a => a.Monitored == true && a.ReleaseDate <= currentTime)
+                .Where<Artist>(a => a.Monitored == true)
+                .Where("\"Albums\".\"Id\" IN (SELECT \"AlbumReleases\".\"AlbumId\" FROM \"AlbumReleases\" JOIN \"Tracks\" ON (\"AlbumReleases\".\"Id\" = \"Tracks\".\"AlbumReleaseId\") LEFT JOIN \"TrackFiles\" ON (\"Tracks\".\"TrackFileId\" = \"TrackFiles\".\"Id\") WHERE \"TrackFiles\" IS NULL AND \"AlbumReleases\".\"Monitored\" = true GROUP BY \"AlbumReleases\".\"AlbumId\")")


This is gross, but I didn't know if y'all would prefer I fix the Builder

Sadly this breaks the actual behavior.

missingTracksSubquery seems to be unused, and you made albums to be mandatory monitored to show up in missing, which currently according to the filter selected by the user it should show up if Unmonitored is selected.

JackDanger commented Apr 6, 2025

View reviewed changes

bakerboy448 marked this pull request as draft August 21, 2025 03:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimize querying for missing albums #5495

Optimize querying for missing albums #5495

Uh oh!

JackDanger commented Apr 6, 2025

Uh oh!

JackDanger Apr 6, 2025

Uh oh!

mynameisbogdan Apr 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Optimize querying for missing albums #5495

Are you sure you want to change the base?

Optimize querying for missing albums #5495

Uh oh!

Conversation

JackDanger commented Apr 6, 2025

Database Migration

Description

The original query explanation

The NEW query explain

Uh oh!

JackDanger Apr 6, 2025

Choose a reason for hiding this comment

Uh oh!

mynameisbogdan Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants