-
Notifications
You must be signed in to change notification settings - Fork 1k
Open
Description
Hi all,
I observed some unexpected behavior when using the vector extension with partitioned tables and HNSW indexing. I wanted to clarify why an index scan is used in one case but not in another, despite similar query patterns.
Case I
script=> \d+ vector_collection2
Partitioned table "public.vector_collection2"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-----------+-----------+-----------+----------+---------+----------+-------------+--------------+-------------
id | integer | | not null | | plain | | |
embedding | vector(5) | | | | external | | |
Partition key: HASH (id)
Indexes:
"vector_collection2_pkey" PRIMARY KEY, btree (id)
"vector_collection2_embedding_idx" hnsw (embedding vector_l2_ops) WITH (m='8', ef_construction='16')
Partitions: vector_collection2_ts00001 FOR VALUES FROM (MINVALUE) TO ('-6917529027641081856'),
vector_collection2_ts00002 FOR VALUES FROM ('-6917529027641081856') TO ('-4611686018427387904'),
vector_collection2_ts00003 FOR VALUES FROM ('-4611686018427387904') TO ('-2305843009213693952'),
vector_collection2_ts00004 FOR VALUES FROM ('-2305843009213693952') TO ('0')
script=> reset enable_seqscan ;
RESET
script=> reset enable_indexscan ;
RESET
script=> SELECT COUNT(*) FROM vector_collection2;
count
---------
4997869
(1 row)
-- Fetching both id and embedding
script=> explain analyze SELECT id, embedding FROM public.vector_collection2 ORDER BY (embedding OPERATOR(public.<->) '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::public.vector) ASC NULLS LAST;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Append (cost=262.30..875089.46 rows=4997869 width=37) (actual time=3.000..37.168 rows=160 loops=1)
Sort Key: ((vector_collection2.embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector))
-> Index Scan using vector_collection2_ts00001_embedding_idx on vector_collection2_ts00001 vector_collection2_1 (cost=65.57..199836.08 rows=1248204 width=37) (actual time=0.651..19.815 rows=40 loops=1)
Order By: (embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector)
-> Index Scan using vector_collection2_ts00002_embedding_idx on vector_collection2_ts00002 vector_collection2_2 (cost=65.57..200180.46 rows=1250423 width=37) (actual time=0.725..5.305 rows=40 loops=1)
Order By: (embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector)
-> Index Scan using vector_collection2_ts00003_embedding_idx on vector_collection2_ts00003 vector_collection2_3 (cost=65.57..200230.06 rows=1250703 width=37) (actual time=0.854..8.316 rows=40 loops=1)
Order By: (embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector)
-> Index Scan using vector_collection2_ts00004_embedding_idx on vector_collection2_ts00004 vector_collection2_4 (cost=65.56..199874.78 rows=1248539 width=37) (actual time=0.761..3.416 rows=40 loops=1)
Order By: (embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector)
Planning Time: 0.383 ms
Execution Time: 37.503 ms
(12 rows)
-- Fetching only id
script=> explain analyze SELECT id FROM public.vector_collection2 ORDER BY (embedding OPERATOR(public.<->) '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::public.vector) ASC NULLS LAST;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Merge (cost=363927.94..849865.94 rows=4164892 width=12) (actual time=6999.244..13603.405 rows=4997869 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=362927.92..368134.04 rows=2082446 width=12) (actual time=6960.759..8275.637 rows=1665956 loops=3)
Sort Key: ((vector_collection2.embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector))
Sort Method: external merge Disk: 42272kB
Worker 0: Sort Method: external merge Disk: 42536kB
Worker 1: Sort Method: external merge Disk: 42544kB
-> Parallel Append (cost=0.00..73193.80 rows=2082446 width=12) (actual time=0.888..3575.830 rows=1665956 loops=3)
-> Parallel Seq Scan on vector_collection2_ts00003 vector_collection2_3 (cost=0.00..15711.08 rows=521126 width=12) (actual time=0.923..1310.306 rows=1250703 loops=1)
-> Parallel Seq Scan on vector_collection2_ts00002 vector_collection2_2 (cost=0.00..15707.62 rows=521010 width=12) (actual time=0.667..1475.575 rows=1250423 loops=1)
-> Parallel Seq Scan on vector_collection2_ts00004 vector_collection2_4 (cost=0.00..15683.81 rows=520225 width=12) (actual time=0.657..470.241 rows=416180 loops=3)
-> Parallel Seq Scan on vector_collection2_ts00001 vector_collection2_1 (cost=0.00..15679.06 rows=520085 width=12) (actual time=0.534..684.244 rows=624102 loops=2)
Planning Time: 0.369 ms
Execution Time: 16376.657 ms
(15 rows)
Case II
vector100k_part=> \d+ vector_items
Partitioned table "public.vector_items"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-----------+-----------+-----------+----------+------------------------------------------+----------+-------------+--------------+-------------
id | integer | | not null | nextval('vector_items_id_seq'::regclass) | plain | | |
embedding | vector(5) | | | | external | | |
Partition key: HASH (id)
Indexes:
"vector_items_pkey" PRIMARY KEY, btree (id)
"vector_items_embedding_idx" hnsw (embedding vector_l2_ops) WITH (m='16', ef_construction='64')
Partitions: vector_items_p0 FOR VALUES WITH (modulus 4, remainder 0),
vector_items_p1 FOR VALUES WITH (modulus 4, remainder 1),
vector_items_p2 FOR VALUES WITH (modulus 4, remainder 2),
vector_items_p3 FOR VALUES WITH (modulus 4, remainder 3)
vector100k_part=> reset enable_seqscan;
RESET
vector100k_part=> reset enable_indexscan ;
RESET
vector100k_part=> SELECT COUNT(*) FROM vector_items;
count
--------
100000
(1 row)
vector100k_part=> explain analyze SELECT id, embedding FROM public.vector_items ORDER BY (embedding OPERATOR(public.<->) '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::public.vector) ASC NULLS LAST;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=13528.82..13778.82 rows=100000 width=37) (actual time=73.779..86.356 rows=100000 loops=1)
Sort Key: ((vector_items.embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector))
Sort Method: external merge Disk: 4896kB
-> Append (cost=0.00..2487.00 rows=100000 width=37) (actual time=0.013..26.059 rows=100000 loops=1)
-> Seq Scan on vector_items_p0 vector_items_1 (cost=0.00..499.07 rows=25126 width=37) (actual time=0.012..4.607 rows=25126 loops=1)
-> Seq Scan on vector_items_p1 vector_items_2 (cost=0.00..496.22 rows=24978 width=37) (actual time=0.011..4.497 rows=24978 loops=1)
-> Seq Scan on vector_items_p2 vector_items_3 (cost=0.00..496.14 rows=24971 width=37) (actual time=0.009..4.542 rows=24971 loops=1)
-> Seq Scan on vector_items_p3 vector_items_4 (cost=0.00..495.56 rows=24925 width=37) (actual time=0.013..4.579 rows=24925 loops=1)
Planning Time: 0.257 ms
Execution Time: 96.241 ms
(10 rows)
vector100k_part=> explain analyze SELECT id FROM public.vector_items ORDER BY (embedding OPERATOR(public.<->) '[0.08761761,0.16212644,0.061548516,0
.099646576,0.36062342]'::public.vector) ASC NULLS LAST;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=10791.82..11041.82 rows=100000 width=12) (actual time=68.310..79.159 rows=100000 loops=1)
Sort Key: ((vector_items.embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector))
Sort Method: external merge Disk: 2552kB
-> Append (cost=0.00..2487.00 rows=100000 width=12) (actual time=0.011..29.633 rows=100000 loops=1)
-> Seq Scan on vector_items_p0 vector_items_1 (cost=0.00..499.07 rows=25126 width=12) (actual time=0.010..4.837 rows=25126 loops=1)
-> Seq Scan on vector_items_p1 vector_items_2 (cost=0.00..496.22 rows=24978 width=12) (actual time=0.011..4.769 rows=24978 loops=1)
-> Seq Scan on vector_items_p2 vector_items_3 (cost=0.00..496.14 rows=24971 width=12) (actual time=0.009..4.742 rows=24971 loops=1)
-> Seq Scan on vector_items_p3 vector_items_4 (cost=0.00..495.56 rows=24925 width=12) (actual time=0.016..4.758 rows=24925 loops=1)
Planning Time: 0.140 ms
Execution Time: 89.018 ms
(10 rows)
Questions:
- Why is the first query (on
vector_collection2) using index scans despite noLIMITand no explicit enable of index scans? Could this be due to the size of the dataset or different HNSW parameters (mandef_construction)? - In Case I - Why when I removed embedding from output it falls back to SeqScan? What could be the reasons for this?
Metadata
Metadata
Assignees
Labels
No labels