Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Unexpected use of Index Scan in ANN query without LIMIT or enable_indexscan #846

@developerayuva

Description

@developerayuva

Hi all,
I observed some unexpected behavior when using the vector extension with partitioned tables and HNSW indexing. I wanted to clarify why an index scan is used in one case but not in another, despite similar query patterns.

Case I

script=> \d+ vector_collection2
                                Partitioned table "public.vector_collection2"
  Column   |   Type    | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
-----------+-----------+-----------+----------+---------+----------+-------------+--------------+-------------
 id        | integer   |           | not null |         | plain    |             |              | 
 embedding | vector(5) |           |          |         | external |             |              | 
Partition key: HASH (id)
Indexes:
    "vector_collection2_pkey" PRIMARY KEY, btree (id)
    "vector_collection2_embedding_idx" hnsw (embedding vector_l2_ops) WITH (m='8', ef_construction='16')
Partitions: vector_collection2_ts00001 FOR VALUES FROM (MINVALUE) TO ('-6917529027641081856'),
            vector_collection2_ts00002 FOR VALUES FROM ('-6917529027641081856') TO ('-4611686018427387904'),
            vector_collection2_ts00003 FOR VALUES FROM ('-4611686018427387904') TO ('-2305843009213693952'),
            vector_collection2_ts00004 FOR VALUES FROM ('-2305843009213693952') TO ('0')

script=> reset enable_seqscan ;
RESET
script=> reset enable_indexscan ;
RESET
script=> SELECT COUNT(*) FROM vector_collection2;
  count  
---------
 4997869
(1 row)

-- Fetching both id and embedding
script=> explain analyze SELECT id, embedding FROM public.vector_collection2 ORDER BY (embedding OPERATOR(public.<->) '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::public.vector) ASC NULLS LAST;
                                                                                                  QUERY PLAN                                                                                                   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Append  (cost=262.30..875089.46 rows=4997869 width=37) (actual time=3.000..37.168 rows=160 loops=1)
   Sort Key: ((vector_collection2.embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector))
   ->  Index Scan using vector_collection2_ts00001_embedding_idx on vector_collection2_ts00001 vector_collection2_1  (cost=65.57..199836.08 rows=1248204 width=37) (actual time=0.651..19.815 rows=40 loops=1)
         Order By: (embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector)
   ->  Index Scan using vector_collection2_ts00002_embedding_idx on vector_collection2_ts00002 vector_collection2_2  (cost=65.57..200180.46 rows=1250423 width=37) (actual time=0.725..5.305 rows=40 loops=1)
         Order By: (embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector)
   ->  Index Scan using vector_collection2_ts00003_embedding_idx on vector_collection2_ts00003 vector_collection2_3  (cost=65.57..200230.06 rows=1250703 width=37) (actual time=0.854..8.316 rows=40 loops=1)
         Order By: (embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector)
   ->  Index Scan using vector_collection2_ts00004_embedding_idx on vector_collection2_ts00004 vector_collection2_4  (cost=65.56..199874.78 rows=1248539 width=37) (actual time=0.761..3.416 rows=40 loops=1)
         Order By: (embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector)
 Planning Time: 0.383 ms
 Execution Time: 37.503 ms
(12 rows)

-- Fetching only id
script=> explain analyze SELECT id FROM public.vector_collection2 ORDER BY (embedding OPERATOR(public.<->) '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::public.vector) ASC NULLS LAST;
                                                                                      QUERY PLAN                                                                                       
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Merge  (cost=363927.94..849865.94 rows=4164892 width=12) (actual time=6999.244..13603.405 rows=4997869 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Sort  (cost=362927.92..368134.04 rows=2082446 width=12) (actual time=6960.759..8275.637 rows=1665956 loops=3)
         Sort Key: ((vector_collection2.embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector))
         Sort Method: external merge  Disk: 42272kB
         Worker 0:  Sort Method: external merge  Disk: 42536kB
         Worker 1:  Sort Method: external merge  Disk: 42544kB
         ->  Parallel Append  (cost=0.00..73193.80 rows=2082446 width=12) (actual time=0.888..3575.830 rows=1665956 loops=3)
               ->  Parallel Seq Scan on vector_collection2_ts00003 vector_collection2_3  (cost=0.00..15711.08 rows=521126 width=12) (actual time=0.923..1310.306 rows=1250703 loops=1)
               ->  Parallel Seq Scan on vector_collection2_ts00002 vector_collection2_2  (cost=0.00..15707.62 rows=521010 width=12) (actual time=0.667..1475.575 rows=1250423 loops=1)
               ->  Parallel Seq Scan on vector_collection2_ts00004 vector_collection2_4  (cost=0.00..15683.81 rows=520225 width=12) (actual time=0.657..470.241 rows=416180 loops=3)
               ->  Parallel Seq Scan on vector_collection2_ts00001 vector_collection2_1  (cost=0.00..15679.06 rows=520085 width=12) (actual time=0.534..684.244 rows=624102 loops=2)
 Planning Time: 0.369 ms
 Execution Time: 16376.657 ms
(15 rows)

Case II

vector100k_part=> \d+ vector_items
                                                    Partitioned table "public.vector_items"
  Column   |   Type    | Collation | Nullable |                 Default                  | Storage  | Compression | Stats target | Description 
-----------+-----------+-----------+----------+------------------------------------------+----------+-------------+--------------+-------------
 id        | integer   |           | not null | nextval('vector_items_id_seq'::regclass) | plain    |             |              | 
 embedding | vector(5) |           |          |                                          | external |             |              | 
Partition key: HASH (id)
Indexes:
    "vector_items_pkey" PRIMARY KEY, btree (id)
    "vector_items_embedding_idx" hnsw (embedding vector_l2_ops) WITH (m='16', ef_construction='64')
Partitions: vector_items_p0 FOR VALUES WITH (modulus 4, remainder 0),
            vector_items_p1 FOR VALUES WITH (modulus 4, remainder 1),
            vector_items_p2 FOR VALUES WITH (modulus 4, remainder 2),
            vector_items_p3 FOR VALUES WITH (modulus 4, remainder 3)

vector100k_part=> reset enable_seqscan;
RESET
vector100k_part=> reset enable_indexscan ;
RESET
vector100k_part=> SELECT COUNT(*) FROM vector_items;
 count  
--------
 100000
(1 row)

vector100k_part=> explain analyze SELECT id, embedding FROM public.vector_items ORDER BY (embedding OPERATOR(public.<->) '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::public.vector) ASC NULLS LAST;
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=13528.82..13778.82 rows=100000 width=37) (actual time=73.779..86.356 rows=100000 loops=1)
   Sort Key: ((vector_items.embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector))
   Sort Method: external merge  Disk: 4896kB
   ->  Append  (cost=0.00..2487.00 rows=100000 width=37) (actual time=0.013..26.059 rows=100000 loops=1)
         ->  Seq Scan on vector_items_p0 vector_items_1  (cost=0.00..499.07 rows=25126 width=37) (actual time=0.012..4.607 rows=25126 loops=1)
         ->  Seq Scan on vector_items_p1 vector_items_2  (cost=0.00..496.22 rows=24978 width=37) (actual time=0.011..4.497 rows=24978 loops=1)
         ->  Seq Scan on vector_items_p2 vector_items_3  (cost=0.00..496.14 rows=24971 width=37) (actual time=0.009..4.542 rows=24971 loops=1)
         ->  Seq Scan on vector_items_p3 vector_items_4  (cost=0.00..495.56 rows=24925 width=37) (actual time=0.013..4.579 rows=24925 loops=1)
 Planning Time: 0.257 ms
 Execution Time: 96.241 ms
(10 rows)

vector100k_part=> explain analyze SELECT id  FROM public.vector_items ORDER BY (embedding OPERATOR(public.<->) '[0.08761761,0.16212644,0.061548516,0
.099646576,0.36062342]'::public.vector) ASC NULLS LAST;
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=10791.82..11041.82 rows=100000 width=12) (actual time=68.310..79.159 rows=100000 loops=1)
   Sort Key: ((vector_items.embedding <-> '[0.08761761,0.16212644,0.061548516,0.099646576,0.36062342]'::vector))
   Sort Method: external merge  Disk: 2552kB
   ->  Append  (cost=0.00..2487.00 rows=100000 width=12) (actual time=0.011..29.633 rows=100000 loops=1)
         ->  Seq Scan on vector_items_p0 vector_items_1  (cost=0.00..499.07 rows=25126 width=12) (actual time=0.010..4.837 rows=25126 loops=1)
         ->  Seq Scan on vector_items_p1 vector_items_2  (cost=0.00..496.22 rows=24978 width=12) (actual time=0.011..4.769 rows=24978 loops=1)
         ->  Seq Scan on vector_items_p2 vector_items_3  (cost=0.00..496.14 rows=24971 width=12) (actual time=0.009..4.742 rows=24971 loops=1)
         ->  Seq Scan on vector_items_p3 vector_items_4  (cost=0.00..495.56 rows=24925 width=12) (actual time=0.016..4.758 rows=24925 loops=1)
 Planning Time: 0.140 ms
 Execution Time: 89.018 ms
(10 rows)

Questions:

  1. Why is the first query (on vector_collection2) using index scans despite no LIMIT and no explicit enable of index scans? Could this be due to the size of the dataset or different HNSW parameters (m and ef_construction)?
  2. In Case I - Why when I removed embedding from output it falls back to SeqScan? What could be the reasons for this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions