Vectorizing vector_concat for improved performance #861

binarycleric · 2025-06-17T19:23:16Z

Vectorizing vector_concat for some improved performance. On an ARM chip this should generate SIMD instructions to copy the two incoming vectors to the new vector as opposed to doing it all in software.

I'm a little rusty at Assembly so forgive me if I make any mistakes. I used Cursor to help document what each line was doing as a sanity check.

With the change

The interesting bits are the calls to ARM's ldp and stp instructions to load/store pairs of CPU registers as opposed to doing them one at a time.

For the first vector copy (a->x to result->x):

LBB49_6:                                ; =>This Inner Loop Header: Depth=1
    ldp q0, q1, [x11, #-32]            ; Load 4 floats into q0, q1 (16 bytes each)
    ldp q2, q3, [x11], #64             ; Load 4 more floats into q2, q3
    stp q0, q1, [x12, #-32]            ; Store 4 floats from q0, q1
    stp q2, q3, [x12], #64             ; Store 4 more floats from q2, q3
    subs x13, x13, #16                  ; Decrement counter
    b.ne LBB49_6                       ; Loop if not done

For the second vector copy (b->x to result->x + a->dim):

LBB49_18:                               ; =>This Inner Loop Header: Depth=1
    ldp q0, q1, [x11, #-32]            ; Load 4 floats into q0, q1
    ldp q2, q3, [x11], #64             ; Load 4 more floats into q2, q3
    stp q0, q1, [x12, #-32]            ; Store 4 floats from q0, q1
    stp q2, q3, [x12], #64             ; Store 4 more floats from q2, q3
    subs x13, x13, #16                  ; Decrement counter
    b.ne LBB49_18                      ; Loop if not done

Without the change

For the first vector copy (a->x to result->x):

LBB49_4:                                ; =>This Inner Loop Header: Depth=1
    ldr s0, [x9, x8, lsl #2]           ; Load single float
    str s0, [x10, x8, lsl #2]          ; Store single float
    add x8, x8, #1                     ; Increment counter
    ldrsh x11, [x19, #4]               ; Load dimension
    cmp x8, x11                        ; Compare counter with dimension
    b.lt LBB49_4                       ; Loop if not done

For the second vector copy (b->x to result->x + a->dim):

LBB49_7:                                ; =>This Inner Loop Header: Depth=1
    ldr s0, [x9, x8, lsl #2]           ; Load single float
    ldrsh w11, [x19, #4]               ; Load first vector dimension
    add w11, w8, w11                   ; Calculate offset
    str s0, [x10, w11, sxtw #2]        ; Store single float
    add x8, x8, #1                     ; Increment counter
    ldrsh x11, [x20, #4]               ; Load dimension
    cmp x8, x11                        ; Compare counter with dimension
    b.lt LBB49_7                       ; Loop if not done

On an ARM chip this should generate SIMD instructions to copy the two incoming vectors to the new vector as opposed to doing it all in software.

jkatz · 2025-06-17T19:35:51Z

@binarycleric Do you have any before/after benchmarks on this? There's some examples of benchmarks to run and what to test here.

binarycleric · 2025-06-17T19:49:12Z

Thanks @jkatz. Working on that now.

binarycleric · 2025-06-17T20:58:08Z

The following was generated using https://github.com/binarycleric/pgvectorbench/blob/main/benchmarks/checks/vector_concat.sql and tested against Postgres 16 on macOS with M4 Pro chip. I'm hitting EoD right now but tomorrow I'm going to run these tests on r7i/r7g instances and post results.

With Change

Source from the vectorize-vector-concat branch.

 function_name |     test_name      | Minimum Time (us) | Median Time (us) | 95th Percentile (us) | 99th Percentile (us) | Maximum Time (us) | Standard Deviation
---------------+--------------------+-------------------+------------------+----------------------+----------------------+-------------------+--------------------
 vector_concat | small_vectors      |             9.775 |           10.967 |               13.113 |               14.067 |           663.996 |              6.610
 vector_concat | medium_vectors     |            12.875 |           14.067 |               16.928 |               17.166 |           116.110 |              1.647
 vector_concat | large_vectors      |           191.927 |          204.086 |              224.829 |              254.154 |           506.878 |             12.241
 vector_concat | very_large_vectors |           623.941 |          651.121 |              739.813 |              777.006 |          1362.085 |             38.261

Without Change

Latest source from the master branch.

function_name |     test_name      | Minimum Time (us) | Median Time (us) | 95th Percentile (us) | 99th Percentile (us) | Maximum Time (us) | Standard Deviation
---------------+--------------------+-------------------+------------------+----------------------+----------------------+-------------------+--------------------
vector_concat | small_vectors      |            10.967 |           11.921 |               14.067 |               15.974 |           658.989 |              6.564
vector_concat | medium_vectors     |            18.835 |           20.027 |               22.888 |               25.034 |           118.017 |              1.717
vector_concat | large_vectors      |           238.895 |          256.777 |              273.943 |              280.857 |           566.006 |             10.689
vector_concat | very_large_vectors |           768.185 |          813.007 |              890.970 |              950.098 |          1585.007 |             42.797

binarycleric · 2025-06-18T15:53:36Z

Here are some more benchmarks, this time on EC2.

x86_64

Instance Type: r7i.2xlarge
Version: PostgreSQL 16.9

main branch

 function_name |     test_name      | Minimum Time (us) | Median Time (us) | 95th Percentile (us) | 99th Percentile (us) | Maximum Time (us) | Standard Deviation
---------------+--------------------+-------------------+------------------+----------------------+----------------------+-------------------+--------------------
 vector_concat | small_vectors      |            24.796 |           27.180 |               31.948 |               36.955 |           611.067 |              6.256
 vector_concat | medium_vectors     |            31.948 |           35.048 |               41.962 |               46.015 |           216.007 |              3.704
 vector_concat | large_vectors      |           617.981 |          658.989 |              724.077 |              749.111 |          1165.867 |             30.209
 vector_concat | very_large_vectors |          1962.900 |         2108.097 |             2403.021 |             2696.991 |          3768.921 |            154.491

vectorize-vector-concat branch

 function_name |     test_name      | Minimum Time (us) | Median Time (us) | 95th Percentile (us) | 99th Percentile (us) | Maximum Time (us) | Standard Deviation
---------------+--------------------+-------------------+------------------+----------------------+----------------------+-------------------+--------------------
 vector_concat | small_vectors      |            23.842 |           26.941 |               30.041 |               37.909 |           607.967 |              6.420
 vector_concat | medium_vectors     |            20.027 |           23.127 |               26.941 |               34.094 |           205.040 |              3.094
 vector_concat | large_vectors      |           521.898 |          581.026 |              638.008 |              656.843 |          1040.936 |             30.592
 vector_concat | very_large_vectors |          1616.001 |         1713.991 |             1950.979 |             2032.995 |          3578.186 |             93.572

ARM

Instance Type: r7g.2xlarge
Version: PostgreSQL 16.9

main branch

 function_name |     test_name      | Minimum Time (us) | Median Time (us) | 95th Percentile (us) | 99th Percentile (us) | Maximum Time (us) | Standard Deviation
---------------+--------------------+-------------------+------------------+----------------------+----------------------+-------------------+--------------------
 vector_concat | small_vectors      |            32.902 |           34.094 |               35.048 |               41.008 |           663.042 |              6.480
 vector_concat | medium_vectors     |            51.975 |           52.929 |               61.035 |               64.135 |           243.902 |              3.406
 vector_concat | large_vectors      |           901.937 |          937.939 |              962.973 |              970.125 |          1309.872 |             14.930
 vector_concat | very_large_vectors |          2517.939 |         2556.086 |             2580.881 |             2666.950 |          3643.990 |             24.754

vectorize-vector-concat branch

 function_name |     test_name      | Minimum Time (us) | Median Time (us) | 95th Percentile (us) | 99th Percentile (us) | Maximum Time (us) | Standard Deviation
---------------+--------------------+-------------------+------------------+----------------------+----------------------+-------------------+--------------------
 vector_concat | small_vectors      |            29.802 |           30.994 |               32.902 |               38.862 |           689.030 |              6.794
 vector_concat | medium_vectors     |            36.001 |           37.909 |               45.061 |               49.829 |           238.180 |              3.398
 vector_concat | large_vectors      |           780.106 |          813.007 |              838.995 |              845.909 |          1189.947 |             14.815
 vector_concat | very_large_vectors |          2048.016 |         2074.003 |             2094.030 |             2108.812 |          3163.099 |             16.784

jkatz · 2025-06-18T16:01:23Z

@binarycleric Can you please use some of the tests mentioned in that blog post (e.g. ANN Benchmark, VectorDBBench)? It's important to capture the before/after recall measurement, to see if any of the changes impact overall result quality.

binarycleric · 2025-06-18T16:23:40Z

@jkatz Sure thing. Sorry about that.

ankane · 2025-06-18T17:26:36Z

This function isn't used for nearest neighbor search, so additional benchmarks shouldn't be needed. Will take a look at this after #860.

ankane · 2025-06-18T19:43:46Z

Looks good. Let's remove const here as well.

binarycleric · 2025-06-19T01:48:37Z

Sounds good. I'll update that in the morning.

ankane · 2025-06-19T03:06:41Z

Thanks

* Vectorizing vector_concat for improved performance On an ARM chip this should generate SIMD instructions to copy the two incoming vectors to the new vector as opposed to doing it all in software. * Moving declarations to above CheckDim * Removing const from dims * Formatting

Vectorizing vector_concat for improved performance

cc5a34a

On an ARM chip this should generate SIMD instructions to copy the two incoming vectors to the new vector as opposed to doing it all in software.

Moving declarations to above CheckDim

58c8017

binarycleric added 2 commits June 18, 2025 22:43

Removing const from dims

c9d4569

Formatting

e4bfe32

ankane merged commit 3a49d14 into pgvector:master Jun 19, 2025
10 of 11 checks passed

binarycleric deleted the vectorize-vector-concat branch June 19, 2025 06:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorizing vector_concat for improved performance #861

Vectorizing vector_concat for improved performance #861

Uh oh!

binarycleric commented Jun 17, 2025

Uh oh!

jkatz commented Jun 17, 2025

Uh oh!

binarycleric commented Jun 17, 2025

Uh oh!

binarycleric commented Jun 17, 2025

Uh oh!

binarycleric commented Jun 18, 2025

Uh oh!

jkatz commented Jun 18, 2025

Uh oh!

binarycleric commented Jun 18, 2025

Uh oh!

ankane commented Jun 18, 2025

Uh oh!

ankane commented Jun 18, 2025

Uh oh!

binarycleric commented Jun 19, 2025

Uh oh!

Uh oh!

ankane commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Vectorizing vector_concat for improved performance #861

Vectorizing vector_concat for improved performance #861

Uh oh!

Conversation

binarycleric commented Jun 17, 2025

With the change

Without the change

Uh oh!

jkatz commented Jun 17, 2025

Uh oh!

binarycleric commented Jun 17, 2025

Uh oh!

binarycleric commented Jun 17, 2025

With Change

Without Change

Uh oh!

binarycleric commented Jun 18, 2025

x86_64

main branch

vectorize-vector-concat branch

ARM

main branch

vectorize-vector-concat branch

Uh oh!

jkatz commented Jun 18, 2025

Uh oh!

binarycleric commented Jun 18, 2025

Uh oh!

ankane commented Jun 18, 2025

Uh oh!

ankane commented Jun 18, 2025

Uh oh!

binarycleric commented Jun 19, 2025

Uh oh!

Uh oh!

ankane commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants