Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@binarycleric
Copy link
Contributor

Vectorizing vector_concat for some improved performance. On an ARM chip this should generate SIMD instructions to copy the two incoming vectors to the new vector as opposed to doing it all in software.

I'm a little rusty at Assembly so forgive me if I make any mistakes. I used Cursor to help document what each line was doing as a sanity check.

With the change

The interesting bits are the calls to ARM's ldp and stp instructions to load/store pairs of CPU registers as opposed to doing them one at a time.

For the first vector copy (a->x to result->x):

LBB49_6:                                ; =>This Inner Loop Header: Depth=1
    ldp q0, q1, [x11, #-32]            ; Load 4 floats into q0, q1 (16 bytes each)
    ldp q2, q3, [x11], #64             ; Load 4 more floats into q2, q3
    stp q0, q1, [x12, #-32]            ; Store 4 floats from q0, q1
    stp q2, q3, [x12], #64             ; Store 4 more floats from q2, q3
    subs x13, x13, #16                  ; Decrement counter
    b.ne LBB49_6                       ; Loop if not done

For the second vector copy (b->x to result->x + a->dim):

LBB49_18:                               ; =>This Inner Loop Header: Depth=1
    ldp q0, q1, [x11, #-32]            ; Load 4 floats into q0, q1
    ldp q2, q3, [x11], #64             ; Load 4 more floats into q2, q3
    stp q0, q1, [x12, #-32]            ; Store 4 floats from q0, q1
    stp q2, q3, [x12], #64             ; Store 4 more floats from q2, q3
    subs x13, x13, #16                  ; Decrement counter
    b.ne LBB49_18                      ; Loop if not done

Without the change

For the first vector copy (a->x to result->x):

LBB49_4:                                ; =>This Inner Loop Header: Depth=1
    ldr s0, [x9, x8, lsl #2]           ; Load single float
    str s0, [x10, x8, lsl #2]          ; Store single float
    add x8, x8, #1                     ; Increment counter
    ldrsh x11, [x19, #4]               ; Load dimension
    cmp x8, x11                        ; Compare counter with dimension
    b.lt LBB49_4                       ; Loop if not done

For the second vector copy (b->x to result->x + a->dim):

LBB49_7:                                ; =>This Inner Loop Header: Depth=1
    ldr s0, [x9, x8, lsl #2]           ; Load single float
    ldrsh w11, [x19, #4]               ; Load first vector dimension
    add w11, w8, w11                   ; Calculate offset
    str s0, [x10, w11, sxtw #2]        ; Store single float
    add x8, x8, #1                     ; Increment counter
    ldrsh x11, [x20, #4]               ; Load dimension
    cmp x8, x11                        ; Compare counter with dimension
    b.lt LBB49_7                       ; Loop if not done

On an ARM chip this should generate SIMD instructions to copy the two
incoming vectors to the new vector as opposed to doing it all in
software.
@jkatz
Copy link
Contributor

jkatz commented Jun 17, 2025

@binarycleric Do you have any before/after benchmarks on this? There's some examples of benchmarks to run and what to test here.

@binarycleric
Copy link
Contributor Author

Thanks @jkatz. Working on that now.

@binarycleric
Copy link
Contributor Author

The following was generated using https://github.com/binarycleric/pgvectorbench/blob/main/benchmarks/checks/vector_concat.sql and tested against Postgres 16 on macOS with M4 Pro chip. I'm hitting EoD right now but tomorrow I'm going to run these tests on r7i/r7g instances and post results.

With Change

Source from the vectorize-vector-concat branch.

 function_name |     test_name      | Minimum Time (us) | Median Time (us) | 95th Percentile (us) | 99th Percentile (us) | Maximum Time (us) | Standard Deviation
---------------+--------------------+-------------------+------------------+----------------------+----------------------+-------------------+--------------------
 vector_concat | small_vectors      |             9.775 |           10.967 |               13.113 |               14.067 |           663.996 |              6.610
 vector_concat | medium_vectors     |            12.875 |           14.067 |               16.928 |               17.166 |           116.110 |              1.647
 vector_concat | large_vectors      |           191.927 |          204.086 |              224.829 |              254.154 |           506.878 |             12.241
 vector_concat | very_large_vectors |           623.941 |          651.121 |              739.813 |              777.006 |          1362.085 |             38.261

Without Change

Latest source from the master branch.

function_name |     test_name      | Minimum Time (us) | Median Time (us) | 95th Percentile (us) | 99th Percentile (us) | Maximum Time (us) | Standard Deviation
---------------+--------------------+-------------------+------------------+----------------------+----------------------+-------------------+--------------------
vector_concat | small_vectors      |            10.967 |           11.921 |               14.067 |               15.974 |           658.989 |              6.564
vector_concat | medium_vectors     |            18.835 |           20.027 |               22.888 |               25.034 |           118.017 |              1.717
vector_concat | large_vectors      |           238.895 |          256.777 |              273.943 |              280.857 |           566.006 |             10.689
vector_concat | very_large_vectors |           768.185 |          813.007 |              890.970 |              950.098 |          1585.007 |             42.797

@binarycleric
Copy link
Contributor Author

Here are some more benchmarks, this time on EC2.

x86_64

Instance Type: r7i.2xlarge
Version: PostgreSQL 16.9

main branch

 function_name |     test_name      | Minimum Time (us) | Median Time (us) | 95th Percentile (us) | 99th Percentile (us) | Maximum Time (us) | Standard Deviation
---------------+--------------------+-------------------+------------------+----------------------+----------------------+-------------------+--------------------
 vector_concat | small_vectors      |            24.796 |           27.180 |               31.948 |               36.955 |           611.067 |              6.256
 vector_concat | medium_vectors     |            31.948 |           35.048 |               41.962 |               46.015 |           216.007 |              3.704
 vector_concat | large_vectors      |           617.981 |          658.989 |              724.077 |              749.111 |          1165.867 |             30.209
 vector_concat | very_large_vectors |          1962.900 |         2108.097 |             2403.021 |             2696.991 |          3768.921 |            154.491

vectorize-vector-concat branch

 function_name |     test_name      | Minimum Time (us) | Median Time (us) | 95th Percentile (us) | 99th Percentile (us) | Maximum Time (us) | Standard Deviation
---------------+--------------------+-------------------+------------------+----------------------+----------------------+-------------------+--------------------
 vector_concat | small_vectors      |            23.842 |           26.941 |               30.041 |               37.909 |           607.967 |              6.420
 vector_concat | medium_vectors     |            20.027 |           23.127 |               26.941 |               34.094 |           205.040 |              3.094
 vector_concat | large_vectors      |           521.898 |          581.026 |              638.008 |              656.843 |          1040.936 |             30.592
 vector_concat | very_large_vectors |          1616.001 |         1713.991 |             1950.979 |             2032.995 |          3578.186 |             93.572

ARM

Instance Type: r7g.2xlarge
Version: PostgreSQL 16.9

main branch

 function_name |     test_name      | Minimum Time (us) | Median Time (us) | 95th Percentile (us) | 99th Percentile (us) | Maximum Time (us) | Standard Deviation
---------------+--------------------+-------------------+------------------+----------------------+----------------------+-------------------+--------------------
 vector_concat | small_vectors      |            32.902 |           34.094 |               35.048 |               41.008 |           663.042 |              6.480
 vector_concat | medium_vectors     |            51.975 |           52.929 |               61.035 |               64.135 |           243.902 |              3.406
 vector_concat | large_vectors      |           901.937 |          937.939 |              962.973 |              970.125 |          1309.872 |             14.930
 vector_concat | very_large_vectors |          2517.939 |         2556.086 |             2580.881 |             2666.950 |          3643.990 |             24.754

vectorize-vector-concat branch

 function_name |     test_name      | Minimum Time (us) | Median Time (us) | 95th Percentile (us) | 99th Percentile (us) | Maximum Time (us) | Standard Deviation
---------------+--------------------+-------------------+------------------+----------------------+----------------------+-------------------+--------------------
 vector_concat | small_vectors      |            29.802 |           30.994 |               32.902 |               38.862 |           689.030 |              6.794
 vector_concat | medium_vectors     |            36.001 |           37.909 |               45.061 |               49.829 |           238.180 |              3.398
 vector_concat | large_vectors      |           780.106 |          813.007 |              838.995 |              845.909 |          1189.947 |             14.815
 vector_concat | very_large_vectors |          2048.016 |         2074.003 |             2094.030 |             2108.812 |          3163.099 |             16.784

@jkatz
Copy link
Contributor

jkatz commented Jun 18, 2025

@binarycleric Can you please use some of the tests mentioned in that blog post (e.g. ANN Benchmark, VectorDBBench)? It's important to capture the before/after recall measurement, to see if any of the changes impact overall result quality.

@binarycleric
Copy link
Contributor Author

@jkatz Sure thing. Sorry about that.

@ankane
Copy link
Member

ankane commented Jun 18, 2025

This function isn't used for nearest neighbor search, so additional benchmarks shouldn't be needed. Will take a look at this after #860.

@ankane
Copy link
Member

ankane commented Jun 18, 2025

Looks good. Let's remove const here as well.

@binarycleric
Copy link
Contributor Author

Sounds good. I'll update that in the morning.

@ankane ankane merged commit 3a49d14 into pgvector:master Jun 19, 2025
10 of 11 checks passed
@ankane
Copy link
Member

ankane commented Jun 19, 2025

Thanks

@binarycleric binarycleric deleted the vectorize-vector-concat branch June 19, 2025 06:03
klmckeig pushed a commit to klmckeig/pgvector that referenced this pull request Dec 8, 2025
* Vectorizing vector_concat for improved performance

On an ARM chip this should generate SIMD instructions to copy the two
incoming vectors to the new vector as opposed to doing it all in
software.

* Moving declarations to above CheckDim

* Removing const from dims

* Formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants