Submit a simple vector dimensionality reduction function. #582

MarchLiu · 2024-06-04T08:17:48Z

Hello,

I previously submitted a not-so-smart issue. Thank you everyone for your answers. I have come to understand the index limitations of PG, as well as some other useful knowledge. Particularly, pgvector: Fewer dimensions are better. Based on this article, I decided to tackle the dimensionality issue of ollama from a different angle.

During this period, I have written several dimensionality reduction algorithms. From the perspective of effectiveness and implementation complexity, the simplest method of partitioning and taking the average value is a good choice.

This function is not complicated; on the contrary, among the algorithms I have tried so far, this method is the simplest:

static inline float sum_float(const float *matrix, int start, int stop) {
    float result = 0.0;
    for (int i = start; i < stop; i++) {
        result += matrix[i];
    }
    return result;
}

/*
 * reduce vector dims by norm in fixed ranges
 */
PGDLLEXPORT PG_FUNCTION_INFO_V1(vector_norm_reduce);

Datum
vector_norm_reduce(PG_FUNCTION_ARGS) {
    Vector *vec = PG_GETARG_VECTOR_P(0);
    int32 reduce_to = PG_GETARG_INT32(1);
    Vector *result;
    int dim = vec->dim;
    float *values = vec->x;

    if (reduce_to < 2)
        ereport(ERROR,
                (errcode(ERRCODE_DATA_EXCEPTION),
                        errmsg("vector cannot reduce to a vector less than 2 by shear")));

    if (reduce_to >= dim)
        ereport(ERROR,
                (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
                        errmsg("vector cannot reduce to more than %d dimensions", dim)));

    CheckDim(dim);
    CheckDim(reduce_to);

    result = InitVector(reduce_to);

    int step = vec->dim / reduce_to;
    for (int idx = 0; idx < reduce_to; idx++) {
        int start = idx * step;
        int stop = (idx + 1) * step - 1;
        if (stop > reduce_to - 1) {
            stop = reduce_to - 1;
        }
        float range = stop - start;
        result->x[idx] = sum_float(vec->x, start, stop) / range;
    }

    PG_RETURN_POINTER(result);
}

I have collected hundreds of sentences of text, including both Chinese and English, some technical articles and literary works, and compared the 4096-dimensional embedding vectors generated directly from ollama with the results of different compression methods. The norm algorithm compressed to 512 dimensions achieved the best results, and most of the time its search results are closer to the original 4096-dimensional vectors.

The Python script test/python/generate_and_save.py is used to generate test data, for example:

python test/python/generate_and_save.py test/python/stories.txt

Another Python script test/python/ask.py is used to compare search results, executed like this:

python test/python/ask.py "They're playing our song."

Perhaps, in addition to me, there may be users who need to generate compressed vectors of lower dimensions from high-dimensional embedding vectors. I hope this function can be of help.

jkatz · 2024-06-04T17:35:29Z

As a general comment, and tests would have to go into the current regression test suite, not a new one built in Python. It'd be possible to test this with the standard PostgreSQL regression tests (.sql/.out).

ankane · 2024-06-04T17:36:46Z

Hi @MarchLiu, thanks for sharing. However, I'm skeptical that averaging dimensions is a good method for dimensionality reduction. I'd want to see significant evidence for it, including how it affects recall, as well as how it compares to binary quantization.

That being said, if you're able to create a SQL version of it, users could add it to their databases without the need for it to be part of an extension.

MarchLiu · 2024-06-05T03:02:04Z

@ankane, @jkatz Hello. Thx for your reply. In fact I tried some different way for reduce dimensions. But the averaging dimensions by fixed ranges is a better implement compared to some more complex method just like averaging by top-k differences etc. Of cause, it isn't a perfect answer but a option only in probability yet.
I'm writing a matrix algorithm library. It try to port BLAS and LAPACK even GGML library and LLM into PostgreSQL. It include some vector methods include these dimensions reduce methods. But I don't want to import those dependencies just for some reduce dimensions function not good enough. So I put this function into this pull request.
My plan-b is develop the matrix library and supply a compile option for methods that convert between 1-row matrix and vector of pgvector.

add vector reduce dim function use norm in fixed range

86880cc

ankane closed this Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submit a simple vector dimensionality reduction function. #582

Submit a simple vector dimensionality reduction function. #582

Uh oh!

MarchLiu commented Jun 4, 2024

Uh oh!

jkatz commented Jun 4, 2024

Uh oh!

ankane commented Jun 4, 2024

Uh oh!

MarchLiu commented Jun 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Submit a simple vector dimensionality reduction function. #582

Submit a simple vector dimensionality reduction function. #582

Uh oh!

Conversation

MarchLiu commented Jun 4, 2024

Uh oh!

jkatz commented Jun 4, 2024

Uh oh!

ankane commented Jun 4, 2024

Uh oh!

MarchLiu commented Jun 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants