Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@MarchLiu
Copy link

@MarchLiu MarchLiu commented Jun 4, 2024

Hello,

I previously submitted a not-so-smart issue. Thank you everyone for your answers. I have come to understand the index limitations of PG, as well as some other useful knowledge. Particularly, pgvector: Fewer dimensions are better. Based on this article, I decided to tackle the dimensionality issue of ollama from a different angle.

During this period, I have written several dimensionality reduction algorithms. From the perspective of effectiveness and implementation complexity, the simplest method of partitioning and taking the average value is a good choice.

This function is not complicated; on the contrary, among the algorithms I have tried so far, this method is the simplest:

static inline float sum_float(const float *matrix, int start, int stop) {
    float result = 0.0;
    for (int i = start; i < stop; i++) {
        result += matrix[i];
    }
    return result;
}

/*
 * reduce vector dims by norm in fixed ranges
 */
PGDLLEXPORT PG_FUNCTION_INFO_V1(vector_norm_reduce);

Datum
vector_norm_reduce(PG_FUNCTION_ARGS) {
    Vector *vec = PG_GETARG_VECTOR_P(0);
    int32 reduce_to = PG_GETARG_INT32(1);
    Vector *result;
    int dim = vec->dim;
    float *values = vec->x;

    if (reduce_to < 2)
        ereport(ERROR,
                (errcode(ERRCODE_DATA_EXCEPTION),
                        errmsg("vector cannot reduce to a vector less than 2 by shear")));

    if (reduce_to >= dim)
        ereport(ERROR,
                (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
                        errmsg("vector cannot reduce to more than %d dimensions", dim)));

    CheckDim(dim);
    CheckDim(reduce_to);

    result = InitVector(reduce_to);

    int step = vec->dim / reduce_to;
    for (int idx = 0; idx < reduce_to; idx++) {
        int start = idx * step;
        int stop = (idx + 1) * step - 1;
        if (stop > reduce_to - 1) {
            stop = reduce_to - 1;
        }
        float range = stop - start;
        result->x[idx] = sum_float(vec->x, start, stop) / range;
    }

    PG_RETURN_POINTER(result);
}

I have collected hundreds of sentences of text, including both Chinese and English, some technical articles and literary works, and compared the 4096-dimensional embedding vectors generated directly from ollama with the results of different compression methods. The norm algorithm compressed to 512 dimensions achieved the best results, and most of the time its search results are closer to the original 4096-dimensional vectors.

The Python script test/python/generate_and_save.py is used to generate test data, for example:

python test/python/generate_and_save.py test/python/stories.txt

Another Python script test/python/ask.py is used to compare search results, executed like this:

python test/python/ask.py "They're playing our song."

Perhaps, in addition to me, there may be users who need to generate compressed vectors of lower dimensions from high-dimensional embedding vectors. I hope this function can be of help.

@jkatz
Copy link
Contributor

jkatz commented Jun 4, 2024

As a general comment, and tests would have to go into the current regression test suite, not a new one built in Python. It'd be possible to test this with the standard PostgreSQL regression tests (.sql/.out).

@ankane
Copy link
Member

ankane commented Jun 4, 2024

Hi @MarchLiu, thanks for sharing. However, I'm skeptical that averaging dimensions is a good method for dimensionality reduction. I'd want to see significant evidence for it, including how it affects recall, as well as how it compares to binary quantization.

That being said, if you're able to create a SQL version of it, users could add it to their databases without the need for it to be part of an extension.

@ankane ankane closed this Jun 4, 2024
@MarchLiu
Copy link
Author

MarchLiu commented Jun 5, 2024

@ankane, @jkatz Hello. Thx for your reply. In fact I tried some different way for reduce dimensions. But the averaging dimensions by fixed ranges is a better implement compared to some more complex method just like averaging by top-k differences etc. Of cause, it isn't a perfect answer but a option only in probability yet.
I'm writing a matrix algorithm library. It try to port BLAS and LAPACK even GGML library and LLM into PostgreSQL. It include some vector methods include these dimensions reduce methods. But I don't want to import those dependencies just for some reduce dimensions function not good enough. So I put this function into this pull request.
My plan-b is develop the matrix library and supply a compile option for methods that convert between 1-row matrix and vector of pgvector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants