Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@wirthual
Copy link
Collaborator

@wirthual wirthual commented Dec 6, 2024

Related Issue

#476

Checklist

  • I have read the CONTRIBUTING guidelines.
  • I have added tests to cover my changes.
  • I have updated the documentation (docs folder) accordingly.

Additional Notes

WIP to add matryoshka embeddings.

Is there a CLAP model which supports matryoshka embedding for testing?
Is there a TinyCLIP model which supoprts matryoshka embedding for testing?

Currently missing:
[ ] Integration into client
[ ] Implementation for dummy model

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Here's my summary of the key changes in this PR:

Adds support for matryoshka (variable-length) embeddings across the infinity library with the following major changes:

  • Added dimensions field to OpenAI embedding input model in pymodels.py to specify desired embedding length
  • Modified BatchHandler to truncate embeddings to requested dimension after generation in batch_handler.py
  • Added matryoshka_dim parameter to embedding methods in AsyncEmbeddingEngine and AsyncEngineArray
  • Added comprehensive test coverage verifying matryoshka functionality:
    • Tests with nomic-embed-text-v1.5 and jina-clip-v2 models
    • Validates truncated embeddings maintain semantic similarity
    • Verifies correct dimensions in API responses

The implementation enables compatibility with models like OpenAI's text-embedding-3 that support variable-length embeddings while maintaining backward compatibility.

Note: PR is marked WIP and still needs:

  • Integration into client
  • Implementation for dummy model
  • Additional test coverage for edge cases

💡 (2/5) Greptile learns from your feedback when you react with 👍/👎!

7 file(s) reviewed, 14 comment(s)
Edit PR Review Bot Settings | Greptile


@add_start_docstrings(AsyncEngineArray.embed.__doc__)
def embed(self, *, model: str, sentences: list[str]):
def embed(self, *, model: str, sentences: list[str], matryoshka_dim=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: matryoshka_dim parameter lacks type annotation. Should be Optional[int]


@add_start_docstrings(AsyncEngineArray.image_embed.__doc__)
def image_embed(self, *, model: str, images: list[Union[str, bytes]]):
def image_embed(self, *, model: str, images: list[Union[str, bytes]], matryoshka_dim=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: matryoshka_dim parameter lacks type annotation. Should be Optional[int]


@add_start_docstrings(AsyncEngineArray.audio_embed.__doc__)
def audio_embed(self, *, model: str, audios: list[Union[str, bytes]]):
def audio_embed(self, *, model: str, audios: list[Union[str, bytes]], matryoshka_dim=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: matryoshka_dim parameter lacks type annotation. Should be Optional[int]


async def image_embed(
self, *, model: str, images: list[Union[str, "ImageClassType"]]
self, *, model: str, images: list[Union[str, "ImageClassType"]], matryoshka_dim=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: matryoshka_dim parameter is missing type annotation, should be Optional[int]


async def audio_embed(
self, *, model: str, audios: list[Union[str, bytes]]
self, *, model: str, audios: list[Union[str, bytes]], matryoshka_dim=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: matryoshka_dim parameter is missing type annotation, should be Optional[int]

)
assert engine.capabilities == {"embed"}
async with engine:
embeddings, usage = await engine.embed(sentences=sentences, matryoshka_dim=matryoshka_dim)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: matryoshka_dim parameter should be validated against model's supported dimensions

embeddings = np.array(embeddings)
assert usage == sum([len(s) for s in sentences])
assert embeddings.shape[0] == len(sentences)
assert embeddings.shape[1] >= 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: redundant assertion since line 408 already checks exact dimension

@codecov-commenter
Copy link

codecov-commenter commented Dec 6, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 92.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 79.63%. Comparing base (be48378) to head (9c811fb).
Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
libs/infinity_emb/infinity_emb/engine.py 87.50% 1 Missing ⚠️
libs/infinity_emb/infinity_emb/sync_engine.py 83.33% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #490      +/-   ##
==========================================
+ Coverage   79.59%   79.63%   +0.04%     
==========================================
  Files          41       41              
  Lines        3430     3438       +8     
==========================================
+ Hits         2730     2738       +8     
  Misses        700      700              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wirthual
Copy link
Collaborator Author

wirthual commented Dec 6, 2024

I did a quick test like this:

from openai import OpenAI

client = OpenAI(
    base_url="http://0.0.0.0:7997",
    api_key="sk", 
)

result = client.embeddings.create(
    input=["input","input2"],
    model="nomic-ai/nomic-embed-text-v1.5",
    dimensions=64
)

assert len(result.data[0].embedding) == 64

model: str = "default/not-specified"
encoding_format: EmbeddingEncodingFormat = EmbeddingEncodingFormat.float
user: Optional[str] = None
dimensions: Optional[int] = None
Copy link
Owner

@michaelfeil michaelfeil Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int should be 0 < x < 8193, using pydantic v2 conint

@michaelfeil
Copy link
Owner

michaelfeil commented Dec 9, 2024

LGTM, if you change the OpenAPI spec for the validation of input and add an end-to-end test

@michaelfeil
Copy link
Owner

michaelfeil commented Dec 9, 2024

@wirthual
Potentially this. could be a end-to-end test under open

from openai import OpenAI

client = OpenAI(
    base_url="http://0.0.0.0:7997",
    api_key="sk", 
)

result = client.embeddings.create(
    input=["input","input2"],
    model="nomic-ai/nomic-embed-text-v1.5",
    dimensions=64
)

assert len(result.data[0].embedding) == 64

@wirthual
Copy link
Collaborator Author

wirthual commented Dec 9, 2024

@wirthual Potentially this. could be a end-to-end test under open

from openai import OpenAI

client = OpenAI(
    base_url="http://0.0.0.0:7997",
    api_key="sk", 
)

result = client.embeddings.create(
    input=["input","input2"],
    model="nomic-ai/nomic-embed-text-v1.5",
    dimensions=64
)

assert len(result.data[0].embedding) == 64

Sounds good. Is there an exmaple on how to start a fastapi server within a pytest method without using AsyncOpenAI client?

@michaelfeil
Copy link
Owner

Just add one here:
https://github.com/michaelfeil/infinity/blob/be483785f23c3e2a738c85028cbac3a390ec2bab/libs/infinity_emb/tests/end_to_end/test_openapi_client_compat.py#L115C9-L115C21
Also with the other tests - mostly we don't use pytest.mark.parametrize here to that it does not need to restart the server every time.

@wirthual
Copy link
Collaborator Author

wirthual commented Dec 9, 2024

Like this?

Copy link
Owner

@michaelfeil michaelfeil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, nevermind.. :)

@wirthual wirthual merged commit efe6096 into main Dec 10, 2024
36 checks passed
@wirthual wirthual deleted the matryoshka_dim branch December 10, 2024 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants