Cody Context Architecture
Cody Context Architecture
architecture
Context awareness is key to providing the highest quality responses to users. A metaphor we have
used is that a raw LLM is like a booksmart programmer who has read all the manuals but doesn’t know
a company’s codebase. By providing context from a company’s codebase along with the LLM prompt,
the LLM can generate an answer that is relevant to that codebase.
The key value proposition of Cody is that if Cody has the best, most relevant context about a
company’s codebase then it will be able to provide the best answers. Cody will be able to:
● Answer questions about that codebase
● Generate code that uses the libraries and style of that codebase
● Generate idiomatic tests and documentation
● And, in general, reduce the work developers need to do to go from the answer provided by the
LLM to delivering value in their organization
This document goes into detail on how we provide this context to an LLM.
This is where Cody comes in. When a user queries Cody, Cody generates a prompt that is specifically
designed to help answer questions about the user’s code.
For example, after the user clicked on the “Explain high-level” recipe, the resulting prompt might look
like this (the labels on the left are not part of the actual prompt):
Context [Contents of
sourcegraph/sourcegraph/internal/search/zoekt/query.go]
A prompt like this is sent to the LLM. The information contained in the prompt is the only information
the LLM has beyond its baseline model. Because the LLM lacks information about a user’s code base,
context makes a big difference.
We can illustrate this with an example. If we send the prompt above to Claude without the context, we
get the following result:
When we send the same prompt to Cody, we include context from multiple files, including
query.go,which implements QueryToZoektQuery. The answer we get is much more specific to
Sourcegraph. It notes that Zoekt is Sourcegraph’s search backend, gives more background context,
and provides much more specific details about the code. The LLM underneath is still Claude. The
difference comes from providing the context.
The quality of the context is critical, even with increasing context window sizes1. For one, with
Enterprise code bases continuing to grow2, even huge context windows are not enough to send
everything. Even in cases where sending everything is feasible, sending excess context can increase
query cost, data egress, and response time. The costs of not being selective can add up. Effective
context selection will continue to be an important part of interacting with LLMs.
Cody gets context by searching for relevant code snippets. It does this in one of two ways: Keyword
Search and Embeddings.
1
Such as Anthropic’s recent introduction of a 100k context window
2
See Sourcegraph’s report on Big Code in the AI Era
In the simple non-code example below, the query “Auth” would match both “OAuth” and “Author”
(assuming substrings are supported). It would not match the related statements “SAML” and “OpenID
Connect” which are also common authentication or authorization methods.
When using a codebase that has not had embeddings generated (see below), Cody may use keyword
search as a fallback, depending on the particular client or extension in use.
Embedding search
Keyword search matches documents based on how many words they share with the query (perhaps
using synonyms or bigrams such as “New York”). However, the most relevant document may have few
words in common. We want to look at the meaning of words in context, not just their textual form.
We use a technique the NLP community developed, text embedding. Embeddings encode words and
sentences as numeric vectors. These vector representations are designed to capture linguistic content
of text, and can be used to measure the similarity between a query and a document based on
meaning.
This is especially relevant for code. Code often uses abbreviations or library names to refer to the same
concept by different names. For example, using embeddings “Auth” would match “OAuth”, “OpenID
Connect”, and “SAML”, which are all common authentication or authorization methods. They should
not match “Authorship” since it is not commonly abbreviated as “Auth”.
3
Stemming is the process of mapping a word to it’s linguistic step. E.g., “days” to “day”
4
Stop words are common words which are filtered out to improve relevance.
Embeddings are the preferred method for fetching context. Compared to keyword search, embeddings
search:
● uses all of the code in the specified repositories (not just code in the local workspace)
● matches code based on the meaning of the query (not the exact terms)
We ran an evaluation of embeddings against keyword search using the CodeSearchNet data. Looking
at Normalized Discounted Cumulative Gain (NDCG) over the first 20 returned results, we compared the
performance of:
● Embeddings models:
○ OpenAI
○ all-mpnet-base-v2
● Keyword search
○ ripgrep
○ Elasticsearch
Using OpenAI embeddings as the baseline, we saw the following relative quality:
Although a true keyword search engine was able to nearly match the OpenAI embeddings model, the
result from all-mpnet-base-v2 shows that the right embeddings model can outperform keyword
search. Any embeddings model is much better than ripgrep5.
That said, it doesn’t need to be an “either/or”. Different search methods can be blended, for example,
combining embeddings with keyword search when exact matches are useful.
Improving context
We are continually working to improve context search. Since Cody’s initial release, we have already
added support for searching over multiple repositories, seamless updating of embeddings, and scaled
the current vector database by 10x. Here’s where we are going from here.
5
The advantage of ripgrep is that it is easy to run locally and on demand.
This model has been effective for us so far, but it presents some challenges.
Using OpenAI requires that we send a customer’s whole code base to a third party service provider.
Although we have a 0-retention policy for both LLM and embeddings providers, some customers prefer
not to send their entire code base to an entity they do not have a direct relationship with.
Another challenge is that OpenAI embeddings are large. Each embedding is represented by 1536
floating point numbers, which adds up to a lot of RAM. Using smaller embedding vectors would allow
the Embeddings Serving subsystem to scale to more code with the same amount of RAM. It also allows
records to be scanned more quickly which makes searches faster.
As the note comparing keyword search and embeddings shows, another challenge with OpenAI
embeddings is that other embeddings models perform better — even though they’re smaller.
To solve these problems, we are replacing OpenAI embeddings with a Sourcegraph managed
Embeddings API. This API would be a stateless API that is managed by Sourcegraph6. It would generate
embeddings directly; it would not call out to another service7. The managed service would have a
0-retention policy, with the generated embeddings continuing to be stored in the vector database in a
customer’s Cloud managed or self-hosted Sourcegraph instance.
Results so far are promising. Even off the shelf, we’re seeing smaller models perform up to 20% better
on our tests than the OpenAI embeddings. (Models we’ve evaluated include AllDistilRobertaV1,
AllMPNetBaseV2, E5, AllMiniLML6V2, and SentenceT5.)
Beyond code
Developers answer questions with more than just code. Sometimes the best answer to a user’s query
is found in a company’s knowledge base or ticket system. We plan to add support for these and other
data providers. Because embeddings encode meaning, we believe they are particularly well suited to
finding commonalities across disparate data sources such as code, documentation, bug reports, logs,
and more.
6
We do not currently plan to provide self-hosted functionality for generating embeddings.
7
We will temporarily retain the option to use OpenAI for generating embeddings to aid in the transition to a new
model.