Traditional code search relies on exact matches and regex patterns. We explore how embedding-based approaches can understand developer intent and return contextually relevant results.
The Problem with Keyword Search
Every developer knows the frustration: you're looking for "where we handle authentication errors" but you don't know what the code calls it. Is it AuthError? AuthenticationException? handleAuthFailure?
Keyword search forces you to guess the vocabulary of whoever wrote the code. This is backwards—the search tool should understand what you mean, not just what you type.
Our Approach
We built a semantic search system that understands code at three levels:
Level 1: Lexical Embeddings
We use a code-trained language model to create vector representations of:
- Function signatures
- Documentation strings
- Variable and parameter names
- Comments
These embeddings capture semantic similarity—authenticate and login are recognized as related concepts.
Level 2: Structural Context
Code structure matters. A function called handleError in an authentication module is different from one in a database module. We enrich our embeddings with:
- File path information
- Import/dependency graph
- Call hierarchy
- Module boundaries
Level 3: Usage Patterns
How code is used reveals its purpose. We analyze:
- Call sites and their context
- Test files (often contain the clearest descriptions of behavior)
- Commit messages that modified the code
- Related documentation
Implementation Details
// Query processing pipeline
async function search(query: string): Promise<SearchResult[]> {
// 1. Generate query embedding
const queryEmbedding = await embedQuery(query);
// 2. Retrieve candidates via approximate nearest neighbors
const candidates = await vectorIndex.search(queryEmbedding, {
limit: 100,
threshold: 0.7,
});
// 3. Re-rank with cross-encoder for precision
const reranked = await crossEncoder.rerank(query, candidates);
// 4. Apply structural filters
return applyContextFilters(reranked, query);
}
The key insight is combining fast approximate search (for recall) with slower but more accurate re-ranking (for precision).
Results
We evaluated on a dataset of 50 natural language queries against 10 popular open-source repositories:
| Method | Precision@5 | Recall@10 | MRR | |--------|-------------|-----------|-----| | Keyword (ripgrep) | 0.23 | 0.31 | 0.34 | | GitHub Code Search | 0.41 | 0.52 | 0.48 | | Our System | 0.72 | 0.84 | 0.79 |
Qualitative feedback from developers was even more positive—several described it as "finally, search that thinks like I do."
Limitations and Future Work
Indexing cost: Building the semantic index requires significant compute. We're exploring incremental updates to reduce this.
Novel code: Recently written code may not have enough context for accurate embedding. We're investigating few-shot learning approaches.
Cross-repository search: Connecting related concepts across different projects remains challenging.
Want to try semantic search on your codebase? We're running a private beta. Sign up here →