yellowgreenfruit

Traditional code search relies on exact matches and regex patterns. We explore how embedding-based approaches can understand developer intent and return contextually relevant results.

The Problem with Keyword Search

Every developer knows the frustration: you're looking for "where we handle authentication errors" but you don't know what the code calls it. Is it AuthError? AuthenticationException? handleAuthFailure?

Keyword search forces you to guess the vocabulary of whoever wrote the code. This is backwards—the search tool should understand what you mean, not just what you type.

Our Approach

We built a semantic search system that understands code at three levels:

Level 1: Lexical Embeddings

We use a code-trained language model to create vector representations of:

Function signatures
Documentation strings
Variable and parameter names
Comments

These embeddings capture semantic similarity—authenticate and login are recognized as related concepts.

Level 2: Structural Context

Code structure matters. A function called handleError in an authentication module is different from one in a database module. We enrich our embeddings with:

File path information
Import/dependency graph
Call hierarchy
Module boundaries

Level 3: Usage Patterns

How code is used reveals its purpose. We analyze:

Call sites and their context
Test files (often contain the clearest descriptions of behavior)
Commit messages that modified the code
Related documentation

Implementation Details

// Query processing pipeline
async function search(query: string): Promise<SearchResult[]> {
  // 1. Generate query embedding
  const queryEmbedding = await embedQuery(query);

  // 2. Retrieve candidates via approximate nearest neighbors
  const candidates = await vectorIndex.search(queryEmbedding, {
    limit: 100,
    threshold: 0.7,
  });

  // 3. Re-rank with cross-encoder for precision
  const reranked = await crossEncoder.rerank(query, candidates);

  // 4. Apply structural filters
  return applyContextFilters(reranked, query);
}

The key insight is combining fast approximate search (for recall) with slower but more accurate re-ranking (for precision).

Results

We evaluated on a dataset of 50 natural language queries against 10 popular open-source repositories:

| Method | Precision@5 | Recall@10 | MRR | |--------|-------------|-----------|-----| | Keyword (ripgrep) | 0.23 | 0.31 | 0.34 | | GitHub Code Search | 0.41 | 0.52 | 0.48 | | Our System | 0.72 | 0.84 | 0.79 |

Qualitative feedback from developers was even more positive—several described it as "finally, search that thinks like I do."

Limitations and Future Work

Indexing cost: Building the semantic index requires significant compute. We're exploring incremental updates to reduce this.

Novel code: Recently written code may not have enough context for accurate embedding. We're investigating few-shot learning approaches.

Cross-repository search: Connecting related concepts across different projects remains challenging.

Want to try semantic search on your codebase? We're running a private beta. Sign up here →

Semantic Code Search: Beyond Keyword Matching