Nick Bradford
|
Commentary
I started working on coding agents in the pre-ChatGPT days, and now am working on Ellipsis, an AI software engineer that reviews PRs, fixes bugs, and more. This post expands on a talk I gave to other AI founders at the recent Y Combinator Alumni Retreat, and is a spiritual successor to my post from a year ago, covering:
Technical deep dive on how Ellipsis works across code review and search
Developer workflow and tooling, including evals and LLM-as-judge
How we built Ellipsis
System architecture
After a user installs our GitHub App into their repo, we route webhook events through Hookdeck for reliability, which forwards to our web app (FastAPI). We immediately place events onto a workflow queue managed by Hatchet.
When a user opens a PR or marks a PR as ready to review, the workflow clones the repository and runs the review agent. It also responds to tags (“@ellipsis-dev review this”), which allows users to ask Ellipsis to make changes to the PR, answer a question, log a GitHub/Linear issue, etc.
A key design consideration is that because our workflows are asynchronous, latency is not nearly as big of a concern as accuracy.
The Review Agent(s)
The core principle of prompt engineering is: to increase performance, make the problem easier for the LLM to solve. So, instead of having one mega-agent with a gargantuan prompt, we have dozens of smaller agents that can be independently benchmarked and optimized.
First, several Comment Generators run in parallel to find different types of issues. For example, one Generator may look for violations of the customer’s custom rules, another may search the codebase for duplicated code, etc. This also lets us mix and match different models - why choose between GPT-4o and Sonnet-3.6 when you can have both? Generators can include Evidence (links to code snippets) with their comment, which are useful later.
Second, we run a multistage Filtering Pipeline, which significantly reduces the false positive rate (developers’ most common complaint about AI code review tools). We also use this pipeline to make small edits/tweaks to comments, such as in the line number or inline code `suggestions`.
In pseudocode:
The first filter is to deduplicate similar comments (especially important given the Generators can sometimes overlap).
The Logical Correctness filter(s) is especially important to try to detect straight-up hallucinations from the Generators. This leverages the Evidence attached to each draft comment.
We include filtered comments and reasoning in our final output so users can have a sense of what Ellipsis found suspicious and why something wasn’t posted.
Using Customer Feedback
Feedback (through thumbs up/down) is also used in a filtering step, by using an embedding search over similar comments we’ve left in the past that have been reacted to. Users can also respond to our comments with an explanation, which makes it much easier for the LLM to understand why a comment was bad:
This approach has a ton of advantages over using a per-customer fine-tuned model, including that it’s more consistent, and that feedback is reflected almost immediately in the agent behavior (and it’s also easier to maintain).
The Code Search Agent
Both Comment Generation and Filtering steps leverage a Code Search subagent. This modular design is important to our overall architecture - our other workflows like Code Generation and Codebase Chat also leverage this same subagent, which can be benchmarked and improved independently.
We also index pull requests, both for direct questions such as “when did we do X”, and because it’s often leveraged by the agents to find previous examples on how particular code changes were made in the past:
Multi-step RAG+
The Code Search agent can take many steps, and leverages both keyword and vector search.
For indexing code, we have both a chunking-based method that uses tree-sitter to parse the AST into high-level pieces (such as functions or classes), and a per-file based method that embeds an LLM-generated summary. We find the former is better at finding specific code and functionality (“find a util function that does X”), while the latter is more helpful for higher-level questions (“where’s the database stuff”). GraphRAG/similar is on the horizon but not yet implemented.
Limiting context is important due to LLM performance degradation at large context sizes. Traditionally, a common approach in RAG to limit the number of search results fed to the LLM is: 1) search for results, 2) use a reranker to reorder by relevance, 3) pick some threshold value of cosine similarity and drop results that are lower than the threshold. However, for code search, the relative ranking is basically irrelevant compared to whether the retrieved code is actually useful in the first place, and cosine similarity does a very poor job of capturing this.
So, after an individual vector search, we run an LLM-based binary classifier which selects relevant pieces using additional context from the agent trajectory, which allows the top-level search agent to keep its context uncluttered as it makes additional passes.
Vector DBs and Efficient HEAD indexing
Despite latency not being a big concern, we don’t want the PR review to be blocked by a slow repository indexing job. We use Turbopuffer as our vector store, we don’t store any customer code, and we update our index very quickly when new commits are pushed to a repository. The key is to not have to re-embed the entire repo every time. On new commits we:
Chunk the repo and take the SHA of each chunk as its ID
Add (obfuscated) metadata about the snippet location so it can be later retrieved without storing any code
Fetch the existing list of IDs in the vector DB namespace
(Optional) copy the namespace to reduce risk of temporary inconsistencies
Compare the new vs existing IDs
Delete IDs from the vector DB that no longer exist
Embed and upsert chunks with new IDs
This means that in practice, because most commits affect only a small percentage of chunks, it only takes a couple seconds to sync on code changes.
Giving the agent a Language Server
When human developers explore code in an IDE, they constantly use features enabled by a language server, such as go-to-definition and find-all-references. However, implementing this across languages within the workers gets messy very quickly.
To solve this, we sidecar an Lsproxy container (from Agentic Labs) using Modal, which gives us a convenient high-level API for a variety of common language servers, and separates concerns nicely.
This allows our agents to use IDE-like tools for “clicking into” references, finding affected code paths, etc. LLMs are very bad at correctly identifying column numbers (and often off by one on line numbers even if you render code with them), so our tools let the agent specify the symbol name it’s trying to select, and then fuzzy match to the closest symbol of the same name.
Developer Workflow
Our blog from last year on developing LLM agents emphasized:
Placing extensive tests in CI
Using a cache for requests (to both LLM and Vector DB), for speed, cost, and determinism (we built our own with DynamoDB)
Keeping prompts in the code instead of a third-party service, to track changes
Rolling your own code instead of using open-source frameworks (e.g. LangChain)
The difficulty of automatically evaluating agent outputs, resulting in considerable reliance on snapshotting outputs and manual human reviews for quality
All except the last have held true. We’ve invested significantly more in building and automating evals and now rely less on snapshot testing. This is often tedious work, but we found it to be extremely high ROI.
Adding a new agent
Generally, our developer workflow looks something like this:
Write an initial prompt and tools (with enough experience these are often pretty good on first attempt)
Sanity check a handful of examples for vibes
Find a correctness measure
Build a mini-benchmark with ~30 examples. Can often just do so manually, or use an LLM to augment (“fuzz testing” is actually really important)
Measure accuracy. If it’s a really simple task, this might be good enough.
Diagnose failure classes, with the help of an LLM auditor
Update agent: tweak prompt, then add few shot examples, and fine-tune only if absolutely necessary
Generate more data
Repeat until plateau
Occasionally sample (public) prod data to identify new edge cases. Repeat diagnosis, data generation, update.
We haven’t found fine-tuning very applicable because 1) it slows down iteration speed, and 2) the latest models usually don’t have fine-tuning available.
Evals and Benchmarks
Where we get the eval data
Depending on the task, we might generate it manually, semi-synthetically, or by labelling (public) prod data. We’ve also had some luck with outsourcing dataset generation for some of the more general tasks (i.e. code search).
For Code Search, for example, we can generate Question-Answer pairs either with a human, or by feeding a repo to an LLM. The Answers come with attached Evidence, which can be deterministically checked against the evidence supplied by the agent.
To annotate data, we’ve built several in-house UIs. This has been necessary because most off-the-shelf solutions don’t work great if you have a lot of data and/or want to render it in a custom way (e.g. code syntax highlighting, rendering diffs), though we’re exploring some UI-builder platforms that can make this easier.
LLM as Judge
Ideally you can deterministically assess if an agent is correct - for example, when making file edits, you can simply check if the result file matches the expected file (with a bit of fuzzy handling of extra whitespace).
For more vague domains, using an LLM-as-judge to compare a result to the ground truth can remove the need for a manual human review. This can be tricky if the result is large and complicated (such as an entire generated PR), but for many of the sub-agents such as Code Search and Code Review, it “just works” (if you remember to use Chain of Thought). We find even GPT-4o is capable of doing this reliably on simpler tasks (such as codebase chat).
Here’s an example prompt for judging the results of Code Search:
This allows us to build reliable quantitative evals pretty easily, which really speeds up dev velocity.
Agent Trajectory Auditor
Because our internal tools make it trivial to define a new LLM pipeline or agent, there are many places where you can inject an LLM to speed up smaller parts of the dev workflow.
For example, to more quickly diagnose agent failures on the benchmark, we run an Auditor on failed test cases. This receives the agent trajectory and the correct answer, and tries to identify where in the trajectory the agent went wrong.
For example, when working on the Code Search agent, it's helpful to understand if a failure comes from poor quality vector search, hallucination, laziness, or some other issue. Here’s an example failure where Code Search decided to be lazy and give up after one failed search attempt:
We’ve found this is the kind of complicated task where the new o1 model shines.
Tips for Getting Good Performance
Plan for the failure case
LLMs are inherently probabilistic, and the non-zero probability of giving a bad response (for now?) means that you should build in graceful error handling from the beginning. It took a while for me to internalize that this was important no matter how straightforward the task and how well it seemed to perform in testing.
So, we have error handling at a variety of levels:
Simple retries and timeouts on LLM calls
If a particular model goes down, we fall back to another (e.g. Sonnet-3.6 => GPT-4o)
If an agent fails to call a tool or gets the tool arg format wrong, feed the error back in
Extensively validate the args for logical correctness, and feed those errors back in too
Failures in one tool should not break the agent; if the Vector DB goes down, the agent is still able to use keyword search.
Failures in one agent should not break the pipeline; if one Review Comment Generator fails, the others can still submit comments.
If there’s an unexpected bug, give the agent a chance to gracefully exit; Code Search handles internal errors and timeouts by telling the agent “Whoops, unexpected error! Try submitting an answer based on what you’ve learned so far.”
If all LLM providers are down, still return a descriptive message back to the user.
Managing Long Context
Despite the recent blessing of much longer context lengths, LLMs still suffer from performance degradation as context length increases.
For most of our coding tasks, a general rule of thumb is that we start to see noticeably more hallucinations when more than half the context is filled, though there’s a lot of variance.
Thus, we have a variety of heuristics around hard and soft token cutoffs, including simply truncating early messages, giving certain messages low priority, self-summarizing, and tool-specific summaries. For example, if a shell command returns a large number of tokens, we use GPT-4o with predictive outputs to summarize/extract the important pieces.
The Big Hammer
If performance is hitting a plateau, there are a two big hammers you can use for boost (really two sides of the same coin):
Split a complicated agent into simpler subagents, which can be independently benchmarked. Just like in normal software architecture, composability is key.
Run agent(s) a bunch of times in parallel, and use a judge to select the best candidate. Variations could come from temperature, slightly different prompts, different fewshot examples, or just a completely different model altogether. Often, the variance from just a single trivial-looking change to the prompt is enough to get a boost.
However, the tradeoff here is that the system is harder to maintain, and reaching for these too early can mask easy-to-fix issues.
Which models to use?
We use primarily Sonnet-3.6 (technically `claude-3-5-sonnet-20241022`, but it feels different enough from the previous 3.5 to get its own name) and GPT-4o, and occasionally their smaller counterparts. We find Sonnet to be slightly better than 4o at most tasks.
Claude and GPT models used to feel very different to prompt, but they now seem to be converging - we can usually take a prompt tuned for 4o and swap it to Sonnet without any changes and still see a performance benefit.
How about o1? We’ve slowly started integrating it in a couple places and see qualitatively better results for many of our harder problems (though not a step change better than Sonnet). However, in our experience o1 has to be prompted very differently from the other models to get good performance; the kinds of things it messes up are very different, so it’s not a drop-in replacement.
What’s next
Graph Search
You can think of code review as a search over a graph, where nodes include code, PRs, GitHub issues, Slack messages, tribal knowledge, etc. Building this graph is straightforward, but traversing it is hard.
Think about how difficult it would be for you as a human to be dropped into a totally unknown codebase to review a medium-sized PR - even if you had an hour to search the codebase, it’s actually quite difficult to identify bugs that require context more than a couple hops from the diff, which is why people typically invite a human who has all the relevant nodes cached in memory (“the senior engineer”).
We’re exploring a number of ways to better traverse this graph to give much deeper reviews - stay tuned!
Code Generation
You can already ask Ellipsis to make small code changes (from a PR, Slack, GitHub/Linear issue), but we are beta testing user-configurable sandboxes that allow Ellipsis to build/lint/test your code and give you much higher quality results. If you’re interested in trying it out, let us know!
PS: You can install Ellipsis in <30 seconds at https://ellipsis.dev