Lessons from 15 months of building LLM agents

Lessons from 15 months of building LLM agents

Nick Bradford

|

Commentary

I’ve spent the past 15 months building LLM agents, currently Ellipsis, a virtual software engineer. Previously, I worked on structured data extraction, codebase migrations, and text-to-SQL.

As a result, startups working on LLM observability/safety/reliability reach out to me several times per week to hear about my problems, so I’m consolidating my experience and advice here.

[GPT-4V] Create me a picture: "building an AI agent" top-down, 8bit, laboratory. In the center is a robot on an operating table, dressed in a black business suit, but half complete. Scientists around are working on it. The color palette is dark, with deep purples and dramatic light. [NB: two scientists appear to be decapitated.]


What does my current setup look like?

My workflow has looked remarkably similar across the varied LLM agents I’ve worked on.

Evals in CI

Most of my day is spent running evals; everything else flows from this. Why?

  • Prompts are brittle. You add an extra space or remove a comma somewhere, and the LLM output may be completely different, i.e. degraded on an important use case.

  • Small changes accumulate: A single agent run might have dozens, hundreds, or thousands (!) of LLM calls, so the actual outputs of a small change must be verified empirically.

  • All the normal reasons you have tests: they’re great. Run the tests, tests pass, you ship. Ship ship ship.

How do evals work? There are two main kinds of evals:

  • “Unit tests”: Verify a small piece of the agent internals; for example, verify the agent calls a particular tool when it’s the obvious choice. Good for testing obscure scenarios. Easy to verify expected behavior.

  • Integration tests”: a complicated scenario end to end. Much more valuable because it tells you “does it actually work”. It’s often hard to write assertions for these because the outcome is amorphous, such as an answer to an ambiguous question.

So, the big question:

How do you tell if the agent was successful?

Here are some ideas:

  • Problem-specific heuristics: Often the agent’s objective has some other characteristics you can use to check. If it was writing a SQL query: does the query execute? If it was writing code: does the code pass tests? If it was extracting data: are all the expected fields present?

  • LLM self-evaluation: great idea. But here’s the thing: any reasonably advanced agent includes a self-critique sub-agent that it will use internally. This means your agent (probably) produced something that already passed your best LLM tests. (If your tests are finding room for improvement, they should be part of your agent!).

Sadly, this just doesn’t cut it most of the time. There are too many degenerate edge cases. There’s only one solution I’ve found:

*snapshot the outputs and read them manually*

This is a huge pain. 

And if your tests check the exact outputs, they need to be *exactly* the same, so you’re going to need:

Caching

This is one of the first things I set up on a new agent project. A cache is needed for your evals to be:

  • Deterministic: you want to stay sane. GPT-4 is non-deterministic at temperature zero (due to Sparse MoE).

  • Fast: a roundtrip to OpenAI can take seconds or minutes, so an agent might run for minutes or even hours - a cache speeds this up by 100x.

  • Cheap: at $0.06/1k input tokens, a fully maxed-out call to GPT-4-32k costs $1.92 (up to $3.84 for large outputs). If you experiment with agents enough, you’ll find there are many ways to blow through $1k in a few minutes of experiments.

There are many caching providers out there, but I’ve always rolled my own (I did try Helicone and hit various issues). Why? 

  • You do NOT want another service in between you and your LLM provider. LLM providers are unreliable enough as it is. (PS: use the Azure endpoints instead of OpenAI.)

  • A cache is trivial to set up and maintain.

So, I’ve built the exact same thing for every LLM project I’ve been on: stable serialize the request => hash it to get a key => stuff in a key/value store. You can build your own in Postgres in minutes.

Observability, at long last

OK, you ran your evals (or a customer ran something in prod), and your agent produced some hot garbage. Time for the fun part.

Getting an intuitive feeling for your agent

Over time I find you get quite a good gut feeling for what your agent is (or should) be doing, just as with acclimating to a large new codebase or getting to know a literal human being.

Mostly, this comes from reading a LOT of:

Logs

I find good “normal” logging is even more valuable than fancy observability tools. The other AI engineers I know all agree: you spend a LOT of time poring over logs.

However much data you’re logging, it’s not enough, even if you take into account this blog post. (I promise I don’t work for DataDog and am not trying to convince you to spend $65M/year on observability.)

In a complicated agent, the “root cause” of an issue is often in the agent’s tools or environment, not in the prompts itself. But sometimes the agent “just did something dumb”, and for that you’ll need:

LLM request UI

You need a nice UI to view agent conversation histories. I use PromptLayer - simple, fast, reliable. I’ve tried a few others and hit misc issues.

Some other platforms have agent-first UIs (LangSmith and W&B come to mind) - haven’t tried them because it just hasn’t been a priority to have a nice visualization, even with thousands of LLM calls in a single agent run.

From viewing the conversation history, you find the spot where the agent went off the rails, and you ask yourself, “what could I have changed about the prompt to get it to do the right thing?” For that, you’ll head into:

Prompt Playground

Most of the LLM request UIs seem to have one built-in now. If you prefer copy-pasting into a different tab, OpenAI has one, Vercel built one to compare different models, loads exist. This is super useful.

You play around in the playground a few minutes, you find a change that works (maybe changing the system prompt, maybe it’s adding a new tool), and now…

Full circle! You need to add a new eval, and then run all your old evals to make sure we didn’t hopelessly break all our other use cases (which happens far too often).

Things I don’t need

Prompt Library / Prompt Templates

Everyone wants me to put my prompts in their database so I can change it in their UI. This makes sense for some products, such as if you have some simple chatbot and you want a PM to be able to tweak the prompt without having to touch any code.

For agents, it’s a complete non-starter:

  • Prompts must be in version control for your evals to be reliable

  • Agents are composed not of a single prompt, but dozens or hundreds of smaller prompts: various tools, sub-agents, and error-handling cases.

Langchain and similar libraries

There are many cool libraries for building LLM agents, and if I only have 30 minutes to prototype something, they can be useful.

However, I’ve yet to find a real use case for them outside of demos. LLM agents are pre-paradigm with no killer apps (yet), and building abstractions for applications that don’t exist is quite difficult.

  • Customizations to your agent are always necessary.

  • The libraries have hooks in all the wrong places, and many simple behaviors are completely inexpressible.

  • Agents are just a while-loop; you can write your own in around as much time as it takes to learn the API of whatever library, and it’ll be far simpler and easier to customize.

For now, the wrong abstraction is far worse than no abstraction. 

Things I’ll pay someone to build
Here are some things I’d love:
  • Fuzz testing: prompt instability is a huge problem, and a constant cause for worry. Are all my prompts stuck in a super unstable local optima, which make changes extremely difficult? Try out hundreds of small variations and measure the variance in the results.

  • Prompt optimization: there’s now tons of literature on this. 

  • Auditor agent: a first step would be “find the point in the conversation where things went wrong”, useful for long agents.

  • [Edit 2023-12-03] Observability for Embeddings: The use case is debugging RAG workflows and evaluating different embedding models. For example, Nomic Atlas has a great visualizer for exploring embedding spaces, but it’s meant for static datasets, not for production use cases.

Conclusion

For now, I think just like “great ML engineers spend a lot of time looking at their data”, AI engineers [have to] spend a lot of time reading agent logs and manually inspecting results.

If you are building something that would improve this workflow, I would love to chat.



Edits

A few people have asked about the desirability of determinism, and the assumptions inherent therein. The reason for determinism is rooted in agent reliability, with major exceptions:

- building a product where you want high variance (like writing poems)

- building in a pass@k-style architecture, where you generate many results and choose the best one

- intrinsically unsolvable by agent (like requiring non-indexed data to answer)

I think "human in the loop" (or "agent in the loop") workflows will be around for many years yet, and the UX is the key, just as how ChatGPT was fundamentally a UX advance to make LLMs easy to interact with.

team@ellipsis.dev

215 Park Ave S, Floor 11, Suite 42

New York, NY, 10003

Ellipsis - AI code reviews & bug fixes | Product Hunt

Copyright ©2024, Ellipsis AI Inc.

team@ellipsis.dev

215 Park Ave S, Floor 11, Suite 42

New York, NY, 10003

Ellipsis - AI code reviews & bug fixes | Product Hunt

Copyright ©2024, Ellipsis AI Inc.

team@ellipsis.dev

215 Park Ave S, Floor 11, Suite 42

New York, NY, 10003

Ellipsis - AI code reviews & bug fixes | Product Hunt

Copyright ©2024, Ellipsis AI Inc.