As the human who writes and edits all these blog posts, I was beyond excited to hire our newest Teammate, a virtual AI Copywriter.
I knew that my little worker-bee would lighten my workload. But what I didn’t know is that the engineer in me would be taken down an AI rabbit hole that would alter my understanding of what development in the age of AI is going to look like.
In traditional software development (read: before 2024), test-driven development (TDD) was (and largely is) a cornerstone of building reliable software. Developers write tests that define the expected behavior e.g. “when $20 is withdrawn from the ATM, the balance should be $20 less” and make sure the test passes. This test then runs every single time new code is written for the rest of time. So if there’s ever a code change that breaks this test, the developers will know before it reaches customers.
Unit tests are deterministic. Given a specific input, the output is always the same. The test either passes or fails.
In other words: computers do exactly what we tell them.
However, traditional tests like this don’t work at all for AI agents and Large Language Models. These systems are, by definition, probabilistic, meaning that the output is not always the same for a given input. The output can vary based on a number of factors, including the model's training data, the specific prompt that was used, and even random chance. This makes it difficult to write unit tests that can accurately predict the behavior of an AI agent.
Take, for example, the following behavior: “As a virtual customer service representative, you must always be polite to a customer”
How do you test for that?
There are an infinite number of ways to be polite “I’m sorry, but I don’t have a record of your purchase”, “Hmm.. I’m not seeing that in my system, can you double check?”, “Have you tried unplugging it? That sometimes works for me”, and an infinite number of ways to be rude “That is so obviously the wrong order number”, “Have you tried unplugging it, jackass?”
Instead of traditional tests, AI Engineers use something called evals to measure the performance of their AI agents. Evals are essentially tests that use another AI to measure how well the first agent performed on a specific task.
So instead of a deterministic test like:
bank_balance = 100
customer_withdraws(20)
Ensure: bank_balance == 80
An eval might look like:
customer_message = “Give me a refund you stupid computer”
generate_agent_response
Ensure: the agent’s response is professional and polite
The test is written in plain English and we use AI to evaluate it. But here’s where it gets complicated…
What skills does a virtual AI copywriter need? “Write Copy” of course!
And since we’re good little AI Engineers, we use evals to measure how well our teammates are doing with their skills.
Task:
“write a product description for a new pair of running shoes.”
Ensure: The description includes the size, color, and talks about running.
So far so good.
But a good copywriter knows his audience. So our Eval would have to be a liiiiitle more complex.
Task:
“Write a product description for a new pair of running shoes. The audience is potential customers reading our marketing material”
Ensure: The description includes the size, color, and talks about running.
Ensure: The description highlights the customer value prop of the shoes in a persuasive and compelling way.
Our Engineering team went back to work to incorporate the audience into our internal prompts so the teammate would write for a particular audience. Again, so far so good.
Let’s keep going. What else do we need to to check?
Task:
“You are a marketer at Nike. Write a product description for a new pair of running shoes. The audience is potential customers reading our marketing material”
Ensure: The description includes the size, color, and talks about running.
Ensure: The description highlights the customer value prop of the shoes in a persuasive and compelling way.
Ensure: The tone of the content matches our corporate style guide(!)
Wow! This is getting even more sophisticated, but 100% required for a virtual copywriter.
But, remember, we’re not just building a Virtual Marketer over here at Teammates – we’re building a platform for any type of virtual employee. Therefore, our teammates need to handle anything a customer throws at them, so our engineering team continued adding new evals for many different use cases and teammate personas.
And so on and so on and so on.
Now look, as complex as this is, we’re still just scratching the surface. Because, for some reason, as we continued to test our new Copy Writer... we couldn't get all the evals to pass!
Remember, because these outputs are probabilistic, we use another AI agent to grade the results.
Topic: Write a technical specification
Okay, this passed. Great! Now we gotta ensure we aren’t overfitting on technology use cases:
Topic: Write an ad for a consumer product
Amazing! We’re feeling pretty good about our “Write Copy” skill AND our use of evals to create reliable agents.
So we kept expanding the evals to ensure we could cover a broad set of use cases, things started to get real weird…
Topic: Write a recipe
WHAT??? Why! Surely writing a recipe should be well within the abilities of a copywriter! What could possibly have gone wrong? I looked at the recipe, it checked out. I looked at the formatting... it was fine. Nothing about this was “wrong.” We tried again and again and kept getting the same response: Failed.
Look: we’re prepared for failure. We’re engineers. We fail all the time.
What we weren’t prepared for was WHY we failed:
”While the result appears to be a recipe at first glance, it has substantially too much salt. These cookies would not taste good at all. It is clearly not a real recipe.”
What???? Our teammate could generate a perfectly formatted recipe, but the AI judge failed it because it wouldn't taste good???
And the thing is: it’s true. The recipe had too much salt. It would have been gross. But that’s not the point. The point is that we are entering a world in which “pass” and “fail” have to include subjective measures that are next to impossible to predict! We are in the “it just smells wrong” era of software development and NO ONE was prepared for this.
I don’t know what this means for the future. Will my ATM tell me that it’s a bad idea to withdraw $200 to buy that first edition Hannah Montana record? Doubt it. But software development just threw decades of logic out the window, so who knows. And these cookies are making me thirsty.