Generative AI in Production: How We Use Amazon Nova to...

Testing voice agents is harder than it looks. They respond to spoken input, run in real time, and have to handle different user behaviors, all of which makes it tricky to check if they're working as expected.

Most teams still test them manually, calling the bot, engaging in full conversations, and manually reviewing transcripts. We wanted a faster, more scalable approach, so we built a CLI tool that automates voice conversations and uses Amazon Nova to evaluate the agent's performance.

This blog shares our perspective on the complexities of testing voice agents, how Nova evaluation works, and when it makes sense to build vs. buy voice testing infrastructure for Gen AI systems.

#The testing challenge in Gen AI voice automation

These systems aren't just answering static questions. They handle end-to-end conversations, maintain context, and assist users in completing specific tasks.

Whether it's scheduling appointments in healthcare, handling account inquiries in financial services, or checking order status in retail, these systems must respond with context and maintain the flow across multiple interactions.

And testing all this takes time.

Here's what manual testing usually involves:

Creating test scenarios
Calling the voice bot
Talking through the entire conversation
Reviewing the transcript line by line
Verifying if the task was completed

Repeating this process slows down development and increases the likelihood of missed issues.

#What we built

To avoid repetitive manual testing, we created a CLI tool that loads a user persona with a specific goal, simulates a voice conversation with the agent, connects to Amazon Nova for evaluation, and returns a transcript, a success score, and improvement suggestions.

It handles:

Voice playback in the right time zone
Success and failure paths
A fixed number of turns
Auto-ending the call
Automated evaluation using Nova

This way, instead of spending time manually reviewing conversations, we get structured feedback immediately.

We've recorded a short demo to show exactly how it works.

Here's what the tool does:This helps identify common issues like misunderstood user requests, missing entities, and incomplete or incorrect responses.

Loads a test case (e.g., check if someone's available at 10 a.m.)
Simulates the full voice conversation
Uses Nova to evaluate if the goal was met
Returns a score and highlights what went wrong

This helps identify common issues like misunderstood user requests, missing entities, and incomplete or incorrect responses.

#Why Amazon Nova?

Testing voice AI is challenging, especially when you need to determine if the conversation was effective for the user.

Amazon Nova simplifies this process. It evaluates entire conversations against defined goals, such as checking availability, and determines whether the outcome was achieved. It then provides a score, highlighting strengths and weaknesses, and suggests specific areas for improvement.

Here's why it matters:

Brings structure to a typically subjective process
Identifies failure patterns faster
Delivers consistent, repeatable results

Nova is model-agnostic, works in any environment (local or Amazon Bedrock), and integrates seamlessly into CI/CD pipelines. It also supports any voice AI platform, just connect a webhook that receives audio, and you're set.

Compared to traditional LLM-based evaluations, Nova is:

Faster and cheaper
Easier to integrate
More transparent

In systems that utilize multiple models in conjunction, such as those for reasoning, retrieval, or speech synthesis, Nova serves as a neutral verifier. It focuses only on the result, regardless of how the agent got there.

That layer of consistency helps teams test and ship with more confidence.

#Build vs buy: What should you do?

Building internal tools often provides more control, but it also introduces added complexity. Not every team needs to start from scratch when it comes to testing Gen AI voice agents.

So the real question is: when does it make sense to build your own testing setup, and when is it better to use what's already available?

Here's a simple way to decide:

Build if:

You need full control over scoring logic or persona flow
You're using multiple LLMs and want a unified evaluation tool
You want to embed voice testing directly into your CI/CD pipelines
You need to simulate different user contexts or test for edge cases that aren't easily handled by general tools

Building your own tool allows you to customise the workflow to your needs. But it also means maintaining that tool over time, especially as your agents evolve.

Use off-the-shelf tools if:

You're working with simple, script-based chat logic
Your voice agent handles limited flows with clear, static responses
You don't require integration with multiple model providers
Your testing needs are occasional or non-automated pipelines

At Armakuni, we've seen both approaches work. What matters most is being clear about what kind of testing fits your product today and what you'll likely need six months from now

#Final thoughts

Just getting your Gen AI agent to work isn't enough. It needs to work consistently, at scale, and under different conditions.

That's why testing infrastructure matters. It helps you catch issues earlier and ship with more confidence.

Tools like Amazon Nova bring structure to the evaluation process. Combined with automation, they eliminate the need for manual reviews and facilitate ongoing improvements to the agent over time.

We're planning to open-source our CLI tool soon, allowing other teams to integrate it into their own voice AI systems.

If you're working on voice automation and need help with testing or evaluation, let's talk.