Skip to main content
Blog Jul 17, 2025 · Justin Mitchell ·4 min read

Generative AI in Production: How We Use Amazon Nova to Test Voice Agents at Scale | Armakuni

Discover how Armakuni used Amazon Nova to automate and streamline Gen AI voice agent testing, with real-time feedback, structured evaluations, and a demo of our CLI tool.

Generative AI in Production: How We Use Amazon Nova to Test Voice Agents at Scale | Armakuni

Testing voice agents is harder than it looks. They respond to spoken input, run in real time, and have to handle different user behaviors, all of which makes it tricky to check if they're working as expected.

Most teams still test them manually, calling the bot, engaging in full conversations, and manually reviewing transcripts. We wanted a faster, more scalable approach, so we built a CLI tool that automates voice conversations and uses Amazon Nova to evaluate the agent's performance.

This blog shares our perspective on the complexities of testing voice agents, how Nova evaluation works, and when it makes sense to build vs. buy voice testing infrastructure for Gen AI systems.

#The testing challenge in Gen AI voice automation

These systems aren't just answering static questions. They handle end-to-end conversations, maintain context, and assist users in completing specific tasks.

Whether it's scheduling appointments in healthcare, handling account inquiries in financial services, or checking order status in retail, these systems must respond with context and maintain the flow across multiple interactions.

And testing all this takes time.

Here's what manual testing usually involves:

Repeating this process slows down development and increases the likelihood of missed issues.

#What we built

To avoid repetitive manual testing, we created a CLI tool that loads a user persona with a specific goal, simulates a voice conversation with the agent, connects to Amazon Nova for evaluation, and returns a transcript, a success score, and improvement suggestions.

It handles:

This way, instead of spending time manually reviewing conversations, we get structured feedback immediately.

We've recorded a short demo to show exactly how it works.

Here's what the tool does:This helps identify common issues like misunderstood user requests, missing entities, and incomplete or incorrect responses.

This helps identify common issues like misunderstood user requests, missing entities, and incomplete or incorrect responses.

#Why Amazon Nova?

Testing voice AI is challenging, especially when you need to determine if the conversation was effective for the user.

Amazon Nova simplifies this process. It evaluates entire conversations against defined goals, such as checking availability, and determines whether the outcome was achieved. It then provides a score, highlighting strengths and weaknesses, and suggests specific areas for improvement.

Here's why it matters:

Image 30

Nova is model-agnostic, works in any environment (local or Amazon Bedrock), and integrates seamlessly into CI/CD pipelines. It also supports any voice AI platform, just connect a webhook that receives audio, and you're set.

Compared to traditional LLM-based evaluations, Nova is:

In systems that utilize multiple models in conjunction, such as those for reasoning, retrieval, or speech synthesis, Nova serves as a neutral verifier. It focuses only on the result, regardless of how the agent got there.

That layer of consistency helps teams test and ship with more confidence.

#Build vs buy: What should you do?

Building internal tools often provides more control, but it also introduces added complexity. Not every team needs to start from scratch when it comes to testing Gen AI voice agents.

So the real question is: when does it make sense to build your own testing setup, and when is it better to use what's already available?

Here's a simple way to decide:

Build if:

Building your own tool allows you to customise the workflow to your needs. But it also means maintaining that tool over time, especially as your agents evolve.

Use off-the-shelf tools if:

At Armakuni, we've seen both approaches work. What matters most is being clear about what kind of testing fits your product today and what you'll likely need six months from now

#Final thoughts

Just getting your Gen AI agent to work isn't enough. It needs to work consistently, at scale, and under different conditions.

That's why testing infrastructure matters. It helps you catch issues earlier and ship with more confidence.

Tools like Amazon Nova bring structure to the evaluation process. Combined with automation, they eliminate the need for manual reviews and facilitate ongoing improvements to the agent over time.

We're planning to open-source our CLI tool soon, allowing other teams to integrate it into their own voice AI systems.

If you're working on voice automation and need help with testing or evaluation, let's talk.

Related reading.

Contact Armakuni.

Most engagements start with an AWS-funded discovery. First conversation is with an engineer, not a sales exec.