by Stephen Kaufman

Agentic AI design: An architectural case study

Explore the rise of agentic AI in this detailed case study. Learn how autonomous agents are transforming workflows, from design to implementation, and driving ROI.

Stephen Kaufman, Chief Architect, Microsoft
Credit: Stephen Kaufman

From obscurity to ubiquity, the rise of large language models (LLMs) is a testament to rapid technological advancement. Just a few short years ago, models like GPT-1 (2018) and GPT-2 (2019) barely registered a blip on anyone’s tech radar. But with the advent of GPT-3 in 2020, LLMs exploded onto the scene, captivating the world’s attention and forever altering the landscape of artificial intelligence (AI), and in the process, becoming an essential part of our everyday computing lives.  

There are many areas of research and focus sprouting from the capabilities presented through LLMs. In 2024, a new trend called agentic AI emerged. Agentic AI is the next leap forward beyond traditional AI to systems that are capable of handling complex, multi-step activities utilizing components called agents. LLMs by themselves are not agents. They have no goal. However, they are used as a prominent component of agentic AI. Agents will play different roles as part of a complex workflow, automating tasks more efficiently. It doesn’t just respond, it learns, adapts and takes actions of its own.  

As we look to identify uses for AI Agents, we will find many opportunities. Those that leverage LLM’s strengths, such as handling natural language tasks, automating repetitive processes and executing well-defined tasks will be those that are most successful.  

Why has agentic AI become the latest rage?  

The analyst firm Forrester named AI agents as one of its top 10 emerging technologies this year and that it will deliver benefits in the next two to five years.  

Sam Altman, OpenAI CEO, forecasts that agentic AI will be in our daily lives by 2025.  

Kevin Weil, chief product officer at OpenAI, wants to make it possible to interact with AI in all the ways that you interact with another human being. He believes these agentic systems will make that possible, and he thinks 2025 will be the year that agentic systems finally hit the mainstream.  

With all this talk, you would think it is easy to define what qualifies as agentic AI, but it isn’t always straightforward. Let’s start with the basics: What is an agent? An agent is part of an AI system designed to act autonomously, making decisions and taking action without direct human intervention or interaction. They can handle complex tasks, including planning, reasoning, learning from experience, and automating activities to achieve their goal.  

You can use these agents through a process called chaining, where you break down complex tasks into manageable tasks that agents can perform as part of an automated workflow.  

An agentic AI system will be characterized through the following capabilities:  

  • Autonomy: Able to initiate and complete tasks without continual oversight. Agentic AI operates with limited, or no, direct human supervision or interaction. This allows greater flexibility of the activities and efficiency in executing each task.
  • Reasoning: Able to use sophisticated decision-making based on context.
  • Reinforced learning: Able to dynamically evolve through interactions with the environment and receive feedback from the interactions.
  • Language understanding: Able to comprehend and follow complex instructions.
  • Workflow optimization: Able to efficiently execute a multi-step process.   

Now that we have covered AI agents, we can see that agentic AI refers to the concept of AI systems being capable of independent action and goal achievement, while AI agents are the individual components within this system that perform each specific task.  

It’s important to break it down this way so you can see beyond the hype and understand what is specifically being referred to. Especially with companies like Microsoft, OpenAI, Meta, Salesforce and others in the news recently with announcements of agentic AI and agent creation tools and capabilities.  

Microsoft recently announced the release of Copilot agents. These are preprogrammed agents that can help with certain tasks. You can utilize these agents through Copilot Studio to help your organization build and deploy AI agents. These agents are already tuned to solve or perform specific tasks. Microsoft is describing AI agents as the new applications for an AI-powered world.  

But what if you are looking to build your own agentic AI solution with custom agents that are specific to the unique tasks required by your business? 

There are many reasons to build your own. A company that adopts agentic AI will gain competitive advantages in innovation, efficiency and responsiveness and may become more agile in operations. Investments in AI agent projects are expected to yield orders of magnitude in ROI and business value if companies select high-impact use cases. But then, that’s where we must dive in slowly. The hype is all around how agents can be used to build complex and capable workflows. While that is true, your development teams may not be ready to implement yet. Development teams starting small and building up, learning, testing and figuring out the realities from the hype will be the ones to succeed. Forrester, in their Predictions 2025: Artificial Intelligence report, predicted that three-quarters of companies that try to build AI agents in-house will fail. Don’t let that scare you off. We need to start with proof-of-concepts and small-scale focused learning projects. That will help us achieve short-term benefits as we continue to learn and build better solutions. Let’s review a case study and see how we can start to realize benefits now.  

Agentic AI design: A case study  

When you start doing agentic AI design you need to break down the tasks, identify the roles and map those to specific functionality that an agent will perform. It is up to you whether you create agents that map to the roles or map to the specific functionality. For instance, If you want to create a system to write blog entries, you might have a researcher agent, a writer agent and a user agent. These might be self-explanatory, but no matter what, there must always be documentation of the system. Do you know what the user agent does in this scenario? Would you know that the user agent performs sentiment/text analysis?  

In our real-world case study, we needed a system that would create test data. This data would be utilized for different types of application testing. The requirements for the system stated that we need to create a test data set that introduces different types of analytic and numerical errors. Twelve different scenarios need to be tested against, and the data files need to contain or be able to contain data that will exercise those 12 tests. In addition, the system needs to create different files that mimic the data sets or files customers submit. There can be up to eight different data sets or files. Each record in each file needs to have a correlation ID or primary/foreign key value to match and link across records in the files. These correlation IDs can be kept in a text file that the system will read and assign along with the created output.  

Then, the system needs to be able to create different amounts of records per file to mimic the number of transactions in the source system. The output of the system should be able to stress the end user application by producing different-sized test files. The requirement for the output is to be able to create files of 1000, 10,000, 100,000 and 1,000,000,000 records.  

Lastly, the system needs to keep track of the number of records in each file, the time it takes to create the output, the time it takes to process, the number of errors created per output test file by the 12 different test types, the number of errors correctly captured by the automated tests and other business-specific metrics. Some of these data points will come from the agentic AI system and some will be generated from the automation testing system.  

The customer team got to work and created a design and started to create a proof-of-concept. They decided to design agents that would map to each of the eight different output data sets. Each agent would read the text file of correlation IDs and interact with the LLM to generate the output data. Eight different prompts were created that were tailored to the specific output data each agent was charged with generating. They tested the prompts, modified them to give better examples, changed the wording of what was being asked from the LLM and kept testing. One of the variables that was included in each prompt was to try to coax it to create the number of output records needed. If the LLM didn’t create enough output, the agent would need to run again.  

As you read through the requirements and the activities that they performed, what are your thoughts? Do you see any issues? Do you agree with this design and implementation?  

Well, there were several issues to overcome. The first is getting the data for that many records output to the file. The second issue no one asked: How many times will the agent system run? Is this a one-time activity per file size? Will they be able to get the output data to represent all 12 tests in each single file? Or will they need to run the system and create a separate file for each of the 12 tests, multiplied by each of the eight files?  

Lastly, the biggest question focuses on one of the more immediate impactful non-functional aspects. I asked them if they had calculated the cost of running the system. We paused the activities and got to work modeling the costs. I am sure that you can imagine the costs associated with running the system for 1 million records, not tokens, records. And one million output records in each of the eight files. Even at fractions of a penny per token, this ends up being tens to hundreds of thousands of dollars.  

All work stopped and shifted to find alternative options.  

What if we shifted our thinking 180 degrees?  

What if we stopped thinking about creating the output files directly? What if we didn’t use the created prompts? This was a proof-of-concept. We were bound to throw things away and try different approaches.  

A new direction  

We shifted, threw the old design away and went in a completely different direction. As we reviewed the requirements, asked more questions, understood all the non-functional requirements and had a more in-depth discussion about how many times the data files would be generated, we settled on a new design.  

Instead of directly having the LLM output test records, we would have the LMM output Python code. That Python code could be run separately from the agentic AI system that is creating it. It could be run across different teams, as many times as needed and could utilize code switches to ensure the output conformed to the error output requirement. It also meant that we could create files of any size with no additional code or cost. The only cost is the creation of the Python code. Once the Python code was generated, it was delivered to the testers which could be run repeatedly.  

For the new design, the agents consisted of code creation agents, a code analyzer agent, a test agent and the planner orchestrating the whole process. The planner decides which agents should tackle what task and in what order. The code creation agent is responsible for creating Python code. The code analyzer agent is responsible for understanding the code and outputting those results for documentation. The test agent tests the Python code and reports back if it compiles or not. The planner will respond to the test agent results and decide if the code generated needs to be regenerated.  

It was exciting to think that we had a system that would write code for us. That the team could be so productive so fast and cut down a large amount of the time duration it took for overall testing activities. But we also had to be careful. GPT and LLMs have been able to write or produce code for a while. It can also create decent code if the prompt is clear and specific and the complexity is kept in check. GPT and LLMs can also write complete garbage. One thing’s certain: Don’t blindly use any code provided without testing it yourself.  

Yes, we have an agent that checks if it compiles, but we all know we can still have code that compiles that doesn’t work or doesn’t do what is intended. Make sure that you are very clear in your prompt on what you want created and make sure that you provide details. GPT models are very adept at creating code modules based on explicit guidance. But they do not do well at creating a complete application. Therefore, we need to make sure that we are testing the code. We will need to augment that code and spend some time debugging. But we will certainly need to make sure to test the code modules and then test all the modules together, in their entirety.  

We also know that once teams start using the Python code solution to create the test output, there will be activities around maintaining that code. There is a balance between rerunning the agent solution to recreate the output and making the changes directly to the Python code. Therefore, the developers/testers that use that code need to make sure they understand the code that is generated. That is why we have a code analyzer agent. The output can be used as an aid to understand the code and help as modifications will end up being made. As with any system, once the users start running the code, they will always have modifications.  

After focusing on the requirements given, we need to make sure that we build in two aspects in every design and implementation.  

The first is to put in place end-to-end monitoring. In our case, we need to ensure that we have monitoring of the agentic AI system. This incorporates logging and monitoring communication (inputs and outputs) from the Large Language Model (LLM) as well as all communication to and from each agent. Then we need to add to our prompt to include monitoring and logging output from the generated Python code. If we try to add this after the code is generated, we know the odds of that happening are very low.  

The second is ensuring that we have a human-on-the-loop. Since the system will run independently and won’t require any human interaction, we need to ensure that the agentic AI system is being designed and implemented with the ability to monitor the process. This includes logging what the agents are doing, what each agent receives and returns from the workflow and shutting the process down or overriding the operation based on alerts created from the logs/monitoring system.  

Lastly, as you are evaluating your design, look at the monitoring and logging output, look at the performance log output, evaluate cost and determine the accuracy of the output based on the prompts and make sure that you are reviewing the chattiness and functionality. Did this need to be an agent? Could this have been a direct call to the LLM instead? Is there an opportunity to reuse the agent in other agentic AI systems? These questions may have you deciding to modify the design further. With every AI implementation, you will need to do multiple rounds of testing and multiple rounds of reviews and changes.  

As you work with agentic AI, make sure that you allow yourself enough time to learn, explore, experiment and try different approaches. Don’t be afraid to throw out parts of the proof-of-concept. Think about cost, agent communication or overcommunication, performance and latency.  

Agentic AI: The time is now  

The possibilities and capabilities of agentic AI are limitless, whether you are looking to incorporate agents created by companies such as Microsoft or SalesForce and others, or to build custom solutions.  

We will see this agentic AI revolution grow as providers release additional agents, tools and development frameworks. Look to see how you can take advantage of the wide range of benefits to your business from operational efficiency and scaling to innovating faster and improving capabilities. This technology is positioned to set early adopters as leaders in their fields, to be ready to take on future challenges and aid in creating significant return on investment.  

Just make sure that you are evaluating non-functional requirements, such as cost and cost analysis, in your activities. There is no faster way to erode ROI than through unneeded token costs and extra processing costs.   

Now is the time to explore agentic AI.  

Stephen Kaufman serves as a chief architect in the Microsoft Customer Success Unit Office of the CTO focusing on AI and cloud computing. He brings more than 30 years of experience across some of the largest enterprise customers, helping them understand and utilize AI ranging from initial concepts to specific application architectures, design, development and delivery.  

This article was made possible by our partnership with the IASA Chief Architect Forum. The CAF’s purpose is to test, challenge and support the art and science of Business Technology Architecture and its evolution over time as well as grow the influence and leadership of chief architects both inside and outside the profession. The CAF is a leadership community of the IASA, the leading non-profit professional association for business technology architects.