Collector: Hızlı ve Tarafsız Haber

VentureBeat
2 gün, 20 saat, 11 dakika

Under the hood of AI agents: A technical guide to the next frontier of gen AI

Agents are the trendiest topic in AI today — and with good reason. Taking gen AI out of the protected sandbox of the chat interface and allowing it to act directly on the world represents a leap forward in the power and utility of AI models. The word “agent” has been used in different ways, however, and there have been some overheated claims about what agents can do. The rhetoric, the willful obfuscation and the rapid evolution of the field have left a lot of people confused. To cut through the noise, I’d like to describe the core components of an agentic AI system and how they fit together: It’s really not as complicated as it may seem. Hopefully, when you’ve finished reading this post, agents won’t seem as mysterious. Agentic ecosystem Although definitions of the word “agent” abound, I like the British programmer Scott Williston’s minimalist take: An LLM agent runs tools in a loop to achieve a goal . The user prompts a large language model (LLM) with a goal: Say, booking a table at a restaurant near a specific theater. The LLM then calls a tool: Say, a database of restaurant locations. The tool provides a response, passes to the LLM, and the LLM calls a new tool. Through repetitions, the agent moves toward accomplishing the goal. But what kind of infrastructure does it take to realize this approach? An agentic system needs a few core components: A way to build the agent . Obviously, when you deploy an agent, you don’t want to have to code it from scratch. There are several agent development frameworks available. Somewhere to run the agent . A seasoned AI developer can download an open-weight LLM and build an agent on a desktop computer. But in practice, most agents will run in the cloud. How do we instantiate them securely and scalably? A mechanism for translating between the text-based LLM and tool calls . A short-term memory for tracking the content of agentic interactions. A long-term memory for tracking the user’s preferences and affinities across sessions. An authorization system for handling permissions. A way to trace the system’s execution, to evaluate the agent’s performance. Let's dive into more detail on each of these components. Building an agent Asking an LLM to explain how it plans to approach a particular task improves its performance on that task. This “chain-of-thought reasoning” is now ubiquitous in AI. The analogue in agentic systems is the ReAct (reasoning + action) model, in which the agent has a thought (“I’ll use the map function to locate nearby restaurants”), performs an action (issuing an API call to the map function), then makes an observation (“There are two pizza places and one Indian restaurant within two blocks of the movie theater”). Agents aren’t required to use the ReAct framework, but it’s proven highly effective. Today, agents are commonly looped over the thought-action-observation sequence. With an agent development framework, the developer defines a goal using natural language, then specifies the tools that the agent can use to achieve that goal, such as databases and microservices. The tool specifications include a natural-language explanation of the context and purpose of the tool’s use, as well as a description of the syntax of the tool’s available API calls. The developer can also tell the agent to build its own tools on the fly. Let’s say, for instance, one tool retrieves a table stored as comma-separated text. To fulfill its goal, the agent needs to sort the table. Sorting a table by repeatedly sending it through an LLM and evaluating the results would be a colossal waste of resources — and it’s not even guaranteed to give the right result. Instead, the developer can simply instruct the agent to generate its own Python code. Available tools can divide responsibility between the LLM and the developer. Once the tools available to the agent have been specified, the developer can simply instruct the agent what tools to use when necessary. Or, the developer can specify which tool to use for which types of data, and even which data items to use as arguments during function calls. Similarly, the developer can tell the LLM to generate Python code for automating repetitive tasks when necessary or, alternatively, tell the agent which algorithms to use for which data types and even provide pseudocode. The approach can vary from agent to agent. Runtime Historically, there were two main ways to isolate code running on shared servers: Containerization, which was efficient but offered lower security; and virtual machines, which were secure but came with a lot of computational overhead. In 2018, Amazon Web Services’ (AWS’s) Lambda serverless-computing service deployed Firecracker , a new form of server isolation. Firecracker creates “microVMs” that have reduced functionality but comparably low overhead, so that each function executed on a Lambda server can have its own microVM. Because instantiating an agent requires deploying an LLM, together with the memory resources to track the LLM’s inputs and outputs, the per-function isolation model is impractical. Instead, session-based isolation can assign every agent its own Firecracker microVM, complete with computational capacity, memory and file system resources. When the session finishes, the LLM’s state information is copied to long-term memory, and the microVM is destroyed. This ensures secure and efficient deployment. Tool calls Just as there are several existing development frameworks for agent creation, there are several existing standards for communication between agents and tools, the most popular of which — currently — is the model context protocol (MCP). MCP establishes a one-to-one connection between the agent’s LLM and a dedicated MCP server that executes tool calls, and it also establishes a standard format for passing different types of data back and forth between the LLM and its server. Sometimes, however, the necessary tool is not one with an available API. In such cases, the only way to retrieve data or perform an action is through cursor movements and clicks on a website. There are a number of services available to perform such computer use and handle the translations back and forth between the computer use service and the LLM interface. Authorizations With agents, authorization works in two directions. First, of course, users require authorization to run the agents they’ve created. But as the agent is acting on the user’s behalf, it will usually require its own authorization to access networked resources. There are a few different ways to approach the problem of authorization. One is with an access delegation algorithm like OAuth, which essentially plumbs the authorization process through the agentic system. The user enters login credentials into OAuth, and the agentic system uses OAuth to log into protected resources, but the agentic system never has direct access to the user’s passwords. In the other approach, the user logs into a secure session on a server, and the server has its own login credentials on protected resources. Permissions allow the user to select from a variety of authorization strategies and algorithms for implementing those strategies. Memory and traces Short-term memory LLMs are next-word prediction machines. What makes them so astoundingly fluent is that their prediction of the next word in a sequence is based on the semantic embeddings of all the words they’ve already seen. That's the LLM’s context, which is in itself a kind of memory. But it’s not the only kind of memory an agentic system needs. Suppose, again, that an agent is trying to book a restaurant near a movie theater, and from a map tool, it’s retrieved a couple dozen restaurants within a mile radius. It doesn’t want to dump information about all those restaurants into the LLM’s context: All that extraneous information could wreak havoc with next-word probabilities. Instead, it can store the complete list in short-term memory and — again, using embeddings — and retrieve one or two records at a time, based on the user’s price and cuisine preferences and proximity to the theater. If none of those restaurants pans out, the agent can dip back into short-term memory, rather than having to execute another tool call. Long-term memory At the end of a session, the agent’s microVM — including the context and the contents of short-term memory — is destroyed. But we will often want to retain information about a particular user from one session to another. If you booked airline reservations today, you’ll want the agent to remember your destination and travel dates when you send it to book a hotel tomorrow. At the end of every session, then, the context and the contents of short-term memory are extracted and distilled for storage in long-term memory. The distillation process can include summarization and embedding for later vector-based retrieval. It can also involve “chunking”, in which documents are split up into sections grouped according to topic. Each topic — or the text chunks dealing with a topic — is then embedded, to make subsequent vector-based retrieval easy. Users can select summarization, embedding, chunking, and other distillation strategies and algorithms. Finally, in addition to storing context and content, agent systems record the API calls and responses corresponding to the inputs and outputs to the LLM. This permits manual review to evaluate agents’ performance. Conclusion That’s it! Of course, it takes a lot of engineering to get all these components to work together and to run efficiently on specific hardware, but in broad strokes, that’s how agentic systems work.

Go to News Site