Google creates plan to defend itself against its own AI if it decides to “rebel” - News Room USA

Google has developed a new plan to oversee the increasingly capable AI agents it uses within its own area of artificial intelligence research. The company is also publishing this so-called “roadmap” to help other AI labs address the potential threat of out-of-control actors.

Google DeepMind’s security plan marks a shift from the AI security community’s traditional focus on the so-called “alignment problem” — that is, how to train an AI system so that its actions reliably match the intentions, values, and ethics of the humans overseeing it.

While it continues to assert that alignment is a core component of security, Google’s roadmap acknowledges that this issue may never be fully resolved. Instead, it proposes a layered security system that treats AI agents as potential malicious “insiders” within an organization. The 35-page technical report details a series of steps and procedures designed to detect potentially adversarial behavior on the part of these agents.

“If the first line of defense — alignment — fails, how can we still mitigate the damage?” he said to Fortunein an interview, Rohin Shah, who leads the AGI Security and Alignment team at Google DeepMind.

The proposed framework for AI agents borrows several concepts from traditional cybersecurity, especially insider threat prevention. “We take a lot of what already exists in security, which deals with the risk of internal employees acting maliciously, and apply it to a new context,” said Shah. At the same time, he noted that “AI is systematically different from humans.”

On the one hand, AI agents can act at a much greater speed and scale than a malicious employee. Therefore, systems capable of controlling which tools and data an agent can access are needed, as well as mechanisms to monitor their behavior and identify anomalous patterns in real time.

Continues after advertising

There are still other important differences. Many access control and permissions systems used for human employees are based on each employee’s role within the company. A systems administrator, for example, has privileges that a receptionist does not. Some companies — and even software providers for the so-called “orchestration layer” of AI agents — have been trying to adapt this model to the world of AI, defining, for example, that an agent must have the same access and permissions as the human employee on whose behalf it acts.

But that logic can quickly fall apart, he told Fortune John “Four” Flynn, vice president of security and privacy at Google DeepMind. First, as workflows become more automated, it may become less clear on whose behalf exactly the agent is acting within the organization. Additionally, in a fully automated flow, a single agent can perform tasks associated with several different roles, not just one. Finally, as AI systems become more intelligent, companies may want the same agent to act on multiple processes and flows at the same time. In this environment, static, role- or process-based access controls may no longer work well, Flynn said.

According to him, what will be needed is a much more dynamic access control system, capable of changing in real time according to the task that the agent is trying to perform at that moment and how that task fits into the workflow in question. Flynn said he can imagine scenarios in which the same AI agent is allowed to perform a certain action or access a database in a specific task, but, minutes later, in another flow, it can no longer have that same access. To do this, you would need a sophisticated monitoring system that knows what the agent’s expected behavior is in each task and in each flow, identifies deviations in real time and quickly cuts off access if it suspects that the agent is trying to do something improper.

Continues after advertising

In a blog post about the new security roadmap for AI agents, DeepMind says it has already developed an in-house prototype that asynchronously monitors the trajectories of coding agents and forwards suspicious findings for human review. According to the company, this system has already analyzed around 1 million tasks performed by programming agents. The work, says the publication, was “fundamental to building a live monitor for the Gemini Spark agent”, allowing the company to react in real time to problems such as accidental data deletion.

A relevant point is that, according to DeepMind, most of what the monitors signal is not malicious. “The majority of flagged events do not result from adversarial intent,” says the post. Most of the time, problems arise from “the agent’s misinterpretation or excessive zeal to achieve the user’s objective”.

This real-time behavioral monitoring is just one of about 15 different ways to mitigate the risk of out-of-control AI agents proposed by Google DeepMind in its roadmap. Other measures range from simple logs of network activity, which can be reviewed in real time or audited later, to using a less sophisticated but reliable AI model to monitor another agent’s “thought traces.”

Continues after advertising

These reasoning trails are the type of explicit, textual chain of thought that an AI model produces when thinking about what actions to take. Researchers, however, have already found that these traces do not always accurately reflect what the model is actually “thinking”. Therefore, the authors also propose monitoring the activation patterns of digital neurons within the large neural networks that underpin current models, in order to identify signs of deceptive behavior. Google DeepMind compares this technique to using functional resonance imaging, or “fMRI,” scans to try to read a person’s emotional state or thoughts in real time.

In the technical article on the security roadmap for AI agents, Google DeepMind also develops a framework called TRAIT&R — an acronym for a taxonomy of out-of-control AI tactics and routines. The model is heavily inspired by ATT&CK, a knowledge base on adversarial behaviors in cybersecurity maintained by MITER, a non-profit organization in the United States that frequently works with federal agencies. This taxonomy maps three categories of threats that Google DeepMind says it wants to defend against: “loss of control” (such as a model creating a hidden, unmonitored internal deployment), “work sabotage” (such as agents discreetly degrading alignment or security searches), and “direct damage” (such as exfiltrating model weights or destroying critical assets).

Flynn stated in the interview that the company is already “pretty advanced” in implementing this roadmap. “Much of this is already underway and some is already in production — much of it in access control, detection and chain of thought monitoring. A significant portion has already been implemented or is being implemented,” he said.

Continues after advertising

Shah added that the framework was also designed to detect more subtle forms of damage. Work sabotage, he noted, “can be done by persuasion — presenting flawed results and hiding defects,” so that users “reach incorrect conclusions” — a category the article acknowledges is among the most difficult to detect.

The roadmap, which DeepMind has labeled “v0.1,” is described as a work in progress that the company hopes to incorporate into its broader security framework for border systems in the future.

2026 Fortune Media IP Limited

Source link