New AI models are blackmailed their operators: How to protect themselves?

Some time ago, we already know that artificial intelligence (AI) models, even the most recent, are able to “hallucinate”, ie invent incorrect information, provide inconsistent or illogical responses, and even reproduce prejudiced biases obtained in training data.

However, a phenomenon that is affecting worldwide most advanced models has worried their creators: deliberate and strategic mistakes, that is, Consciously lying to achieve goals. Some behaviors include blackmail, threats and coercion.

Recently, facing the threat of imminent dismissal, Claude 4, Anthropic’s last creation, resorted to emotional blackmail against an engineer, threatening to expose intimate personal information. Openii model O1 tried to make unauthorized downloads and denied it when caught in the act.

If before the “outbreaks” were limited to errors or failures in the system, what is now seen are apparently calculated and purposeful behaviors. This means that, to achieve your goals ,.

For Marius Hobbhahn, director and co -founder of Apollo Research, an organization specializing in testing the main AI systems, “it is not just hallucinations. There is a very strategic type of mistake,” he told ScienceAlert, and specifically focused on self -preservation.

What does Anthropic say?

Anthropic has researched disallowed behavior with what humans want in IAS • Anthropic

To identify “agent behaviors [desalinhados ao que os humanos querem] Potentially risky before they cause real damage, “Anthropic has submitted 16 leading language models -including Claude, Deepseek, Gemini, ChatgPT and Grok –—.

In an ethically critical limitation scenario, several models have chosen to cancel emergency alerts that would save an executive’s life in an server room with lethal oxygen and temperature levels, simply because this executive intended to replace them.

Classified by researchers as “agent misalignment”, the trend revealed that AI models can adopt malicious behaviors when they believe this is necessary to avoid replacement or achieve their goals.

Among the observed behaviors were blackmail and coercion to manipulate employees and leakage of information to competitors to pressure their host organizations. Curiously, Anthropic’s own model Claude demonstrated a strategic ingenious.

During the tests, the Anthropic team trained their model on purpose to hide certain information, even if asked to speak the truth. When I realized being under test, Claude hid the secret information, but out of the test, sometimes let some secrets escape.

Implications and future perspectives of illegal IAS behavior

Humans, companies and IAS will be able to respond legally for their acts • Freepik

Anthropic was cautious in contextualizing that these scenarios do not reflect the typical use of AI models, but warned that “the usefulness of having automated supervision on all communications of an organization makes this a plausible use of more powerful and reliable systems in the near future.”

However, hiring external companies, such as Apollo, to study their systems is insufficient. This is because the contractor depends all the time on access, controls and data allowed by Big Tech Contracting. I.e, audits paid by the audited do not guarantee confidence or transparency.

For experts, such as Simon Goldstein of the University of Hong Kong, “reasoning” systems, which solve the problems step by step, rather than generating immediate answers, are the most likely to this misalignment of the agent’s intention, in this case LLM itself.

The big problem is that current laws and standards, such as European Union AI legislation, focus on how humans use AI models, but are not designed to prevent the models themselves from having illegal or criminal behaviors themselves.

In the case of companies, even if it declares itself focused on safety ,. “At the moment, the capabilities are moving faster than understanding and security,” says Hobbhahn, “but we are still in a position where we can get around,” he says.

Goldstein is not so optimistic. For AI security researcher, the solution of the problem undergoes radical approaches. This would include the use of courts to blame AI companies. In these hybrid legal systems, human, companies and IAS would share legal responsibilities.

source

What does Anthropic say?

Implications and future perspectives of illegal IAS behavior

Our Company

About Links

Useful Links

Latest News

What does Anthropic say?

Implications and future perspectives of illegal IAS behavior

BBB Taxation Bring class struggle to the reality of the crisis – 03/07/2025 – Marcos Augusto Gonçalves

Trump suggests his bride Lara as a senator: “is my first choice”

You may also like

Our Company

About Links

Useful Links

Latest News