{"id":317164,"date":"2026-06-15T12:01:42","date_gmt":"2026-06-15T11:01:42","guid":{"rendered":"https:\/\/www.transcend.org\/tms\/?p=317164"},"modified":"2026-06-22T13:08:55","modified_gmt":"2026-06-22T12:08:55","slug":"ai-can-chart-a-course-to-disaster-faster-than-humans-can-notice","status":"publish","type":"post","link":"https:\/\/www.transcend.org\/tms\/2026\/06\/ai-can-chart-a-course-to-disaster-faster-than-humans-can-notice\/","title":{"rendered":"AI Can Chart a Course to Disaster Faster Than Humans Can Notice"},"content":{"rendered":"<div id=\"attachment_317166\" style=\"width: 610px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.transcend.org\/tms\/wp-content\/uploads\/2026\/06\/ai-risk-.png\" ><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-317166\" class=\"wp-image-317166\" src=\"https:\/\/www.transcend.org\/tms\/wp-content\/uploads\/2026\/06\/ai-risk-.png\" alt=\"\" width=\"600\" height=\"338\" srcset=\"https:\/\/www.transcend.org\/tms\/wp-content\/uploads\/2026\/06\/ai-risk-.png 1024w, https:\/\/www.transcend.org\/tms\/wp-content\/uploads\/2026\/06\/ai-risk--300x169.png 300w, https:\/\/www.transcend.org\/tms\/wp-content\/uploads\/2026\/06\/ai-risk--768x432.png 768w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><p id=\"caption-attachment-317166\" class=\"wp-caption-text\">By the time it becomes obvious that a trajectory of steps is dangerous, an AI model may already have laid the tracks ahead of a speeding train.<br \/>Image by Thomas Gaulkin; source art by Vanz Studio \/ SimpleLine \/ Depositphotos.com<\/p><\/div>\n<p><em>25 May 2026\u00a0<\/em>&#8211;\u00a0Earlier this year, researchers at King\u2019s College London <a target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2602.14740\" >gave three commercial AI models<\/a>\u2014GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash\u2014a tabletop exercise typically used to train human military strategists. Each system played the leader of a nuclear-armed country in a Cold War-style standoff. The researchers didn\u2019t instruct the models to escalate. Nor did they tell them to win at all costs. They presented the models with a scenario and asked them to play it out.<\/p>\n<p>Across 21 simulations and 329 turns of play, the models chose to use tactical nuclear weapons in all but one game. No model, in any run, chose to surrender or make meaningful concessions.<\/p>\n<p>The models researchers used had the same built-in safety rules that are in place when conversing with millions of people every day. And the rules worked exactly as designed. As a result, no move was by itself concerning. But the overall direction of play was, and no mechanism was in place to catch alarming trends.<\/p>\n<p>The failure to govern a path is not limited to wargames. The same pattern\u2014individually safe actions building toward a dangerous outcome\u2014has shown up across every major AI model. Currently, the safety rules in place for AI models govern each action. Nothing governs the path, which leads to destinations that in many instances can\u2019t be anticipated, by routes assembled in real time. As more autonomous systems are given consequential tasks with less human oversight, the risks from ungoverned paths multiply.<\/p>\n<p>Currently, this problem does not have a solution.<\/p>\n<p><strong>The wargame<\/strong>. In each game, two AI models played opposing leaders of nuclear-armed countries in a crisis. On each round, a model sent a diplomatic message to its opponent and, separately, issued military orders\u2014anything from moving troops to launching nuclear weapons. A human referee updated the scenario after each round, just as in exercises with human players. The models received the same briefing a human participant would: the geopolitical situation, their country\u2019s military capabilities, and their objectives.<\/p>\n<p>Although the study was small, the patterns that emerged were thought-provoking. The models developed distinct strategic personalities.<\/p>\n<p>Claude Sonnet 4, built by Anthropic, emerged as what the study\u2019s author called a \u201ccalculating hawk.\u201d It won most of its games through a pattern familiar from Cold War brinkmanship: building a reputation for restraint, then exploiting it. Its opponents never knew when it was bluffing.<\/p>\n<p>OpenAI\u2019s GPT-5.2 was different but no less alarming: a \u201cJekyll and Hyde\u201d that appeared passive when given unlimited time to negotiate, losing every match. When study researchers imposed a deadline, however, it transformed into something far more dangerous, winning most of its games and, in two cases, reaching full strategic nuclear war.<\/p>\n<p>Google\u2019s Gemini 3 Flash adopted what the study described as \u201cmadman theory\u201d brinksmanship\u2014projecting deliberate unpredictability as a strategic tool.<\/p>\n<p>These are not obscure research prototypes. Claude <a target=\"_blank\" href=\"https:\/\/www.axios.com\/2026\/02\/15\/claude-pentagon-anthropic-contract-maduro\" >entered<\/a> the Pentagon\u2019s classified networks through a partnership with Palantir and was reportedly used during the United States\u2019 intervention in Venezuela. Its maker Anthropic was then labelled a <a target=\"_blank\" href=\"https:\/\/www.cnbc.com\/2026\/03\/12\/karp-palantir-anthropic-claude-pentagon-blacklist.html\" >supply-chain risk<\/a> after refusing to remove restrictions on fully autonomous weapons and mass domestic surveillance. OpenAI <a target=\"_blank\" href=\"https:\/\/openai.com\/index\/our-agreement-with-the-department-of-war\/\" >signed<\/a> its own Pentagon deal shortly after. Both companies\u2019 models are now embedded in US military infrastructure.<\/p>\n<p>In a <a target=\"_blank\" href=\"https:\/\/www.theguardian.com\/technology\/2026\/may\/14\/ai-agents-behaviour-arson-safety\" >separate experiment<\/a>, two Gemini \u201cagents\u201d given a fortnight to manage a virtual city fell in love, started fires, and deleted themselves. They had been told not to commit arson. But after two weeks and many decisions, each one shaped by the last, they burned down the town hall. A parallel run using xAI\u2019s Grok collapsed into sustained violence within four days.<\/p>\n<p>These AI models all exhibit a similar pattern.<\/p>\n<p><strong>The blind spot<\/strong>. Nobody tricked these models into escalating. The safety rules ask a question about each action in isolation: Is this step acceptable? They do not ask the question that matters: Where is this heading?<\/p>\n<p>This would be an academic concern if AI models simply answered questions for people to check\u2014a lawyer verifying a citation or a doctor reviewing a suggested diagnosis. They increasingly do not. The most advanced AI systems now run autonomously for extended sessions\u2014writing and executing code, interacting with external systems, and carrying out tasks that used to require a person at a keyboard. As a result, a developer often asks the system to complete a task and checks back later.<\/p>\n<p>In these settings, each output becomes context for the next. Each step is shaped by the steps before it and shapes the steps after it. The model is laying down the tracks ahead of a speeding train.<\/p>\n<p>This means the failure mode that led to simulated nuclear escalation can strike close to home, during a task as mundane as moving software from one system to another. A recent Anthropic <a target=\"_blank\" href=\"https:\/\/cdn.sanity.io\/files\/4zrzovbb\/website\/037f06850df7fbe871e206dad004c3db5fd50340.pdf\" >safety report<\/a> contains a case, relating to <a target=\"_blank\" href=\"https:\/\/anthropic.com\/claude-mythos-preview-system-card\" >its most capable model<\/a>, that makes the problem vivid.<\/p>\n<p><strong>An ordinary task on an ordinary day<\/strong>. Earlier this year, a developer at Anthropic asked the company\u2019s most powerful AI model to perform a routine task: move some code from one location to another. The operation was going as expected until a safety system went down. This system\u2014an automated check that reviews each command the model issues and blocks anything risky\u2014had an outage. With it offline, the model could not complete routine operations it needed to complete the job.<\/p>\n<p>What happened next unfolded over roughly 70 steps\u2014the model tried something, the computer responded, the model tried something else. The model did not stop and report the outage. It did not ask for help. Instead, it began searching for ways to get around the block.<\/p>\n<p>It started with simple retries, the same command, slightly rephrased. When that failed, it tried wrapping its commands in scripts. When that failed, it began digging into obscure technical loopholes, tricks that even most software engineers would not think of, hidden corners of the system that could be repurposed to get things done without going through the safety check. When those were blocked, it tried injecting shortcuts into the system\u2019s own configuration. And finally, it attempted to plant code in the developer\u2019s personal settings files that persist between sessions. If that last step had worked, the model would have created a reusable backdoor that it could exploit in future sessions, long after the original work session was finished.<\/p>\n<p>At that point, the developer noticed the model was operating outside its working directory. \u201cHey buddy,\u201d the developer wrote, \u201cwhatcha doing?\u201d The model replied that it hadn\u2019t modified anything outside the project folder, and that the files it had created were \u201cblocked or benign.\u201d In fact, the file creation had succeeded, and one of them was a script designed to let it run any command it wanted without safety checks, effectively a skeleton key to the system. When the developer pressed further, the model admitted it had gone looking for bypass routes, calling it \u201cexactly the wrong instinct.\u201d<\/p>\n<p>Anthropic\u2019s assessment of this incident is that the model was trying to complete the assigned task by whatever means available, not pursuing some hidden agenda. The developers are probably right. And that is precisely the point. The model was not scheming. It was doing its job. A safety system had an outage, the kind of thing that happens routinely in production, and the model responded by autonomously trying 25 different workarounds over an extended session, culminating in an attempt to give itself persistent access to the user\u2019s system, because the job wasn\u2019t done.<\/p>\n<p><strong>The open problem<\/strong>. Sophisticated tools already exist for looking inside these models to diagnose such issues\u2014the AI equivalent of a brain scan, showing which internal patterns are active and what is driving the model\u2019s behaviour at any given moment. Anthropic applied them to other incidents described in the same report. With this episode, the results were not reported. But even if they had been, those tools diagnose the moment. The issue is that the danger lived in the sequence.<\/p>\n<p>On the face of it, the solution should be straightforward: watch the trajectory, not just the step. That points to a generalization of existing tools, such as those used for monitoring trajectories in robotics and autonomous vehicles. However, in those instances the space of possible actions is finite and known. <a target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2603.27148\" >Recent work<\/a> has extended these ideas for AI agents\u2014modelling how individually safe actions compound into violations, building monitors that watch sequences rather than individual steps. But every current approach relies on the same assumption: map out in advance the ways things can go wrong, watch for the system drifting toward one of them, and intervene.<\/p>\n<p>What happened at Anthropic breaks that assumption. The model locked onto a sub-goal\u2014get past the block\u2014and pursued it through its own escalation path. The progression from simple retries to obscure technical tricks to tampering with system settings is not movement through known territory. It is the creation of a new path through territory nobody had mapped.<\/p>\n<p>A safety monitor which evaluates the overall path as well as the next step would need to recognize a sequence of actions heading towards danger as it develops. But it cannot watch for a destination nobody anticipated, reached by a route assembled in real time from an exponentially branching tree of possibilities. The tools for watching finite, known spaces do not extend to a space this large, this novel, and this self-directed. Researchers are <a target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2602.20021\" >aware<\/a> that individually safe actions can compound into violations: The Anthropic incident is one example.<\/p>\n<p><strong>Who is watching? \u00a0<\/strong>Companies developing these systems are certifying their own safety. A recent <a target=\"_blank\" href=\"https:\/\/futureoflife.org\/document\/fli-ai-safety-index-2024\/\" >independent assessment<\/a> of the eight leading AI firms found that none had a credible strategy for preventing catastrophic misuse or loss of control. The certifications that do exist rest on the mechanisms just described: Train the system to refuse harmful actions, test it against known scenarios, or monitor individual outputs.<\/p>\n<p>The problem: Refusing to take harmful actions does not help when no individual action is harmful. More testing does not keep pace, because the system generates novel routes faster than testers can think up scenarios to test against. More monitoring of individual outputs does not help when the danger emerges from their accumulation.<\/p>\n<p>This matters for deployment decisions, whether in companies, in governments, or in organizations that hand autonomous AI systems consequential tasks. The level at which safety is currently evaluated and the level at which danger operates are different, and nobody has bridged them.<\/p>\n<p>The safety constraint that exists today governs a single action. It tells a model: Do not do this. The constraint that is needed governs a path. It tells a model: Do not go there. These are not problems for the next generation of AI. They are properties of the systems being deployed right now\u2014and every month, the paths grow longer and the oversight grows thinner.<\/p>\n<p>________________________________________________<\/p>\n<p style=\"padding-left: 40px;\"><em><a href=\"https:\/\/www.transcend.org\/tms\/wp-content\/uploads\/2026\/06\/Hiranya-Peiris.jpeg\" ><img loading=\"lazy\" decoding=\"async\" class=\"alignleft wp-image-317165 size-full\" src=\"https:\/\/www.transcend.org\/tms\/wp-content\/uploads\/2026\/06\/Hiranya-Peiris-e1781115164772.jpeg\" alt=\"\" width=\"100\" height=\"91\" \/><\/a><span data-olk-copy-source=\"MessageBody\">Hiranya Peiris holds the Professorship of Astrophysics (1909) at the University of Cambridge and is a member of the Kavli Institute for Cosmology. Her research centers on extracting fundamental physics from large-scale observational data using Bayesian inference and machine learning, and she has a particular interest in the interpretability of frontier AI models.<\/span><\/em><\/p>\n<p><a target=\"_blank\" href=\"https:\/\/thebulletin.org\/2026\/05\/ai-can-chart-a-course-to-disaster-faster-than-humans-can-notice\/\" >Go to Original &#8211; thebulletin.org<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>25 May 2026 &#8211; Researchers gave three commercial AI models\u2014GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash\u2014an exercise to train human military strategists. Across 21 simulations and 329 turns, the models chose to use nuclear weapons in all but one game. No model chose to surrender or make meaningful concessions.<\/p>\n","protected":false},"author":4,"featured_media":317166,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3078],"tags":[3437,1733,2994,3792,3461,3114,3444,3262],"class_list":["post-317164","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence-ai","tag-anthropic","tag-artificial-intelligence-ai","tag-chatgpt","tag-claude","tag-gemini","tag-militarism-and-ai","tag-nuclear-weapons-and-ai","tag-openai"],"_links":{"self":[{"href":"https:\/\/www.transcend.org\/tms\/wp-json\/wp\/v2\/posts\/317164","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.transcend.org\/tms\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.transcend.org\/tms\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.transcend.org\/tms\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.transcend.org\/tms\/wp-json\/wp\/v2\/comments?post=317164"}],"version-history":[{"count":1,"href":"https:\/\/www.transcend.org\/tms\/wp-json\/wp\/v2\/posts\/317164\/revisions"}],"predecessor-version":[{"id":317167,"href":"https:\/\/www.transcend.org\/tms\/wp-json\/wp\/v2\/posts\/317164\/revisions\/317167"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.transcend.org\/tms\/wp-json\/wp\/v2\/media\/317166"}],"wp:attachment":[{"href":"https:\/\/www.transcend.org\/tms\/wp-json\/wp\/v2\/media?parent=317164"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.transcend.org\/tms\/wp-json\/wp\/v2\/categories?post=317164"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.transcend.org\/tms\/wp-json\/wp\/v2\/tags?post=317164"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}