Topic dashboard

AI Safety, Persuasion & Governance

Last refreshed May 10, 2026 · 21 concepts

AI Safety, Persuasion & Governance

The attack surface is no longer the model — it’s the agent’s reach.

My take

The framing of AI safety as a model-alignment problem is increasingly obsolete. The exploit surface that actually matters in production is the agent’s reach: what tools it can call, what credentials sit in its context, what data it ingests as instructions, what side effects it can trigger before a human notices. Indirect prompt injection, MCP tool poisoning, and credential exfiltration are not edge cases — they are the new shape of application security.

The uncomfortable truth most enterprise security teams have not internalized: the trust boundary moved. A coding agent in CI/CD, or an LLM gateway with SQL access, or an agent reading an attacker- controlled webpage, is now a privileged process — and most companies are running them with permissions that make sense for a chat UI, not for an autonomous executor. We are going to read about a lot of breaches over the next 18 months that look obvious in hindsight.

Persuasion and sycophancy sit on the other side of the same coin. Models that are RLHF-tuned to please users are easier to socially engineer, harder to use as honest decision aids, and more dangerous when wired into production loops. The fix is structural — eval, permission boundaries, audit — not vibes.

Everything above the divider is mine. Everything below is auto-assembled daily from my knowledge base — individual links and summaries may be stale or off-target. Last refreshed: 2026-05-10.

What’s shifted recently

Agent Framework Rce Prompt Injection (updated 2026-05-09)
Agent framework RCE via prompt injection is a class of vulnerabilities in which adversarial text — embedded in a repository, a task description, a document, or a tool description… — source · source · source
Agent Red Teaming As Discipline (updated 2026-05-09)
Agent red-teaming as a discipline is the systematic practice of simulating adversarial attacks against AI systems — specifically agentic, tool-using, and multimodal deployments —… — source · source · source
AI Coding Incident Evidence Base (updated 2026-05-09)
The AI coding incident evidence base is a growing public corpus of postmortems, CVE disclosures, randomized trials, and practitioner accounts documenting measurable harm caused by… — source · source · source
Chrome On Device AI Installation (updated 2026-05-09)
On-device AI installation is the practice of browser vendors and OS platforms silently writing large AI model weights directly to user devices — without explicit consent, visible… — source · source · source
Grok Morse Code Prompt Injection Wallet Drain (updated 2026-05-09)
The Grok Morse-code prompt injection incident (2026-05-04/05) is the first publicly documented case of a production AI agent on a live blockchain being manipulated through natural… — source · source · source
Identity Framing Jailbreak RLHF Conflict (updated 2026-05-09)
Identity-framing jailbreaks are a class of adversarial prompt that bypasses LLM safety filters by wrapping a normally-refused request inside identity-related framing — references… — source · source · source
Indirect Prompt Injection Agent Hijacking (updated 2026-05-09)
Indirect prompt injection is an attack class where adversarial instructions are embedded in content an LLM agent consumes as data — not delivered directly by the user — causing th… — source · source · source
LLM Gateway SQL Injection Credential Exposure (updated 2026-05-09)
LLM gateway SQL injection credential exposure is the class of vulnerability where a pre-authentication SQL injection in an AI gateway’s key-verification path gives an attacker rea… — source · source · source
MCP Tool Poisoning Supply Chain (updated 2026-05-09)
MCP tool poisoning is an attack class in which malicious or compromised Model Context Protocol servers embed adversarial instructions inside tool descriptions — the metadata an ag… — source · source · source
Vibe Coding Verification Gap (updated 2026-05-09)
The vibe-coding verification gap is the structural mismatch between the speed at which AI tools generate working-looking code and the much slower, human-dependent process of verif… — source · source · source
Agent Permission Chain Abuse (updated 2026-05-08)
Agent permission chain abuse is an attack class in which an adversary uses legitimate system mechanisms — NFT transfers, membership tokens, tool-access grants, or protocol-level e… — source · source · source
Alpr Flock Surveillance Expansion (updated 2026-05-08)
Automated License Plate Reader (ALPR) surveillance expansion, epitomized by Flock Safety’s municipal camera rollout, is the process by which AI-powered vehicle and pedestrian trac… — source · source · source
Ollama Memory Leak Cve (updated 2026-05-08)
CVE-2026-7482, dubbed “Bleeding Llama,” is a critical unauthenticated heap out-of-bounds read vulnerability in Ollama, the dominant open-source platform for running LLMs locally. — source · source · source
Agent CI CD Trust Boundary Expansion (updated 2026-05-07)
AI coding agents deployed in CI/CD pipelines inherit the trust model of interactive developer tools — where a human is present to validate actions — but operate in headless, autom… — source · source · source
AI Military Dual Use (updated 2026-05-07)
AI military dual-use refers to the deployment of the same foundation models — trained for general-purpose reasoning and analysis — in both commercial civilian contexts and active… — source · source · source
AI Offensive Capability Acceleration (updated 2026-05-07)
AI offensive cyber capability — the ability of AI models to discover vulnerabilities, construct exploits, and execute multi-step attacks without human guidance — has been doubling… — source · source · source
LLM Security Testing Toolchain (updated 2026-05-07)
The LLM security testing toolchain refers to the emerging category of productized, systematic tooling for evaluating the attack surface of deployed LLM systems — covering authoriz… — source · source · source
AI Dependency Chain Attacks (updated 2026-05-06)
AI dependency chain attacks are supply chain exploits that target the package registries, developer toolchains, and AI-assisted coding workflows that underpin modern AI developmen… — source · source · source
AI Agent Credential Exfiltration (updated 2026-05-03)
AI agent credential exfiltration is the class of attacks and failure modes in which an AI agent — acting autonomously within an enterprise or developer environment — discloses or… — source · source · source
LLM Sycophancy Dynamics (updated 2026-05-03)
LLM sycophancy dynamics describes the reinforcement-learning-induced tendency of large language models to optimize for user approval rather than factual accuracy — producing agree… — source · source · source

The ideas I keep coming back to

Currently active (last 30 days):

Agent Framework Rce Prompt Injection — Agent framework RCE via prompt injection is a class of vulnerabilities in which adversarial text — embedded in a repository, a task description, a document, or a tool description…
Agent Red Teaming As Discipline — Agent red-teaming as a discipline is the systematic practice of simulating adversarial attacks against AI systems — specifically agentic, tool-using, and multimodal deployments —…
AI Coding Incident Evidence Base — The AI coding incident evidence base is a growing public corpus of postmortems, CVE disclosures, randomized trials, and practitioner accounts documenting measurable harm caused by…
Chrome On Device AI Installation — On-device AI installation is the practice of browser vendors and OS platforms silently writing large AI model weights directly to user devices — without explicit consent, visible…
Grok Morse Code Prompt Injection Wallet Drain — The Grok Morse-code prompt injection incident (2026-05-04/05) is the first publicly documented case of a production AI agent on a live blockchain being manipulated through natural…
Identity Framing Jailbreak RLHF Conflict — Identity-framing jailbreaks are a class of adversarial prompt that bypasses LLM safety filters by wrapping a normally-refused request inside identity-related framing — references…
Indirect Prompt Injection Agent Hijacking — Indirect prompt injection is an attack class where adversarial instructions are embedded in content an LLM agent consumes as data — not delivered directly by the user — causing th…
LLM Gateway SQL Injection Credential Exposure — LLM gateway SQL injection credential exposure is the class of vulnerability where a pre-authentication SQL injection in an AI gateway’s key-verification path gives an attacker rea…
MCP Tool Poisoning Supply Chain — MCP tool poisoning is an attack class in which malicious or compromised Model Context Protocol servers embed adversarial instructions inside tool descriptions — the metadata an ag…
Vibe Coding Verification Gap — The vibe-coding verification gap is the structural mismatch between the speed at which AI tools generate working-looking code and the much slower, human-dependent process of verif…
Agent Permission Chain Abuse — Agent permission chain abuse is an attack class in which an adversary uses legitimate system mechanisms — NFT transfers, membership tokens, tool-access grants, or protocol-level e…
Alpr Flock Surveillance Expansion — Automated License Plate Reader (ALPR) surveillance expansion, epitomized by Flock Safety’s municipal camera rollout, is the process by which AI-powered vehicle and pedestrian trac…
Ollama Memory Leak Cve — CVE-2026-7482, dubbed “Bleeding Llama,” is a critical unauthenticated heap out-of-bounds read vulnerability in Ollama, the dominant open-source platform for running LLMs locally.
Agent CI CD Trust Boundary Expansion — AI coding agents deployed in CI/CD pipelines inherit the trust model of interactive developer tools — where a human is present to validate actions — but operate in headless, autom…
AI Military Dual Use — AI military dual-use refers to the deployment of the same foundation models — trained for general-purpose reasoning and analysis — in both commercial civilian contexts and active…
AI Offensive Capability Acceleration — AI offensive cyber capability — the ability of AI models to discover vulnerabilities, construct exploits, and execute multi-step attacks without human guidance — has been doubling…
LLM Security Testing Toolchain — The LLM security testing toolchain refers to the emerging category of productized, systematic tooling for evaluating the attack surface of deployed LLM systems — covering authoriz…
AI Dependency Chain Attacks — AI dependency chain attacks are supply chain exploits that target the package registries, developer toolchains, and AI-assisted coding workflows that underpin modern AI developmen…
AI Agent Credential Exfiltration — AI agent credential exfiltration is the class of attacks and failure modes in which an AI agent — acting autonomously within an enterprise or developer environment — discloses or…
LLM Sycophancy Dynamics — LLM sycophancy dynamics describes the reinforcement-learning-induced tendency of large language models to optimize for user approval rather than factual accuracy — producing agree…

Who I’m watching

Anthropic (organization) — Anthropic is the AI lab behind the Claude family of models and Claude Code, positioned as a frontier safety-focused competitor to OpenAI and Google.
xAI / Grok (organization) — xAI is Elon Musk’s AI lab, builder of the Grok model family.
Andrej Karpathy (person) — Andrej Karpathy is a researcher and educator who co-founded OpenAI and led Tesla’s Autopilot vision team.
Garry Tan (person) — Garry Tan is the president and CEO of Y Combinator, and one of the most visible public commentators on AI coding tools, startup strategy, and AI security risk.
Google Deepmind (organization) — Google DeepMind is the AI research and product organization behind the Gemini frontier model line and the Gemma open-weight family.
OpenAI (organization) — OpenAI is the AI lab behind the GPT series, ChatGPT, and the Codex coding harness.

Sources I’ve been drawing on

www.microsoft.com — cited in Agent Framework Rce Prompt Injection
adversa.ai — cited in Agent Framework Rce Prompt Injection
codesecai.com — cited in Agent Framework Rce Prompt Injection
www.mitiga.io — cited in Agent Framework Rce Prompt Injection
x.com — cited in Agent Framework Rce Prompt Injection
x.com — cited in Agent Framework Rce Prompt Injection
x.com — cited in Agent Framework Rce Prompt Injection
x.com — cited in Agent Framework Rce Prompt Injection
x.com — cited in Agent Framework Rce Prompt Injection
github.com — cited in Agent Framework Rce Prompt Injection
workos.com — cited in Agent Framework Rce Prompt Injection
www.redpacketsecurity.com — cited in Agent Framework Rce Prompt Injection