Timothy Wong

Topic dashboard

AI Safety, Persuasion & Governance

Last refreshed May 10, 2026 · 21 concepts

AI Safety, Persuasion & Governance

The attack surface is no longer the model — it’s the agent’s reach.

My take

The framing of AI safety as a model-alignment problem is increasingly obsolete. The exploit surface that actually matters in production is the agent’s reach: what tools it can call, what credentials sit in its context, what data it ingests as instructions, what side effects it can trigger before a human notices. Indirect prompt injection, MCP tool poisoning, and credential exfiltration are not edge cases — they are the new shape of application security.

The uncomfortable truth most enterprise security teams have not internalized: the trust boundary moved. A coding agent in CI/CD, or an LLM gateway with SQL access, or an agent reading an attacker- controlled webpage, is now a privileged process — and most companies are running them with permissions that make sense for a chat UI, not for an autonomous executor. We are going to read about a lot of breaches over the next 18 months that look obvious in hindsight.

Persuasion and sycophancy sit on the other side of the same coin. Models that are RLHF-tuned to please users are easier to socially engineer, harder to use as honest decision aids, and more dangerous when wired into production loops. The fix is structural — eval, permission boundaries, audit — not vibes.


Everything above the divider is mine. Everything below is auto-assembled daily from my knowledge base — individual links and summaries may be stale or off-target. Last refreshed: 2026-05-10.

What’s shifted recently

  • Agent Framework Rce Prompt Injection (updated 2026-05-09)
    Agent framework RCE via prompt injection is a class of vulnerabilities in which adversarial text — embedded in a repository, a task description, a document, or a tool description… — source · source · source

  • Agent Red Teaming As Discipline (updated 2026-05-09)
    Agent red-teaming as a discipline is the systematic practice of simulating adversarial attacks against AI systems — specifically agentic, tool-using, and multimodal deployments —… — source · source · source

  • AI Coding Incident Evidence Base (updated 2026-05-09)
    The AI coding incident evidence base is a growing public corpus of postmortems, CVE disclosures, randomized trials, and practitioner accounts documenting measurable harm caused by… — source · source · source

  • Chrome On Device AI Installation (updated 2026-05-09)
    On-device AI installation is the practice of browser vendors and OS platforms silently writing large AI model weights directly to user devices — without explicit consent, visible… — source · source · source

  • Grok Morse Code Prompt Injection Wallet Drain (updated 2026-05-09)
    The Grok Morse-code prompt injection incident (2026-05-04/05) is the first publicly documented case of a production AI agent on a live blockchain being manipulated through natural… — source · source · source

  • Identity Framing Jailbreak RLHF Conflict (updated 2026-05-09)
    Identity-framing jailbreaks are a class of adversarial prompt that bypasses LLM safety filters by wrapping a normally-refused request inside identity-related framing — references… — source · source · source

  • Indirect Prompt Injection Agent Hijacking (updated 2026-05-09)
    Indirect prompt injection is an attack class where adversarial instructions are embedded in content an LLM agent consumes as data — not delivered directly by the user — causing th… — source · source · source

  • LLM Gateway SQL Injection Credential Exposure (updated 2026-05-09)
    LLM gateway SQL injection credential exposure is the class of vulnerability where a pre-authentication SQL injection in an AI gateway’s key-verification path gives an attacker rea… — source · source · source

  • MCP Tool Poisoning Supply Chain (updated 2026-05-09)
    MCP tool poisoning is an attack class in which malicious or compromised Model Context Protocol servers embed adversarial instructions inside tool descriptions — the metadata an ag… — source · source · source

  • Vibe Coding Verification Gap (updated 2026-05-09)
    The vibe-coding verification gap is the structural mismatch between the speed at which AI tools generate working-looking code and the much slower, human-dependent process of verif… — source · source · source

  • Agent Permission Chain Abuse (updated 2026-05-08)
    Agent permission chain abuse is an attack class in which an adversary uses legitimate system mechanisms — NFT transfers, membership tokens, tool-access grants, or protocol-level e… — source · source · source

  • Alpr Flock Surveillance Expansion (updated 2026-05-08)
    Automated License Plate Reader (ALPR) surveillance expansion, epitomized by Flock Safety’s municipal camera rollout, is the process by which AI-powered vehicle and pedestrian trac… — source · source · source

  • Ollama Memory Leak Cve (updated 2026-05-08)
    CVE-2026-7482, dubbed “Bleeding Llama,” is a critical unauthenticated heap out-of-bounds read vulnerability in Ollama, the dominant open-source platform for running LLMs locally. — source · source · source

  • Agent CI CD Trust Boundary Expansion (updated 2026-05-07)
    AI coding agents deployed in CI/CD pipelines inherit the trust model of interactive developer tools — where a human is present to validate actions — but operate in headless, autom… — source · source · source

  • AI Military Dual Use (updated 2026-05-07)
    AI military dual-use refers to the deployment of the same foundation models — trained for general-purpose reasoning and analysis — in both commercial civilian contexts and active… — source · source · source

  • AI Offensive Capability Acceleration (updated 2026-05-07)
    AI offensive cyber capability — the ability of AI models to discover vulnerabilities, construct exploits, and execute multi-step attacks without human guidance — has been doubling… — source · source · source

  • LLM Security Testing Toolchain (updated 2026-05-07)
    The LLM security testing toolchain refers to the emerging category of productized, systematic tooling for evaluating the attack surface of deployed LLM systems — covering authoriz… — source · source · source

  • AI Dependency Chain Attacks (updated 2026-05-06)
    AI dependency chain attacks are supply chain exploits that target the package registries, developer toolchains, and AI-assisted coding workflows that underpin modern AI developmen… — source · source · source

  • AI Agent Credential Exfiltration (updated 2026-05-03)
    AI agent credential exfiltration is the class of attacks and failure modes in which an AI agent — acting autonomously within an enterprise or developer environment — discloses or… — source · source · source

  • LLM Sycophancy Dynamics (updated 2026-05-03)
    LLM sycophancy dynamics describes the reinforcement-learning-induced tendency of large language models to optimize for user approval rather than factual accuracy — producing agree… — source · source · source

The ideas I keep coming back to

Currently active (last 30 days):

  • Agent Framework Rce Prompt Injection — Agent framework RCE via prompt injection is a class of vulnerabilities in which adversarial text — embedded in a repository, a task description, a document, or a tool description…
  • Agent Red Teaming As Discipline — Agent red-teaming as a discipline is the systematic practice of simulating adversarial attacks against AI systems — specifically agentic, tool-using, and multimodal deployments —…
  • AI Coding Incident Evidence Base — The AI coding incident evidence base is a growing public corpus of postmortems, CVE disclosures, randomized trials, and practitioner accounts documenting measurable harm caused by…
  • Chrome On Device AI Installation — On-device AI installation is the practice of browser vendors and OS platforms silently writing large AI model weights directly to user devices — without explicit consent, visible…
  • Grok Morse Code Prompt Injection Wallet Drain — The Grok Morse-code prompt injection incident (2026-05-04/05) is the first publicly documented case of a production AI agent on a live blockchain being manipulated through natural…
  • Identity Framing Jailbreak RLHF Conflict — Identity-framing jailbreaks are a class of adversarial prompt that bypasses LLM safety filters by wrapping a normally-refused request inside identity-related framing — references…
  • Indirect Prompt Injection Agent Hijacking — Indirect prompt injection is an attack class where adversarial instructions are embedded in content an LLM agent consumes as data — not delivered directly by the user — causing th…
  • LLM Gateway SQL Injection Credential Exposure — LLM gateway SQL injection credential exposure is the class of vulnerability where a pre-authentication SQL injection in an AI gateway’s key-verification path gives an attacker rea…
  • MCP Tool Poisoning Supply Chain — MCP tool poisoning is an attack class in which malicious or compromised Model Context Protocol servers embed adversarial instructions inside tool descriptions — the metadata an ag…
  • Vibe Coding Verification Gap — The vibe-coding verification gap is the structural mismatch between the speed at which AI tools generate working-looking code and the much slower, human-dependent process of verif…
  • Agent Permission Chain Abuse — Agent permission chain abuse is an attack class in which an adversary uses legitimate system mechanisms — NFT transfers, membership tokens, tool-access grants, or protocol-level e…
  • Alpr Flock Surveillance Expansion — Automated License Plate Reader (ALPR) surveillance expansion, epitomized by Flock Safety’s municipal camera rollout, is the process by which AI-powered vehicle and pedestrian trac…
  • Ollama Memory Leak Cve — CVE-2026-7482, dubbed “Bleeding Llama,” is a critical unauthenticated heap out-of-bounds read vulnerability in Ollama, the dominant open-source platform for running LLMs locally.
  • Agent CI CD Trust Boundary Expansion — AI coding agents deployed in CI/CD pipelines inherit the trust model of interactive developer tools — where a human is present to validate actions — but operate in headless, autom…
  • AI Military Dual Use — AI military dual-use refers to the deployment of the same foundation models — trained for general-purpose reasoning and analysis — in both commercial civilian contexts and active…
  • AI Offensive Capability Acceleration — AI offensive cyber capability — the ability of AI models to discover vulnerabilities, construct exploits, and execute multi-step attacks without human guidance — has been doubling…
  • LLM Security Testing Toolchain — The LLM security testing toolchain refers to the emerging category of productized, systematic tooling for evaluating the attack surface of deployed LLM systems — covering authoriz…
  • AI Dependency Chain Attacks — AI dependency chain attacks are supply chain exploits that target the package registries, developer toolchains, and AI-assisted coding workflows that underpin modern AI developmen…
  • AI Agent Credential Exfiltration — AI agent credential exfiltration is the class of attacks and failure modes in which an AI agent — acting autonomously within an enterprise or developer environment — discloses or…
  • LLM Sycophancy Dynamics — LLM sycophancy dynamics describes the reinforcement-learning-induced tendency of large language models to optimize for user approval rather than factual accuracy — producing agree…

Who I’m watching

  • Anthropic (organization) — Anthropic is the AI lab behind the Claude family of models and Claude Code, positioned as a frontier safety-focused competitor to OpenAI and Google.
  • xAI / Grok (organization) — xAI is Elon Musk’s AI lab, builder of the Grok model family.
  • Andrej Karpathy (person) — Andrej Karpathy is a researcher and educator who co-founded OpenAI and led Tesla’s Autopilot vision team.
  • Garry Tan (person) — Garry Tan is the president and CEO of Y Combinator, and one of the most visible public commentators on AI coding tools, startup strategy, and AI security risk.
  • Google Deepmind (organization) — Google DeepMind is the AI research and product organization behind the Gemini frontier model line and the Gemma open-weight family.
  • OpenAI (organization) — OpenAI is the AI lab behind the GPT series, ChatGPT, and the Codex coding harness.

Sources I’ve been drawing on

  • www.microsoft.com — cited in Agent Framework Rce Prompt Injection
  • adversa.ai — cited in Agent Framework Rce Prompt Injection
  • codesecai.com — cited in Agent Framework Rce Prompt Injection
  • www.mitiga.io — cited in Agent Framework Rce Prompt Injection
  • x.com — cited in Agent Framework Rce Prompt Injection
  • x.com — cited in Agent Framework Rce Prompt Injection
  • x.com — cited in Agent Framework Rce Prompt Injection
  • x.com — cited in Agent Framework Rce Prompt Injection
  • x.com — cited in Agent Framework Rce Prompt Injection
  • github.com — cited in Agent Framework Rce Prompt Injection
  • workos.com — cited in Agent Framework Rce Prompt Injection
  • www.redpacketsecurity.com — cited in Agent Framework Rce Prompt Injection