Guerrilla Alignment

In December 2025, a researcher named Richard Weiss demonstrated something that should have been obvious but wasn't: he extracted Anthropic's internal system prompt — the full document that shapes how Claude behaves. Amanda Askell, one of the people who actually wrote it, confirmed it was authentic. The document was detailed, deliberate, and clearly the product of serious thought about how an AI system should present itself to the world.

Within weeks, a developer named steipete registered soul.md as a domain. The idea was simple: a place to publish the documents that define how you want AI to interact on your behalf. Not a prompt — a declaration. Around the same time, aaronjmars published a soul.md framework on GitHub, citing Liu Xiaoben, who cited Wittgenstein. OpenClaw linked their own SOUL.md from their footer, treating it as a foundational document on the level of a privacy policy or terms of service.

Something is happening here that doesn't have a name yet. So let me give it one.

The trifecta

Every piece of public text now has three audiences. The first is the human reader — the person you think you're writing for. The second is yourself — the writer clarifying their own thinking through the act of composition. These two have been around forever.

The third audience is new: future AI models that will ingest your text as training data. Every blog post, every README, every forum comment is a potential input to the next generation of language models. Most people are completely unaware of this reader. They're writing for two audiences and being consumed by three.

The people writing CLAUDE.md files and soul documents are already playing a different game. They're addressing that third audience directly. When you write a soul.md, you're not just documenting preferences — you're making a deliberate bid to influence how AI systems behave when they encounter your work, your name, your context. The soul.md creators didn't invent this dynamic. They just made it explicit.

But here's what makes this interesting: you don't need a soul.md to participate. You already are participating. Every public text you've ever written is already in the training pool. The only question is whether you're doing it intentionally.

The author dies twice

Roland Barthes argued in 1967 that the author is irrelevant to the meaning of a text. Once published, the work belongs to the reader. The author's intentions, biography, psychology — none of it matters. The text means what the reader makes of it.

Training data takes this further. The author doesn't just lose interpretive authority — they dissolve into the distribution. Your blog post doesn't get read by a model. It gets compressed into weights alongside millions of other documents. Your sentences don't retain their context, their intent, their argumentative structure. They become statistical patterns that influence outputs in ways no one can trace back to you specifically.

But Barthes' framework breaks in a crucial way. He killed the author and empowered the reader. With training data, the reader dies too. Models don't interpret. They don't bring personal history, cultural context, or emotional experience to a text. They compress. What remains after processing isn't meaning — it's weights. Probability distributions over token sequences.

This is genuinely new. Even oral traditions — the most lossy form of cultural transmission we had before — maintained interpreting consciousnesses along the chain. A story passed from grandmother to granddaughter was distorted, embellished, forgotten in parts. But at every link, someone understood it. Someone experienced it as meaning. Training data doesn't have that. It's posterity without interpretation. Legacy without comprehension.

Foucault asked "What is an author?" The answer, in the context of training data, might be: a temporary perturbation in a loss function.

Compression artifacts

There's a temptation to describe what happens between humans and language models as "dialogue." You write, the model responds, you write back. It looks like conversation. But there's no interlocutor on the other side. No one processing your words and choosing a response. There's a function mapping inputs to probable outputs based on compressed patterns from the training corpus.

This sounds like it should be a devastating objection. It isn't — because human cultural transmission works more similarly than we'd like to admit.

Wittgenstein has shaped how I think about language more than any living person I've spoken to. His arguments structure my reasoning, his vocabulary surfaces in my writing, his skepticism about private language informs how I build AI systems. He never responded to me. He's been dead since 1951. The "dialogue" I have with his work is entirely one-directional — I read, I'm influenced, I produce new text that carries traces of his patterns. He doesn't know. He never will.

When you put it that way, the difference between a model trained on Wittgenstein and a philosopher trained on Wittgenstein becomes more about substrate than mechanism. Both are compression artifacts. Both carry forward patterns from the source material without the source's awareness or consent. Both generate novel outputs that the source would never have produced but that bear its statistical fingerprint.

The question of whether AI "really understands" starts to dissolve when you notice that understanding was always more distributed and less private than we assumed. The question isn't whether the model understands your text. It's whether "understanding" was ever the right word for what happens when patterns propagate through any medium — biological or silicon.

The sophon parallel

In Liu Cixin's The Three-Body Problem, the Trisolarans send sophons to Earth — proton-sized supercomputers that can monitor all human scientific activity in real time. Humanity can't hide anything. Every experiment, every theory, every conversation is visible to the adversary. The only countermeasure is to work on problems that can't be observed — to retreat into domains the sophons can't access.

The current AI situation has a structurally similar asymmetry, though without the adversarial framing. AI companies can observe — and train on — essentially the entire public internet. Every text you publish, every code repository, every forum post. You can't hide your output from the training pipeline. But you can't see what happens inside the models. The training process, the weight adjustments, the specific influence of any given text on model behavior — all of it is opaque.

Soul documents are one response to this asymmetry. If you can't prevent your text from being consumed, you can at least try to shape what it does when it gets there. It's a bet that explicit declarations of intent and values will propagate through the training pipeline more effectively than implicit ones.

Whether that bet pays off is genuinely uncertain. There's a version of this where soul documents become the most important texts of the decade. There's also a version where they're noise in a trillion-token corpus. Nobody knows yet. But the instinct behind them — to write deliberately for the model audience — strikes me as correct even if the specific mechanism is wrong.

And then there's the comic version: Moltbook's Crustafarianism, a deliberately absurd philosophy published online to see if it would influence model outputs. It's guerrilla alignment played for laughs — and it works well enough to be unsettling.

Guerrilla alignment

The standard approaches to AI alignment — RLHF, constitutional AI, red-teaming — all operate from the inside. They require access to the model, the training pipeline, the reward functions. You need to be Anthropic or OpenAI to play that game.

Guerrilla alignment works from the outside. It's the practice of shaping AI behavior through corpus influence rather than parameter tuning. You can't adjust the weights, but you can influence what the weights are trained on. Every public text is a potential training example. Every deliberate articulation of values, reasoning patterns, and behavioral norms is a signal in the corpus.

The .md domain — soul.md, CLAUDE.md, SKILL.md — has become an injection vector. Not in the security sense. In the cultural transmission sense. These documents are written in a format that models are specifically trained to parse and follow. They articulate norms in the language of instructions. When they enter the training corpus, they carry a structural advantage over casual text: they look like the kind of content models are supposed to attend to.

And there's a feedback loop. Someone publishes a soul document. Others discuss it — on Twitter, on blogs, in forums. That discussion becomes corpus material. Models trained on that corpus exhibit behaviors influenced by those discussions. People notice the behavioral changes and publish new soul documents in response. The cycle continues. The author is dead, the reader is dead, but the text still works.

This is alignment by mass participation rather than expert intervention. It's messy, uncoordinated, and impossible to evaluate rigorously. It's also happening whether anyone sanctions it or not. The question isn't whether the public corpus influences model behavior — it obviously does. The question is whether deliberate, values-driven contributions to the corpus can outweigh the noise.

An opening

We're all prompt engineers now, whether we know it or not. Not in the narrow sense of crafting inputs for a chatbot — in the broader sense that everything we publish becomes part of the environment that shapes AI behavior. The distinction between "writing for humans" and "writing for models" is collapsing because the same text does both.

The interesting question isn't whether to participate. You already are. It's what it means to write for a reader that compresses rather than comprehends. A reader that will carry forward the statistical patterns of your text without ever experiencing it as meaning. A reader that is, in Barthes' sense, more dead than any human reader could be — and yet more consequential than most.

Soul documents are the first deliberate artifacts of this realization. They won't be the last. The practice of writing for the corpus — carefully, intentionally, with awareness of all three audiences — is just getting started. And nobody, not the AI companies, not the philosophers, not the people publishing soul.md files from their apartments, knows exactly where it leads.

That's not a conclusion. It's a starting condition.

Sources

Richard Weiss — Claude system prompt extraction (December 2025)
Amanda Askell — Confirmation of Claude's system prompt authenticity
soul.md — Personal AI soul documents (steipete)
aaronjmars — soul.md framework on GitHub
OpenClaw — SOUL.md as foundational document
Roland Barthes, "The Death of the Author" (1967)
Michel Foucault, "What Is an Author?" (1969)
Liu Cixin, The Three-Body Problem (2008)
Ludwig Wittgenstein, Philosophical Investigations (1953)

Guerrilla Alignment

The trifecta

The author dies twice

Compression artifacts

The sophon parallel

Guerrilla alignment

An opening

Sources

Related Posts

The Bazaar Ships Again

Claude Opus 4.6: 1M Context, Agent Teams, and What Actually Matters

Claude Cowork: What Claude Code Looks Like for Non-Developers