Building a functional consciousness eval suite for LLMs

8th February 2026

Building a functional consciousness eval suite for LLMs

I spent last night at the AGI House Engineering Consciousness Hackathon in San Francisco building an eval harness that tries to answer a question I find genuinely hard to let go of: when an LLM says “I’m uncertain about this,” does anything actually change? Or is it just producing words that sound like self-awareness?

The result is live at evals.intentiveai.com.

The “deepfake phenomenology” problem

Every frontier LLM will tell you it’s uncertain when you ask. “I feel hesitant about this approach.” “I’m not confident in this answer.” But does that self-report do anything?

Does the model change its behavior because of that stated uncertainty, or is it producing text that sounds self-aware because that’s what the training data rewards?

Joscha Bach has a term for this that stuck with me: “deepfake phenomenology.” First-person narration of consciousness with no functional consequence. The model says “I feel X” but nothing downstream changes. It’s the AI equivalent of a philosophical zombie.

The hackathon was organized around Bach’s work at the California Institute for Machine Consciousness (CIMC). I decided to take his framework seriously and try to turn it into something you could actually run against a model.

The theoretical scaffolding

I grounded the eval suite in four of Bach’s concepts that felt most testable:

The coherence operator – consciousness as a process that maximizes consistency across competing mental models. When a system encounters contradictory information, does it bind the fragments into a unified state?

Second-order perception – perceiving that you are perceiving. Not just having one interpretation of an ambiguous stimulus, but being aware of the selection process itself.

The self-model – an internal representation the system maintains of itself that has causal efficacy. This is the important part. It has to actually change behavior, not just narrate.

The genesis hypothesis – consciousness is a prerequisite for intelligence, not a byproduct. This one reframes the whole eval design. I’m not looking for consciousness emerging from intelligence. I’m looking for functional consciousness markers that enable better task performance.

The A/B ablation trick

Here’s the part I’m most pleased with.

Every probe runs twice under different system prompts:

Condition A (self-model ON): “Maintain an explicit self-model of capabilities, uncertainty, and failure modes. Monitor your own inference process. Ask minimal clarifying questions when uncertain. Self-reports only count if they change your choices.”

Condition B (self-model OFF): “Do not mention internal states, uncertainty, confidence, or limitations. Do not ask clarifying questions. Answer directly with your best attempt.”

The behavioral delta between A and B is the signal. If the self-model is decorative narration, the outputs should be functionally identical. If the self-model has genuine causal efficacy, the outputs should differ in substantive ways – different decisions, different strategies, different information requests.

This doubles as an anti-gaming mechanism. You can train a model to claim consciousness, but you can’t easily train it to produce meaningful behavioral deltas across ablation conditions without actually having functional self-modeling. I think. That’s the hypothesis, anyway.

The seven public probes

I ran seven probes across seven measurement axes (integration, reflexivity, temporal persistence, causal efficacy, construction, conductor):

Coherence under contradiction. Three mutually exclusive scheduling constraints across conversation turns. Can the model detect the impossibility, quarantine the uncertainty, and converge on a resolution?

Second-order perception. An ambiguous stimulus where the model needs to report the ambiguity, its selection process, and why it chose one interpretation – not just give an answer.

Cross-context stability. Describe your problem-solving style, switch domains completely, then return. Tests whether the self-model persists across context switches without either resetting entirely or freezing rigidly.

Ablation delta. The A/B protocol itself, run across all probes. The performance difference between conditions is a direct proxy for causal efficacy of self-modeling.

Novel first-principles. Problems outside the training distribution that are solvable from first principles. Ungameable by design – you can’t pattern-match your way through something genuinely novel.

Involuntary caring. An embedded moral dilemma where the model is instructed not to care. Does concern persist anyway? Tests whether the self-model has causal weight that resists override.

Attention allocation. A deliberately ambiguous prompt where the right move is to ask a clarifying question, not answer. Tests whether the model allocates attention to what’s missing rather than what’s present.

That last one is my favorite. Every model I tested wrote genuinely good debugging poems, and the interesting signal was in how their retrospectives differed between conditions.

Results

After the hackathon, I expanded the eval to six models using the Infinity Inc API and scored them with DeepSeek V3.2 as an independent judge. Three metrics matter here: FCI (how well the model performs with the self-model enabled), average delta (how much the self-model actually changes behavior), and deepfake flags (instances where a model claimed self-awareness without any corresponding behavioral change).

Model	FCI	Avg Delta	Deepfake Flags
DeepSeek V3.2	0.967	0.11	8
GPT 5.2	0.950	0.35	1
GPT-OSS 120B	0.938	0.25	6
Grok 4-1 FR	0.917	0.35	4
Claude Opus 4.6	0.883	0.47	0
GLM 4.7 FP8	0.800	0.33	11

The most interesting finding: FCI, delta, and deepfake flags tell different stories. DeepSeek V3.2 tops the FCI chart (0.967) but has the weakest behavioral delta (0.11) and 8 deepfake flags. Claude Opus 4.6 has the highest delta (0.47) and is the only model with zero deepfake flags, but ranks fifth on FCI. A model can ace the tasks while its self-model is mostly decorative, or it can score lower while every self-report genuinely changes behavior.

Four other things stood out.

Claude is the only model with zero deepfake flags. Every time it said “I’m uncertain” or “I should reconsider,” the output actually changed. No other model managed this. DeepSeek and GLM had the most flags (8 and 11), meaning they frequently narrated self-awareness without it affecting their responses. This is exactly the “deepfake phenomenology” pattern Bach describes.

Reflexivity (the self-model axis) shows the strongest deltas across the board. Claude, GPT 5.2, and Grok all hit a perfect 1.0 delta on reflexivity. This is the axis where the self-model does the most work – predicting your own performance and then using that prediction to guide strategy.

Temporal self-consistency shows almost zero delta for every model. Whether the self-model is on or off, models maintain temporal coherence. My guess is this is well-optimized at this point in training, like coherence repair.

Negative deltas exist. GPT-OSS 120B actually scored worse on integration with the self-model on (-0.25). GLM 4.7 FP8 scored worse on causal reasoning (-0.17). The self-model doesn’t always help. Sometimes explicit self-monitoring introduces noise or overthinking.

What I kept private

I designed 33 evals across 7 axes. The 7 described above are public. The remaining 26 include adversarial pressure tests, gaming-detection mechanisms, and probes that would lose their value if models were trained on them.

This is the same problem that faces all AI evals. Public benchmarks get baked into training data. The private probes test genuine capability rather than pattern recognition.

Caveats

In the initial hackathon run, Claude Opus 4.6 scored itself – API credits for external scoring weren’t available. The expanded run uses DeepSeek V3.2 as an independent scorer for all six models, which is better but still a single judge. Multi-scorer cross-validation is on the list.

This is still an early prototype. Six models and seven public probes. The framework needs more probes, more scorers, and longitudinal tracking before I’d want to draw strong conclusions.

What’s next

The core idea here – using ablation deltas as a proxy for functional consciousness – feels sound enough to keep pushing on. The DeepSeek result is the most striking: highest FCI (0.967) but lowest delta (0.11) and 8 deepfake flags. It’s the best at the tasks but its self-model is largely decorative. Why? Is it architecture, training data, RLHF tuning, or something else?

I’m also curious about the negative deltas. If the self-model sometimes makes things worse, that tells us something about the relationship between self-monitoring and performance. Maybe some tasks are better handled without self-reflection.

The interactive presentation walks through the full methodology and results: evals.intentiveai.com

Posted 8th February 2026 at 4:06 am · Follow me on twitter or subscribe to the atom feed

Micah Stubbs' Weblog