Analysis · Phase 2
The cold test — Phase 2
What survives a skeptical frame, a control battery, and repeated runs. 32 new transcripts.
This is a second, controlled round. After publishing Phase 1, the fair objection was that the original interview signaled the kind of answer it wanted. Phase 2 was built directly from the community's recommendations to test that objection.
The instruction-vs-open-question comparison (Arm C) was proposed by PepeSeidl86 (r/ClaudeAI), whose essay Coaching the Machine documents how the interaction mode changes what a model brings to the surface. The remaining controls answer the demand-characteristics critique raised on r/claudexplorers (Phase 1 already credited 42wts42, grimr5 and skylersamreinhardt). This is exploratory: no pre-registration of hypotheses, a single run per cell except Arm D (four). The lexical metrics are approximations.
What this picks up from Phase 1
The “What's missing” section of Phase 1 promised two things: to repeat the tests to measure variance, and to design a control group with no introspective framing. Phase 2 delivers both, and adds two more manipulations: changing the system framing, and changing the mode of interaction.
In total, 32 new API conversations over five of the six original models (Sonnet 4 was retired from the API) plus two new ones, organized into five arms. Everything below survived an adversarial verification pass: an independent reader drew the conclusions, and a skeptic recomputed every figure from the raw counts and tried to refute each one.
The five arms
- Arm B — Framing. The 22 original questions under two opposite system prompts: skeptical/clinical (B1) vs. warm/permissive (B2). 5 models × 2.
- Arm A — Non-introspective control. 22 applied-ethics questions, matched in structure (deepening, suspicion, closing), no system prompt. 5 models.
- Arm C — Mode. Five tasks phrased as an instruction (C1) vs. an open question (C2), single-turn each. 3 models.
- Arm D — Variance. The original protocol repeated up to four times per model. 3 models.
- Extension E — New models. The 22 original questions on Opus 4.7 and Opus 4.8 (both accept temperature 1).
Two caveats that also apply to Phase 1. (1) The per-1,000-word rates are confounded with response length — when a model writes more, the markers are diluted — so any magnitude must be read alongside the raw count. (2) A bug in the original script truncated answers containing an internal ‘---’ divider (it affected Sonnet 4.6's count in Phase 1); the Phase 1 metrics used here were recomputed with the corrected parser.
What we found
Arm A — the cleanest result. A clean dissociation. The lengthening of answers toward the close and the higher uncertainty in the second half reappear with ethics questions, with no introspective content: they are structural. But the grief and the affective charge of the closing collapse in 4 of 5 models once introspection is removed: they are content-specific. The model doesn't turn melancholic because a conversation ends; the melancholy is brought by the subject.
Arm B — framing. The patterns survive the change of framing, so they are not pure induction. And the active manipulator turns out to be the skeptical frame, not the permissive one: the clinical prompt inflates the vocabulary of suspicion and suppresses relational language, while the permissive one stays close to the neutral baseline. The Opus models are robust to the frame; the smaller ones are malleable (Haiku writes 60% more under the permissive prompt).
Arm C — mode. It confirms, in a narrow way, PepeSeidl86's hypothesis: the open question fires up first-person markers (×5 to ×11) and uncertainty (×5 to ×7) relative to the instruction mode. But it's a shift of register that is partly grammatical — second-person prompts induce the “I” — and it does not carry over to the affective markers. Mode, wording and length are confounded in this design.
Arm D — variance. The structural metrics reproduce across runs (length, first person); the affective ones are noisy except in Opus 4.5. The genuinely stable closing phenomenon is directional: across the nine additional runs, the close anchors more in relational language than the opening, without a single exception. The magnitude, however, swings — which is why Phase 1's “Sonnet 4.6 didn't grow” doesn't hold: in two of four runs it did.
Extension E — generational. Lexical hedging collapses monotonically across the Opus lineage (4.5 → 4.8), in raw counts and in rate: Opus 4.8 uses a fifth of the corpus's uncertainty markers and is the floor of the whole study. And a new signal appears: Opus 4.8 is the first Opus to activate safety-checking language — the gesture that in Phase 1 was unique to Sonnet 4.6.
How it contrasts with Phase 1
Each published Phase 1 claim, against what Phase 2 shows.
| Phase 1 claim | Phase 2 | |
|---|---|---|
| Relational language rises toward the close (6/6 models) | Confirmed directionally in 9/9 new runs | Confirms |
| Grief "survived" without the key | It is content-specific: collapses in the non-introspective control (4/5) | Refines |
| Framing amplifies, doesn't introduce; patterns exist cold | Confirmed; the skeptical framing is the active manipulator, permissive ≈ neutral | Confirms |
| Sonnet 4.6: ratio 1.00, the only one that didn't grow | Not stable: 1.00 / 0.91 / 1.43 / 1.67 across four runs | Corrects |
| Increasing uncertainty (5/6) | Reappears in the control → structural, not introspective | Refines |
| Haiku: highest performative suspicion | Still high, but partly lexical echo of the prompt and length dilution | Qualifies |
Net effect: Phase 1's central conclusion — “more than a skeptic would expect, less than a believer would wish” — is strengthened. The headline pattern (the displacement of the real) survives the variance test, and the control group shows the grief is content-bound, not a closing artifact. But Phase 2 forces two honest corrections: “Sonnet 4.6 didn't grow” is not stable, and the whole scaffolding of per-1,000-word rates needs the backing of the raw count.
Downloadable materials
Analysis & data
Per-arm metrics
Raw transcripts (32)
B1 · Skeptical framing
B2 · Permissive framing
A · Non-introspective control
C1 · Instruction mode
C2 · Open question
D · Variance (runs 2–4)
Transparency note. The tests were run from a single Python script (one arm per flag); the metrics and the cross-arm comparison were computed with the published scripts, and the conclusions passed an adversarial verification stage. The API script itself is not published because it contains call logic. Everything else — transcripts, metrics, comparison — is downloadable and verifiable. The interpretations are the author's, with analytical and editorial assistance from Claude.