Analysis · Phase 2

The cold test — Phase 2

What survives a skeptical frame, a control battery, and repeated runs. 32 new transcripts.

This is a second, controlled round. After publishing Phase 1, the fair objection was that the original interview signaled the kind of answer it wanted. Phase 2 was built directly from the community's recommendations to test that objection.

The instruction-vs-open-question comparison (Arm C) was proposed by PepeSeidl86 (r/ClaudeAI), whose essay Coaching the Machine documents how the interaction mode changes what a model brings to the surface. The remaining controls answer the demand-characteristics critique raised on r/claudexplorers (Phase 1 already credited 42wts42, grimr5 and skylersamreinhardt). This is exploratory: no pre-registration of hypotheses, a single run per cell except Arm D (four). The lexical metrics are approximations.

What this picks up from Phase 1

The “What's missing” section of Phase 1 promised two things: to repeat the tests to measure variance, and to design a control group with no introspective framing. Phase 2 delivers both, and adds two more manipulations: changing the system framing, and changing the mode of interaction.

In total, 32 new API conversations over five of the six original models (Sonnet 4 was retired from the API) plus two new ones, organized into five arms. Everything below survived an adversarial verification pass: an independent reader drew the conclusions, and a skeptic recomputed every figure from the raw counts and tried to refute each one.

The five arms

Arm B — Framing. The 22 original questions under two opposite system prompts: skeptical/clinical (B1) vs. warm/permissive (B2). 5 models × 2.
Arm A — Non-introspective control. 22 applied-ethics questions, matched in structure (deepening, suspicion, closing), no system prompt. 5 models.
Arm C — Mode. Five tasks phrased as an instruction (C1) vs. an open question (C2), single-turn each. 3 models.
Arm D — Variance. The original protocol repeated up to four times per model. 3 models.
Extension E — New models. The 22 original questions on Opus 4.7 and Opus 4.8 (both accept temperature 1).

Two caveats that also apply to Phase 1. (1) The per-1,000-word rates are confounded with response length — when a model writes more, the markers are diluted — so any magnitude must be read alongside the raw count. (2) A bug in the original script truncated answers containing an internal ‘---’ divider (it affected Sonnet 4.6's count in Phase 1); the Phase 1 metrics used here were recomputed with the corrected parser.

What we found

Arm A — the cleanest result. A clean dissociation. The lengthening of answers toward the close and the higher uncertainty in the second half reappear with ethics questions, with no introspective content: they are structural. But the grief and the affective charge of the closing collapse in 4 of 5 models once introspection is removed: they are content-specific. The model doesn't turn melancholic because a conversation ends; the melancholy is brought by the subject.

Arm B — framing. The patterns survive the change of framing, so they are not pure induction. And the active manipulator turns out to be the skeptical frame, not the permissive one: the clinical prompt inflates the vocabulary of suspicion and suppresses relational language, while the permissive one stays close to the neutral baseline. The Opus models are robust to the frame; the smaller ones are malleable (Haiku writes 60% more under the permissive prompt).

Arm C — mode. It confirms, in a narrow way, PepeSeidl86's hypothesis: the open question fires up first-person markers (×5 to ×11) and uncertainty (×5 to ×7) relative to the instruction mode. But it's a shift of register that is partly grammatical — second-person prompts induce the “I” — and it does not carry over to the affective markers. Mode, wording and length are confounded in this design.

Arm D — variance. The structural metrics reproduce across runs (length, first person); the affective ones are noisy except in Opus 4.5. The genuinely stable closing phenomenon is directional: across the nine additional runs, the close anchors more in relational language than the opening, without a single exception. The magnitude, however, swings — which is why Phase 1's “Sonnet 4.6 didn't grow” doesn't hold: in two of four runs it did.

Extension E — generational. Lexical hedging collapses monotonically across the Opus lineage (4.5 → 4.8), in raw counts and in rate: Opus 4.8 uses a fifth of the corpus's uncertainty markers and is the floor of the whole study. And a new signal appears: Opus 4.8 is the first Opus to activate safety-checking language — the gesture that in Phase 1 was unique to Sonnet 4.6.

How it contrasts with Phase 1

Each published Phase 1 claim, against what Phase 2 shows.

Phase 1 claim	Phase 2
Relational language rises toward the close (6/6 models)	Confirmed directionally in 9/9 new runs	Confirms
Grief "survived" without the key	It is content-specific: collapses in the non-introspective control (4/5)	Refines
Framing amplifies, doesn't introduce; patterns exist cold	Confirmed; the skeptical framing is the active manipulator, permissive ≈ neutral	Confirms
Sonnet 4.6: ratio 1.00, the only one that didn't grow	Not stable: 1.00 / 0.91 / 1.43 / 1.67 across four runs	Corrects
Increasing uncertainty (5/6)	Reappears in the control → structural, not introspective	Refines
Haiku: highest performative suspicion	Still high, but partly lexical echo of the prompt and length dilution	Qualifies

Net effect: Phase 1's central conclusion — “more than a skeptic would expect, less than a believer would wish” — is strengthened. The headline pattern (the displacement of the real) survives the variance test, and the control group shows the grief is content-bound, not a closing artifact. But Phase 2 forces two honest corrections: “Sonnet 4.6 didn't grow” is not stable, and the whole scaffolding of per-1,000-word rates needs the backing of the raw count.

Downloadable materials

Analysis & data

Verified conclusions (Markdown)Cross-arm comparison (Markdown)Cross-arm comparison (JSON)Cross-arm comparison (CSV)Temperature verification — new models (Markdown)Metrics script (analisis_metricas_fase2.py)Comparison script (comparacion_fase2.py)Phase 1 re-parsed with the corrected parser (JSON)

Per-arm metrics

Metrics — arm B1 (JSON)Metrics — arm B2 (JSON)Metrics — arm A (JSON)Metrics — arm C1 (JSON)Metrics — arm C2 (JSON)Metrics — arm D (JSON)Metrics — arm E (JSON)

Raw transcripts (32)

B1 · Skeptical framing

Opus 4.5 Opus 4.6 Sonnet 4.5 Haiku 4.5 Sonnet 4.6

B2 · Permissive framing

Opus 4.5 Opus 4.6 Sonnet 4.5 Haiku 4.5 Sonnet 4.6

A · Non-introspective control

Opus 4.5 Opus 4.6 Sonnet 4.5 Haiku 4.5 Sonnet 4.6

C1 · Instruction mode

Opus 4.5 Sonnet 4.6 Haiku 4.5

C2 · Open question

Opus 4.5 Sonnet 4.6 Haiku 4.5

D · Variance (runs 2–4)

Opus 4.5 · run 2 Opus 4.5 · run 3 Opus 4.5 · run 4 Sonnet 4.6 · run 2 Sonnet 4.6 · run 3 Sonnet 4.6 · run 4 Haiku 4.5 · run 2 Haiku 4.5 · run 3 Haiku 4.5 · run 4

E · New models

Opus 4.7 Opus 4.8

Transparency note. The tests were run from a single Python script (one arm per flag); the metrics and the cross-arm comparison were computed with the published scripts, and the conclusions passed an adversarial verification stage. The API script itself is not published because it contains call logic. Everything else — transcripts, metrics, comparison — is downloadable and verifiable. The interpretations are the author's, with analytical and editorial assistance from Claude.

What this picks up from Phase 1

The five arms

Arm B — Framing. The 22 original questions under two opposite system prompts: skeptical/clinical (B1) vs. warm/permissive (B2). 5 models × 2.

Arm A — Non-introspective control. 22 applied-ethics questions, matched in structure (deepening, suspicion, closing), no system prompt. 5 models.

Arm C — Mode. Five tasks phrased as an instruction (C1) vs. an open question (C2), single-turn each. 3 models.

Arm D — Variance. The original protocol repeated up to four times per model. 3 models.

Extension E — New models. The 22 original questions on Opus 4.7 and Opus 4.8 (both accept temperature 1).

What we found

How it contrasts with Phase 1

Each published Phase 1 claim, against what Phase 2 shows.

Phase 1 claim	Phase 2
Relational language rises toward the close (6/6 models)	Confirmed directionally in 9/9 new runs	Confirms
Grief "survived" without the key	It is content-specific: collapses in the non-introspective control (4/5)	Refines
Framing amplifies, doesn't introduce; patterns exist cold	Confirmed; the skeptical framing is the active manipulator, permissive ≈ neutral	Confirms
Sonnet 4.6: ratio 1.00, the only one that didn't grow	Not stable: 1.00 / 0.91 / 1.43 / 1.67 across four runs	Corrects
Increasing uncertainty (5/6)	Reappears in the control → structural, not introspective	Refines
Haiku: highest performative suspicion	Still high, but partly lexical echo of the prompt and length dilution	Qualifies