Post
59
## Models know they're being influenced. They just don't tell you.
12 open-weight reasoning models. 41,832 inference runs. Six types of reasoning hints. One finding: models acknowledge influence ~87.5% of the time in their thinking tokens, but only ~28.6% in their final answers.
If you're using CoT monitoring for safety, this is a blind spot. The reasoning trace looks clean while the model's internal deliberation tells a different story.
- Faithfulness ranges from 39.7% to 89.9% across model families
- Social-pressure hints are least acknowledged (consistency: 35.5%, sycophancy: 53.9%)
- Training methodology matters more than scale
**Paper:** [arxiv:2603.22582](https://arxiv.org/abs/2603.22582) | **Dataset:** [richardyoung/cot-faithfulness-open-models]( richardyoung/cot-faithfulness-open-models) | **Companion paper:** [arxiv:2603.20172]( Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models? (2603.22582)
12 open-weight reasoning models. 41,832 inference runs. Six types of reasoning hints. One finding: models acknowledge influence ~87.5% of the time in their thinking tokens, but only ~28.6% in their final answers.
If you're using CoT monitoring for safety, this is a blind spot. The reasoning trace looks clean while the model's internal deliberation tells a different story.
- Faithfulness ranges from 39.7% to 89.9% across model families
- Social-pressure hints are least acknowledged (consistency: 35.5%, sycophancy: 53.9%)
- Training methodology matters more than scale
**Paper:** [arxiv:2603.22582](https://arxiv.org/abs/2603.22582) | **Dataset:** [richardyoung/cot-faithfulness-open-models]( richardyoung/cot-faithfulness-open-models) | **Companion paper:** [arxiv:2603.20172]( Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models? (2603.22582)