They can see Claude lying

Anthropic recently published a post (Link to Post) along with an accompanying paper (Link to Paper) where they discuss how, by employing interpretability techniques inspired by neuroscientific approaches, researchers have developed tools to identify patterns and information flows within their LLM models. This marks a significant departure from our traditional approach of treating the model as a black box.

Catching the Model in Action

We’ve long known that LLMs can hallucinate facts or fabricate reasoning. For the first time, however, researchers have been able to “catch the model in the act” of making up fake reasoning. Sometimes, when given a hint about the answer, Claude works backward, finding intermediate steps that lead to the target, regardless of whether those steps are true or false.

The Concept of a Universal "Language of Thought"

Another fascinating insight is that Claude appears to utilize a conceptual space shared across multiple languages, suggesting the existence of a universal "language of thought." Researchers investigate this by asking Claude for the "opposite of small" across different languages. They find that the same core features for the concepts of smallness and oppositeness activate, triggering the concept of largeness, which is then translated into the language of the question.

Planning Beyond Immediate Word Prediction

Further, the post highlights that, despite generating text one word at a time, Claude demonstrates the ability to plan several words ahead. For instance, in poetry composition, the model anticipates rhyming words and structures its output to reach those endpoints, indicating foresight beyond immediate word prediction.

Conclusion

The tools developed by Anthropic represent a crucial first step toward understanding the inner workings of large language models. Further development of these tools promises significant advancements in both interpretability and safety, paving the way for more transparent and reliable AI systems.