This is a quick 4-session course designed to introduce the field of Mechanistic Interpretability for Large Language Models (LLMs). The course includes theoretical materials, introduces Python libraries for interpretability, discusses recent papers in the field (published by organizations such as Anthropic and Google DeepMind), and provides practical exercises.

The field of Mechanistic Interpretability has rapidly gained popularity in recent years, with more than 90 papers accepted at ICML 2024. Its main goal is to understand the logic behind the decisions of machine learning models. This knowledge can be applied to improve transparency and trust in existing models, as well as to better understand how these models learn.

Shortly after the completion of this course, BAISH will organize a Mechanistic Interpretability Hackathon at FGV — details to be announced! We strongly recommend that anyone interested in participating in the Hackathon complete this course as an introduction to the topic, which will also increase their chances of receiving an award in the competition.

Course Schedule

Session 1 - Transformers and Interpretability

Date: August 23, Friday, starting at 2:30 PM

Location: Auditorium 537

Schedule:

  • Preparation: Handbook Chapters 2 and 3
  • 45 min: Introduction to transformers and attention
  • 20 min: Coffee break
  • 45 min: Introduction to Mechanistic Interpretability
  • 30 min: Coding: PyTorch and TransformerLens

Materials:

Session 2 - Circuits

Date: August 30, Friday, starting at 2:30 PM

Location: Auditorium 418

Schedule:

  • Preparation: Handbook Chapters 3 and 4
  • 40 min: Circuits and the induction circuit
  • 20 min: Exploration of the paper "A Mathematical Framework for Transformer Circuits"
  • 20 min: Coffee break
  • 60 min: Coding: circuit discovery

Materials:

Session 3 - Superposition

Date: September 6, Friday, starting at 2:30 PM

Location: Auditorium 537

Schedule:

  • Preparation: Handbook Chapter 5
  • 30 min: Superposition
  • 30 min: 3Blue1Brown video "How might LLMs store facts" and discussion
  • 20 min: Coffee break
  • 60 min: Coding: superposition in toy models

Materials:

Session 4 - Sparse Autoencoders (SAE)

Date: September 13, Friday, starting at 3:00 PM

Location: Auditorium 418

Schedule:

  • Preparation: Handbook Chapter 6
  • 30 min: Sparse Autoencoders (SAE) and practical exploration
  • 30 min: Exploration of the papers "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" and "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet"
  • 20 min: Coffee break
  • 40 min: Coding: Using SAEs
  • 30 min: New areas of exploration and tips for the Hackathon!

Materials:

Additional Resources

Opportunities

🎓

ARENA

Opportunity to spend 4 weeks in London studying practical content relevant to AI Safety research

Learn More
🔬

MATS

Program to start research in AI Safety with guidance from mentors with extensive experience in the field. Several authors of papers used in this course are MATS mentors.

View Mentors