A comprehensive introduction to Mechanistic Interpretability for Large Language Models
This is a quick 4-session course designed to introduce the field of Mechanistic Interpretability for Large Language Models (LLMs). The course includes theoretical materials, introduces Python libraries for interpretability, discusses recent papers in the field (published by organizations such as Anthropic and Google DeepMind), and provides practical exercises.
The field of Mechanistic Interpretability has rapidly gained popularity in recent years, with more than 90 papers accepted at ICML 2024. Its main goal is to understand the logic behind the decisions of machine learning models. This knowledge can be applied to improve transparency and trust in existing models, as well as to better understand how these models learn.
Shortly after the completion of this course, BAISH will organize a Mechanistic Interpretability Hackathon at FGV — details to be announced! We strongly recommend that anyone interested in participating in the Hackathon complete this course as an introduction to the topic, which will also increase their chances of receiving an award in the competition.
Date: August 23, Friday, starting at 2:30 PM
Location: Auditorium 537
Date: August 30, Friday, starting at 2:30 PM
Location: Auditorium 418
Date: September 6, Friday, starting at 2:30 PM
Location: Auditorium 537
Date: September 13, Friday, starting at 3:00 PM
Location: Auditorium 418
Opportunity to spend 4 weeks in London studying practical content relevant to AI Safety research
Learn MoreProgram to start research in AI Safety with guidance from mentors with extensive experience in the field. Several authors of papers used in this course are MATS mentors.
View Mentors