AIAP: An Overview of Technical AI Alignment with Rohin Shah (Part 2)

Listen to Episode Here

Show Notes

The space of AI alignment research is highly dynamic, and it's often difficult to get a bird's eye view of the landscape. This podcast is the second of two parts attempting to partially remedy this by providing an overview of technical AI alignment efforts. In particular, this episode seeks to continue the discussion from Part 1 by going in more depth with regards to the specific approaches to AI alignment. In this podcast, Lucas spoke with Rohin Shah. Rohin is a 5th year PhD student at UC Berkeley with the Center for Human-Compatible AI, working with Anca Dragan, Pieter Abbeel and Stuart Russell. Every week, he collects and summarizes recent progress relevant to AI alignment in the Alignment Newsletter.

Topics discussed in this episode include:

Embedded agency
The field of "getting AI systems to do what we want"
Ambitious value learning
Corrigibility, including iterated amplification, debate, and factored cognition
AI boxing and impact measures
Robustness through verification, adverserial ML, and adverserial examples
Interpretability research
Comprehensive AI Services
Rohin's relative optimism about the state of AI alignment

You can take a short (3 minute) survey to share your feedback about the podcast here.

We hope that you will continue to join in the conversations by following us or subscribing to our podcasts on Youtube, SoundCloud, iTunes, Google Play, Stitcher, or your preferred podcast site/application. You can find all the AI Alignment Podcasts here.

Recommended/mentioned reading

Value Learning sequence

Embedded Agency sequence

Iterated Amplification sequence

AI Alignment Newsletter database

Reframing Superintelligence: CAIS as General Intelligence

Guidelines for AI Containment

Penalizing side effects using stepwise relative reachability

Towards a New Impact Measure

Techniques for optimizing worst-case performance

Cooperative Inverse Reinforcement Learning

Deep reinforcement learning from human preferences