Discussion about this post

User's avatar
Nathan Lambert's avatar

I see a short term future of alignment being about having models that can express a large set of different behaviors. Helpful, honest, and harmless is the default, but the spectrum is more important. It could help with policy so AI is obviously not just a unipolar world view that all politicians grapple with.

Talking to many in industry recently, character and steer ability is actually much easier as the models get stronger. This intermediate stage where the default personality was so strong is an artifact of training being hard.

Though, i would like to know more about how these changes in character relate to changes in safety on extreme risks and not just “word policing” or norm chasing.

Completely agree and thanks for highlighting this again.

Expand full comment
Kenny Easwaran's avatar

I’ve been one of those people who worried for a while that all the talk of “alignment” had a false presupposition that there was a thing to align to. So I definitely like this “steerability” question better.

As a philosopher, I’ve also long thought that “AI ethics” would be a better term for what people were after with “alignment”, but I think that would only be true if one presupposes (as I do) a sort of consequentialist account of ethics where it’s about ensuring that agents behave in ways that generally tend to make the world better rather than worse. Instead the term “AI ethics” has come to refer to a cluster of topics that are surely relevant to this point but not getting at the big picture, in my opinion.

Expand full comment
6 more comments...

No posts

OSZAR »