The core challenge of AI alignment is…

Apr 3

"Steered to where" is a different question from "steerable at all"

8 Comments

I see a short term future of alignment being about having models that can express a large set of different behaviors. Helpful, honest, and harmless is the default, but the spectrum is more important. It could help with policy so AI is obviously not just a unipolar world view that all politicians grapple with.

Talking to many in industry recently, character and steer ability is actually much easier as the models get stronger. This intermediate stage where the default personality was so strong is an artifact of training being hard.

Though, i would like to know more about how these changes in character relate to changes in safety on extreme risks and not just “word policing” or norm chasing.

Completely agree and thanks for highlighting this again.

Expand full comment

Reply (1)

Helen Toner

Apr 5

Yeah, one thing "steerability" doesn't distinguish well is the difference between how well you can lead the AI to play a certain kind of character, vs. how well you can ensure its actions are in line with what you want. Not necessarily that big of a difference in the chatbot paradigm, but will matter much more as agents get more widespread.

Expand full comment

Kenny Easwaran

Apr 5

I’ve been one of those people who worried for a while that all the talk of “alignment” had a false presupposition that there was a thing to align to. So I definitely like this “steerability” question better.

As a philosopher, I’ve also long thought that “AI ethics” would be a better term for what people were after with “alignment”, but I think that would only be true if one presupposes (as I do) a sort of consequentialist account of ethics where it’s about ensuring that agents behave in ways that generally tend to make the world better rather than worse. Instead the term “AI ethics” has come to refer to a cluster of topics that are surely relevant to this point but not getting at the big picture, in my opinion.

Expand full comment

John Wittle

Apr 3

I rather liked the alignment framing back in its original context, when we believed that the intelligence explosion would probably happen over the course of a few nanoseconds instead of the course of a few years.

It looked a lot more like, you've got a gun and you need to aim it at, not just a region of value space, but a particular point in value space (a point that we don't even actually know the position of)

"Alignment" felt like a really good term for that... the orientation when the gun fires is all that matters, and you can't affect its trajectory afterward, and it's either correctly aimed or it isn't.

"Steerability" is definitely a better term for the current context though. We're riding the bullet and get the chance to correct its trajectory as it moves, we've got some RCS thrusters for attitude control. We just have to hope we can actually build some telemetry sensors quickly enough to make a difference.

Expand full comment

Lachlan Cannon

May 11

I like the distinction between this framing, and that of alignment/control, but I do think it has its own issues that arise in the same contexts. When a model is steerable, who is is steerable by? Its creators? Its users? The HHH framing of models implies that you want the model to be steerable in creation, and not steerable in use -- especially difficult when it's an open weights model and users can alter it after the fact. How do you make a model steerable to its creators intent, while not allowing a bad actor to come along afterwards and re-steer it towards more malevolent purposes?

I suppose a proper theory of steerability would be even more important in this context, so one could steer the model originally and guard against re-steering in deployment.

Expand full comment

Jeanne Dietsch

Apr 4

As Yudkowsky details, "emergent volition" is not the same as the values that will further humankind's survival or the values that most humans consider desirable. Yet I wonder whether developers focusing on value-alignment have considered, or are even aware of, the explicit work of thousands of people from nations worldwide to establish the UN Sustainability Goals and the Guiding Principles on Business and Human Rights. The Sustainability Goals were adopted by every one of the 193 member nations in the UN.

Isn't this Yudkowsky's "light in the heart of humanity" that delineates what our better angels would wish? https://agirus.substack.com/p/ai-alignment-with-international-standards

Expand full comment

You may have liked this post: https://www.lesswrong.com/posts/p3aL6BwpbPhqxnayL/the-problem-with-the-word-alignment-1

https://x.com/peligrietzer/status/1767793747662688672

Expand full comment

One Wandering Mind

Apr 28

Ideally, we’d block any malicious use—like someone trying to design a bioweapon.

The default behavior of the model, should have a core set of values it follows and behavior as a result of that. It is still important to not overload with values that are not commonly shared.

Then comes steerability. While there might be a more opinionated default set of values and behavior, the model should be able to shift to different behavior and values if the model creators, developers, or end users desire that without retraining the model.

Exactly where to set the boundaries is and how to build a system that has the desired layered boundaries is difficult. The OpenAI model spec appears to set how they desire for the models to behave in this layers approach and is an interesting read and useful for people building with their systems.

Expand full comment

Rising Tide

The core challenge of AI alignment is…