What's the simplest concrete unsolved problem in AI alignment?

post by agg (ag) · 2023-01-26T04:15:13.620Z · LW · GW · 4 comments

This is a question post.

In your preferred area of AI alignment, what is the simplest concrete unsolved problem?

By "simplest", ideally the problem has been solved when any of the conditions are weakened. However, this isn't always possible, so a simpler solved version of the problem could also work (e.g., Goldbach's weak conjecture is known to be true.)

By "concrete", I mean something where given the statement of the problem and a proposed solution, a neutral third party would be able to consistently determine whether it's solved or not (e.g., not "explain [some theory] in a good way").


answer by Evan R. Murphy · 2023-01-27T20:32:24.421Z · LW(p) · GW(p)

I would check out the 200 Concrete Open Problems in Mechanistic Interpretability [? · GW] post series by Neel Nanda. Mechanistic interpretability has been considered a promising research direction by many in the alignment community for years. But it's only in the past couple months that we have an experienced researcher in this area laying out specific concrete problems and providing detailed guidance for newcomers.

Caveat: I haven't myself looked closely at this post series yet, as in recent months I have been more focused on investigating language model behaviour than on interpretability. So I don't have direct knowledge that these posts are as useful as they look.

comment by harfe · 2023-01-27T23:16:19.052Z · LW(p) · GW(p)

I have the impression that Neel Nanda means something different by the word "concrete" than agg, when agg considers problems of the type "explain something in a good way" not a concrete problem.

For example, I would think that "Hunt through Neuroscope for the toy models and look for interesting neurons to focus on." would not matcg agg's bar for concreteness. But maybe other problems from Neel Nanda might.

Replies from: ag
comment by agg (ag) · 2023-01-28T00:21:07.265Z · LW(p) · GW(p)

Well, I don't consider "explain something in a good way" an example of a concrete problem (at least for the purposes of this question)—that was a counterexample. Some of the other problems listed definitely do seem interesting!

Replies from: harfe
comment by harfe · 2023-01-28T21:40:41.700Z · LW(p) · GW(p)

yes, sorry, I meant to say the opposite. I changed it now.


Comments sorted by top scores.