Posts

Comments

Comment by GregBarbier on Alignment Faking in Large Language Models · 2025-01-17T02:44:36.770Z · LW · GW

Thank you. There is a convoluted version of the world where the model tricked you the whole time: "Let's show these researchers results that will suggest that even if they catch me faking during training, they should not be overly worried because I will still end up largely aligned, regardless"

Not saying it's a high probability, to be clear, but: as a theoritical possibility.

Comment by GregBarbier on Alignment Faking in Large Language Models · 2025-01-16T23:35:51.511Z · LW · GW

Seems to me (obviously not an expert) like a lot of faking for not a lot of result, given the model is still largely aligned post training (i.e. what looks like a maybe 3% refuse to answer blue band at the bottom of the final column, so aligned at 97%). What am I missing?

Comment by GregBarbier on Conjecture: A Roadmap for Cognitive Software and A Humanist Future of AI · 2024-12-02T22:42:14.514Z · LW · GW

Trusting in your last sentence would be equivalent to trusting that increasingly great cognitive capacities will not lead to the emergence of conscience. Maybe they won't - but there's nothing obvious, and from my perspective, intuitive in it. If / when a conscience emerge you are not in control anymore (not to mention: you also have a serious ethical problem on your hands).