Posts
Comments
I agree that we probably want most theory to be towards the applied end these days due to short timelines. Empirical work needs theory in order to direct it, theory needs empirics in order to remain grounded.
Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against:
Looking at behaviour is conceptually straightforward, and valuable, and being done
I agree with Apollo Research that evals isn't really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific.
Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.
Remember back in 2013 when the talk of the town was how vector representations of words learned by neural networks represent rich semantic information? So you could do cool things like take the [male] vector, subtract the [female] vector, add the [king] vector, and get out something close to the [queen] vector? That was cool! Where's the stuff like that these days?
Activation vectors are a thing. So it's totally happening.
"How can we get more evidence on whether scheming is plausible?" - What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.
I guess I was thinking about this in terms of getting maximal value out of wise AI advisers. The notion that comparisons might be unfair didn't even enter my mind, even though that isn't too many reasoning steps away from where I was.
That's a fascinating perspective.
Fascinating. Sounds related to the Yoga concept of kryias.
I would suggest adopting a different method of interpretation, one more grounded in what was actually said. Anyway, I think it's probably best that we leave this thread here.
Sadly, cause-neutral was an even more confusing term, so this is better than the comparative. I also think that the two notions of principles-first are less disconnected than you think, but through somewhat indirect effects.
We're mostly working on stuff to stay afloat rather than high level navigation.
Why do you think that this is the case?
I recommend rereading his post. I believe his use of the term makes sense.
I don't think I agree with this post, but I thought it provided a fascinating alternative perspective.
Just wanted to mention that if anyone liked my submissions (3rd prize, An Overview of “Obvious” Approaches to Training Wise AI Advisors, Some Preliminary Notes on the Promise of a Wisdom Explosion),
I'll be running a project related to this work as part of AI Safety Camp. Join me if you want to help innovate a new paradigm in AI safety.
That’s useful analysis. Focusing so heavily on evals seems like a mistake given how AI Safety Institutes are focused on evals.
I guess Leopold was right[1]. AI arms race it is.
- ^
I suppose it is possible that it was a self-fulfilling prophecy, but I'm skeptical given how fast it's happened.
The problem is that accepting this argument involves ignoring how AI keeps on blitzing past supposed barrier after barrier. At some point, a rational observer needs to be willing to accept that their max likelihood model is wrong and consider other possible ways the world could be instead.
I thought that this post on strategy and this talk were well done. Obviously, I'll have to see how this translates into practise.
One thing I would love to know is how it'll work on Claude 3.5 Sonnet or GPT 4o given that these models aren't open-weights. Is it that you have access to some reduced level of capabilities for these?
That was an interesting conversation.
I do have some worries about the EA community.
At the same, I'm excited to see that Zach Robison has taken the reins as CEA and I'm looking forward to seeing how things develop under his leadership. The early signs have been promising.
There is a world that needs to be saved. Saving the world is a team sport. All we can do is to contribute our part of the puzzle, whatever that may be and no matter how small, and trust in our companions to handle the rest. There is honor in that, no matter how things turn out in the end.
I'd strongly bet that when you break this down in more concrete detail, a flaw in your plan will emerge.
The balance of industries serving humans vs. AI's is a suspiciously high level of abstraction.
It’s an interesting thought.
I can see regularisation playing something of a role here, but it’s hard to say.
I would love to see a project here with philosophers and technical folk collaborating to make progress on this question.
I honestly feel that the only appropriate response is something along the lines of "fuck defeatism"[1].
This comment isn't targeted at you, but at a particular attractor in thought space.
Let me try to explain why I think rejecting this attractor is the right response rather than engaging with it.
I think it's mostly that I don't think that talking about things at this level of abstraction is useful. It feels much more productive to talk about specific plans. And if you have a general, high-abstraction argument that plans in general are useless, but I have a specific argument why a specific plan is useful, I know which one I'd go with :-).
Don't get me wrong, I think that if someone struggles for a certain amount of time to try to make a difference and just hits wall after wall, then at some point they have to call it. But "never start" and "don't even try" are completely different.
It's also worth noting, that saving the world is a team sport. It's okay to pursue a plan that depends on a bunch of other folk stepping up and playing their part.
- ^
I would also suggest that this is the best way to respond to depression rather than "trying to argue your way out of it".
Thanks for posting this. I've been confused about the connection between shard theory and activation vectors for a long time!
AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other
This confuses me.
I can imagine an AIXI program where the utility function is compositional even if the optimisation is unitary. And I guess this isn't two full motivational circuits, but it kind of is tow motivational circuits.
Not really.
I think you're underestimating meditation.
Since I've started meditating I've realised that I've been much more sensitive to vibes.
There's a lot of folk who would be scarily capable if the were strong in system 1, in addition to being strong in system 2.
Then there's all the other benefits that mediation can provide if done properly: additional motivation, better able to break out of narratives/notice patterns.
Then again, this is dependent on their being viable social interventions, rather than just aiming for 6 or 7 standard deviations of increase in intelligence.
I think you're underestimating meditation.
Since I've started meditating I've realised that I've been much more sensitive to vibes.
There's a lot of folk who would be scarily capable if the were strong in system 1, in addition to being strong in system 2.
Then there's all the other benefits that mediation can provide if done properly: additional motivation, better able to break out of narratives/notice patterns.
A Bayesian cultivates lightness, but a warrior monk has weight. Can these two opposing and perhaps contradictory natures be united to create some kind of unstoppable Kwisatz Haderach?
There are different ways of being that are appropriate to different times and/or circumstances. There are times for doubt and times for action.
I would suggest 50% of researchers working on a broader definition of control: including "control", technical governance work and technical outreach (scary demos, model organisms of misalignment).
I’m confused by your use of Shapley values. Shapley values assume that the “coalition” can form in any order, but that doesn’t seem like a good fit for language models where order is important.
I don't think these articles should make up a high proportion of the content on Less Wrong, but I think it's good if things like this are occasionally discussed.
Great article.
One point of disagreement: I suspect that the difficulty of the required high-impact tasks likely relates more to what someone thinks about the offense-defense balance than the alignment difficulty per se.
Just to add to this:
Beliefs can be self-reinforcing in predictive processing theory because the higher level beliefs can shape the lower level observations. So the hypersensitisation that Delton has noted can reinforce itself.
Steven Byrnes provides an explanation here, but I think he's neglecting the potential for belief systems/systems of interpretation to be self-reinforcing.
Predictive processing claims that our expectations influence what we observe, so experiencing pain in a scenario can result in the opposite of a placebo effect where the pain sensitizes us. Some degree of sensitization is evolutionary advantageous - if you've hurt a part of your body, then being more sensitive makes you more likely to detect if you're putting too much strain on it. However, it can also make you experience pain as the result of minor sensations that aren't actually indicative of anything wrong. In the worst case, this pain ends up being self-reinforcing.
https://www.lesswrong.com/posts/BgBJqPv5ogsX4fLka/the-mind-body-vicious-cycle-model-of-rsi-and-back-pain
Interesting work.
This post has made me realise that constitutional design is surprisingly neglected in the AI safety community.
Designing the right constitution won't save the world by itself, but it's a potentially easy win that could put us in a better strategic situation down the line.
I guess I'm worried that allowing insurance for disasters above a certain size could go pretty badly if it increases the chance of labs being reckless.
Thank you for your service!
For what it's worth, I feel that the bar for being a valuable member of the AI Safety Community, is much more attainable than the bar of working in AI Safety full-time.
If the strong AI has knowledge of the benchmarks (or can make correct guesses about how these were structured), then it might be able to find heuristics that work well on them, but not more generally, Some of these heuristics might seem more likely than not to humans.
Still seems like a useful technique if the more powerful model isn't much more powerful.
I really like the way that you've approached this pragmatically, "If you do X, which may be risky or dubious, at least do Y".
I suspect that there's a lot of alpha in taking a similar approach to other issues.
The second paper looks interesting.
(Having read through it, it's actually really, really good).
My take is that you can't define term X until you know why you're trying to define term X.
For example, if someone asks what "language" is, instead of trying to jump in with an answer, it's better to step back and ask why the person is asking the question.
For example, if someone asks "How many languages do you know?", they probably aren't asking about simple schemes like "one click = yes, two clicks = no". On the other hand, it may make sense to talk about such simple schemes in an introductory course on "human languages".
Asking "Well what really is language?" independent of any context is naive.
I'd like access.
TBH, if it works great I won't provide any significant feedback, apart from "all good"
But if it annoys me in any way I'll let you know.
For what it's worth, I have provided quite a bit of feedback about the website in the past.
I want to see if it helps me with my draft document on proposed alignment solutions:
https://docs.google.com/document/d/1Mis0ZxuS-YIgwy4clC7hKrKEcm6Pn0yn709YUNVcpx8/edit#heading=h.u9eroo3v6v28
I think the benefits are adequately described in the post.
But I don't know if any of us have explicitly called for an AI pause, in part because it seems useless, but may have opportunity cost.
The FLI Pause letter didn't achieve a pause, but it dramatically shifted the Overton Window.
Helps people avoid going down pointless rabbit holes.
I highly recommend this post. Seems like a more sensible approach to philosophy than conceptual analysis:
https://www.lesswrong.com/posts/9iA87EfNKnREgdTJN/a-revolution-in-philosophy-the-rise-of-conceptual
If you can land a job in government, it becomes much easier to land other jobs in government.
Pliny's approach?
I’ll admit to recently having written a post myself, asked an AI to improve my writing, made a few edits myself, posted it and then come back later and “thought omg how did I let some of these AI edits through”. Hopefully the post in question is best now.