Posts
Comments
Echoing others: turning Less Wrong into Manifold would be a mistake. Manifold already exists. However, maybe you should suggest to them that they add a forum independent of any particular market.
I've said this elsewhere, but I think we need to also be working on training wise AI advisers in order to help us navigate these situations.
Do you think there's any other updates you should make as well?
Well, does this improve automated ML research and kick off an intelligence explosion sooner?
"Funders of independent researchers we’ve interviewed think that there are plenty of talented applicants, but would prefer more research proposals focused on relatively few existing promising research directions" - Would be curious to hear why this is. Is it that if there is too great a profusion of research directions that there won't be enough effort behind each individual one?
I'd love to hear some more specific advice about how to communicate in these kinds of circumstances when it's much easier for folk not to listen.
Just going to put it out there, it's not actually clear that we actually should want to advance AI for maths.
I maintain my position that you're missing the stakes if you think that's important. Even limiting ourselves strictly to concentration of power worries, risks of totalitarianism dominate these concerns.
My take - lots of good analysis, but makes a few crucial mistakes/weaknesses that throw the conclusions into significant doubt:
The USG will be able and willing to either provide or mandate strong infosecurity for multiple projects.
I simply don't buy that the infosec for multiple such projects will be anywhere near the infosec of a single project because the overall security ends up being that of the weakest link.
Additionally, the more projects there are with a particular capability, the more folk there are who can leak information either by talking or by being spies.
The probability-weighted impacts of AI takeover or the proliferation of world-ending technologies might be high enough to dominate the probability-weighted impacts of power concentration.
Comment: We currently doubt this, but we haven’t modelled it out, and we have lower p(doom) from misalignment than many (<10%).
Seems entirely plausible to me that either one could dominate. Would love to see more analysis around this.
Reducing access to these services will significantly disempower the rest of the world: we’re not talking about whether people will have access to the best chatbots or not, but whether they’ll have access to extremely powerful future capabilities which enable them to shape and improve their lives on a scale that humans haven’t previously been able to.
If you're worried about this, I don't think you quite realise the stakes. Capabilities mostly proliferate anyway. People can wait a few more years.
My take: Bits of this review come off as a bit too status-oriented to me. This is ironic, because the best part of the review is towards the end when it talks about the risk of rationality becoming a Fandom.
Sharing this resource doc on AI Safety & Entrepreneurship that I created in case anyone finds this helpful:
https://docs.google.com/document/d/1m_5UUGf7do-H1yyl1uhcQ-O3EkWTwsHIxIQ1ooaxvEE/edit?usp=sharing
If it works, maybe it isn't slop?
I agree that we probably want most theory to be towards the applied end these days due to short timelines. Empirical work needs theory in order to direct it, theory needs empirics in order to remain grounded.
Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against:
Looking at behaviour is conceptually straightforward, and valuable, and being done
I agree with Apollo Research that evals isn't really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific.
Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.
Remember back in 2013 when the talk of the town was how vector representations of words learned by neural networks represent rich semantic information? So you could do cool things like take the [male] vector, subtract the [female] vector, add the [king] vector, and get out something close to the [queen] vector? That was cool! Where's the stuff like that these days?
Activation vectors are a thing. So it's totally happening.
"How can we get more evidence on whether scheming is plausible?" - What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.
I guess I was thinking about this in terms of getting maximal value out of wise AI advisers. The notion that comparisons might be unfair didn't even enter my mind, even though that isn't too many reasoning steps away from where I was.
That's a fascinating perspective.
Fascinating. Sounds related to the Yoga concept of kryias.
I would suggest adopting a different method of interpretation, one more grounded in what was actually said. Anyway, I think it's probably best that we leave this thread here.
Sadly, cause-neutral was an even more confusing term, so this is better than the comparative. I also think that the two notions of principles-first are less disconnected than you think, but through somewhat indirect effects.
We're mostly working on stuff to stay afloat rather than high level navigation.
Why do you think that this is the case?
I recommend rereading his post. I believe his use of the term makes sense.
I don't think I agree with this post, but I thought it provided a fascinating alternative perspective.
Just wanted to mention that if anyone liked my submissions (3rd prize, An Overview of “Obvious” Approaches to Training Wise AI Advisors, Some Preliminary Notes on the Promise of a Wisdom Explosion),
I'll be running a project related to this work as part of AI Safety Camp. Join me if you want to help innovate a new paradigm in AI safety.
That’s useful analysis. Focusing so heavily on evals seems like a mistake given how AI Safety Institutes are focused on evals.
I guess Leopold was right[1]. AI arms race it is.
- ^
I suppose it is possible that it was a self-fulfilling prophecy, but I'm skeptical given how fast it's happened.
The problem is that accepting this argument involves ignoring how AI keeps on blitzing past supposed barrier after barrier. At some point, a rational observer needs to be willing to accept that their max likelihood model is wrong and consider other possible ways the world could be instead.
I thought that this post on strategy and this talk were well done. Obviously, I'll have to see how this translates into practise.
One thing I would love to know is how it'll work on Claude 3.5 Sonnet or GPT 4o given that these models aren't open-weights. Is it that you have access to some reduced level of capabilities for these?
That was an interesting conversation.
I do have some worries about the EA community.
At the same, I'm excited to see that Zach Robison has taken the reins as CEA and I'm looking forward to seeing how things develop under his leadership. The early signs have been promising.
There is a world that needs to be saved. Saving the world is a team sport. All we can do is to contribute our part of the puzzle, whatever that may be and no matter how small, and trust in our companions to handle the rest. There is honor in that, no matter how things turn out in the end.
I'd strongly bet that when you break this down in more concrete detail, a flaw in your plan will emerge.
The balance of industries serving humans vs. AI's is a suspiciously high level of abstraction.
It’s an interesting thought.
I can see regularisation playing something of a role here, but it’s hard to say.
I would love to see a project here with philosophers and technical folk collaborating to make progress on this question.
I honestly feel that the only appropriate response is something along the lines of "fuck defeatism"[1].
This comment isn't targeted at you, but at a particular attractor in thought space.
Let me try to explain why I think rejecting this attractor is the right response rather than engaging with it.
I think it's mostly that I don't think that talking about things at this level of abstraction is useful. It feels much more productive to talk about specific plans. And if you have a general, high-abstraction argument that plans in general are useless, but I have a specific argument why a specific plan is useful, I know which one I'd go with :-).
Don't get me wrong, I think that if someone struggles for a certain amount of time to try to make a difference and just hits wall after wall, then at some point they have to call it. But "never start" and "don't even try" are completely different.
It's also worth noting, that saving the world is a team sport. It's okay to pursue a plan that depends on a bunch of other folk stepping up and playing their part.
- ^
I would also suggest that this is the best way to respond to depression rather than "trying to argue your way out of it".
Thanks for posting this. I've been confused about the connection between shard theory and activation vectors for a long time!
AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other
This confuses me.
I can imagine an AIXI program where the utility function is compositional even if the optimisation is unitary. And I guess this isn't two full motivational circuits, but it kind of is tow motivational circuits.
Not really.
I think you're underestimating meditation.
Since I've started meditating I've realised that I've been much more sensitive to vibes.
There's a lot of folk who would be scarily capable if the were strong in system 1, in addition to being strong in system 2.
Then there's all the other benefits that mediation can provide if done properly: additional motivation, better able to break out of narratives/notice patterns.
Then again, this is dependent on their being viable social interventions, rather than just aiming for 6 or 7 standard deviations of increase in intelligence.
I think you're underestimating meditation.
Since I've started meditating I've realised that I've been much more sensitive to vibes.
There's a lot of folk who would be scarily capable if the were strong in system 1, in addition to being strong in system 2.
Then there's all the other benefits that mediation can provide if done properly: additional motivation, better able to break out of narratives/notice patterns.
A Bayesian cultivates lightness, but a warrior monk has weight. Can these two opposing and perhaps contradictory natures be united to create some kind of unstoppable Kwisatz Haderach?
There are different ways of being that are appropriate to different times and/or circumstances. There are times for doubt and times for action.
I would suggest 50% of researchers working on a broader definition of control: including "control", technical governance work and technical outreach (scary demos, model organisms of misalignment).
I’m confused by your use of Shapley values. Shapley values assume that the “coalition” can form in any order, but that doesn’t seem like a good fit for language models where order is important.
I don't think these articles should make up a high proportion of the content on Less Wrong, but I think it's good if things like this are occasionally discussed.
Great article.
One point of disagreement: I suspect that the difficulty of the required high-impact tasks likely relates more to what someone thinks about the offense-defense balance than the alignment difficulty per se.
Just to add to this:
Beliefs can be self-reinforcing in predictive processing theory because the higher level beliefs can shape the lower level observations. So the hypersensitisation that Delton has noted can reinforce itself.
Steven Byrnes provides an explanation here, but I think he's neglecting the potential for belief systems/systems of interpretation to be self-reinforcing.
Predictive processing claims that our expectations influence what we observe, so experiencing pain in a scenario can result in the opposite of a placebo effect where the pain sensitizes us. Some degree of sensitization is evolutionary advantageous - if you've hurt a part of your body, then being more sensitive makes you more likely to detect if you're putting too much strain on it. However, it can also make you experience pain as the result of minor sensations that aren't actually indicative of anything wrong. In the worst case, this pain ends up being self-reinforcing.
https://www.lesswrong.com/posts/BgBJqPv5ogsX4fLka/the-mind-body-vicious-cycle-model-of-rsi-and-back-pain
Interesting work.
This post has made me realise that constitutional design is surprisingly neglected in the AI safety community.
Designing the right constitution won't save the world by itself, but it's a potentially easy win that could put us in a better strategic situation down the line.
I guess I'm worried that allowing insurance for disasters above a certain size could go pretty badly if it increases the chance of labs being reckless.
Thank you for your service!
For what it's worth, I feel that the bar for being a valuable member of the AI Safety Community, is much more attainable than the bar of working in AI Safety full-time.