Posts
Comments
Also I suggest that given the number of tags in each section, load more should be load all.
This is awesome! Three comments:
- Please make an easy to find Recent Changes feed (maybe a thing on the home page which only appears if you've made wiki edits). If you want an editor community, that will be their home, and the thing they're keeping up with and knowing to positively reinforce each other.
- The concepts portal is now a slightly awkward mix of articles and tags, with potentially very high use tags being quite buried because no one's written a good article for it (e.g Rationality Quotes has 136 pages tagged, but zero karma, so requires many clicks to reach). I'm especially thinking about the use case of wanting to know what types of articles there are to browse around. I'm not sure exactly what to do about this.. maybe having the sorting not be just about karma, but a mix of karma and number of tagged posts? Like (k+10)*(t+10) or something? Disadvantage is this is opaque and drops alphabetical much harder.
- A bunch of the uncategorized ones could be categorized, but I'm not seeing a way to do this with normal permissions.
Adjusting 2 would make it much cleaner to categorize the many ones in 3 without that clogging up the normal lists.
Nice! I'll watch through these then probably add a lot of them to the aisafety.video playlist.
I've heard from people I trust that:
- They can be pretty great, if you know what you want and set the prompt up right
- They won't be as skilled as a human therapist, and might throw you in at the deep end or not be tracking things a human would
Using them can be very worth it as they're always available and cheap, but they require a little intentionality. I suggest asking your human therapist for a few suggestions of kinda of work you might do with a peer or LLM assistant, and monitoring how it affects you while exploring, if you feel safe enough doing that. Maybe do it the day before a human session the first few times so you have a good safety net. Maybe ask some LWers what their system prompts are, or find some well-tested prompts elsewhere.
Looks like Tantrix:
oh yup, sorry, I meant mid 2026, like ~6 months before the primary proper starts. But could be earlier.
Yeah, this seems worth a shot. If we do this, we should do our own pre-primary in like mid 2027 to select who to run in each party, so that we don't split the vote and also so that we select the best candidate.
Someone I know was involved in a DIY pre-primary in the UK which unseated an extremely safe politician, and we'd get a bunch of extra press while doing this.
Humans without scaffolding can do a very finite number of sequential reasoning steps without mistakes. That's why thinking aids like paper, whiteboards, and other people to bounce ideas off and keep the cache fresh are so useful.
With a large enough decisive strategic advantage, a system can afford to run safety checks on any future versions of itself and anything else it's interacting with sufficient to stabilize values for extremely long periods of time.
Multipolar worlds though? Yeah, they're going to get eaten by evolution/moloch/power seeking/pythia.
More cynical take based on the Musk/Altman emails: Altman was expecting Musk to be CEO. He set up a governance structure which would effectively be able to dethrone Musk, with him as the obvious successor, and was happy to staff the board with ideological people who might well take issue with something Musk did down the line to give him a shot at the throne.
Musk walked away, and it would've been too weird to change his mind on the governance structure. Altman thought this trap wouldn't fire with high enough probability to disarm it at any time before it did.
I don't know whether the dates line up to dis-confirm this, but I could see this kind of 5d chess move happening. Though maybe normal power and incentive psychological things are sufficient.
Looks fun!
I could also remove Oil Seeker's protection from Pollution; they don't need it for making Black Chips to be worthwhile for them but it makes that less of an amazing deal than it is.
Maybe have the pollution cost halved for Black, if removing it turns out to be too weak?
Seems accurate, though I think Thinking This Through A Bit involved the part of backchaining where you look at approximately where on the map the destination is, and that's what some pro-backchain people are trying to point at. In the non-metaphor, the destination is not well specified by people in most categories, and might be like 50 ft in the air so you need a way to go up or something.
And maybe if you are assisting someone else who has well grounded models, you might be able to subproblem solve within their plan and do good, but you're betting your impact on their direction. Much better to have your own compass or at least a gears model of theirs so you can check and orient reliably.
PS: I brought snacks!
Give me a dojo.lesswrong.com, where the people into mental self-improvement can hang out and swap techniques, maybe a meetup.lesswrong.com where I can run better meetups and find out about the best rationalist get-togethers. Let there be an ai.lesswrong.com for the people writing about artificial intelligence.
Yes! Ish! I'd be keen to have something like this for the upcoming aisafety.com/stay-informed page, where we're looking like we'll currently resort to linking to https://www.lesswrong.com/tag/ai?sortedBy=magic#:~:text=Posts%20tagged%20AI as there's no simpler way to get people specifically to the AI section of the site.
I'd weakly lean towards not using a subdomain, but to using a linkable filter, but yeah seems good.
I'd also think that making it really easy and fluid to cross-post (including selectively, maybe the posts pop up in your drafts and you just have to click post if you don't want everything cross-posted) would be a pretty big boon for LW.
I'm glad you're trying to figure out a solution. I am however going to shoot this one down a bunch.
If these assumptions were true, this would be nice. Unfortunately, I think all three are false.
LLMs will never be superintelligent when predicting a single token.
In a technical sense, definitively false. Redwood compared human to AI token prediction and even early AIs were far superhuman. Also, in a more important sense, you can apply a huge amount of optimization on selecting a token. This video gives a decent intuition, though in a slightly different setting.
LLMs will have no state.
False in three different ways. Firstly, people are totally building in explicit state in lots of ways (test time training, context retrieval, reasoning models, etc). Secondly, there's a feedback cycle of AI influences training data of next AI, which will become a tighter and tighter loop. Thirdly, the AI can use the environment as state in ways which would be nearly impossible to fully trace or mitigate.
not in a way that any emergent behaviour of the system as a whole isn't reflected in the outputs of any of the constituent LLMs
alas, many well-understood systems regularly do and should be expected to have poorly understood behaviour when taken together.
a simpler LLM to detect output that looks like it's leading towards unaligned behaviour.
Robustly detecting "unaligned behaviour" is an unsolved problem, if by aligned you mean "makes the long term future good" rather than "doesn't embarrass the corporation". Solving this would be massive progress, and throwing a LLM at it naively has many pitfalls.
Stepping back, I'd encourage you to drop by AI plans, skill up at detecting failure modes, and get good at both generating and red-teaming your own ideas (the Agent Foundations course and Arbital are good places to start). Get a long list of things you've shown how they break, and help both break and extract the insights from other's ideas.
the extent human civilization is human-aligned, most of the reason for the alignment is that humans are extremely useful to various social systems like the economy, and states, or as substrate of cultural evolution. When human cognition ceases to be useful, we should expect these systems to become less aligned, leading to human disempowerment.
oh good, I've been thinking this basically word for word for a while and had it in my backlog. Glad this is written up nicely, far better than I would likely have done :)
The one thing I'm not a big fan of: I'd bet "Gradual Disempowerment" sounds like a "this might take many decades or longer" to most readers, whereas with capabilities curves this could be a few months to single digit years thing.
I think I have a draft somewhere, but never finished it. tl;dr; Quantum lets you steal private keys from public keys (so all wallets that have a send transaction). Upgrading can protect wallets where people move their coins, but it's going to be messy, slow, and won't work for lost-key wallets, which are a pretty huge fraction of the total BTC reserve. Once we get quantum BTC at least is going to have a very bad time, others will have a moderately bad time depending on how early they upgrade.
Nice! I haven't read a ton of Buddhism, cool that this fits into a known framework.
I'm uncertain of how you use the word consciousness here do you mean our blob of sensory experience or something else?
Yeah, ~subjective experience.
Let's do most of this via the much higher bandwidth medium of voice, but quickly:
- Yes, qualia[1] is real, and is a class of mathematical structure.[2]
- (placeholder for not a question item)
- Matter is a class of math which is ~kinda like our physics.
- Our part of the multiverse probably doesn't have special "exists" tags, probably everything is real (though to get remotely sane answers you need a decreasing reality fluid/caring fluid allocation).
Math, in the sense I'm trying to point to it, is 'Structure'. By which I mean: Well defined seeds/axioms/starting points and precisely specified rules/laws/inference steps for extending those seeds. The quickest way I've seen to get the intuition for what I'm trying to point at with 'structure' is to watch these videos in succession (but it doesn't work for everyone):
- ^
experience/the thing LWers tend to mean, not the most restrictive philosophical sense (#4 on SEP) which is pointlessly high complexity (edit: clarified that this is not the universal philosophical definition, but only one of several meanings, walked back a little on rhetoric)
- ^
possibly maybe even the entire class, though if true most qualia would be very very alien to us and not necessarily morally valuable
give up large chunks of the planet to an ASI to prevent that
I know this isn't your main point but.. That isn't a kind of trade that is plausible. Misaligned superintelligence disassembles the entire planet, sun, and everything it can reach. Biological life does not survive, outside of some weird edge cases like "samples to sell to alien superintelligences that like life". Nothing in the galaxy is safe.
Re: Ayahuasca from the ACX survey having effects like:
- “Obliterated my atheism, inverted my world view no longer believe matter is base substrate believe consciousness is, no longer fear death, non duality seems obvious to me now.”
[1]There's a cluster of subcultures that consistently drift toward philosophical idealist metaphysics (consciousness, not matter or math, as fundamental to reality): McKenna-style psychonauts, Silicon Valley Buddhist circles, neo-occultist movements, certain transhumanist branches, quantum consciousness theorists, and various New Age spirituality scenes. While these communities seem superficially different, they share a striking tendency to reject materialism in favor of mind-first metaphysics.
The common factor connecting them? These are all communities where psychedelic use is notably prevalent. This isn't coincidental.
There's a plausible mechanistic explanation: Psychedelics disrupt the Default Mode Network and adjusting a bunch of other neural parameters. When these break down, the experience of physical reality (your predictive processing simulation) gets fuzzy and malleable while consciousness remains vivid and present. This creates a powerful intuition that consciousness must be more fundamental than matter. Conscious experience is more fundamental/stable than perception of the material world, which many people conflate with the material world itself.
The fun part? This very intuition - that consciousness is primary and matter secondary - is itself being produced by ingesting a chemical which alters physical brain mechanisms. We're watching neural circuitry create metaphysical intuitions in real-time.
This suggests something profound about metaphysics itself: Our basic intuitions about what's fundamental to reality (whether materialist OR idealist) might be more about human neural architecture than about ultimate reality. It's like a TV malfunctioning in a way that produces the message "TV isn't real, only signals are real!"
This doesn't definitively prove idealism wrong, but it should make us deeply suspicious of metaphysical intuitions that feel like direct insight - they might just be showing us the structure of our own cognitive machinery.
- ^
Claude assisted writing, ideas from me and edited by me.
We do not take a position on the likelihood of loss of control.
This seems worth taking a position on, the relevant people need to hear from the experts an unfiltered stance of "this is a real and perhaps very likely risk".
Agree that takeoff speeds are more important, and expect that FrontierMath has much less affect on takeoff speed. Still think timelines matter enough that the amount of relevantly informing people that you buy from this is likely not worth the cost, especially if the org is avoiding talking about risks in public and leadership isn't focused on agentic takeover, so the info is not packaged with the info needed for that info to have the effects which would help.
Evaluating the final model tells you where you got to. Evaluating many small models and checkpoints helps you get further faster.
Even outside of the arguing against the Control paradigm, this post (esp. The Model & The Problem & The Median Doom-Path: Slop, not Scheming) cover some really important ideas, which I think people working on many empirical alignment agendas would benefit from being aware of.
One neat thing I've explored is learning about new therapeutic techniques by dropping a whole book into context and asking for guiding phrases. Most therapy books do a lot of covering general principles of minds and how to work with them, with the unique aspects buried in a way which is not super efficient for someone who already has the universal ideas. Getting guiding phrases gives a good starting point for what the specific shape of a technique is, and means you can kinda use it pretty quickly. My project system prompt is:
Given the name of, and potentially documentation on, an introspective or therapeutic practice, generate a set of guiding phrases for facilitators. These phrases should help practitioners guide participants through deep exploration, self-reflection, and potential transformation. If you don't know much about the technique or the documentation is insufficient, feel free to ask for more information. Please explain what you know about the technique, especially the core principles and things relevant to generating guiding phrases, first.
Consider the following:
Understand the practice's core principles, goals, and methods.
Create open-ended prompts that invite reflection and avoid simple yes/no answers.
Incorporate awareness of physical sensations, emotions, and thought patterns.
Develop phrases to navigate unexpected discoveries or resistances.
Craft language that promotes non-judgmental observation of experiences.
Generate prompts that explore contradictions or conflicting beliefs.
Encourage looking beyond surface-level responses to deeper insights.
Help participants relate insights to their everyday lives and future actions.
Include questions that foster meta-reflection on the process itself.
Use metaphorical language when appropriate to conceptualize abstract experiences.
Ensure phrases align with the specific terminology and concepts of the practice.
Balance providing guidance with allowing space for unexpected insights.
Consider ethical implications and respect appropriate boundaries.Aim for a diverse set of phrases that can be used flexibly throughout the process. The goal is to provide facilitators with versatile tools that enhance the participant's journey of self-discovery and growth.
Example (adapt based on the specific practice):"As you consider [topic], what do you notice in your body?"
"If that feeling had a voice, what might it say?"
"How does holding this belief serve you?"
"What's alive for you in this moment?"
"How might this insight change your approach to [relevant aspect of life]?"Remember, the essence is to create inviting, open-ended phrases that align with the practice's core principles and facilitate deep, transformative exploration.
Please store your produced phrases in an artefact.
I'm guessing you view having better understanding of what's coming as very high value, enough that burning some runway is acceptable? I could see that model (though put <15% on it), but I think this is at least not good integrity wise to have put on the appearance of doing just the good for x-risk part and not sharing it as an optimizable benchmark, while being funded by and giving the data to people who will use it for capability advancements.
Evaluation on demand because they can run them intensely lets them test small models for architecture improvements. This is where the vast majority of the capability gain is.
Getting an evaluation of each final model is going to be way less useful for the research cycle, as it only gives a final score, not a metric which is part of the feedback loop.
However, we have a verbal agreement that these materials will not be used in model training.
If by this you mean "OpenAI will not train on this data", that doesn't address the vast majority of the concern. If OpenAI is evaluating the model against the data, they will be able to more effectively optimize for capabilities advancement, and that's a betrayal of the trust of the people who worked on this with the understanding that it will be used only outside of the research loop to check for dangerous advancements. And, particularly, not to make those dangerous advancements come sooner by giving OpenAI another number to optimize for.
If you mean OpenAI will not be internally evaluating models on this to improve and test the training process, please state this clearly in writing (and maybe explain why they got privileged access to the data despite being prohibited from the obvious use of that data).
Really high quality high-difficulty benchmarks are much more scarce and important for capabilities advancing than just training data. Having an apparently x-risk focused org do a benchmark implying it's for evaluating danger from highly capable models in a way which the capabilities orgs can't use to test their models, then having it turn out that's secretly funded by OpenAI with OpenAI getting access to most of the data is very sketchy.
Some people who contributed questions likely thought they would be reducing x-risk by helping build bright line warning signs. Their work being available to OpenAI will mostly have increased x-risk by giving the capabilities people an unusually important number-goes-up to optimize for, bringing timelines to dangerous systems closer. That's a betrayal of trust, and Epoch should do some serious soul searching about taking money to do harmful things.
This is a good idea usually, but critically important when using skills like those described in Listening to Wisdom, in a therapeutic relationship (including many forms of coaching), or while under the influence of substances that increase your rate of cognitive change and lower barriers to information inflow (such as psychedelics).
If you're opening yourself up to receive the content of those vibes on an emotional/embodied/deep way, and those vibes are bad, this can be toxic to an extent you will not be expecting (even if you try to account for this warning).
Do not do mind meld-like techniques/drugs/therapy with people your system is throwing unexplained warnings around. Instead, step out of the situation and investigate any such warnings at a safe distance, with the possibility of a "nope" and disengaging if the warning is still flashing (even if you don't get clarity on it's source).
Maybe having exact evaluations not being trivial is not entirely a bug, but might make the game more interesting (though maybe more annoying)?
I recommend most readers skip this subsection on a first read; it’s not very central to explaining the alignment problem.
Suggest either putting this kind of aside in a footnote, or giving the reader a handy link to the next section for convenience?
Nice!
(I wrote the bit about not having to tell people your favourite suit or what cards you have leaves things open for some sharp or clever negotiation, but looking back I think it's mostly a trap. I haven't seen anyone get things to go better for them by hiding the suit.)
To add some layer of this strategy: Giving each person one specific card on their suit that they want with much higher strength might be fun, as the other players can ransom that card if they know (but might be happy trading it anyway). Also having the four suits each having a different multiplier might be fun?
On one side: Humanoid robots have much more density of parts requiring more machine-time than cars, probably slowing things a bunch.
On the other, you mention assuming no speed up due to the robots building robot factories, but this seems like the dominant factor in the growth. Your numbers excluding that are going to be way underestimating things pretty quickly without that. I'd be interested in what those numbers look like assuming reasonable guesses about robot workforce being part of a feedback cycle.
Or, worse, if most directions are net negative and you have to try quite hard to find one which is positive, almost everyone optimizing for magnitude will end up doing harm proportional to how much they optimize magnitude.
Yeah, this seems probably a good idea. Though some of these would be best on existing resource pages, like the funders list.
This is imo one of if not the most impactful software role in the AI safety ecosystem. I think the AI safety funding ecosystem's significant challenges are most likely to be addressed by scaling up the s-process.
I've passed this on to some of the best SWEs I know, some of whom also manage a lot of other great devs.
Added something to the TL;DR footnote covering this.
It's a pretty straightforward modification of the Caplan thruster. You scoop up bits of sun with very strong magnetic fields, but rather than fusing it and using it to move a star, you cool most of it (firing some back with very high velocity to balance things momentum wise) and keep the matter you extract (or fuse some if you need quick energy). There's even a video on it! Skip to 4:20 for the relevant bit.
This feels very related to a section I didn't write for the post because it was getting too long about how to "quote" claims about the other person's self-model in a way which defuse conflict while leaving you with a wider range of conversational motion. Basically by saying e.g.
"I have a story that you're angry with me"
rather than
"You're angry with me"
The other person can accept your statement into their conversational stack safety, even if they're not angry. Because another person thinking you're angry while you're not angry is totally compatible as a model, but you being angry while you're not angry is not. So if you try and include their mental object it fires a crapton of error messages for colliding predictive models.
Thanks! Yeah, I don't have that bit distilled down. I have a sense of the difference between making an ask of someone and making a demand. Thinking in Active Inference/predictive processing terms, I think it's something like:
Top level statement: I would like X (because Y).
vs
Top level statement: X. (is the case, must be the case, or with less force, is probabilistically more the case)
In the first one, you can accept the statement into your predictive models even if the outcome is that you don't do X, because the action-request is "quoted". The latter statement, if incorporated into your cognitive stack, causes dissonance unless X.
Edit: Also, if you're interested, the methodology for coming up with the distillation was learning NVC, being in situations where it did and didn't get applied, then carefully introspecting on the load-bearing parts of the difference until the principle which had been encoded in experience popped out in a crystallized form.
Completed as A Principled Cartoon Guide to NVC :)
Blooper reel
Claude can just about pull these cartoons off, but it does make mistakes. I made at least twice as many mistakes prompting though.
From the way things sure seem to look, the universe is very big, and has room for lots of computations later on. A bunch of plausible rollouts involve some small fraction of those very large resources going on simulations.
You can, if you want, abandon all epistemic hope and have a very very wide prior. Maybe we're totally wrong about everything! Maybe we're Boltzmann brains! But that's not super informative or helpful, so we look around us and extrapolate assuming that's a reasonable thing to do, because we ain't got anything else we can do.
Simulations are very compatible with that. The other examples aren't so much, if you look up close and have some model of what those things are like and do.
In a large universe, you, and everyone else, exists both in and not in simulations. That is: The pattern you identify with exists in both basement reality (in many places) and also in simulations (in many places).
There is a question of what proportion of the you-patterns exist in basement reality, but it has a slightly different flavour, I think. It seems to trigger some deep evolved patterns (around fakeness?) less than the kind of existential fear that simulations with the naive conception of identity sometimes brings up.
But to answer that question: Maybe simulators tend to prefer "flat" simulations, where the entire system is simulated evenly to avoid divergence from the physical system it's trying to gather information about. Maybe your unique characteristic is the kind of thing that makes you more likely to be simulated in higher fidelity than the average human, and simulators prefer uneven simulations. Or maybe it's unusual but not particularly relevant for tactical simulations of what emerges from the intelligence explosion (which is probably where the majority of the simulation compute goes).
But, either way, that update is probably pretty small compared to the background high rate of simulations of "humans around at the time of the singularity". Bostrom's paper covers the general argument for simulations generally outnumbering basement reality due to ancestor simulations: https://simulation-argument.com/simulation.pdf
However, even granting all of the background assumptions that go into this: Not all observers who are you live in a simulation. You exist in both types of places. Simulations don't reduce your weight in the basement reality, they can only give you more places which you exist.
Anyone trying this in the modern day will have a much easier time thanks to LLM tools, especially https://elicit.com/ for automated literature review.
I have a principled explanation for this! Post upcoming :)
Link is broken
Defeater, in my mind, is a failure mode which if you don't address you will not succeed at aligning sufficiently powerful systems.[1] It does not mean work outside of that focused on them is useless, but at some point you have to deal with the defeaters, and if the vast majority of people working towards alignment don't get them clearly, and the people who do get them claim we're nowhere near on track to find a way to beat the defeaters, then that is a scary situation.
This is true even if some of the work being done by people unaware of the defeaters is not useless, e.g. maybe it is successfully averting earlier forms of doom than the ones that require routing around the defeaters.
- ^
Not best considered as an argument against specific lines of attack, but as a problem which if unsolved leads inevitably to doom. People with a strong grok of a bunch of these often think that way more timelines are lost to "we didn't solve these defeaters" than the problems being even plausibly addressed by the class of work being done by most of the field. This does unfortunately make it get used as (and feel like) an argument against those approaches by people who don't and don't claim to understand those approaches, but that's not the generator or important nature of it.
I mostly agree with the diagnosis of the problem, but have some different guesses about paths to try and get alignment on track.
I think the core difficulties of alignment are explained semi-acceptably, but in a scattered form which means that only the dedicated explorers with lots of time and good taste end up finding them. Having a high quality course which collects the best explainers we have to prepare people for trying to find a toehold, and noticing the gaps left and writing good things added to fill them, seems necessary for any additional group of people added to actually point in the right direction.
BlueDot's course seems strongly optimized to funnel people into the empirical/ML/lab alignment team pipeline, they have dropped the Agent Foundations module entirely, and their "What makes aligning AI difficult?" fast track is 3/5ths articles on RLHF/RLAIF (plus an intro to LLMs and a RA video). This is the standard recommendation, and there isn't a generally known alternative.
I tried to fix this with Agent Foundations for Superintelligent Robust-Alignment, but I think this would go a lot better if someone like @johnswentworth took it over and polished it.