Please stop publishing ideas/insights/research about AI
post by Tamsin Leake (carado-1) · 2024-05-02T14:54:15.691Z · LW · GW · 65 commentsContents
65 comments
Basically all ideas/insights/research about AI is potentially exfohazardous [LW · GW]. At least, it's pretty hard to know when some ideas/insights/research will actually make things better; especially in a world where building an aligned superintelligence (let's call this work "alignment") is quite harder than building any superintelligence (let's call this work "capabilities"), and there's a lot more people trying to do the latter than the former, and they have a lot more material resources.
Ideas about AI, let alone insights about AI, let alone research results about AI, should be kept to private communication between trusted alignment researchers. On lesswrong, we should focus on teaching people the rationality skills which could help them figure out insights that help them build any superintelligence, but are more likely to first give them insights that help them realize that that is a bad idea.
For example, OpenAI has demonstrated that they're just gonna cheerfully head towards doom. If you give OpenAI, say, interpretability insights, they'll just use them to work towards doom faster; what you need is to either give OpenAI enough rationality to slow down (even just a bit), or at least not give them anything. To be clear, I don't think people working at OpenAI know that they're working towards doom; a much more likely hypothesis is that they've memed themselves into not thinking very hard about the consequences of their work, and to erroneously feel vaguely optimistic about those due to cognitive biases such as wishful thinking.
It's very rare that any research purely helps alignment, because any alignment design is a fragile target that is just a few changes away from unaligned. There is no alignment plan which fails harmlessly if you fuck up implementing it, and people tend to fuck things up unless they try really hard not to (and often even if they do), and people don't tend to try really hard not to. This applies doubly so to work that aims to make AI understandable or helpful, rather than aligned — a helpful AI will help anyone, and the world has more people trying to build any superintelligence (let's call those "capabilities researchers") than people trying to build aligned superintelligence (let's call those "alignment researchers").
Worse yet: if focusing on alignment is correlated with higher rationality and thus with better ability for one to figure out what they need to solve their problems, then alignment researchers are more likely to already have the ideas/insights/research they need than capabilities researchers, and thus publishing ideas/insights/research about AI is more likely to differentially help capabilities researchers. Note that this is another relative statement; I'm not saying "alignment researchers have everything they need", I'm saying "in general you should expect them to need less outside ideas/insights/research on AI than capabilities researchers".
Alignment is a differential problem. We don't need alignment researchers to succeed as fast as possible; what we really need is for alignment researchers to succeed before capabilities researchers. Don't ask yourself "does this help alignment?", ask yourself "does this help alignment more than capabilities?".
- "But superintelligence is so far away!" — even if this was true (it isn't) then it wouldn't particularly matter. There is nothing that makes differentially helping capabilities "fine if superintelligence is sufficiently far away". Differentially helping capabilities is just generally bad.
- "But I'm only bringing up something that's already out there!" — something "already being out there" isn't really a binary thing. Bringing attention to a concept that's "already out there" is an exfohazard if it's worse for people to think about that concept more often. In many contexts, the concept of AI is an exfohazard, because in some contexts it's better for the world if people to think a bit less often about AI, even though they're already familiar with the concept. The same applies often for situations where people say "this ship has sailed": often, it is the case that the ship has, in fact, less-than-maximally sailed, and every bit of sailing-it-a-bit-less helps. If a ship has 85% sailed, let's not bring that up to 87% sailed. No, not everyone is already maximally-efficiently allocating their focus to the concepts that would help them the most — in fact, barely anyone is, and controlling what already-out-there concepts people pay attention to is an important part of exfohazard policy.
- "But my ideas/insights/research is not likely to impact much!" — that's not particularly how it works? It needs to somehow be differenially helpful to alignment, which I think is almost never the case. There is nothing that makes differentially helping capabilities "fine if you're only differentially helping them a little bit". Differentially helping capabilities is just generally bad. Overall p(doom) depends, among other things, on many small impacts, so you can see this as "doing your part". But really, you should just go "is this the right direction?" and if the answer is not then "by how much is this the wrong direction" doesn't matter a whole lot. I, for example, will continue to take actions that direct capabilities researchers' attention away from concept that'd help them with capabilities, even if my impact is very small; not just because it'd have a bunch of impact if everyone in my reference class did this, but because my very small impact is still in the right direction.
- "But I'm explicitely labelling these ideas as scary and bad!" — and? Pointing at which things are powerful and thus they're scary still points people at the things which are powerful [LW · GW].
- "So where do I privately share such research?" — good question! There is currently no infrastructure for this. I suggest keeping your ideas/insights/research to yourself. If you think that's difficult for you to do, then I suggest not thinking about AI, and doing something else with your time, like getting into factorio 2 or something.
- "If nobody publishes anything, how will alignment get solved?" — sure, it's harder for alignment researchers to succeed if they don't communicate publicly with one another — but it's not impossible. That's what dignity [LW · GW] is about. And "this is bad so I'll do the alternative" isn't really a plan: the alternative might be worse. My whole point is that, as bad as that situation would be, it'd be better than the status quo where people just casually post eg interpretability and other prosaic research on lesswrong or in papers. This is the case because alignment takes more sequential work-time, because it's harder.
Any small increment towards alignment-succeeding-before-capabilities helps. Even if there's a bunch of damage everyday from people posting prosaic research on lesswrong and in papers, you can help by not making it worse. This isn't even a prisoner's dilemma; not publishing ideas/insights/research about AI gets you lower p(doom) — and thus, also, more LDT-value-handshake utility in worlds where we do survive.
So favor posting things that help people be more rational and make better judgments (such as not working on capabilities). Favor posting things that help capabilities researchers realize that they are that; that their alignment plans won't pan out and they're really just doing capabilities research.
Or at least just shut up and posting nothing, rather than posting ideas/insights/research about AI.
65 comments
Comments sorted by top scores.
comment by Zack_M_Davis · 2024-05-02T18:55:16.637Z · LW(p) · GW(p)
I think this is undignified.
I agree that it would be safer if humanity were a collective hivemind that could coordinate to not build AI until we know how to build the best AI, and that people should differentially work on things that make the situation better rather than worse, and that this potentially includes keeping quiet about information that would make things worse.
The problem is—as you say—"[i]t's very rare that any research purely helps alignment"; you can't think about aligning AI without thinking about AI. In order to navigate the machine intelligence transition in the most dignified way, you want your civilization's best people to be doing their best thinking about the problem, and your best people can't do their best thinking under the conditions of paranoid secrecy.
Concretely, I've been studying some deep learning basics lately and have written a couple [LW · GW] posts [LW · GW] about things I've learned. I think this was good, not bad. I think I and my readers have a slightly better understanding of the technology in question than if I hadn't studied and hadn't written, and that better understanding will help us make better decisions in expectation.
This applies doubly so to work that aims to make AI understandable or helpful, rather than aligned—a helpful AI will help anyone
Sorry, what? I thought the fear was that we don't know how to make helpful AI at all. (And that people who think they're being helped by seductively helpful-sounding LLM assistants are being misled by surface appearances; the shoggoth underneath has its own desires that we won't like when it's powerful enough to persue them autonomously.) In contrast, this almost makes it sound like you think it is plausible to align AI to its user's intent, but that this would be bad if the users aren't one of "us"—you know, the good alignment researchers who want to use AI to take over the universe, totally unlike those evil capabilities researchers who want to use AI to produce economically valuable goods and services.
Replies from: habryka4, lahwran, carado-1, Chris_Leong↑ comment by habryka (habryka4) · 2024-05-03T01:05:41.473Z · LW(p) · GW(p)
Sorry, what? I thought the fear was that we don't know how to make helpful AI at all. (And that people who think they're being helped by seductively helpful-sounding LLM assistants are being misled by surface appearances; the shoggoth underneath has its own desires that we won't like when it's powerful enough to persue them autonomously.) In contrast, this almost makes it sound like you think it is plausible to align AI to its user's intent, but that this would be bad if the users aren't one of "us"—you know, the good alignment researchers who want to use AI to take over the universe, totally unlike those evil capabilities researchers who want to use AI to produce economically valuable goods and services.
My steelman of this (though to be clear I think your comment makes good points):
There is a large difference between a system being helpful and a system being aligned. Ultimately AI existential risk is a coordination problem where I expect catastrophic consequences because a bunch of people want to build AGI without making it safe. Therefore making technologies that in a naive and short-term sense just help AGI developers build whatever they want to build will have bad consequences. If I trusted everyone to use their intelligence just for good things, we wouldn't have anthropogenic existential risk on our hands.
Some of those technologies might end up useful for also getting the AI to be more properly aligned, or maybe to help with work that reduces the risk of AI catastrophe some other way, though my current sense is that kind of work is pretty different and doesn't benefit remotely as much from generically locally-helpful AI.
In-general I feel pretty sad about conflating "alignment" with "short-term intent alignment". I think the two problems are related but have really important crucial differences, I don't think the latter generalizes that well to the former (for all the usual sycophancy/treacherous-turn reasons), and indeed progress on the latter IMO mostly makes the world marginally worse because the thing it is most likely to be used for is developing existentially dangerous AI systems faster.
Edit: Another really important dimension to model here is also not just the effect of that kind of research on what individual researchers will do, but what effect this kind of research will have on what the market wants to invest in. My standard story of doom is centrally rooted in there being very strong short-term individual economic incentives to build more capable AGI, enabling people to make billions to trillions of dollars, while the downside risk is a distributed negative externality that is not at all priced into the costs of AI development. Developing applications of AI that make a lot of money without accounting for the negative extinction externalities therefore can be really quite bad for the world.
↑ comment by the gears to ascension (lahwran) · 2024-05-02T21:00:55.687Z · LW(p) · GW(p)
The problem is that "helpful" oracle AI will not stay helpful for long, if there is any incentive to produce things which are less helpful. Your beliefs are apparently out of date: we have helpful AI now, so that's an existence disproof of "helpful ai is impossible". But the threat of AI being more evolutionarily fit, and possibly an AI taking sudden and intense action to make use of its being more evolutionarily fit, is still hanging over our heads; and it only takes one hyperdesperate not-what-you-meant seeker.
Concretely, I think your posts are in fact a great (but not at all worst-case) example of things that have more cost than benefit, and I think you should keep working but only talk to people in DMs. Time is very, very short, and if you accidentally have a pivotally negative impact, you could be the one that burns the last two days before the world is destroyed.
↑ comment by Tamsin Leake (carado-1) · 2024-05-02T20:40:48.223Z · LW(p) · GW(p)
but that this would be bad if the users aren't one of "us"—you know, the good alignment researchers who want to use AI to take over the universe, totally unlike those evil capabilities researchers who want to use AI to produce economically valuable goods and services.
Rather, " us" — the good alignment researchers who will be careful at all about the long term effects of our actions, unlike capabilities researchers who are happy to accelerate race dynamics and increase p(doom) if they make a quick profit out of it in the short term.
Replies from: Zack_M_Davis↑ comment by Zack_M_Davis · 2024-05-03T00:23:58.474Z · LW(p) · GW(p)
I think these judgements would benefit from more concreteness: that rather than proposing a dichotomy of "capabilities research" (them, Bad) and "alignment research" (us, Good), you could be more specific about what kinds of work you want to see more and less of.
I agree that (say) Carmack and Sutton are doing a bad thing by declaring a goal to "build AGI" while dismissing the reasons that this is incredibly dangerous. But the thing that makes infohazard concerns so fraught is that there's a lot of work that potentially affects our civilization's trajectory into the machine intelligence transition in complicated ways, which makes it hard to draw a boundary around "trusted alignment researchers" in a principled and not self-serving way that doesn't collapse into "science and technology is bad".
We can agree that OpenAI as originally conceived was a bad idea. What about the people working on music generation? That's unambiguously "capabilities", but it's also not particularly optimized at ending the world that way "AGI for AGI's sake" projects are. If that's still bad even though music generation isn't going to end the world (because it's still directing attention and money into AI, increasing the incentive to build GPUs, &c.), where do you draw the line? Some of the researchers I cited in my most recent post are working on "build[ing] better models of primate visual cognition". Is that wrong? Should Judea Pearl not have published? Turing? Charles Babbage?
In asking these obnoxious questions, I'm not trying to make a reductio ad absurdum of caring about risk, or proposing an infinitely slippery slope where our only choices are between max accelerationism and a destroy-all-computers Butlerian Jihad. I just think it's important to notice that "Stop thinking about AI" kind of does amount to a Butlerian Jihad (and that publishing and thinking are not unrelated)?
↑ comment by Chris_Leong · 2024-05-03T17:21:33.156Z · LW(p) · GW(p)
In contrast, this almost makes it sound like you think it is plausible to align AI to its user's intent, but that this would be bad if the users aren't one of "us"—you know, the good alignment researchers who want to use AI to take over the universe, totally unlike those evil capabilities researchers who want to use AI to produce economically valuable goods and services.
If I'm being honest, I don't find this framing helpful.
If you believe that things will go well if certain actors gain access to advanced AI technologies first, you should directly argue that.
Focusing on status games feels like a red herring.
comment by leogao · 2024-05-03T00:37:23.851Z · LW(p) · GW(p)
I'm very sympathetic to the idea of being careful about publishing things that could spread capabilities ideas. However, I think there are several important things missing from your world model, which cause me to believe that following your advice would substantially hurt alignment progress.
(To be clear, none of this applies to alignment people working directly on capabilities, who should, like, not. Rather, this is about alignment researchers accidentally advancing capabilities by talking to capabilities people)
- It's genuinely hard to come up with ideas that help capabilities a lot. I think you are severely underestimating how hard it is, and how much insight is required. I think one issue here is that most papers on arxiv are garbage and don't actually make any progress, but those papers are not the ones that are pushing AGI forward anyways.
- Even if you try very hard to do so, it's still very hard to convince people that you're right if you don't have a ton of clout via a legible reputation of being right a lot. Everyone has an agenda they're convinced will solve AGI and is too busy trying to convince everyone else to work on their agenda.
- High level ideas are generally not that valuable in and of themselves. People generally learn to ignore ideas unless they have strong empirical evidence of correctness (or endorsement of highly respected researchers) because there are simply too many ideas. The valuable thing is not the idea itself, but the knowledge of which ideas are actually correct.
- I think deeply understanding top tier capabilities researchers' views on how to achieve AGI is actually extremely valuable for thinking about alignment. Even if you disagree on object level views, understanding how very smart people come to their conclusions is very valuable.
- I think alignment discourse is greatly harmed by people being too scared to say things. When it bleeds over to being too scared to think about capabilities related topics for fear of accidentally generating something dangerous, I think this is even more harmful.
↑ comment by LawrenceC (LawChan) · 2024-05-03T03:24:25.116Z · LW(p) · GW(p)
While I've softened my position on this in the last year, I want to give a big +1 to this response, especially these two points:
- It's genuinely hard to come up with ideas that help capabilities a lot. I think you are severely underestimating how hard it is, and how much insight is required. I think one issue here is that most papers on arxiv are garbage and don't actually make any progress, but those papers are not the ones that are pushing AGI forward anyways.
- [..]
- High level ideas are generally not that valuable in and of themselves. People generally learn to ignore ideas unless they have strong empirical evidence of correctness (or endorsement of highly respected researchers) because there are simply too many ideas. The valuable thing is not the idea itself, but the knowledge of which ideas are actually correct.
(emphasis added)
I think it's often challenging to just understand where the frontier is, because it's so far and so many things are secret. And if you're not at a scaling lab and then also don't keep up with the frontier of the literature, it's natural to overestimate the novelty of your insights. And then, if you're too scared to investigate your insights, you might continue to think that your ideas are better than they are. Meanwhile, as an AI Safety researcher, not only is there a lot less distance to the frontier of whatever subfield you're in, you'll probably spend most of your time doing work that keeps you on the frontier.
Random insights can be valuable, but the history of deep learning is full of random insights that were right but for arguably the wrong reasons (batch/layernorm, Adam, arguably the algorithm that would later be rebranded as PPO), as well as brilliant insights that turned out to be basically useless (e.g. consider a lot of the Bayesian neural network stuff, but there's really too many examples to list) if not harmful in the long run (e.g. lots of "clever" or not-so-clever ways of adding inductive bias). Part of the reason is that people don't get taught the history of the field, and see all the oh-so-clever ideas that didn't work, or how a lot of the "insights" were invented post-hoc. So if you're new to deep learning you might get the impression that insights were more causally responsible for the capabilities advancements, than they actually are. Insofar as good alignment requires deconfusion and rationality to generate good insights, and capabilities does not, then you should expect that the insights you get from improving rationality/doing deconfusion are more impactful for alignment than capabilities.
I mean, if you actually do come up with a better initialization scheme, a trick that improves GPU utilization, or some other sort of cheap algorithmic trick to improve training AND check it's correct through some small/medium-scale empirical experiments, then sure, please reconsider publishing that. But it's hard to incidentally do that -- even if you do come up with some insight while doing say, mech interp, it feels like going out of your way to test your capability ideas should be a really obvious "you're basically doing capabilities" sign? And maybe, you should be doing the safety work you claim to want to do instead?
Replies from: luciaquirke↑ comment by luciaquirke · 2024-05-30T05:11:31.677Z · LW(p) · GW(p)
Is there anything you recommend for understanding the history of the field?
↑ comment by Nicholas / Heather Kross (NicholasKross) · 2024-05-10T21:44:14.190Z · LW(p) · GW(p)
I think deeply understanding top tier capabilities researchers' views on how to achieve AGI is actually extremely valuable for thinking about alignment. Even if you disagree on object level views, understanding how very smart people come to their conclusions is very valuable.
I think the first sentence is true (especially for alignment strategy), but the second sentence seems sort of... broad-life-advice-ish, instead of a specific tip? It's a pretty indirect help to most kinds of alignment.
Otherwise, this comment's points really do seem like empirical things that people could put odds or ratios on. Wondering if a more-specific version of those "AI Views Snapshots [LW · GW]" would be warranted, for these sorts of "research meta-knowledge" cruxes. Heck, it might be good to have lots of AI Views Snapshot DLC Mini-Charts, from for-specific-research-agendas(?) to internal-to-organizations(?!?!?!?).
Replies from: NicholasKrosscomment by LawrenceC (LawChan) · 2024-05-03T05:04:31.928Z · LW(p) · GW(p)
I agree with many of the points made in this post, especially the "But my ideas/insights/research is not likely to impact much!" point. I find it plausible that in some subfields, AI x-risk people are too prone to publishing due to historical precedent and norms (maybe mech interp? though little has actually come of that). I also want to point out that there are non-zero arguments to expect alignment work to help more with capabilties, relative to existing "mainstream" capabilities work, even if I don't believe this to be the case. (For example, you might believe that the field of deep learning spends too little time actually thinking about how to improve their models, and too much time just tinkering, in which case your thinking could have a disproportionate impact even after adjusting for the fact that you're not trying to do capabilities.) And I think that some of the research labeled "alignment" is basically just capabilities work, and maybe the people doing them should stop.
I also upvoted the post because I think this attitude is pervasive in these circles, and it's good to actually hash it out in public.
But as with most of the commenters, I disagree with the conclusion of the post.
I suspect the main cruxes between us are the following:
1. How much useful alignment work is actually being done?
From paragraphs such as the following:
It's very rare that any research purely helps alignment, because any alignment design is a fragile target that is just a few changes away from unaligned. There is no alignment plan which fails harmlessly if you fuck up implementing it, and people tend to fuck things up unless they try really hard not to (and often even if they do), and people don't tend to try really hard not to. This applies doubly so to work that aims to make AI understandable or helpful, rather than aligned — a helpful AI will help anyone, and the world has more people trying to build any superintelligence (let's call those "capabilities researchers") than people trying to build aligned superintelligence (let's call those "alignment researchers").
And
"But my ideas/insights/research is not likely to impact much!" — that's not particularly how it works? It needs to somehow be differenially helpful to alignment, which I think is almost never the case.
It seems that a big part of your world model is that ~no one who thinks they're doing "alignment" work is doing real alignment work, and are really just doing capabilities work. In particular, it seems that you think interp or intent alignment are basically just capabilities work, insofar as their primary effect is helping people build unsafe ASI faster. Perhaps you think that, in the case of interp, before we can understand the AI in a way that's helpful for alignment, we'll understand it in a way that allows us to improve it. I'm somewhat sympathetic to this argument. But I think making it requires arguing that interp work doesn't really contribute to alignment at all, and is thus better thought of as capabilities work (and same for intent alignment).
Perhaps you believe that all alignment work is useless, not because they're misguided and actually capabilities work, but because we're so far from building aligned ASI that ~all alignment work is useless, and in the intermediate regime where additional insights non-negligibly hasten the arrival of unaligned ASI. But I think you should argue for that explicitly (as say, Eliezer did in his death with dignity post), since I imagine most of the commenters here would disagree with this take.
My guess is this is the largest crux between us; if I thought all "alignment" work did nothing for alignment, and was perhaps just capabilities work in disguise, then I would agree that people should stop. In fact, I might even argue that we should just stop all alignment work whatsoever! Insofar as I'm correct about this being a crux, I'd like to see a post explicitly arguing for the lack of alignment relevancy of existing 'alignment work', which will probably lead to a more constructive conversation than this post.
2. How many useful capabilities insights incidentally come from "alignment" work?
I think empirically, very few (if not zero) capabilities insights have come from alignment work. And a priori, you might expect that research that aims to solve topic X produces marginally more X than a related topic Y. Insofar as you think that current "alignment" work is more than epsilon useful, I think you would not argue that most alignment work is differentially negative. So insofar as you think a lot of "alignment" work is real alignment work, you probably believe that many capabilities insights have come from past alignment work.
Perhaps you're reluctant to give examples, for fear of highlighting them. I think the math doesn't work out here -- having a few clear examples from you would probably be sufficient to significantly reduce the number of published insights from the community as a whole. But, if you have many examples of insights that help capabilities but are too dangerous to highlight, I'd appreciate if you would just say that (and maybe we can find a trusted third party to verify your claim, but not share the details?).
Perhaps you might say, well, the alignment community is very small, so there might not be many examples that come to mind! To make this carry through, you'd still have to believe that the alignment community also hasn't produced much good research. (Even though, naively, you might expect higher returns from alignment due to there being more unpicked low-hanging fruit due to its small size.) But then again, I'd prefer if you explicitly argued that ~all alignment is either useless or capabilities instead of gesturing at a generic phenomenon.
Perhaps you might say that capabilities insights are incredibly long tailed, and thus seeing no examples doesn't mean that the expected harm is low. But, I think you still need to make some sort of plausibility argument here, as well as a story for why the existing ML insights deserve a lot of Shapley for capabilities advances, even though most of the "insights" people had were useless if not actively misleading.
I also think that there's an obvious confounder, if you believe something along the lines of "focusing on alignment is correlated with higher rationality". Personally, I also think the average alignment(-interested) researcher is more competent at machine learning or research in general than the average generic capabilities researcher (this probably becomes false once you condition on being at OAI, Anthropic, or another scaling lab). If you just count "how many good ideas came from 'alignment' researchers per capita" to the number for 'capability' researchers, you may find that the former is higher because they're just more competent. This goes back again into crux 1., where you then need to argue that competency doesn't help at all in doing actual alignment work, and again, I suspect it's more productive to just argue about the relevance and quality of alignment work instead of arguing about incidental capabilities insights.
3. How important are insights to alignment/capabilities work?
From paragraphs such as the following:
Worse yet: if focusing on alignment is correlated with higher rationality and thus with better ability for one to figure out what they need to solve their problems, then alignment researchers are more likely to already have the ideas/insights/research they need than capabilities researchers, and thus publishing ideas/insights/research about AI is more likely to differentially help capabilities researchers. Note that this is another relative statement; I'm not saying "alignment researchers have everything they need", I'm saying "in general you should expect them to need less outside ideas/insights/research on AI than capabilities researchers".
it seems that you're working with a model of research output with two main components -- (intrinsic) rationality and (external) insights. But there's a huge component that's missing from this model: actual empirical experiments validating the insight, which is the ~bulk of actual capabilities work and a substantial fraction of alignment work. This matters both because ~no capabilities researchers will listen to you if you don't have empirical experiments, and because, if you believe that you can deduce more alignment research "on your own", you might also believe that you need to do more empirical experiments to do capabilities research (and thus that the contribution per insight is by default a lot smaller).
Even if true insights are differentially more helpful for capabilities, the fact that it seems empirically difficult to know which insights are true means that a lot of the work in getting a true insight will involve things that look a lot more like normal capabilities work -- e.g. training more capable models. But surely then, the argument would be reducable to: if you do capabilities work, don't share it on pain of accelerating ASI progress -- which seems like something your audience already agrees with!
That being said, I think I might disagree with your premise here. My guess is that alignment, by being less grounded than capabilities, probably requires more oustide ideas/insights/research, just for sanity checking reasons (once you control for competence of researcher and the fact that there's probably more low-hanging fruit in alignment). After all, you can just make a change and see if your log loss on pretraining goes down, but it's a lot harder to know if your model of deceptive alignment actually is at all sensible. If you don't improve your model's performance on standard benchmarks, then this is evidence that your capability idea doesn't work, but there aren't even really any benchmarks for many of the problems alignment researchers think about. So it's easier to go astray, and therefore more important to get feedback from other researchers.
Finally, to answer this question:
"So where do I privately share such research?" — good question!
I suspect that the way to go is to form working groups of researchers that stick together, and that maintain a high level of trust. e.g. a research organization. Then, do and share your research internally and think about possible externalities before publishing more broadly, perhaps doing a tiered release. (This is indeed the model used by many people in alignment orgs.)
Replies from: mesaoptimizer↑ comment by mesaoptimizer · 2024-05-03T10:06:20.207Z · LW(p) · GW(p)
This is a really well-written response. I'm pretty impressed by it.
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-05-02T18:55:42.340Z · LW(p) · GW(p)
"If nobody publishes anything, how will alignment get solved?" — sure, it's harder for alignment researchers to succeed if they don't communicate publicly with one another — but it's not impossible. That's what dignity is about. A
Huh, I have the opposite intuition. I was about to cite that exact same "Death with dignity" post as an argument for why you are wrong; it's undignified for us to stop trying to solve the alignment problem and publicly discussing the problem with each other, out of fear that some of our ideas might accidentally percolate into OpenAI and cause them to go slightly faster, and that this increased speedup might have made the difference between victory and defeat. The dignified thing to do is think and talk about the problem.
↑ comment by the gears to ascension (lahwran) · 2024-05-02T21:05:28.771Z · LW(p) · GW(p)
Obviously keep working, but stop talking where people who are trying to destroy the world can hear. If you're Neel Nanda and work for a company trying to destroy the world, consider not publishing anything else at all, and only publishing useless versions of your work, because your work being useful for resilient moral alignment depends on a long chain of things that publishing it makes near impossible.
Replies from: faul_sname↑ comment by faul_sname · 2024-05-02T22:26:35.585Z · LW(p) · GW(p)
I think there are approximately zero people actively trying to take actions which, according to their own world model, are likely to lead to the destruction of the world. As such, I think it's probably helpful on the margin to publish stuff of the form "model internals are surprisingly interpretable, and if you want to know if your language model is plotting to overthrow humanity there will probably be tells, here's where you might want to look". More generally "you can and should get better at figuring out what's going on inside models, rather than treating them as black boxes" is probably a good norm to have.
I could see the argument against, for example if you think "LLMs are a dead end on the path to AGI, so the only impact of improvements to their robustness is increasing their usefulness at helping to design the recursively self-improving GOFAI that will ultimately end up taking over the world" or "there exists some group of alignment researchers that is on track to solve both capabilities and alignment such that they can take over the world and prevent anyone else from ending it" or even "people who thing about alignment are likely to have unusually strong insights about capabilities, relative to people who think mostly about capabilities".
I'm not aware of any arguments that alignment researchers specifically should refrain from publishing that don't have some pretty specific upstream assumptions like the above though.
↑ comment by Akram Choudhary (akram-choudhary) · 2024-05-04T10:41:29.878Z · LW(p) · GW(p)
Daniel, your interpretation is literally contradicted by Eliezer's exact words. Eliezer defines dignity as that which increases our chance of survival.
""Wait, dignity points?" you ask. "What are those? In what units are they measured, exactly?"
And to this I reply: Obviously, the measuring units of dignity are over humanity's log odds of survival - the graph on which the logistic success curve is a straight line. A project that doubles humanity's chance of survival from 0% to 0% is helping humanity die with one additional information-theoretic bit of dignity."
Replies from: daniel-kokotajlo↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-05-04T20:50:52.738Z · LW(p) · GW(p)
I don't think our chances of survival will increase if LessWrong becomes substantially more risk-averse about publishing research and musings about AI. I think they will decrease.
comment by Seth Herd · 2024-05-02T18:49:38.339Z · LW(p) · GW(p)
At the core, this is a reminder to not publish things that will help more with capabilities than alignment. That's perfectly reasonable.
The tone of the post suggests erring on the side of "safety" by not publishing things that have an uncertain safety/capabilities balance. I hope that wasn't the intent.
Because that does not make sense. Anything that advances alignment more than safety in expectation should be published.
You have to make a difficult judgment call for each publication. Be mindful of your bias in wanting to publish to show off your work and ideas. Get others' insights if you can do so reasonably quickly.
But at the end of the day, you have to make that judgment call. There's no consolation prize for saying "at least I didn't make the world end faster". If you're a utilitarian, winning the future is the only goal.
(If you're not a utilitarian, you might actually want a resolution faster so you and your loved ones have higher odds of surviving into the far future.)
Replies from: carado-1↑ comment by Tamsin Leake (carado-1) · 2024-05-02T20:30:44.183Z · LW(p) · GW(p)
I am a utilitarian and agree with your comment.
The intent of the post was
- to make people weigh whether to publish or not, because I think some people are not weighing this enough
- to give some arguments in favor of "you might be systematically overestimating the utility of publishing", because I think some people are doing that
I agree people should take the utilitalianly optimal action, I just think they're doing the utilitarian calculus wrong or not doing the calculus at all.
comment by Erik Jenner (ejenner) · 2024-05-02T17:20:51.519Z · LW(p) · GW(p)
I think the status quo around publishing safety research is mostly fine (though being a bit more careful seems good); more confidently, I think going as far as the vibe of this post suggests would be bad.
Some possible cruxes, or reasons the post basically didn't move my view on that:
- Most importantly, I think the research published by people working on x-risk tends to overall help safety/alignment more than capabilities.
- I suspect the main disagreement might be what kind of research is needed to make AI go well, and whether the research currently happening helps.
- Probably less importantly, I disagree a bit about how helpful that research likely is for advancing capabilities. In particular, I don't buy the argument that safety researchers have unusually good ideas/research compared to capability researchers at top labs (part of this is that my impression is capabilities aren't mainly bottlenecked by ideas, though of course sufficiently good ideas would help).
- It's getting harder to draw the boundary since people use "safety" or "alignment" for a lot of things now. So, to be clear, I'm talking about research published by people who think there are catastrophic risks from AI and care a lot about preventing those, it seems like that's your target audience.
- Secondarily, longer timelines are only helpful if useful things are happening, and I think if everyone working on x-risk stopped publishing their research, way fewer useful things would happen on the research side. Maybe the plan is to mostly use the additional time for policy interventions? I think that's also complicated though (so far, visibly advancing capabilities have been one of the main things making policy progress feasible). Overall, I think more time would help, but it's not clear how much and I'm not even totally sure about the sign (taking into account worries from hardware overhang).
- I think there are more structural downsides to not publishing anything. E.g. that makes it much harder to get academia on board (and getting academia on board has been pretty important for policy as far as I can tell, and I think getting them even more on board would be pretty good). Not sure this is an actual crux though, if I thought the research that's happening wasn't helpful enough, this point would also be weaker.
I think most of these are pretty long-standing disagreements, and I don't think the post really tries to argue its side of them, so my guess is it's not going to convince the main people it would need to convince (who are currently publishing prosaic safety/alignment research). That said, if someone hasn't thought at all about concepts like "differentially advancing safety" or "capabilities externalities," then reading this post would probably be helpful, and I'd endorse thinking about those issues. And I agree that some of the "But ..." objections you list are pretty weak.
Replies from: carado-1, lahwran↑ comment by Tamsin Leake (carado-1) · 2024-05-02T17:33:18.369Z · LW(p) · GW(p)
I don't buy the argument that safety researchers have unusually good ideas/research compared to capability researchers at top labs
I don't think this particularly needs to be true for my point to hold; they only need to have reasonably good ideas/research, not unusually good, for them to publish less to be a positive thing.
That said, if someone hasn't thought at all about concepts like "differentially advancing safety" or "capabilities externalities," then reading this post would probably be helpful, and I'd endorse thinking about those issues.
That's a lot of what I intend to do with this post, yes. I think a lot of people do not think about the impact of publishing very much and just blurt-out/publish things as a default action, and I would like them to think about their actions more.
Replies from: ryan_greenblatt↑ comment by ryan_greenblatt · 2024-05-02T18:36:06.771Z · LW(p) · GW(p)
I don't think this particularly needs to be true for my point to hold; they only need to have reasonably good ideas/research, not unusually good, for them to publish less to be a positive thing.
There currently seems to be >10x as many people directly trying to build AGI/improve capabilities as trying to improve safety.
Suppose that the safety people have as good ideas and research ability as the capabilities people. (As a simplifying assumption.)
Then, if all the safety people switched to working full time on maximally advancing capabilities, this would only advance capabilites by less than 10%.
If, on the other hand, they stopped publically publishing safety work and this resulted in a 50% slow down, all safety work would slow down by 50%.
Naively, it seems very hard for publishing less to make sense if the number of safety researchers is much smaller than the number of capabilities researchers and safety researchers aren't much better at capabilities than capabilities researchers.
Replies from: pi-rogers, mesaoptimizer, mesaoptimizer↑ comment by Morphism (pi-rogers) · 2024-05-02T20:37:22.586Z · LW(p) · GW(p)
But safety research can actually disproportionally help capabilities, e.g. the development of RLHF allowed OAI to turn their weird text predictors into a very generally useful product.
Replies from: ryan_greenblatt↑ comment by ryan_greenblatt · 2024-05-02T20:56:07.389Z · LW(p) · GW(p)
I'm skeptical of the RLHF example (see also this post by Paul on the topic [LW · GW]).
That said, I agree that if indeed safety researchers produce (highly counterfactual) research advances that are much more effective at increasing the profitability and capability of AIs than the research advances done by people directly optimizing for profitability and capability, then safety researchers could substantially speed up timelines. (In other words, if safety targeted research is better at profit and capabilities than research which is directly targeted at these aims.)
I dispute this being true.
(I do think it's plausible that safety interested people have historically substantially advanced timelines (and might continue to do so to some extent now), but not via doing research targeted at improving safety, by just directly doing capabilities research for various reasons.)
↑ comment by mesaoptimizer · 2024-05-03T10:15:51.350Z · LW(p) · GW(p)
Ryan, this is kind of a side-note but I notice that you have a very Paul-like approach to arguments and replies on LW.
Two things that come to notice:
- You have a tendency to reply to certain posts or comments with "I don't quite understand what is being said here, and I disagree with it." or, "It doesn't track with my views", or equivalent replies that seem not very useful for understanding your object level arguments. (Although I notice that in the recent comments I see, you usually postfix it with some elaboration on your model.)
- In the comment I'm replying to, you use a strategy of black-box-like abstraction modeling of a situation to try to argue for a conclusion, one that usually involves numbers such as multipliers or percentages. (I have the impression that Paul uses this a lot, and one concrete example that comes to mind is the takeoff speeds essay. I usually consider such arguments invalid when they seem to throw away information we already have, or seem to use a set of abstractions that don't particularly feel appropriate to the information I believe we have.
I just found this interesting and plausible enough to highlight to you. Its a moderate investment of my time to find out examples from your comment history to highlight all these instances, but writing this comment still seemed valuable.
↑ comment by mesaoptimizer · 2024-05-03T10:17:20.872Z · LW(p) · GW(p)
Note that I agree with your sentiment here, although my concrete argument is basically what LawrenceC wrote as a reply to this post [LW(p) · GW(p)].
↑ comment by the gears to ascension (lahwran) · 2024-05-02T21:23:18.976Z · LW(p) · GW(p)
Most importantly, I think the research published by people working on x-risk tends to overall help safety/alignment more than capabilities
It may be producing green nodes faster, but it seems on track to produce a red node before a yellow node. [LW · GW]
comment by Chi Nguyen · 2024-05-02T18:13:19.988Z · LW(p) · GW(p)
Are there types of published alignment research that you think were (more likely to be) good to publish? If so, I'd be curious to see a list.
Replies from: pi-rogers, carado-1↑ comment by Morphism (pi-rogers) · 2024-05-02T20:29:24.351Z · LW(p) · GW(p)
Some off the top of my head:
- Outer Alignment Research (e.g. analytic moral philosophy in an attempt to extrapolate CEV) seems to be totally useless to capabilities, so we should almost definitely publish that.
- Evals for Governance? Not sure about this since a lot of eval research helps capabilities, but if it leads to regulation that lengthens timelines, it could be net positive.
Edit: oops i didn't see tammy's comment
↑ comment by Tamsin Leake (carado-1) · 2024-05-02T20:24:59.876Z · LW(p) · GW(p)
I think research that is mostly about outer alignment (what to point the AI to) rather than inner alignment (how to point the AI to it) tends to be good — quantilizers, corrigibility, QACI, decision theory, embedded agency, indirect normativity, infra bayesianism, things like that. Though I could see some of those backfiring the way RLHF did — in the hands of a very irresponsible org, even not very capabilities-related research can be used to accelerate timelines and increase race dynamics if the org doing it thinks it can get a quick buck out of it.
Replies from: MakoYass, pi-rogers↑ comment by mako yass (MakoYass) · 2024-05-02T21:38:24.827Z · LW(p) · GW(p)
You think that studying agency and infrabayesianism wont make small contributions to capabilities? Even just saying "agency" in the context of AI makes capabilities progress.
↑ comment by Morphism (pi-rogers) · 2024-05-02T20:33:09.372Z · LW(p) · GW(p)
I could see embedded agency being harmful though, since an actual implementation of it would be really useful for inner alignment
comment by the gears to ascension (lahwran) · 2024-05-02T20:58:41.581Z · LW(p) · GW(p)
I would give examples of things that shouldn't have been published and are why I agree, but that would be missing the point, wouldn't it?
Let's put it this way: I think most "alignment" or "safety" research is in fact nothing of the kind, and most people responding are deluding themselves so as to avoid having to consider the possibility of needing to go back to the drawing board.
As usual, capability (ability to figure out things about ai) generalizes further than alignment (ability to aim your ability to understand ai at actually making your knowledge produce utilitarian(-prioritarian)-morally-good outcomes).
comment by mako yass (MakoYass) · 2024-05-02T22:34:23.371Z · LW(p) · GW(p)
On infrastructures for private sharing:
Feature recommendation: Marked Posts (name intentionally bland. Any variant of "private" (ie, secret, sensitive, classified) would attract attention and partially negate the point) (alternate names: permissioned posts, declared posts)
This feature prevents leaks, without sacrificing openness.
A marked post will only be seen by members in good standing. They'll be able to see the title and abstract in their feed, but before they're able to read it, they have to click "I declare that I'm going to read this", and then they'll leave a read receipt (or a "mark") visible to the post creator, admins, other members in good standing. (these would also just serve a useful social function of giving us more mutual knowledge of who knows what, while making it easier to coordinate to make sure every post gets read by people who'd understand it and be able to pass it along to interested parties.)
If a member "reads" an abnormally high number of these posts, the system detects that, and they may have their ability to read more posts frozen. Admins, and members who've read many of the same posts, are notified, and you can investigate. If other members find that this person actually is reading this many posts, that they seem to truly understand the content, they can be given an expanded reading rate. Members in good standing should be happy to help with this, if that person is a leaker, well that's serious, if they're not a leaker, what you're doing in the interrogation setting is essentially you're just getting to know a new entrant to the community who reads and understands a lot, talking about the theory with them, and that is a happy thing to do.
Members in good standing must be endorsed by another member in good standing before they will be able to see Marked posts. The endorsements are also tracked. If someone issues too many endorsements too quickly (or the people downstream of their endorsements are collectively doing so in a short time window), this sends an alert. The exact detection algorithm here is something I have funding to develop so if you want to do this, tell me and I can expedite that project.
Note: Someone who reads and understands a lot of posts within a community will, if those posts are correct, end up being aligned with the community. Verify that they're actually reading the posts and not just the capabilities ones→verify unlikely to leak.
Replies from: nathan-helm-burger↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-05-09T04:32:02.248Z · LW(p) · GW(p)
Hmm. Seems... fragile. I don't think that's a reason not to do it, but I also wouldn't put much hope in the idea that leaks would be successfully prevented by this system.
Replies from: abandon↑ comment by dirk (abandon) · 2024-05-23T04:01:07.472Z · LW(p) · GW(p)
Re the "Why?/Citation?" react: I don't know if this is what Nathan was thinking of, but trivially a would-be leaker could simply screenshot posts as they read and pass the screenshots on without this being reflected in the system.
Replies from: MakoYass↑ comment by mako yass (MakoYass) · 2024-12-01T22:56:17.039Z · LW(p) · GW(p)
You can't see the post body without declaring intent to read.
Replies from: abandon↑ comment by dirk (abandon) · 2024-12-01T23:30:46.679Z · LW(p) · GW(p)
But someone who declared intent to read could simply take a picture and send it to any number of people who hadn't declared intent.
Replies from: MakoYass↑ comment by mako yass (MakoYass) · 2024-12-02T04:16:47.737Z · LW(p) · GW(p)
Indicating them as a suspect when the leak is discovered.
Generally the set of people who actually read posts worthy of being marked is in a sense small, people know each other. If you had a process for distributing the work, it would be possible to figure out who's probably doing it.
It would take a lot of energy, but it's energy that probably should be cultivated anyway, the work of knowing each other and staying aligned.
Replies from: abandon↑ comment by dirk (abandon) · 2024-12-02T07:30:09.909Z · LW(p) · GW(p)
Of course this would shrink the suspect pool, but catching the leaker more easily after the fact is very different from the system making it difficult to leak things. Under the proposed system, it would be very easy to leak things.
comment by O O (o-o) · 2024-05-03T21:30:34.236Z · LW(p) · GW(p)
This style of thinking seems illogical to me. It has already clearly resulted in a sort of evaporative cooling in OpenAI. At a high level, is it possible you have the opposite of a wishful thinking bias you claim OpenAI researchers have? I won't go into too much detail about why this post doesn't make sense to me. as others already have.
But broadly speaking:
- I doubt rationality gives you too much of an advantage in capabilities research, and believing this when on a site full of rationalists seems a little pretentious almost.
- I also have no idea how any alignment research so far has helped capabilities in any way. I don't even know how RLHF has helped capabilities. If anything, it's well documented that RLHF diminishes capabilities (base models can for example play chess very well). The vast majority of alignment research, especially research before LLMs, isn't even useful to alignment (a lot of it seems far too ungrounded).
- There was never a real shot of solving alignment until LLMs became realized either. The world has changed and it seems like foom priors are wrong, but most here haven't updated. It increasingly seems like we'll get strong precursor models so we will have ample time to engineer solutions and it won't be like trying to build a working rocket on the first try. (The reasons being we are rapidly approaching the limits of energy constraints and transistor density without really being close to fooming). This mental model is still popular when reality seems to diverge.
Well I actually have a hunch to why, many holding on to the above priors don't want to let them go because that means this problem they have dedicated a lot of mental space to will seem more feasible to solve.
If it's instead a boring engineering problem, this stops being a quest to save the world or an all consuming issue. Incremental alignment work might solve it, so in order to preserve the difficulty of the issue, it will cause extinction for some far-fetched reason. Building precursor models then bootstrapping alignment might solve it, so this "foom" is invented and held on to (for a lot of highly speculative assumptions), because that would stop it from being a boring engineering problem that requires lots of effort and instead something a lone genius will have to solve. The question that maybe energy constraints will limit AI progress from here on out was met with a maybe response [LW(p) · GW(p)], but the number of upvotes make me think most readers just filed it as an unconditional "no, it won't" in their minds.
There is a good reason to think like this - if boring engineering really does solve the issue, then this community is better off assuming it won't. In that scenario, boring engineering work is being done by the tech industry anyways, so no need to help there. But I hope if people adopt the mindset of assuming the worst case scenario to have the highest expected effects of research, they realize the assumption they are making is an assumption, and not let the mental effects consume them.
↑ comment by habryka (habryka4) · 2024-05-04T21:35:12.037Z · LW(p) · GW(p)
This style of thinking seems illogical to me. It has already clearly resulted in a sort of evaporative cooling in OpenAI.
I don't think what's happening at OpenAI is "evaporative cooling as a result of people being too risk-averse to do alignment work that's adjacent to capabilities". I would describe it more as "purging anyone who tries to provide oversight". I don't think the people who are leaving OpenAI who are safety conscious are doing it because of concerns like the OP, they are doing it because they are being marginalized and the organization is acting somewhat obviously reckless.
↑ comment by Ben Pace (Benito) · 2024-05-03T21:38:44.523Z · LW(p) · GW(p)
Preface: I think this comment will be pretty unpopular here.
I think this is a very unhelpful frame for any discussion (especially so the more high-stakes it is) for the reasons that SlateStarCodex outlines in Against Bravery Debates, and I think your comment would be better with this removed.
Added: I appreciate the edit :)
↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-05-09T04:26:33.395Z · LW(p) · GW(p)
I think you make some valid points. In particular, I agree that some people seem to have fallen into a trap of being unrealistically pessimistic about AI outcomes which mirrors the errors of those AI developers and cheerleaders who are being unrealistically optimistic.
On the other hand, I disagree with this critique (although I can see where you're coming from):
If it's instead a boring engineering problem, this stops being a quest to save the world or an all consuming issue. Incremental alignment work might solve it, so in order to preserve the difficulty of the issue, it will cause extinction for some far-fetched reason. Building precursor models then bootstrapping alignment might solve it, so this "foom" is invented and held on to (for a lot of highly speculative assumptions), because that would stop it from being a boring engineering problem that requires lots of effort and instead something a lone genius will have to solve.
I think that FOOM is a real risk, and I have a lot of evidence grounding my calculations about available algorithmic efficiency improvements based on estimates of the compute of the human brain. The conclusion I draw from believing that FOOM is both possible, and indeed likely, after a certain threshold of AI R&D capability is reached by AI models is that preventing/controlling FOOM is an engineering problem.
I don't think we should expect a model in training to become super-human so fast that it blows past our ability to evaluate it. I do think that in order to have the best chance of catching and controlling a rapid accelerating take-off, we need to do pre-emptive engineering work. We need very comprehensive evals to have detailed measures of key factors like general capability, reasoning, deception, self-preservation, and agency. We need carefully designed high-security training facilities with air-gapped datacenters. We need regulation that prevents irresponsible actors from undertaking unsafe experiments. Indeed, most of the critical work to preventing uncontrolled rogue AGI due to FOOM is well described by 'boring engineering problems' or 'boring regulation and enforcement problems'.
Believing in the dangers of recursive self-improvement doesn't necessarily involve believing that the best solution is a genius theoretical answer to value and intent alignment. I wouldn't rule the chance of that out, but I certainly don't expect that slim possibility. It seems foolish to trust in that the primary hope for humanity. Instead, let's focus on doing the necessary engineering and political work so that we can proceed with reasonable safety measures in place!
comment by Morphism (pi-rogers) · 2024-05-02T20:18:44.187Z · LW(p) · GW(p)
Idea:
Have everyone who wants to share and recieve potentially exfohazardous ideas/research send out a 4096-bit RSA public key.
Then, make a clone of the alignment forum, where every time you make a post, you provide a list of the public keys of the people who you want to see the post. Then, on the client side, it encrypts the post using all of those public keys. The server only ever holds encrypted posts.
Then, users can put in their own private key to see a post. The encrypted post gets downloaded to the user's machine and is decrypted on the client side. Perhaps require users to be on open-source browsers for extra security.
Maybe also add some post-quantum thing like what Signal uses so that we don't all die when quantum computers get good enough.
Should I build this?
Is there someone else here more experienced with csec who should build this instead?
Replies from: MakoYass, MakoYass↑ comment by mako yass (MakoYass) · 2024-05-02T22:48:00.179Z · LW(p) · GW(p)
I don't think e2e encryption is warranted here for the first iteration. Generally, keypair management is too hard, today, everyone I know who used encrypted Element chat has lost their keys lmao. (I endorse element chat, but I don't endorse making every channel you use encrypted, you will lose your logs!), and keypairs alone are a terrible way of doing secure identity. Keys can be lost or stolen, and though that doesn't happen every day, the probability is always too high to build anything serious on top of them. I'm waiting for a secure identity system with key rotation and some form of account recovery process (which can be an institutional service or a "social recovery" thing) before building anything important on top of e2e encryption.
Replies from: pi-rogers↑ comment by Morphism (pi-rogers) · 2024-05-02T22:55:37.478Z · LW(p) · GW(p)
I mean, Signal messenger has worked pretty well in my experience.
Replies from: MakoYass↑ comment by mako yass (MakoYass) · 2024-05-02T23:14:23.131Z · LW(p) · GW(p)
Possibly incidental, but if people were successfully maintaining continuous secure access to their signal account you wouldn't even notice because it doesn't even make an attempt to transfer encrypted data to new sessions.
↑ comment by mako yass (MakoYass) · 2024-05-02T22:47:12.937Z · LW(p) · GW(p)
Then, users can put in their own private key to see a post
This was probably a typo but just in case: you should never send a private key off your device. The public key is the part that you send.
Replies from: pi-rogers↑ comment by Morphism (pi-rogers) · 2024-05-02T22:57:13.200Z · LW(p) · GW(p)
Oh no I mean they have the private key stored on the client side and decrypt it there.
Ideally all of this is behind a nice UI, like Signal.
comment by Johannes C. Mayer (johannes-c-mayer) · 2024-05-02T15:12:28.074Z · LW(p) · GW(p)
I think this post would be much more effective in achieving its goal if it would provide alternatives.
What are the advantages of posting your research ideas on LessWrong? Are there other ways in which you can get these advantages? Are there maybe even alternatives that give you more of the thing you want?
I expect telling people about these alternatives (if they exist) would make them more likely to make use of them.
One of the main things I think people can get by publishing their research is to get feedback. But you could also search for people who are interested in what you are working on. Then you can send your write-ups only to these people.
Also seeing people engage with things that you write is very motivating.
These are just some rough examples as I don't think I have very good models about what you can get out of LessWrong and how to get the same benefits in different ways.
Replies from: carado-1↑ comment by Tamsin Leake (carado-1) · 2024-05-02T16:08:44.709Z · LW(p) · GW(p)
One straightforward alternative is to just not do that; I agree it's not very satisfying but it should still be the action that's pursued if it's the one that has more utility.
I wish I had better alternatives, but I don't. But the null action is an alternative.
comment by RedMan · 2024-05-05T15:27:25.755Z · LW(p) · GW(p)
In computer security, there is an ongoing debate about vulnerability disclosure, which at present seems to have settled on 'if you aren't running a bug bounty program for your software you're irresponsible, project zero gets it right, metasploit is a net good, and it's ok to make exploits for hackers ideologically aligned with you'.
The framing of the question for decades was essentially "do you tell the person or company
with the vulnerable software, who may ignore you or sue you because they don't want to spend money? Do you tell the public, where someone might adapt your report into an attack?
Of course, there is the (generally believed to be) unethical option chosen by many "sell it to someone who will use it, and will protect your identity as the author from people who might retaliate"
There was an alternative called 'antisec': https://en.m.wikipedia.org/wiki/Antisec_Movement which basically argued 'dont tell people about exploits, they're expensive to make, very few people develop the talents to smash the stack for fun and profit, and once they're out, they're easy to use to cause mayhem'.
They did not go anywhere, and the antisec viewpoint is not present in any mainstream discussion about vulnerability ethics.
Alternatively, nations have broadly worked together to not publicly disclose technical data that would make building nuclear bombs simple. It is an exercise for the reader to determine whether it has worked.
So, the ideas here have been tried in different fields, with mixed results.
Replies from: lahwran↑ comment by the gears to ascension (lahwran) · 2024-05-05T21:59:21.408Z · LW(p) · GW(p)
[edit: pinned to profile]
Useful comparison; but I'd say AI is better compared to biology than to computer security [LW · GW] at the moment. Making the reality of the situation more comparable to computer security would be great. There's some sort of continuity you could draw between them in terms of how possible it is to defend against risks. In general the thing I want to advocate is being the appropriate amount of cautious for a given level of risk, and I believe that AI is in a situation best compared to gain-of-function research on viruses at the moment. Don't publish research that aids gain-of-function researchers without the ability to defend against what they're going to come up with based on it. And right now, we're not remotely close to being able to defend current minds - human and AI - against the long tail of dangerous outcomes of gain-of-function AI research. If that were to become different, then it would look like the nodes are getting yellower and yellower as we go, and as a result, a fading need to worry that people are making red nodes easier to reach [LW · GW]. Once you can mostly reliably defend and the community can come up with a reliable defense fast, it becomes a lot more reasonable to publish things that produce gain-of-function.
My issue is: right now, all the ideas for how to make defenses better help gain-of-function a lot, and people regularly write papers with justifications for their research that sound to me like the intro of a gain-of-function biology paper. "There's a bad thing, and we need to defend against it. To research this, we made it worse, in the hope that this would teach us how it works..."
comment by mako yass (MakoYass) · 2024-05-02T21:35:33.900Z · LW(p) · GW(p)
"So where do I privately share such research?" — good question! There is currently no infrastructure for this.
This is why I currently think you're completely wrong about this. There never will be an infrastructure for this. Privacy of communities isn't a solvable problem in general, as soon as your community is large enough to compete with the adversary, it's large enough and conspicuous enough that the adversary will pay attention to it and send in spies and extract leaks. If you make it compartmented enough to prevent leaks/weed out the spies, it wont have enough intellectual liveliness to solve the alignment problem.
There is nothing that makes differentially helping capabilities "fine if you're only differentially helping them a little bit".
If your acceptable lower limit for basically anything is zero you wont be allowed to do anything, really anything. You have to name some quantity of capabilities progress that's okay to do before you'll be allowed to talk about AI in a group setting.
Replies from: mesaoptimizer, sharmake-farah, MakoYass↑ comment by mesaoptimizer · 2024-05-03T09:52:10.835Z · LW(p) · GW(p)
If your acceptable lower limit for basically anything is zero you wont be allowed to do anything, really anything. You have to name some quantity of capabilities progress that’s okay to do before you’ll be allowed to talk about AI in a group setting.
↑ comment by Noosphere89 (sharmake-farah) · 2024-05-02T22:12:56.775Z · LW(p) · GW(p)
Privacy of communities isn't a solvable problem in general, as soon as your community is large enough to compete with the adversary, it's large enough and conspicuous enough that the adversary will pay attention to it and send in spies and extract leaks.
I disagree with this in theory as a long-term concern, but yes in practice the methods to have privacy of communities haven't been implemented or tested at all, and I agree with the general sentiment that it isn't worth the steep drawbacks of privacy to protect secrets, which does unfortunately make me dislike the post due to it's strength of recommendations.
So while I could in theory disagree with you, in practice right now I mostly have to agree with the comment that there will not be such an infrastructure for private alignment ideas.
Also to touch on something here that isn't too relevant and could be considered a tangent:
If your acceptable lower limit for basically anything is zero you wont be allowed to do anything, really anything.
This is why perfectionism is such a bad thing, and why you need to be able to accept that failure happens. You cannot have 0 failures IRL.
↑ comment by mako yass (MakoYass) · 2024-05-02T22:02:03.731Z · LW(p) · GW(p)
There never will be an infrastructure for this.
I should be less resolute about this. It would kind of be my job to look for a design that could do it.
One thing we've never seen is a system where read receipts are tracked and analyzed on the global level and read permissions are suspended and alerts are sent to admins if an account is doing too many unjustified reads.
This would prevent a small number of spies from extracting a large number of documents.
I suppose we could implement that today.
↑ comment by mako yass (MakoYass) · 2024-05-02T22:35:10.890Z · LW(p) · GW(p)
So I wrote a feature recommendation: https://www.lesswrong.com/posts/55rc6LJcqRmyaEr9T/please-stop-publishing-ideas-insights-research-about-ai?commentId=6fxN9KPeQgxZY235M [LW(p) · GW(p)]
comment by Gunnar_Zarncke · 2024-05-02T21:34:10.543Z · LW(p) · GW(p)
"So where do I privately share such research?" — good question! There is currently no infrastructure for this.
I'd really like to have such a place, or even a standard policy how to do this.
I feel like the aintelope I'm working on has to secure it's stuff from scratch. Yes, it's early, but it is difficult to engineer security in later. You have to start with something. I'd really like to have a standard for AI Safety projects to follow or join.
comment by Petr Andreev (petr-andreev) · 2024-05-05T21:29:30.287Z · LW(p) · GW(p)
Good day!
I fully share the views expressed in your article. Indeed, the ideal solution would be to delete many of the existing materials and to reformat the remaining ones into a format understandable to every novice programmer, transhumanist, or even an average person.
As a poker player and a lawyer assisting consumers who have suffered from the consequences of artificial intelligence, as well as someone interested in cryptocurrencies and existential risks, I first invested in Eliezer Yudkowsky's ideas many years ago. At that time, I saw how generative-predictive models easily outplayed poker players, and I wondered whether it was possible to counteract this. Since then, I have not seen a single serious security study conducted by not the players themselves, but any non-response system up question could it research even self data
and in the realm of cryptocurrencies, money continues to be stolen with the help of AI, with no help or refund in sight.
I see prediction we have already lost the battle against GAI, but in the next 12 years, we have a chance to make the situation a bit better. To create conditions of the game where this player or his precursor (AI-users) will have more aligned (lawful good) elements.
It seems that very intelligent also very stubborn, see no doubts in position, such high IQs are very dangerous. Think they are right about everything, that understood it all, but we are just few perspectives in a vast, incomprehensible world where we understand nothing. We all wrong.
Yes, you're probably a couple of sigmas smarter than the median person, but you need to convince exactly such a person, the median, or even dumper on a couple of IQ sigmas not to launch anything. It's not just OpenAI developing GAI,
others are too, make research, decisions but they might not even know who Eliezer Yudkowsky is or what the lesswrong website is. They might visit pepper copy of the site, see that it's clear we shouldn't let GAI emerge, think about graphic boards, and where there are many graphic boards, in decentralized mining, they might decide to take control of them.
If we're lucky, their master slaves will just steal them and use them for mining, and everything will be fine then.
But various research like changing the sign of a function and creating something dangerous, that's better removed.
Another strange thing is the super-ethical laws for Europe and the US. A lot of jurisdictions. Even convention of cybercrime not universal. And in universal jurisdiction cybercrimes there is no crimes about existential risks. So many of international media laws just declarations without real procedures without any real power
Many laws aren't adhered to in practice, there are different kinds of people, for some, the criminal code is like a menu, and if you don't have to pay for that menu, it's doubly bad
There are individualists, and among transhumanists, I'm sure there are many who would choose their life and the life of a close million over the rest of humanity. And that's not good, unfair. System should be for all billions of people
But there are also those in the world who, if presented with a "shut down server" button, will eventually press it. There are many such buttons in various fields worldwide. If we take predictions for a hundred years, unless something radically changes, the likelihood of "server shutdown" approaches 1.
So it's interesting whether through GAI open source or any other framework or model, we could create some universal platform with a rule system that on one hand does universal monitoring of all existential problems, but also provides clear, beneficial instructions for the median voter, as well as for the median worker and their masters.
Culture is created by the spoon. Give us a normal, unified system that encourages correct behavior for adhering to existential risks, since you've won the genetic and event lottery by intelligence and were born with high IQ and social skills.
Usually, the median person is interested in: jobs, a full fridge, rituals, culture, the spread of their opinion leader's information, dopamine, political and other random and inherited values, life, continuation of life, and the like.
Provide a universal way of obtaining this and just monitor it calmly. And it touched on the problem of all existential risks: ecology, physics, pandemics, volcanic activity, space, nanobots, atom.
Doomclock 23:55 is not only because of the GAI risk, what selfishness.
Sometimes it seems that Yudkowsky is the Girolamo Savonarola of our days. And the system of procedures that Institute of Future Life and Eliezar already invented, their execution is important!
Sadly in humanity now it's profitable to act, and then ask for forgiveness. So many businesses are built the same as nowadays Binance without responsibility, 'don't FUD just build', same way work all AI and others powerful startups. Many experimental researches not 100% sure that they are safe for planet. In 20th and 21th centuary it's became normal. But it shouldn't.
And these real condition of problem, real pattern of life. And yet in crypto, there are many graphics cards, collected in decentralized networks, and they gather in large decentralized, unturnoffable nodes and clusters. Are they danger?
We need systems of cheap protection, brakes, and incentives for their use! And like with seat belts, teach from childhood. Something even simpler than Khan Academy. HPMOR was great. Do we have anything for next Generations? That didn't see or like Harry Potter? What is it? To explain problem.
Laws and rules just for show, unenforceable, are only harmful. Since ancient times it's known that any rules consist of three things: hypothesis, disposition, and sanction. Without powerful procedural law, all these material legal norms are worthless, more precisely, a boon for the malefactor. If we don't procedurally protect people from wrongful AI, introducing soothing, non-working ethical rules will only increase volatility and the likelihood of wrongful AI, his advantage, even if we are lucky to have its element (it's alighment) in principle.
I apologize if there were any offensive remarks in the text or if it seemed like an unstructured rant expressing incorrect thoughts, that how my brain work. Hope I wrong, point please. Thank you for any comments and for your attention!
Replies from: nathan-helm-burger↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-05-09T05:09:26.232Z · LW(p) · GW(p)
A bit of a rant, yes, but some good thoughts here.
I agree that unenforceable regulation can be a bad thing. On the other hand, it can also work in some limited ways. For example, the international agreements against heritable human genetic engineering seem to have held up fairly well. But I think that that requires supporting facts about the world to be true. It needs to not be obviously highly profitable to defectors, it needs to be relatively inaccessible to most people (requiring specialized tech and knowledge), it needs to fit with our collective intuitions (bio-engineering humans seems kinda icky to a lot of people).
The trouble is, all of these things fail to help us with the problem of dangerous AI! As you point out, many bitcoin miners have plenty of GPUs to be dangerous if we get even a couple more orders-of-magnitude algorithmic efficiency improvements. So it's accessible. AI and AGI offer many tempting ways to acquire power and money in society. So it's immediately and incrementally profitable. People aren't as widely instinctively outraged by AI experiments as Bio-engineering experiments. So it's not intuitively repulsive.
So yes, this seems to me to be very much a situation in which we should not place any trust in unenforceable regulation.
I also agree that we probably do need some sort of organization which enforces the necessary protections (detection and destruction) against rogue AI.
And it does seem potentially like a lot of human satisfaction could be bought in the near future with a focus on making sure everyone in the world gets a reasonable minimum amount of satisfaction from their physical and social environments as you describe here:
Usually, the median person is interested in: jobs, a full fridge, rituals, culture, the spread of their opinion leader's information, dopamine, political and other random and inherited values, life, continuation of life, and the like. Provide a universal way of obtaining this and just monitor it calmly.
As Connor Leahy has said, we should be able to build sufficiently powerful tool-AI to not need to build AGI! Stop while we still have control! Use the wealth to buy off those who would try anyway. Also, build an enforcement agency to stop runaway AI or AI misuse.
I don't know how we get there from here though.
Also, the offense-dominant weapons development landscape is looking really grim, and I don't see how to easily patch that.
On the other hand, I don't think we buy ourselves any chance of victory by trying to gag ourselves for fear of speeding up AGI development. It's coming soon regardless of what we do! The race is short now, we need to act fast!
I don't buy the arguments that our discussions here will make a significant impact in the timing of the arrival of AGI. That seems like hubris to me, to imagine we have such substantial effects, just from our discussions.
Code? Yes, code can be dangerous and shouldn't be published if so.
Sufficiently detailed technical descriptions of potential advancements? Yeah, I can see that being dangerous.
Unsubstantiated commentary about a published paper being interesting and potentially having both capabilities and alignment value? I am unconvinced that such discussions meaningfully impact the experiments being undertaken in AI labs.
comment by Nicholas / Heather Kross (NicholasKross) · 2024-05-02T16:17:16.618Z · LW(p) · GW(p)
"But my ideas are likely to fail! Can I share failed ideas?": If you share a failed idea, that saves the other person time/effort they would've spent chasing that idea. This, of course, speeds up that person's progress, so don't even share failed ideas/experiments about AI, in the status quo.
"So where do I privately share such research?" — good question! There is currently no infrastructure for this. I suggest keeping your ideas/insights/research to yourself. If you think that's difficult for you to do, then I suggest not thinking about AI, and doing something else with your time, like getting into factorio 2 or something.
"But I'm impatient about the infrastructure coming to exist!": Apply for a possibly-relevant grant and build it! Or build it in your spare time. Or be ready to help out if/when someone develops this infrastructure.
"But I have AI insights and I want to convert them into money/career-capital/personal-gain/status!": With that kind of brainpower/creativity, you can get any/all of those things pretty efficiently without publishing AI research, working at a lab, advancing a given SOTA, or doing basically (or literally) anything that differentially speeds up AI capabilities. This, of course, means "work on the object-level problem, without routing that work through AI capabilities", which is often as straightforward "do it yourself".
"But I'm wasting my time if I don't get involved in something related to AGI!": "I want to try LSD, but it's only available in another country. I could spend my time traveling to that country, or looking for mushrooms, or even just staying sober. Therefore, I'm wasting my time unless I immediately inject 999999 fentanyl."