Posts
Comments
A lot of good people are doing a lot of bad things that they don't enjoy doing all the time. That seems weird. They even say stuff like "I don't want to do this". But then they recite some very serious sounding words or whatever and do it anyways.
Lol, okay on review that reads as priveledged. Easy for rectangle-havers to say.
There is underlying violence keeping a lot of people "at work" and doing the things they don't want to do. An authoritarian violence keeping everyone in place.
The threat is to shelter, food, security, even humanity past a certain point. You don't "go along", we grind you into the ground. Or, "allow you to be ground by the environment we cocreated".
Many people "do the thing they don't want" because they are under much greater threat of material scarcity or physical violence than I am at present and I want to respect that.
<3
For specific activities, I would suggest doubling down on activities that you already like to do or have interest in, but which you implicitly avoid "getting into" because they are considered low status. For example: improve your masturbation game, improve your drug game (as in plan fun companion activities or make it a social thing; not just saying do more/stronger drugs), get really into that fringe sub-genre that ~only you like, experiment with your clothes/hair style, explore your own sexual orientation/gender identity, just straight up stop doing any hobbies that you're only into for the status, etc.
I think the best way to cash in on the fun side of the fun/status tradeoff is probably mostly rooted in adopting a disposition and outlook that allows you to. I think most people self limit themselves like crazy to promote a certain image and that if you're really trying to extract fun-bang for your status-buck, then dissolving some of that social conditioning and learning to be silly is a good way to go. Basically, I think there's a lot of fun to be had for those who are comfortable acting silly or playful or unconventional. If you can unlock some of that as your default disposition, or even just a mode you can switch into, then I think practically any given activity will be marginally more fun.
I think that most people have a capacity to be silly/playful in a way that is really fun, but that they stifle it mostly for social reasons and over time this just becomes a habitual part of how they interact with the world. I don't expect this point to be controversial.
One of the main social functions of things like alcohol and parties seem to be to give people a clear social license to act silly, playful, and even ~outrageous without being judged harshly. I think that if one is able to overcome some of the latent self-protective psychological constraints that most people develop and gain a degree of genuine emotional indifference towards status, then they can experience much more playfulness and joy than most people normally permit themselves.
I know this isn't really a self contained "Friday night activity" in itself, but I think that general mindset shifts are probably the way to go if you're not terribly status-concerned and looking for ways to collect fun-rent on it. I think there's a lot to be said for just granting yourself the permission to be silly and have fun in general.
While I agree that there are notable differences between "vegans" and "carnists" in terms of group dynamics, I do not think that necessarily disagrees with the idea that carnists are anti-truthseeking.
"carnists" are not a coherent group, not an ideology, they do not have an agenda (unless we're talking about some very specific industry lobbyists who no doubt exist). They're just people who don't care and eat meat.
It seems untrue that because carnists are not an organized physical group that has meetings and such, they are thereby incapable of having shared norms or ideas/memes. I think in some contexts it can make sense/be useful to refer to a group of people who are not coherent in the sense of explicitly "working together" or having shared newletters based around a subject or whatever. In some cases, it can make sense to refer to those people's ideologies/norms.
Also, I disagree with the idea that carnists are inherently neutral on the subject of animals/meat. That is, they don't "not care". In general, they actively want to eat meat and would be against things that would stop this. That's not "not caring"; it is "having an agenda", just not one that opposes the current status quo. The fact that being pro-meat and "okay with factory farming" is the more dominant stance/assumed default in our current status quo doesn't mean that it isn't a legitimate position/belief that people could be said to hold. There are many examples of other memetic environments throughout history where the assumed default may not have looked like a "stance" or an "agenda" to the people who were used to it, but nonetheless represented certain ideological claims.
I don't think something only becomes an "ideology" when it disagrees with the current dominant cultural ideas; some things that are culturally common and baked into people from birth can still absolutely be "ideology" in the way I am used to using it. If we disagree on that, then perhaps we could use a different term?
If nothing else, carnists share the ideological assumption that "eating meat is okay". In practice, they often also share ideas about the surrounding philosophical questions and attitudes. I don't think it is beyond the pale to say that they could share norms around truth-seeking as it relates to these questions and attitudes. It feels unnecessarily dismissive and perhaps implicitly status quoist to assume that: as a dominant, implicit meme of our culture "carnism" must be "neutral" and therefore does not come with/correlate with any norms surrounding how people think about/process questions related to animals/meat.
Carnism comes with as much ideology as veganism even if people aren't as explicit in presenting it or if the typical carnist hasn't put as much thought into it.
I do not really have any experience advocating publicly for veganism and I wouldn't really know about which specific espistemic failure modes are common among carnists for these sorts of conversations, but I have seen plenty of people bend themselves out of shape persevering their own comfort and status quo, so it really doesn't seem like a stretch to imagine that epistemic maladies may tend to present among carnists when the question of veganism comes up.
For one thing, I have personally seen carnists respond in intentionally hostile ways towards vegans/vegan messaging on several occasions. Partially this is because they see it as a threat to their ideas or their way of life or partially this is because veganism is a designated punching bag that you're allowed to insult in a lot of places. Often times, these attacks draw on shared ideas about veganism/animals/morality that are common between "carnists".
So, while I agree that there are very different group dynamics, I don't think it makes sense to say that vegans hold ideologies and are capable of exhibiting certain epistemic behaviors, but that carnists, by virtue of not being a sufficiently coherent collection of individuals, could not have the same labels applied to them.
Thanks! I haven't watched, but I appreciated having something to give me the gist!
Hotz was allowed to drive discussion. In debate terms, he was the con side, raising challenges, while Yudkowsky was the pro side defending a fixed position.
This always seems to be the framing which seems unbelievably stupid given the stakes on each side of the argument. Still, it seems to be the default; I'm guessing this is status quo bias and the historical tendency of everything to stay relatively the same year by year (less so once technology really started happening). I think AI safety outreach needs to break out of this framing or it's playing a losing game. I feel like, in terms of public communication, whoever's playing defense has mostly already lost.
The idea that poking a single whole in EY's reasoning is also a really broken norm around these discussions that we are going to have to move past if we want effective public communication. In particular, the combination of "tell me exactly what an ASI would do" and "if anything you say sounds implausible, then AI is safe" is just ridiculous. Any conversation implicitly operating on that basis is operating in bad faith and borderline not worth having. It's not a fair framing of the situation.
9. Hotz closes with a vision of ASIs running amok
What a ridiculous thing to be okay with?! Is this representative of his actual stance? Is this stance taken seriously by anyone besides him?
not going to rely on a given argument or pathway because although it was true it would strain credulity. This is a tricky balance, on the whole we likely need more of this.
I take it this means not using certain implausible seeming examples? I agree that we could stand to move away from the "understand the lesson behind this implausible seeming toy example"-style argumentation and more towards an emphasis on something like "a lot of factors point to doom and even very clever people can't figure out how to make things safe".
I think it matters that most of the "technical arguments" point strongly towards doom, but I think it's a mistake for AI safety advocates to try to do all of the work of laying out and defending technical arguments when it comes to public facing communication/debate. If you're trying to give all the complicated reasons why doom is a real possibility, then you're implicitly taking on a huge burden of proof and letting your opponent get away with doing nothing more than cause confusion and nitpick.
Like, imagine having to explain general relativity in a debate to an audience who has never heard about it. Your opponent continuously just stops you and disagrees with you; maybe misuses a term here and there and then at the end the debate is judged by whether the audience is convinced that your theory of physics is correct. It just seems like playing a losing game for no reason.
Again, I didn't see this and I'm sure EY handled himself fine, I just think there's a lot of room for improvement in the general rhythm that these sorts of discussions tend to fall into.
I think it is okay for AI safety advocates to lay out the groundwork, maybe make a few big-picture arguments, maybe talk about expert opinion (since that alone is enough to perk most sane people's ears and shift some of the burden of proof), and then mostly let their opponents do the work of stumbling through the briars of technical argumentation if they still want to nitpick whatever thought experiment. In general, a leaner case just argues better and is more easily understood. Thus, I think it's better to argue the general case than to attempt the standard shuffle of a dozen different analogies; especially when time/audience attention is more acutely limited.
Would the prize also go towards someone who can prove it is possible in theory? I think some flavor of "alignment" is probably possible and I would suspect it more feasible to try to prove so than to prove otherwise.
I'm not asking to try to get my hypothetical hands on this hypothetical prize money, I'm just curious if you think putting effort into positive proofs of feasibility would be equally worthwhile. I think it is meaningful to differentiate "proving possibility" from alignment research more generally and that the former would itself be worthwhile. I'm sure some alignment researchers do that sort of thing right? It seems like a reasonable place to start given an agent-theoretic approach or similar.
I appreciate the attempt, but I think the argument is going to have to be a little stronger than that if you're hoping for the 10 million lol.
Aligned ASI doesn't mean "unaligned ASI in chains that make it act nice", so the bits where you say:
any constraints we might hope to impose upon an intelligence of this caliber would, by its very nature, be surmountable by the AI
and
overconfidence to assume that we could circumscribe the liberties of a super-intelligent entity
feel kind of misplaced. The idea is less "put the super-genius in chains" and moreso to get "a system smarter than you that wants the sort of stuff you would want a system smarter than you to want in the first place".
From what I could tell, you're also saying something like ~"Making a system that is more capable than you act only in ways that you approve of is nonsense because if it acts only in ways that you already see as correct, then it's not meaningfully smarter than you/generally intelligent." I'm sure there's more nuance, but that's the basic sort of chain of reasoning I'm getting from you.
I disagree. I don't think it is fair to say that just because something is more cognitively capable than you, it's inherently misaligned. I think this is conflating some stuff that is generally worth keeping distinct. That is, "what a system wants" and "how good it is at getting what it wants" (cf. Hume's guillotine, orthogonality thesis).
Like, sure, an ASI can identify different courses of action/ consider things more astutely than you would, but that doesn't mean it's taking actions that go against your general desires. Something can see solutions that you don't see yet pursue the same goals as you. I mean, people cooperate all the time even with asymmetric information and options and such. One way of putting it might be something like: "system is smarter than you and does stuff you don't understand, but that's okay cause it leads to your preferred outcomes". I think that's the rough idea behind alignment.
For reference, I think the way you asserted your disagreement came off kind of self-assured and didn't really demonstrate much underlying understanding of the positions you're disagreeing with. I suspect that's part of why you got all the downvotes, but I don't want you to feel like you're getting shut down just for having a contrarian take. 👍
The doubling time for AI compute is ~6 months
Source?
In 5 years compute will scale 2^(5÷0.5)=1024 times
This is a nitpick, but I think you meant 2^(5*2)=1024
In 5 years AI will be superhuman at most tasks including designing AI
This kind of clashes with the idea that AI capabilities gains are driven mostly by compute. If "moar layers!" is the only way forward, then someone might say this is unlikely. I don't think this is a hard problem, but I thing its a bit of a snag in the argument.
An AI will design a better version of itself and recursively loop this process until it reaches some limit
I think you'll lose some people on this one. The missing step here is something like "the AI will be able to recognize and take actions that increase its reward function". There is enough of a disconnect between current systems and systems that would actually take coherent, goal-oriented actions that the point kind of needs to be justified. Otherwise, it leaves room for something like a GPT-X to just kind of say good AI designs when asked, but which doesn't really know how to actively maximize its reward function beyond just doing the normal sorts of things it was trained to do.
Such any AI will be superhuman at almost all tasks, including computer security, R&D, planning, and persuasion
I think this is a stronger claim than you need to make and might not actually be that well-justified. It might be worse than humans at loading the dishwasher bc that's not important to it, but if it was important, then it could do a brief R&D program in which it quickly becomes superhuman at dish-washer-loading. Idk, maybe the distinction I'm making is pointless, but I guess I'm also saying that there's a lot of tasks it might not need to be good at if its good at things like engineering and strategy.
Overall, I tend to agree with you. Most of my hope for a good outcome lies in something like the "bots get stuck in a local maximum and produce useful superhuman alignment work before one of them bootstraps itself and starts 'disempowering' humanity". I guess that relates to the thing I said a couple paragraphs ago about coherent, goal-oriented actions potentially not arising even as other capabilities improve.
I am less and less optimistic about this as research specifically designed to make bots more "agentic" continues. In my eyes, this is among some of the worst research there is.
Personally, I found it obvious that the title was being playful and don't mind that sort of tongue-in-cheek thing. I mean "utterly perfect" is kind of a give away that they're not being serious.
Great post!
As much as a I like LessWrong for what it is, I think it's often guilty of a lot of the negative aspects of conformity and coworking that you point out here. Ie. killing good ideas in their cradle. Of course, there are trade-offs to this sort of thing and I certainly appreciate brass-tacks and hard-nosed reasoning sometimes. There is also a need for ingenuity, non-conformity, and genuine creativity (in all of its deeply anti-social glory).
Thank you for sharing this! It helped me feel LessWeird about the sorts of things I do in my own creative/explorative processes and it gave me some new techniques/mindset-things to try.
I suspect there is some kind of internal asymmetry between how we process praise and rejection, especially when it comes to vulnerable things like our identities and our ideas. Back when I used to watch more "content creators" I remember they would consistently gripe that they could read 100 positive comments and still feel most affected by the one or two negative ones.
Well, cheers to not letting our thinking be crushed by the status quo! Nor by critics, internal or otherwise!
There’s a dead zone between skimming and scrutiny where you could play slow games without analyzing them and get neither the immediate benefits of cognitively-demanding analysis nor enough information to gain a passive understanding of the underlying patterns.
I think this is a good point. I think there's a lot to be said for being intentional about how/what you're consuming. It's kind of easy for me to fall into a pit of "kind of paying attention" where I'm spending mental energy, but not retaining anything, but not really skimming either. I think it is less cognitively demanding per unit time, but gives you way worse learning-bang for your mental-energy-buck.
I don't think I fully understand this dead zone or why it happens, but I am suspicious that it also plays a pretty large role in a lot of ineffective/mainstream education.
It strikes me that there is a difficult problem involved in creating a system that can automatically perform useful alignment research, which is generally pretty speculative and theoretical, without that system just being generally skilled at reasoning/problem solving. I am sure they are aware of this, but I feel like it is a fundamental issue worth highlighting.
Still, it seems like the special case of "solve the alignment problem as it relates to an automated alignment researcher" might be easier than "solve alignment problem for reasoning systems generally", so it is potentially a useful approach.
Anyone know what resources I could check out to see how they're planning on designing, aligning, and getting useful work out of their auto-alignment researcher? I mean, they mention some of the techniques, but it still seems vague to me what kind of model they're even talking about. Are they basically going to use an LLM fine-tuned on existing research and then use some kind of scalable oversight/"turbo-RLHF" training regime to try to push it towards more useful outputs or what?
I'm interested in getting involved with a mentorship program or a learning cohort for alignment work. I have found a few things poking around (mostly expired application posts), but I was wondering if anyone could point me towards a more comprehensive list. I found aisafety.community, but it still seems like it is missing things like bootcamps, SERI MATS, and such. If anyone is aware of a list of bootcamps, cohorts, or mentor programs or list a few off for me, I would really appreciate the direction. Thanks!
I have sometimes seen people/contests focused on writing up specific scenarios for how AI can go wrong starting with our current situation and fictionally projecting into the future. I think the idea is that this can act as an intuition pump and potentially a way to convince people.
I think that is likely net negative given the fact that state of the art AIs are being trained on internet text and stories where a good agent starts behaving badly are a key component motivating the Waluigi effect.
These sort of stories still seem worth thinking about, but perhaps greater care should be taken not to inject GPT-5's training data with examples of chatbots that go murderous. Maybe only post it as a zip file or use a simple cipher.
This seems to be phrased like a disagreement, but I think you're mostly saying things that are addressed in the original post. It is totally fair to say that things wouldn't go down like this if you stuck 100 actual prisoners or mathematicians or whatever into this scenario. I don't believe OP was trying to claim that it would. The point is just that sometimes bad equilibria can form from everyone following simple, seemingly innocuous rules. It is a faithful execution of certain simple strategic approaches, but it is a bad strategy in situations like this because it fails to account for things like modeling the preferences/behavior of other agents.
To address your scenario:
Alice breaks it unilaterally on round 1, then Bob notices that and joins in on round 2, neither of them end up punished and they get 98.6 from then on
Ya, sure this could happen "in real life", but the important part is that this solution assumes that Alice breaking the equilibrium on round 1 is evidence that she'll break it on round 2. This is exactly why the character Rowan asks:
"If you've just seen someone else violate the equilibrium, though, shouldn't you rationally expect that they might defect from the equilibrium in the future?"
and it is yields the response that
"Well, yes. This is a limitation of Nash equilibrium as an analysis tool, if you weren't already convinced it needed revisiting based on this terribly unnecessarily horrible outcome in this situation. ..."
This is followed by discussion of how we might add mathematical elements to account for predicting the behavior of other agents.
Humans predict the behavior of other agents automatically and would not be likely to get stuck in this particular bad equilibrium. That said, I still think this is an interesting toy example because it's kind of similar to some bad equilibria which humans DO get stuck in (see these comments for example). It would be interesting to learn more about the mathematics and try to pinpoint what makes these failure modes more/less likely to occur.
After reading this, I tried to imagine what an ML system would have to look like if there really were an equivalent of the kind of overhang that was present in evolution. I think that if we try to make the ML analogy such that SGD = evolution, then it would have to look something like: "There are some parameters which update really really slowly (DNA) compared to other parameters (neurons). The difference is like ~1,000,000,000x. Sometimes, all the fast parameters get wiped and the slow parameters update slightly. The process starts over and the fast parameters start from scratch because it seems like there is ~0 carryover between the information in the fast parameters of last generation and the fast parameters in the new generation." In this analogy, the evolutionary-equivalent sharp left turn would be something like: "some of the information from the fast parameters is distilled down and utilized by the fast parameters of the new generation." OP touches on this and this is not what we see in practice, so I agree with OP's point here.
(I would be curious if anyone has info/link on how much certain parameters in a network change relative to other parameters. I have heard this discussed when talking about the resilience of terminal goals against SGD.)
A different analogy I thought of would be one where humans deciding on model architecture are the analogue for evolution and the training process itself is like within-lifetime learning. In these terms, if we wanted to imagine the equivalent of the sharp left turn, we could imagine that we had to keep making new models bc of finite "life-spans" and each time we started over, we used a similar architecture with some tweaks based on how the last generation of models performed (inter-generational shifts in gene frequency). The models gradually improve over time due to humans selecting on the architecture. In this analogy, the equivalent of the culture-based sharp left turn would be if humans started using the models of one generation to curate really good, distilled training data for the next generation. This would let each generation outperform the previous generations by noticeably more despite only gradual tweaks in architecture occurring between generations.
This is similar to what OP pointed out in talking about "AI iteratively refining its training data". Although, in the case that the same AI is generating and using the training data, then it feels more analogous to note taking/refining your thoughts through journaling than it does to passing on knowledge between generations. I agree with OP's concern about that leading to weird runaway effects.
I actually find this second version of the analogy where humans = evolution and SGD/training = within lifetime learning somewhat plausible. Of course, it is still missing some of the other pieces to have a sharp left turn (ie. the part where I assumed that models had short lifespans and the fact that irl models increase in size a lot each generation). Still, it does work as a bi-level optimization process where one of the levels has way more compute/happens way faster. In humans, we can't really use our brains without reinforcement learning, so this analogy would also mean that deployment is like taking a snapshot of a human brain in a specific state and just initializing that every time.
I am not sure where this analogy breaks/ what the implications are for alignment, but I think it avoids some of the flaws of thinking in terms of evolution = SGD. By analogy, that would kind of mean that when we consciously act in ways that go against evolution, but that we think are good, we're exhibiting Outer Misalignment.
By analogy, that would also mean that when we voluntarily make choices that we would not consciously endorse, we are exhibiting some level of inner misalignment. I am not sure how I feel about this one; that might be a stretch. It would make a separation between some kind of "inner learning process" in our brains that is kind of the equivalent of SGD and the rest of our brains that are the equivalent of the NN. We can act in accordance with the inner learner and that connection of neurons will be strengthened or we act against it and learn not to do that. Humans don't really have a "deployment" phase. (Although, if I wanted to be somewhat unkind I might say that some people do more or less stop actually changing their inner NNs at some point in life and only act based on their context windows.)
I don't know, let me know what you think.
You said that multiple people have looked into s-risks and consider them of similar likelihood to x-risks. That is surprising to me and I would like to know more. Would you be willing to share your sources?
I am very interested in finding more posts/writing of this kind. I really appreciate attempts to "look at the game board" or otherwise summarize the current strategic situation.
I have found plenty of resources explaining why alignment is a difficult problem and I have some sense of the underlying game-theory/public goods problem that is incentivizing actors to take excessive risks in developing AI anyways. Still, I would really appreciate any resources that take a zoomed-out perspective and try to identify the current bottlenecks, key battlegrounds, local win conditions, and roadmaps in making AI go well.
The skepticism that I object to has less to do with the idea that ML systems are not robust enough to operate robots and more to do with people rationalizing based off of the intrinsic feeling that "robots are not scary enough to justify considering AGI a credible threat". (Whether they voice this intuition or not)
I agree that having highly capable robots which operate off of ML would be evidence for AGI soon and thus the lack of such robots is evidence in the opposite direction.
That said, because the main threat from AGI that I am concerned about comes from reasoning and planning capabilities, I think it can be somewhat of a red herring. I'm not saying we shouldn't update on the lack of competent robots, but I am saying that we shouldn't flippantly use the intuition, "that robot can't do all sorts of human tasks, I guess machines aren't that smart and this isn't a big deal yet".
I am not trying to imply that this is the reasoning you are employing, but it is a type of reasoning I have seen in the wild. If anything, the lack of robustness in current ML systems might actually be more concerning overall, though I am uncertain about this.
My off-the-cuff best guesses at answering these questions:
1. Current day large language models do have "goals". They are just very alien, simple-ish goals that are hard to conceptualize. GPT-3 can be thought of as having a "goal" that is hard to express in human terms, but which drives it to predict the next word in a sentence. It's neural pathways "fire" according to some form of logic that leads it to "try" to do certain things; this is a goal. As systems become more general, their goals they will continue to have goals. Their terminal goals can remain as abstract and incomprehensible as whatever GPT-3's goal could be said to be, but they will be more capable of devising instrumental goals that are comprehensible in human terms.
2. Yes. Anything that intelligently performs tasks can be thought of as having goals. That is just a part of why input x outputs y and not z. The term "goal" is just a way of abstracting the behavior of complex, intelligent systems to make some kind of statement about what inputs correspond to what outputs. As such, it is not coherent to speak about an intelligent system that does not have "goals" (in the broad sense of the word). If you were to make a circuit board that just executes the function x = 3y, that circuit board could be said to have "goals" if you chose to consider it intelligent and use the kind of language that we usually reserve for people to describe it. These might not be goals that are familiar or easily expressible in human terms, but they are still goals in a relevant sense. If we strip the word "goal" down to pretty much just mean "the thing a system inherently tends towards doing", then systems that do things can necessarily be said to have goals.
3. "Tool" and "agent" is not a meaningful distinction past a certain point. A tool with any level of "intelligence" that carries out tasks would necessarily be an agent in a certain sense. Even a thermostat can be correctly thought of as an agent which optimizes for a goal. While some hypothetical systems might be very blatant about their own preferences and other systems might behave more like how we are used to tools behaving, they can both be said to have "goals" they are acting on. It is harder to conceptualize the vague inner goals of systems that seem more like tools and easier to imagine the explicit goals of a system that behaves more like a strategic actor, but this distinction is only superficial. In fact, the "deeper"/more terminal goals of the strategic actor system would be incomprehensible and alien in much the same way as the tool system. Human minds can be said to optimize for goals that are, in themselves, not similar humans' explicit terminal goals/values. Tool-like AI is just agentic-AI that is either incapable of or (as in the case of deception) not currently choosing to carry out goals in a way that is obviously agentic by human standards.
For personal context: I can understand why a superintelligent system having any goals that aren't my goals would be very bad for me. I can also understand some of the reasons it is difficult to actually specify my goals or train a system to share my goals. There are a few parts of the basic argument that I don't understand as well though.
For one, I think I have trouble imagining an AGI that actually has "goals" and acts like an agent; I might just be anthropomorphizing too much.
1. Would it make sense to talk about modern large language models as "having goals" or is that something that we expect to emerge later as AI systems become more general? 2. Is there a reason to believe that sufficiently advanced AGI would have goals "by default"? 3. Are "goal-directed" systems inherently more concerning than "tool-like" systems when it comes to alignment issues (or is that an incoherent distinction in this context)?
I will try to answer those questions myself to help people see where my reasoning might be going wrong or what questions I should actually be trying to ask.
Thanks!