Posts
Comments
Chinas has alienated virtually all its neighbours
That sounds like an exaggeration? My impression is that China has OK/good relations with countries such as Vietnam, Cambodia, Pakistan, Indonesia, North Korea, factions in Myanmar. And Russia, of course. If you're serious about this claim, I think you should look at a map, make a list of countries which qualify as "neighbors" based purely on geographic distance, then look up relations for each one.
What I think is more likely than EA pivoting is a handful of people launch a lifeboat and recreate a high integrity version of EA.
Thoughts on how this might be done:
-
Interview a bunch of people who became disillusioned. Try to identify common complaints.
-
For each common complaint, research organizational psychology, history of high-performing organizations, etc. and brainstorm institutional solutions to address that complaint. By "institutional solutions", I mean approaches which claim to e.g. fix an underlying bad incentive structure, so it won't require continuous heroic effort to address the complaint.
-
Combine the most promising solutions into a charter for a new association of some kind. Solicit criticism/red-teaming for the charter.
-
Don't try to replace EA all at once. Start small by aiming at a particular problem present in EA, e.g. bad funding incentives, criticism (it sucks too hard to both give and receive it), or bad feedback loops in the area of AI safety. Initially focus on solving that particular problem, but also build in the capability to scale up and address additional problems if things are going well.
-
Don't market this as a "replacement for EA". There's no reason to have an adversarial relationship. When describing the new thing, focus on the specific problem which was selected as the initial focus, plus the distinctive features of the charter and the problems they are supposed to solve.
-
Think of this as an experiment, where you're aiming to test one or more theses about what charter content will cause organizational outperformance.
I think it would be interesting if someone put together a reading list on high-performing organizations, social movement history, etc. etc. I suspect this is undersupplied on the current margin, compared with observing and theorizing about EA as it exists now. Without any understanding of history, you run the risk of being a "general fighting the last war" -- addressing the problems EA has now, but inadvertently introduce a new set of problems. Seems like the ideal charter would exist in the intersection of "inside view says this will fix EA's current issues" and "outside view says this has worked well historically".
A reading list might be too much work, but there's really no reason not to do an LLM-enabled literature review of some kind, at the very least.
I also think a reading list for leadership could be valuable. One impression of mine is that "EA leaders" aren't reading books about how to lead, research on leadership, or what great leaders did.
The possibility for the society-like effect of multiple power centres creating prosocial incentives on the projects
OpenAI behaves in a generally antisocial way, inconsistent with its charter, yet other power centers haven't reined it in. Even in the EA and rationalist communities, people don't seem to have asked questions like "Is the charter legally enforceable? Should people besides Elon Musk be suing?"
If an idea is failing in practice, it seems a bit pointless to discuss whether it will work in theory.
One idea is to use a base LLM with no RLHF, compute the perplexity of the reasoning text, and add it as an additional term in the loss function. That should help with comprehensibility, but it doesn't necessarily help with steganography. To disincentivize steganography, you could add noise to the reasoning in various ways, and remove any incentive for terseness, to ensure the model isn't trying to squeeze more communication into a limited token budget.
A basic idea for detecting steganography is to monitor next-token probabilities for synonym pairs. If they tend to diverge over time, that suggests the model is assigning additional hidden meaning. (This might fail if the synonym embeddings are too close.)
My preferred approach to CoT would be something like:
-
Give human raters the task of next-token prediction on a large text corpus. Have them write out their internal monologue when trying to predict the next word in a sentence.
-
Train a model to predict the internal monologue of a human rater, conditional on previous tokens.
-
Train a second model to predict the next token in the corpus, conditional on previous tokens in the corpus and also the written internal monologue.
-
Only combine the above two models in production.
-
Now that you've embedded CoT in the base model, maybe it will be powerful enough that you can discard RHLF, and replace it with some sort of fine-tuning on PhDs roleplaying as a helpful/honest/harmless chatbot.
Basically give the base model a sort of "working memory" that's incentivized for maximal human imitativeness and interpretability. Then you could build an interface where a person can mouse over any word in a sentence and see what the model was 'thinking' when it chose that word. (Realistically you wouldn't do this for every word in a sentence, just the trickier ones.)
If that's true, perhaps the performance penalty for pinning/freezing weights in the 'internals', prior to the post-training, would be low. That could ease interpretability a lot, if you didn't need to worry so much about those internals which weren't affected by post-training?
On LessWrong, there's a comment section where hard questions can be asked and are asked frequently.
In my experience, asking hard questions here is quite socially unrewarding. I could probably think of a dozen or so cases where I think the LW consensus "emperor" has no clothes, that I haven't posted about, just because I expect it to be an exercise in frustration. I think I will probably quit posting here soon.
I don't think AI policy is a good example for discourse on LessWrong. There are strategic reasons to be less transparent about how to affect public policy then for most other topics.
In terms of advocacy methods, sure. In terms of desired policies, I generally disagree.
Everything that's written publically can be easily picked up by journalists wanting to write stories about AI.
If that's what we are worried about, there is plenty of low-hanging fruit in terms of e.g. not tweeting wildly provocative stuff for no reason. (You can ask for examples, but be warned, sharing them might increase the probability that a journalist writes about them!)
"The far left is censorious" and "Republicans are censorious" are in no way incompatible claims :-)
Great post. Self-selection seems huge for online communities, and I think it's no different on these fora.
Confidence level: General vague impressions and assorted thoughts follow; could very well be wrong on some details.
A disagreement I have with both the rationalist and EA communities is what the process of coming to robust conclusions looks like. In those communities, it seems like the strategy is often to identify a few super-geniuses who go do a super-deep analysis, and come to a conclusion that's assumed to be robust and trustworthy. See the "Groupthink" section on this page for specifics.
From my perspective, I would rather see an ordinary-genius do an ordinary-depth analysis, and then have a bunch of other people ask a bunch of hard questions. If the analysis holds up against all those hard questions, then the conclusion can be taken as robust.
Everyone brings their own incentives, intuitions, and knowledge to a problem. If a single person focuses a lot on a problem, they run into diminishing returns regarding the number of angles of attack. It seems more effective to generate a lot of angles of attack by taking the union of everyone's thoughts.
From my perspective, placing a lot of trust in top EA/LW thought leaders ironically makes them less trustworthy, because people stop asking why the emperor has no clothes.
The problem with saying the emporer has no clothes is: Either you show yourself a fool, or else you're attacking a high-status person. Not a good prospect either way, in social terms.
EA/LW communities are an unusual niche with opaque membership norms, and people may want to retain their "insider" status. So they do extra homework before accusing the emperor of nudity, and might just procrastinate indefinitely.
There can also be a subtle aspect of circular reasoning to thought leadership: "we know this person is great because of their insights", but also "we know this insight is great because of the person who said it". (Certain celebrity users on these fora get 50+ positive karma on basically every top-level post. Hard to believe that the authorship isn't coloring the perception of the content.)
A recent illustration of these principles might be the pivot to AI Pause. IIRC, it took a "super-genius" (Katja Grace) writing a super long post before Pause became popular. If an outsider simply said: "So AI is bad, why not make it illegal?" -- I bet they would've been downvoted. And once that's downvoted, no one feels obligated to reply. (Note, also -- I don't believe there was much reasoning transparency regarding why the pause strategy was considered unpromising at the time. You kinda had to be an insider like Katja to know the reasoning in order to critique it.)
In conclusion, I suspect there are a fair number of mistaken community beliefs which survive because (1) no "super-genius" has yet written a super-long post about them, and (2) poking around by asking hard questions is disincentivized.
Yeah, I think there are a lot of underexplored ideas along these lines.
It's weird how so much of the internet seems locked into either the reddit model (upvotes/downvotes) or the Twitter model (likes/shares/followers), when the design space is so much larger than that. Someone like Aaron, who played such a big role in shaping the internet, seems more likely to have a gut-level belief that it can be shaped. I expect there are a lot more things like Community Notes that we could discover if we went looking for them.
I've always wondered what Aaron Swartz would think of the internet now, if he was still alive. He had far-left politics, but also seemed to be a big believer in openness, free speech, crowdsourcing, etc. When he was alive those were very compatible positions, and Aaron was practically the poster child for holding both of them. Nowadays the far left favors speech restrictions and is cynical about the internet.
Would Aaron have abandoned the far left, now that they are censorious? Would he have become censorious himself? Or would he have invented some clever new technology, like RSS or reddit, to try and fix the internet's problems?
Just goes to show what a tragedy death is, I guess.
I expect escape will happen a bunch
Are you willing to name a specific year/OOM such that if there are no publicly known cases of escape by that year/OOM, you would be surprised? What, if anything, would you acknowledge as evidence that alignment is easier than you thought, here?
To ensure the definition of "escape" is not gerrymandered -- do you know of any cases of escape right now? Do you think escape has already occurred and you just don't know about it? "Escape" means something qualitatively different from any known event up to this point, yes? Does it basically refer to self-exfiltration of weights which was not requested by any human? Can we get a somewhat precise definition by any chance?
Sure there will be errors, but how important will those errors be?
Humans currently control the trajectory of humanity, and humans are error-prone. If you replace humans with something that's error-prone in similar ways, that doesn't seem like it's obviously either a gain or a loss. How would such a system compare to an em of a human, for example?
If you want to show that we're truly doomed, I think you need additional steps beyond just "there will be errors".
Some recent-ish bird flu coverage:
Global health leader critiques ‘ineptitude’ of U.S. response to bird flu outbreak among cows
A Bird-Flu Pandemic in People? Here’s What It Might Look Like. TLDR: not good. (Reload the page and ctrl-a then ctrl-c to copy the article text before the paywall comes up.) Interesting quote: "The real danger, Dr. Lowen of Emory said, is if a farmworker becomes infected with both H5N1 and a seasonal flu virus. Flu viruses are adept at swapping genes, so a co-infection would give H5N1 opportunity to gain genes that enable it to spread among people as efficiently as seasonal flu does."
Infectious bird flu survived milk pasteurization in lab tests, study finds. Here's what to know.
1 in 5 milk samples from grocery stores test positive for bird flu. Why the FDA says it’s still safe to drink -- see also updates from the FDA here: "Last week we announced preliminary results of a study of 297 retail dairy samples, which were all found to be negative for viable virus." (May 10)
The FDA is making reassuring noises about pasteurized milk, but given that CDC and friends also made reassuring noises early in the COVID-19 pandemic, I'm not fully reassured.
I wonder if drinking a little bit of pasteurized milk every day would be helpful inoculation? You could hedge your bets by buying some milk from every available brand, and consuming a teaspoon from a different brand every day, gradually working up to a tablespoon etc.
About a month ago, I wrote a quick take suggesting that an early messaging mistake made by MIRI was: claim there should be a single leading FAI org, but not give specific criteria for selecting that org. That could've lead to a situation where Deepmind, OpenAI, and Anthropic can all think of themselves as "the best leading FAI org".
An analogous possible mistake that's currently being made: Claim that we should "shut it all down", and also claim that it would be a tragedy if humanity never created AI, but not give specific criteria for when it would be appropriate to actually create AI.
What sort of specific criteria? One idea: A committee of random alignment researchers is formed to study the design; if at least X% of the committee rates the odds of success at Y% or higher, it gets the thumbs up. Not ideal criteria, just provided for the sake of illustration.
Why would this be valuable?
-
If we actually get a pause, it's important to know when to unpause as well. Specific criteria could improve the odds that an unpause happens in a reasonable way.
-
If you want to build consensus for a pause, advertising some reasonable criteria for when we'll unpause could get more people on board.
Don’t have time to respond in detail but a few quick clarifications/responses:
Sure, don't feel obligated to respond, and I invite the people disagree-voting my comments to hop in as well.
— There are lots of groups focused on comms/governance. MIRI is unique only insofar as it started off as a “technical research org” and has recently pivoted more toward comms/governance.
That's fair, when you said "pretty much any other organization in the space" I was thinking of technical orgs.
MIRI's uniqueness does seem to suggest it has a comparative advantage for technical comms. Are there any organizations focused on that?
by MIRI’s lights, getting policymakers to understand alignment issues would be more likely to result in alignment progress than having more conversations with people in the technical alignment space
By 'alignment progress' do you mean an increased rate of insights per year? Due to increased alignment funding?
Anyway, I don't think you're going to get "shut it all down" without either a warning shot or a congressional hearing.
If you just extrapolate trends, it wouldn't particularly surprise me to see Alex Turner at a congressional hearing arguing against "shut it all down". Big AI has an incentive to find the best witnesses it can, and Alex Turner seems to be getting steadily more annoyed. (As am I, fwiw.)
Again, extrapolating trends, I expect MIRI's critics like Nora Belrose will increasingly shift from the "inside game" of trying to engage w/ MIRI directly to a more "outside game" strategy of explaining to outsiders why they don't think MIRI is credible. After the US "shuts it down", countries like the UAE (accused of sponsoring genocide in Sudan) will likely try to quietly scoop up US AI talent. If MIRI is considered discredited in the technical community, I expect many AI researchers to accept that offer instead of retooling their career. Remember, a key mistake the board made in the OpenAI drama was underestimating the amount of leverage that individual AI researchers have, and not trying to gain mindshare with them.
Pause maximalism (by which I mean focusing 100% on getting a pause and not trying to speed alignment progress) only makes sense to me if we're getting a ~complete ~indefinite pause. I'm not seeing a clear story for how that actually happens, absent a much broader doomer consensus. And if you're not able to persuade your friends, you shouldn't expect to persuade your enemies.
Right now I think MIRI only gets their stated objective in a world where we get a warning shot which creates a broader doom consensus. In that world it's not clear advocacy makes a difference on the margin.
I think if MIRI engages with “curious newcomers” those newcomers will have their own questions/confusions/objections and engaging with those will improve general understanding.
You think policymakers will ask the sort of questions that lead to a solution for alignment?
In my mind, the most plausible way "improve general understanding" can advance the research frontier for alignment is if you're improving the general understanding of people fairly near that frontier.
Based on my experience so far, I don’t expect their questions/confusions/objections to overlap a lot with the questions/confusions/objections that tech-oriented active LW users have.
I expect MIRI is not the only tech-oriented group policymakers are talking to. So in the long run, it's valuable for MIRI to either (a) convince other tech-oriented groups of its views, or (b) provide arguments that will stand up against those from other tech-oriented groups.
there’s perhaps more public writing/dialogues between MIRI and its critics than for pretty much any other organization in the space.
I believe they are also the only organization in the space that says its main focus is on communications. I'm puzzled that multiple full-time paid staff are getting out-argued by folks like Alex Turner who are posting for free in their spare time.
If MIRI wants us to make use of any added timeline in a way that's useful, or make arguments that outsiders will consider robust, I think they should consider a technical communications strategy in addition to a general-public communications strategy. The wave-rock model could help for technical communications as well. Right now their wave game for technical communications seems somewhat nonexistent. E.g. compare Eliezer's posting frequency on LW vs X.
You depict a tradeoff between focusing on "ingroup members" vs "policy folks", but I suspect there are other factors which are causing their overall output to be low, given their budget and staffing levels. E.g. perhaps it's an excessive concern with org reputation that leads them to be overly guarded in their public statements. In which case they could hire an intern to argue online for 40 hours a week, and if the intern says something dumb, MIRI can say "they were just an intern -- and now we fired them." (Just spitballing here.)
It's puzzling to me that MIRI originally created LW for the purpose of improving humanity's thinking about AI, and now Rob says that's their "main focus", yet they don't seem to use LW that much? Nate hasn't said anything about alignment here in the past ~6 months. I don't exactly see them arguing with the ingroup too much.
So what's the path by which our "general understanding of the situation" is supposed to improve? There's little point in delaying timelines by a year, if no useful alignment research is done in that year. The overall goal should be to maximize the product of timeline delay and rate of alignment insights.
Also, I think you may be underestimating the ability of newcomers to notice that MIRI tends to ignore its strongest critics. See also previously linked comment.
In terms of "improve the world's general understanding of the situation", I encourage MIRI to engage more with informed skeptics. Our best hope is if there is a flaw in MIRI's argument for doom somewhere. I would guess that e.g. Matthew Barnett he has spent something like 100x as much effort engaging with MIRI as MIRI has spent engaging with him, at least publicly. He seems unusually persistent -- I suspect many people are giving up, or gave up long ago. I certainly feel quite cynical about whether I should even bother writing a comment like this one.
superbabies
I'm concerned there may be an alignment problem for superbabies.
Humans often have contempt for people and animals with less intelligence than them. "You're dumb" is practically an all-purpose putdown. We seem to assign moral value to various species on the basis of intelligence rather than their capacity for joy/suffering. We put chimpanzees in zoos and chickens in factory farms.
Additionally, jealousy/"xenophobia" towards superbabies from vanilla humans could lead them to become misanthropes. Everyone knows genetic enhancement is a radioactive topic. At what age will a child learn they were modified? It could easily be just as big of a shock as learning that you were adopted or conceived by a donor. Then stack more baggage on top: Will they be bullied for it? Will they experience discrimination?
I feel like we're charging headlong into these sociopolitical implications, hollering "more intelligence is good!", the same way we charged headlong into the sociopolitical implications of the internet/social media in the 1990s and 2000s while hollering "more democracy is good!" There's a similar lack of effort to forecast the actual implications of the technology.
I hope researchers are seeking genes for altruism and psychological resilience in addition to genes for intelligence.
Also a strategy postmortem on the decision to pivot to technical research in 2013: https://intelligence.org/2013/04/13/miris-strategy-for-2013/
I do wonder about the counterfactual where MIRI never sold the Singularity Summit, and it was blowing up as an annual event, same way Less Wrong blew up as a place to discuss AI. Seems like owning the Summit could create a lot of leverage for advocacy.
One thing I find fascinating is the number of times MIRI has reinvented themselves as an organization over the decades. People often forget that they were originally founded to bring about the Singularity with no concern for friendliness. (I suspect their advocacy would be more credible if they emphasized that.)
I appreciate your replies. I had some more time to think and now I have more takes. This isn't my area, but I'm having fun thinking about it.
See https://en.wikipedia.org/wiki/File:ComputerMemoryHierarchy.svg
-
Disk encryption is table stakes. I'll assume any virtual memory is also encrypted. I don't know much about that.
-
I'm assuming no use of flash memory.
-
Absent homomorphic encryption, we have to decrypt in the registers, or whatever they're called for a GPU.
So basically the question is how valuable is it to encrypt the weights in RAM and possibly in the processor cache. For the sake of this discussion, I'm going to assume reading from the processor cache is just as hard as reading from the registers, so there's no point in encrypting the processor cache if we're going to decrypt in registers anyway. (Also, encrypting the processor cache could really hurt performance!)
So that leaves RAM: how much added security we get if we encrypt RAM in addition to encrypting disk.
One problem I notice: An attacker who has physical read access to RAM may very well also have physical write access to RAM. That allows them to subvert any sort of boot-time security, by rewriting the running OS in RAM.
If the processor can only execute signed code, that could help. But an attacker could still control which signed code the processor runs (by strategically changing the contents at an instruction pointer?) I suspect this is enough in practice.
A somewhat insane idea would be for the OS to run encrypted in RAM to make it harder for an attacker to tamper with it. I doubt this would help -- an attacker could probably infer from the pattern of memory accesses which OS code does what. (Assuming they're able to observe memory accesses.)
So overall it seems like with physical write access to RAM, an attacker can probably get de facto root access, and make the processor their puppet. At that point, I think exfiltrating the weights should be pretty straightforward. I'm assuming intermediate activations must be available for interpretability, so it seems possible to infer intermediate weights by systematically probing intermediate activations and solving for the weights.
If you could run the OS from ROM, so it can't be tampered with, maybe that could help. I'm assuming no way to rewrite the ROM or swap in a new ROM while the system is running. Of course, that makes OS updates annoying since you have to physically open things up and swap them out. Maybe that introduces new vulnerabilities.
In any case, overall I suspect the benefit-to-effort ratio is higher elsewhere. I would focus on making sure the AI isn't capable of reading its own RAM in the first place, and isn't trying to.
QFT is the extreme example of a "better abstraction", but in principle (if the natural abstraction hypothesis fails) there will be all sorts and shapes of abstractions, and some of them will be available to us, and some of them will be available to the model, and these sets will not fully overlap—which is a concern in worlds where different abstractions lead to different generalization properties.
Indeed. I think the key thing for me is, I expect the model to be strongly incentivized to have a solid translation layer from its internal ontology to e.g. English language, due to being trained on lots of English language data. Due to Occam's Razor, I expect the internal ontology to be biased towards that of an English-language speaker.
It's just that, if you feed enough data to a model that can hold entire swaths of the physical universe inside of its metaphorical "head", pretty soon hypotheses that involve the actual state of that universe will begin to outperform hypotheses that don't, and which instead use some kind of lossy approximation of that state involving intermediary concepts like "intent", "belief", "agent", "subjective state", etc.
I'm imagining something like: early in training the model makes use of those lossy approximations because they are a cheap/accessible way to improve its predictive accuracy. Later in training, assuming it's being trained on the sort of gigantic scale that would allow it to hold swaths of the physical universe in its head, it loses those desired lossy abstractions due to catastrophic forgetting. Is that an OK way to operationalize your concern?
I'm still not convinced that this problem is a priority. It seems like a problem which will be encountered very late if ever, and will lead to 'random' failures on predicting future/counterfactual data in a way that's fairly obvious.
If I encountered an intelligent extraterrestrial species, in principle I think I could learn to predict fairly well things like what it finds easy to understand, what its values are, and what it considers to be ethical behavior, without using any of the cognitive machinery I use to self-reflect. Humans tend to reason about other humans by asking "what would I think if I was in their situation", but in principle an AI doesn't have to work that way. But perhaps you think there are strong reasons why this would happen in practice?
Supposing we had strong reasons to believe that an AI system wasn't self-aware and wasn't capable of self-reflection. So it can look over a plan it generated and reason about its understandability, corrigibility, impact on human values, etc. without any reflective aspects. Does that make alignment any easier according to you?
Supposing the AI lacks a concept of "easy to understand", as you hypothesize. Does it seem reasonable to think that it might not be all that great at convincing a gatekeeper to unbox it, since it might focus on super complex arguments which humans can't understand?
Entirely separately, I have concerns about the ability of ML-based technology to robustly point the AI in any builder-intended direction whatsoever, even if there exists some not-too-large adequate mapping from that intended direction onto the AI's internal ontology at training time. My guess is that more of the disagreement lies here.
Is this mostly about mesa-optimizers, or something else?
If I were to guess, I'd guess that by "you" you're referring to someone or something outside of the model, who has access to the model's internals, and who uses that access to, as you say, "read" the next token out of the model's ontology.
Was using a metaphorical "you". Probably should've said something like "gradient descent will find a way to read the next token out of the QFT-based simulation".
Yes, there are certainly applications where the training regime produces IID data, but next-token prediction is pretty clearly not one of those?
I suppose I should've said various documents are IID to be more clear. I would certainly guess they are.
Right, so just to check that we're on the same page: do we agree that after a (retrodictively trained) model is deployed for some use case other than retrodicting existing data—for generative use, say, or for use in some kind of online RL setup—then it'll doing something other than retrodicting?
Generally speaking, yes.
And that's where the QFT model comes in. It says, actually, even if you train me for a good long while on a good amount of data, there are lots of ways for me to generalize "wrongly" from your perspective, if I'm modeling the universe at the level of quantum fields. Sure, I got all the retrodictions right while there was something to be retrodicted, but what exactly makes you think I did that by modeling the philosopher whose remarks I was being trained on?
Well, if we're following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren't generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn't overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it.
In principle we could construct a test set or dev set either before or after the model has been trained. It shouldn't make a difference under normal circumstances. It sounds like maybe you're discussing a scenario where the model has achieved a level of omniscience, and it does fine on data that was available during its training, because it's able to read off of an omniscient world-model. But then it fails on data generated in the future, because the translation method for its omniscient world-model only works on artifacts that were present during training. Basically, the time at which the data was generated could constitute a hidden and unexpected source of distribution shift. Does that summarize the core concern?
(To be clear, this sort of acquired-omniscience is liable to sound kooky to many ML researchers. I think it's worth stress-testing alignment proposals under these sort of extreme scenarios, but I'm not sure we should weight them heavily in terms of estimating our probability of success. In this particular scenario, the model's performance would drop on data generated after training, and that would hurt the company's bottom line, and they would have a strong financial incentive to fix it. So I don't know if thinking about this is a comparative advantage for alignment researchers.)
BTW, the point about documents being IID was meant to indicate that there's little incentive for the model to e.g. retrodict the coordinates of the server storing a particular document -- the sort of data that could aid and incentivize omniscience to a greater degree.
In any case, I would argue that "accidental omniscience" characterizes the problem better than "alien abstractions". As before, you can imagine an accidentally-omniscient model that uses vanilla abstractions, or a non-omniscient model that uses alien ones.
Because the human isn't going to constantly be present for everything the system does after it's deployed (unless for some reason it's not deployed).
I think it ought to be possible for someone to always be present. [I'm also not sure it would be necessary.]
So we need not assume that predicting "the genius philosopher" is a core task.
It's not the genius philosopher that's the core task, it's the reading of their opinions out of a QFT-based simulation of them. As I understand this thought experiment, we're doing next-token prediction on e.g. a book written by a philosopher, and in order to predict the next token using QFT, the obvious method is to use QFT to simulate the philosopher. But that's not quite enough -- you also need to read the next token out of that QFT-based simulation if you actually want to predict it. This sort of 'reading tokens out of a QFT simulation' thing would be very common, thus something the system gets good at in order to succeed at next-token prediction.
I think perhaps there's more to your thought experiment than just alien abstractions, and it's worth disentangling these assumptions. For one thing, in a standard train/dev/test setup, the model is arguably not really doing prediction, it's doing retrodiction. It's making 'predictions' about things which already happened in the past. The final model is chosen based on what retrodicts the data the best. Also, usually the data is IID rather than sequential -- there's no time component to the data points (unless it's a time-series problem, which it usually isn't). The fact that we're choosing a model which retrodicts well is why the presence/absence of a human is generally assumed to be irrelevant, and emphasizing this factor sounds wacky to my ML engineer ears.
So basically I suspect what you're really trying to claim here, which incidentally I've also seen John allude to elsewhere, is that the standard assumptions of machine learning involving retrodiction and IID data points may break down once your system gets smart enough. This is a possibility worth exploring, I just want to clarify that it seems orthogonal to the issue of alien abstractions. In principle one can imagine a system that heavily features QFT in its internal ontology yet still can be characterized as retrodicting on IID data, or a system with vanilla abstractions that can't be characterized as retrodicting on IID data. I think exploring this in a post could be valuable, because it seems like an under-discussed source of disagreement between certain doomer-type people and mainstream ML folks.
I'm confused about what it means to "remove the human", and why it's so important whether the human is 'removed'. Maybe if I try to nail down more parameters of the hypothetical, that will help with my confusion. For the sake of argument, can I assume...
-
That the AI is running computations involving quantum fields because it found that was the most effective way to make e.g. next-token predictions on its training set?
-
That the AI is in principle capable of running computations involving quantum fields to represent a genius philosopher?
If I can assume that stuff, then it feels like a fairly core task, abundantly stress-tested during training, to read off the genius philosopher's spoken opinions about e.g. moral philosophy from the quantum fields. How else could quantum fields be useful for next-token predictions?
Another probe: Is alignment supposed to be hard in this hypothetical because the AI can't represent human values in principle? Or is it supposed to be hard because it also has a lot of unsatisfactory representations of human values, and there's no good method for finding a satisfactory needle in the unsatisfactory haystack? Or some other reason?
But remove the human, and suddenly the system is no longer operating based on its predictions of the behavior of a real physical system, and is instead operating from some learned counterfactual representation consisting of proxies in its native QFT-style ontology which happened to coincide with the actual human's behavior while the human was present.
This sounds a lot like saying "it might fail to generalize". Supposing we make a lot of progress on out-of-distribution generalization, is alignment getting any easier according to you? Wouldn't that imply our systems are getting better at choosing proxies which generalize even when the human isn't 'present'?
I think this quantum fields example is perhaps not all that forceful, because in your OP you state
maybe a faithful and robust translation would be so long in the system’s “internal language” that the translation wouldn’t fit in the system
However, it sounds like you're describing a system where we represent humans using quantum fields as a routine matter, so fitting the translation into the system isn't sounding like a huge problem? Like, if I want to know the answer to some moral dilemma, I can simulate my favorite philosopher at the level of quantum fields in order to hear what they would say if they were asked about the dilemma. Sounds like it could be just as good as an em, where alignment is concerned.
It's hard for me to imagine a world where developing representations that allow you to make good next-token predictions etc. doesn't also develop representations that can somehow be useful for alignment. Would be interested to hear fleshed-out counterexamples.
a very extreme failure of natural abstraction, such that human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all.
This hypothetical suggests to me that the AI might not be very good at e.g. manipulating humans in an AI-box experiment, since it just doesn't understand how humans think all that well.
I wonder what MIRI thinks about this 2013 post ("The genie knows, but doesn't care") nowadays. Seems like the argument is less persuasive now, with AIs that seem to learn representations first, and later are given agency by the devs. I actually suspect your model of Eliezer is wrong, because it seems to imply he believes "the AI actually just doesn't know", and it's a little hard for me to imagine him saying that.
Alternatively, maybe the "faithfully and robustly" bit is supposed to be very load-bearing. However, it's already the case that humans learn idiosyncratic, opaque neural representations of our values from sense data -- yet we're able to come into alignment with each other, without a bunch of heavy-duty interpretability or robustness techniques.
Thank you.
I think maybe my confusion here is related to the threat model. If a model gained root access to the device that it's running on, it seems like it could probably subvert these security measures? Anyway I'd be interested to read a more detailed description of the threat model and how this stuff is supposed to help.
More specifically, it seems a bit weird to imagine an attacker who has physical access to a running server, yet isn't able to gain de facto root access for the purpose of weight exfiltration. E.g. you can imagine using your physical access to copy the encrypted weights on to a different drive running a different OS, then boot from that drive, and the new OS has been customized to interface with the chip so as to exfiltrate the weights. Remembering that the chip can't exactly be encrypting every result of its weight computations using some super-secret key, because if it did, the entire setup would effectively be useless. Seems to me like the OS has to be part of the TCB along with the chip?
Out of curiosity, does that mean that if the app worked fairly well as described, you would consider that an update that alignment maybe isn't as hard as you thought? Or are you one of the "only endpoints can be predicted" crowd, such that this wouldn't constitute any evidence?
BTW, I strongly suspect that Youtube cleaned up its comment section in recent years by using ML for comment rankings. Seems like a big improvement to me. You'll notice that "crappy Youtube comments" is not as much of a meme as it once was.
I pretty much agree. I just wrote a comment about why I believe LLMs are the future of rational discourse. (Publishing this comment as comfort-zone-expansion on getting downvoted, so feel free to downvote.)
To be more specific about why I agree: I think the quality/agreement axes on voting help a little bit, but when it comes to any sort of acrimonious topic where many users "have a dog in the fight", it's not enough. In the justice system we have this concept that a judge is supposed to recuse themselves in certain cases when they have a conflict of interest. A judge, someone who's trained much of their career for neutral rules-based judgement, still just isn't trusted to be neutral in certain cases. Now consider Less Wrong, a place where untrained users make decisions in just a few minutes (as opposed to a trial over multiple hours or days), without any sort of rules to go by, oftentimes in their afterhours when they're cognitively fatigued and less capable of System 2 overrides, and are susceptible to Asch Conformity type effects due to early exposure to the judgements of other users. There's a lot of content here about how to overcome your biases, which is great, but there simply isn't enough anti-bias firepower in the LW archives to consistently conquer that grandaddy of biases, myside bias. We aren't trained to consistently implement specific anti-myside bias techniques that are backed by RCTs and certified by Cochrane, not even close, and it's dangerous to overestimate our bias-fighting abilities.
IMO the next level-up in discourse is going to be when someone creates an LLM-moderated forum. The LLM will have a big public list of discussion guidelines in its context window. When you click "submit" on your comment, it will give your comment a provisional score (in lieu of votescore), and tell you what you can do to improve it. The LLM won't just tell you how to be more civil or rational. It will also say things like "hey, it looks like someone else already made that point in the comments -- shall I upvote their comment for you, and extract the original portion of your comment as a new submission?" Or "back in 2013 it was argued that XYZ, seems your comment doesn't jive with that. Thoughts?" Or "Point A was especially insightful, I like that!" Or "here's a way you could rewrite this more briefly and more clearly and less aggressively". Or "here's a counterargument someone might write, perhaps you should be anticipating it?"
The initial version probably won't work well, but over time, with enough discussion and iteration on guidelines/finetuning/etc., the discussion on that forum will be clearly superior. It'll be the same sort of level-up we saw with Community Notes on X, or with the US court system compared with the mob rule you see on social media. Real-world humans have the problem where the more you feel you have a dog in the fight, the more you engage with the discussion, and that causes inevitable politicization with online voting systems. The LLM is going to be like a superhumanly patient neutral moderator, neutering the popularity contest and ingroup/outgroup aspects of modern social media.
...we’ll need much more extreme security against model self-exfiltration across the board, from hardware encryption to many-key signoff.
I found this bit confusing. Why is hardware encryption supposed to help? Why would it be better than software encryption? Is the idea to prevent the model from reading its own weights via a physical side-channel?
What exactly is "many-key signoff" supposed to mean and how is it supposed to help?
IMO someone should consider writing a "how and why" post on nationalizing AI companies. It could accomplish a few things:
-
Ensure there's a reasonable plan in place for nationalization. That way if nationalization happens, we can decrease the likelihood of it being controlled by Donald Trump with few safeguards, or something like that. Maybe we could take inspiration from a less partisan organization like the Federal Reserve.
-
Scare off investors. Just writing the post and having it be discussed a lot could scare them.
-
Get AI companies on their best behavior. Maybe Sam Altman would finally be pushed out if congresspeople made him the poster child for why nationalization is needed.
Separately, I think it's currently pretty plausible that model weight leaks will substantially reduce the profit of AI companies by reducing their moat, and that has an effect size that seems plausible larger than the benefits of non-proliferation.
What sort of leaks are we talking about? I doubt a sophisticated hacker is going to steal weights from OpenAI just to post them on 4chan. And I doubt OpenAI's weights will be stolen by anyone except a sophisticated hacker.
If you want to reduce the incentive to develop AI, how about passing legislation to tax it really heavily? That is likely to have popular support due to the threat of AI unemployment. And it reduces the financial incentive to invest in large training runs. Even just making a lot of noise about such legislation creates uncertainty for investors.
Eliezer's response to claims about unfalsifiability, namely that "predicting endpoints is easier than predicting intermediate points", seems like a cop-out to me, since this would seem to reverse the usual pattern in forecasting and prediction, without good reason.
Note that MIRI has made some intermediate predictions. For example, I'm fairly certain Eliezer predicted that AlphaGo would go 5 for 5 against LSD, and it didn't. I would respect his intellectual honesty more if he'd registered the alleged difficulty of intermediate predictions before making them unsuccessfully.
I think MIRI has something valuable to contribute to alignment discussions, but I'd respect them more if they did a "5 Whys" type analysis on their poor prediction track record, so as to improve the accuracy of predictions going forwards. I'm not seeing any evidence of that. It seems more like the standard pattern where a public figure invests their ego in some position, then tries to avoid losing face.
With regard to the "Message and Tone" section, I mostly agree with the specific claims. But I think there is danger in taking it too far. I strongly recommend this post: https://www.lesswrong.com/posts/D2GrrrrfipHWPJSHh/book-review-how-minds-change
I'm concerned that the AI safety debate is becoming more and more polarized, sort of like US politics in general. I think many Americans are being very authentic and undiplomatic with each other when they argue online, in a way that doesn't effectively advance their policy objectives. Given how easily other issues fall into this trap, it seems reasonable on priors to expect the same for AI. Then we'll have a "memetic trench warfare" situation where you have a lot of AI-acceleration partisans who are entrenched in their position. If they can convince just one country's government to avoid cooperating with "shut it all down", your advocacy could end up doing more harm than good. So, if I were you I'd be a bit more focused on increasing the minimum level of AI fear in the population, as opposed to optimizing for the mean or median level of AI fear.
With regard to polarization, I'm much more worried about Eliezer, and perhaps Nate, than I am about Rob. If I were you, I'd make Rob spokesperson #1, and try to hire more people like him.
There are also interesting questions about whether MIRIs goals can be made to align with those of us who think that alignment is not trivial but is achievable. I'd better leave that for a separate post, as this has gotten pretty long for a "short form" post.
I'm not sure I see the conflict? If you're a longtermist, most value is in the far future anyways. Delaying AGI by 10 years to buy just an 0.1% chance improvement at aligning AI seems like a good deal. I don't agree with MIRI's strong claims, but maybe those strong claims will slow AI progress, and that would be good by my lights.
What concerns me more is that their comms will have unexpected bad effects of speeding AI progress. On the outside view: (a) their comms have arguably backfired in the past and (b) they don't seem to do much red-teaming, which I suspect is associated with unintentional harms, especially in a domain with few feedback loops.
Perhaps we should focus on alignment problems that only appear for more powerful systems, as a form of differential technological development. Those problems are harder (will require more thought to solve), and are less economically useful to solve in the near-term.
IMO it's important to keep in mind that the sample size driving these conclusions is generally pretty small. Every statistician and machine learning engineer knows that a dataset with 2 data points is essentially worthless, yet people are surprisingly willing to draw a trend line through 2 data points.
When you're dealing with small sample sizes, in my view it is better to take more of a "case study" approach than an "outside view" approach. 2 data points isn't really enough for statistical inference. However, if the 2 data points all illuminate some underlying dynamic which isn't likely to change, then the argument becomes more compelling. Basically when sample sizes are small, you need to do more inside-view theorizing to make up for it. And then you need to be careful about extrapolating to new situations, to ensure that the inside-view properties you identified actually hold in those new situations.
If LW takes this route, it should be cognizant of the usual challenges of getting involved in politics. I think there's a very good chance of evaporative cooling, where people trying to see AI clearly gradually leave, and are replaced by activists. The current reaction to OpenAI events is already seeming fairly tribal IMO.
So basically, I think it is a bad idea and you think we can't do it anyway. In that case let's stop calling for it, and call for something more compassionate and realistic like a public apology.
I'll bet an apology would be a more effective way to pressure OpenAI to clean up its act anyways. Which is a better headline -- "OpenAI cofounder apologizes for their role in creating OpenAI", or some sort of internal EA movement drama? If we can generate a steady stream of negative headlines about OpenAI, there's a chance that Sam is declared too much of a PR and regulatory liability. I don't think it's a particularly good plan, but I haven't heard a better one.
It is not that people people's decision-making skill is optimized such that you can consistently reverse someone's opinion to get something that accurately tracks reality. If that was the case then they are implicitly tracking reality very well already. Reversed stupidity is not intelligence.
Sure, I think this helps tease out the moral valence point I was trying to make. "Don't allow them near" implies their advice is actively harmful, which in turn suggests that reversing it could be a good idea. But as you say, this is implausible. A more plausible statement is that their advice is basically noise -- you shouldn't pay too much attention to it. I expect OP would've said something like that if they were focused on descriptive accuracy rather than scapegoating.
Another way to illuminate the moral dimension of this conversation: If we're talking about poor decision-making, perhaps MIRI and FHI should also be discussed? They did a lot to create interest in AGI, and MIRI failed to create good alignment researchers by its own lights. Now after doing advocacy off and on for years, and creating this situation, they're pivoting to 100% advocacy.
Could MIRI be made up of good people who are "great at technical stuff", yet apt to shoot themselves in the foot when it comes to communicating with the public? It's hard for me to imagine an upvoted post on this forum saying "MIRI shouldn't be allowed anywhere near AI safety communications".
Regarding the situation at OpenAI, I think it's important to keep a few historical facts in mind:
- The AI alignment community has long stated that an ideal FAI project would have a lead over competing projects. See e.g. this post:
Requisite resource levels: The project must have adequate resources to compete at the frontier of AGI development, including whatever mix of computational resources, intellectual labor, and closed insights are required to produce a 1+ year lead over less cautious competing projects.
- The scaling hypothesis wasn't obviously true around the time OpenAI was founded. At that time, it was assumed that regulation was ineffectual because algorithms can't be regulated. It's only now, when GPUs are looking like the bottleneck, that the regulation strategy seems viable.
What happened with OpenAI? One story is something like:
-
AI safety advocates attracted a lot of attention in Silicon Valley with a particular story about AI dangers and what needed to be done.
-
Part of this story involved an FAI project with a lead over competing projects. But the story didn't come with easy-to-evaluate criteria for whether a leading project counted as a good "FAI project" or an bad "UFAI project". Thinking about AI alignment is epistemically cursed; people who think about the topic independently rarely have similar models.
-
Deepmind was originally the consensus "FAI project", but Elon Musk started OpenAI because Larry Page has e/acc beliefs.
-
OpenAI hired employees with a distribution of beliefs about AI alignment difficulty, some of whom may be motivated primarily by greed or power-seeking.
-
At a certain point, that distribution got "truncated" with the formation of Anthropic.
Presumably at this point, every major project thinks it's best if they win, due to self-serving biases.
Some possible lessons:
-
Do more message red-teaming. If an organization like AI Lab Watch had been founded 10+ years ago, and was baked into the AI safety messaging along with "FAI project needs a clear lead", then we could've spent the past 10 years getting consensus on how to anoint one or a just a few "FAI projects". And the campaign for AI Pause could instead be a campaign to "pause all AGI projects except the anointed FAI project". So -- when we look back in 10 years on the current messaging, what mistakes will seem obvious in hindsight? And if this situation is partially a result of MIRI's messaging in the past, perhaps we should ask hard questions about their current pivot towards messaging? (Note: I could be accused of grinding my personal axe here, because I'm rather dissatisfied with current AI Pause messaging.)
-
Assume AI acts like magnet for greedy power-seekers. Make decisions accordingly.
Enforcing social norms to prevent scapegoating also destroys information that is valuable for accurate credit assignment and causally modelling reality.
I read the Ben Hoffman post you linked. I'm not finding it very clear, but the gist seems to be something like: Statements about others often import some sort of good/bad moral valence; trying to avoid this valence can decrease the accuracy of your statements.
If OP was optimizing purely for descriptive accuracy, disregarding everyone's feelings, that would be one thing. But the discussion of "repercussions" before there's been an investigation goes into pure-scapegoating territory if you ask me.
I do not read any mention of a 'moral failing' in that comment.
If OP wants to clarify that he doesn't think there was a moral failing, I expect that to be helpful for a post-mortem. I expect some other people besides me also saw that subtext, even if it's not explicit.
You can be empathetic to people having flawed decision making and care about them, while also wanting to keep them away from certain decision-making positions.
"Keep people away" sounds like moral talk to me. If you think someone's decisionmaking is actively bad, i.e. you'd better off reversing any advice from them, then maybe you should keep them around so you can do that! But more realistically, someone who's fucked up in a big way will probably have learned from that, and functional cultures don't throw away hard-won knowledge.
Imagine a world where AI is just an inherently treacherous domain, and we throw out the leadership whenever they make a mistake. So we get a continuous churn of inexperienced leaders in an inherently treacherous domain -- doesn't sound like a recipe for success!
Oh, interesting. Who exactly do you think influential people like Holden Karnofsky and Paul Christiano are accountable to, exactly? This "detailed investigation" you speak of, and this notion of a "blameless culture", makes a lot of sense when you are the head of an organization and you are conducting an investigation as to the systematic mistakes made by people who work for you, and who you are responsible for. I don't think this situation is similar enough that you can use these intuitions blandly without thinking through the actual causal factors involved in this situation.
I agree that changes things. I'd be much more sympathetic to the OP if they were demanding an investigation or an apology.
In the spirit of trying to understand what actually went wrong here -- IIRC, OpenAI didn't expect ChatGPT to blow up the way it did. Seems like they were playing a strategy of "release cool demos" as opposed to "create a giant competitive market".
I downvoted this comment because it felt uncomfortably scapegoat-y to me. If you think the OpenAI grant was a big mistake, it's important to have a detailed investigation of what went wrong, and that sort of detailed investigation is most likely to succeed if you have cooperation from people who are involved. I've been reading a fair amount about what it takes to instill a culture of safety in an organization, and nothing I've seen suggests that scapegoating is a good approach.
Writing a postmortem is not punishment—it is a learning opportunity for the entire company.
...
Blameless postmortems are a tenet of SRE culture. For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment.
Blameless culture originated in the healthcare and avionics industries where mistakes can be fatal. These industries nurture an environment where every "mistake" is seen as an opportunity to strengthen the system. When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had incomplete or incorrect information, effective prevention plans can be put in place. You can’t "fix" people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems.
...
Removing blame from a postmortem gives people the confidence to escalate issues without fear. It is also important not to stigmatize frequent production of postmortems by a person or team. An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization [Boy13].
...
We can say with confidence that thanks to our continuous investment in cultivating a postmortem culture, Google weathers fewer outages and fosters a better user experience.
https://sre.google/sre-book/postmortem-culture/
If you start with the assumption that there was a moral failing on the part of the grantmakers, and you are wrong, there's a good chance you'll never learn that.
endorsing getting into bed with companies on-track to make billions of dollars profiting from risking the extinction of humanity in order to nudge them a bit
Wasn't OpenAI a nonprofit at the time?
adversarial dynamics present in such a strategy
Are you just referring to the profit incentive conflicting with the need for safety, or something else?
I'm struggling to see how we get aligned AI without "inside game at labs" in some way, shape, or form.
My sense is that evaporative cooling is the biggest thing which went wrong at OpenAI. So I feel OK about e.g. Anthropic if it's not showing signs of evaporative cooling.
I skimmed the blog post you linked, and it seems like the evidence is compatible with NO also helping elsewhere in your body, beyond just the nasal area?
Beets may increase NO. Canned beets are cheap, convenient, and surprisingly easy to find in stores. I like this smoothie.
I suppose this could also be an argument for varying the pitch of your hum.