Posts
Comments
I place here prediction that TurnTrout is trying to say that while, counterfactally, if we had algorithm that reasons about training, it would achieve low loss, it's not obviously true that such algorithms are actually "achievable" for SGD in some "natural" setting.
You can just have model with capabilities of smartest human hacker that exfiltrates itself, hacks 1-10% of world computing power with shittiest protection, distributes itself in Rosetta@home style and bootstrap whatever takeover plan using sheer bruteforce. Thus said, I see no reason for capabilities to land exactly on point "smartest human hacker", because there is nothing special at this point, and it can be 2x, 5x, 10x, without any necessity to become 1000000x within a second.
I'm pretty optimistic about our white box alignment methods generalizing fine.
And I still don't get why! I would like to see your theory of generalization in DL that allows to have such level of optimism, and "gradient descent is powerful" simply doesn't catch that.
It’s striking how well these black box alignment methods work
I should note that human alignment methods works only with respect of the fact that no human in history could suddenly start to design nanotech in their head or treat other humans as buggy manipulable machinery. I think there are plenty of humans around who would want to become mind-controlling dictators given possibility or who are generally nice but would give in temptation.
It mostly sounds like "LLMs don't scale into scary things", not "deceptive alignment is unlikely".
If it's true, why is shutdown problem not solved? Even if it's true that any behavior can be represented as EUM, it's at least not trivial.
You can publish hash of this question
The trick with FDT is that FDT agents never receive the letter and never pay. FDT payoff is p*(-1000000), where p is a probability of infestation. EDT payoff is p*(-1000000) + (1-p)*(-1000), which seems to me speaking for itself.
I feel like... no, it is not very interesting, it seems pretty trivial? We (agents) have goals, we have relationships between them, like "priorities", we sometimes abandon goals with low priority in favor of goals with higher priorities. We also can have meta-goals like "how should my systems of goals look like" and "how to abandon and adopt intermediate goals in a reasonable way" and "how to do reflection on goals" and future superintelligent systems probably will have something like that. All of this seems to me coming in package with concept of "goal".
"A reasonably powerful world model" would correctly predict that being FDT agent is better than being EDT and modify itself into FDT agent, because there are several problems where both CDT and EDT fail in comparison with FDT (see original FDT paper).
Like this?
What about things like fun, happiness, eudamonia, meaning? I certainly think that excluding brain damage/very advanced brainwashing, you are not going to eat babies or turn planets into paperclips.
In real world, most of variables correlates with each other. If you take action that correlates most with high utility, you are going to throw away a lot of resources.
"Estimate overconfidence" implies that estimate can be zero!
I think the problem here is distinguishing between terminal and instrumental goals? Most of people probably don't run apple pie business because they have terminal goals about apple pie business. They probably want money, status, want to be useful and provide for their families and I expect this goals to be very persistent and self-preseving.
How about: https://arxiv.org/abs/2102.04518
"Under current economic incentives and structure" we can have only "no alignment". I was talking about rosy hypotheticals. My point was "either we are dead or we are sane enough to stop, find another way and solve problem fully". Your scenario is not inside the set of realistic outcomes.
You are conflating two definitions of alignment, "notkilleveryoneism" and "ambitious CEV-style value alignment". If you have only first type of alignment, you don't use it to produce good art, you use it for something like "augment human intelligence so we can solve second type of alignment". If your ASI is aligned in second sense, it is going to deduce that humans wouldn't like being coddled without capability to develop their own culture, so it will probably just sprinkle here and there inspiring examples of art for us and develop various mind-boggling sources of beauty like telepathy and qualia-tuning.
I should note that while your attitude is understandable, event "Roko said his confident predictions out loud" is actually good, because we can evaluate his overconfidence and update our models accordingly.
"The board definitely isn't planning this" is not the same as "the board have zero probability of doing this". It can be "the board would do this if you apply enough psychological pressure through media".
I think these are, judging from available info, kinda two opposite stories? The problem of SBF was that nobody inside EA was in position to tell him "you are an asshole who steals clients money, you are fired".
More general, any attempts to do something more effective will blow up a lot of things because trying to something more effective than business-as-usual is an outside-distribution problem and you can't simply choose to not go outside.
This is so boring that it's begging for responce with "Yes, We Have Noticed The Skulls"
It's a problem in a sense that you need to make your systems either weaker or very expensive (in terms of alignment tax, see, for example, davidads' Open Agency Architecture) relative to unconstrained systems.
"A source close to Altman" means "Altman" and I'm pretty sure that he is not very trustworthy party at the moment.
The main problem I see here is generality of some niches that are more powerful. For example, nanotech-designer AI can start by thinking only about molecular structures, but eventually it stumbles upon situations "I need to design nanotech swarm that is aligned to constructor goal" or "what if I am a pile of computing matter that was created by other nanotech swarms (technically, all multicellular life is a multitude of nanotech swarms)", "what if my goal is not aligned with goal of nanotech swarm that created me", etc.
Their Responsible AI team was in pretty bad shape after recent lay-offs. I think Facebook just decided to cut costs.
I feel sceptical about interpretability primarily because imagine that you have neural network that does useful superintelligent things because "cares about humans". We have found Fourier transform in modular addition network because we already knew what Fourier transform is. But we have veeeery limited understanding of what "caring about humans" is from the math position.
I should note that it's LM, not LLM.
to be as corrigible as it rationally, Bayesianly should be
I can't parse this as a meaningful statement. Corrigibility is a about alignment, not a degree of how rational being is.
The problem is simple: we have zero chance to build competent value learner on first try, and failed attempts can bring you S-risks. So you shouldn't try to build value learner on first try and instead build something small that can just superhumanly design nanotech and doesn't think about unconvenient topics like "other minds".
Compentent value learner is not corrigible. Competent value learner will read the entire internet, make model of human preferences, build nanotech and spread nanobot clouds all over the world to cure everyone from everything and read everyones' mind to create an accurate picture of future utopia. It won't be interested in anything you can say, because it will be capable to predict you with accuracy 99.999999999%. And if you say something like "this nanobot clouds look suspicious, I should shut down AI and check its code again", it won't let you, because every minute it doesn't spread healing nanobots is an additional ten dead children.
The meaning of corrigibility is exactly if you fail to build value learner, you can at least shutdown it and try again.
That's a direct reference!
I think that if you mentioned cars and planes you should read this
If you can write prompt for GPT-2000 such that completion of this prompt results in aligned pivotal act, you can just use knowledge necessary for writing this prompt to Just Build aligned ASI, without necessity to use GPT-2000.
If you can't write a program that produces aligned (under whatever definition of alignment you use) output being run on unphysically large computer, you can't deduce from training data or weights of superintelligent neural network if it produces aligned output.
Let's suppose that your model makes a bad action. Why? Either the model is aligned but uncapable to deduce good action or the model is misaligned and uncapable to deduce deceptively good action. In both cases, gradient update provides information about capabilities, not about alignment. Hypothetical homunculi doesn't need to be "immune", it isn't affected in a first place.
Other way around: let's suppose that you observe model taking a good action. Why? It can be an aligned model that makes a genuine good action or it can be a misaligned model which takes a deceptive action. In both cases you observe capabilities, not alignment.
The problem here is not a prior over aligned/deceptive models (unless you think that this prior requires less than 1 bit to specify aligned model, where I say that optimism departs from sanity here), the problem is lack of understanding of updates which presumably should cause model to be aligned. Maybe prosaic alignment works, maybe don't, we don't know how to check.
The main problem I have with this type of reasoning is an arbitrary drawn ontological boundaries. Why IGF is "not real" and ML objective function is "real", while if we really zoom in training process, the verifiable in positivist brutal way real training goal is "whatever direction in coefficient space loss function decreases on current batch of data" which seems to me pretty corresponding to "whatever traits are spreading in current environment"?
Compare two pairs of statements: "evolution optimizes for IGF" and "evolution optimizes for near-random traits"; "we optimize for aligned models" and "we optimize for models which get good metrics on training dataset".
If you are warm, any warm-detectors inside your body will detect mostly you. Imagine if blood vessels in your own eye radiated in visible spectrum with the same intensity as daylight environment.
The only reason why it can be impossible is if the amount of compute needed to run one smart-as-smartest human model is so huge that we need to literally disassemble Earth to run 100000 copies. It's quite unrealistic reason because similar amout of compute for actual humans fit an actual small cranium.
There are definitely enough matter on Earth to sustain additional 100k human brains with signal speed 1000m/s instead of 100m/s? I actually can't imagine how our understanding of physics should be wrong for it to not be possible.
Such possibility is explored at least here: https://arxiv.org/abs/2305.17066 but that's not the point. The point is: even in hypothetical world where scaling laws and algorithmic progress hit the wall at smartest-human-level, you can do this and get an arbitrary level of intelligence. In real world, of course, there are better ways.
Thermal vision for warm-blooded animals has obvious problems with noise.
Even if we have only smartest-human-level models, you can spawn 100000 copies at 10x speed and organize them in the way "one model checks if output of other model displays cognitive biases" and get maybe not "design nanotech in 10 days" level but still something smarter than any organized group of humans.
IIRC, you can get post on Alignment Forum only if you are invited or moderators crossposted it? The problem is that Alignment Forum is deliberately for some sort of professionals, but everyone wants to write about alignment. Maybe it would be better if we had "Alignment Forum for starters".
I think a phrase "goal misgeneralization" is a wrong framing because it gives impression that it's system makes an error, not you who have chosen ambiguous way to put values in your system.
This is a meta-point, but I find it weird that you ask what is "caring about something" according to CS but don't ask what "corrigibility" is, despite the fact of existence of multiple examples of goal-oriented systems and some relatively-good formalisms (we disagree whether expected utility maximization is a good model of real goal-oriented systems, but we all agree that if we met expected utility maximizer we would find its behavior pretty much goal-oriented), while corrigibility is a pure product of imagination of one particular Eliezer Yudkowsky, born in attempt to imagine system that doesn't care about us but still behaves nicely under some vaguely-restricted definition of niceness. We don't have any examples of corrigible systems in nature and we have constant failure of attempts to formalize even relatively simple instances of corrigibility, like shutdownability. I think likely answer to "why I should expect corrigibility to be unlikely" sounds like "there is no simple description of corrigibility to which our learning systems can easily generalize and there are no reasons to expect simple description to exist".
- "Output next token that has maximum probability according to your posterior distribution given prompt" is a literally an optimization problem. This problem gains huge benefits if system that tries to solve it is more coherent.
- Strictly speaking, LLMs can't be "just PDF calculators". Straight calculation of PDF on such amount of data is computationally untractable (or we would have GPTs in golden era of bayesian models). Actual algorithms should contain bazillion shortcuts and approximations and "having an agent inside system" is as good shortcut as anything else.
Okay, let's break down this.
- Inner misalignment is when we have "objective function" (reward, loss function, etc.) and select systems that produce better results according this function (using evolutionary search, SGD, etc) and resulting system doesn't produce actions which optimize this objective function. The most obvious example of inner misalignment is RL-trained agents that doesn't maximize reward.
- Your argument against possibility of inner misalignment is, basically, "SGD is so powerful optimizer that no matter what it will drag the system towards minimum of loss function". Let's suppose this is true.
- We don't have "good" outer function, defined over training data, such that, given observation and action, this function scores action higher if this action, given observation, is better. Instead of this we have outer functions that favors things like good predictions and outputs receiving high score from human/AI overseer.
- If you have some alignment benchmark, you can't see the difference between superhumanly capable aligned and deceptively aligned systems. They both give you correct answers, because they both are superhumanly capable.
- Because they give you the same correct answers, loss function assignes minimal values to their outputs. They are both either inside local minimum or on flat basin of loss function landscape.
- Therefore, you don't need inner misalignment to get deceptive alignment.
Arguments about inner misalignment work as arguments for optimism only inside "outer/inner alignment" framework, in deep learning version of it. If we have good outer loss function, such as closer to the minimum means better, then yes, our worries should be about weird inner misalignment issues. But we don't have good outer loss function so we kinda should hope for inner misalignment.
Fortunately, no! But I would be really upset if I didn't read it earlier.
Spoilers can be put in markdown using syntax like this:
:::spoiler
TEXT
:::
I'm talking about observable evidence, like, transhumanists claiming they will drop their biological bodies on first possibility.