Posts
Comments
we'll elide all of the subtle difficulties involved in actually getting RL to work in practice
I haven't properly internalized the rest of the post, but this confuses me because I thought this post was about the subtle difficulties.
The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens, P(s'|s,a)=append(s,a)
and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.
The third virtue of rationality, lightness, is wrong. In fact: the more you value information to change your mind on some question, the more obstinate you should be to changing your mind on that question. Lightness implies disinterest in the question.
Imagine your mind as a logarithmic market-maker which assigns some initial subsidy to any new question . This subsidy parameter captures your marginal value for information on . But it also measures how hard it is to change your mind — the cost of moving your probability from to is .
What would this imply in practice? It means that each individual “trader” (both internal mental heuristics/thought patterns, and external sources of information/other people) will generally have a smaller influence on your beliefs, as they may not have enough wealth. Traders who influence your belief will carry greater risk (to their influence on you in future), though will also earn more reward if they’re right.
I don't understand. The hard problem of alignment/CEV/etc. is that it's not obvious how to scale intelligence while "maintaining" utility function/preferences, and this still applies for human intelligence amplification.
I suppose this is fine if the only improvement you can expect beyond human-level intelligence is "processing speed", but I would expect superhuman AI to be more intelligent in a variety of ways.
Something that seems like it should be well-known, but I have not seen an explicit reference for:
Goodhart’s law can, in principle, be overcome via adversarial training (or generally learning Multi-Agent Systems)
—aka “The enemy is smart.”
Goodhart’s law only really applies to a “static” objective, not when the objective is the outcome of a game with other agents who can adapt.
This doesn’t really require the other agents to act in a way that continuously “improves” the training objective either, it just requires them to be able to constantly throw adversarial examples to the agent forcing it to “generalize”.
In particular, I think this is the basic reason why any reasonable Scalable Oversight protocol would be fundamentally “multi-agent” in nature (like Debate).
I think only particular reward functions, such as in multi-agent/co-operative environments (agents can include humans, like in RLHF) or in actually interactive proving environments?
Yes, I also realized that "ideas" being a thing is due to bounded rationality -- specifically they are the outputs of AI search. "Proofs" are weirder though, and I haven't seen them distinguished very often. I wonder if this is a reasonable analogy to make:
- Ideas : search
- Answers : inference
- Proofs: alignment
There is a cliche that there are two types of mathematicians: "theory developers" and "problem solvers". Similarly, Robin Hanson divides the production of knowledge into "framing" and "filling".
It seems to me there are actually three sorts of information in the world:
- "Ideas": math/science theories and models, inventions, business ideas, solutions to open-ended problems
- "Answers": math theorems, experimental observations, results of computations
- "Proofs": math proofs, arguments, evidence, digital signatures, certifications, reputations, signalling
From a strictly Bayesian perspective, there seems to be no "fundamental" difference between these forms of information. They're all just things you condition your prior on. Yet this division seems to be natural in quite a variety of informational tasks. What gives?
adding this from replies for prominence--
Yes, I also realized that "ideas" being a thing is due to bounded rationality -- specifically they are the outputs of AI search. "Proofs" are weirder though, and I haven't seen them distinguished very often. I wonder if this is a reasonable analogy to make:
- Ideas : search
- Answers : inference
- Proofs: alignment
Just realized in logarithmic market scoring the net number of stocks is basically just log-odds, lol:
Your claims about markets seem just wrong to me. Markets generally do what their consumers want, and their failures are largely the result of transaction costs. Some of these transaction costs have to do with information asymmetry (which needs to be solved), but many others that show up in the real world (related to standard problems like negative externalities etc.) can just be removed by construction in virtual markets.
Markets are fundamentally driven by the pursuit of defined rewards or currencies, so in such a system, how do we ensure that the currency being optimized for truly captures what we care about
By having humans be the consumers in the market. Yes, it is possible to "trick" the consumers, but the idea is that if any oversight protocol is possible at all, then the consumers will naturally buy information from there, and AIs will learn to expect this changing reward function.
MIRI has been talking about it for years; the agent foundations group has many serious open problems related to it.
Can you send me a link? The only thing on "markets in an alignment context" I've found on this from the MIRI side is the Wentworth-Soares discussion, but that seems like a very different issue.
it can be confidently known now that the design you proposed is catastrophically misaligned
Can you send me a link for where this was confidently shown? This is a very strong claim to make, nobody even makes this claim in the context of backprop.
I don't think that AI alignment people doing "enemy of enemy is friend" logic with AI luddites (i.e. people worried about Privacy/Racism/Artists/Misinformation/Jobs/Whatever) is useful.
Alignment research is a luxury good for labs, which means it would be the first thing axed (hyperbolically speaking) if you imposed generic hurdles/costs on their revenue, or if you made them spend on mitigating P/R/A/M/J/W problems.
This "crowding-out" effect is already happening to a very large extent: there are vastly more researchers and capital being devoted to P/R/A/M/J/W problems, which could have been allocated to actual alignment research! If you are forming a "coalition" with these people, you are getting a very shitty deal -- they've been much more effective at getting their priorities funded than you have been!
If you want them to care about notkilleveryoneism, you have to specifically make it expensive for them to kill everyone, not just untargetedly "oppose" them. E.g. like foom liability.
Why aren't adverserial inputs used more widely for captchas?
- Different models have different adverserial examples?
- There are only a known adverserial examples for a given model (discovering them takes time), and can easily just be manually enumerated?
I have no idea what to make of the random stray downvotes
The simplest way to explain "the reward function isn't the utility function" is: humans evolved to have utility functions because it was instrumentally useful for the reward function / evolution selected agents with utility functions.
(yeah I know maybe we don't even have utility functions; that's not the point)
Concretely: it was useful for humans to have feelings and desires, because that way evolution doesn't have to spoonfeed us every last detail of how we should act, instead it gives us heuristics like "food smells good, I want".
Evolution couldn't just select a perfect optimizer of the reward function, because there is no such thing as a perfect optimizer (computational costs mean that a "perfect optimizer" is actually uncomputable). So instead it selected agents that were boundedly optimal given their training environment.
The use of "Differential Progress" ("does this advance safety more or capabilities more?") by the AI safety community to evaluate the value of research is ill-motivated.
Most capabilities advancements are not very counterfactual ("some similar advancement would have happened anyway"), whereas safety research is. In other words: differential progress measures absolute rather than comparative advantage / disregards the impact of supply on value / measures value as the y-intercept of the demand curve rather than the intersection of the demand and supply curves.
Even if you looked at actual market value, just p_safety > p_capabilities isn't a principled condition.
Concretely, I think that harping on differential progress risks AI safety getting crowded out by harmless but useless work -- most obviously "AI bias" "AI disinformation", and in my more controversial opinion, overtly prosaic AI safety research which will not give us any insights that can be generalized beyond current architectures. A serious solution to AI alignment will in all likelihood involve risky things like imagining more powerful architectures and revealing some deeper insights about intelligence.
I think EY once mentioned it in the context of self-awareness or free will or something, and called it something like "complete epistemological panic".
Abstraction is like economies of scale
One thing I'm surprised by is how everyone learns the canonical way to handwrite certain math characters, despite learning most things from printed or electronic material. E.g. writing as IR
rather than how it's rendered.
I know I learned the canonical way because of Khan Academy, but I don't think "guy handwriting on a blackboard like thing" is THAT disproportionately common among educational resources?
Oh right, lol, good point.
I used to have an idea for a karma/reputation system: repeatedly recalculate karma weighted by the karma of the upvoters and downvoters on a comment (then normalize to avoid hyperinflation) until a fixed point is reached.
I feel like this is vaguely somehow related to:
- AlphaGoZero
- Humans Consulting HCH
- Wealth in markets
it's extremely high immediate value -- it solves IP rights entirely.
It's the barbed wire for IP rights
quick thoughts on LLM psychology
LLMs cannot be directly anthromorphized. Though something like “a program that continuously calls an LLM to generate a rolling chain of thought, dumps memory into a relational database, can call from a library of functions which includes dumping to recall from that database, receives inputs that are added to the LLM context” is much more agent-like.
Humans evolved feelings as signals of cost and benefit — because we can respond to those signals in our behaviour.
These feelings add up to a “utility function”, something that is only instrumentally useful to the training process. I.e. you can think of a utility function as itself a heuristic taught by the reward function.
LLMs certainly do need cost-benefit signals about features of text. But I think their feelings/utility functions are limited to just that.
E.g. LLMs do not experience the feeling of “mental effort”. They do not find some questions harder than others, because the energy cost of cognition is not a useful signal to them during the training process (I don’t think regularization counts for this either).
LLMs also do not experience “annoyance”. They don’t have the ability to ignore or obliterate a user they’re annoyed with, so annoyance is not a useful signal to them.
Ok, but aren’t LLMs capable of simulating annoyance? E.g. if annoying questions are followed by annoyed responses in the dataset, couldn’t LLMs learn to experience some model of annoyance so as to correctly reproduce the verbal effects of annoyance in its response?
More precisely, if you just gave an LLM the function ignore_user()
in its function library, it would run it when “simulating annoyance” even though ignoring the user wasn’t useful during training, because it’s playing the role.
I don’t think this is the same as being annoyed, though. For people, simulating an emotion and feeling it are often similar due to mirror neurons or whatever, but there is no reason to expect this is the case for LLMs.
current LLMs vs dangerous AIs
Most current "alignment research" with LLMs seems indistinguishable from "capabilities research". Both are just "getting the AI to be better at what we want it to do", and there isn't really a critical difference between the two.
Alignment in the original sense was defined oppositionally to the AI's own nefarious objectives. Which LLMs don't have, so alignment research with LLMs is probably moot.
something related I wrote in my MATS application:
-
I think the most important alignment failure modes occur when deploying an LLM as part of an agent (i.e. a program that autonomously runs a limited-context chain of thought from LLM predictions, maintains a long-term storage, calls functions such as search over storage, self-prompting and habit modification either based on LLM-generated function calls or as cron-jobs/hooks).
-
These kinds of alignment failures are (1) only truly serious when the agent is somehow objective-driven or equivalently has feelings, which current LLMs have not been trained to be (I think that would need some kind of online learning, or learning to self-modify) (2) can only be solved when the agent is objective-driven.
conditionalization is not the probabilistic version of implies
P | Q | Q| P | P → Q |
---|---|---|---|
T | T | T | T |
T | F | F | F |
F | T | N/A | T |
F | F | N/A | T |
Resolution logic for conditionalization: Q if P or True
Resolution logic for implies: Q if P or None
I think that the philosophical questions you're describing actually evaporate and turn out to be meaningless once you think enough about them, because they have a very anthropic flavour.
I don't think that's exactly true. But why do you think that follows from what I wrote?
That's syntax, not semantics.
It's really not, that's the point I made about semantics.
Eh that's kind-of right, my original comment there was dumb.
You overstate your case. The universe contains a finite amount of incompressible information, which is strictly less than the information contained in . That self-reference applies to the universe is obvious, because the universe contains computer programs.
The point is the universe is certainly a computer program, and that incompleteness applies to all computer programs (to all things with only finite incompressible information). In any case, I explained Godel with an explicitly empirical example, so I'm not sure what your point is.
I agree, and one could think of this in terms of markets: a market cannot capture all information about the world, because it is part of the world.
But I disagree that this is fundamentally unrelated -- here too the issue is that it would need to represent states of the world corresponding to what belief it expresses. Ultimately mathematics is supposed to represent the real world.
No, it doesn't. There is no 1/4 chance of anything once you've found yourself in Room A1.
You do acknowledge that the payout for the agent in room B (if it exists) from your actions is the same as the payout for you from your own actions, which if the coin came up tails is $3, yes?
I don't understand what you are saying. If you find yourself in Room A1, you simply eliminate the last two possibilities so the total payout of Tails becomes 6.
If you find yourself in Room A1, you do find yourself in a world where you are allowed to bet. It doesn't make sense to consider the counterfactual, because you already have gotten new information.
That's not important at all. The agents in rooms A1 and A2 themselves would do better to choose tails than to choose heads. They really are being harmed by the information.
I see, that is indeed the same principle (and also simpler/we don't need to worry about whether we "control" symmetric situations).
I don't think this is right. A superrational agent exploits the symmetry between A1 and A2, correct? So it must reason that an identical agent in A2 will reason the same way as it does, and if it bets heads, so will the other agent. That's the point of bringing up EDT.
Wait, but can't the AI also choose to adopt the strategy "build another computer with a larger largest computable number"?
I don't understand the significance of using a TM -- is this any different from just applying some probability distribution over the set of actions?
Suppose the function U(t) is increasing fast enough, e.g. if the probability of reaching t is exp(-t), then let U(t) be exp(2t), or whatever.
I don't think the question can be dismissed that easily.
It does not require infinities. E.g. you can just reparameterize the problem to the interval (0, 1), see the edited question. You just require an infinite set.
Infinite t does not necessarily deliver infinite utility.
Perhaps it would be simpler if I instead let t be in (0, 1], and U(t) = {t if t < 1; 0 if t = 1}.
It's the same problem, with 1 replacing infinity. I have edited the question with this example instead.
(It's not a particularly weird utility function -- consider, e.g. if the agent needs to expend a resource such that the utility from expending the resource at time t is some fast-growing function f(t). But never expending the resource gives zero utility. In any case, an adverserial agent can always create this situation.)
I see. So the answer is that it is indeed true that Godel's statement is true in all models of second-order PA, but unprovable nonetheless since Godel's completeness theorem isn't true for second-order logic?
This seems to be relevant to calculations of climate change externalities, where the research is almost always based on the direct costs of climate change if no one modified their behaviour, rather than the cost of building a sea wall, or planting trees.
Disagree. Daria considers the colour of the sky an important issue because it is socially important, not because it is of actual cognitive importance. Ferris recognizes that it doesn't truly change much about his beliefs, since their society doesn't have any actual scientific theories predicting the colour of the sky (if they did, the alliances would not be on uncorrelated issues like taxes and marriage), and bothers with things he finds to be genuinely more important.
I'm not sure your interpretation of logical positivism is what the positivists actually say. They don't argue against having a mental model that is metaphysical, they point out that this mental model is simply a "gauge", and that anything physical is invariant under changes of this gauge.
Interesting. Did they promise to do so beforehand?
In any case, I'm not surprised the Soviets did something like this, but I guess the point is really "Why isn't this more widespread?" And also: "why does this not happen with goals other than staying in power?" E.g. why has no one tried to pass a bill that says "Roko condition AND we implement this-and-this policy". Because otherwise it seems that the stuff the Soviets did was motivated by something other than Roko's basilisk.
But that's not Roko's basilisk. Whether or not you individually vote for the candidate does not affect you as long as the candidate wins.