Posts
Comments
Which part of LLM? Shoggoth or simulacra? As I see it, there is a pressure on shoggoth to become very good at simulating exactly correct human in exactly correct situation, which is extremely complicated task.
Yes, I think it is fair to say that I meant the Shoggoth part, although I'm a little wary of that dichotomy utilized in a load-bearing way.
But I still don't see how this leads to strategic planning or consequentialist reasoning on shoggoth's part. It's not like shoggot even "lives" in some kind of universe with linear time or gets any reward for predicting the next token, or learns on its mistakes. It is architecturally an input-output function where input is whatever information it has about previous text and output is whatever parameters the simulation needs right now. It is incredibly "smart", but not agent kind of smart. I don't see any room for shoggoth's agency in this setup.
No room for agency at all? If this were well-reasoned, I would consider it major progress on the inner alignment problem. But I fail to follow your line of thinking. Something being architecturally an input-output function seems not that closely related to what kind of universe it "lives" in. Part of the lesson of transformer architectures, in my view at least, was that giving a next-token-predictor a long input context is more practical than trying to train RNNs. What this suggests is that given a long context window, LLMs reconstruct the information which would have been kept around in a recurrent state pretty well anyway.
This makes it not very plausible that the key dividing line between agentic and non-agentic is whether the architecture keeps state around.
The argument I sketched as to why this input-output function might learn to be agentic was that it is tackling an extremely complex task, which might benefit from some agentic strategy. I'm still not saying such an argument is correct, but perhaps it will help to sketch why this seems plausible. Modern LLMs are broadly thought of as "attention" algorithms, meaning they decide what parts of sequences to focus on. Separately, many people think it is reasonable to characterize modern LLMs as having a sort of world-model which gets consulted to recall facts. Where to focus attention is a consideration which will have lots of facets to it, of course. But in a multi-stage transformer, isn't it plausible that the world-model gets consulted in a way that feeds into how attention is allocated? In other words, couldn't attention-allocation go through a relatively consequentialist circuit at times, which essentially asks itself a question about how it expects things to go if it allocates attention in different ways?
Any specific repeated calculation of that kind could get "memorized out", replaced with a shorter circuit which simply knows how to proceed in those circumstances. But it is possible, in theory at least, that the more general-purpose reasoning, going through the world-model, would be selected for due to its broad utility in a variety of circumstances.
Since the world-model-consultation is only selected to be useful for predicting the next token, the consequentialist question which the system asks its world-model could be fairly arbitrary so long as it has a good correlation with next-token-prediction utility on the training data.
Is this planning? IE does the "query to the world-model" involve considering multiple plans and rejecting worse ones? Or is the world-model more of a memorized mess of stuff with no "moving parts" to its computation? Well, we don't really know enough to say (so far as I am aware). Input-output type signatures do not tell us much about the simplicity or complexity of calculations within. "It's just circuits" but large circuits can implement some pretty sophisticated algorithms. Big NNs do not equal big lookup tables.
I intended the three to be probability and utility and steam, but it might make more sense to categorize things in other ways. While I still think there might be something more interesting here, I nowadays mainly think of Steam as the probability distribution over future actions and action-related concepts. This makes Steam an epistemic object, like any other belief, but with more normative/instrumental content because it's beliefs about actions, and because there will be a lot of FixDT stuff going on in such beliefs. Kickstarter / "belief-in" dynamics also seem extremely relevant.
Here are some different things that come to mind.
- As you mention, the simulacra behaves in an agentic way within its simulated environment, a character in a story. So the capacity to emulate agency is there. Sometimes characters can develop awareness that they are a character in a story. If an LLM is simulating that scenario, doesn't it seem appropriate (at least on some level) to say that there is real agency being oriented toward the real world? This is "situational awareness".
- Another idea is that the LLM has to learn some strategic planning in order to direct its cognitive resources efficiently toward the task of prediction. Prediction is a very complicated task, so this meta-cognition could in principle become arbitrarily complicated. In principle we might expect this to converge toward some sort of consequentialist reasoning, because that sort of reasoning is generically useful for approaching complex domains. The goals of this consequentialist reasoning do not need to be exactly "predict accurately" however; they merely need to be adequately aligned with this in the training distribution.
- Combining #1 and #2, if the model gets some use out of developing consequentialist metacognition, and the pseudo-consequentialist model used to simulate characters in stories is "right there", the model might borrow it for metacognitive purposes.
The frame I tend to think about it with is not exactly "how does it develop agency" but rather "how is agency ruled out". Although NNs don't neatly separate into different hypotheses (eg, circuits can work together rather than just compete with each other) it is still roughly right to think of NN training as rejecting lots of hypotheses and keeping around lots of other hypotheses. Some of these hypotheses will be highly agentic; we know NNs are capable of arriving at highly agentic policies in specific cases. So there's a question of whether those hypotheses can be ruled out in other cases. And then there's the more empirical question of, if we haven't entirely ruled out those agentic hypotheses, what degree of influence do they realistically have?
Seemingly the training data cannot entirely rule out an agentic style of reasoning (such as deceptive alignment), since agents can just choose to behave like non-agents. So, the inner alignment problem becomes: what other means can we use to rule out a large agentic influence? (Eg, can we argue that simplicity prior favors "honest" predictive models over deceptively aligned agents temporarily playing along with the prediction game?) The general concern is: no one has yet articulated a convincing answer, so far as I know.
Hence, I regard the problem more as a lack of any argument ruling out agency, rather than the existence of a clear positive argument that agency will arise. Others may have different views on this.
I am thinking of this as a noise-reducing modification to the loss function, similar to using model-based rather than model-free learning (which, if done well, rewards/punishes a policy based on the average reward/punishment it would have gotten over many steps).
If science were incentivized via prediction market (and assuming scientists can make sizable bets by taking out loans), then the first person to predict a thing wins most of the money related to it. In other words, prediction markets are approximately parade-leader-incentivizing.
But if there's a race to be the first to bet, then this reward is high-variance; Newton could get priority over Leibniz by getting his ideas to the market a little faster.
You recommend dividing credit more to all the people who could have gotten information to the market, with some kind of time-discount for when they could have done it. If we conceive of "who won the race" as introducing some noise into the credit-assignment, this is a way to de-noise things.
This has the consequence of taking away a lot of credit from race-winners when the race was pretty big, which is the part you focus on; based on this idea, you want to be part of smaller races (ideally size 1). But, outside-view, you should have wanted this all along anyway; if you are racing for status, but you are part of a big race, only a small number of people can win anyway, so your outside-view probability of personally winning status should already be divided by the number of racers. To think you have a good chance of winning such a race you must have personal reasons, and (since being in the race selects, in part, for people who think they can win) they're probably overconfident.
So for the most part your advice has no benefit for calibrated people, since being a parade-leader is hard.
There are for sure cases where your metric comes apart from expected-parade-leading by a lot more, though. A few years ago I heard accusations that one of the big names behind Deep Learning earned their status by visiting lots of research groups and keeping an eye out for what big things were going to happen next, and managing to publish papers on these big things just a bit ahead of everyone else. This strategy creates the appearance of being a fountain of information, when in fact the service provided is just a small speed boost to pre-existing trends. (I do not recall who exactly was being accused, and I don't have a lot of info on the reliability of this assessment anyway, it was just a rumor.)
Typically, people say that the market is mostly efficient, and if there was financial alpha to be gained by doing hiring differently from most corporations, then there would already be companies outcompeting others by doing that. Well, here's a company doing some things differently and outcompeting other companies. Maybe there aren't enough people willing to do such things (who have the resources to) for the returns to reach an equilibrium?
Well, it could be that the practices lead to high-variance results, so that you should mostly expect companies which operate like that to fail, but you also expect a few unusually large wins.
But I'm not familiar enough with the specific case to say anything substantial.
I am not sure whether I am more excited about 'positive' approaches (accelerating alignment research more) vs 'negative' approaches (cooling down capability-gain research). I agree that some sorts of capability-gain research are much more/less dangerous than others, and the most clearly risky stuff right now is scaling & scaling-related.
So you agree with the claim that current LLMs are a lot more useful for accelerating capabilities work than they are for accelerating alignment work?
Hmm. Have you tried to have conversations with Claude or other LLMs for the purpose of alignment work? If so, what happened?
For me, what happens is that Claude tries to work constitutional AI in as the solution to most problems. This is part of what I mean by "bad at philosophy".
But more generally, I have a sense that I just get BS from Claude, even when it isn't specifically trying to shoehorn its own safety measures in as the solution.
Any thoughts on the sort of failure mode suggested by AI doing philosophy = AI generating hands? I feel strongly that Claude (and all other LLMs I have tested so far) accelerate AI progress much more than they accelerate AI alignment progress, because they are decent at programming but terrible at philosophy. It also seems easier in principle to train LLMs to be even better at programming. There's also going to be a lot more of a direct market incentive for LLMs to keep getting better at programming.
(Helping out with programming is also not the only way LLMs can help accelerate capabilities.)
So this seems like a generally dangerous overall dynamic -- LLMs are already better at accelerating capabilities progress than they are at accelerating alignment, and furthermore, it seems like the strong default is for this disparity to get worse and worse.
I would argue that accelerating alignment research more than capabilities research should actually be considered a basic safety feature.
I'll admit I overstated it here, but my claim is that once you remove the requirement for arbitrarily good/perfect solutions, it becomes easier to solve the problem. Sometimes, it's still impossible to solve the problem, but it's usually solvable once you drop a perfectness/arbitrarily good requirement, primarily because it loosens a lot of constraints.
I mean, yeah, I agree with all of this as generic statements if we ignore the subject at hand.
I agree it isn't a logical implication, but I suspect your example is very misleading, and that more realistic imperfect solutions won't have this failure mode, so I'm still quite comfortable with using it as an implication that isn't 100% accurate, but more like 90-95+% accurate.
I agree the example sucks and only serves to prove that it is not a logical implication.
A better example would be, like, the Goodhart model of AI risk, where any loss function that we optimize hard enough to get into superintelligence would probably result in a large divergence between what we get and what we actually want, because optimization amplifies. Note that this still does not make an assumption that we need to prove 100% safety, but rather, argues, for reasons, from assumptions that it will be hard to get any safety at all from loss functions which merely coincide to what we want somewhat well.
I still think the list of lethalities is a pretty good reply to your overall line of reasoning -- IE it clearly flags that the problem is not achieving perfection, but rather, achieving any significant probability of safety, and it gives a bunch of concrete reasons why this is hard, IE provides arguments rather than some kind of blind assumption like you seem to be indicating.
You are doing a reasonable thing by trying to provide some sort of argument for why these conclusions seem wrong, but "things tend to be easy when you lift the requirement of perfection" is just an extremely weak argument which seems to fall apart the moment we contemplate the specific case of AI alignment at all.
I finally got around to reading this today, because I have been thinking about doing more interpretability work, so I wanted to give this piece a chance to talk me out of it.
It mostly didn't.
- A lot of this boils down to "existing interpretability work is unimpressive". I think this is an important point, and significant sub-points were raised to argue it. However, it says little 'against almost every theory of impact of interpretability'. We can just do better work.
- A lot of the rest boils down to "enumerative safety is dumb". I agree, at least for the version of "enumerative safety" you argue against here.
My impact story (for the work I am considering doing) is most similar to the "retargeting" story which you briefly mention, but barely critique.
I do think the world would be better off if this were required reading for anyone considering going into interpretability vs other areas. (Barring weird side-effects of the counterfactual where someone has the ability to enforce required reading...) It is a good piece of work which raises many important points.
More generally, if we grant that we don't need perfection, or arbitrarily good alignment, at least early on, then I think this implies that alignment should be really easy, and the p(Doom) numbers are almost certainly way too high, primarily because it's often doable to solve problems of you don't need perfect or arbitrarily good solutions.
It seems really easy to spell out worldviews where "we don't need perfection, or arbitrarily good alignment" but yet "alignment should be really easy". To give a somewhat silly example based on the OP, I could buy Enumerative Safety in principle -- so if we can check all the features for safety, we can 100% guarantee the safety of the model. It then follows that if we can check 95% of the features (sampled randomly) then we get something like a 95% safety guarantee (depending on priors).
But I might also think that properly "checking" even one feature is really, really hard.
So I don't buy the claimed implication: "we don't need perfection" does not imply "alignment should be really easy". Indeed, I think the implication quite badly fails.
Compare this to a similar argument that a hardware enthusiast could use to argue against making a software/hardware distinction. You can argue that saying "software" is misleading because it distracts from the physical reality. Software is still present physically somewhere in the computer. Software doesn't do anything hardware can't do, since software doing is just hardware doing.
But thinking in this way will not be a very good way of predicting reality. The hypothetical hardware enthusiast would not be able to predict the rise of the "programmer" profession, or the great increase in complexity of things that machines can do thanks to "programming".
I think it is more helpful to think of modern AI as a paradigm shift in the same way that the shift from "electronic" (hardware) to "digital" (software) was a paradigm shift. Sure, you can still use the old paradigm to put labels on things. Everything is "still hardware". But doing so can miss an important transition.
While I agree that wedding photos and NN weights are both data, and this helps to highlight ways they "aren't software", I think this undersells the point. NN weights are "active" in ways wedding photos aren't. The classic code/data distinction has a mostly-OK summary: code is data of type function. Code is data which can be "run" on other data.
NN weights are "of type function" too: the usual way to use them is to "run" them. Yet, it is pretty obvious that they are not code in the traditional sense.
So I think this is similar to a hardware geek insisting that code is just hardware configuration, like setting a dial or flipping a set of switches. To the hypothetical hardware geek, everything is hardware; "software" is a physical thing just as much as a wire is. An arduino is just a particularly inefficient control circuit.
So, although from a hardware perspective you basically always want to replace an arduino with a more special-purpose chip, "something magical" happens when we move to software -- new sorts of things become possible.
Similarly, looking at AI as data rather than code may be a way to say that AI "isn't software" within the paradigm of software, but it is not very helpful for understanding the large shift that is taking place. I think it is better to see this as a new layer in somewhat the same way as software was a new layer on top of hardware. The kinds of thinking you need to do in order to do something with hardware vs do something with software are quite different, but ultimately, more similar to each other than they both are to how to do something with AI.
Ah, very interesting, thanks! I wonder if there is a different way to measure relative endorsement that could achieve transitivity.
Yeah, the stuff in the updatelessness section was supposed to gesture at how to handle this with my definition.
First of all, I think children surprise me enough in pursuit of their own goals that they do often count as agents by the definition in the post.
But, if children or animals who are intuitively agents often don't fit the definition in the post, my idea is that you can detect their agency by looking at things with increasingly time/space/data bounded probability distributions. I think taking on "smaller" perspectives is very important.
I can feel what you mean about arbitrarily drawing a circle around the known optimizer and then "deleting" it, but this just doesn't feel that weird to me? Like I think the way that people model the world allows them to do this kind of operation with pretty substantially meaningful results.
I agree, but I am skeptical that there could be a satisfying mathematical notion here. And I am particularly skeptical about a satisfying mathematical notion that doesn't already rely on some other agent-detector piece which helps us understand how to remove the agent.
I think this is where Flint's framework was insightful. Instead of "detecting" and "deleting" the optimization process and then measuring the diff, you consider the system of every possible trajectory, measure the optimization of each (with respect to the ordering over states), take the average, and then compare your potential optimizer to this.
Looking back at Flint's work, I don't agree with this summary. His idea is more about spotting attractor basins in the dynamics. There is no "compare your optimizer to this" step which I can see, since he studies the dynamics of the entire system. He suggests that in cases where it is meaningful to make an optimizer/optimized distinction, this could be detected by noticing that a specific region (the 'optimizer') is sensitive to very small perturbations, which can take the whole system out of the attractor basin.
In any case, I agree that Flint's work also eliminates the need for an unnatural baseline in which we have to remove the agent.
Overall, I expect my definition to be more useful to alignment, but I don't currently have a well-articulated argument for that conclusion. Here are some comparison points:
- Flint's definition requires a system with stable dynamics over time, so that we can define an iteration rule. My definition can handle that case, but does not require it. So, for example, Flint's definition doesn't work well for a goal like "become President in 2030" -- it works better for continual goals, like "be president".
- Flint's notion of robustness involves counterfactual perturbations which we may never see in the real world. I feel a bit suspicious about this aspect. Can counterfactual perturbations we'll never see in practice be really relevant and useful for reasoning about alignment?
- Flint's notion is based more on the physical system, whereas mine is more about how we subjectively view that system.
- I feel that "endorsement" comes closer to a concept of alignment. Because of the subjective nature of endorsement, it comes closer to formalizing when an optimizer is trusted, rather than merely good at its job.
- It seems more plausible that we can show (with plausible normative assumptions about our own reasoning) that we (should) absolutely endorse some AI, in comparison to modeling the world in sufficient detail to show that building the AI would put us into a good attractor basin.
- I suspect Flint's definition suffers more from the value change problem than mine, although I think I haven't done the work necessary to make this clear.
There are several compromises I made for the sake of getting the idea across as simply as I could.
- I think the graduate-level-textbook version of this would be much more clear about what the quotes are doing. I was tempted to not even include the quotes in the mathematical expressions, since I don't think I'm super clear about why they're there.
- I totally ignored the difference between (probability conditional on ) and (probability after learning ).
- I neglect to include quantifiers in any of my definitions; the reader is left to guess which things are implicitly universally quantified.
I think I do prefer the version I wrote, which uses rather than , but obviously the English-language descriptions ignore this distinction and make it sound like what I really want is .
It seems like the intention is that "learns" or "hears about" 's belief, and then updates (in the above Bayesian inference sense) to have a new that has the consistency condition with .
Obviously we can consider both possibilities and see where that goes, but I think maybe the conditional version makes more sense as a notion of whether you right now endorse something. A conditional probability is sort of like a plan for updating. You won't necessarily follow the plan exactly when you actually update, but the conditional probability is your best estimate.
To throw some terminology out there, let's call my thing "endorsement" and a version which uses actual updates rather than conditionals "deference" (because you'd actually defer to their opinions if you learn them).
- You can know whether you endorse something, since you can know your current conditional probabilities (to within some accuracy, anyway). It is harder to know whether you defer to something, since in the case where updates don't equal conditionals, you must not know what you are going to update to. I think it makes more sense to define the intentional stance in terms of something you can more easily know about yourself.
- Using endorsement to define agency makes it about how you reason about specific hypotheticals, whereas using deference to try and define agency would make it about what actually happens in those hypotheticals (ie, how you would actually update if you learned a thing). Since you might not ever get to learn that thing, this makes endorsement more well-defined than deference.
Bayes' theorem is the statement about , which is true from the axioms of probability theory for any and whatsoever.
I actually prefer the view of Alan Hajek (among others) who holds that P(A|B) is a primitive, not defined as in Bayes' ratio formula for conditional probability. Bayes' ratio formula can be proven in the case where P(B)>0, but if P(B)=0 it seems better to say that conditional probabilities can exist rather than necessarily being undefined. For example, we can reason about the conditional probability that a meteor hits land given that it hits the equator, even if hitting the equator is a measure zero event. Statisticians learn to compute such things in advanced stats classes, and it seems sensible to unify such notions under the formal P(A|B) rather than insisting that they are technically some other thing.
By putting in the conditional, you're saying that it's an event on , a thing with the same type as . And it feels like that's conceptually correct, but also kind of the hard part. It's as if is modelling as an agent embedded into .
Right. This is what I was gesturing at with the quotes. There has to be some kind of translation from (which is a mathematical concept 'outside' ) to an event inside . So the quotes are doing something similar to a Goedel encoding.
While trying to understand the equations, I found it easier to visualize and as two separate distributions on the same , where endorsement is simply a consistency condition. For belief consistency, you would just say that endorses on event if .
But that isn't what you wrote; instead you wrote thing this with conditioning on a quoted thing. And of course, the thing I said is symmetrical between and , whereas your concept of endorsement is not symmetrical.
The asymmetry is quite important. If we could only endorse things that have exactly our opinions, we could never improve.
The post is making the distinction between seeing preferences as a utility function of worlds (this is the regular old idea of utility functions as random variables) vs seeing preferences as an expectation function on events (the jeffrey-bolker view). Both perspectives hold that an agent can optimize things it does not have direct access to. Agency is optimization at a distance. Optimization that isn't at a distance is selection as opposed to control.
I agree that this is an important distinction, but I personally prefer to call it "transformative AI" or some such.
The interesting thing about this -- beyond showing that going probabilistic allows the handshake to work with somewhat unreliable bots -- is that proving rather than is a lot different. With , we're like "And so Peano arithmetic (or whatever) proves they cooperate! We think Peano arithmetic is accurate about such matters, so, they actually cooperate."
With the conclusion we're more like "So if the agent's probability estimates are any good, we should also expect them to cooperate" or something like that. The connection to them actually cooperating is looser.
An intriguing perspective, but I'm not sure whether I agree. Naively, it would seem that a choice between fixed points in the FixDT setting is just a choice between different probability distributions, which brings us very close to the VNM idea of a choice between gambles. So VNM-like utility theory seems like the obvious outcome.
That being said, I don't really agree with the idea that an agent should have a fixed VNM-like utility function. So I do think some generalization is needed.
Yeah, "settles on" here meant however the agent selects beliefs. The epistemic constraint implies that the agent uses exhaustive search or some other procedure guaranteed to produce a fixed point, rather than Banach-style iteration.
Moving to a Banach-like setting will often make the fixed points unique, which takes away the whole idea of FixDT.
Moving to a setting where the agent isn't guaranteed to converge would mean we have to re-write the epistemic constraint to be appropriate to that setting.
Yes, thanks for citing it here! I should have mentioned it, really.
I see the Skyrms iterative idea as quite different from the "just take a fixed point" theory I sketch here, although clearly they have something in common. FixDT makes it easier to combine both epistemic and instrumental concerns -- every fixed point obeys the epistemic requirement; and then the choice between them obeys the instrumental requirement. If we iteratively zoom in on a fixed point instead of selecting from the set, this seems harder?
If we try the Skyrms iteration thing, maybe the most sensible thing would be to move toward the beliefs of greatest expected utility -- but do so in a setting where epistemic utility emerges naturally from pragmatic concerts (such as A Pragmatists Guide to Epistemic Decision Theory by Ben Levinstein). So the agent is only ever revising its beliefs in pragmatic ways, but we assume enough about the environment that it wants to obey both the epistemic and instrumental constraints? But, possibly, this assumption would just be inconsistent with the sort of decision problem which motivates FixDT (and Greaves).
I don't understand. The fact that every single node has a Markov blanket seems unrelated. The claim that the intersection of any two blankets is a blanket doesn't seem true? For example, I can have a network:
a -> b -> c
| | |
v v v
d -> e -> f
| | |
v v v
g -> h -> i
It seems like the intersection of the blankets for 'a' and 'c' don't form a blanket.
I find your attempted clarification confusing.
Our model is going to have some variables in it, and if we don't know in advance where the agent will be at each timestep, then presumably we don't know which of those variables (or which function of those variables, etc) will be our Markov blanket.
No? A probabilistic model can just be a probability distribution over events, with no "random variables in it". It seemed like your suggestion was to define the random variables later, "on top of" the probabilistic model, not as an intrinsic part of the model, so as to avoid the objection that a physics-ish model won't have agent-ish variables in it.
So the random variables for our markov blanket can just be defined as things like skin surface temperature & surface lighting & so on; random variables which can be derived from a physics-ish event space, but not by any particularly simple means (since the location of these things keeps changing).
On the other hand, if we knew which variables or which function of the variables were the blanket, then presumably we'd already know where the agent is, so presumably we're already conditioning on something when we say "the agent's boundary is a Markov blanket".
Again, no? If I know skin surface temperature and lighting conditions and so on all add up to a Markov blanket, I don't thereby know where the skin is.
I think that is a basically-correct argument. It doesn't actually argue that agent boundaries aren't Markov boundaries; I still think agent boundaries are basically Markov boundaries. But the argument implies that the most naive setup is missing some piece having to do with "where the agent is".
It seems like you agree with Sam way more than would naively be suggested by your initial reply. I don't understand why.
When I talked with Sam about this recently, he was somewhat satisfied by your reply, but he did think there were a bunch of questions which follow. By giving up on the idea that the markov blanket can be "built up" from an underlying causal model, we potentially give up on a lot of niceness desiderata which we might have wanted. So there's a natural question of how much you want to try and recover, which you could have gotten from "structural" markov blankets, and might be able to get some other way, but don't automatically get from arbitrary markov blankets.
In particular, if I had to guess: causal properties? I don't know about you, but my OP was mainly directed at Critch, and iiuc Critch wants the Markov blanket to have some causal properties so that we can talk about input/output. I also find it appealing for "agent boundaries" to have some property like that. But if the random variables are unrelated to a causal graph (which, again, is how I understood your proposal) then it seems difficult to recover anything like that.
Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing?
(Modulo the fact that it can play chess pretty well, which is longer-horizon than some things; this distinction is quantitative rather than qualitative and it’s being eroded, etc.)
And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior?
(Modulo, e.g., the fact that it can play chess pretty well, which indicates a certain type of want-like behavior in the behaviorist sense. An AI's ability to win no matter how you move is the same as its ability to reliably steer the game-board into states where you're check-mated, as though it had an internal check-mating “goal” it were trying to achieve. This is again a quantitative gap that’s being eroded.)
I don't think the following is all that relevant to the point you are making in this post, but someone cited this post of yours in relation to the question of whether LLMs are "intelligent" (summarizing the post as "Nate says LLMs aren't intelligent") and then argued against the post as goalpost-moving, so I wanted to discuss that.
It may come as a shock to some, that Abram Demski adamantly defends the following position: GPT4 is AGI. I would be goalpost-moving if I said otherwise. I think the AGI community is goalpost-moving to the extent that it says otherwise.
I think there is some tendency in the AI Risk community to equate "AGI" with "the sort of AI which kills all the humans unless it is aligned". But "AGI" stands for "artificial general intelligence", not "kills all the humans". I think it makes more sense for the definition of AGI to be up to the community of AI researchers who use the term AGI to distance their work from narrow AI, rather than for it to be up to the AI risk community. And GPT4 is definitely not narrow AI.
I'll argue an even stronger claim: if you come up with a task which can be described and completed entirely in text format (and then evaluated somehow for performance quality), for most such tasks the performance of GPT4 is at or above the performance of a random human. (We can even be nice and only randomly sample humans who speak whichever languages are appropriate to the task; I'll still stand by the claim.) Yes, GPT4 has some weaknesses compared to a random human. But most claims of weaknesses I've heard are in fact contrasting GPT4 to expert humans, not random humans. So my stronger claim is: GPT4 is human-level AGI, maybe not by all possible definitions of the term, but by a very reasonable-seeming definition which 2014 Abram Demski might have been perfectly happy with. To deny this would be goalpost-moving for me; and, I expect, for many.
So (and I don't think this is what you were saying) if GPT4 were being ruled out of "human-level AGI" because it cannot write a coherent set of novels on its own, or do a big engineering project, well, I call shenanigans. Most humans can't do that either.
Ahhh that makes sense, thanks.
This topic came up when working on a project where I try to make a set of minimal assumptions such that I know how to construct an aligned system under these assumptions. After knowing how to construct an aligned system under this set of assumptions, I then attempt to remove an assumption and adjust the system such that it is still aligned. I am trying to remove the cartesian assumption right now.
I would encourage you to consider looking at Reflective Oracles next, to describe a computationally unbounded agent which is capable of thinking about worlds which are as computationally unbounded as itself; and a next logical step after that would be to look at logical induction or infrabayesianism, to think about agents which are smaller than what they reason about.
You can compute everything that takes finite compute and memory instantly. (This implies some sense of cartesianess, as I am sort of imagining the system running faster than the world, as it can just do an entire tree search in one "clock tick" of the environment.)
This part makes me quite skeptical that the described result would constitute embedded agency at all. It's possible that you are describing a direction which would yield some kind of intellectual progress if pursued in the right way, but you are not describing a set of constraints such that I'd say a thing in this direction would definitely be progress.
My intuition is that this would still need to solve the problem of giving an agent a correct representation of itself, in the sense that it can "plan over itself" arbitrarily. This can be thought of as enabling the agent to reason over the entire environment which includes itself. Is that part a solved problem?
This part seems inconsistent with the previous quoted paragraph; if the agent is able to reason about the world only because it can run faster than the world, then it sounds like it'll have trouble reasoning about itself.
Reflective Oracles solve the problem of describing an agent with infinite computational resources which can do planning involving itself and other similar agents, including uncertainty (via reflective-oracle solomonoff induction), which sounds superior to the sort of direction you propose. However, they do not run "faster than the world", as they can reason about worlds which include things like themselves.
Amusingly, searching for articles on whether offering unlicensed investment advice is illegal (and whether disclaiming it as "not investment advice" matters) brings me to pages offering "not legal advice" ;p
Also, to be clear, nothing in this post constitutes investment advice or legal advice.
&
(Also I know enough to say up front that nothing I say here is Investment Advice, or other advice of any kind!)
&
None of what I say is financial advice, including anything that sounds like financial advice.
I usually interpret this sort of statement as an invocation to the gods of law, something along the lines of "please don't smite me", and certainly not intended literally. Indeed, it seems incongruous to interpret it literally here: the whole point of the discussion, as I'm understanding it, is to provide potentially useful ideas about investing strategies. Am I supposed to pretend that it's just, like, an interesting thought experiment? Or is there some other interpretation of your disclaimer I'm not seeing?
I'm looking at the Savage theory from your own https://plato.stanford.edu/entries/decision-theory/ and I see U(f)=∑u(f(si))P(si), so at least they have no problem with the domains (O and S) being different. Now I see the confusion is that to you Omega=S (and also O=S), but to me Omega=dom(u)=O.
(Just to be clear, I did not write that article.)
I think the interpretation of Savage is pretty subtle. The objects of preference ("outcomes") and objects of belief ("states") are treated as distinct sets. But how are we supposed to think about this?
- The interpretation Savage seems to imply is that both outcomes and states are "part of the world", but the agent has somehow segregated parts of the world into matters of belief and matters of preference. But however the agent has done this, it seems to be fundamentally beyond the Savage representation; clearly within Savage, the agent cannot represent meta-beliefs about which matters are matters of belief and which are matters of preference. So this seems pretty weird.
- We could instead think of the objects of preference as something like "happiness levels" rather than events in the world. The idea of the representation theorem then becomes that we can peg "happiness levels" to real numbers. In this case, the picture looks more like standard utility functions; S is the domain of the function that gives us our happiness level (which can be represented by a real-valued utility).
- Another approach which seems somewhat common is to take the Savage representation but require that S=O. Savage's "acts" then become maps from world to world, which fits well with other theories of counterfactuals and causal interventions.
So even within a Savage framework, it's not entirely clear that we would want the domain of the utility function to be different from the domain of the belief function.
I should also have mentioned the super-common VNM picture, where utility has to be a function of arbitrary states as well.
That's just math speak, you can define a lot of things as a lot of other things, but that doesn't mean that the agent is going to be literally iterating over infinite sets of infinite bit strings and evaluating something on each of them.
The question is, what math-speak is the best representation of the things we actually care about?
It remains totally unclear to me why you demand the world to be such a thing.
Ah, if you don't see 'worlds' as meaning any such thing, then I wonder, are we really arguing about anything at all?
I'm using 'worlds' that way in reference to the same general setup which we see in propositions-vs-models in model theory, or in vs the -algebra in the Kolmogorov axioms, or in Kripke frames, and perhaps some other places.
We can either start with a basic set of "worlds" (eg, ) and define our "propositions" or "events" as sets of worlds, where that proposition/event 'holds' or 'is true' or 'occurs'; or, equivalently, we could start with an algebra of propositions/events (like a -algebra) and derive worlds as maximally specific choices of which propositions are true and false (or which events hold/occur).
My point is that if U has two output values, then it only needs two possible inputs. Maybe you're saying that if |dom(U)|=2, then there is no point in having |dom(P)|>2, and maybe you're right, but I feel no need to make such claims.
Maybe I should just let you tell me what framework you are even using in the first place. There are two main alternatives to the Jeffrey-Bolker framework which I have in mind: the Savage axioms, and also the thing commonly seen in statistics textbooks where you have a probability distribution which obeys the Kolmogorov axioms and then you have random variables over that (random variables being defined as functions of type ). A utility function is then treated as a random variable.
It doesn't sound like your notion of utility function is any of those things, so I just don't know what kind of framework you have in mind.
My point is only that U is also reasonable, and possibly equivalent or more general. That there is no "case against" it.
I do agree that my post didn't do a very good job of delivering a case against utility functions, and actually only argues that there exists a plausibly-more-useful alternative to a specific view which includes utility functions as one of several elements.
Utility functions definitely aren't more general.
A classical probability distribution over with a utility function understood as a random variable can easily be converted to the Jeffrey-Bolker framework, by taking the JB algebra as the sigma-algebra, and V as the expected value of U. Technically the sigma-algebra needs to be atomless to fit JB exactly, but Zoltan Domotor (Axiomatization of Jeffrey Utilities) generalizes this considerably.
I've heard people say that there is a way to convert in the other direction, but that it requires ultrafilters (so in some sense it's very non-constructive). I haven't been able to find this construction yet or had anyone explain how it works.
So it seems to me, but I recognize that I haven't shown in detail, that the space of computable values is strictly broader in the JB framework; computable utility functions + computable probability gives us computable JB-values, but computable JB-values need not correspond to computable utility functions.
Thus, the space of minds which can be described by the two frameworks might be equivalent, but the space of minds which can be described by computations does not seem to be; the JB space, there, is larger.
I don't see why any "good" utility function should be uncomputable.
Well, the Jeffrey-Bolker kind of explanation is as follows: agents really only need to consider and manipulate the probabilities and expected values of events (ie, propositions in the agent's internal language). So it makes some sense to assume that these probabilities and expected values are computable. But this does not imply (as far as I know) that we can construct 'worlds' as maximal specifications of which propositions are true/false and then define a utility function on those worlds which is consistent with the computable expected values and have that utility function itself be computable. And indeed it seems rather plausible to me that this is not the case, even for values which otherwise seem relatively unremarkable, as illustrated by examples like the procrastination paradox.
I think there is a good reason to imagine that the agent structures its ontology around its perceptions. The agent cannot observe whether-the-button-is-ever-pressed; it can only observe, on a given day, whether the button has been pressed on that day. |Omega|=2 is too small to even represent such perceptions.
I agree with the first sentence, however Omega is merely the domain of U, it does not need to be the entire ontology. In this case Omega={"button has been pressed", "button has not been pressed"} and P("button has been pressed" | "I'm pressing the button")~1. Obviously, there is also no problem with extending Omega with the perceptions, all the way up to |Omega|=4, or with adding some clocks.
I'm not sure why you say Omega can be the domain of U but not the entire ontology. This seems to mean that we don't know how to take expected values for arbitrary events. Also it means you are no longer advocating for the model I'm arguing against, where U is a random variable.
We could expand the scenario so that every "day" is represented by an n-bit string.
If you want to force the agent to remember the entire history of the world, then you'll run out of storage space before you need to worry about computability. A real agent would have to start forgetting days, or keep some compressed summary of that history. It seems to me that Jeffrey would "update" the daily utilities into total expected utility; in that case, U can do something similar.
I agree that we can put even more stringent (and realistic) requirements on the computational power of the agent, and then both JB and random-variable treatments become implausible, in so far as those treatments involve infinitely large representations.
I still think that the Jeffreyesque representational choice of using compact event-propositions, rather than fully-specified worlds, seems more plausible with respect to such bounded agents.
You defined U at the very beginning, so there is no need to send these new facts to U, it doesn't care. Instead, you are describing a problem with P, and it's a hard problem, but Jeffrey also uses P, so that doesn't solve it.
As per my earlier comment on "Omega is merely the domain of U", I think here you're abandoning elements of the random-variable approach to U, and in fact reasoning in a more JB-esque way.
> ... set our model to be a list of "events" we've observed ...
I didn't understand this part.If you "evaluate events", then events have some sort of bit representation in the agent, right? I don't clearly see the events in your "Updates Are Computable" example, so I can't say much and I may be confused, but I have a strong feeling that you could define U as a function on those bits, and get the same agent.
Yeah, it seems like we're talking past each other here and would need to do more work to unpack what's going on. All I can think to say right now is this: the usual random-variable approach to defining U requires that probabilities respect countable additivity, because the event of "the button being pressed" is just the set of individual worlds where that happens (where the button gets pressed on a particular day). This is the root of the computational difficulty in the standard approach. JB doesn't require countable additivity, since it isn't a rule which agents can enforce on their beliefs by touching only finitely many of them. This harkens back to something you said earlier:
Instead, you are describing a problem with P, and it's a hard problem, but Jeffrey also uses P, so that doesn't solve it.
Which I agree with in this case, except that JB does "solve" it by explicitly relaxing that constraint.
Again, this is a way in which JB is more general, not less; JB could follow that constraint, if you like.
In my personal practice, there seems to be a real difference -- "something magic happens" -- when you've got an actual audience you actually want to explain something to. I would recommend this over trying to simulate the experience within personal notes, if you can get it. The audience doesn't need to be 'the public internet' -- although each individual audience will have a different sort of impact on your writing, so EG writing to a friend who already understands you fairly well may not cause you to clarify your ideas in the same way as writing to strangers.
I would also mildly caution against a policy which makes your own personal notes too effortful to write. I wholeheartedly agree that you should keep your future self in mind as an audience, and write such that the notes will be useful if you look back at them. But if I imagine writing my own personal notes to the same standard as public-facing essays, I think I lose something -- it takes too long to capture ideas that way.
I agree that it makes more sense to suppose "worlds" are something closer to how the agent imagines worlds, rather than quarks. But on this view, I think it makes a lot of sense to argue that there are no maximally specific worlds -- I can always "extend" a world with an extra, new fact which I had not previously included. IE, agents never "finish" imagining worlds; more detail can always be added (even if only in separate magisteria, eg, imagining adding epiphenomenal facts). I can always conceive of the possibility of a new predicate beyond all the predicates which a specific world-model discusses.
If you buy this, then I think the Jeffrey-Bolker setup is a reasonable formalization.
If you don't buy this, my next question would be whether you really think that the sort of "world" ("world model", as you called it) which an agent attaches value to always are "closed off" (ie sperify all the facts one way or the other; do not admit further detail) -- or, perhaps, you merely want to argue that this can sometimes be the case but not always. (Because if it's sometimes the case but not always, this argues against both the traditional view where Omega is the set which the probability is a measure over & the utility function is a function of, and against the Jeffrey-Bolker picture.)
I find it implausible that the sort of "world model" which we can model humans as having-values-as-a-function-of is "closed off" -- we can appreciate ideas like atoms and quarks, adding these to our ontology, without necessarily changing other aspects of our world-model. Perhaps sometimes we can "close things off" like this -- we can consider the possibility that there "is nothing else" -- but even so, I think this is better-modeled as an additional assertion which we add to the set of propositions defining a possibility rather than modeling us as having bottomed out in an underlying set of "world" which inherently decide all propositions.
In "procrastination" example you intentionally picked a bad model, so it proves nothing (if the world only has one button we care about, then maybe |Omega|=2 and everything is perfectly computable).
You seem to be suggesting that any such example could be similarly re-written to make things nicely computable. I find this implausible. We could expand the scenario so that every "day" is represented by an n-bit string. The computable function b() looks at a "day" and tells us whether the button was pressed or not on that day. As before, we get -10 utility if the button is never pressed. But we also have some (computable) reward, r(), which is a function of a "day" and tells us how good or bad that day was. The discounted reward is such that these priorities are never more important than whether or not the button is pressed; but so long as the button is eventually pressed, we prefer to get more reward rather than less. How would you change the representation now?
More generally, do you believe that any plausible utility function on bit-strings can be re-represented as a computable function (perhaps on some other representation, rather than bit-strings)? Why would you particularly expect this to be the case?
I think in arguing that I intentionally picked a bad model, you mean that the world-model representation which I chose was totally ad-hoc and chosen specifically to make things difficult to compute, and without having the goal in mind of making things difficult to compute, someone else would have chosen something simpler like |Omega|=2. But I think there is a good reason to imagine that the agent structures its ontology around its perceptions. The agent cannot observe whether-the-button-is-ever-pressed; it can only observe, on a given day, whether the button has been pressed on that day. |Omega|=2 is too small to even represent such perceptions.
Further on, it seems to me that if we set our model to be a list of "events" we've observed, then we get the exact thing you're talking about. Although you're imprecise and inconsistent about what an event is, how it's represented, how many there are, so I'm not sure if that's supposed to make anything more tractable.
I didn't understand this part.
In general, asking questions about the domain of U (and P!) is a good idea, and something that all introductions to Utility lack. But the ease with which you abandon a perfectly good formalism is concerning. LI is cool, and it doesn't use U, but that's not an argument against U, at best you can say that U was not as useful as you'd hoped.
Jeffrey-Bolker is fairly commonly advocated amongst decision theorists in philosophy (from both sides of the CDT-EDT debate!), although as far as I'm aware it hasn't made its way into stats textbooks at any level. It can be seen as part of a broader movement in mathematics, away from set-theoretic representations and toward more algebraic representations. A related example is pointless topology -- instead of understanding a topology as a structure imposed on a set of points, the structure of "opens" (no longer "open sets") is examined in its own right. In the same way that discarding "worlds" moves the formalism closer to concepts which the agent can actually realistically manipulate, discarding "points" from topology moves the math closer to the pieces which mathematicians are actually interested in manipulating.
My own take is that the domain of U is the type of P. That is, U is evaluated on possible functions P. P certainly represents everything the agent cares about in the world, and it's also already small and efficient enough to be stored and updated in the agent, so this solution creates no new problems.
This is an interesting alternative, which I have never seen spelled out in axiomatic foundations.
I also wrote a huge amount in private idea-journals before I started writing publicly. There was also an intermediate stage where I wrote a lot on mailing lists, which felt less public than blogging although technically public.
Even if I conceded this point, which is not obvious to me, I would still insist on the point that different speakers will be using natural language differently and so resorting to natural language rather than formal language is not universally a good move when it comes to clarifying disagreements.
Well, more importantly, I want to argue that "translation" is happening even if both people are apparently using English.
For example, philosophers have settled on distinct but related meanings for the terms "probability", "credence", "chance", "frequency", "belief". (Some of these meanings are more vague/general while others are more precise; but more importantly, these different terms have many different detailed implications.) If two people are unfamiliar with all of those subtleties and they start using one of the words (say, "probability"), then it is very possible that they have two different ideas about which more-precise notion is being invoked.
When doing original research, people are often in this situation, because the several more-precise notions have not even been invented yet (so it's not possible to go look up how philosophers have clarified the possible concepts).
In my experience, this means that two people using natural language to try and discuss a topic are very often in a situation where it feels like we're "translating" back and forth between our two different ontologies, even though we're both expressing those ideas in English.
So, even if both people express their ideas in English, I think the "non-invertible translation problem" discussed in the original post can still arise.
I disagree. For tricky technical topics, two different people will be speaking sufficiently different versions of English that this isn't true. Vagueness and other such topics will not apply equally to both speakers; one person might have a precise understanding of decision-theoretic terms like "action" and "observation" while the other person may regard them as more vague, or may have different decision-theoretic understanding of those terms. Simple example, one person may regard Jeffrey-Bolker as the default framework for understanding agents, while the other may prefer Savage; these two frameworks ontologize actions in very different ways, which may be incidental to the debate or may be central. Speaking in English just obscures this underlying difference in how we think about things, rather than solving the problem.
I'm not sure where the underlying disagreement in the decision theory case was (something about actions vs mixed strategies) but I assume there again the underlying problem can be expressed in natural language statements which both parties can understand without the need of translating them.
In the case of mixed vs pure strategies, I think it is quite clear that translating to technical terminology rather than English helped clarify rather than obscure, even if it created the non-one-to-one translation problem this post is discussing.
"Weak methods" means confidence is achieved more empirically, so there's always a question of how well the results will generalize for some new AI system (as we scale existing technology up or change details of NN architectures, gradient methods, etc). "Strong methods" means there's a strong argument (most centrally, a proof) based on a detailed gears-level understanding of what's happening, so there is much less doubt about what systems the method will successfully apply to.
The question seems too huge for me to properly try to answer. Instead, I want to note that academics have been making some progress on models which are trying to do something similar to, but perhaps subtly different from, Paul's reflective probability distribution you cite.
The basic idea is not new to me -- I can't recall where, but I think I've probably seen a talk observing that linear combinations of neurons, rather than individual neurons, are what you'd expect to be meaningful (under some assumptions) because that's how the next layer of neurons looks at a layer -- since linear combinations are what's important to the network, it would be weird if it turned out individual neurons were particularly meaningful. This wasn't even surprising to me at the time I first learned about it.
But it's great to see it illustrated so well!
In my view, this provides relatively little insights to the hard questions of what it even means to understand what is going on inside a network (so, for example, it doesn't provide any obvious progress on the hard version of ELK). So how useful this ultimately turns out to be for aligning superintelligence depends on how useful "weak methods" in general are. (IE methods with empirical validation but which don't come with strong theoretical arguments that they will work in general.)
That being said, I am quite glad that such good progress is being made, even if it's what I would classify as "weak methods".
Yeah. For my case, I think it should be assumed that the meta-logics are as different as the object-logics, so that things continue to be confusing.
As I mentioned here, if Alice understands your point about the power of the double-negation formulation, she would be applying a different translation of Bob's statements from the one I assumed in the post, so she would be escaping the problem. IE:
part of the beauty in the double-negation translation is that all of classical logic is valid under it.
is basically a reminder to Alice that the translation back from double-negation form is trivial in her own view (since it is classically equivalent), and all of Bob's intuitionistic moves are also classically valid, so she only has to squint at Bob's weird axioms, everything else should be understantable.
But I think things will still get weird if Bob makes meta-logical arguments (ie Bob's arguments are not just proofs in Bob's logic), which seems quite probable.
I'm interested in concrete advice about how to resolve this problem in a real argument (which I think you don't quite provide), but I'm also quite interested in the abstract question of how two people with different ontologies can communicate. Normally I think of the problem as one of constructing a third reference frame (a common language) by which they can communicate, but your proposal is also interesting, and escapes the common-language idea.
That's an interesting point, but I have a couple of replies.
- First and foremost, any argument against 'not not A' becomes an argument against A if Alice translates back into classical logic in a different way than I've assumed she is. Bob's argument might conclude 'not A' (because even in intuitionistic logic), but Alice thinks of this as a tricky intuitionistic assertion, and so she interprets it indirectly as saying something about proofs. For Alice to notice and understand your point would, I think, be Alice fixing the failure case I'm pointing out.
- Second, an argument against an assumption need not be an argument for its negation, especially for intuitionists/constructivists. (Excluded middle is something they want to argue against, but definitely not something they want to negate, for example.) The nature of Bob's argument against Alice's claim can be quite complex and can occur at meta-levels, rather than occurring directly in the logic. So I guess I'm not clear that things are as simple as you claim, when this happens.
Would you count all the people who worked on the EU AI act?