The Cognitive-Theoretic Model of the Universe: A Partial Summary and Review 2024-03-27T19:59:27.893Z
Constructive Cauchy sequences vs. Dedekind cuts 2024-03-14T23:04:07.300Z
Simple Kelly betting in prediction markets 2024-03-06T18:59:18.243Z
A review of "Don’t forget the boundary problem..." 2024-02-08T23:19:49.786Z
2023 in AI predictions 2024-01-01T05:23:42.514Z
A case for AI alignment being difficult 2023-12-31T19:55:26.130Z
Scaling laws for dominant assurance contracts 2023-11-28T23:11:07.631Z
Moral Reality Check (a short story) 2023-11-26T05:03:18.254Z
Non-superintelligent paperclip maximizers are normal 2023-10-10T00:29:53.072Z
A Proof of Löb's Theorem using Computability Theory 2023-08-16T18:57:41.048Z
SSA rejects anthropic shadow, too 2023-07-27T17:25:17.728Z
A review of Principia Qualia 2023-07-12T18:38:52.283Z
Hell is Game Theory Folk Theorems 2023-05-01T03:16:03.247Z
A short conceptual explainer of Immanuel Kant's Critique of Pure Reason 2022-06-03T01:06:32.394Z
A method of writing content easily with little anxiety 2022-04-08T22:11:47.298Z
Occupational Infohazards 2021-12-18T20:56:47.978Z
"Infohazard" is a predominantly conflict-theoretic concept 2021-12-02T17:54:26.182Z
Selfishness, preference falsification, and AI alignment 2021-10-28T00:16:47.051Z
My experience at and around MIRI and CFAR (inspired by Zoe Curzi's writeup of experiences at Leverage) 2021-10-16T21:28:12.427Z
Many-worlds versus discrete knowledge 2020-08-13T18:35:53.442Z
Modeling naturalized decision problems in linear logic 2020-05-06T00:15:15.400Z
Topological metaphysics: relating point-set topology and locale theory 2020-05-01T03:57:11.899Z
Two Alternatives to Logical Counterfactuals 2020-04-01T09:48:29.619Z
The absurdity of un-referenceable entities 2020-03-14T17:40:37.750Z
Puzzles for Physicalists 2020-03-12T01:37:13.353Z
A conversation on theory of mind, subjectivity, and objectivity 2020-03-10T04:59:23.266Z
Subjective implication decision theory in critical agentialism 2020-03-05T23:30:42.694Z
A critical agential account of free will, causation, and physics 2020-03-05T07:57:38.193Z
On the falsifiability of hypercomputation, part 2: finite input streams 2020-02-17T03:51:57.238Z
On the falsifiability of hypercomputation 2020-02-07T08:16:07.268Z
Philosophical self-ratification 2020-02-03T22:48:46.985Z
High-precision claims may be refuted without being replaced with other high-precision claims 2020-01-30T23:08:33.792Z
On hiding the source of knowledge 2020-01-26T02:48:51.310Z
On the ontological development of consciousness 2020-01-25T05:56:43.244Z
Is requires ought 2019-10-28T02:36:43.196Z
Metaphorical extensions and conceptual figure-ground inversions 2019-07-24T06:21:54.487Z
Dialogue on Appeals to Consequences 2019-07-18T02:34:52.497Z
Why artificial optimism? 2019-07-15T21:41:24.223Z
The AI Timelines Scam 2019-07-11T02:52:58.917Z
Self-consciousness wants to make everything about itself 2019-07-03T01:44:41.204Z
Writing children's picture books 2019-06-25T21:43:45.578Z
Conditional revealed preference 2019-04-16T19:16:55.396Z
Boundaries enable positive material-informational feedback loops 2018-12-22T02:46:48.938Z
Act of Charity 2018-11-17T05:19:20.786Z
EDT solves 5 and 10 with conditional oracles 2018-09-30T07:57:35.136Z
Reducing collective rationality to individual optimization in common-payoff games using MCMC 2018-08-20T00:51:29.499Z
Buridan's ass in coordination games 2018-07-16T02:51:30.561Z
Decision theory and zero-sum game theory, NP and PSPACE 2018-05-24T08:03:18.721Z
In the presence of disinformation, collective epistemology requires local modeling 2017-12-15T09:54:09.543Z
Autopoietic systems and difficulty of AGI alignment 2017-08-20T01:05:10.000Z


Comment by jessicata (jessica.liu.taylor) on The Cognitive-Theoretic Model of the Universe: A Partial Summary and Review · 2024-04-03T22:20:49.572Z · LW · GW

Regarding quantum, I'd missed the bottom text. It seems if I only read the main text, the obvious interpretation is that points are events and the circles restrict which other events they can interact with. He says "At the same time, conspansion gives the quantum wave function of objects a new home: inside the conspanding objects themselves" which implies the wave function is somehow located in the objects.

From the diagram text, it seems he is instead saying that each circle represents entangled wavefunctions of some subset of objects that generated the circle. I still don't see how to get quantum non-locality from this. The wave function can be represented as a complex valued function on configuration space; how could it be factored into a number of entanglements that only involve a small number of objects? In probability theory you can represent a probability measure as a factor graph, where each factor only involves a limited subset of variables, but (a) not all distributions can be efficiently factored this way, (b) generalizing this to quantum wave functions is additionally complicated due to how wave functions differ from probability distributions.

Comment by jessicata (jessica.liu.taylor) on Is requires ought · 2024-04-02T18:49:26.782Z · LW · GW

It's an expectation that has to do with a function of the thing, an expectation that the thing will function for some purpose. I suppose you could decompose that kind of claim to a more complex claim that doesn't involve "function", but in practice this is difficult.

I guess my main point is that sometimes fulfilling one's functions is necessary for knowledge, e.g. you need to check proofs correctly to have the knowledge that the proofs you have checked are correct, the expectation that you check proofs correctly is connected with the behavior of checking them correctly.

Comment by jessicata (jessica.liu.taylor) on The Cognitive-Theoretic Model of the Universe: A Partial Summary and Review · 2024-03-29T22:10:21.073Z · LW · GW

I paid attention to this mainly because other people wanted me to, but the high IQ thing also draws some attention. I've seen ideas like "theory of cognitive processes should be integrated into philosophy of science" elsewhere (and have advocated such ideas myself), "syndiffeonesis" seems like an original term (although some versions of it appear in type theory), "conspansion" seems pretty Deleuzian, UBT is Spinozan, "telic recursion" is maybe original but highly underspecified... I think what I found useful about it is that it had a lot of these ideas, at least some of which are good, and different takes on/explanations of them than I've found elsewhere even when the ideas themselves aren't original.

Comment by jessicata (jessica.liu.taylor) on The Cognitive-Theoretic Model of the Universe: A Partial Summary and Review · 2024-03-28T15:55:50.228Z · LW · GW

I don't see any. He even says his approach “leaves the current picture of reality virtually intact”. In Popper's terms this would be metaphysics, not science, which is part of why I'm skeptical of the claimed applications to quantum mechanics and so on. Note that, while there's a common interpretation of Popper saying metaphysics is meaningless, he contradicts this.

Quoting Popper:

Language analysts believe that there are no genuine philosophical problems, or that the problems of philosophy, if any, are problems of linguistic usage, or of the meaning of words. I, however, believe that there is at least one philosophical problem in which all thinking men are interested. It is the problem of cosmology: the problem of understanding the world—including ourselves, and our knowledge, as part of the world. All science is cosmology, I believe, and for me the interest of philosophy, no less than of science, lies solely in the contributions which it has made to it.


I have tried to show that the most important of the traditional problems of epistemology—those connected with the growth of knowledge—transcend the two standard methods of linguistic analysis and require the analysis of scientific knowledge. But the last thing I wish to do, however, is to advocate another dogma. Even the analysis of science—the ‘philosophy of science’—is threatening to become a fashion, a specialism. yet philosophers should not be specialists. For myself, I am interested in science and in philosophy only because I want to learn something about the riddle of the world in which we live, and the riddle of man’s knowledge of that world. And I believe that only a revival of interest in these riddles can save the sciences and philosophy from narrow specialization and from an obscurantist faith in the expert’s special skill, and in his personal knowledge and authority; a faith that so well fits our ‘post-rationalist’ and ‘post-critical’ age, proudly dedicated to the destruction of the tradition of rational philosophy, and of rational thought itself.


Positivists usually interpret the problem of demarcation in a naturalistic way; they interpret it as if it were a problem of natural science. Instead of taking it as their task to propose a suitable convention, they believe they have to discover a difference, existing in the nature of things, as it were, between empirical science on the one hand and metaphysics on the other. They are constantly trying to prove that metaphysics by its very nature is nothing but nonsensical twaddle—‘sophistry and illusion’, as Hume says, which we should ‘commit to the flames’. If by the words ‘nonsensical’ or ‘meaningless’ we wish to express no more, by definition, than ‘not belonging to empirical science’, then the characterization of metaphysics as meaningless nonsense would be trivial; for metaphysics has usually been defined as non-empirical. But of course, the positivists believe they can say much more about metaphysics than that some of its statements are non-empirical. The words ‘meaningless’ or ‘nonsensical’ convey, and are meant to convey, a derogatory evaluation; and there is no doubt that what the positivists really want to achieve is not so much a successful demarcation as the final overthrow and the annihilation of metaphysics. However this may be, we find that each time the positivists tried to say more clearly what ‘meaningful’ meant, the attempt led to the same result—to a definition of ‘meaningful sentence’ (in contradistinction to ‘meaningless pseudo-sentence’) which simply reiterated the criterion of demarcation of their inductive logic.


In contrast to these anti-metaphysical stratagems—anti-metaphysical in intention, that is—my business, as I see it, is not to bring about the overthrow of metaphysics. It is, rather, to formulate a suitable characterization of empirical science, or to define the concepts ‘empirical science’ and ‘metaphysics’ in such a way that we shall be able to say of a given system of statements whether or not its closer study is the concern of empirical science.

Comment by jessicata (jessica.liu.taylor) on UDT1.01: The Story So Far (1/10) · 2024-03-28T01:07:27.221Z · LW · GW

Ok, I misunderstood. (See also my post on the relation between local and global optimality, and another post on coordinating local decisions using MCMC)

Comment by jessicata (jessica.liu.taylor) on UDT1.01: The Story So Far (1/10) · 2024-03-27T23:31:56.974Z · LW · GW

UDT1.0, since it’s just considering modifying its own move, corresponds to a player that’s acting as if it’s independent of what everyone else is deciding, instead of teaming up with its alternate selves to play the globally optimal policy.

I thought UDT by definition pre-computes the globally optimal policy? At least, that's the impression I get from reading Wei Dai's original posts.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-03-25T05:26:29.762Z · LW · GW

Some possible AI architectures are structured as goal function optimization and by assumption that the human brain contains one or more expected utility maximizers, there is a human utility function that could be a possible AI goal. I'm not saying it's likely.

Comment by jessicata (jessica.liu.taylor) on Constructive Cauchy sequences vs. Dedekind cuts · 2024-03-17T00:20:06.458Z · LW · GW

With just that you could get upper bounds for the real. You could get some lower bounds by showing all rationals in the enumeration are greater than some rational, but this isn't always possible to do, so maybe your type includes things that aren't real numbers with provable lower bounds.

If you require both then we're back at the situation where, if there's a constructive proof that the enumerations min/max to the same value, you can get a Cauchy real out of this, and perhaps these are equivalent.

Comment by jessicata (jessica.liu.taylor) on Constructive Cauchy sequences vs. Dedekind cuts · 2024-03-16T03:37:05.937Z · LW · GW

It seems that a real number defined this way will have some perhaps-infinite list of rationals it's less than and one it's greater than. You might want to add a constraint that the maximum of the list of numbers it's above gets arbitrarily close to the minimum of the list of numbers it's below (as Tailcalled suggested).

With respect to Cauchy sequences, the issue is how to specify convergence; the epsilon/N definition is one way to do this and, constructively, gives a way of computing epsilon-good approximations.

Comment by jessicata (jessica.liu.taylor) on Constructive Cauchy sequences vs. Dedekind cuts · 2024-03-15T17:53:12.654Z · LW · GW

The power of this seems similar to the power of constructive Cauchy sequences because you can use the (x < y) → A u B function to approximate the value to any positive precision error.

Comment by jessicata (jessica.liu.taylor) on Constructive Cauchy sequences vs. Dedekind cuts · 2024-03-15T17:50:04.466Z · LW · GW

By truth values do you mean Prop or something else?

Comment by jessicata (jessica.liu.taylor) on Constructive Cauchy sequences vs. Dedekind cuts · 2024-03-15T05:56:38.315Z · LW · GW

Here's how one might specify Dedekind cuts in type theory. Provide two types A,B with mappings , . To show these cover all the rationals, provide such that the value returned by c maps back to its argument, through functions or . But this lets us re-construct a function by seeing whether provides an A or a B. There are other ways of doing this but I'm not sure what else is worth analyzing.

Comment by jessicata (jessica.liu.taylor) on Constructive Cauchy sequences vs. Dedekind cuts · 2024-03-15T05:51:54.969Z · LW · GW
Comment by jessicata (jessica.liu.taylor) on Simple Kelly betting in prediction markets · 2024-03-08T01:57:35.800Z · LW · GW

Well, the one thing making that difficult is that I did not know the Lagrange multiplier theorem until reading this comment.

I agree this is in practice not directly applicable because buying contracts with all your money is silly.

Comment by jessicata (jessica.liu.taylor) on Why Two Valid Answers Approach is not Enough for Sleeping Beauty · 2024-02-08T00:59:21.003Z · LW · GW

All you need is to construct an appropriate probability space and use basic probability theory instead of inventing clever reasons why it doesn’t apply in this particular case.

I don't see how to do that but maybe your plan is to get to that at some point

Am I missing something? How is it at all controversial?

it's not, it's just a modification on the usual halfer argument that "you don't learn anything upon waking up"

Comment by jessicata (jessica.liu.taylor) on Why Two Valid Answers Approach is not Enough for Sleeping Beauty · 2024-02-07T02:27:49.105Z · LW · GW
  • halfers have to condition on there being at least one observer in the possible world. if the coin can come up 0,1,2 at 1/3 each, and Sleeping Beauty wakes up that number of times, halfers still think the 0 outcome is 0% likely upon waking up.
  • halfers also have to construct the reference class carefully. if there are many events of people with amnesia waking up once or twice, and SSA's reference class consists of the set of awakenings from these, then SSA and SIA will agree on a 1/3 probability. this is because in a large population, about 1/3 of awakenings are in worlds where the coin came up such that there would be one awakening.
Comment by jessicata (jessica.liu.taylor) on A Shutdown Problem Proposal · 2024-01-22T02:59:01.127Z · LW · GW

I don't have a better solution right now, but one problem to note is that this agent will strongly bet that the button will be independent of the human pressing the button. So it could lose money to a different agent that thinks these are correlated, as they are.

Comment by jessicata (jessica.liu.taylor) on Scaling laws for dominant assurance contracts · 2024-01-15T02:50:39.079Z · LW · GW

Nice job with the bound! I've heard a number of people in my social sphere say very positive things about DACs so this is mainly my response to them.

Comment by jessicata (jessica.liu.taylor) on Universal Love Integration Test: Hitler · 2024-01-11T00:58:02.070Z · LW · GW

You mentioned wanting to get the game theory of love correct. Understanding a game involves understanding the situations and motives of the involved agents. So getting the game theory of love correct with respect to some agent implies understanding that agent's situation.

Comment by jessicata (jessica.liu.taylor) on Universal Love Integration Test: Hitler · 2024-01-11T00:41:16.746Z · LW · GW

This seems more like "imagining being nice to Hitler, as one could be nice to anyone" than "imagining what Hitler was in fact like and why his decisions seemed to him like the thing to do". Computing the game theoretically right strategy involves understanding different agents' situations, the kind of empathy that couldn't be confused with being a doormat, sometimes called "cognitive empathy".

I respect Sarah Constantin's attempt to understand Hitler's psychological situation.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-10T23:57:08.960Z · LW · GW

If you define "human values" as "what humans would say about their values across situations", then yes, predicting "human values" is a reasonable training objective. Those just aren't really what we "want" as agents, and agentic humans would have motives not to let the future be controlled by an AI optimizing for human approval.

That's also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It's possible that the objectives of these maximizers are affected by socialization, but they'll be less affected by socialization than verbal statements about values, because they're harder to fake so less affected by preference falsification.

Children learn some sense of what they're supposed to say about values, but have some pre-built sense of "what to do / aim for" that's affected by evopsych and so on. It seems like there's a huge semantic problem with talking about "values" in a way that's ambiguous between "in-built evopsych-ish motives" and "things learned from culture about what to endorse", but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term "values" rather than "preferences".

In the section on subversion I made the case that terminal values make much more difference in subversive behavior than compliant behavior.

It seems like to get at the values of approximate utility maximizers located in the brain you would need something like Goal Inference as Inverse Planning rather than just predicting behavior.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-10T22:57:52.283Z · LW · GW

How would you design a task that incentivizes a system to output its true estimates of human values? We don't have ground truth for human values, because they're mind states not behaviors.

Seems easier to create incentives for things like "wash dishes without breaking them", you can just tell.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-08T19:46:37.423Z · LW · GW

I'm mainly trying to communicate with people familiar with AI alignment discourse. If other people can still understand it, that's useful, but not really the main intention.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-08T19:34:05.299Z · LW · GW

I do think this part is speculative. The degree of "inner alignment" to the training objective depends on the details.

Partly the degree to which "try to model the world well" leads to real-world agency depends on the details of this objective. For example, doing a scientific experiment would result in understanding the world better, and if there's RL training towards "better understand the world", that could propagate to intending to carry out experiments that increase understanding of the world, which is a real-world objective.

If, instead, the AI's dataset is fixed and it's trying to find a good compression of it, that's less directly a real-world objective. However, depending on the training objective, the AI might get a reward from thinking certain thoughts that would result in discovering something about how to compress the dataset better. This would be "consequentialism" at least within a limited, computational domain.

An overall reason for thinking it's at least uncertain whether AIs that model the world would care about it is that an AI that did care about the world would, as an instrumental goal, compliantly solve its training problems and some test problems (before it has the capacity for a treacherous turn). So, good short-term performance doesn't by itself say much about goal-directed behavior in generalizations.

The distribution of goals with respect to generalization, therefore, depends on things like which mind-designs are easier to find by the search/optimization algorithm. It seems pretty uncertain to me whether agents with general goals might be "simpler" than agents with task-specific goals (it probably depends on the task), therefore easier to find while getting ~equivalent performance. I do think that gradient descent is relatively more likely to find inner-aligned agents (with task-specific goals), because the internal parts are gradient descended towards task performance, it's not just a black box search.

Yudkowsky mentions evolution as an argument that inner alignment can't be assumed. I think there are quite a lot of dis-analogies between evolution and ML, but the general point that some training processes result in agents whose goals aren't aligned with the training objective holds. I think, in particular, supervised learning systems like LLMs are unlikely to exhibit this, as explained in the section on myopic agents.

Comment by jessicata (jessica.liu.taylor) on 2023 in AI predictions · 2024-01-04T02:32:24.152Z · LW · GW

I tested it on 3 held-out problems and it got 1/3. Significant progress, increases the chance these can be solved with prompting. So partially it's a question of if any major LLMs incorporate better auto prompting.

Comment by jessicata (jessica.liu.taylor) on 2023 in AI predictions · 2024-01-04T02:19:52.620Z · LW · GW

Nice prompt! It solved the 3 x 3 problem too.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-03T21:53:52.894Z · LW · GW

There are evolutionary priors for what to be afraid of but some of it is learned. I've heard children don't start out fearing snakes but will easily learn to if they see other people afraid of them, whereas the same is not true for flowers (sorry, can't find a ref, but this article discusses the general topic). Fear of heights might be innate but toddlers seem pretty bad at not falling down stairs. Mountain climbers have to be using mainly mechanical reasoning to figure out which heights are actually dangerous. It seems not hard to learn the way in which heights are dangerous if you understand the mechanics required to walk and traverse stairs and so on.

Instincts like curiosity are more helpful at the beginning of life, over time they can be learned as instrumental goals. If an AI learns advanced metacognitive strategies instead of innate curiosity that's not obviously a big problem from a human values perspective but it's unclear.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-03T07:51:05.802Z · LW · GW

Most civilizations in the past have had "bad values" by our standards. People have been in preference falsification equilibria where they feel like they have to endorse certain values or face social censure. They probably still are falsifying preferences and our civilizational values are probably still bad. E.g. high incidence of people right now saying they're traumatized. CEV probably tends more towards the values of untraumatized than traumatized humans, even from a somewhat traumatized starting point.

The idea that civilization is "oppressive" and some societies have fewer problems points to value drift that has already happened. The Roman empire was really, really bad and has influenced future societies due to Christianity and so on. Civilizations have become powerful partly through military mobilization. Civilizations can be nice to live in in various ways, but that mostly has to do with greater satisfaction of instrumental values.

Some of the value drift might not be worth undoing, e.g. value drift towards caring more about far-away people than humans naturally would.

Comment by jessicata (jessica.liu.taylor) on AI Is Not Software · 2024-01-02T21:53:42.275Z · LW · GW

Seems like an issue of code/data segmentation. Programs can contain compile time constants, and you could turn a neural network into a program that has compile time constants for the weights, perhaps "distilling" it to reduce the total size, perhaps even binarizing it.

Arguably, video games aren't entirely software by this standard, because they use image assets.

Formally segmenting "code" from "data" is famously hard because "code as data" is how compilers work and "data as code" is how interpreters work. Some AI techniques involve program synthesis.

I think the relevant issue is copyright more than the code/data distinction? Since code can be copyrighted too.

Comment by jessicata (jessica.liu.taylor) on 2023 in AI predictions · 2024-01-02T21:45:31.633Z · LW · GW

I think it's hard because it requires some planning and puzzle solving in a new, somewhat complex environment. The AI results on Montezuma's Revenge seem pretty unimpressive to me because they're going to a new room, trying random stuff until they make progress, then "remembering" that for future runs. Which means they need quite a lot of training data.

For short term RL given lots of feedback, there are already decent results e.g. in starcraft and DOTA. So the difficulty is more figuring out how to automatically scope out narrow RL problems that can be learned without too much training time.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-02T20:50:17.254Z · LW · GW

From a within-lifetime perspective, getting bored is instrumentally useful for doing "exploration" that results in finding useful things to do, which can be economically useful, be effective signalling of capacity, build social connection, etc. Curiosity is partially innate but it's also probably partially learned. I guess that's not super different from pain avoidance. But anyway, I don't worry about an AI that fails to get bored, but is otherwise basically similar to humans, taking over, because not getting bored would result in being ineffective at accomplishing open-ended things.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-02T20:47:05.304Z · LW · GW

I did mention LLMs as myopic agents.

If they actually simulate humans it seems like maybe legacy humans get outcompeted by simulated humans. I'm not sure that's worse than what humans expected without technological transcendence (normal death, getting replaced by children and eventually conquering civilizations, etc). Assuming the LLMs that simulate humans well are moral patients (see anti zombie arguments).

It's still not as good as could be achieved in principle. Seems like having the equivalent of "legal principles" that get used as training feedback could help. Plus direct human feedback. Maybe the system gets subverted eventually but the problem of humans getting replaced by em-like AIs is mostly a short term one of current humans being unhappy about that.

Comment by jessicata (jessica.liu.taylor) on SSA rejects anthropic shadow, too · 2024-01-02T20:43:06.923Z · LW · GW

Yeah, that's a good reference.

Comment by jessicata (jessica.liu.taylor) on 2023 in AI predictions · 2024-01-02T20:42:03.963Z · LW · GW

Thanks, added.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-02T20:37:46.752Z · LW · GW

I think use of AI tools could have similar results to human cognitive enhancement, which I expect to basically be helpful. They'll have more problems with things that are enhanced by stuff like "bigger brain size" rather than "faster thought" and "reducing entropic error rates / wisdom of the crowds" because they're trained on humans. One can in general expect more success on this sort of thing by having an idea of what problem is even being solved. There's a lot of stuff that happens in philosophy departments that isn't best explained by "solving the problem" (which is under-defined anyway) and could be explained by motives like "building connections", "getting funding", "being on the good side of powerful political coalitions", etc. So psychology/sociology of philosophy seems like an approach to understand what is even being done when humans say they're trying to solve philosophy problems.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-02T20:33:06.396Z · LW · GW

I meant to say I'd be relatively good at it, I think it would be hard to find 20 people who are better than me at this sort of thing. The original ITT was about simulating "a libertarian" rather than "a particular libertarian", so emulating Yudkowsky specifically is a difficulty increase that would have to be compensated for. I think replicating writing style isn't the main issue, replicating the substance of arguments is, which is unfortunately harder to test. This post wasn't meant to do this, as I said.

I'm also not sure in particular what about the Yudkowskian AI risk models you think I don't understand. I disagree in places but that's not evidence of not understanding them.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-02T20:30:00.755Z · LW · GW

I'm defining "values" as what approximate expected utility optimizers in the human brain want. Maybe "wants" is a better word. People falsify their preferences and in those cases it seems more normative to go with internal optimizer preferences.

Re indexicality, this is an "the AI knows but does not care" issue, it's about specifying it not about there being some AI module somewhere that "knows" it. If AGI were generated partially from humans understanding how to encode indexical goals that would be a different situation.

Re treacherous turns, I agreed that myopic agents don't have this issue to nearly the extent that long-term real-world optimizing agents do. It depends how the AGI is selected. If it's selected by "getting good performance according to a human evaluator in the real world" then at some capability level AGIs that "want" that will be selected more.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-02T08:08:25.441Z · LW · GW

They would approximate human agency at the limit but there's both the issue of how fast they approach the limit and the degree to which they have independent agency rather than replicating human agency. There are fewer deceptive alignment problems if the long term agency they have is just an approximation of human agency.

Mostly I don't think there's much of an alignment problem for LLMs because they basically approximate human-like agency, but they aren't approaching autopoiesis, they'll lead to some state transition that is kind of like human enhancement and kind of like invention of new tools. There are eventually capability gains by modeling things using a different, better set of concepts and agent substrate than humans have, it's just that the best current methods heavily rely on human concepts.

I don't understand what you think the pressing concerns with LLM alignment are. It seems like Paul Christiano type methods would basically work for them. They don't have a fundamentally different set of concepts and type of long-term agency from humans, so humans thinking long enough to evaluate LLMs with the help of other LLMs, in order to generate RL signals and imitation targets, seems sufficient.

Comment by jessicata (jessica.liu.taylor) on 2023 in AI predictions · 2024-01-02T06:24:02.207Z · LW · GW

Ok, I added this prediction.

Comment by jessicata (jessica.liu.taylor) on 2023 in AI predictions · 2024-01-02T05:06:02.705Z · LW · GW

Do you know if Andrew Ng or Yann LeCun has made a specific prediction that AGI won't arrive by some date? Couldn't find it through a quick search. Idk what others to include.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-02T04:00:39.234Z · LW · GW

I'm assuming the relevant values are the optimizer ones not what people say. I discussed social institutions, including those encouraging people to endorse and optimize for common values, in the section on subversion.

Alignment with a human other than yourself could be a problem because people are to some degree selfish and, to a smaller degree, have different general principles/aesthetics about how things should be. So some sort of incentive optimization / social choice theory / etc might help. But at least there's significant overlap between different humans' values. Though, there's a pretty big existing problem of people dying, the default was already that current people would be replaced by other people.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-02T00:41:31.963Z · LW · GW

To the extent people now don't care about the long-term future there isn't much to do in terms of long-term alignment. People right now who care about what happens 2000 years from now probably have roughly similar preferences to people 1000 years from now who aren't significantly biologically changed or cognitively enhanced, because some component of what people care about is biological.

I'm not saying it would be random so much as not very dependent on the original history of humans used to train early AGI iterations. It would have different data history but part of that is because of different measurements, e.g. scientific measuring tools. Different ontology means that value laden things people might care about like "having good relationships with other humans" are not meaningful things to future AIs in terms of their world model, not something they would care much by default (they aren't even modeling the world in those terms), and it would be hard to encode a utility function so they care about it despite the ontological difference.

Comment by jessicata (jessica.liu.taylor) on 2023 in AI predictions · 2024-01-02T00:35:32.443Z · LW · GW

Beat Ocarina of Time with <100 hours of playing Zelda games during training or deployment (but perhaps training on other games), no reading guides/walkthroughs/playthroughs, no severe bug exploits (those that would cut down the required time by a lot), no reward-shaping/advice specific to this game generated by humans who know non-trivial things about the game (but the agent can shape its own reward). Including LLM coding a program to do it. I'd say probably not by 2033.

Comment by jessicata (jessica.liu.taylor) on A case for AI alignment being difficult · 2024-01-01T23:52:32.285Z · LW · GW

I think it's possible human values depend on life history too, but that seems to add additional complexity and make alignment harder. If the effects of life history very much dominate those of evolutionary history, then maybe neglecting evolutionary history would be more acceptable, making the problem easier.

But I don't think default AGI would be especially path dependent on human collective life history. Human society changes over time as humans supersede old cultures (see section on subversion). AGI would be a much bigger shift than the normal societal shifts and so would drift from human culture more rapidly. Partially due to different conceptual ontology and so on. The legacy concepts of humans would be a pretty inefficient system for AGIs to keep using. Like how scientists aren't alchemists anymore, but a bigger shift than that.

(Note, LLMs still rely a lot on human concepts rather than having independent ontology and agency, so this is more about future AI systems)

Comment by jessicata (jessica.liu.taylor) on 2023 in AI predictions · 2024-01-01T23:44:24.130Z · LW · GW

I think they will remain hard by EOY 2024, as in, of this problem and the 7 held-out ones of similar difficulty, the best LLM will probably not solve 4/8.

I think I would update some on how fast LLMs are advancing but these are not inherently very hard problems so I don't think it would be a huge surprise, this was meant to be one of the easiest things they fail at right now. Maybe if that happens I would think things are going 1.6x as fast short term as I would have otherwise thought?

I was surprised by GPT3/3.5 but not so much by 4, I think it adds up to on net an update that LLMs are advancing faster than I thought, but I haven't much changed my long-term AGI timelines, because I think that will involve lots of techs not just LLMs, although LLM progress is some update about general tech progress.

Comment by jessicata (jessica.liu.taylor) on 2023 in AI predictions · 2024-01-01T19:08:33.476Z · LW · GW

I've added 6 more held-out problems for a total of 7. Agree that getting the answer without pointing out problems is the right standard.

Comment by jessicata (jessica.liu.taylor) on 2023 in AI predictions · 2024-01-01T18:29:41.881Z · LW · GW


Comment by jessicata (jessica.liu.taylor) on 2023 in AI predictions · 2024-01-01T17:58:26.297Z · LW · GW

Here's the harder problem. I've also held out a third problem without posting it online.

harder problem

Comment by jessicata (jessica.liu.taylor) on 2023 in AI predictions · 2024-01-01T17:45:38.358Z · LW · GW

Maybe those don't stick out to me because long timelines seems like the default hypothesis to me, and there's a lot of people stating specific, falsifiable short timelines predictions locally so there's a selection effect. I added Brian Chau and Robin Hanson to the list though, not sure who else (other than me) has made specific long timelines predictions who would be good to add. Would like to add people like Yann LeCun and Andrew Ng if there are specific falsifiable predictions they made.

Comment by jessicata (jessica.liu.taylor) on 2023 in AI predictions · 2024-01-01T08:42:54.108Z · LW · GW

I've written about the anthropic question. Appreciate the update!