Posts
Comments
With just that you could get upper bounds for the real. You could get some lower bounds by showing all rationals in the enumeration are greater than some rational, but this isn't always possible to do, so maybe your type includes things that aren't real numbers with provable lower bounds.
If you require both then we're back at the situation where, if there's a constructive proof that the enumerations min/max to the same value, you can get a Cauchy real out of this, and perhaps these are equivalent.
It seems that a real number defined this way will have some perhaps-infinite list of rationals it's less than and one it's greater than. You might want to add a constraint that the maximum of the list of numbers it's above gets arbitrarily close to the minimum of the list of numbers it's below (as Tailcalled suggested).
With respect to Cauchy sequences, the issue is how to specify convergence; the epsilon/N definition is one way to do this and, constructively, gives a way of computing epsilon-good approximations.
The power of this seems similar to the power of constructive Cauchy sequences because you can use the (x < y) → A u B function to approximate the value to any positive precision error.
By truth values do you mean Prop or something else?
Here's how one might specify Dedekind cuts in type theory. Provide two types A,B with mappings , . To show these cover all the rationals, provide such that the value returned by c maps back to its argument, through functions or . But this lets us re-construct a function by seeing whether provides an A or a B. There are other ways of doing this but I'm not sure what else is worth analyzing.
Well, the one thing making that difficult is that I did not know the Lagrange multiplier theorem until reading this comment.
I agree this is in practice not directly applicable because buying contracts with all your money is silly.
All you need is to construct an appropriate probability space and use basic probability theory instead of inventing clever reasons why it doesn’t apply in this particular case.
I don't see how to do that but maybe your plan is to get to that at some point
Am I missing something? How is it at all controversial?
it's not, it's just a modification on the usual halfer argument that "you don't learn anything upon waking up"
- halfers have to condition on there being at least one observer in the possible world. if the coin can come up 0,1,2 at 1/3 each, and Sleeping Beauty wakes up that number of times, halfers still think the 0 outcome is 0% likely upon waking up.
- halfers also have to construct the reference class carefully. if there are many events of people with amnesia waking up once or twice, and SSA's reference class consists of the set of awakenings from these, then SSA and SIA will agree on a 1/3 probability. this is because in a large population, about 1/3 of awakenings are in worlds where the coin came up such that there would be one awakening.
I don't have a better solution right now, but one problem to note is that this agent will strongly bet that the button will be independent of the human pressing the button. So it could lose money to a different agent that thinks these are correlated, as they are.
Nice job with the bound! I've heard a number of people in my social sphere say very positive things about DACs so this is mainly my response to them.
You mentioned wanting to get the game theory of love correct. Understanding a game involves understanding the situations and motives of the involved agents. So getting the game theory of love correct with respect to some agent implies understanding that agent's situation.
This seems more like "imagining being nice to Hitler, as one could be nice to anyone" than "imagining what Hitler was in fact like and why his decisions seemed to him like the thing to do". Computing the game theoretically right strategy involves understanding different agents' situations, the kind of empathy that couldn't be confused with being a doormat, sometimes called "cognitive empathy".
I respect Sarah Constantin's attempt to understand Hitler's psychological situation.
If you define "human values" as "what humans would say about their values across situations", then yes, predicting "human values" is a reasonable training objective. Those just aren't really what we "want" as agents, and agentic humans would have motives not to let the future be controlled by an AI optimizing for human approval.
That's also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It's possible that the objectives of these maximizers are affected by socialization, but they'll be less affected by socialization than verbal statements about values, because they're harder to fake so less affected by preference falsification.
Children learn some sense of what they're supposed to say about values, but have some pre-built sense of "what to do / aim for" that's affected by evopsych and so on. It seems like there's a huge semantic problem with talking about "values" in a way that's ambiguous between "in-built evopsych-ish motives" and "things learned from culture about what to endorse", but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term "values" rather than "preferences".
In the section on subversion I made the case that terminal values make much more difference in subversive behavior than compliant behavior.
It seems like to get at the values of approximate utility maximizers located in the brain you would need something like Goal Inference as Inverse Planning rather than just predicting behavior.
How would you design a task that incentivizes a system to output its true estimates of human values? We don't have ground truth for human values, because they're mind states not behaviors.
Seems easier to create incentives for things like "wash dishes without breaking them", you can just tell.
I'm mainly trying to communicate with people familiar with AI alignment discourse. If other people can still understand it, that's useful, but not really the main intention.
I do think this part is speculative. The degree of "inner alignment" to the training objective depends on the details.
Partly the degree to which "try to model the world well" leads to real-world agency depends on the details of this objective. For example, doing a scientific experiment would result in understanding the world better, and if there's RL training towards "better understand the world", that could propagate to intending to carry out experiments that increase understanding of the world, which is a real-world objective.
If, instead, the AI's dataset is fixed and it's trying to find a good compression of it, that's less directly a real-world objective. However, depending on the training objective, the AI might get a reward from thinking certain thoughts that would result in discovering something about how to compress the dataset better. This would be "consequentialism" at least within a limited, computational domain.
An overall reason for thinking it's at least uncertain whether AIs that model the world would care about it is that an AI that did care about the world would, as an instrumental goal, compliantly solve its training problems and some test problems (before it has the capacity for a treacherous turn). So, good short-term performance doesn't by itself say much about goal-directed behavior in generalizations.
The distribution of goals with respect to generalization, therefore, depends on things like which mind-designs are easier to find by the search/optimization algorithm. It seems pretty uncertain to me whether agents with general goals might be "simpler" than agents with task-specific goals (it probably depends on the task), therefore easier to find while getting ~equivalent performance. I do think that gradient descent is relatively more likely to find inner-aligned agents (with task-specific goals), because the internal parts are gradient descended towards task performance, it's not just a black box search.
Yudkowsky mentions evolution as an argument that inner alignment can't be assumed. I think there are quite a lot of dis-analogies between evolution and ML, but the general point that some training processes result in agents whose goals aren't aligned with the training objective holds. I think, in particular, supervised learning systems like LLMs are unlikely to exhibit this, as explained in the section on myopic agents.
I tested it on 3 held-out problems and it got 1/3. Significant progress, increases the chance these can be solved with prompting. So partially it's a question of if any major LLMs incorporate better auto prompting.
Nice prompt! It solved the 3 x 3 problem too.
There are evolutionary priors for what to be afraid of but some of it is learned. I've heard children don't start out fearing snakes but will easily learn to if they see other people afraid of them, whereas the same is not true for flowers (sorry, can't find a ref, but this article discusses the general topic). Fear of heights might be innate but toddlers seem pretty bad at not falling down stairs. Mountain climbers have to be using mainly mechanical reasoning to figure out which heights are actually dangerous. It seems not hard to learn the way in which heights are dangerous if you understand the mechanics required to walk and traverse stairs and so on.
Instincts like curiosity are more helpful at the beginning of life, over time they can be learned as instrumental goals. If an AI learns advanced metacognitive strategies instead of innate curiosity that's not obviously a big problem from a human values perspective but it's unclear.
Most civilizations in the past have had "bad values" by our standards. People have been in preference falsification equilibria where they feel like they have to endorse certain values or face social censure. They probably still are falsifying preferences and our civilizational values are probably still bad. E.g. high incidence of people right now saying they're traumatized. CEV probably tends more towards the values of untraumatized than traumatized humans, even from a somewhat traumatized starting point.
The idea that civilization is "oppressive" and some societies have fewer problems points to value drift that has already happened. The Roman empire was really, really bad and has influenced future societies due to Christianity and so on. Civilizations have become powerful partly through military mobilization. Civilizations can be nice to live in in various ways, but that mostly has to do with greater satisfaction of instrumental values.
Some of the value drift might not be worth undoing, e.g. value drift towards caring more about far-away people than humans naturally would.
Seems like an issue of code/data segmentation. Programs can contain compile time constants, and you could turn a neural network into a program that has compile time constants for the weights, perhaps "distilling" it to reduce the total size, perhaps even binarizing it.
Arguably, video games aren't entirely software by this standard, because they use image assets.
Formally segmenting "code" from "data" is famously hard because "code as data" is how compilers work and "data as code" is how interpreters work. Some AI techniques involve program synthesis.
I think the relevant issue is copyright more than the code/data distinction? Since code can be copyrighted too.
I think it's hard because it requires some planning and puzzle solving in a new, somewhat complex environment. The AI results on Montezuma's Revenge seem pretty unimpressive to me because they're going to a new room, trying random stuff until they make progress, then "remembering" that for future runs. Which means they need quite a lot of training data.
For short term RL given lots of feedback, there are already decent results e.g. in starcraft and DOTA. So the difficulty is more figuring out how to automatically scope out narrow RL problems that can be learned without too much training time.
From a within-lifetime perspective, getting bored is instrumentally useful for doing "exploration" that results in finding useful things to do, which can be economically useful, be effective signalling of capacity, build social connection, etc. Curiosity is partially innate but it's also probably partially learned. I guess that's not super different from pain avoidance. But anyway, I don't worry about an AI that fails to get bored, but is otherwise basically similar to humans, taking over, because not getting bored would result in being ineffective at accomplishing open-ended things.
I did mention LLMs as myopic agents.
If they actually simulate humans it seems like maybe legacy humans get outcompeted by simulated humans. I'm not sure that's worse than what humans expected without technological transcendence (normal death, getting replaced by children and eventually conquering civilizations, etc). Assuming the LLMs that simulate humans well are moral patients (see anti zombie arguments).
It's still not as good as could be achieved in principle. Seems like having the equivalent of "legal principles" that get used as training feedback could help. Plus direct human feedback. Maybe the system gets subverted eventually but the problem of humans getting replaced by em-like AIs is mostly a short term one of current humans being unhappy about that.
Yeah, that's a good reference.
Thanks, added.
I think use of AI tools could have similar results to human cognitive enhancement, which I expect to basically be helpful. They'll have more problems with things that are enhanced by stuff like "bigger brain size" rather than "faster thought" and "reducing entropic error rates / wisdom of the crowds" because they're trained on humans. One can in general expect more success on this sort of thing by having an idea of what problem is even being solved. There's a lot of stuff that happens in philosophy departments that isn't best explained by "solving the problem" (which is under-defined anyway) and could be explained by motives like "building connections", "getting funding", "being on the good side of powerful political coalitions", etc. So psychology/sociology of philosophy seems like an approach to understand what is even being done when humans say they're trying to solve philosophy problems.
I meant to say I'd be relatively good at it, I think it would be hard to find 20 people who are better than me at this sort of thing. The original ITT was about simulating "a libertarian" rather than "a particular libertarian", so emulating Yudkowsky specifically is a difficulty increase that would have to be compensated for. I think replicating writing style isn't the main issue, replicating the substance of arguments is, which is unfortunately harder to test. This post wasn't meant to do this, as I said.
I'm also not sure in particular what about the Yudkowskian AI risk models you think I don't understand. I disagree in places but that's not evidence of not understanding them.
I'm defining "values" as what approximate expected utility optimizers in the human brain want. Maybe "wants" is a better word. People falsify their preferences and in those cases it seems more normative to go with internal optimizer preferences.
Re indexicality, this is an "the AI knows but does not care" issue, it's about specifying it not about there being some AI module somewhere that "knows" it. If AGI were generated partially from humans understanding how to encode indexical goals that would be a different situation.
Re treacherous turns, I agreed that myopic agents don't have this issue to nearly the extent that long-term real-world optimizing agents do. It depends how the AGI is selected. If it's selected by "getting good performance according to a human evaluator in the real world" then at some capability level AGIs that "want" that will be selected more.
They would approximate human agency at the limit but there's both the issue of how fast they approach the limit and the degree to which they have independent agency rather than replicating human agency. There are fewer deceptive alignment problems if the long term agency they have is just an approximation of human agency.
Mostly I don't think there's much of an alignment problem for LLMs because they basically approximate human-like agency, but they aren't approaching autopoiesis, they'll lead to some state transition that is kind of like human enhancement and kind of like invention of new tools. There are eventually capability gains by modeling things using a different, better set of concepts and agent substrate than humans have, it's just that the best current methods heavily rely on human concepts.
I don't understand what you think the pressing concerns with LLM alignment are. It seems like Paul Christiano type methods would basically work for them. They don't have a fundamentally different set of concepts and type of long-term agency from humans, so humans thinking long enough to evaluate LLMs with the help of other LLMs, in order to generate RL signals and imitation targets, seems sufficient.
Ok, I added this prediction.
Do you know if Andrew Ng or Yann LeCun has made a specific prediction that AGI won't arrive by some date? Couldn't find it through a quick search. Idk what others to include.
I'm assuming the relevant values are the optimizer ones not what people say. I discussed social institutions, including those encouraging people to endorse and optimize for common values, in the section on subversion.
Alignment with a human other than yourself could be a problem because people are to some degree selfish and, to a smaller degree, have different general principles/aesthetics about how things should be. So some sort of incentive optimization / social choice theory / etc might help. But at least there's significant overlap between different humans' values. Though, there's a pretty big existing problem of people dying, the default was already that current people would be replaced by other people.
To the extent people now don't care about the long-term future there isn't much to do in terms of long-term alignment. People right now who care about what happens 2000 years from now probably have roughly similar preferences to people 1000 years from now who aren't significantly biologically changed or cognitively enhanced, because some component of what people care about is biological.
I'm not saying it would be random so much as not very dependent on the original history of humans used to train early AGI iterations. It would have different data history but part of that is because of different measurements, e.g. scientific measuring tools. Different ontology means that value laden things people might care about like "having good relationships with other humans" are not meaningful things to future AIs in terms of their world model, not something they would care much by default (they aren't even modeling the world in those terms), and it would be hard to encode a utility function so they care about it despite the ontological difference.
Beat Ocarina of Time with <100 hours of playing Zelda games during training or deployment (but perhaps training on other games), no reading guides/walkthroughs/playthroughs, no severe bug exploits (those that would cut down the required time by a lot), no reward-shaping/advice specific to this game generated by humans who know non-trivial things about the game (but the agent can shape its own reward). Including LLM coding a program to do it. I'd say probably not by 2033.
I think it's possible human values depend on life history too, but that seems to add additional complexity and make alignment harder. If the effects of life history very much dominate those of evolutionary history, then maybe neglecting evolutionary history would be more acceptable, making the problem easier.
But I don't think default AGI would be especially path dependent on human collective life history. Human society changes over time as humans supersede old cultures (see section on subversion). AGI would be a much bigger shift than the normal societal shifts and so would drift from human culture more rapidly. Partially due to different conceptual ontology and so on. The legacy concepts of humans would be a pretty inefficient system for AGIs to keep using. Like how scientists aren't alchemists anymore, but a bigger shift than that.
(Note, LLMs still rely a lot on human concepts rather than having independent ontology and agency, so this is more about future AI systems)
I think they will remain hard by EOY 2024, as in, of this problem and the 7 held-out ones of similar difficulty, the best LLM will probably not solve 4/8.
I think I would update some on how fast LLMs are advancing but these are not inherently very hard problems so I don't think it would be a huge surprise, this was meant to be one of the easiest things they fail at right now. Maybe if that happens I would think things are going 1.6x as fast short term as I would have otherwise thought?
I was surprised by GPT3/3.5 but not so much by 4, I think it adds up to on net an update that LLMs are advancing faster than I thought, but I haven't much changed my long-term AGI timelines, because I think that will involve lots of techs not just LLMs, although LLM progress is some update about general tech progress.
I've added 6 more held-out problems for a total of 7. Agree that getting the answer without pointing out problems is the right standard.
Added
Here's the harder problem. I've also held out a third problem without posting it online.
Maybe those don't stick out to me because long timelines seems like the default hypothesis to me, and there's a lot of people stating specific, falsifiable short timelines predictions locally so there's a selection effect. I added Brian Chau and Robin Hanson to the list though, not sure who else (other than me) has made specific long timelines predictions who would be good to add. Would like to add people like Yann LeCun and Andrew Ng if there are specific falsifiable predictions they made.
I've written about the anthropic question. Appreciate the update!
Something approximating utility function optimization over partial world configurations. What scope of world configuration space is optimized by effective systems depends on the scope of the task. For something like space exploration, the scope of the task is such that accomplishing it requires making trade-offs over a large sub-set of the world, and efficient ways of making these trade-offs are parametrized by utility function over this sub-set.
What time-scale and spatial scope the "pick thoughts in your head" optimization is over depends on what scope is necessary for solving the problem. Some problems like space exploration have a necessarily high time and space scope. Proving hard theorems has a smaller spatial scope (perhaps ~none) but a higher temporal scope. Although, to the extent the distribution over theorems to be proven depends on the real world, having a model of the world might help prove them better.
Depending on how the problem-solving system is found, it might be that the easily-findable systems that solve the problem distribution sufficiently well will not only model the world but care about it, because the general consequentalist algorithms that do planning cognition to solve the problem would also plan about the world. This of course depends on the method for finding problem-solving systems, but one could imagine doing hill climbing over ways of wiring together a number of modules that include optimization and world-modeling modules, and easily-findable configurations that solve the problem well might solve it by deploying general-purpose consequentialist optimization on the world model (as I said, many possible long-term goals lead to short-term compliant problem solving as an instrumental strategy).
Again, this is relatively speculative, and depends on the AI paradigm and problem formulation. It's probably less of a problem for ML-based systems because the cognition of an ML system is aggressively gradient descended to be effective at solving the problem distribution.
The problem is somewhat intensified in cases where the problem relates to already-existing long-term agents such as in the case of predicting or optimizing with respect to humans, because the system at some capability level would simulate the external long-term optimizer. However, it's unclear how much this would constitute creation of an agent with different goals from humans.
Thanks, fixed. I believe Yudkowsky is the right spelling though.
Relevant skills for an AI economy would include mathematics, programming, ML, web development, etc.
It's hard to extrapolate out that far, but AI still has a lot of trouble with robotics (e.g. we don't have good dish washing household robots). So there will probably be e.g. construction jobs for a while. AI is helpful for programming but using AI to program relies on a lot of human support; I doubt programming will be entirely automated in 30 years. AI tends to have trouble with contextualized, embodied/embedded problems; it's better at decontextualized, schoolwork-like problems. For example if you're doing sales you need to manage a set of relationships whose data is gathered over a lot of contexts, mostly not recorded, and AI is going to have more trouble with parsing that context into something a transformer can operate on and give a good response to. Self-driving is an example of an embedded, though low-context, problem and progress on that has been slower than expected, although due to all the data from electric cars it's possible to train a transformer to imitate humans using that data.
Oh, I thought this was mainly about x risk, especially due to the Yudkowsky reference. On the other points I think they're not a huge change either. If you predict the economy will have lots of AI in the future then you can give your child an advantage by training them in relevant skills. Also, many jobs like service jobs are likely to be around, there are lots of things AI has trouble with or which humans generally prefer humans to do. AI would increase material productivity and that would be expected to decrease cost of living as well. See Yudkowsky's post on AI unemployment.
Regarding international conflict, I haven't seen a convincing model laid out for how AI would make international conflict worse. Drone warfare is a possibility, but would tend to concentrate military power in technical countries such as Taiwan, UK, USA, and Israel. I don't know where OP lives but I don't see how it would make things worse for USA/UK children. Drones would be expected to have a better civilian casualty ratio than other methods like conventional explosives, nukes, or bio-weapons.
The post is phrased as "do you think it's a good idea to have kids given timelines?". I've said why I'm not convinced timelines should be relevant to having kids. I think if people are getting their views by copying Eliezer Yudkowsky and copying people who copy his views (which I'm not sure if OP is doing) then they should get better epistemology.
From 2018, AI timelines section of mediangroup.org/research.
Modeling AI progress through insights
We assembled a list of major technical insights in the history of progress in AI and metadata on the discoverer(s) of each insight.
Based on this dataset, we developed an interactive model that calculates the time it would take to reach the cumulation of all AI research, based on a guess at what percentage of AI discoveries have been made.
AI Insights dataset: data (json file), schema
Feasibility of Training an AGI using Deep Reinforcement Learning: A Very Rough Estimate
Several months ago, we were presented with a scenario for how artificial general intelligence (AGI) may be achieved in the near future. We found the approach surprising, so we attempted to produce a rough model to investigate its feasibility. The document presents the model and its conclusions.
The usual cliches about the folly of trying to predict the future go without saying and this shouldn't be treated as a rigorous estimate, but hopefully it can give a loose, rough sense of some of the relevant quantities involved. The notebook and the data used for it can be found in the Median Group numbers GitHub repo if the reader is interested in using different quantities or changing the structure of the model.
[Download PDF](http://mediangroup.org/docs/Feasibility of Training an AGI using Deep Reinforcement Learning, A Very Rough Estimate.pdf)
(note: second has a hard-to-estimate "real life vs alphago" difficulty parameter that the result is somewhat dependent on, although this parameter can be adjusted in the model)
I recommend articles (not by me) Why I am not an AI doomer, Diminishing Returns in Machine Learning.
You're providing no evidence that superintelligence is likely in the next 30 years other than a Yudkowsky tweet. I expect that 30 years later we will not have superintelligence (of the sort that can build the stack to run itself on, growing at a fast rate, taking over the solar system etc).