Posts
Comments
Maybe it could be FLCI to avoid collision with the existing FLI.
I also think the name is off, but for a different reason. When I hear "the west" with no other context, I assume it means this, which doesn't make sense here, because the UK and FHI are very solidly part of The West. (I have not heard the "Harvard of the west" phrase and I'm guessing it's pretty darn obscure, especially to the international audience of LW.)
Feedback on the website: it's not clear to me what the difference is between LessOnline and the summer camp right after. Is the summer camp only something you go to if you're also going to Manifest? Is it the same as LessOnline but longer?
Oh, no, I'm saying it's more like 2^8 afterwards. (Obviously it's more than that but I think closer to 8 than a million.) I think having functioning vision at all brings it down to, I dunno, 2^10000. I think you would be hard pressed to name 500 attributes of mammals that you need to pay attention to to learn a new species.
We then get around the 2^8000000 problem by having only a relatively very very small set of candidate “things” to which words might be attached.
A major way that we get around this is by having hierarchical abstractions. By the time I'm learning "dog" from 1-5 examples, I've already done enormous work in learning about objects, animals, something-like-mammals, heads, eyes, legs, etc. So when you point at five dogs and say "those form a group" I've already forged abstractions that handle almost all the information that makes them worth paying attention to, and now I'm just paying attention to a few differences from other mammals, like size, fur color, ear shape, etc.
I'm not sure how the rest of this post relates to this, but it didn't feel present; maybe it's one of the umpteenth things you left out for the sake of introductory exposition.
I've noticed you using the word "chaos" a few times across your posts. I think you're using it colloquially to mean something like "rapidly unpredictable", but it does have a technical meaning that doesn't always line up with how you use it, so it might be useful to distinguish it from a couple other things. Here's my current understanding of what some things mean. (All of these definitions and implications depend on a pile of finicky math and tend to have surprising counter-example if you didn't define things just right, and definitions vary across sources.)
Sensitive to initial conditions. A system is sensitive to initial conditions if two points in its phase space will eventually diverge exponentially (at least) over time. This is one way to say that you'll rapidly lose information about a system, but it doesn't have to look chaotic. For example, say you have a system whose phase space is just the real line, and its dynamics over time is just that points get 10x farther from the origin every time step. Then, if you know the value of a point to ten decimal places of precision, after ten time steps you only know one decimal place of precision. (Although there are regions of the real line where you're still sure it doesn't reside, for example you're sure it's not closer to the origin.)
Ergodic. A system is ergodic if (almost) every point in phase space will trace out a trajectory that gets arbitrarily close to every other point. This means that each point is some kind of chaotically unpredictable, because if it's been going for a while and you're not tracking it, you'll eventually end up with maximum uncertainty about where it is. But this doesn't imply sensitivity to initial conditions; there are systems that are ergodic, but where any pair of points will stay the same distance from each other. A simple example is where phase space is a circle, and the dynamics are that on each time step, you rotate each point around the circle by an irrational angle.
Chaos. The formal characterization that people assign to this word was an active research topic for decades, but I think it's mostly settled now. My understanding is that it essentially means this;
- Your system has at least one point whose trajectory is ergodic, that is, it will get arbitrarily close to every other point in the phase space
- For every natural number n, there is a point in the phase space whose trajectory is periodic with period n. That is, after n time steps (and not before), it will return back exactly where it started. (Further, these periodic points are "dense", that is, every point in phase space has periodic points arbitrarily close to it).
The reason these two criteria yield (colloquially) chaotic behavior is, I think, reasonably intuitively understandable. Take a random point in its phase space. Assume it isn't one with a periodic trajectory (which will be true with "probability 1"). Instead it will be ergodic. That means it will eventually get arbitrarily close to all other points. But consider what happens when it gets close to one of the periodic trajectories; it will, at least for a while, act almost as though it has that period, until it drifts sufficiently far away. (This is using an unstated assumption that the dynamics of the systems have a property where nearby points act similarly.) But it will eventually do this for every periodic trajectory. Therefore, there will be times when it's periodic very briefly, and times when it's periodic for a long time, et cetera. This makes it pretty unpredictable.
There are also connections between the above. You might have noticed that my example of a system that was sensitive to initial conditions but not ergodic or chaotic relied on having an unbounded phase space, where the two points both shot off to infinity. I think that if you have sensitivity to initial conditions and a bounded phase space, then you generally also have ergodic and chaotic behavior.
Anyway, I think "chaos" is a sexy/popular term to use to describe vaguely unpredictable systems, but almost all of the time you don't actually need to rely on the full technical criteria of it. I think this could be important for not leading readers into red-herring trails of investigation. For example, all of standard statistical mechanics only needs ergodicity.
Has anyone checked out Nassim Nicholas Taleb's book Statistical Consequences of Fat Tails? I'm wondering where it lies on the spectrum from textbook to prolonged opinion piece. I'd love to read a textbook about the title.
Just noticing that every post has at least one negative vote, which feels interesting for some reason.
The e-ink tablet market has really diversified recently. I'd recommend that anyone interested look around at the options. My impression is that the Kindle Scribe is one of the least good ones (which doesn't mean it's bad).
Here's the arxiv version of the paper, with a bunch more content in appendices.
And, since I can't do everything: what popular platforms shouldn't I prioritize?
I think cross-posting between twitter, mastodon and bluesky would be pretty easy. And it would let you gather your own data on which platforms are worth continuing.
I looked at these several months ago and unfortunately recommend neither. Pearl's Causality is very dense, and not really a good introduction. The Primer is really egregiously riddled with errors; there seems to have been some problem with the publisher. And on top of that, I just found it not very well written.
I don't have a specific recommendation, but I believe that at this point there are a bunch of statistics textbooks that competently discuss the essential content of causal modelling; maybe check the reviews for some of those on amazon.
One way that the analogy with code doesn't carry over is that in math, you often can't even being to use a theorem if you don't know a lot of detail about what the objects in the theorem mean, and often knowing what they mean is pretty close to knowing why the theorem's you're building on are true. Being handed a theorem is less like being handed an API and more like being handed a sentence in a foreign language. I can't begin to make use of the information content in the sentence until I learn what every symbol means and how the grammar works, and at that point I could have written the sentence myself.
I'd recommend porting it over as a sequence instead of one big post (or maybe just port the first chunk as an intro post?). LW doesn't have a citation format, but you can use footnotes for it (and you can use the same footnote number in multiple places).
I had a side project to get better at research in 2023. I found very little resources that were actually helpful to me. But here are some that I liked.
- A few posts by Holden Karnofsky on Cold Takes, especially Useful Vices for Wicked Problems and Learning By Writing.
- Diving into deliberate practice. Most easily read is the popsci book Peak. This book emphasizes "mental representations", which I find the most useful part of the method, though I think it's also the least supported by the science.
- The popsci book Grit.
- The book Ultralearning. Extremely skimmable, large collection of heuristics that I find essential for the "lean" style of research.
- Reading a scattering of historical accounts of how researchers did their research, and how it came to be useful. (E.g. Newton, Einstein, Erdős, Shannon, Kolmogorov, and a long tail of less big names.)
(Many resources were not helpful for me for reasons that might not apply to others; I was already doing what they advised, or they were about how to succeed inside academia, or they were about emotional problems like lack of confidence or burnout. But, I think mostly I failed to find good resources because no one knows how to do good research.)
Finally, I want to note an aspect of the discussion in the report that makes me quite uncomfortable: namely, it seems plausible to me that in addition to potentially posing existential risks to humanity, the sorts of AIs discussed in the report might well be moral patients in their own right.
I strongly appreciate this paragraph for stating this concern so emphatically. I think this possibility is strongly under-represented in the AI safety discussion as a whole.
I agree there's a core principle somewhere around the idea of "controllable implies understandable". But when I think about this with respect to humans studying biology, then there's another thought that comes to my mind; the things we want to control are not necessarily the things the system itself is controlling. For example, we would like to control the obesity crisis (and weight loss in general) but it's not clear that the biological system itself is controlling that. It almost certainly was successfully controlling it in the ancestral environment (and therefore it was understandable within that environment) but perhaps the environment has changed enough that it is now uncontrollable (and potentially not understandable). Cancer manages to successfully control the system in the sense of causing itself to happen, but that doesn't mean that our goal, "reliably stopping cancer" is understandable, since it is not a way that the system is controlling itself.
This mismatch seems pretty evidently applicable to AI alignment.
And perhaps the "environment" part is critical. A system being controllable in one environment doesn't imply it being controllable in a different (or broader) environment, and thus guaranteed understandability is also lost. This feels like an expression of misgeneralization.
Looking back at Flint's work, I don't agree with this summary.
Ah, sorry, I wasn't intending for that to be a summary. I found Flint's framework very insightful, but after reading it I sort of just melded it into my own overall beliefs and understanding around optimization. I don't think he intended it to be a coherent or finished framework on its own, so I don't generally try to think "what does Flint's framework say about X?". I think its main influence on me was the whole idea of using dynamical systems and phase space as the basis for optimization. So for example;
In any case, I agree that Flint's work also eliminates the need for an unnatural baseline in which we have to remove the agent.
I would say that working in the framework of dynamical systems is what lets one get a natural baseline against which to measure optimization, by comparing a given trajectory with all possible trajectories.
I think I could have some more response/commentary about each of your bullet points, but there's a background overarching thing that may be more useful to prod at. I have a clear (-feeling-to-me) distinction between "optimization" and "agent", which doesn't seem to be how you're using the words. The dynamical systems + Yudkowsky measure perspective is a great start on capturing the optimization concept, but it is agnostic about (my version of) the agent concept (except insofar as agents are a type of optimizer). It feels to me like the idea of endorsement you're developing here is cool and useful and is... related to optimization, but isn't the basis of optimization. So I agree that e.g. "endorsement" is closer to alignment, but also I don't think that "optimization" is supposed to be all that close to alignment; I'd reserve that for "agent". I think we'll need a few levels of formalization in agent foundations, and you're working toward a different level than those, and so these ideas aren't in conflict.
Breaking that down just a bit more; let's say that "alignment" refers to aligning the intentional goals of agents. I'd say that "optimization" is a more general phenomenon where some types of systems tend to move their state up an ordering; but that doesn't mean that it's "intentional", nor that that goal is cleanly encoded somewhere inside the system. So while you could say that two optimizing systems "are more aligned" if they move up similar state orderings, it would be awkward to talk about aligning them.
(My notion of) optimization has its own version of the thing you're calling "Vingean", which is that if I believe a process optimizes along a certain state ordering, but I have no beliefs about how it works on the inside, then I can still at least predict that the state will go up the ordering. I can predict that the car will arrive at the airport even though I don't know the turns. But this has nothing to do with the (optimization) process having beliefs or doing reasoning of any kind (which I think of as agent properties). For example I believe that there exists an optimization process such that mountains get worn down, and so I will predict it to happen, even though I know very little about the chemistry of erosion or rocks. And this is kinda like "endorsement", but it's not that the mountain has probability assignments or anything.
In fact I think it's just a version of what makes something a good abstraction; an abstraction is a compact model that allows you to make accurate predictions about outcomes without having to predict all intermediate steps. And all abstractions also have the property that if you have enough compute/etc. then you can just directly calculate the outcome based on lower-level physics, and don't need the abstraction to predict the outcome accurately.
I think that was a longer-winded way to say that I don't think your concepts in this post are replacements for the Yudkowsky/Flint optimization ideas; instead it sounds like you're saying "Assume the optimization process is of the kind that has beliefs and takes actions. Then we can define 'endorsement' as follows; ..."
What's your preferred response/solution to ~"problems"(?) of events that have probability zero but occur nevertheless
My impression is that people have generally agreed that this paradox is resolved (=formally grounded) by measure theory. I know enough measure theory to know what it is but haven't gone out of my way to explore the corners of said paradoxes.
But you might be asking me about it in the framework of Yudkowsky's measure of optimization. Let's say the states are the real numbers in [0, 1] and the relevant ordering is the same as the one on the real numbers, and we're using the uniform measure over it. Then, even though the probability of getting any specific real number is zero, the probability mass we use to calculate bit of optimization power is all the probability mass below that number. In that case, all the resulting numbers would imply finite optimization power. ... except if we got the result that was exactly the number 0. But in that case, that would actually be infinitely surprising! And so the fact that the measure of optimization returns infinity bits reflects intuition.
It's (probably) true that our physical reality has only finite precision
I'm also not a physicist but my impression is that physicists generally believe that the world does actually have infinite precision.
I'd also guess that the description length of (a computable version of) the standard model as-is (which includes infinite precision because it uses the real number system) has lower K-complexity than whatever comparable version of physics where you further specify a finite precision.
I don't understand this part. How does probability mass constrain how "bad" the states can get? Could you rephrase this maybe?
The probability mass doesn't constraint how "bad" the states can get; I was saying that the fact that there's only 1 unit of probability mass means that the amount of probability mass on lower states is bounded (by 1).
Restricting the formalism to orderings means that there is no meaning to how bad a state is, only a meaning to whether it is better or worse than another state. (You can additionally decide on a measure of how bad, as long as it's consistent with the ordering, but we don't need that to analyze (this concept of) optimization.)
I'll also note that I think what you're calling "Vingean agency" is a notable sub-type of optimization process that you've done a good job at analyzing here. But it's definitely not the definition of optimization or agency to me. For example, in the post you say
We perceive agency when something is better at doing something than us; we endorse some aspect of its reasoning or activity.
This doesn't feel true to me (in the carve-nature-at-its-joints sense). I think children are strongly agents, even though I do everything more competently than they do.
I have some comments on the arbitrariness of the "baseline" measure in Yudkowsky's measure of optimization.
Sometimes, I am surprised in the moment about how something looks, and I quickly update to believing there's an optimization process behind it. For example, if I climb a hill expecting to see a natural forest, and then instead see a grid of suburban houses or an industrial logging site, I'll immediately realize that there's no way this is random and instead there's an optimization process that I wasn't previously modelling. In cases like this, I think Yudkowsky's measure accurately captures the measure of optimization.
Alternatively, sometimes I'm thinking about optimization processes that I've always known are there, and I'm wondering to myself how powerful they are. For example, sometimes I'll be admiring how competent one of my friends is. To measure their competence, I can imagine what a "typical" person would do in that situation, and check the Yudkowsky measure as a diff. I can feel what you mean about arbitrarily drawing a circle around the known optimizer and then "deleting" it, but this just doesn't feel that weird to me? Like I think the way that people model the world allows them to do this kind of operation with pretty substantially meaningful results.
While it may be clear how to do this in many cases, it isn't clear in general. I suspect if we tried to write down the algorithm for doing it, it would involve an "agency detector" at some point; you have to be able to draw a circle around the agent in order to selectively forget it.
I think this is where Flint's framework was insightful. Instead of "detecting" and "deleting" the optimization process and then measuring the diff, you consider the system of every possible trajectory, measure the optimization of each (with respect to the ordering over states), take the average, and then compare your potential optimizer to this. The potential optimization process will be in that average, but it will be washed out by all the other trajectories (assuming most trajectories don't go up the ordering nearly as much; if they did, then your observed process would rightly not register as an optimizer).
(Obviously this is not helpful for e.g. looking into a neural network and figuring out whether it contains something that will powerfully optimize the world around you. But that's not what this level of the framework is for; this level is for deciding what it even means for something to powerfully optimize something around you.)
Of course, to run this comparison you need a "baseline" of a measure over every possible trajectory. But I think this is just reflecting the true nature of optimization; I think it's only meaningful relative to some other expectation.
I feel like there's a key concept that you're aiming for that isn't quite spelled out in the math.
I remember reading somewhere that there's a typically unmentioned distinction between "Bayes' theorem" and "Bayesian inference". Bayes' theorem is the statement about , which is true from the axioms of probability theory for any and whatsoever. Notably, it has nothing to do with time, and it's still true even after you learn . On the other hand, Bayesian inference is the premise your beliefs should change in accordance with Bayes' theorem. Namely that where is an observation. That is, when you observe something, you wholesale replace your probability space with a new probability space which is calculated by applying the conditional (via Bayes' theorem).
And I think there's a similar thing going on with your definitions of endorsement. While trying to understand the equations, I found it easier to visualize and as two separate distributions on the same , where endorsement is simply a consistency condition. For belief consistency, you would just say that endorses on event if .
But that isn't what you wrote; instead you wrote thing this with conditioning on a quoted thing. And of course, the thing I said is symmetrical between and , whereas your concept of endorsement is not symmetrical. It seems like the intention is that "learns" or "hears about" 's belief, and then updates (in the above Bayesian inference sense) to have a new that has the consistency condition with .
By putting in the conditional, you're saying that it's an event on , a thing with the same type as . And it feels like that's conceptually correct, but also kind of the hard part. It's as if is modelling as an agent embedded into .
You guys could compute a kind of Page Rank for LW posts.
Yeah, So8res wrote that post after reading this one and having a lot of discussion in the comments. That said, my memory was that people eventually convinced him that the title idea in his post was wrong.
[This is a self-review because I see that no one has left a review to move it into the next phase. So8res's comment would also make a great review.]
I'm pretty proud of this post for the level of craftsmanship I was able to put into it. I think it embodies multiple rationalist virtues. It's a kind of "timeless" content, and is a central example of the kind of content people want to see on LW that isn't stuff about AI.
It would also look great printed in a book. :)
You can also add the PIBBSS Speaker Events to your calendar through this link.
FYI this link redirects to a UC Berkeley login page.
Two years later, this is still pretty much how much sleep works!
- Still aging
- Still do regular morning climbing with my friend twice a week
- Still hang out with my partner before best most nights
- Still maintain control through time changes
I never went back into software, so I never again had a 9-5 job. Instead, I'm an independent researcher. In order to further motivate waking up for that, I schedule body-doubling with people on most days of the week, usually starting between 7 and 8:30am. I rarely use melatonin.
My current biggest sleep problem is that, if I don't have climbing, body-doubling, or something else scheduled early, then I usually stay in bed for a while, awake but unproductive. Hm, I haven't used Focusmate in a long time either. Maybe I should try that again?
Isn't the shortform feature perfect for this?
[This is a review for the whole sequence.]
I think of LessWrong as a place whose primary purpose is and always has been to develop the art of rationality. One issue is that this mission tends to attract a certain kind of person -- intelligent, systematizing, deprioritizing social harmony, etc -- and that can make it harder for other kinds of people to participate in the development of the art of rationality. But rationality is for everyone, and ideally the art would be equally accessible to all.
This sequence has many good traits, but one of the most distinguishing is that it wholly legible and welcoming to people not of the aforementioned kind. In a world where huge efforts of cooperation will be needed to ensure a good future, I think this trait takes this sequence worthy of being further showcased!
This paper, like others from Anthropic, is is exemplary science and exceptional science communication. The authors are clear, precise and thorough. It is evident that their research motivation is to solve a problem, and not to publish a paper, and that their communication motivation is to help others understand, and not to impress.
This post expresses an important idea in AI alignment that I have essentially believed for a long time, and which I have not seen expressed elsewhere. (I think a substantially better treatment of the idea is possible, but this post is fine, and you get a lot of points for being the only place where an idea is being shared.)
Earlier this year I spent a lot of time trying to understand how to do research better. This post was one of the few resources that actually helped. It described several models that I resonated with, but which I had not read anywhere else. It essentially described a lot of the things I was already doing, and this gave me more confidence in deciding to continue doing full time AI alignment research. (It also helps that Karnofsky is an accomplished researcher, and so his advice has more weight!)
I'm curious what you would estimate the cost of producing the books to be. That is, how much would someone have to donate to pay for Lightcone to produce the books?
I'd like to gain clarity on what we think the relationship should be between AI alignment and agent foundations. To me, the relationship is 1) historical, in that the people bringing about the field of agent foundations are coming from the AI alignment community and 2) motivational, in that the reason they're investigating agent foundations is to make progress on AI alignment, but not 3) technical, in that I think agent foundations should not be about directly answering questions of how to make the development of AI beneficial to humanity. I think it makes more sense to pursue agent foundations as a quest to understand the nature of agents as a technical concept in its own right.
If you are a climate scientist, then you are very likely in the field in order to help humanity reduce the harms from climate change. But on a day-to-day basis, the thing you are doing is trying to understand the underlying patterns and behavior of the climate as a physical system. It would be unnatural to e.g. exclude papers from climate science journals on the grounds of not being clearly applicable to reducing climate change.
For agent foundations, I think some of the core questions revolve around things like, how does having goals work? How stable are goals? How retargetable are goals? Can we make systems that optimize strongly but within certain limitations? But none of those question are are directly about aligning the goals with humanity.
There's also another group of questions like, what are human's goals? How can we tell? How complex and fragile are they? How can we get an AI system to imitate a human? Et cetera. But I think these questions come from a field that is not agent foundations.
There should certainly be constant and heavy communication between these fields. And I also think that even individual people should be thinking about the applicability questions. But they're somewhat separate loops. A climate scientist will have an outer loop that does things like, chooses a research problem because they think the answer might help reduce climate change, and they should keep checking on that belief as they perform their research. But while they're doing their research, I think they should generally be using an inner loop that just thinks, "huh, how does this funny 'climate' thing work?"
FWIW I saw "Anti-MATS" in the sidebar and totally assumed that meant that someone in the dialogue was arguing that the MATS program was bad (instead of discussing the idea of a program that was like MATS but opposite).
Agent foundations is studying a strange alternate world where agents know the source code to themselves and the universe, where perfect predictors exist and so on
I just want to flag that this is very much not a defining characteristic of agent foundations! Some work in agent foundations will make assumptions like this, some won't -- I consider it a major goal of agent foundations to come up with theories that do not rely on assumptions like this.
(Or maybe you just meant those as examples?)
Maybe, "try gaining skill somewhere with lower standards"?
Somehow I read "non-results" in the title and unthinkingly interpreted it as "we now have more data that says inositol does nothing". Maybe the title could be "still not enough data on insotol"?
I wonder if we couldn't convert this into some kind of community wiki, so that the people represented in it can provide endorsed representations of their own work, and so that the community as a whole can keep it updated as time goes on.
Obviously there's the problem where you don't want random people to be able to put illegitimate stuff on the list. But it's also hard to agree on a way to declare legitimacy.
...Maybe we could have a big post like lukeprog's old textbook post, where researchers can make top-level comments describing their own research? And then others can up- or down-vote the comments based on the perceived legitimacy of the research program?
Honestly this isn't that long, I might say to re-merge it with the main post. Normally I'm a huge proponent of breaking posts up smaller, but yours is literally trying to be an index, so breaking a piece off makes it harder to use.
Here's my guess as to how the universality hypothesis a.k.a. natural abstractions will turn out. (This is not written to be particularly understandable.)
- At the very "bottom", or perceptual level of the conceptual hierarchy, there will be a pretty straight-forward objective set of concept. Think the first layer of CNNs in image processing, the neurons in the retina/V1, letter frequencies, how to break text strings into words. There's some parameterization here, but the functional form will be clear (like having a basis of n vectors in R^n, but it (almost) doesn't matter which vectors).
- For a few levels above that, it's much less clear to me that the concepts will be objective. Curve detectors may be universal, but the way they get combined is less obviously objective to me.
- This continues until we get to a middle level that I'd call "objects". I think it's clear that things like cats and trees are objective concepts. Sufficiently good language models will all share concepts that correspond to a bunch of words. This level is very much due to the part where we live in this universe, which tends to create objects, and on earth, which has a biosphere with a bunch of mid-level complexity going on.
- Then there will be another series of layers that are less obvious. Partly these levels are filled with whatever content is relevant to the system. If you study cats a lot then there is a bunch of objectively discernible cat behavior. But it's not necessary to know that to operate in the world competently. Rivers and waterfalls will be a level 3 concept, but the details of fluid dynamics are in this level.
- Somewhere around the top level of the conceptual hierarchy, I think there will be kind of a weird split. Some of the concepts up here will be profoundly objective; things like "and", mathematics, and the abstract concept of "object". Absolutely every competent system will have these. But then there will also be this other set of concepts that I would map onto "philosophy" or "worldview". Humans demonstrate that you can have vastly different versions of these very high-level concepts, given very similar data, each of which is in some sense a functional local optimum. If this also holds for AIs, then that seems very tricky.
- Actually my guess is that there is also a basically objective top-level of the conceptual hierarchy. Humans are capable of figuring it out but most of them get it wrong. So sufficiently advanced AIs will converge on this, but it may be hard to interact with humans about it. Also, some humans' values may be defined in terms of their incorrect worldviews, leading to ontological crises with what the AIs are trying to do.
Note that we are interested in people at all levels of seniority, including graduate students,
If I imagine being an undergraduate student who's interested, then this sentence leaves me unclear on whether I should fill it out.
I can imagine some ways that the universe might escape heat death, but I seriously doubt that Kurzweil is referring to anything concrete that has technical merit. Under anything resembling normal laws of physics, computers need negentropy to run calculations, and they cannot just "decide" to keep on computing.
I would love to try having dialogues with people about Agent Foundations! I'm on the vaguely-pro side, and want to have a better understanding of people on the vaguely-con side; either people who think it's not useful, or people who are confused about what it is and why we're doing it, etc.
I like this post for the way it illustrates how the probability distribution over blocks of strings changes as you increase block length.
Otherwise, I think the representation of other ideas and how they related to it is not very accurate, and might mislead reader about the consensus among academics.
As an example, strings where the frequency of substrings converges to a uniform distribution is are called "normal". The idea that this could be the definition of a random string was a big debate through the first half of the 20th century, as people tried to put probability theory on solid foundations. But you can have a fixed, deterministic program that generates normal strings! And so people generally rejected this ideas as the definition of random. Algorithmic information theory uses the definition of Martin-Löf random, which is that an (infinite) string is random if it can't be compressed by any program (with a bunch of subtleties and distinctions in there).
- Utility functions might already be the true name - after all, they do directly measure optimisation, while probability doesn't directly measure information.
- The true name might have nothing to do with utility functions - Alex Altair has made the case that it should be defined in terms of preference orderings instead.
My vote here is for something between "Utility functions might already be the true name" and "The true name might have nothing to do with utility functions".
It sounds to me like you're chasing an intuition that is validly reflecting one of nature's joints, and that that joint is more or less already named by the concept of "utility function" (but where further research is useful).
And separately, I think there's another natural joint that I (and Yudkowsky and others) call "optimization", and this joint has nothing to do with utility functions. Or more accurately, maximizing a utility function is an instance of optimization, but has additional structure.
FWIW I don't think it's honest to title this "breakthroughs". It's almost the opposite, a list of incremental progress.
Unrelatedly, why not make this a cross-post rather than a link-post?
I think it would help a lot to provide people with examples. For example, here
Many machine learning research agendas for safety are investigating issues identified years earlier by foundational research, and are at least partly informed by that research.
You say that, but then don't provide any examples. I imagine readers just not thinking of any, and then moving on without feeling any more convince.
Overall, I think that it's hard for people to believe agent foundations will be useful because they're not visualizing any compelling concrete path where it makes a big difference.