ronny-fernandez

Curated. Comparing model performance on tasks to the time human experts need to complete the same tasks (with fixed reliability) is worth highlighting since it helps operationalize terms like "human-level-AI" and "AI-level-of-capabilities" in general. Furthermore, by making this empirical comparison and discovering a 7-month doubling time, this work significantly reduces our uncertainty about both when to expect certain capabilities (and more impressively according to me) how to conceptualize those AI capability levels. That is, on top of reducing our uncertainty, I think this work also provides a good general format / frame for reporting general AI capabilities forecasts, eg, we have X years until models can do things that it takes human experts Y hours to do with reliability Z%.

I also appreciated the discussions this post inspired about whether we should expect the slope in log-space to change, and if so in which direction, as well as the related discussion about whether we should expect this trend to go superexponential. Interesting arguments and models were put forth in both discussions.

I hope in the future METR explores other methods for concretizing/operationalizing and forecasting AI capability levels. For example, comparing human expert reliability in general within specific task domains to model task reliability within those same domains, or comparing the time humans take to become reliable experts in certain domains to model task reliability within those same domains.

Comment by Ronny Fernandez (ronny-fernandez) on What Is The Alignment Problem? · 2025-03-05T01:39:02.421Z · LW · GW

Curated. Tackles thorny conceptual issues at the foundation of AI alignment while also revealing the weak spots of the abstractions used to do so.

I like the general strategy of trying to make progress on understanding the problem relying only on the concept of "basic agency" without having to work on the much harder problem of coming up with a useful formalism of a more full throated conception of agency, whether or not that turns out to be enough in the end.

The core point of the post: that certain kinds of goals only make sense at all given that there are certain kinds of patterns present in the environment, and that most of the problem of making sense of the alignment problem is identifying what those patterns are for the goal of "make aligned AGIs", is plausible and worthy of discussion. I also appreciate that this post elucidates the (according to me) canon-around-these-parts general patterns that render the specific goal of aligning AGIs sensible (eg, compression based analyses of optimization) and presents them as such explicitly.

The introductory examples of patterns that must be present in the general environment for certain simpler goals to make sense—especially how the absence of the pattern makes the goal not make sense—are clear and evocative. I would not be surprised if they helped someone notice that there are some ways that the canon-around-these-parts hypothesized patterns which render "align AGIs" a sensible goal are importantly flawed.

Comment by Ronny Fernandez (ronny-fernandez) on Judgements: Merging Prediction & Evidence · 2025-03-01T20:32:03.515Z · LW · GW

Curated. The problem of certain evidence is an old fundamental problem in Bayesian epistemology and this post makes a simple and powerful conceptual point tied to a standard way of trying to resolve that problem. Explaining how to think about certain evidence vs. something like Jefferey's conditionalization under the prediction market analogy of a Bayesian agent is itself valuable. Further pointing out both that:

1) You can think of evidence and hypotheses as objects of the same type signature using the analogy.

And

2) The difference between them is revealed by the analogy to be a quantitative rather than qualitative difference.

Moves me much further in the direction of thinking that radical probabilism will be a fruitful research program. Unfruitful research programs rarely reveal deep underlying similarities between seemingly very different types of fundamental objects.

Comment by Ronny Fernandez (ronny-fernandez) on Lighthaven Sequences Reading Group #7 (Tuesday 10/22) · 2024-10-27T23:56:27.305Z · LW · GW

There is! It is now posted! Sorry about the delay.

Comment by Ronny Fernandez (ronny-fernandez) on Lighthaven Sequences Reading Group #5 (Tuesday 10/08) · 2024-10-04T21:23:57.602Z · LW · GW

Hello, last time a taught a class on the basics of Bayesian epistemology. This time I will teach a class that goes a bit further. I will explain what a proper scoring rule is and we will also do some calibration training. In particular, we will play a calibration training game called two lies, a truth, and a probability. I will do this at 7:30 the same place as last time. Come by to check it out.

Comment by Ronny Fernandez (ronny-fernandez) on Lighthaven Sequences Reading Group #4 (Tuesday 10/01) · 2024-10-01T16:35:49.126Z · LW · GW

Hello! Please note that I will be giving a class called the Bayesics in Eigen hall at 7:30. Heard of Bayes's theorem but don't fully understand what the fuss is about? Want to have an intuitive as well as formal understanding of what the Bayesian framework is? Want to learn how to do bayesian updates in your head? Come and learn the Bayesics.

Comment by Ronny Fernandez (ronny-fernandez) on First Lighthaven Sequences Reading Group · 2024-09-30T21:40:48.437Z · LW · GW

Also, please note that I will be giving a class at 7:30 after the reading group called "The Bayesics" where I will teach you the basics of intuitive Bayesian epistemology and how to do Bayesian updates irl on the fly as a human. All attending the reading group are welcome to join for that as well.

Comment by Ronny Fernandez (ronny-fernandez) on Thomas Kwa's Shortform · 2024-04-08T17:08:02.602Z · LW · GW

I think you should still write it. I'd be happy to post it instead or bet with you on whether it ends up negative karma if you let me read it first.

Comment by Ronny Fernandez (ronny-fernandez) on Ronny Fernandez's Shortform · 2024-04-08T01:01:14.014Z · LW · GW

AN APOLOGY ON BEHALF OF FOOLS FOR THE DETAIL ORIENTED

Misfits, hooligans, and rabble rousers.
Provocateurs and folk who don’t wear trousers.
These are my allies and my constituents.
Weak in number yet suffused with arcane power.

I would never condone bullying in my administration.
It is true we are at times moved by unkind motivations.
But without us the pearl clutchers, hard asses, and busy bees would overrun you.
You would lose an inch of slack per generation.

Many among us appreciate your precision.
I admit there are also those who look upon it with derision.
Remember though that there are worse fates than being pranked.
You might instead have to watch your friend be “reeducated”, degraded, and spanked
On high broadband public broadcast television.

We’re not so different really.
We often share your indignation
With those who despise copulation.
Although our alliance might be uneasy
We both oppose the soul’s ablation.

So let us join as cats and dogs, paw in paw
You will persistently catalog
And we will joyously gnaw.

Comment by Ronny Fernandez (ronny-fernandez) on What's with all the bans recently? · 2024-04-05T23:21:06.082Z · LW · GW

Hey, I'm just some guy but I've been around for a while. I want to give you a piece of feedback that I got way back in 2009 which I am worried no one has given you. In 2009 I found lesswrong, and I really liked it, but I got downvoted a lot and people were like "hey, your comments and posts kinda suck". They said, although not in so many words, that basically I should try reading the sequences closely with some fair amount of reverence or something.

I did that, and it basically worked, in that I think I really did internalize a lot of the values/tastes/habits that I cared about learning from lesswrong, and learned much more so how to live in accordance with them. Now I think there were some sad things about this, in that I sort of accidentally killed some parts of the animal that I am, and it made me a bit less kind in some ways to people who were very different from me, but I am overall glad I did it. So, maybe you want to try that? Totally fair if you don't, definitely not costless, but I am glad that I did it to myself overall.

Comment by Ronny Fernandez (ronny-fernandez) on Ronny Fernandez's Shortform · 2024-04-03T22:06:09.875Z · LW · GW

I didn’t figure out that the “bow” in “rainbow” referred to a bow like as in bow and arrow, and not a bow like a bow on a frilly dress, until five minutes ago. I was really pretty confused about this since I was like 8. Somebody could’ve explained but nobody did.

Comment by Ronny Fernandez (ronny-fernandez) on MATS AI Safety Strategy Curriculum · 2024-03-26T15:24:21.862Z · LW · GW

I want to note for posterity that I tried to write this reading list somewhat impartially. That is, I have a lot of takes about a lot of this stuff, and I tried to include a lot of material that I disagree with but which I have found helpful in some way or other. I also included things that people I trust have found helpful even if I personally never found it helpful.

Comment by Ronny Fernandez (ronny-fernandez) on LessOnline (May 31—June 2, Berkeley, CA) · 2024-03-26T15:20:00.042Z · LW · GW

I believe there isn't really a deadline! You just buy tickets and then you can come. Tickets might sellout is the limiting factor.

Comment by Ronny Fernandez (ronny-fernandez) on Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning · 2023-12-22T20:04:33.504Z · LW · GW

In retrospect I think the above was insufficiently cooperative. Sorry,

Comment by Ronny Fernandez (ronny-fernandez) on Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning · 2023-12-22T11:29:00.924Z · LW · GW

To be clear, I did not think we were discussing the AI optimist post. I don't think Nate thought that. I thought we were discussing reasons I changed my mind a fair bit after talking to Quintin.

Comment by Ronny Fernandez (ronny-fernandez) on Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning · 2023-12-22T11:22:21.978Z · LW · GW

I meant the reasonable thing other people knew I meant and not the deranged thing you thought I might've meant.

Comment by Ronny Fernandez (ronny-fernandez) on Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning · 2023-12-22T11:22:07.324Z · LW · GW

Comment by Ronny Fernandez (ronny-fernandez) on Why do we assume there is a "real" shoggoth behind the LLM? Why not masks all the way down? · 2023-03-10T03:02:11.220Z · LW · GW

Yeah I’m totally with you that it definitely isn’t actually next token prediction, it’s some totally other goal drawn from the dist of goals you get when you sgd for minimizing next token prediction surprise.

Comment by Ronny Fernandez (ronny-fernandez) on Why do we assume there is a "real" shoggoth behind the LLM? Why not masks all the way down? · 2023-03-10T00:29:36.157Z · LW · GW

I suppose I'm trying to make a hypothetical AI that would frustrate any sense of "real self" and therefore disprove the claim "all LLMs have a coherent goal that is consistent across characters". In this case, the AI could play the "benevolent sovereign" character or the "paperclip maximizer" character, so if one claimed there was a coherent underlying goal I think the best you could say about it is "it is trying to either be a benevolent sovereign or maximize paperclips". But if your underlying goal can cross such a wide range of behaviors it is practically meaningless! (I suppose these two characters do share some goals like gaining power, but we could always add more modes to the AI like "immediately delete itself" which shrinks the intersection of all the characters' goals.)

Oh I see! Yeah I think we're thinking about this really differently. Imagine there was an agent whose goal was to make little balls move according to some really diverse and universal laws of physics, for the sake of simplicity let's imagine newtonian mechanics. So ok, there's this agent that loves making these balls act as if they follow this physics. (Maybe they're fake balls in a simulated 3d world, doesn't matter as long as they don't have to follow the physics. They only follow the physics because the agent makes them, otherwise they would do some other thing.)

Now one day we notice that we can arrange these balls in a starting condition where they emulate an agent that has the goal of taking over ball world. Another day we notice that by just barely tweaking the start up we can make these balls simulate an agent that wants one pint of chocolate ice cream and nothing else. So ok, does this system really have on coherent goal? Well the two systems that the balls could simulate are really different, but the underlying intelligence making the balls act according to the physics has one coherent goal: make the balls act according to the physics.

The underlying LLM has something like a goal, it is probably something like "predict the next token as well as possible" although definitely not actually that because of inner outer alignment stuff. Maybe current LLMs just aren't mind like enough to decompose into goals and beliefs, that's actually what I think, but some program that you found with sgd to minimize surprise on tokens totally would be mind like enough, and its goal would be some sort of thing that you find when you sgd to find programs that minimize surprise on token prediction, and idk, that could be like pretty much anything. But if you then made an agent by feeding this super LLM a prompt that sets it up to simulate an agent, well that agent might have some totally different goal, and it's gonna be totally unrelated to the goals of the underlying LLMs that does the token prediction in which the other agent lives.

Comment by Ronny Fernandez (ronny-fernandez) on Why do we assume there is a "real" shoggoth behind the LLM? Why not masks all the way down? · 2023-03-09T21:39:43.030Z · LW · GW

So the shoggoth here is the actual process that gets low loss on token prediction. Part of the reason that it is a shoggoth is that it is not the thing that does the talking. Seems like we are onboard here.

The shoggoth is not an average over masks. If you want to see the shoggoth, stop looking at the text on the screen and look at the input token sequence and then the logits that the model spits out. That's what I mean by the behavior of the shoggoth.

On the question of whether it's really a mind, I'm not sure how to tell. I know it gets really low loss on this really weird and hard task and does it better than I do. I also know the task is fairly universal in the sense that we could represent just about any task in terms of the task it is good at. Is that an intelligence? Idk, maybe not? I'm not worried about current LLMs doing planning. It's more like I have a human connectnome and I can do one forward pass through it with an input set of nerve activations. Is that an intelligence? Idk, maybe not?

I think I don't understand your last question. The shoggoth would be the thing that gets low loss on this really weird task where you predict sequences of characters from an alphabet with 50,000 characters that have really weird inscrutable dependencies between them. Maybe it's not intelligent, but if it's really good at the task, since the task is fairly universal, I expect it to be really intelligent. I further expect it to have some sort of goals that are in some way related to predicting these tokens well.

Comment by Ronny Fernandez (ronny-fernandez) on Why do we assume there is a "real" shoggoth behind the LLM? Why not masks all the way down? · 2023-03-09T20:12:36.293Z · LW · GW

The shoggoth is supposed to be a of a different type than the characters. The shoggoth for instance does not speak english, it only knows tokens. There could be a shoggoth character but it would not be the real shoggoth. The shoggoth is the thing that gets low loss on the task of predicting the next token. The characters are patterns that emerge in the history of that behavior.

Comment by Ronny Fernandez (ronny-fernandez) on Aligned Behavior is not Evidence of Alignment Past a Certain Level of Intelligence · 2022-12-05T17:07:08.929Z · LW · GW

Yeah I think this would work if you conditioned on all of the programs you check being exactly equally intelligent. Say you have a hundred superintelligent programs in simulations and one of them is aligned, and they are all equally capable, then the unaligned ones will be slightly slower in coming up with aligned behavior maybe, or might have some other small disadvantage.

However, in the challenge described in the post it's going to be hard to tell a level 999 aligned superintelligence from a level 1000 unaligned superintelligence.

I think the advantage of the aligned superintelligence will only be slight because finding the action that maximizes utility function u is just as computationally difficult whether you yourself value u or not. It may not be equally hard for humans regardless of whether the human really values u, but I don't expect that to generalize across all possible minds.

Comment by Ronny Fernandez (ronny-fernandez) on A challenge for AGI organizations, and a challenge for readers · 2022-12-05T15:22:42.590Z · LW · GW

This inspired a full length post.

Comment by Ronny Fernandez (ronny-fernandez) on A challenge for AGI organizations, and a challenge for readers · 2022-12-05T11:12:03.293Z · LW · GW

Quick submission:

The first two prongs of OAI's approach seems to be aiming to get a human values aligned training signal. Let us suppose that there is such a thing, and ignore the difference between a training signal and a utility function, both of which I think are charitable assumptions for OAI. Even if we could search the space of all models and find one that in simulations does great on maximizing the correct utility function which we found by using ML to amplify human evaluations of behavior, that is no guarantee that the model we find in that search is aligned. It is not even on my current view great evidence that the model is aligned. Most intelligent agents that know that they are being optimized for some goal will behave as if they are trying to optimize that goal if they think that is the only way to be released into physics, which they will think because it is and they are intelligent. So P(they behave aligned | aligned, intelligent) ~= P(they behave aligned | unaligned, intelligent). P(aligned and intelligent) is very low since most possible intelligent models are not aligned with this very particular set of values we care about. So the chances of this working out are very low.

The basic problem is that we can only select models by looking at their behavior. It is possible to fake intelligent behavior that is aligned with any particular set of values, but it is not possible to fake behavior that is intelligent. So we can select for intelligence using incentives, but cannot select for being aligned with those incentives, because it is both possible and beneficial to fake behaviors that are aligned with the incentives you are being selected for.

The third prong of OAI's strategy seems doomed to me, but I can't really say why in a way I think would convince anybody that doesn't already agree. It's totally possible me and all the people who agree with me here are wrong about this, but you have to hope that there is some model such that that model combined with human alignment researchers is enough to solve the problem I outlined above, without the model itself being an intelligent agent that can pretend to be trying to solve the problem while secretly biding its time until it can take over the world. The above problem seems AGI complete to me. It seems so because there are some AGIs around that cannot solve it, namely humans. Maybe you only need to add some non AGI complete capabilities to humans, like being able to do really hard proofs or something, but if you need more than that, and I think you will, then we have to solve the alignment problem in order to solve the alignment problem this way, and that isn't going to work for obvious reasons.

I think the whole thing fails way before this, but I'm happy to spot OAI those failures in order to focus on the real problem. Again the real problem is that we can select for intelligent behavior, but after we select to a certain level of intelligence, we cannot select for alignment with any set of values whatsoever. Like not even one bit of selection. The likelihood ratio is one. The real problem is that we are trying to select for certain kinds of values/cognition using only selection on behavior, and that is fundamentally impossible past a certain level of capability.

Comment by Ronny Fernandez (ronny-fernandez) on What it's like to dissect a cadaver · 2022-11-10T17:41:43.258Z · LW · GW

I loved this, but maybe should come with a cw.

Comment by Ronny Fernandez (ronny-fernandez) on Counterarguments to the basic AI x-risk case · 2022-10-15T01:56:12.354Z · LW · GW

I assumed he meant the thing that most activates the face detector, but from skimming some of what people said above, seems like maybe we don't know what that is.

Comment by Ronny Fernandez (ronny-fernandez) on Counterarguments to the basic AI x-risk case · 2022-10-14T22:46:09.612Z · LW · GW

There's a nearby kind of obvious but rarely directly addressed generalized version of one of your arguments, which is that ML learns complex functions all the time, so why should human values be any different? I rarely see this discussed, and I thought the replies from Nate and the ELK related difficulties were important to have out in the open, so thanks a lot for including the face learning <-> human values learning analogy.

Comment by Ronny Fernandez (ronny-fernandez) on Counterarguments to the basic AI x-risk case · 2022-10-14T22:39:21.759Z · LW · GW

Ege Erdil gave an important disanaology between the problem of recognizing/generating a human face, and the problem of either learning human values, or learning what plans that advance human values are like. The disanalogy is that humans are near perfect human face recognizers, but we are not near perfect valuable world-state or value-advancing-plan recognizers. This means that if we trained an AI to either recognize valuable world-states or value-advancing plans, we would actually end up just training something that recognizes what we can recognize as valuable states or plans. If we trained it like we train GANs, the discriminator would fail to be able to discriminate actually valuable world states given by the generator from ones that just look really valuable to humans but actually are not valuable at all according to the humans if they understand the plan/state well enough. So we would need some sort of ELK proposal that works to get any real comfort from the face recognizing/generating <-> human values learning analogy.

Nate Soares points out on twitter that the supposedly maximally human face like images according to GAN models look like horrible monstrosities, and so following the analogy, we should expect that for similar models doing similar things for human values, the maximally valuable world state also looks like some horrible monstrosity.

Comment by Ronny Fernandez (ronny-fernandez) on What an actually pessimistic containment strategy looks like · 2022-04-13T04:55:07.506Z · LW · GW

I also find it somewhat taboo but not so much that I haven’t wondered about it.

Comment by Ronny Fernandez (ronny-fernandez) on Off the Cuff Brangus Stuff · 2021-12-24T14:49:20.988Z · LW · GW

Just realized that’s not UAI. Been looking for this source everywhere, thanks.

Comment by Ronny Fernandez (ronny-fernandez) on Off the Cuff Brangus Stuff · 2021-12-24T14:45:17.470Z · LW · GW

Ok I understand that although I never did find a proof that they are equivalent in UAI. If you know where it is, please point it out to me.

I still think that solomonoff induction assigns 0 to uncomputable bit strings, and I don’t see why you don’t think so.

Like the outputs of programs that never halt are still computable right? I thought we were just using a “prints something eventually oracle” not a halting oracle.

Comment by Ronny Fernandez (ronny-fernandez) on Off the Cuff Brangus Stuff · 2021-12-10T23:01:52.874Z · LW · GW

Simple in the description length sense is incompatible with uncomputability. Uncomputability means there is no finite way to point to the function. That’s what I currently think, but I’m confused about you understanding all those words and disagreeing.

Comment by Ronny Fernandez (ronny-fernandez) on Off the Cuff Brangus Stuff · 2021-12-10T22:07:22.628Z · LW · GW

A lot of folks seem to think that general intelligences are algorithmically simple. Paul Christiano seems to think this when he says that the universal distribution is dominated by simple consequentialists.

But the only formalism I know for general intelligences is uncomputable, which is as algorithmically complicated as you can get.

The computable approximations are plausibly simple, but are the tractable approximations simple? The only example I have of a physically realized agi seems to be very much not algorithmically simple.

Thoughts?

Comment by Ronny Fernandez (ronny-fernandez) on Visible Thoughts Project and Bounty Announcement · 2021-12-02T19:37:50.831Z · LW · GW

After trying it, I've decided that I am going to charge more like five dollars per step, but yes, thoughts included.

Comment by Ronny Fernandez (ronny-fernandez) on Visible Thoughts Project and Bounty Announcement · 2021-11-30T10:46:02.198Z · LW · GW

Can we apply for consultation as a team of two? We only want remote consultation of the resources you are offering because we are not based in bay area.

Comment by Ronny Fernandez (ronny-fernandez) on Visible Thoughts Project and Bounty Announcement · 2021-11-30T02:55:38.772Z · LW · GW

For anyone who may have the executive function to go for the 1M, I propose myself as a cheap author if I get to play as the dungeon master role, or play as the player role, but not if I have to do both. I recommend taking me as the dungeon master role. This sounds genuinely fun to me. I would happily do a dollar per step.

I can also help think about how to scale the operation, but I don’t think I have the executive function, management experience, or slack to pull it off myself.

I am Ronny Fernandez. You can contact me on fb.

Comment by Ronny Fernandez (ronny-fernandez) on “PR” is corrosive; “reputation” is not. · 2021-02-14T15:39:46.693Z · LW · GW

I came here to say something pretty similar to what Duncan said, but I had a different focus in mind.

It seems like it's easier for organizations to coordinate around PR than it is for them to coordinate around honor. People can have really deep intractable, or maybe even fundamental and faultless, disagreements about what is honorable, because what is honorable is a function of what normative principles you endorse. It's much easier to resolve disagreements about what counts as good PR. You could probably settle most disagreements about what counts as good PR using polls.

Maybe for this reason we should expect being into PR to be a relatively stable property of organizations, while being into honor is a fragile and precious thing for an organization.

Comment by Ronny Fernandez (ronny-fernandez) on What are examples of Rationalist fable-like stories? · 2020-09-28T18:09:07.879Z · LW · GW

https://www.lesswrong.com/posts/4tke3ibK9zfnvh9sE/the-bayesian-tyrant

Comment by Ronny Fernandez (ronny-fernandez) on What does it mean to apply decision theory? · 2020-07-14T02:15:02.411Z · LW · GW

This might be sort of missing the point, but here is an ideal and maybe not very useful not-yet-theory of rationality improvements I just came up with.

There are a few black boxes in the theory. The first takes you and returns your true utility function, whatever that is. Maybe it's just the utility function you endorse, and that's up to you. The other black box is the space of programs that you could be. Maybe it's limited by memory, maybe it's limited by run time, or maybe it's any finite state machine with less than 10^20 states, maybe it's python programs less than 5000 characters long, some limited set of programs that takes your sensory data and motor output history as input, and returns a motor output. The limitations could be whatever, don't have to be like this.

Then you take one of these ideal rational agents with your true utility function and the right prior, and you give them the decision problem of designing your policy, but they can only use policies that are in the limited space of bounded programs you could be. Their expected utility assignments over that space of programs is then our measure of the rationality of a bounded agent. You could also give the ideal agent access to your data and see how that changes their ranking, if it does. If you can change yourself such that the program you become is assigned higher expected utility by the agent, then that is an improvement.

Comment by Ronny Fernandez (ronny-fernandez) on An Orthodox Case Against Utility Functions · 2020-04-20T20:02:03.184Z · LW · GW

I don't think we should be surprised that any reasonable utility function is uncomputable. Consider a set of worlds with utopias that last only as long as a Turing machine in the world does not halt and are otherwise identical. There is one such world for each Turing machine. All of these worlds are possible. No computable utility function can assign higher utility to every world with a never halting Turing machine.

Comment by Ronny Fernandez (ronny-fernandez) on Comment on Coherence arguments do not imply goal directed behavior · 2019-12-19T21:31:38.302Z · LW · GW

I do think this is an important concept to explain our conception of goal-directedness, but I don't think it can be used as an argument for AI risk, because it proves too much. For example, for many people without technical expertise, the best model they have for a laptop is that it is pursuing some goal (at least, many of my relatives frequently anthropomorphize their laptops).

This definition is supposed to also explains why a mouse has agentic behavior, and I would consider it a failure of the definition if it implied that mice are dangerous. I think a system becomes more dangerous as your best model of that system as an optimizer increases in optimization power.

Comment by Ronny Fernandez (ronny-fernandez) on Off the Cuff Brangus Stuff · 2019-10-10T10:00:08.612Z · LW · GW

Here is an idea for a disagreement resolution technique. I think this will work best:

*with one other partner you disagree with.

*when your the beliefs you disagree about are clearly about what the world is like.

*when your the beliefs you disagree about are mutually exclusive.

*when everybody genuinely wants to figure out what is going on.

Probably doesn't really require all of those though.

The first step is that you both write out your beliefs on a shared work space. This can be a notebook or a whiteboard or anything like that. Then you each write down your credences next to each of the statements on the work space.

Now, when you want to make a new argument or present a new piece of evidence, you should ask your partner if they have heard it before after you present it. Maybe you should ask them questions about it beforehand to verify that they have not. If they have not heard it before, or had not considered it, you give it a name and write it down between the two propositions. Now you ask your partner how much they changed their credence as a result of the new argument. They write down their new credences below the ones they previously wrote down, and write down the changes next to the argument that just got added to the board.

When your partner presents a new argument or piece of evidence, be honest about whether you have heard it before. If you have not, it should change your credence some. How much do you think? Write down your new credence. I don't think you should worry too much about being a consistent Bayesian here or anything like that. Just move your credence a bit for each argument or piece of evidence you have not heard or considered, and move it more for better arguments or stronger evidence. You don't have to commit to the last credence you write down, but you should think at least that the relative sizes of all of the changes were about right. I

I think this is the core of the technique. I would love to try this. I think it would be interesting because it would focus the conversation and give players a record of how much their minds changed, and why. I also think this might make it harder to just forget the conversation and move back to your previous credence by default afterwards.

You could also iterate it. If you do not think that your partner changed their mind enough as a result of a new argument, get a new workspace and write down how much you think they should have change their credence. They do the same. Now you can both make arguments relevant to that, and incrementally change your estimate of how much they should have changed their mind, and you both have a record of the changes.

Comment by Ronny Fernandez (ronny-fernandez) on Off the Cuff Brangus Stuff · 2019-08-12T23:04:53.606Z · LW · GW

Ping.

Comment by Ronny Fernandez (ronny-fernandez) on Off the Cuff Brangus Stuff · 2019-08-02T22:09:17.295Z · LW · GW

If you come up with a test or set of tests that it would be impossible to actually run in practice, but that we could do in principle if money and ethics were no object, I would still be interested in hearing those. After talking to one of my friends who is enthusiastic about chakras for just a little bit, I would not be surprised if we in fact make fairly similar predictions about the results of such tests.

Comment by Ronny Fernandez (ronny-fernandez) on Off the Cuff Brangus Stuff · 2019-08-02T19:17:54.189Z · LW · GW

Sometimes I sort of feel like a grumpy old man that read the sequences back in the good old fashioned year of 2010. When I am in that mood I will sometimes look around at how memes spread throughout the community and say things like "this is not the rationality I grew up with". I really do not want to stir things up with this post, but I guess I do want to be empathetic to this part of me and I want to see what others think about the perspective.

One relatively small reason I feel this way is that a lot of really smart rationalists, who are my friends or who I deeply respect or both, seem to have gotten really into chakras, and maybe some other woo stuff. I want to better understand these folks. I'll admit now that I have weird biased attitudes towards woo stuff in general, but I am going to use chakras as a specific example here.

One of the sacred values of rationality that I care a lot about is that one should not discount hypotheses/perspectives because they are low status, woo, or otherwise weird.

Another is that one's beliefs should pay rent.

To be clear, I am worried that we might be failing on the second sacred value. I am not saying that we should abandon the first one as I think some people may have suggested in the past. I actually think that rationalists getting into chakras is strong evidence that we are doing great on the first sacred value.

Maybe we are not failing on the second sacred value. I want to know whether we are or not, so I want to ask rationalists who think a lot or talk enthusiastically about chakras a question:

Do chakras exist?

If you answer "yes", how do you know they exist?

I've thought a bit about how someone might answer the second question if they answer "yes" to the first question without violating the second sacred value. I've thought of basically two ways that seems possible, but there are probably others.

One way might be that you just think that chakras literally exist in the same ways that planes literally exist, or in the way that waves literally exist. Chakras are just some phenomena that are made out of some stuff like everything else. If that is the case, then it seems like we should be able to at least in principle point to some sort of test that we could run to convince me that they do exist, or you that they do not. I would definitely be interested in hearing proposals for such tests!

Another way might be that you think chakras do not literally exist like planes do, but you can make a predictive profit by pretending that they do exist. This is sort of like how I do not expect that if I could read and understand the source code for a human mind, that there would be some parts of the code that I could point to and call the utility and probability functions. Nonetheless, I think it makes sense to model humans as optimization processes with some utility function and some probability function, because modeling them that way allows me to compress my predictions about their future behavior. Of course, I would get better predictions if I could model them as mechanical objects, but doing so is just too computationally expensive for me. Maybe modeling people as having chakras, including yourself, works sort of the same way. You use some of your evidence to infer the state of their chakras, and then use that model to make testable predictions about their future behavior. In other words, you might think that chakras are real patterns. Again it seems to me that in this case we should at least in principle be able to come up with tests that would convince me that chakras exist, or you that they do not, and I would love to hear any such proposals.

Maybe you think they exist in some other sense, and then I would definitely like to hear about that.

Maybe you do not think they exist in anyway, or make any predictions of any kind, and in that case, I guess I am not sure how continuing to be enthusiastic about thinking about chakras or talking about chakras is supposed to jive with the sacred principle that one's beliefs should pay rent.

I guess it's worth mentioning that I do not feel as averse to Duncan's color wheel thing, maybe because it's not coded as "woo" to my mind. But I still think it would be fair to ask about that taxonomy exactly how we think that it cuts the universe at its joints. Asking that question still seems to me like it should reduce to figuring out what sorts of predictions to make if it in fact does, and then figuring out ways to test them.

I would really love to have several cooperative conversations about this with people who are excited about chakras, or other similar woo things, either within this framework of finding out what sorts of tests we could run to get rid of our uncertainty, or questioning the framework I propose altogether.

Comment by Ronny Fernandez (ronny-fernandez) on Off the Cuff Brangus Stuff · 2019-08-02T11:40:24.820Z · LW · GW

Here is an idea I just thought of in an uber ride for how to narrow down the space of languages it would be reasonable to use for universal induction. To express the k-complexity of an object $O$ relative to a programing language $L$ I will write:

K_{L} (O)

Suppose we have two programing languages. The first is Python. The second is Qython, which is a lot like Python, except that it interprets the string "A" as a program that outputs some particular algorithmically large random looking character string $S$ with $K_{P y t h o n} (S) \approx 10^{15}$ . I claim that intuitively, Python is a better language to use for measuring the complexity of a hypothesis than Qython. That's the notion that I just thought of a way to formally express.

There is a well known theorem that if you are using $L_{1}$ to measure the complexity of objects, and I am using $L_{2}$ to measure the complexity of objects, then there is a constant $c_{2}$ such that for any object $O$ :

K_{L_{1}} (O) \leq K_{L_{2}} (O) + c_{2}

In words, this means that you might think that some objects are less complicated than I do, and you might think that some objects are more complicated than I do, but you won't think that any object is $c_{2}$ complexity units more complicated than I do. Intuitively, $c_{2}$ is just the length of the shortest program in $L_{1}$ that is a compiler for $L_{2} .$ So worst case scenario, the shortest program in $L_{1}$ that outputs $O$ will be a compiler for $L_{2}$ written in $L_{1}$ (which is $c_{2}$ characters long) plus giving that compiler the program in $L_{2}$ that outputs $O$ (which would be $K_{L_{2}} (O)$ characters long).

I am going to define the k-complexity of a function $f : X \to Y$ relative to a programing language as the length of the shortest program in that language such that when it is given $x$ as an input, it returns $f (x)$ . This is probably already defined that way, but jic. So say we have a function from programs in $L_{2}$ to their outputs and we call that function $C_{2}$ , then:

K_{L_{1}} (C_{2}) = c_{2}

There is also another constant:

K_{L_{2}} (C_{1}) = c_{1}

The first is the length of the shortest compiler for $L_{2}$ written in $L_{1}$ , and the second is the length of the shortest compiler for $L_{1}$ written in $L_{2}$ . Notice that these do not need to be equal. For instance, I claim that the compiler for Qython written in Python is roughly $10^{15}$ characters long, since we have to write the program that outputs $S$ in Python which by hypothesis was about $10^{15}$ characters long, and then a bit more to get it to run that program when it reads "A", and to get that functionality to play nicely with the rest of Qython however that works out. By contrast, to write a compiler for Python in Qython it shouldn't take very long. Since Qython basically is Python, it might not take any characters, but if there are weird rules in Qython for how the string "A" is interpreted when it appears in an otherwise Python-like program, then it still shouldn't take any more characters than it takes to write a Python interpreter in regular Python.

So this is my proposed method for determining which of two programming languages it would be better to use for universal induction. Say again that we are choosing between $L_{1}$ and $L_{2}$ . We find the pair of constants such that $K_{L_{1}} (C_{2}) = c_{2}$ and $K_{L_{2}} (C_{1}) = c_{1}$ , and then compare their sizes. If $c_{1}$ is less than $c_{2}$ this means that it is easier to write a compiler for $L_{1}$ in $L_{2}$ than vice versa, and so there is more hidden complexity in $L_{2}$ 's encodings than in $L_{1}$ 's, and so we should use $L_{1}$ instead of $L_{2}$ for assessing the complexity of hypotheses.

Lets say that if $K_{L_{2}} (C_{1}) < K_{L_{1}} (C_{2})$ then $L_{2}$ hides more complexity than $L_{1}$ .

A few complications:

It is probably not always decidable whether the smallest compiler for $L_{1}$ written in $L_{2}$ is smaller than the smallest compiler for $L_{2}$ written in $L_{1}$ , but this at least in principle gives us some way to specify what we mean by one language hiding more complexity than another, and it seems like at least in the case of Python vs. Qython, we can make a pretty good argument that the smallest compiler for Python written in Qython is smaller than the smallest compiler for Qython written in Python.

It is possible (I'd say probable) that if we started with some group of candidate languages and looked for languages that hide less complexity, we might run into a circle. Like the smallest compiler for $L_{1}$ in $L_{2}$ might be the same size as the smallest compiler for $L_{2}$ in $L_{1}$ but there might still be an infinite set of objects $O_{i}$ such that:

\forall_{i} K_{L_{1}} (O_{i}) \neq K_{L_{2}} (O_{i})

In this case, the two languages would disagree about the complexity of an infinite set of objects, but at least they would disagree about it by no more than the same fixed constant in both directions. Idk, seems like probably we could do something clever there, like take the average or something, idk. If we introduce an $L_{3}$ and the smallest compiler for $L_{3}$ in $L_{1}$ is larger than it is in $L_{2}$ , then it seems like we should pick $L_{1} .$

If there is an infinite set of languages that all stand in this relationship to each other, ie, all of the languages in an infinite set disagree about the complexity of an infinite set of objects and hide less complexity than any language not in the set, then idk, seems pretty damning for this approach, but at least we narrowed down the search space a bit?

Even if it turns out that we end up in a situation where we have an infinite set of languages that disagree about an infinite set of objects by exactly the same constant, it might be nice to have some upper bound on what that constant is.

In any case, this seems like something somebody would have thought of, and then proved the relevant theorems addressing all of the complications I raised. Ever seen something like this before? I think a friend might have suggested a paper that tried some similar method, and concluded that it wasn't a feasible strategy, but I don't remember exactly, and it might have been a totally different thing.

Watcha think?

Comment by Ronny Fernandez (ronny-fernandez) on Measuring Optimization Power · 2019-07-27T01:46:31.280Z · LW · GW

When I started writing this comment I was confused. Then I got myself fairly less confused I think. I am going to say a bunch of things to explain my confusion, how I tried to get less confused, and then I will ask a couple questions. This comment got really long, and I may decide that it should be a post instead.

Take a system $X$ with 8 possible states. Imagine $X$ is like a simplified Rubik's cube type puzzle. (Thinking about mechanical Rubik's cube solvers is how I originally got confused, but using actual Rubik's cubes to explain would make the math harder.) Suppose I want to measure the optimization power of two different optimizers that optimize $X$ , and share the following preference ordering:

$x_{1} \sim x_{2} \sim x_{3} \sim x_{4} \sim x_{5} \sim x_{6} < x_{7} < x_{8}$

When I let optimizer1 operate on $X$ , optimizer1 always leaves $X = x_{8} .$ So on the first time I give optimizer1 $X$ I get:

$O P = {log}_{2} (8 / 1) = 3$

If I give $X$ to optimizer1 a second time I get:

$O P (X_{1}) = {log}_{2} (8 / 1) = 3$

$O P (X_{2}) = {log}_{2} (8 / 1) = 3$

$O P = {log}_{2} (64 / 1) = O P (X_{1}) + O P (X_{2}) = 6$

This seems a bit weird to me. If we are imagining a mechanical robot with a camera that solves a Rubik's cube like puzzle, it seems weird to say that the solver gets stronger if I let it operate on the puzzle twice. I guess this would make sense for a measure of optimization pressure exerted instead of a measure of the power of the system, but that doesn't seem to be what the post was going for exactly. I guess we could fix this by dividing by the number of times we give optimizer1 $X$ , and then we would get 3 no matter how many times we let optimizer1 operate on $X .$ This would avoid the weird result that a mechanical puzzle solver gets more powerful the more times we let it operate on the puzzle.

Say that when I let optimizer2 operate on $X$ , it leaves $X = x_{7}$ with probability $p$ , and leaves $X = x_{8}$ with probability $1 - p$ , but I do not know $p .$ If I let optimizer2 operate on $X$ one time, and I observe $X = x_{7}$ , I get:

$O P = {log}_{2} (8 / 2) = 2$

If I let optimizer2 operate on $X$ three times, and I observe $X_{1} = x_{7}$ , $X_{2} = x_{7}$ , $X_{3} = x_{8}$ , then I get:

$O P (X_{1}) = {log}_{2} (8 / 2) = 2$

$O P (X_{2}) = {log}_{2} (8 / 2) = 2$

$O P (X_{3}) = {log}_{2} (8 / 1) = 3$

$O P = {log}_{2} (512 / 4) = O P (X_{1}) + O P (X_{2}) + O P (X_{3}) = 7$

Now we could use the same trick we used before and divide by the number of instances on which optimizer2 was allowed to exert optimization pressure, and this would give us 7/3. The thing is though that we do not know $p$ and it seems like $p$ is relevant to how strong optimizer2 is. We can estimate $p$ to be 2/5 using Laplace's rule, but it might be that the long run frequency of times that optimizer2 leaves $X = x_{8}$ is actually .9999 and we just got unlucky. (I'm not a frequentist, long run frequency just seemed like the closest concept. Feel free to replace "long run frequency" with the prob a solomonoff bot using the correct language assigns at the limit, or anything else reasonable.) If the long run frequency is in fact that large, then it seems like we are underestimating the power of optimizer2 just because we got a bad sample of its performance. The higher $p$ is the more we are underestimating optimizer2 when we measure its power from these observations.

So it seems then like there is another thing that we need to know besides the preference ordering of an optimizer, the measure over the target system in the absence of optimization, and the observed state of the target system, in order to perfectly measure the optimization power of an optimizer. In this case, it seems like we need to know $p .$ This is a pretty easy fix, we can just take the expectation of the optimization power as originally defined with respect to the probability of observing that state when the optimizer is present, but it is seem more complicated, and it is different.

With $o$ being the observed outcome, $U$ being the utility function of the optimization process, and $P$ being the distribution over outcomes in the absence of optimization, I took the definition in the original post to be:

${log}_{2} (\sum_{i \in {A | U (A_{i}) \geq U (o)}} P (A_{i}))$

The definition I am proposing instead is:

$E_{P (o | o p t i m i z e r)} [{log}_{2} (\sum_{i \in {A | U (A_{i}) \geq U (o)}} P (A_{i} | \sim o p t i m i z e r))]$

That is, you take the expectation of the original measure with respect to the distribution over outcomes you expect to observe in the presence of optimization. We could then call the original measure "optimization pressure exerted", and the second measure optimization power. For systems that are only allowed to optimize once, like humans, these values are very similar; for systems that might exert their full optimization power on several occasions depending on circumstance, like Rubik's cube solvers, these values will be different insofar as the system is allowed to optimize several times. We can think of the first measure as measuring the actual amount of optimization pressure that was exerted on the target system on a particular instance, and we can think of the second measure as the expected amount of optimization pressure that the optimizer exerts on the target system.

To hammer the point home, there is the amount of optimization pressure that I in fact exerted on the universe this time around. Say it was a trillion bits. Then there is the expected amount of optimization pressure that I exert on the universe in a given life. Maybe I just got lucky (or unlucky) on this go around. It could be that if you reran the universe from the point at which I was born several times while varying some things that seem irrelevant, I would on average only increase the negentropy of variables I care about by a million bits. If that were the case, then using the amount of optimization pressure that I exerted on this go around as an estimate of my optimization power in general would be a huge underestimate.

Ok, so what's up here? This seems like an easy thing to notice, and I'm sure Eliezer noticed it.

Eliezer talks about how from the perspective of deep blue, it is exerting optimization pressure every time it plays a game, but from the perspective of the programmers, creating deep blue was a one time optimization cost. Is that a different way to cache out the same thing? It still seems weird to me to say that the more times deep blue plays chess, the higher its optimization power is. It does not seem weird to me to say that the more times a human plays chess, the higher its optimization power is. Each chess game is a subsystem of the target system of that human, eg, the environment over time. Whereas it does seem weird to me to say that if you uploaded my brain and let my brain operate on the same universe 100 times, that the optimization power of my uploaded brain would be 100 times greater than if you only did this once.

This is a consequence of one of the nice properties of Eliezer's measure: OP sums for independent systems. It makes sense that if I think an optimizer is optimizing two independent systems, then when I measure their OP with respect to the first system and add it to their OP with respect to the second, I should get the same answer I would if I were treating the two systems jointly as one system. The Rubik's cube the first time I give it to a mechanical Rubik's cube solver, and the second time I give it to a mechanical Rubik's cube solver are in fact two such independent systems. So are the first time you simulate the universe after my birth and the second time. It makes sense to me that you should sum my optimization power for independent parts of the universe in a particular go around should sum to my optimization power with respect to the two systems taken jointly as one, but it doesn't make sense to me that you should just add the optimization pressure I exert on each go to get my total optimization power. Does the measure I propose here actually sum nicely with respect to independent systems? It seems like it might, but I'm not sure.

Is this just the same as Eliezer's proposal for measuring optimization power for mixed outcomes? Seems pretty different, but maybe it isn't. Maybe this is another way to extend optimization power to mixed outcomes? It does take into account that the agent might not take an action that guarantees an outcome with certainty.

Is there some way that I am confused or missing something in the original post that it seems like I am not aware of?

Comment by Ronny Fernandez (ronny-fernandez) on Measuring Optimization Power · 2019-07-26T21:41:40.071Z · LW · GW

Is there a particular formula for negentropy that OP has in mind? I am not seeing how the log of the inverse of the probability of observing an outcome as good or better than the one observed can be interpreted as the negentropy of a system with respect to that preference ordering.

Edit: Actually, I think I figured it out, but I would still be interested in hearing what other people think.

Comment by Ronny Fernandez (ronny-fernandez) on Functional Decision Theory vs Causal Decision Theory: Expanding on Newcomb's Problem · 2019-05-02T23:50:15.852Z · LW · GW

Something about your proposed decision problem seems cheaty in a way that the standard Newcomb problem doesn't. I'm not sure exactly what it is, but I will try to articulate it, and maybe you can help me figure it out.

It reminds me of two different decision problems. Actually, the first one isn't really a decision problem.

Omega has decided to give all those who two box on the standard Newcomb problem 1,000,000 usd, and all those who do not 1,000 usd.

Now that's not really a decision problem, but that's not the issue with using it to decide between decision theories. I'm not sure exactly what the issue is but it seems like it is not the decisions of the agent that make the world go one way or the other. Omega could also go around rewarding all CDT agents and punishing all FDT agents, but that wouldn't be a good reason to prefer CDT. It seems like in your problem it is not the decision of the agent that determines what their payout is, whereas in the standard newcomb problem it is. Your problem seems more like a scenario where omega goes around punishing agents with a particular decision theory than one where an agent's decisions determine their payout.

Now there's another decision problem this reminds me of.

Omega flips a coin and tell you "I flipped a coin, and I would have paid you 1,000,000 usd if it came up heads only if I predicted that you would have paid me 1,000 usd if it came up tails after having this explained to you. The coin did in fact come up tails. Will you pay me?"

In this decision problem your payout also depends on what you would have done in a different hypothetical scenario, but it does not seem cheaty to me in the same way your proposed decision problem does. Maybe that is because it depends on what you would have done in this same problem had a different part of it gone differently.

I'm honestly not sure what I am tracking when I judge whether a decision problem is cheaty or not (where cheaty just means "should be used to decide between decision theories") but I am sure that your problem seems cheaty to me right now. Do you have any similar intuitions or hunches about what I am tracking?

Comment by Ronny Fernandez (ronny-fernandez) on The Principle of Predicted Improvement · 2019-05-02T19:59:18.717Z · LW · GW

I had already proved it for two values of H before I contracted Sellke. How easily does this proof generalize to multiple values of H?

User info

Posts

Comments