Posts

Work with me on agent foundations: independent fellowship 2024-09-21T13:59:16.706Z
Quick look: applications of chaos theory 2024-08-18T15:00:07.853Z
[Talk transcript] What “structure” is and why it matters 2024-07-25T15:49:00.844Z
A simple model of math skill 2024-07-21T18:57:33.697Z
Empirical vs. Mathematical Joints of Nature 2024-06-26T01:55:22.858Z
New intro textbook on AIXI 2024-05-11T18:18:50.945Z
Towards a formalization of the agent structure problem 2024-04-29T20:28:15.190Z
Raemon's Deliberate (“Purposeful?”) Practice Club 2023-11-14T18:24:19.335Z
My research agenda in agent foundations 2023-06-28T18:00:27.813Z
Why don't quantilizers also cut off the upper end of the distribution? 2023-05-15T01:40:50.183Z
Draft: Inferring minimizers 2023-04-01T20:20:48.676Z
Draft: Detecting optimization 2023-03-29T20:17:46.642Z
Draft: The optimization toolbox 2023-03-28T20:40:38.165Z
Draft: Introduction to optimization 2023-03-26T17:25:55.093Z
Dealing with infinite entropy 2023-03-01T15:01:40.400Z
Top YouTube channel Veritasium releases video on Sleeping Beauty Problem 2023-02-11T20:36:57.089Z
My first year in AI alignment 2023-01-02T01:28:03.470Z
How can one literally buy time (from x-risk) with money? 2022-12-13T19:24:06.225Z
Consider using reversible automata for alignment research 2022-12-11T01:00:24.223Z
A dynamical systems primer for entropy and optimization 2022-12-10T00:13:13.984Z
How do finite factored sets compare with phase space? 2022-12-06T20:05:54.061Z
Alex_Altair's Shortform 2022-11-27T18:59:05.193Z
When do you visualize (or not) while doing math? 2022-11-23T20:15:20.885Z
"Normal" is the equilibrium state of past optimization processes 2022-10-30T19:03:19.328Z
Introduction to abstract entropy 2022-10-20T21:03:02.486Z
Deliberate practice for research? 2022-10-08T03:45:21.773Z
How dangerous is human-level AI? 2022-06-10T17:38:27.643Z
I'm trying out "asteroid mindset" 2022-06-03T13:35:48.614Z
Request for small textbook recommendations 2022-05-25T22:19:56.549Z
Why hasn't deep learning generated significant economic value yet? 2022-04-30T20:27:54.554Z
Does the rationalist community have a membership funnel? 2022-04-12T18:44:48.795Z
When to use "meta" vs "self-reference", "recursive", etc. 2022-04-06T04:57:47.405Z
Giving calibrated time estimates can have social costs 2022-04-03T21:23:46.590Z
Ways to invest your wealth if you believe in a high-variance future? 2022-03-11T16:07:49.302Z
How can I see a daily count of all words I type? 2022-02-04T02:05:04.483Z
Tag for AI alignment? 2022-01-02T18:55:45.228Z
Is Omicron less severe? 2021-12-30T23:14:37.292Z
Confusion about Sequences and Review Sequences 2021-12-21T18:13:13.394Z
How I became a person who wakes up early 2021-12-18T18:41:45.732Z
What's the status of third vaccine doses? 2021-08-04T02:22:52.317Z
A new option for building lumenators 2021-07-12T23:45:34.294Z
Bay and/or Global Solstice* Job Search (2021 - 2022) 2021-03-16T00:21:10.290Z
Where does the phrase "central example" come from? 2021-03-12T05:57:49.253Z
One Year of Pomodoros 2020-12-31T04:42:31.274Z
Logistics for the 2020 online Secular Solstice* 2020-12-03T00:08:37.401Z
The Bay Area Solstice 2014-12-03T22:33:17.760Z
Mathematical Measures of Optimization Power 2012-11-24T10:55:17.145Z
Modifying Universal Intelligence Measure 2012-09-18T23:44:08.864Z
An Intuitive Explanation of Solomonoff Induction 2012-07-11T08:05:20.544Z
Should LW have a separate AI section? 2012-07-10T01:42:39.259Z

Comments

Comment by Alex_Altair on o3 · 2024-12-20T21:13:06.666Z · LW · GW

Thanks. Is "pass@1" some kind of lingo? (It seems like an ungoogleable term.)

Comment by Alex_Altair on o3 · 2024-12-20T20:50:47.704Z · LW · GW

I guess one thing I want to know is like... how exactly does the scoring work? I can imagine something like, they ran the model a zillion times on each question, and if any one of the answers was right, that got counted in the light blue bar. Something that plainly silly probably isn't what happened, but it could be something similar.

If it actually just submitted one answer to each question and got a quarter of them right, then I think it doesn't particularly matter to me how much compute it used.

Comment by Alex_Altair on o3 · 2024-12-20T19:37:34.853Z · LW · GW

On the livestream, Mark Chen says the 25.2% was achieved "in aggressive test-time settings". Does that just mean more compute?

Comment by Alex_Altair on o3 · 2024-12-20T19:35:25.581Z · LW · GW

I wish they would tell us what the dark vs light blue means. Specifically, for the FrontierMath benchmark, the dark blue looks like it's around 8% (rather than the light blue at 25.2%). Which like, I dunno, maybe this is nit picking, but 25% on FrontierMath seems like a BIG deal, and I'd like to know how much to be updating my beliefs.

Comment by Alex_Altair on The 2023 LessWrong Review: The Basic Ask · 2024-12-04T20:04:01.491Z · LW · GW

things are almost never greater than the sum of their parts Because Reductionism

Isn't it more like, the value of the sum of the things is greater than the sum of the value of each of the things? That is,  (where perhaps  is a utility function). That seems totally normal and not-at-all at odds with Reductionism.

Comment by Alex_Altair on Dalcy's Shortform · 2024-11-26T01:39:38.614Z · LW · GW

I'd vote for removing the stage "developing some sort of polytime solution" and just calling 4 "developing a practical solution". I think listing that extra step is coming from the perspective of something who's more heavily involved in complexity classes. We're usually interested in polynomial time algorithms because they're usually practical, but there are lots of contexts where practicality doesn't require a polynomial time algorithm, or really, where we're just not working in a context where it's natural to think in terms of algorithms with run-times.

Comment by Alex_Altair on A Straightforward Explanation of the Good Regulator Theorem · 2024-11-21T23:02:42.940Z · LW · GW

Thank you for writing this! Your description in the beginning about trying to read about the GRT and coming across a sequence of resources, each of which didn't do quite what you wanted, is a precise description of the path I also followed. I gave up at the end, wishing that someone would write an explainer, and you have written exactly the explainer that I wanted!

Comment by Alex_Altair on Habryka's Shortform Feed · 2024-10-30T17:45:31.435Z · LW · GW

Positive feedback, I am happy to see the comment karma arrows pointing up and down instead of left and right. I have some degree of left-right confusion and was always click and unclicking my comments votes to figure out which was up and down.

Also appreciate that the read time got put back into main posts.

(Comment font stuff looks totally fine to me, both before and after this change.)

Comment by Alex_Altair on Dalcy's Shortform · 2024-10-28T02:02:00.232Z · LW · GW

[Some thoughts that are similar but different to my previous comment;]

I suspect you can often just prove the behavioral selection theorem and structural selection theorem in separate, almost independent steps.

  1. Prove a behavioral theorem
  2. add in a structural assumption
  3. prove that behavioral result plus structural assumption implies structural result.

Behavior essentially serves as an "interface", and a given behavior can be implemented by any number of different structures. So it would make sense that you need to prove something about structure separately (and that you can prove it for multiple different types of structural assumption).

Further claims: for any given structural class,

  • there will be a natural simplicity measure
  • simpler instances will be exponentially rare.

A structural class is something like programs, or Markov chains, or structural causal models. The point of specifying structure is to in some way model how the system might actually be shaped in real life. So it seems to me that any of these will be specified with a finite string over a finite alphabet. This comes with the natural simplicity measure of the length of the specification string, and there are exponentially fewer short strings than long ones.[1]

So let's say you want to prove that your thing X which has behavior B has specific structure S. Since structure S has a fixed description length, you almost automatically know that it's exponentially less likely for X to be one of the infinitely many structures with description length longer than S. (Something similar holds for being within delta of S) The remaining issue is whether there are any other secret structures that are shorter than S (or of similar length) that X could be instead.

  1. ^

    Technically, you could have a subset of strings that didn't grow exponentially. For example, you could, for some reason, decide to specify your Markov chains using only strings of zeros. That would grow linearly rather than exponentially. But this is clearly a less natural specification method.

Comment by Alex_Altair on Dalcy's Shortform · 2024-10-19T00:45:56.360Z · LW · GW

For some reason the "only if" always throws me off. It reminds me of the unless keyword in ruby, which is equivalent to if not, but somehow always made my brain segfault.

Comment by Alex_Altair on Dalcy's Shortform · 2024-10-19T00:44:26.087Z · LW · GW

It's maybe also worth saying that any other description method is a subset of programs (or is incomputable and therefore not what real-world AI systems are). So if the theoretical issues in AIT bother you, you can probably make a similar argument using a programming language with no while loop, or I dunno, finite MDPs whose probability distributions are Gaussian with finite parameter descriptions.

Comment by Alex_Altair on Dalcy's Shortform · 2024-10-19T00:39:52.713Z · LW · GW

Yeah, I think structural selection theorems matter a lot, for reasons I discussed here.

This is also one reason why I continue to be excited about Algorithmic Information Theory. Computable functions are behavioral, but programs (= algorithms) are structural! The fact that programs can be expressed in the homogeneous language of finite binary strings gives a clear way to select for structure; just limit the length of your program. We even know exactly how this mathematical parameter translates into real-world systems, because we can know exactly how many bits our ML models take up on the hard drives.

And I think you can use algorithmic information distance to well-define just how close to agent-structured your policy is. First, define the specific program A that you mean to be maximally agent-structured (which I define as a utility-maximizing program). If your policy (as a program) can be described as "Program A, but different in ways X" then we have an upper bound for how close it is to agent-structured it is. X will be a program that tells you how to transform A into your policy, and that gives us a "distance" of at most the length of X in bits.

For a given length, almost no programs act anything like A. So if your policy is only slightly bigger than A, and it acts like A, then it's probably of the form "A, but slightly different", which means it's agent-structured. (Unfortunately this argument needs like 200 pages of clarification.)

Comment by Alex_Altair on Seeking AI Alignment Tutor/Advisor: $100–150/hr · 2024-10-06T00:01:47.830Z · LW · GW

FWIW I think this would be a lot less like "tutoring" and a lot more like "paying people to tell you their opinions". Which is a fine thing to want to do, but I just want to make sure you don't think there's any kind of objective curriculum that comprises AI alignment.

Comment by Alex_Altair on Work with me on agent foundations: independent fellowship · 2024-09-27T22:46:16.457Z · LW · GW

Nice! Yeah I'd be happy to chat about that, and also happy to get referrals of any other researchers who might be interested in receiving this funding to work on it.

Comment by Alex_Altair on [deleted post] 2024-09-24T04:44:01.947Z

Note to readers; it is an obligatory warning on any post like this that you should not run random scripts downloaded from the internet without reading them to see what they do, because there are many harmful things they could be doing.

Comment by Alex_Altair on Work with me on agent foundations: independent fellowship · 2024-09-21T17:18:00.210Z · LW · GW

<3!

Comment by Alex_Altair on Perplexity wins my AI race · 2024-08-28T06:09:58.807Z · LW · GW

FWIW I have used Perplexity twice since you mentioned it, it was somewhat helpful both times, but also, both times the citations had errors. By that I mean it would say something and then put a citation number next to it, but what it said was not in the cited document.

Comment by Alex_Altair on My Apartment Art Commission Process · 2024-08-28T06:03:07.817Z · LW · GW

Aren’t they sick as hell???

Can confirm, these are sick as hell

Comment by Alex_Altair on Quick look: applications of chaos theory · 2024-08-22T21:31:13.209Z · LW · GW

I know that there's something called the Lyapunov exponent. Could we "diminish the chaos" if we use logarithms, like with the Richter scale for earthquakes?

This is a neat question. I think the answer is no, and here's my attempt to describe why.

The Lyapunov exponent measures the difference between the trajectories over time. If your system is the double pendulum, you need to be able to take two random states of the double pendulum and say how different they are. So it's not like you're measuring the speed, or the length, or something like that. And if you have this distance metric on the whole space of double-pendulum states, then you can't "take the log" of all the distances at the same time (I think because that would break the triangle inequality).

Comment by Alex_Altair on Quick look: applications of chaos theory · 2024-08-22T21:16:05.539Z · LW · GW

It possesses this subjective element (what we consider to be negligible differences) that seems to undermine its standing as a legitimate mathematical discipline.

I think I see what you're getting at here, but no, "chaotic" is a mathematical property that systems (of equations) either have or don't have. The idea behind sensitive dependence on initial conditions is that any difference, no matter how small, will eventually lead to diverging trajectories. Since it will happen for arbitrarily small differences, it will definitely happen for whatever difference exists within our ability to make measurements. But the more precisely you measure, the longer it will take for the trajectories to diverge (which is what faul_sname is referring to).

Comment by Alex_Altair on Quick look: applications of chaos theory · 2024-08-22T21:01:03.227Z · LW · GW

The paper Gleick was referring to is this one, but it would be a lot of work to discern whether it was causal in getting telephone companies to do anything different. It sounds to me like the paper is saying that the particular telephone error data they were looking at could not be well-modeled as IID, nor could it be well-modeled as a standard Markov chain; instead, it was best modeled as a statistical fractal, which corresponds to a heavy-tailed distribution somehow.

Comment by Alex_Altair on Quick look: applications of chaos theory · 2024-08-22T20:26:14.475Z · LW · GW

Definitely on the order of "tens of hours", but it'd be hard to say more specifically. Also, almost all of that time (at least for me) went into learning stuff that didn't go into this post. Partly that's because the project is broader than this post, and partly because I have my own research priority of understanding systems theory pretty well.

Comment by Alex_Altair on Quick look: applications of chaos theory · 2024-08-22T20:23:35.469Z · LW · GW

For what it's worth, I think you're getting downvoted in part because what you write seems to indicate that you didn't read the post.

Comment by Alex_Altair on Quick look: applications of chaos theory · 2024-08-20T20:04:47.109Z · LW · GW

Huh, interesting! So the way I'm thinking about this is, your loss landscape determines the attractor/repellor structure of your phase space (= network parameter space). For a (reasonable) optimization algorithm to have chaotic behavior on that landscape, it seems like the landscape would either have to have 1) a positive-measure flat region, on which the dynamics were ergodic, or 2) a strange attractor, which seems more plausible.

I'm not sure how that relates to the above link; it mentions the parameters "diverging", but it's not clear to me how neural network weights can diverge; aren't they bounded?

Comment by Alex_Altair on Quick look: applications of chaos theory · 2024-08-20T19:46:57.163Z · LW · GW

I'm curious about this part;

even though the motion of the trebuchet with sling isn't chaotic during the throw, it can be made chaotic by just varying the initial conditions, which rules out a simple closed form solution for non-chaotic initial conditions

Do you know what theorems/whatever this is from? It seems to me that if you know that "throws" constitute a subset of phase space that isn't chaotic, then you should be able to have a closed-form solution for those trajectories.

Comment by Alex_Altair on Alex_Altair's Shortform · 2024-08-19T14:00:30.707Z · LW · GW

It turns out I have the ESR version of firefox on this particular computer: Firefox 115.14.0esr (64-bit). Also tried it in incognito, and with all browser extensions turned off, and checked multiple posts that used sections.

Comment by Alex_Altair on Habryka's Shortform Feed · 2024-08-18T16:55:44.653Z · LW · GW

My overall review is, seems fine, some pros and some cons, mostly looks/feels the same to me. Some details;

  • I had also started feeling like the stuff between the title and the start of the post content was cluttered.
  • I think my biggest current annoyance is the TOC on the left sidebar. This has actually disappeared for me, and I don't see it on hover-over, which I assume is maybe just a firefox bug or something. But even before this update, I didn't like the TOC. Specifically, you guys had made it so that there was spacing between the sections that was supposed to be proportional to the length of each section. This never felt like it worked for me (I could speculate on why if you're interested). I'd much prefer if the TOC was just a normal outline-type thing (which it was in a previous iteration).
  • I think I'll also miss the word count. I use it quite frequently (usually after going onto the post page itself, so the preview card wouldn't help much). Having the TOC progress bar thing never felt like it worked either. I agree with Neel that it'd be fine to have the word count in the date hover-over, if you want to have less stuff on the page.
  • The tags at the top right are now just bare words, which I think looks funny. Over the years you guys have often seemed to prefer really naked minimalist stuff. In this case I think the tags kinda look like they might be site-wide menus, or something. I think it's better to have the tiny box drawn around each tag as a visual cue.
  • The author name is now in a sans-serif font, which looks pretty off to me in between the title and the text as serif fonts. It looks like when the browser failed to load the site font and falls back onto the default font, or something. (I do see that it matches the fact that usernames in the comments are sans serif, though.)
  • I initially disliked the karma section being so suppressed, but then I read one of your comments in this thread explaining your reasoning behind that, and now I agree it's good.
  • I also use the comment count/link to jump to comments fairly often, and agree that having that in the lower left is fine.
Comment by Alex_Altair on Alex_Altair's Shortform · 2024-08-18T16:31:49.131Z · LW · GW

It does not! At least, not anywhere that I've tried hovering.

Comment by Alex_Altair on Alex_Altair's Shortform · 2024-08-18T16:09:30.679Z · LW · GW

Is it just me, or did the table of contents for posts disappear? The left sidebar just has lines and dots now.

Comment by Alex_Altair on A simple model of math skill · 2024-07-23T16:54:00.094Z · LW · GW

There is a little crackpot voice in my head that says something like, "the real numbers are dumb and bad and we don't need them!" I don't give it a lot of time, but I do let that voice exist in the back of my mind trying to work out other possible foundations. A related issue here is that it seems to me that one should be able to have a uniform probability distribution over a countable set of numbers. Perhaps one could do that by introducing infinitesimals.

Comment by Alex_Altair on 2022 AI Alignment Course: 5→37% working on AI safety · 2024-06-21T21:20:23.641Z · LW · GW

Agreed the title is confusing. I assumed it meant that some metric was 5% for last year's course, and 37% for this year's course. I think I would just nix numbers from the title altogether.

Comment by Alex_Altair on What distinguishes "early", "mid" and "end" games? · 2024-06-21T21:14:32.965Z · LW · GW

One model I have is that when things are exponentials (or S-curves), it's pretty hard to tell when you're about to leave the "early" game, because exponentials look the same when scaled. If every year has 2x as much activity as the previous year, then every year feels like the one that was the big transition.

For example, it's easy to think that AI has "gone mainstream" now. Which is true according to some order of magnitude. But even though a lot of politicians are talking about AI stuff more often, it's nowhere near the top of the list for most of them. It's more like just one more special interest to sometimes give lip service too, nowhere near issues like US polarization, China, healthcare and climate change.

Of course, AI isn't necessarily well-modelled by an S-curve. Depending on what you're measuring, it could be non-monotonic (with winters and summers). It could also be a hyperbola. And if we all dropped dead in the same minute from nanobots, then there wouldn't really be a mid- or end-game at all. But I currently hold a decent amount of humility around ideas like "we're in midgame now".

Comment by Alex_Altair on [deleted post] 2024-06-11T22:10:42.762Z

(Tiny bug report, I got an email for this comment reply, but I don't see it anywhere in my notifications.)

Comment by Alex_Altair on [deleted post] 2024-06-11T22:10:09.135Z

Done

Comment by Alex_Altair on [deleted post] 2024-06-11T20:43:09.992Z

I propose that this tag be merged into the tag called Infinities In Ethics.

Comment by Alex_Altair on 0. CAST: Corrigibility as Singular Target · 2024-06-08T04:00:09.496Z · LW · GW

3.

3b.*?

Comment by Alex_Altair on Announcing ILIAD — Theoretical AI Alignment Conference · 2024-06-05T20:43:05.166Z · LW · GW

How about deconferences?

Comment by Alex_Altair on Open Problems Create Paradigms · 2024-06-05T18:39:40.610Z · LW · GW

I'm noticing what might be a miscommunication/misunderstanding between your comment and the post and Kuhn. It's not that the statement of such open problems creates the paradigm; it's that solutions to those problems creates the paradigm.

The problems exist because the old paradigms (concepts, methods etc) can't solve them. If you can state some open problems such that everyone agrees that those problems matter, and whose solution could be verified by the community, then you've gotten a setup for solutions to create a new paradigm. A solution will necessarily use new concepts and methods. If accepted by the community, these concepts and methods constitute the new paradigm.

(Even this doesn't always work if the techniques can't be carried over to further problems and progress. For example, my impression is that Logical Induction nailed the solution to a legitimately important open problem, but it does not seem that the solution has been of a kind which could be used for further progress.)

Comment by Alex_Altair on Announcing ILIAD — Theoretical AI Alignment Conference · 2024-06-05T18:14:49.435Z · LW · GW

Interactively Learning the Ideal Agent Design

Comment by Alex_Altair on Thomas Kwa's Shortform · 2024-06-05T17:34:40.798Z · LW · GW

[Continuing to sound elitist,] I have a related gripe/hot take that comments give people too much karma. I feel like I often see people who are "noisy" in that they comment a lot and have a lot of karma from that,[1] but have few or no valuable posts, and who I also don't have a memory of reading valuable comments from. It makes me feel incentivized to acquire more of a habit of using LW as a social media feed, rather than just commenting when a thought I have passes my personal bar of feeling useful.

  1. ^

    Note that self-karma contributes to a comments position within the sorting, but doesn't contribute to the karma count on your account, so you can't get a bunch of karma just by leaving a bunch of comments that no one upvotes. So these people are getting at least a consolation prize upvote from others.

Comment by Alex_Altair on "Does your paradigm beget new, good, paradigms?" · 2024-06-04T18:55:57.816Z · LW · GW

I think the guiding principle behind whether or not scientific work is good should probably look something more like “is this getting me closer to understanding what’s happening”

One model that I'm currently holding is that Kuhnian paradigms are about how groups of people collectively decide that scientific work is good, which is distinct from how individual scientists do or should decide that scientific work is good. And collective agreement is way more easily reached via external criteria.

Which is to say, problems are what establishes a paradigm. It's way easier to get a group of people to agree that "thing no go", than it is to get them to agree on the inherent nature of thing-ness and go-ness. And when someone finally makes thing go, everyone looks around and kinda has to concede that, whatever their opinion was of that person's ontology, they sure did make thing go. (And then I think the Wentworth/Greenblatt discussion above is about whether the method used to make thing go will be useful for making other things go, which is indeed required for actually establishing a new paradigm.)

That said, I think that the way that an individual scientist decides what ideas to pursue should usually route though things more like “is this getting me closer to understanding what’s happening”, but that external people are going to track "are problems getting solved", and so it's probably a good idea for most of the individual scientists to occasionally reflect on how likely their ideas are to make progress on (paradigm-setting) problems.

(It is possible for the agreed-upon problem to be "everyone is confused", and possible for a new idea to simultaneously de-confused everyone, thus inducing a new paradigm. (You could say that this is what happened with the Church-Turing thesis.) But it's just pretty uncommon, because people's ontologies can be wildly different.)

When you say, "I think that more precisely articulating what our goals are with agent foundations/paradigmaticity/etc could be very helpful...", how compatible is that with more precisely articulating problems in agent foundations (whose solutions would be externally verifiable by most agent foundations researchers)?

Comment by Alex_Altair on MIRI 2024 Communications Strategy · 2024-05-29T20:46:55.004Z · LW · GW

stable, durable, proactive content – called “rock” content

FWIW this is conventionally called evergreen content.

Comment by Alex_Altair on One way violinists fail · 2024-05-29T17:42:54.932Z · LW · GW

"you're only funky as [the moving average of] your last [few] cut[s]"

Somehow this is in a <a> link tag with no href attribute.

Comment by Alex_Altair on When is Goodhart catastrophic? · 2024-05-28T19:42:58.370Z · LW · GW

I finally got around to reading this sequence, and I really like the ideas behind these methods. This feels like someone actually trying to figure out exactly how fragile human values are. It's especially exciting because it seems like it hooks right into an existing, normal field of academia (thus making it easier to leverage their resources toward alignment).

I do have one major issue with how the takeaway is communicated, starting with the term "catastrophic". I would only use that word when the outcome of the optimization is really bad, much worse that "average" in some sense. That's in line with the idea that the AI will "use the atoms for something else", and not just leave us alone to optimize its own thing. But the theorems in this sequence don't seem to be about that; 

We call this catastrophic Goodhart because the end result, in terms of , is as bad as if we hadn't conditioned at all.

Being as bad as if you hadn't optimized at all isn't very bad; it's where we started from!

I think this has almost the opposite takeaway from the intended one. I can imagine someone (say, OpenAI) reading these results and thinking something like, great! They just proved that in the worst case scenario, we do no harm. Full speed ahead!

(Of course, putting a bunch of optimization power into something and then getting no result would still be a waste of the resources put into it, which is presumably not built into . But that's still not very bad.)

That said, my intuition says that these same techniques could also suss out the cases where optimizing for  pessimizes for , in the previously mentioned use-our-atoms sense.

Comment by Alex_Altair on Catastrophic Goodhart in RL with KL penalty · 2024-05-28T15:51:19.636Z · LW · GW

Does the notation get flipped at some point? In the abstract you say

prior policy 

and

there are arbitrarily well-performing policies 

But then later you say

This strongly penalizes  taking actions the base policy never takes

Which makes it sound like they're switched.

I also notice that you call it "prior policy", "base policy" and "reference policy" at different times; these all make sense but it'd be a bit nicer if there was one phrase used consistently.

Comment by Alex_Altair on Computational Mechanics Hackathon (June 1 & 2) · 2024-05-25T01:58:25.044Z · LW · GW

I'm curious if you knowingly scheduled this during LessOnline?

Comment by Alex_Altair on Towards a formalization of the agent structure problem · 2024-05-19T20:21:56.518Z · LW · GW

Yep, that paper has been on my list for a while, but I have thus far been unable to penetrate the formalisms that the Causal Incentive Group uses. This paper in particular also seems have some fairly limiting assumptions in the theorem.

Comment by Alex_Altair on Fund me please - I Work so Hard that my Feet start Bleeding and I Need to Infiltrate University · 2024-05-19T01:05:38.915Z · LW · GW

Hey Johannes, I don't quite know how to say this, but I think this post is a red flag about your mental health. "I work so hard that I ignore broken glass and then walk on it" is not healthy.

I've been around the community a long time and have seen several people have psychotic episodes. This is exactly the kind of thing I start seeing before they do.

I'm not saying it's 90% likely, or anything. Just that it's definitely high enough for me to need to say something. Please try to seek out some resources to get you more grounded.

Comment by Alex_Altair on Ilya Sutskever and Jan Leike resign from OpenAI [updated] · 2024-05-15T19:44:22.816Z · LW · GW

I really appreciate this comment!

And yeah, that's why I said only "Note that...", and not something like "don't trust this guy". I think the content of the article is probably true, and maybe it's Metz who wrote it just because AI is his beat. But I do also hold tiny models that say "maybe he dislikes us" and also something about the "questionable understanding" etc that habryka mentions below. AFAICT I'm not internally seething or anything, I just have a yellow-flag attached to this name.

Comment by Alex_Altair on Ilya Sutskever and Jan Leike resign from OpenAI [updated] · 2024-05-15T01:38:05.681Z · LW · GW

Note that the NYT article is by Cade Metz.