Approaching Human-Level Forecasting with Language Models 2024-02-29T22:36:34.012Z
Analyzing the Historical Rate of Catastrophes 2023-12-05T06:30:01.757Z
Forecasting AI (Overview) 2023-11-16T19:00:04.218Z
GPT-2030 and Catastrophic Drives: Four Vignettes 2023-11-10T07:30:06.480Z
Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI 2023-10-31T05:10:02.581Z
AI Pause Will Likely Backfire (Guest Post) 2023-10-24T04:30:02.113Z
AI Forecasting: Two Years In 2023-08-19T23:40:04.302Z
What will GPT-2030 look like? 2023-06-07T23:40:02.925Z
Complex Systems are Hard to Control 2023-04-04T00:00:13.634Z
Principles for Productive Group Meetings 2023-03-22T00:50:07.619Z
Emergent Deception and Emergent Optimization 2023-02-20T02:40:09.912Z
Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small 2022-10-28T23:55:44.755Z
Forecasting ML Benchmarks in 2023 2022-07-18T02:50:16.683Z
AI Forecasting: One Year In 2022-07-04T05:10:18.470Z
How fast can we perform a forward pass? 2022-06-10T23:30:10.341Z
Early 2022 Paper Round-up (Part 2) 2022-04-21T23:40:11.933Z
Early 2022 Paper Round-up 2022-04-14T20:50:29.365Z
Appendix: More Is Different In Other Domains 2022-02-08T16:00:17.848Z
Empirical Findings Generalize Surprisingly Far 2022-02-01T22:30:19.142Z
ML Systems Will Have Weird Failure Modes 2022-01-26T01:40:13.134Z
Anchor Weights for ML 2022-01-20T16:20:20.390Z
Thought Experiments Provide a Third Anchor 2022-01-18T16:00:20.795Z
Future ML Systems Will Be Qualitatively Different 2022-01-11T19:50:11.377Z
More Is Different for AI 2022-01-04T19:30:20.352Z
From Considerations to Probabilities 2021-12-31T02:10:14.682Z
Prioritizing Information 2021-12-24T00:00:22.448Z
The "Other" Option 2021-12-16T20:20:29.611Z
Combining Forecasts 2021-12-10T02:10:14.402Z
Common Probability Distributions 2021-12-02T01:50:17.115Z
Base Rates and Reference Classes 2021-11-24T22:30:18.741Z
Forecasting: Zeroth and First Order 2021-11-18T01:30:19.127Z
Measuring and Forecasting Risks from AI 2021-11-12T02:30:20.959Z
How should we compare neural network representations? 2021-11-05T22:10:18.677Z
Measuring and forecasting risks 2021-10-29T07:27:32.836Z
Deliberate Play 2021-10-24T02:50:16.947Z
On The Risks of Emergent Behavior in Foundation Models 2021-10-18T20:00:15.896Z
How much slower is remote work? 2021-10-08T02:00:17.857Z
Unsolved ML Safety Problems 2021-09-29T16:00:19.466Z
Let Us Do Our Work As Well 2021-09-17T00:40:18.443Z
Economic AI Safety 2021-09-16T20:50:50.335Z
Film Study for Research 2021-09-14T18:53:25.831Z
Does Diverse News Decrease Polarization? 2021-09-11T02:30:16.583Z
Measurement, Optimization, and Take-off Speed 2021-09-10T19:30:57.189Z
Model Mis-specification and Inverse Reinforcement Learning 2018-11-09T15:33:02.630Z
Latent Variables and Model Mis-Specification 2018-11-07T14:48:40.434Z
[link] Essay on AI Safety 2015-06-26T07:42:11.581Z
The Power of Noise 2014-06-16T17:26:30.329Z
A Fervent Defense of Frequentist Statistics 2014-02-18T20:08:48.833Z
Another Critique of Effective Altruism 2014-01-05T09:51:12.231Z
Macro, not Micro 2013-01-06T05:29:38.689Z


Comment by jsteinhardt on GPT-2030 and Catastrophic Drives: Four Vignettes · 2023-11-11T07:35:37.804Z · LW · GW

When only a couple thousand copies you probably don't want to pay for the speedup, eg even going an extra 4x decreases the number of copies by 8x.

I also think when you don't have control over your own hardware the speedup schemes become harder, since they might require custom network topologies. Not sure about that though

Comment by jsteinhardt on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-07T04:40:53.112Z · LW · GW

While I am not close to this situation, I felt moved to write something, mostly to support junior researchers and staff such as TurnTrout, Thomas Kwa, and KurtB who are voicing difficult experiences that may be challenging for them to talk about; and partly because I can provide perspective as someone who has managed many researchers and worked in a variety of research and non-research organizations and so can more authoritatively speak to what behaviors are 'normal' and what patterns tend to lead to good or bad outcomes. Caveat that I know very little about any internal details of MIRI, but I am still reasonably confident of what I'm saying based on general patterns and experience in the world.

Based on reading Thomas Kwa's experience, as well as KurtB's experience, Nate Soares' behavior is far outside any norms of acceptable behavior that I'd endorse. Accepting or normalizing this behavior within an organization has a corrosive effect on the morale, epsistemics, and spiritual well-being of its members. The morale effects are probably obvious, but regarding epistemics, leadership is significantly less likely to get useful feedback if people are afraid to cross them (psychological safety is an important concept here). Finally, regarding spirit, normalizing this behavior sends a message to people that they aren't entitled to set boundaries or be respected, which can create far-reaching damage in their other interactions and in their image of themselves. Based on this, I feel very worried for MIRI and think it should probably do a serious re-think of its organizational culture.

Since some commenters brought up academia and the idea that some professors can be negligent or difficult to work with, I will compare Nate's behavior to professors in CS academia. Looking at what Thomas Kwa described, I can think of some professors who exhibit individual traits in Thomas' description, but someone who had all of them at once would be an outlier (in a field that is already welcoming to difficult personalities), and I would strongly warn students against working with such a person. KurtB's experience goes beyond that and seems at least a standard deviation worse; if someone behaved this way, I would try to minimize their influence in any organization I was part of and refuse to collaborate with them, and I would expect even a tenured faculty to have a serious talking-to about their behavior from colleagues (though maybe some places would be too cowardly to have this conversation), and for HR complaints to stack up.

Nate, the best description I can think of for what's going on is that you have fairly severe issues with emotional regulation. Your comments indicate that you see this as a basic aspect of your emotional make-up (and maybe intimately tied to your ability to do research), but I have seen this pattern several times before and I am pretty confident this is not the case. In previous cases I've seen, the person in question expresses or exhibits and unwillingness to change up until the point that they face clear consequences for their actions, at which point (after a period of expressing outrage) they buckle down and make the changes, which usually changes their own life for the better, including being able to think more clearly. A first step would be going to therapy, which I definitely recommend. I am pretty confident that even for your own sake you should make a serious effort to make changes here. (I hope this doesn't come across as condescending, as I genuinely believe this is good advice.)

Along these lines, for people around Nate who think that they "have" to accept this behavior because Nate's work is important, even on those grounds alone setting boundaries on the behavior will lead to better outcomes.

Here is an example of how an organization could set boundaries on this behavior: If Nate yells at a staff member, that staff member no longer does ops work for Nate until he apologizes and expresses a credible commitment to communicate more courteously in the future. (This could be done in principle by making it opt-in to do continued ops work for Nate if this happens, and working hard to create a real affordance for not opting in.)

The important principle here is that Nate internalizes the costs of his decisions (by removing his ability to impose costs on others, and bearing the resulting inconvenience). Here the cost to Nate is also generally lower than the cost that would have been imposed on others (inflating your own bike tire is less costly than having your day ruined by being yelled at), though this isn't crucial. The important thing is Nate would have skin in the game---if he still doesn't change, then I believe somewhat more that he's actually incapable of doing so, but I would guess that this would actually lead to changes. And if MIRI for some reason believes that other people should be willing to bear large costs for small benefits to Nate, they should also hire a dedicated staff to do damage control for him. (Maybe some or all of this is already happening... I am not at MIRI so I don't know, but it doesn't sound this way based on the experiences that have been shared.)

In summary: based on my own personal experience across many organizations, Nate's behavior is not okay and MIRI should set boundaries on it. I do not believe Nate's claim that this is a fundamental aspect of his emotional make-up, as it matches other patterns in the past that have changed when consequences were imposed, and even if it is a fundamental aspect he should face the natural consequences of his actions. These consequences should center on removing his ability to harm others, or, if this is not feasible, creating institutions at MIRI to reliably clean up after him and maintain psychological safety.

Comment by jsteinhardt on AI Forecasting: Two Years In · 2023-08-20T21:58:29.658Z · LW · GW

I don't see it in the header in Mobile (although I do see the updated text now about it being a link post). Maybe it works on desktop but not mobile?

Comment by jsteinhardt on AI Forecasting: Two Years In · 2023-08-20T21:44:11.590Z · LW · GW

Is it clear these results don't count? I see nothing in the Metaculus question text that rules it out.

Comment by jsteinhardt on AI Forecasting: Two Years In · 2023-08-20T21:43:34.779Z · LW · GW

Mods, could you have these posts link back to my blog Bounded Regret in some form? Right now there is no indication that this is cross-posted from my blog, and no link back to the original source.

Comment by jsteinhardt on Elon Musk announces xAI · 2023-07-15T06:59:04.443Z · LW · GW

Dan spent his entire PhD working on AI safety and did some of the most influential work on OOD robustness and OOD detection, as well as writing Unsolved Problems. Even if this work is less valued by some readers on LessWrong (imo mistakenly), it seems pretty inaccurate to say that he didn't work on safety before founding CAIS.

Comment by jsteinhardt on Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell? · 2023-06-25T20:19:14.214Z · LW · GW

Melanie Mitchell and Meg Mitchell are different people. Melanie was the participant in this debate, but you seem to be ascribing Meg's opinions to her, including linking to video interviews with Meg in your comments.

Comment by jsteinhardt on What will GPT-2030 look like? · 2023-06-10T00:37:20.185Z · LW · GW

I'm leaving it to the moderators to keep the copies mirrored, or just accept that errors won't be corrected on this copy. Hopefully there's some automatic way to do that?

Comment by jsteinhardt on What will GPT-2030 look like? · 2023-06-09T03:52:50.079Z · LW · GW

Oops, thanks, updated to fix this.

Comment by jsteinhardt on What will GPT-2030 look like? · 2023-06-09T03:51:34.638Z · LW · GW

Thanks! I removed the link.

Comment by jsteinhardt on What will GPT-2030 look like? · 2023-06-09T03:50:52.550Z · LW · GW

Thanks! I removed the link.

Comment by jsteinhardt on Steering GPT-2-XL by adding an activation vector · 2023-05-27T17:46:22.107Z · LW · GW

Glad it was helpful!

Comment by jsteinhardt on Steering GPT-2-XL by adding an activation vector · 2023-05-19T19:53:54.798Z · LW · GW

Hi Alex,

Let me first acknowledge that your write-up is significantly more thorough than pretty much all content on LessWrong, and that I found the particular examples interesting. I also appreciated that you included a related work section in your write-up. The reason I commented on this post and not others is because it's one of the few ML posts on LessWrong that seemed like it might teach me something, and I wish I had made that more clear before posting critical feedback (I was thinking of the feedback as directed at Oliver / Raemon's moderation norms, rather than your work, but I realize in retrospect it probably felt directed at you).

I think the main important point is that there is a body of related work in the ML literature that explores fairly similar ideas, and LessWrong readers who care about AI alignment should be aware of this work, and that most LessWrong readers who read the post won't realize this. I think it's good to point out Dan's initial mistake, but I took his substantive point to be what I just summarized, and it seems correct to me and hasn't been addressed. (I also think Dan overfocused on Ludwig's paper, see below for more of my take on related work.)

Here is how I currently see the paper situated in broader work (I think you do discuss the majority but not all of this):

 * There is a lot of work studying activation vectors in computer vision models, and the methods here seem broadly similar to the methods there. This seems like the closest point of comparison.

 * In language, there's a bunch of work on controllable generation ( where I would be surprised if no one looked at modifying activations (at least I'd expect someone to try soft prompt tuning), but I don't know for sure.

 * On modifying activations in language models there is a bunch of stuff on patching / swapping, and on modifying stuff in the directions of probes.

I think we would probably both agree that this is the main set of related papers, and also both agree that you cited work within each of these branches (except maybe the second one). Where we differ is that I see all of this as basically variations on the same idea of modifying the activations or weights to control a model's runtime behavior:
 * You need to find a direction, which you can do either by learning a direction or by simple averaging. Simple averaging is more or less the same as one step of gradient descent, so I see these as conceptually similar.
 * You can modify the activations or weights. Usually if an idea works in one case it works in the other case, so I also see these as similar.
 * The modality can be language or vision. Most prior work has been on vision models, but some of that has also been on vision-language models, e.g. I'm pretty sure there's a paper on averaging together CLIP activations to get controllable generation.

So I think it's most accurate to say that you've adapted some well-explored ideas to a use case that you are particularly interested in. However, the post uses language like "Activation additions are a new way of interacting with LLMs", which seems to be claiming that this is entirely new and unexplored, and I think this could mislead readers, as for instance Thomas Kwa's response seems to suggest.

I also felt like Dan H brought up reasonable questions (e.g. why should we believe that weights vs. activations is a big deal? Why is fine-tuning vs. averaging important? Have you tried testing the difference empirically?) that haven't been answered that would be good to at least more clearly acknowledge. The fact that he was bringing up points that seemed good to me that were not being directly engaged with was what most bothered me about the exchange above.

This is my best attempt to explain where I'm coming from in about an hour of work (spent e.g. reading through things and trying to articulate intuitions in LW-friendly terms). I don't think it captures my full intuitions or the full reasons I bounced off the related work section, but hopefully it's helpful.

Comment by jsteinhardt on Steering GPT-2-XL by adding an activation vector · 2023-05-18T23:24:48.102Z · LW · GW

I'll just note that I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.

I'm not sure what you mean about whether the post was "missing something important", but I do think that you should be pretty worried about LessWrong's collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged with on his substantive point, he's being nitpicked by a moderator. It's not an accident that no one else is bringing these points up--it's because everyone else who has the expertise to do so has given up or judged it not worth their time, largely because of responses like the one Dan H is getting.

Comment by jsteinhardt on So, geez there's a lot of AI content these days · 2022-10-07T02:51:00.112Z · LW · GW

Here is my take: since there's so much AI content, it's not really feasible to read all of it, so in practice I read almost none of it (and consequently visit LW less frequently).

The main issue I run into is that for most posts, on a brief skim it seems like basically a thing I have thought about before. Unlike academic papers, most LW posts do not cite previous related work nor explain how what they are talking about relates to this past work. As a result, if I start to skim a post and I think it's talking about something I've seen before, I have no easy way of telling if they're (1) aware of this fact and have something new to say, (2) aware of this fact but trying to provide a better exposition, or (3) unaware of this fact and reinventing the wheel. Since I can't tell, I normally just bounce off.

I think a solution could be to have a stronger norm that posts about AI should say, and cite, what they are building on and how it relates / what is new. This would decrease the amount of content while improving its quality, and also make it easier to choose what to read. I view this as a win-win-win.

Comment by jsteinhardt on Hiring Programmers in Academia · 2022-07-25T01:59:18.485Z · LW · GW

I think this might be an overstatement. It's true that NSF tends not to fund developers, but in ML the NSF is only one of many funders (lots of faculty have grants from industry partnerships, for instance).

Comment by jsteinhardt on Personal forecasting retrospective: 2020-2022 · 2022-07-21T19:07:27.500Z · LW · GW

Thanks for writing this!

Regarding how surprise on current forecasts should factor into AI timelines, two takes I have:

 * Given that all the forecasts seem to be wrong in the "things happened faster than we expected" direction, we should probably expect HLAI to happen faster than expected as well.

 * It also seems like we should retreat more to outside views about general rates of technological progress, rather than forming a specific inside view (since the inside view seems to mostly end up being wrong).

I think a pure outside view would give a median of something like 35 years in my opinion (based on my very sketchy attempt of forming a dataset of when technical grant challenges were solved), and then ML progress seems to be happening quite quickly, so you should probably adjust down from that.

Actually pretty interested how you get to medians of 40 years, that seems longer than I'd predict without looking at any field-specific facts about ML, and then the field-specific facts mostly push towards shorter timelines.

Comment by jsteinhardt on How fast can we perform a forward pass? · 2022-06-12T15:38:46.528Z · LW · GW

Thanks! I just read over it and assuming I understood correctly, this bottleneck primarily happens for "small" operations like layer normalization and softlax, and not for large matrix multiples. In addition, these small operations are still the minority of runtime (40% in their case). So I think this is still consistent with my analysis, which assumes various things will creep in to keep GPU utilization around 40%, but that they won't ever drive it to (say) 10%. Is this correct or have I misunderstood the nature of the bottleneck?

Edit: also maybe we're just miscommunicating--I definitely don't think CPU->HBM is a bottleneck, it's instead the time to load from HBM which sounds the same as what you said. Unless I misread the A100 specs, that comes out to 1.5TB/s, which is the number I use throughout.

Comment by jsteinhardt on How fast can we perform a forward pass? · 2022-06-11T23:12:41.664Z · LW · GW

Short answer: If future AI systems are doing R&D, it matters how quickly the R&D is happening.

Comment by jsteinhardt on How fast can we perform a forward pass? · 2022-06-11T15:48:54.064Z · LW · GW

Okay, thanks! The posts actually are written in markdown, at least on the backend, in case that helps you.

Comment by jsteinhardt on How fast can we perform a forward pass? · 2022-06-11T00:05:17.314Z · LW · GW

Question for mods (sorry if I asked this before): Is there a way to make the LaTeX render?

In theory MathJax should be enough, eg that's all I use at the original post:

Comment by jsteinhardt on Why I'm Optimistic About Near-Term AI Risk · 2022-05-17T14:07:07.579Z · LW · GW

I was surprised by this claim. To be concrete, what's your probability of xrisk conditional on 10-year timelines? Mine is something like 25% I think, and higher than my unconditional probability of xrisk.

Comment by jsteinhardt on Early 2022 Paper Round-up · 2022-04-16T01:28:40.083Z · LW · GW

Fortunately (?), I think the jury is still out on whether phase transitions happen in practice for large-scale systems. It could be that once a system is complex and large enough, it's hard for a single factor to dominate and you get smoother changes. But I think it could go either way.

Comment by jsteinhardt on Early 2022 Paper Round-up · 2022-04-16T00:30:31.832Z · LW · GW

Thanks! I pretty much agree with everything you said. This is also largely why I am excited about the work, and I think what you wrote captures it more crisply than I could have.

Comment by jsteinhardt on Buck's Shortform · 2022-04-09T17:12:11.243Z · LW · GW

Yup, I agree with this, and think the argument generalizes to most alignment work (which is why I'm relatively optimistic about our chances compared to some other people, e.g. something like 85% p(success), mostly because most things one can think of doing will probably be done).

It's possibly an argument that work is most valuable in cases of unexpectedly short timelines, although I'm not sure how much weight I actually place on that.

Comment by jsteinhardt on [RETRACTED] It's time for EA leadership to pull the short-timelines fire alarm. · 2022-04-09T15:11:54.986Z · LW · GW

Note the answer changes a lot based on how the question is operationalized. This stronger operationalization has dates around a decade later.

Comment by jsteinhardt on More Is Different for AI · 2022-02-17T16:53:38.126Z · LW · GW

Here's a link to the version on my blog:

Comment by jsteinhardt on More Is Different for AI · 2022-02-17T16:52:26.446Z · LW · GW

Yup! That sounds great :)

Comment by jsteinhardt on More Is Different for AI · 2022-02-16T01:53:39.672Z · LW · GW

Thanks Ruby! Now that the other posts are out, would it be easy to forward-link them (by adding links to the italicized titles in the list at the end)?

Comment by jsteinhardt on ML Systems Will Have Weird Failure Modes · 2022-01-26T02:08:25.476Z · LW · GW

@Mods: Looks like the LaTeX isn't rendering. I'm not sure what the right way to do that is on LessWrong. On my website, I do it with code injection. You can see the result here, where the LaTeX all renders in MathJax:

Comment by jsteinhardt on What's Up With Confusingly Pervasive Goal Directedness? · 2022-01-23T20:29:28.849Z · LW · GW

I feel like you are arguing for a very strong claim here, which is that "as soon as you have an efficient way of determining whether a problem is solved, and any way of generating a correct solution some very small fraction of the time, you can just build an efficient solution that solves it all of the time"

Hm, this isn't the claim I intended to make. Both because it overemphasizes on "efficient" and because it adds a lot of "for all" statements.

If I were trying to state my claim more clearly, it would be something like "generically, for the large majority of problems of the sort you would come across in ML, once you can distinguish good answers you can find good answers (modulo some amount of engineering work), because non-convex optimization generally works and there are a large number of techniques for solving the sparse rewards problem, which are also getting better over time".

Comment by jsteinhardt on What's Up With Confusingly Pervasive Goal Directedness? · 2022-01-23T17:52:55.221Z · LW · GW

Thanks for the push-back and the clear explanation. I still think my points hold and I'll try to explain why below.

In order to even get a single expected datapoint of approval, I need to sample 10^8 examples, which in our current sampling method would take 10^8 * 10 hours, e.g. approximately 100,000 years. I don't understand how you could do "Learning from Human Preferences" on something this sparse

This is true if all the other datapoints are entirely indistinguishable, and the only signal is "good" vs. "bad". But in practice you would compare / rank the datapoints, and move towards the ones that are better.

Take the backflip example from the human preferences paper: if your only signal was "is this a successful backflip?", then your argument would apply and it would be pretty hard to learn. But the signal is "is this more like a successful backflip than this other thing?" and this makes learning feasible.

More generally, I feel that the thing I'm arguing against would imply that ML in general is impossible (and esp. the human preferences work), so I think it would help to say explicitly where the disanalogy occurs.

I should note that comparisons is only one reason why the situation isn't as bad as you say. Another is that even with only non-approved data points to label, you could do things like label "which part" of the plan is non-approved. And with very sophisticated AI systems, you could ask them to predict which plans would be approved/non-approved, even if they don't have explicit examples, simply by modeling the human approvers very accurately in general.

I feel even beyond that, this still assumes that the reason it is proposing a "good" plan is pure noise, and not the result of any underlying bias that is actually costly to replace.

When you say "costly to replace", this is with respect to what cost function? Do you have in mind the system's original training objective, or something else?

If you have an original cost function F(x) and an approval cost A(x), you can minimize F(x) + c * A(x), increasing the weight on c until it pays enough attention to A(x). For an appropriate choice of c, this is (approximately) equivalent to asking "Find the most approved policy such that F(x) is below some threshold"--more generally, varying c will trace out the Pareto boundary between F and A.

so even if we get within 33 bits (which I do think seems unlikely)

Yeah, I agree 33 bits would be way too optimistic. My 50% CI is somewhere between 1,000 and 100,000 bits needed. It just seems unlikely to me that you'd be able to generate, say, 100 bits but then run into a fundamental obstacle after that (as opposed to an engineering / cost obstacle).

Like, I feel like... this is literally a substantial part of the P vs. NP problem, and I can't just assume my algorithm just like finds efficient solution to arbitrary NP-hard problems.

I don't think the P vs. NP analogy is a good one here, for a few reasons:

 * The problems you're talking about above are statistical issues (you're saying you can't get any statistical signal), while P vs. NP is a computational question.

 * In general, I think P vs. NP is a bad fit for ML. Invoking related intuitions would have led you astray over the past decade--for instance, predicting that neural networks should not perform well because they are solving a problem (non-convex optimization) that is NP-hard in the worst case.

Comment by jsteinhardt on What's Up With Confusingly Pervasive Goal Directedness? · 2022-01-23T00:30:13.810Z · LW · GW

This would imply a fixed upper bound on the number of bits you can produce (for instance, a false negative rate of 1 in 128 implies at most 7 bits). But in practice you can produce many more than 7 bits, by double checking your answer, combining multiple sources of information, etc.

Comment by jsteinhardt on Anchor Weights for ML · 2022-01-22T21:00:01.800Z · LW · GW

Maybe, but I think some people would disagree strongly with this list even in the abstract (putting almost no weight on Current ML, or putting way more weight on humans, or something else). I agree that it's better to drill down into concrete disagreements, but I think right now there are implicit strong disagreements that are not always being made explicit, and this is a quick way to draw them out.

Comment by jsteinhardt on What's Up With Confusingly Pervasive Goal Directedness? · 2022-01-22T20:58:04.420Z · LW · GW

Basically the same techniques as in Deep Reinforcement Learning from Human Preferences and the follow-ups--train a neural network model to imitate your judgments, then chain it together with RL.

I think current versions of that technique could easily give you 33 bits of information--although as noted elsewhere, the actual numbers of bits you need might be much larger than that, but the techniques are getting better over time as well.

Comment by jsteinhardt on What's Up With Confusingly Pervasive Goal Directedness? · 2022-01-21T06:38:51.411Z · LW · GW

Yes, I think I understand that more powerful optimizers can find more spurious solutions. But the OP seemed to be hypothesizing that you had some way to pick out the spurious from the good solutions, but saying it won't scale because you have 10^50, not 100, bad solutions for each good one. That's the part that seems wrong to me.

Comment by jsteinhardt on What's Up With Confusingly Pervasive Goal Directedness? · 2022-01-21T03:17:41.481Z · LW · GW

I'm not sure I understand why it's important that the fraction of good plans is 1% vs .00000001%. If you have any method for distinguishing good from bad plans, you can chain it with an optimizer to find good plans even if they're rare. The main difficulty is generating enough bits--but in that light, the numbers I gave above are 7 vs 33 bits--not a clear qualitative difference. And in general I'd be kind of surprised if you could get up to say 50 bits but then ran into a fundamental obstacle in scaling up further.

Comment by jsteinhardt on Thought Experiments Provide a Third Anchor · 2022-01-19T23:26:11.928Z · LW · GW

Thanks! Yes, this makes very similar points :) And from 4 years ago!

Comment by jsteinhardt on Thought Experiments Provide a Third Anchor · 2022-01-19T03:38:57.023Z · LW · GW

The fear of anthropomorphising AI is one of the more ridiculous traditional mental blindspots in the LW/rationalist sphere.

You're really going to love Thursday's post :).

Jokes aside, I actually am not sure LW is that against anthropomorphising. It seems like a much stronger injunction among ML researchers than it is on this forum.

I personally am not very into using humans as a reference class because it is a reference class with a single data point, whereas e.g. "complex systems" has a much larger number of data points.

In addition, it seems like intuition about how humans behave is already pretty baked in to how we think about intelligent agents, so I'd guess by default we overweight it and have to consciously get ourselves to consider other anchors.

I would agree that it's better to do this by explicitly proposing additional anchors, rather than never talking about humans.

Comment by jsteinhardt on Future ML Systems Will Be Qualitatively Different · 2022-01-14T00:00:34.651Z · LW · GW

Okay I think I get what you're saying now--more SGD steps should increase "effective model capacity", so per the double descent intuition we should expect the validation loss to first increase then decrease (as is indeed observed). Is that right?

Comment by jsteinhardt on Future ML Systems Will Be Qualitatively Different · 2022-01-13T23:56:48.902Z · LW · GW

But if you keep training, GD should eventually find a low complexity high test scoring solution - if one exists - because those solutions have an even higher score (with some appropriate regularization term). Obviously much depends on the overparameterization and relative reg term strength - if it's too strong GD may fail or at least appear to fail as it skips the easier high complexity solution stage. I thought that explanation of grokking was pretty clear.

I think I'm still not understanding. Shouldn't the implicit regularization strength of SGD be higher, not lower, for fewer iterations? So running it longer should give you a higher-complexity, not a lower-complexity solution. (Although it's less clear how this intuition pans out once you already have very low training loss, maybe you're saying that double descent somehow kicks in there?)

Comment by jsteinhardt on Future ML Systems Will Be Qualitatively Different · 2022-01-13T20:24:55.377Z · LW · GW

I'm not sure I get what the relation would be--double descent is usually with respect to the model size (vs. amout of data), although there is some work on double descent vs. number of training iterations e.g. But I don't immediately see how to connect this to grokking.

(I agree they might be connected, I'm just saying I don't see how to show this. I'm very interested in models that can explain grokking, so if you have ideas let me know!)

Comment by jsteinhardt on Future ML Systems Will Be Qualitatively Different · 2022-01-13T20:07:08.974Z · LW · GW

I don't think it's inferior -- I think both of them have contrasting strengths and limitations. I think the default view in ML would be to use 95% empiricism, 5% philosophy when making predictions, and I'd advocate for more like 50/50, depending on your overall inclinations (I'm 70-30 since I love data, and I think 30-70 is also reasonable, but I think neither 95-5 or 5-95 would be justifiable).

I'm curious what in the post makes you think I'm claiming philosophy is superior. I wrote this:

> Confronting emergence will require adopting mindsets that are less familiar to most ML researchers and utilizing more of the Philosophy worldview (in tandem with Engineering and other worldviews).

This was intended to avoid making a claim of superiority in either direction.

Comment by jsteinhardt on San Francisco shares COVID data only when it's too late · 2021-12-27T17:10:41.793Z · LW · GW

Also my personal take is that SF, on a pure scientific/data basis, has had one of the best responses in the nation, probably benefiting from having UCSF for in-house expertise. (I'm less enthusiastic about the political response--I think we erred way too far on the "take no risks" side, and like everyone else prioritized restaurants over schools which seems like a clear mistake. But on the data front I feel like you're attacking one of the singularly most reasonable counties in the U.S.)

Comment by jsteinhardt on San Francisco shares COVID data only when it's too late · 2021-12-27T17:06:54.071Z · LW · GW

It seems like the main alternative would be to have something like Alameda County's reporting, which has a couple days fewer lag at the expense of less quality control:

It's really unclear to me that Alameda's data is more informative than SF's. (In fact I'd say it's the opposite--I tend to look at SF over Alameda even though I live in Alameda County.)

I think there is some information lost in SF's presentation, but it's generally less information lost than most alternatives on the market. SF is also backdating the data to when the tests were actually performed, thus being transparent about the fact that most test data is about what happened several days ago. Websites that claim to give you more up-to-date information are not actually doing so, they're just hiding this fact.

If you looked at the next 4 days in the time series it would probably look something like: 500, 200, 100, 100. Not because Omicron is abating but because most tests taken in the last 4 days haven't had time to be processed and recorded. I think if I was careful I could squeeze a small amount of information out of those numbers (e.g. based on whether the 500 was actually 400 or 600) but it would require a lot of work. I tried this in the past when working with some public health researchers and it's surprisingly hard to not fool yourself into thinking that cases are going down again when it's actually reporting lag.

Comment by jsteinhardt on Worst-case thinking in AI alignment · 2021-12-27T16:09:41.369Z · LW · GW

Finding the min-max solution might be easier, but what we actually care about is an acceptable solution. My point is that the min-max solution, in most cases, will be unacceptably bad.

And in fact, since min_x f(theta,x) <= E_x[f(theta,x)], any solution that is acceptable in the worst case is also acceptable in the average case.

Comment by jsteinhardt on Worst-case thinking in AI alignment · 2021-12-24T17:27:31.877Z · LW · GW

Thanks! I appreciated these distinctions. The worst-case argument for modularity came up in a past argument I had with Eliezer, where I argued that this was a reason for randomization (even though Bayesian decision theory implies you should never randomize). See section 2 here: The Power of Noise.

Re: 50% vs. 10% vs. 90%. I liked this illustration, although I don't think your argument actually implies 50% specifically. For instance if it turns out that everyone else is working on the 50% worlds and no one is working on the 90% worlds, you should probably work on the 90% worlds. In addition:

 *  It seems pretty plausible that the problem is overall more tractable in 10% worlds than 50% worlds, so given equal neglectedness you would prefer the 10% world.

 * Many ideas will generalize across worlds, and recruitment / skill-building / organization-building also generalizes across worlds. This is an argument towards working on problems that seem tractable and relevant to any world, as long as they are neglected enough that you are building out distinct ideas and organizational capacity (vs. just picking from the same tree as ML generally). I don't think that this argument dominates considerations, but it likely explains some of our differences in approach.

In the terms laid out in your post, I think my biggest functional disagreement (in terms of how it affects what problems we work on) is that I expect most worst-case assumptions make the problem entirely impossible, and I am more optimistic that many empirically-grounded assumptions will generalize quite far, all the way to AGI. To be clear, I am not against all worst-case assumptions (for instance my entire PhD thesis is about this) but I do think they are usually a source of significant added difficulty and one has to be fairly careful where they are making them.

For instance, as regards Redwood's project, I expect making language models fully adversarially robust is impossible with currently accessible techniques, and that even a fairly restricted adversary will be impossible to defend against while maintaining good test accuracy. On the other hand I am still pretty excited about Redwood's project because I think you will learn interesting things by trying. (I spent some time trying to solve the unrestricted adversarial example competition, totally failed, but still felt it was a good use of time for similar reasons, and the difficulties for language models seem interestingly distinct in a way that should generate additional insight.) I'm actually not sure if this differs that much from your beliefs, though.

Comment by jsteinhardt on Worst-case thinking in AI alignment · 2021-12-24T17:10:08.339Z · LW · GW

I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don't think it's correct to call it the "second-simplest solution", since there are many choices of what facet of the environment is worst-case.

One keyword for this is "partial specification", e.g. here is a paper I wrote that makes a minimal set of statistical assumptions and worst-case assumptions everywhere else: (Unfortunately the statistical assumptions are not really reasonable so the method was way too brittle in practice.) This kind of idea is also common in robust statistics. But my take would not be that it is simpler--in general it is way harder than just working with the empirical distribution in front of you.

Comment by jsteinhardt on Understanding and controlling auto-induced distributional shift · 2021-12-14T04:38:05.232Z · LW · GW

Cool paper! One brief comment is this seems closely related to performative prediction and it seems worth discussing the relationship.

Edit: just realized this is a review, not a new paper, so my comment is a bit less relevant. Although it does still seem like a useful connection to make.

Comment by jsteinhardt on Base Rates and Reference Classes · 2021-11-25T05:54:42.209Z · LW · GW

Oh okay got it! It looks like the behavior is as intended, but one downside from my perspective is that the blog link is not very visually prominent as is--I would expect most readers to not notice it. I care about this mostly because I would like more people to know about my blog's existence, and I think it could be fixed if there was the option to add a small avatar next to the blog name to make it more visually prominent (I could imagine lots of other fixes too but just throwing a concrete one out there).

On a separate not it looks like the latex is not rendering in the post: I used to go in and out of math mode, but I'm not sure the LW editor parses that. (My blog embeds a javascript header that loads mathjax but I assume that is not loaded with the automatic crossposting.)