Posts

Goal oriented cognition in "a single forward pass" 2024-04-22T05:03:18.649Z
Express interest in an "FHI of the West" 2024-04-18T03:32:58.592Z
Structured Transparency: a framework for addressing use/mis-use trade-offs when sharing information 2024-04-11T18:35:44.824Z
LessWrong's (first) album: I Have Been A Good Bing 2024-04-01T07:33:45.242Z
How useful is "AI Control" as a framing on AI X-Risk? 2024-03-14T18:06:30.459Z
Open Thread Spring 2024 2024-03-11T19:17:23.833Z
Is a random box of gas predictable after 20 seconds? 2024-01-24T23:00:53.184Z
Will quantum randomness affect the 2028 election? 2024-01-24T22:54:30.800Z
Vote in the LessWrong review! (LW 2022 Review voting phase) 2024-01-17T07:22:17.921Z
AI Impacts 2023 Expert Survey on Progress in AI 2024-01-05T19:42:17.226Z
Originality vs. Correctness 2023-12-06T18:51:49.531Z
The LessWrong 2022 Review 2023-12-05T04:00:00.000Z
Open Thread – Winter 2023/2024 2023-12-04T22:59:49.957Z
Complex systems research as a field (and its relevance to AI Alignment) 2023-12-01T22:10:25.801Z
How useful is mechanistic interpretability? 2023-12-01T02:54:53.488Z
My techno-optimism [By Vitalik Buterin] 2023-11-27T23:53:35.859Z
"Epistemic range of motion" and LessWrong moderation 2023-11-27T21:58:40.834Z
Debate helps supervise human experts [Paper] 2023-11-17T05:25:17.030Z
How much to update on recent AI governance moves? 2023-11-16T23:46:01.601Z
AI Timelines 2023-11-10T05:28:24.841Z
How to (hopefully ethically) make money off of AGI 2023-11-06T23:35:16.476Z
Integrity in AI Governance and Advocacy 2023-11-03T19:52:33.180Z
What's up with "Responsible Scaling Policies"? 2023-10-29T04:17:07.839Z
Trying to understand John Wentworth's research agenda 2023-10-20T00:05:40.929Z
Trying to deconfuse some core AI x-risk problems 2023-10-17T18:36:56.189Z
How should TurnTrout handle his DeepMind equity situation? 2023-10-16T18:25:38.895Z
The Lighthaven Campus is open for bookings 2023-09-30T01:08:12.664Z
Navigating an ecosystem that might or might not be bad for the world 2023-09-15T23:58:00.389Z
Long-Term Future Fund Ask Us Anything (September 2023) 2023-08-31T00:28:13.953Z
Open Thread - August 2023 2023-08-09T03:52:55.729Z
Long-Term Future Fund: April 2023 grant recommendations 2023-08-02T07:54:49.083Z
Final Lightspeed Grants coworking/office hours before the application deadline 2023-07-05T06:03:37.649Z
Correctly Calibrated Trust 2023-06-24T19:48:05.702Z
My tentative best guess on how EAs and Rationalists sometimes turn crazy 2023-06-21T04:11:28.518Z
Lightcone Infrastructure/LessWrong is looking for funding 2023-06-14T04:45:53.425Z
Launching Lightspeed Grants (Apply by July 6th) 2023-06-07T02:53:29.227Z
Yoshua Bengio argues for tool-AI and to ban "executive-AI" 2023-05-09T00:13:08.719Z
Open & Welcome Thread – April 2023 2023-04-10T06:36:03.545Z
Shutting Down the Lightcone Offices 2023-03-14T22:47:51.539Z
Review AI Alignment posts to help figure out how to make a proper AI Alignment review 2023-01-10T00:19:23.503Z
Kurzgesagt – The Last Human (Youtube) 2022-06-29T03:28:44.213Z
Replacing Karma with Good Heart Tokens (Worth $1!) 2022-04-01T09:31:34.332Z
Apply to the ML for Alignment Bootcamp (MLAB) in Berkeley [Jan 3 - Jan 22] 2021-11-03T18:22:58.879Z
The LessWrong Team is now Lightcone Infrastructure, come work with us! 2021-10-01T01:20:33.411Z
Welcome & FAQ! 2021-08-24T20:14:21.161Z
Berkeley, CA – ACX Meetups Everywhere 2021 2021-08-23T08:50:51.898Z
The Death of Behavioral Economics 2021-08-22T22:39:12.697Z
Open and Welcome Thread – August 2021 2021-08-15T05:59:05.270Z
Open and Welcome Thread – July 2021 2021-07-03T19:53:07.048Z
Open and Welcome Thread – June 2021 2021-06-06T02:20:22.421Z

Comments

Comment by habryka (habryka4) on Please stop publishing ideas/insights/research about AI · 2024-05-03T01:05:41.473Z · LW · GW

Sorry, what? I thought the fear was that we don't know how to make helpful AI at all. (And that people who think they're being helped by seductively helpful-sounding LLM assistants are being misled by surface appearances; the shoggoth underneath has its own desires that we won't like when it's powerful enough to persue them autonomously.) In contrast, this almost makes it sound like you think it is plausible to align AI to its user's intent, but that this would be bad if the users aren't one of "us"—you know, the good alignment researchers who want to use AI to take over the universe, totally unlike those evil capabilities researchers who want to use AI to produce economically valuable goods and services.

My steelman of this (though to be clear I think your comment makes good points): 

There is a large difference between a system being helpful and a system being aligned. Ultimately AI existential risk is a coordination problem where I expect catastrophic consequences because a bunch of people want to build AGI without making it safe. Therefore making technologies that in a naive and short-term sense just help AGI developers build whatever they want to build will have bad consequences. If I trusted everyone to use their intelligence just for good things, we wouldn't have anthropogenic existential risk on our hands.

Some of those technologies might end up useful for also getting the AI to be more properly aligned, or maybe to help with work that reduces the risk of AI catastrophe some other way, though my current sense is that kind of work is pretty different and doesn't benefit remotely as much from generically locally-helpful AI.

In-general I feel pretty sad about conflating "alignment" with "short-term intent alignment". I think the two problems are related but have really important crucial differences, I don't think the latter generalizes that well to the former (for all the usual sycophancy/treacherous-turn reasons), and indeed progress on the latter IMO mostly makes the world marginally worse because the thing it is most likely to be used for is developing existentially dangerous AI systems faster.

Edit: Another really important dimension to model here is also not just the effect of that kind of research on what individual researchers will do, but what effect this kind of research will have on what the market wants to invest in. My standard story of doom is centrally rooted in there being very strong short-term individual economic incentives to build more capable AGI, enabling people to make billions to trillions of dollars, while the downside risk is a distributed negative externality that is not at all priced into the costs of AI development. Developing applications of AI that make a lot of money without accounting for the negative extinction externalities therefore can be really quite bad for the world. 

Comment by habryka (habryka4) on Goal oriented cognition in "a single forward pass" · 2024-05-03T00:24:56.968Z · LW · GW

Hmm, I think the first bullet point is pretty precisely what I am talking about (though to be clear, I haven't read the paper in detail). 

I was specifically saying that trying to somehow get feedback from future tokens into the next token objective would probably do some interesting things and enable a bunch of cross-token optimization that currently isn't happening, which would improve performance on some tasks. This seems like what's going on here.

Agree that another major component of the paper is accelerating inference, which I wasn't talking about. I would have to read the paper in more detail to get a sense of how much it's just doing that, in which case I wouldn't think it's a good example.

Comment by habryka (habryka4) on AI #62: Too Soon to Tell · 2024-05-02T18:33:01.803Z · LW · GW

Oh no, I wonder what happened. Re-importing it right now.

Comment by habryka (habryka4) on Goal oriented cognition in "a single forward pass" · 2024-05-01T22:45:25.073Z · LW · GW

@johnswentworth I think this paper basically does the thing I was talking about (with pretty impressive results), though I haven't read it in a ton of detail: https://news.ycombinator.com/item?id=40220851 

Comment by habryka (habryka4) on metachirality's Shortform · 2024-05-01T20:16:57.241Z · LW · GW

You can! Just go to the all-posts page, sort by year, and the highest-rated shortform posts for each year will be in the Quick Takes section: 

2024: 

2023: 

2022: 

Comment by habryka (habryka4) on Transformers Represent Belief State Geometry in their Residual Stream · 2024-05-01T18:09:01.638Z · LW · GW

Promoted to curated: Formalizing what it means for transformers to learn "the underlying world model" when engaging in next-token prediction tasks seems pretty useful, in that it's an abstraction that I see used all the time when discussing risks from models where the vast majority of the compute was spent in pre-training, where the details usually get handwaived. It seems useful to understand what exactly we mean by that in more detail. 

I have not done a thorough review of this kind of work, but it seems to me that also others thought the basic ideas in the work hold up, and I thought reading this post gave me crisper abstractions to talk about this kind of stuff in the future.

Comment by habryka4 on [deleted post] 2024-04-30T21:48:43.702Z

Don't really think this makes sense as a tag page. Too subjective.

Comment by habryka (habryka4) on Open Thread Spring 2024 · 2024-04-30T16:54:37.786Z · LW · GW

Three is a bit much. I am honestly not sure what's better. My guess is putting them all into one. (Context, I am one of the LTFF fund managers)

Comment by habryka (habryka4) on Open Thread Spring 2024 · 2024-04-29T19:50:33.911Z · LW · GW

Yeah, I feel kind of excited about having some strong-downvote and strong-upvote UI which gives you one of a standard set of options for explaining your vote, or allows you to leave it unexplained, all anonymous.

Comment by habryka (habryka4) on D&D.Sci · 2024-04-28T21:56:48.670Z · LW · GW

I edited the top-comment to do that.

Comment by habryka (habryka4) on LessOnline (May 31—June 2, Berkeley, CA) · 2024-04-28T16:57:22.461Z · LW · GW

We'll send out location details to anyone who buys a ticket (and also feel free to ping us and we'll tell you).

I've had some experience with people trying to disrupt events, and trivial inconveniences of figuring out the address makes a non negligible difference in people doing stuff like that.

Comment by habryka (habryka4) on AI Safety Sphere · 2024-04-27T18:22:33.831Z · LW · GW

I tried setting up an account, but it just told me it had sent me an email to confirm my account that never arrived.

Comment by habryka (habryka4) on Take the wheel, Shoggoth! (Lesswrong is trying out changes to the frontpage algorithm) · 2024-04-26T21:17:13.545Z · LW · GW

GDPR is a giant mess, so it's pretty unclear what it requires us to implement. My current understanding is that it just requires us to tell you that we are collecting analytics data if you are from the EU. 

And the kind of stuff we are sending over to Recombee would be covered by it being data necessary to provide site functionality, not just analytics, so wouldn't be covered by that (if you want to avoid data being sent to Google Analytics in-particular, you can do that by just blocking the GA script in uBlock origin or whatever other adblocker you use, which it should do by default).

Comment by habryka (habryka4) on Take the wheel, Shoggoth! (Lesswrong is trying out changes to the frontpage algorithm) · 2024-04-26T21:01:07.430Z · LW · GW

I am pretty excited about doing something more in-house, but it's much easier to get data about how promising this direction is by using some third-party services that already have all the infrastructure. 

If it turns out to be a core part of LW, it makes more sense to in-house it. It's also really valuable to have an relatively validated baseline to compare things to. 

There are a bunch of third-party services we couldn't really replace that we send user data to. Hex.tech as our analytics dashboard service. Google Analytics for basic user behavior and patterns. A bunch of AWS services. Implementing the functionality of all of that ourselves, or putting a bunch of effort into anonymizing the data is not impossible, but seems pretty hard, and Recombee seems about par for the degree to which I trust them to not do anything with that data themselves.

Comment by habryka (habryka4) on This is Water by David Foster Wallace · 2024-04-24T23:44:08.790Z · LW · GW

Mod note: I clarified the opening note a bit more, to make the start and nature of the essay more clear.

Comment by habryka (habryka4) on The Best Tacit Knowledge Videos on Every Subject · 2024-04-23T18:21:43.662Z · LW · GW

If you have recommendations, post them! I doubt the author tried to filter the subjects very much by "book subjects" it's just what people seem to have found good ones so far. 

Comment by habryka (habryka4) on Open Thread Spring 2024 · 2024-04-22T21:08:26.706Z · LW · GW

This probably should be made more transparent, but the reason why these aren't in the library is because they don't have images for the sequence-item. We display all sequences that people create that have proper images on the library (otherwise we just show it on user's profiles).

Comment by habryka (habryka4) on Goal oriented cognition in "a single forward pass" · 2024-04-22T21:07:28.275Z · LW · GW

I think this just doesn't work very well, because it incentivizes the model to output a token which makes subsequent tokens easier to predict, as long as the benefit in predictability of the subsequent token(s) outweighs the cost of the first token.

Hmm, this doesn't sound right. The ground truth data would still be the same, so if you were to predict "aaaaaa" you would get the answer wrong. In the above example, you are presumably querying the log props of the model that was trained on 1-token prediction, which of course would think it's quite likely that conditional on the last 10 characters being "a" the next one will be "a", but I am saying "what is the probability of the full completion 'a a a a a...' given the prefix 'Once upon a time, there was a'", which doesn't seem very high.

The only thing I am saying here is "force the model to predict more than one token at a time, conditioning on its past responses, then evaluate the model on performance of the whole set of tokens". I didn't think super hard about what the best loss function here is, and whether you would have to whip out PPO for this.  Seems plausible.

Comment by habryka (habryka4) on Goal oriented cognition in "a single forward pass" · 2024-04-22T16:53:01.772Z · LW · GW

Yeah, I was indeed confused, sorry. I edited out the relevant section of the dialogue and replaced it with the correct relevant point (the aside here didn't matter because a somewhat stronger condition is true, which is that during training we always just condition on the right answer instead of conditioning on the output for the next token in the training set). 

In autoregressive transformers an order is imposed by masking, but all later tokens attend to all earlier tokens in the same way. 

Yeah, the masking is what threw me off. I was trying to think about whether any information would flow from the internal representations used to predict the second token to predicting the third token, and indeed, if you were to backpropagate the error after each specific token prediction, then there would be some information from predicting the second token available to predicting the third token (via the the updated weights). 

However, batch-sizes make this also inapplicable (I think you would basically never do a backpropagation after each token, that would kind of get rid of the whole benefit of parallel training), and even without that, the amount of relevant information flowing this way would be very miniscule and there wouldn't be any learning going for how this information flows. 

Comment by habryka (habryka4) on Goal oriented cognition in "a single forward pass" · 2024-04-22T15:45:45.666Z · LW · GW

I reference this in this section:

I do think saying "the system is just predicting one token at a time" is wrong, but I guess the way the work a transformer puts into token N gets rewarded or punished when it predicts token N + M feels really weird and confusing to me and still like it can be summarized much more as "it's taking one token at a time" than "it's doing reasoning across the whole context

IIRC at least for a standard transformer (which maybe had been modified with the recent context length extension) the gradients only flow through a subset of the weights (for a token halfway through the context, the gradients flow through half the weights that were responsible for the first token, IIRC).

Comment by habryka (habryka4) on Goal oriented cognition in "a single forward pass" · 2024-04-22T06:16:25.652Z · LW · GW

I think you are talking about a different probability distribution here.

You are right that this allows you to sample non-greedily from the learned distribution over text, but I was talking about the inductive biases on the model. 

My claim was that the way LLMs are trained, the way the inductive biases shake out is that the LLM won't be incentivized to output tokens that predictably have low probability, but make it easier to predict future tokens (by, for example, in the process of trying to predict a proof, reminding itself of all the of the things its knows before those things leave its context window, or when doing an addition that it can't handle in a single forward pass, outputting a token that's optimized to give itself enough serial depth to perform the full addition of two long n-digit digit numbers, which would then allow it to get the next n tokens right and so overall achieve lower joint loss).

Comment by habryka (habryka4) on What's with all the bans recently? · 2024-04-21T01:22:21.413Z · LW · GW

Yeah, I am also not seeing anything. Maybe it was something temporary, but I thought we had set it up to leave a trace if any automatic rate limits got applied in the past. 

Curious what symptom Nora observed (GreaterWrong has been having some problems with rate-limit warnings that I've been confused by, so I can imagine that looking like a rate-limit from our side).

Comment by habryka (habryka4) on [Linkpost] Practically-A-Book Review: Rootclaim $100,000 Lab Leak Debate · 2024-04-19T02:36:33.263Z · LW · GW

[Mod note: I edited out some of the meta commentary from the beginning for this curation. In-general for link posts I have a relatively low bar for editing things unilaterally, though I of course would never want to misportray what an author said] 

Comment by habryka (habryka4) on Express interest in an "FHI of the West" · 2024-04-19T01:05:57.015Z · LW · GW

To what extent would the organization be factoring in transformative AI timelines? It seems to me like the kinds of questions one would prioritize in a "normal period" look very different than the kinds of questions that one would prioritize if they place non-trivial probability on "AI may kill everyone in <10 years" or "AI may become better than humans on nearly all cognitive tasks in <10 years."

My guess is a lot, because the future of humanity sure depends on the details of how AI goes. But I do think I would want the primary optimization criterion of such an organization to be truth-seeking and to have quite strong norms and guardrails against anything that would trade off communicating truths against making a short-term impact and gaining power. 

As an example of one thing I would do very differently from FHI (and a thing that I talked with Bostrom about somewhat recently where we seemed to agree) was that with the world moving faster and more things happening, you really want to focus on faster OODA loops in your truth-seeking institutions. 

This suggests that instead of publishing books, or going through month-long academic review processes, you want to move more towards things like blogposts and comments, and maybe in the limit even on things like live panels where you analyze things right as they happen. 

I do think there are lots of failure modes around becoming too news-focused (and e.g. on LW we do a lot of things to not become too news-focused), so I think this is a dangerous balance, but its one of the things I think I would do pretty differently, and which depends on transformative AI timelines.

To comment a bit more on the power stuff: I think a thing that I am quite worried about is that as more stuff happens more quickly with AI people will feel a strong temptation to trade in some of the epistemic trust they have built with others, into resources that they can deploy directly under their control, because as more things happen, its harder to feel in control and by just getting more resources directly under your control (as opposed to trying to improve the decisions of others by discovering and communicating important truths) you can regain some of that feeling of control. That is one dynamic I would really like to avoid with any organization like this, where I would like it to continue to have a stance towards the world that is about improving sanity, and not about getting resources for itself and its allies.

Comment by habryka (habryka4) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-18T16:55:47.241Z · LW · GW

Do you have quick links for the elliptic curve backdoor and/or any ground-breaking work in computer security that NIST has performed?

Comment by habryka (habryka4) on Express interest in an "FHI of the West" · 2024-04-18T16:30:10.981Z · LW · GW

Generally agree with most things in this comment. To be clear, I have been thinking about doing something in the space for many years, internally referring to it as creating an "FHI of the West", and while I do think the need for this is increased by FHI disappearing, I was never thinking about this as a clone of FHI, but was always expecting very substantial differences (due to differences in culture, skills, and broader circumstances in the world some of which you characterize above)

I wrote this post mostly because with the death of FHI it seemed to me that there might be a spark of energy and collective attention that seems good to capture right now, since I do think what I would want to build here would be able to effectively fill some of the gap left behind.

Comment by habryka (habryka4) on Express interest in an "FHI of the West" · 2024-04-18T16:11:39.394Z · LW · GW

Totally agree, it definitely should not be branded this way if it launches.

I am thinking of "FHI of the West" here basically just as the kind of line directors use in Hollywood to get the theme of a movie across. Like "Jaws in Space" being famously the one line summary of the movie "Alien".

It also started internally as a joke based on an old story of the University of Ann Arbor branding itself as "the Harvard of the West", which was perceived to be a somewhat clear exaggeration at the time (and resulted in Kennedy giving a speech where he described Harvard jokingly as "The Michigan of the East" which popularized it). Describing something as "Harvard of the West" in a joking way seems to have popped up across the Internet in a bunch of different contexts. I'll add that context to the OP, though like, it is a quite obscure reference.

If anything like this launches to a broader audience I expect no direct reference to FHI to remain. It just seems like a decent way to get some of the core pointers across.

Comment by habryka (habryka4) on FHI (Future of Humanity Institute) has shut down (2005–2024) · 2024-04-17T21:54:11.985Z · LW · GW

My sense is FHI was somewhat accurately modeled as "closed" for a few months. I did not know today would be the date of the official announcement.

Comment by habryka (habryka4) on FHI (Future of Humanity Institute) has shut down (2005–2024) · 2024-04-17T20:57:54.487Z · LW · GW

I knew this was going on for quite a while (my guess is around a year or two). I think ultimately it was a slow smothering by the university administration and given the adversarialness of that relationship with the university, I don't really think outrage would have really helped that much (though it might have, I don't really understand the university's perspective on this). 

My guess is dragging this out longer would have caused more ongoing friction and would have overall destroyed more time and energy by the really smart and competent people at FHI than they would have benefitted from the institution.

Comment by habryka (habryka4) on Prometheus's Shortform · 2024-04-17T00:04:59.540Z · LW · GW

Yeah, that makes sense. I've noticed miscommunications around the word "scheming" a few times, so am in favor of tabooing it more. "Engage in deception for instrumental reasons" seems like an obvious extension that captures a lot of what I care about.

Comment by habryka (habryka4) on Prometheus's Shortform · 2024-04-16T23:40:38.351Z · LW · GW

The best definition I would have of "scheming" would be "the model is acting deceptively about its own intentions or capabilities in order to fool a supervisor" [1]. This behavior seems to satisfy that pretty solidly: 

Of course, in this case the scheming goal was explicitly trained for (as opposed to arising naturally out of convergent instrumental power drives), but it sure seems to me like its engaging in the relevant kind of scheming.

I agree there is more uncertainty and lack of clarity on whether deceptively-aligned systems will arise "naturally", but the above seems like a clear example of someone artificially creating a deceptively-aligned system.

  1. ^

    Joe Carlsmith uses "whether advanced AIs that perform well in training will be doing so in order to gain power later", but IDK, that feels really underspecified. Like, there are just tons of reasons for why the AI will want to perform well in training for power-seeking reasons, and when I read the rest of the report it seems like Joe was more analyzing it through the deception of supervisors lens.

Comment by habryka (habryka4) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-16T23:22:24.829Z · LW · GW

I would call what METR does alignment research, but also fine to use a different term for it. Mostly using it synonymously with "AI Safety Research" which I know you object to, but I do think that's how it's normally used (and the relevant aspect here is the pre-paradigmaticity of the relevant research, which I continue to think applies independently of the bucket you put it into).

I do think it's marginally good to make "AI Alignment Research" mean something narrower, so am supportive here of getting me to use something broader like "AI Safety Research", but I don't really think that changes my argument in any relevant way.

Comment by habryka (habryka4) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-16T23:10:09.899Z · LW · GW

Huh, I guess my sense is that in order to develop the standards talked about above you need to do a lot of cutting-edge research. 

I guess my sense from the description above is that we would be talking about research pretty similar to what METR is doing, which seems pretty open-ended and pre-paradigmatic to me. But I might be misunderstanding his role.

Comment by habryka (habryka4) on Prometheus's Shortform · 2024-04-16T22:52:51.036Z · LW · GW

It seems pretty likely to me that current AGIs are already scheming. At least it seems like the simplest explanation for things like the behavior observed in this paper: https://www.alignmentforum.org/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training-deceptive-llms-that-persist-through 

Comment by habryka (habryka4) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-16T22:51:28.308Z · LW · GW

I don't know the landscape of US government institutions that well, but some guesses: 

  • My sense is DARPA and sub-institutions like IARPA often have pursued substantially more open-ended research that seems more in-line with what I expect AI Alignment research to look like
  • The US government has many national laboratories that have housed a lot of great science and research. Many of those seem like decent fits: https://www.usa.gov/agencies/national-laboratories 
  • It's also not super clear to me that research like this needs to be hosted within a governmental institutions. Organizations like RAND or academic institutions seem well-placed to host it, and have existing high trust relationships with the U.S. government.
  • Something like the UK task force structure also seems reasonable to me, though I don't think I have a super deep understanding of that either. Of course, creating a whole new structure for something like this is hard (and I do see people in-parallel trying to establish a new specialized institution)

The Romney, Reed, Moran and King framework whose summary I happened to read this morning suggests the following options: 

It lists NIST together with the Department of Commerce as one of the options, but all the other options also seem reasonable to me, and I think better by my lights. Though I agree none of these seem ideal (besides the creation of a specialized new agency, though of course that will justifiably encounter a bunch of friction, since creating a new agency should have a pretty high bar for happening).

Comment by habryka (habryka4) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-16T21:13:27.544Z · LW · GW

This seems quite promising to me. My primary concern is that I feel like NIST is really not an institution well-suited to house alignment research. 

My current understanding is that NIST is generally a very conservative organization, with the primary remit of establishing standards in a variety of industries and sciences. These standards are almost always about things that are extremely objective and very well-established science. 

In contrast, AI Alignment seems to me to continue to be in a highly pre-paradigmatic state, and the standards that I can imagine us developing there seem to be qualitatively very different from the other standards that NIST has historically been in charge of developing. It seems to me that NIST is not a good vehicle for the kind of cutting edge research and extremely uncertain environment in which things like alignment and evals research have to happen. 

Maybe other people have a different model of different US governmental institutions, but at least to me NIST seems like a bad fit for the kind of work that I expect Paul to be doing there. 

Comment by habryka (habryka4) on Anthropic AI made the right call · 2024-04-16T21:09:47.479Z · LW · GW

Oh, I have indeed used this to update that OpenAI deeply misunderstands alignment, and this IMO has allowed me to make many accurate predictions about what OpenAI has been doing over the last few years, so I feel good about interpreting it that way.

Comment by habryka (habryka4) on nikola's Shortform · 2024-04-16T16:03:00.899Z · LW · GW

Wouldn't the equivalent be more like burning a body of a dead person?

It's not like the AI would have a continuous stream of consciousness, and it's more that you are destroying the information necessary to run them. It seems to me that shutting off an AI is more similar to killing them.

Seems like the death analogy here is a bit spotty. I could see it going either way as a best fit.

Comment by habryka (habryka4) on Anthropic AI made the right call · 2024-04-15T20:57:05.339Z · LW · GW

I would take pretty strong bets that that isn't what happened based on having talked to more people about this. Happy to operationalize and then try to resolve it.

Comment by habryka (habryka4) on Anthropic AI made the right call · 2024-04-15T20:01:54.714Z · LW · GW

That seems concerning! Did you follow up with the leadership of your organization to understand to what degree they seem to have been making different (and plausibly contradictory) commitments to different interest groups? 

It seems like it's quite important to know what promises your organization has made to whom, if you are trying to assess whether you working there will positively or negatively effect how AI will go.

(Note, I talked with Evan about this in private some other times, so the above comment is more me bringing a private conversation into the public realm than me starting a whole conversation about this. I've already poked Evan privately asking him to please try to get better confirmation of the nature of the commitments made here, but he wasn't interested at the time, so I am making the same bid publicly.)

Comment by habryka (habryka4) on Anthropic AI made the right call · 2024-04-15T19:58:19.428Z · LW · GW

I also strongly expected them to violate this commitment, though my understanding is that various investors and early collaborators did believe they would keep this commitment. 

I think it's important to understand that Anthropic was founded before the recent post-Chat-GPT hype/AI-interest-explosion. Similarly to how OpenAIs charter seemed plausible as something that OpenAI could adhere to for people early on, so did it seem possible that commercial pressures would not cause a fully-throated arms-race between all the top companies, with billions to trillions of dollars for the taking for whoever got to AGI first, which I do agree made violating this commitment a relatively likely conclusion.

Comment by habryka (habryka4) on nikola's Shortform · 2024-04-15T18:19:03.829Z · LW · GW

Not just "some robots or nanomachines" but "enough robots or nanomachines to maintain existing chip fabs, and also the supply chains (e.g. for ultra-pure water and silicon) which feed into those chip fabs, or make its own high-performance computing hardware".

My guess is software performance will be enough to not really have to make many more chips until you are at a quite advanced tech level where making better chips is easy. But it's something one should actually think carefully about, and there is a bit of hope in that it would become a blocker, but it doesn't seem that likely to me.

Comment by habryka (habryka4) on Anthropic AI made the right call · 2024-04-15T05:17:09.146Z · LW · GW

I guess I'm more willing to treat Anthropic's marketing as not-representing-Anthropic. Shrug.

I feel sympathetic to this, but when I think of the mess of trying to hold an organization accountable when I literally can't take the public statements of the organization itself as evidence, then that feels kind of doomed to me. It feels like it would allow Anthropic to weasel itself out of almost any commitment.

Comment by habryka (habryka4) on Anthropic AI made the right call · 2024-04-15T04:54:23.094Z · LW · GW

Claude 3 Opus meaningfully advanced the frontier? Or slightly advanced it but Anthropic markets it like it was a substantial advance so they're being similarly low-integrity?

I updated somewhat over the following weeks that Opus had meaningfully advanced the frontier, but I don't know how much that is true for other people. 

It seems like Anthropic's marketing is in direct contradiction with the explicit commitment they made to many people, including Dustin, which seems to have quite consistently been the "meaningfully advance the frontier" line. I think it's less clear whether their actual capabilities are, as opposed to their marketing statements. I think if you want to have any chance of enforcing commitments like this, the enforcement needs to happen at the latest when the organization publicly claims to have done something in direct contradiction to it, so I think the marketing statements matter a bunch here.

Anthropic has also continued to publish ads claiming that Claude 3 has meaningfully pushed the state of the art and is the smartest model on the market since the discussion around this happened, so it's not just a one-time oversight by their marketing department.

Separately, multiple Anthropic staffers seem to think themselves no longer bound by their previous commitment and expect that Anthropic will likely unambiguously advance the frontier if they get the chance.

Comment by habryka (habryka4) on The Best Tacit Knowledge Videos on Every Subject · 2024-04-15T03:45:44.680Z · LW · GW

Promoted to curated: The original "The Best Textbooks on Every Subject" post was among the most valuable that LessWrong has ever featured. I really like this extension of it into the realm of tacit knowledge videos, which does feel like a very valuable set of content that I haven't seen curated anywhere else on the internet.

Thank you very much for doing this! And I hope this post will see contributions for many months and years to come.

Comment by habryka (habryka4) on Habryka's Shortform Feed · 2024-04-15T03:43:05.587Z · LW · GW

Had a very aggressive crawler basically DDos-ing us from a few dozen IPs for the last hour. Sorry for the slower server response times. Things should be fixed now.

Comment by habryka (habryka4) on Anthropic AI made the right call · 2024-04-15T03:34:39.571Z · LW · GW

Most of us agree with you that deploying Claude 3 was reasonable,

I at least didn't interpret this poll to mean that deploying it was reasonable. I think given past Anthropic commitments it was pretty unreasonable (violating your deployment commitments seems really quite bad, and is IMO one of the most central things that Anthropic should be judged on). It's just not really clear whether it directly increased risk. I would be quite sad if that poll result would be seen as something like "approval of whether Anthropic made the right call".

Comment by habryka (habryka4) on nikola's Shortform · 2024-04-15T01:56:19.092Z · LW · GW

Before then, if the AI wishes to actually survive, it needs to construct and control a robot/nanomachine population advanced enough to maintain its infrastructure.

As Gwern said, you don't really need to maintain all the infrastructure for that long, and doing it for a while seems quite doable without advanced robots or nanomachines. 

If one wanted to do a very prosaic estimate, you could do something like "how fast is AI software development progress accelerating when the AI can kill all the humans" and then see how many calendar months you need to actually maintain the compute infrastructure before the AI can obviously just build some robots or nanomachines. 

My best guess is that the AI will have some robots from which it could bootstrap substantially before it can kill all the humans. But even if it didn't, it seems like with algorithmic progress rates being likely at the very highest when the AI will get smart enough to kill everyone, it seems like you would at most need a few more doublings of compute-efficiency to get that capacity, which would be only a few weeks to months away then, where I think you won't really run into compute-infrastructure issues even if everyone is dead. 

Of course, forecasting this kind of stuff is hard, but I do think "the AI needs to maintain infrastructure" tends to be pretty overstated. My guess is at any point where the AI could kill everyone, it would probably also not really have a problem of bootstrapping afterwards. 

Comment by habryka (habryka4) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-14T01:57:09.637Z · LW · GW

What's a service that works everywhere? I would have expected YouTube to do pretty well here. Happy to upload it wherever convenient.

Comment by habryka (habryka4) on What's with all the bans recently? · 2024-04-12T03:47:57.374Z · LW · GW

By the way, the rate-limiting algorithm as I've understood it seems poor. It only takes one downvoted comment to get limited, So it doesn't matter if a user leaves one good comment and one poor comment, or if they write 99 good comments and one poor comment.

Automatic rate-limiting only uses the last 20 posts and comments, which can still be relatively harsh, but 99 good comments will definitely outweigh one poor comment.