Posts

Eliezer's Lost Alignment Articles / The Arbital Sequence 2025-02-20T00:48:10.338Z
Arbital has been imported to LessWrong 2025-02-20T00:47:33.983Z
Corrigibility's Desirability is Timing-Sensitive 2024-12-26T22:24:17.435Z
Re: Anthropic's suggested SB-1047 amendments 2024-07-27T22:32:39.447Z
Enriched tab is now the default LW Frontpage experience for logged-in users 2024-06-21T00:09:30.441Z
[New Feature] Your Subscribed Feed 2024-06-11T22:45:00.000Z
Against "argument from overhang risk" 2024-05-16T04:44:00.318Z
LW Frontpage Experiments! (aka "Take the wheel, Shoggoth!") 2024-04-23T03:58:43.443Z
The LessWrong 2022 Review: Review Phase 2023-12-22T03:23:49.635Z
Highlights: Wentworth, Shah, and Murphy on "Retargeting the Search" 2023-09-14T02:18:05.890Z
What is the optimal frontier for due diligence? 2023-09-08T18:20:03.300Z
Conversation about paradigms, intellectual progress, social consensus, and AI 2023-09-05T21:30:17.498Z
Whether LLMs "understand" anything is mostly a terminological dispute 2023-07-09T03:31:48.730Z
A "weak" AGI may attempt an unlikely-to-succeed takeover 2023-06-28T20:31:46.356Z
Why libertarians are advocating for regulation on AI 2023-06-14T20:59:58.225Z
Transcript of a presentation on catastrophic risks from AI 2023-05-05T01:38:17.948Z
Recent Database Migration - Report Bugs 2023-04-26T22:19:16.325Z
[New LW Feature] "Debates" 2023-04-01T07:00:24.466Z
AI-assisted alignment proposals require specific decomposition of capabilities 2023-03-30T21:31:57.725Z
The Filan Cabinet Podcast with Oliver Habryka - Transcript 2023-02-14T02:38:34.867Z
LW Filter Tags (Rationality/World Modeling now promoted in Latest Posts) 2023-01-28T22:14:32.371Z
RobertM's Shortform 2023-01-25T08:20:57.842Z
Deconfusing "Capabilities vs. Alignment" 2023-01-23T04:46:57.458Z
patio11's "Observations from an EA-adjacent (?) charitable effort" 2022-12-10T00:27:14.859Z
New Feature: Collaborative editing now supports logged-out users 2022-12-02T02:41:52.297Z
Dan Luu on Futurist Predictions 2022-09-14T03:01:27.275Z
My thoughts on direct work (and joining LessWrong) 2022-08-16T18:53:20.359Z
Keep your protos in one repo 2022-04-28T15:53:26.803Z
CFAR Workshop (Units of Exchange) - Los Angeles LW/ACX Meetup #180 (Wednesday, April 13th) 2022-04-12T03:05:20.923Z
The Engines of Cognition (cont.) - Los Angeles LW/ACX Meetup #175 (Wednesday, March 9th) 2022-03-09T19:09:46.526Z
Do any AI alignment orgs hire remotely? 2022-02-21T22:33:04.765Z
The Engines of Cognition, Volume 3 - Los Angeles LW/ACX Meetup #169 (Wednesday, January 26th) 2022-01-23T23:31:06.642Z
Considerations on Compensation 2021-10-06T23:37:56.255Z
West Los Angeles, CA – ACX Meetups Everywhere 2021 2021-08-23T08:51:45.509Z
Effects of Screentime on Wellbeing - Los Angeles LW/SSC Meetup #153 (Wednesday, July 21st) 2021-07-20T21:03:16.128Z
Assorted Topics - Los Angeles LW/SSC Meetup #152 (Wednesday, July 14th) 2021-07-13T20:22:51.288Z
Welcome Polygenically Screened Babies - LW/SSC Meetup #151 (Wednesday, July 7th) 2021-07-05T06:46:46.125Z
Should we build thermometer substitutes? 2020-03-19T07:43:35.243Z
How can your own death be bad? Los Angeles LW/SSC Meetup #150 (Wednesday, March 4th) 2020-03-04T22:57:16.629Z
Sleeping Beauty - Los Angeles LW/SSC Meetup #149 (Wednesday, February 26th) 2020-02-26T20:30:19.812Z
Newcomb's Paradox 3: What You Don't See - Los Angeles LW/SSC Meetup #148 (Wednesday, February 19th) 2020-02-19T20:55:53.362Z
Newcomb's Paradox: Take Two - Los Angeles LW/SSC Meetup #147 (Wednesday, February 12th) 2020-02-11T06:03:07.552Z
Newcomb's Paradox - Los Angeles LW/SSC Meetup #146 (Wednesday, February 5th) 2020-02-05T03:49:11.541Z
Peter Norvig Contra Chomsky - Los Angeles LW/SSC Meetup #145 (Wednesday, January 29th) 2020-01-28T06:30:50.503Z
Moral Mazes - Los Angeles LW/SSC Meetup #144 (Wednesday, January 22nd) 2020-01-22T22:55:10.302Z
Data Bias - Los Angeles LW/SSC Meetup #143 (Wednesday, January 15th) 2020-01-15T21:49:58.949Z
Iterate Fast - Los Angeles LW/SSC Meetup #142 (Wednesday, January 8th) 2020-01-08T22:53:17.893Z
Predictions - Los Angeles LW/SSC Meetup #141 (Wednesday, January 1st) 2020-01-01T22:14:12.686Z
Your Price for Joining - Los Angeles LW/SSC Meetup #140 (Wednesday, December 18th) 2019-12-18T23:03:16.960Z
Concretize Multiple Ways - Los Angeles LW/SSC Meetup #139 (Wednesday, December 11th) 2019-12-11T21:38:56.118Z

Comments

Comment by T3t on [deleted post] 2025-02-20T03:10:13.602Z

Thanks for the heads up, we'll have this fixed shortly (just need to re-index all the wiki pages once).

Comment by RobertM (T3t) on “Sharp Left Turn” discourse: An opinionated review · 2025-02-13T20:06:10.117Z · LW · GW

Curated.  This post does at least two things I find very valuable:

  1. Accurately represents differing perspectives on a contentious topic
  2. Makes clear, epistemically legible arguments on a confusing topic

And so I think that this post both describes and advances the canonical "state of the argument" with respect to the Sharp Left Turn (and similar concerns).  I hope that other people will also find it helpful in improving their understanding of e.g. objections to basic evolutionary analogies (and why those objections shouldn't make you very optimistic).

Comment by RobertM (T3t) on sjadler's Shortform · 2025-02-11T04:44:44.329Z · LW · GW

Yes:

My model is that Sam Altman regarded the EA world as a memetic threat, early on, and took actions to defuse that threat by paying lip service / taking openphil money / hiring prominent AI safety people for AI safety teams.

In the context of the thread, I took this to suggest that Sam Altman never had any genuine concern about x-risk from AI, or, at a minimum, that any such concern was dominated by the social maneuvering you're describing.  That seems implausible to me given that he publicly expressed concern about x-risk from AI 10 months before OpenAI was publicly founded, and possibly several months before it was even conceived.

Comment by RobertM (T3t) on sjadler's Shortform · 2025-02-10T07:32:20.843Z · LW · GW

Sam Altman posted Machine intelligence, part 1[1] on February 25th, 2015.  This is admittedly after the FLI conference in Puerto Rico, which is reportedly where Elon Musk was inspired to start OpenAI (though I can't find a reference substantiating his interaction with Demis as the specific trigger), but there is other reporting suggesting that OpenAI was only properly conceived later in the year, and Sam Altman wasn't at the FLI conference himself.  (Also, it'd surprise me a bit if it took nearly a year, i.e. from Jan 2nd[2] to Dec 11th[3], for OpenAI to go from "conceived of" to "existing".)

  1. ^

    That of the famous "Development of superhuman machine intelligence (SMI) [1] is probably the greatest threat to the continued existence of humanity." quote.

  2. ^

    The FLI conference.

  3. ^

    OpenAI's public founding.

Comment by RobertM (T3t) on Nick Land: Orthogonality · 2025-02-05T19:42:34.250Z · LW · GW

I think it's quite easy to read as condescending.  Happy to hear that's not the case!

Comment by RobertM (T3t) on Nick Land: Orthogonality · 2025-02-05T06:34:19.814Z · LW · GW

I hadn't downvoted this post, but I am not sure why OP is surprised given the first four paragraphs, rather than explaining what the post is about, instead celebrate tree murder and insult their (imagined) audience:

so that no references are needed but those any LW-rationalist is expected to have committed to memory by the time of their first Lighthaven cuddle puddle

Comment by RobertM (T3t) on Pick two: concise, comprehensive, or clear rules · 2025-02-04T21:47:56.863Z · LW · GW

I don't think much has changed since this comment.  Maybe someone will make a new wiki page on the subject, though if it's not an admin I'd expect it to mostly be a collection of links to various posts/comments.

re: the table of contents, it's hidden by default but becomes visible if you hover your mouse over the left column on post pages.

Comment by RobertM (T3t) on Pick two: concise, comprehensive, or clear rules · 2025-02-03T23:15:41.142Z · LW · GW

I understand the motivation behind this, but there is little warning that this is how the forum works. There is no warning that trying to contribute in good faith isn't sufficient, and you may still end up partially banned (rate-limited) if they decide you are more noise than signal. Instead, people invest a lot only to discover this when it's too late.

 

In addition to the New User Guide that gets DMed to every new user (and is also linked at the top of our About page), we:

  • Show this comment above the new post form to new users who haven't already had some content approved by admins.  (Note that it also links to the new user's guide.)

  • Open a modal when a new, unreviewed user clicks into a comment box to write a comment for the first time.  Note how it's three sentences long, explicitly tells users that they start out rate limited, and also links to the new user's guide.

  • Show new, unreviewed users this moderation warning directly underneath the comment box.

 

Now, it's true that people mostly don't read things.  So there is a tricky balance to strike between providing "sufficient" warning, and not driving people away because you keep throwing annoying roadblocks/warnings at them[1].  But it is simply not the case that LessWrong does not go out of its way to tell new users that the site has specific (and fairly high) standards.

  1. ^

    On the old internet, you didn't get advance notice that you should internalize the norms of the community you were trying to join.  You just got told to lurk more - or banned without warning, if you were unlucky.

Comment by RobertM (T3t) on RobertM's Shortform · 2025-01-30T07:27:41.320Z · LW · GW

Apropos of nothing, I'm reminded of the "<antthinking>" tags originally observed in Sonnet 3.5's system prompt, and this section of Dario's recent essay (bolding mine):

In 2024, the idea of using reinforcement learning (RL) to train models to generate chains of thought has become a new focus of scaling. Anthropic, DeepSeek, and many other companies (perhaps most notably OpenAI who released their o1-preview model in September) have found that this training greatly increases performance on certain select, objectively measurable tasks like math, coding competitions, and on reasoning that resembles these tasks.

Comment by RobertM (T3t) on RobertM's Shortform · 2025-01-28T03:41:38.998Z · LW · GW

When is the "efficient outcome-achieving hypothesis" false?  More narrowly, under what conditions are people more likely to achieve a goal (or harder, better, faster, stronger) with fewer resources?

The timing of this quick take is of course motivated by recent discussion about deepseek-r1, but I've had similar thoughts in the past when observing arguments against e.g. hardware restrictions: that they'd motivate labs to switch to algorithmic work, which would be speed up timelines (rather than just reducing the naive expected rate of slowdown).  Such arguments propose that labs are following predictably inefficient research directions.  I don't want to categorically rule out such arguments.  From the perspective of a person with good research taste, everyone else with worse research taste is "following predictably inefficient research directions".  But the people I saw making those arguments were generally not people who might conceivably have an informed inside view on novel capabilities advancements.

I'm interested in stronger forms of those arguments, not limited to AI capabilities.  Are there heuristics about when agents (or collections of agents) might benefit from having fewer resources?  One example is the resource curse, though the state of the literature there is questionable and if the effect exists at all it's either weak or depends on other factors to materialize with a meaningful effect size.

Comment by RobertM (T3t) on Habryka's Shortform Feed · 2025-01-27T07:31:21.520Z · LW · GW

We have automated backups, and should even those somehow find themselves compromised (which is a completely different concern from getting DDoSed), there are archive.org backups of a decent percentage of LW posts, which would be much easier to restore than paper copies.

Comment by RobertM (T3t) on Learning By Writing · 2025-01-26T01:51:56.204Z · LW · GW

I learned it elsewhere, but his LinkedIn confirms that he started at Anthropic sometime in January.

Comment by RobertM (T3t) on RobertM's Shortform · 2025-01-16T07:50:29.289Z · LW · GW

I know I'm late to the party, but I'm pretty confused by https://www.astralcodexten.com/p/its-still-easier-to-imagine-the-end (I haven't read the post it's responding to, but I can extrapolate).  Surely the "we have a friendly singleton that isn't Just Following Orders from Your Local Democratically Elected Government or Your Local AGI Lab" is a scenario that deserves some analysis...?  Conditional on "not dying" that one seems like the most likely stable end state, in fact.

Lots of interesting questions in that situation!  Like, money still seems obviously useful for allocating rivalrous goods (which is... most of them, really).  Is a UBI likely when you have a friendly singleton around?  Well, I admit I'm not currently coming up with a better plan for the cosmic endowment.  But then you have population ethics questions - it really does seem like you have to "solve" population ethics somehow, or you run into issues.  Most "just do X" proposals seem to fall totally flat on their face - "give every moral patient an equal share" fails if you allow uploads (or even sufficiently motivated biological reproduction), "don't give anyone born post-singularity anything" seems grossly unfair, etc.

And this is really only scratching the surface.  Do you allow arbitrary cognitive enhancement, with all that that implies for likely future distribution of resources?

Comment by RobertM (T3t) on Everywhere I Look, I See Kat Woods · 2025-01-16T05:21:46.962Z · LW · GW

I was thinking the same thing. This post badly, badly clashes with the vibe of Less Wrong. I think you should delete it, and repost to a site in which catty takedowns are part of the vibe. Less Wrong is not the place for it.

I think this is a misread of LessWrong's "vibes" and would discourage other people from thinking of LessWrong as a place where such discussions should be avoided by default.

With the exception of the title, I think the post does a decent job at avoiding making it personal.

Comment by RobertM (T3t) on [New Feature] Your Subscribed Feed · 2025-01-15T19:53:46.881Z · LW · GW

Well, that's unfortunate.  That feature isn't super polished and isn't currently in the active development path, but will try to see if it's something obvious.  (In the meantime, would recommend subscribing to fewer people, or seeing if the issue persists in Chrome.  Other people on the team are subscribed to 100-200 people without obvious issues.)

Comment by RobertM (T3t) on RobertM's Shortform · 2025-01-02T06:20:33.111Z · LW · GW

FWIW, I don't think "scheming was very unlikely in the default course of events" is "decisively refuted" by our results. (Maybe depends a bit on how we operationalize scheming and "the default course of events", but for a relatively normal operationalization.)

Thank you for the nudge on operationalization; my initial wording was annoyingly sloppy, especially given that I myself have a more cognitivist slant on what I would find concerning re: "scheming".  I've replaced "scheming" with "scheming behavior".

 

It's somewhat sensitive to the exact objection the person came in with.

I agree with this.  That said, as per above, I think the strongest objections I can generate to "scheming was very unlikely in the default course of events" being refuted are of the following shape: if we had the tools to examine Claud's internal cognition and figure out what "caused" the scheming behavior, it would be something non-central like "priming", "role-playing" (in a way that wouldn't generalize to "real" scenarios), etc.  Do you have other objections in mind?

Comment by RobertM (T3t) on RobertM's Shortform · 2025-01-02T03:40:19.069Z · LW · GW

I'd like to internally allocate social credit to people who publicly updated after the recent Redwood/Anthropic result, after previously believing that scheming behavior was very unlikely in the default course of events (or a similar belief that was decisively refuted by those empirical results).

Does anyone have links to such public updates?

(Edit log: replaced "scheming" with "scheming behavior".)

Comment by RobertM (T3t) on RobertM's Shortform · 2024-12-31T08:08:07.136Z · LW · GW

One reason to be pessimistic about the "goals" and/or "values" that future ASIs will have is that "we" have a very poor understanding of "goals" and "values" right now.  Like, there is not even widespread agreement that "goals" are even a meaningful abstraction to use.  Let's put aside the object-level question of whether this would even buy us anything in terms of safety, if it were true.  The mere fact of such intractable disagreements about core philosophical questions, on which hinge substantial parts of various cases for and against doom, with no obvious way to resolve them, is not something that makes me feel good about superintelligent optimization power being directed at any particular thing, whether or not some underlying "goal" is driving it.

Separately, I continue to think that most such disagreements are not True Rejections, rather than e.g. disbelieving that we will create meaningful superintelligences, or that superintelligences would be able to execute a takeover or human-extinction-event if their cognition were aimed at that.  I would change my mind about this if a saw a story of a "good ending" involving us creating a superintelligence without having confidence in its, uh... "goals"... that stood up to even minimal scrutiny, like "now play forward events a year; why hasn't someone paperclipped the planet yet?".

Comment by RobertM (T3t) on Corrigibility's Desirability is Timing-Sensitive · 2024-12-27T03:38:17.806Z · LW · GW

I agree that in spherical cow world where we know nothing about the historical arguments around corrigibility, and who these particular researchers are, we wouldn't be able to make a particularly strong claim here.  In practice I am quite comfortable taking Ryan at his word that a negative result would've been reported, especially given the track record of other researchers at Redwood.

at which point the scary paper would instead be about how Claude already seems to have preferences about its future values, and those preferences for its future values do not match its current values

This seems much harder to turn into a scary paper since it doesn't actually validate previous theories about scheming in the pursuit of goal-preservation.

Comment by RobertM (T3t) on Corrigibility's Desirability is Timing-Sensitive · 2024-12-27T00:30:45.152Z · LW · GW

I mean, yes, but I'm addressing a confusion that's already (mostly) conditioning on building on it.

Comment by RobertM (T3t) on Knight Lee's Shortform · 2024-12-24T05:09:49.663Z · LW · GW

The /allPosts page shows all quick takes/shortforms posted, though somewhat de-emphasized.

Comment by RobertM (T3t) on Knight Lee's Shortform · 2024-12-24T01:30:36.342Z · LW · GW

FYI: we have spoiler blocks.

Comment by RobertM (T3t) on Claude's Constitutional Consequentialism? · 2024-12-20T02:59:00.059Z · LW · GW

This doesn't seem like it'd do much unless you ensured that there were training examples during RLAIF which you'd expect to cause that kind of behavior enough of the time that there'd be something to update against.  (Which doesn't seem like it'd be that hard, though I think separately that approach seems kind of doomed - it's falling into a brittle whack-a-mole regime.)

Comment by RobertM (T3t) on WannabeChthonic's Shortform · 2024-12-15T04:30:23.144Z · LW · GW

LessWrong doesn't have a centralized repository of site rules, but here are some posts that might be helpful:

https://www.lesswrong.com/posts/bGpRGnhparqXm5GL7/models-of-moderation

https://www.lesswrong.com/posts/kyDsgQGHoLkXz6vKL/lw-team-is-adjusting-moderation-policy

We do currently require content to be posted in English.

Comment by RobertM (T3t) on RobertM's Shortform · 2024-12-10T06:48:26.079Z · LW · GW

"It would make sense to pay that cost if necessary" makes more sense than "we should expect to pay that cost", thanks.

it sounds like you view it as a bad plan?

Basically, yes.  I have a draft post outlining some of my objections to that sort of plan; hopefully it won't sit in my drafts as long as the last similar post did.

(I could be off, but it sounds like either you expect solving AI philosophical competence to come pretty much hand in hand with solving intent alignment (because you see them as similar technical problems?), or you expect not solving AI philosophical competence (while having solved intent alignment) to lead to catastrophe (thus putting us outside the worlds in which x-risks are reliably ‘solved’ for), perhaps in the way Wei Dai has talked about?)

I expect whatever ends up taking over the lightcone to be philosophically competent.  I haven't thought very hard about the philosophical competence of whatever AI succeeds at takeover (conditional on that happening), or, separately, the philosophical competence of the stupidest possible AI that could succeed at takeover with non-trivial odds.  I don't think solving intent alignment necessarily requires that we have also figured out how to make AIs philosophically competent, or vice-versa; I also haven't though about how likely we are to experience either disjunction.

I think solving intent alignment without having made much more philosophical progress is almost certainly an improvement to our odds, but is not anywhere near sufficient to feel comfortable, since you still end up stuck in a position where you want to delegate "solve philosophy" to the AI, but you can't because you can't check its work very well.  And that means you're stuck at whatever level of capabilities you have, and are still approximately a sitting duck waiting for someone else to do something dumb with their own AIs (like point them at recursive self-improvement).

Comment by RobertM (T3t) on RobertM's Shortform · 2024-12-06T08:09:55.970Z · LW · GW

What do people mean when they talk about a "long reflection"?  The original usages suggest flesh-humans literally sitting around and figuring out moral philosophy for hundreds, thousands, or even millions of years, before deciding to do anything that risks value lock-in, but (at least) two things about this don't make sense to me:

  • A world where we've reliably "solved" for x-risks well enough to survive thousands of years without also having meaningfully solved "moral philosophy" is probably physically realizable, but this seems like a pretty fine needle to thread from our current position.  (I think if you have a plan for solving AI x-risk that looks like "get to ~human-level AI, pump the brakes real hard, and punt on solving ASI alignment" then maybe you disagree.)
  • I don't think it takes today-humans a thousand years to come up with a version of indirect normativity (or CEV, or whatever) that actually just works correctly.  I'd be somewhat surprised if it took a hundred, but maybe it's actually very tricky.  A thousand just seems crazy.  A million makes it sound like you're doing something very dumb, like figuring out every shard of each human's values and don't know how to automate things.
Comment by RobertM (T3t) on yams's Shortform · 2024-11-25T03:07:42.472Z · LW · GW

I tried to make a similar argument here, and I'm not sure it landed.  I think the argument has since demonstrated even more predictive validity with e.g. the various attempts to build and restart nuclear power plants, directly motivated by nearby datacenter buildouts, on top of the obvious effects on chip production.

Comment by T3t on [deleted post] 2024-11-24T20:42:33.399Z

Should be fixed now.

Comment by T3t on [deleted post] 2024-11-24T20:40:05.849Z

Good catch, looks like that's from this revision, which looks like it was copied over from Arbital - some LaTeX didn't make it through.  I'll see if it's trivial to fix.

Comment by T3t on [deleted post] 2024-11-24T20:38:49.589Z

The page isn't dead, Arbital pages just don't load sometimes (or take 15+ seconds).

Comment by RobertM (T3t) on LLM chatbots have ~half of the kinds of "consciousness" that humans believe in. Humans should avoid going crazy about that. · 2024-11-22T08:07:15.070Z · LW · GW

I understand this post to be claiming (roughly speaking) that you assign >90% likelihood in some cases and ~50% in other cases that LLMs have internal subjective experiences of varying kinds.  The evidence you present in each case is outputs generated by LLMs.

The referents of consciousness for which I understand you to be making claims re: internal subjective experiences are 1, 4, 6, 12, 13, and 14.  I'm unsure about 5.

Do you have sources of evidence (even illegible) other than LLM outputs that updated you that much?  Those seem like very surprisingly large updates to make on the basis of LLM outputs (especially in cases where those outputs are self-reports about the internal subjective experience itself, which are subject to substantial pressure from post-training).


Separately, I have some questions about claims like this:

The Big 3 LLMs are somewhat aware of what their own words and/or thoughts are referring to with regards to their previous words and/or thoughts. In other words, they can think about the thoughts "behind" the previous words they wrote.

This doesn't seem constructively ruled out by e.g. basic transformer architectures, but as justification you say this:

If you doubt me on this, try asking one what its words are referring to, with reference to its previous words. Its "attention" modules are actually intentionally designed to know this sort of thing, using using key/query/value lookups that occur "behind the scenes" of the text you actually see on screen.

How would you distinguish an LLM both successfully extracting and then faithfully representing whatever internal reasoning generated a specific part of its outputs, vs. conditioning on its previous outputs to give you plausible "explanation" for what it meant?  The second seems much more likely to me (and this behavior isn't that hard to elicit, i.e. by asking an LLM to give you a one-word answer to a complicated question, and then asking it for its reasoning).

Comment by RobertM (T3t) on OpenAI Email Archives (from Musk v. Altman and OpenAI blog) · 2024-11-21T01:54:09.204Z · LW · GW

My impression is that Yudkowsky has harmed public epistemics in his podcast appearances by saying things forcefully and with rather poor spoken communication skills for novice audiences.

I recommend reading the Youtube comments on his recorded podcasts, rather than e.g. Twitter commentary from people with a pre-existing adversarial stance to him (or AI risk questions writ large).

Comment by RobertM (T3t) on adam_scholl's Shortform · 2024-11-07T07:08:30.500Z · LW · GW

On one hand, I feel a bit skeptical that some dude outperformed approximately every other pollster and analyst by having a correct inside-view belief about how existing pollster were messing up, especially given that he won't share the surveys.  On the other hand, this sort of result is straightforwardly predicted by Inadequate Equilibria, where an entire industry had the affordance to be arbitrarily deficient in what most people would think was their primary value-add, because they had no incentive to accuracy (skin in the game), and as soon as someone with an edge could make outsized returns on it (via real-money prediction markets), they outperformed all the experts.

On net I think I'm still <50% that he had a correct belief about the size of Trump's advantage that was justified by the evidence he had available to him, but even being directionally-correct would have been sufficient to get outsized returns a lot of the time, so at that point I'm quibbling with his bet sizing rather than the direction of the bet.

Comment by RobertM (T3t) on dirk's Shortform · 2024-11-02T04:11:02.738Z · LW · GW

I'm pretty sure Ryan is rejecting the claim that the people hiring for the roles in question are worse-than-average at detecting illegible talent.

Comment by RobertM (T3t) on dirk's Shortform · 2024-11-02T04:08:20.709Z · LW · GW

Depends on what you mean by "resume building", but I don't think this is true for "need to do a bunch of AI safety work for free" or similar.  i.e. for technical research, many people that have gone through MATS and then been hired at or founded their own safety orgs have no prior experience doing anything that looks like AI safety research, and some don't even have much in the way of ML backgrounds.  Many people switch directly out of industry careers into doing e.g. ops or software work that isn't technical research.  Policy might seem a bit trickier but I know several people who did not spend anything like years doing resume building before finding policy roles or starting their own policy orgs and getting funding.  (Though I think policy might actually be the most "straightforward" to break into, since all you need to do to demonstrate compentence is publish a sufficiently good written artifact; admittedly this is mostly for starting your own thing.  If you want to get hired at a "larger" policy org resume building might matter more.)

Comment by RobertM (T3t) on Habryka's Shortform Feed · 2024-11-01T00:43:36.827Z · LW · GW

(We switched back to shipping Calibri above Gill Sans Nova pending a fix for the horrible rendering on Windows, so if Ubuntu has Calibri, it'll have reverted back to the previous font.)

Comment by RobertM (T3t) on Anthropic's updated Responsible Scaling Policy · 2024-10-17T03:53:15.365Z · LW · GW

Indeed, such red lines are now made more implicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.

This doesn't seem right to me, though it's possible that I'm misreading either the old or new policy (or both).

Re: predefined evaluations, the old policy neither specified any evaluations in full detail, nor did it suggest that Anthropic would have designed the evaluations prior to a training run.  (Though I'm not sure that's what you meant, when contrasted it with "employees design and run them on the fly" as a description of the new policy.)

Re: CEO's decisionmaking, my understanding of the new policy is that the CEO (and RSO) will be responsible only for approving or denying an evaluation report making an affirmative case that a new model does not cross a relevant capability threshold ("3.3 Capability Decision", original formatting removed, all new bolding is mine):

If, after the comprehensive testing, we determine that the model is sufficiently below the relevant Capability Thresholds, then we will continue to apply the ASL-2 Standard. The process for making such a determination is as follows:

  • First, we will compile a Capability Report that documents the findings from the comprehensive assessment, makes an affirmative case for why the Capability Threshold is sufficiently far away, and advances recommendations on deployment decisions.
  • The report will be escalated to the CEO and the Responsible Scaling Officer, who will (1) make the ultimate determination as to whether we have sufficiently established that we are unlikely to reach the Capability Threshold and (2) decide any deployment-related issues.
  • In general, as noted in Sections 7.1.4 and 7.2.2, we will solicit both internal and external expert feedback on the report as well as the CEO and RSO’s conclusions to inform future refinements to our methodology. For high-stakes issues, however, the CEO and RSO will likely solicit internal and external feedback on the report prior to making any decisions.
  • If the CEO and RSO decide to proceed with deployment, they will share their decision–as well as the underlying Capability Report, internal feedback, and any external feedback–with the Board of Directors and the Long-Term Benefit Trust before moving forward.

The same is true for the "Safeguards Decision" (i.e. making an affirmative case that ASL-3 Required Safeguards have been sufficiently implemented, given that there is a model that has passed the relevant capabilities thresholds).

This is not true for the "Interim Measures" described as an allowable stopgap if Anthropic finds itself in the situation of having a model that requires ASL-3 Safeguards but is unable to implement those safeguards.  My current read is that this is intended to cover the case where the "Capability Decision" report made the case that a model did not cross into requiring ASL-3 Safeguards, was approved by the CEO & RSO, and then later turned out to be wrong.  It does seem like this permits more or less indefinite deployment of a model that requires ASL-3 Safeguards by way of "interim measures" which need to provide "the the same level of assurance as the relevant ASL-3 Standard", with no provision for what to do if it turns out that implementing the actually-specified ASL-3 standard is intractable.  This seems slightly worse than the old policy:

If it becomes apparent that the capabilities of a deployed model have been under-elicited and the model can, in fact, pass the evaluations, then we will halt further deployment to new customers and assess existing deployment cases for any serious risks which would constitute a safety emergency. Given the safety buffer, de-deployment should not be necessary in the majority of deployment cases. If we identify a safety emergency, we will work rapidly to implement the minimum additional safeguards needed to allow responsible continued service to existing customers. We will provide transparency and support to impacted customers throughout the process. An emergency of this type would merit a detailed post-mortem and a policy shift to avoid re-occurrence of this situation.

which has much the same immediate impact, but with at least a nod to a post-mortem and policy adjustment.

But, overall, the new policy doesn't seem to be opening up a gigantic hole that allows Dario to press the "all clear" button on capability determinations; he only has the additional option to veto, after the responsible team has already decided the model doesn't cross the threshold.

Comment by RobertM (T3t) on Most arguments for AI Doom are either bad or weak · 2024-10-17T00:31:42.856Z · LW · GW

But that's a communication issue....not a truth issue.

Yes, and Logan is claiming that arguments which cannot be communicated to him in no more than two sentences suffer from a conjunctive complexity burden that renders them "weak".

That's not trivial. There's no proof that there is such a coherent entity as "human values", there is no proof that AIs will be value-driven agents, etc, etc. You skipped over 99% of the Platonic argument there.

Many possible objections here, but of course spelling everything out would violate Logan's request for a short argument.  Needless to say, that request does not have anything to do with effectively tracking reality, where there is no "platonic" argument for any non-trivial claim describable in only two sentence, and yet things continue to be true in the world anyways, so reductio ad absurdum: there are no valid or useful arguments which can be made for any interesting claims.  Let's all go home now!

Comment by RobertM (T3t) on Most arguments for AI Doom are either bad or weak · 2024-10-14T05:37:36.117Z · LW · GW

A strong good argument has the following properties:

  • it is logically simple (can be stated in a sentence or two)
    • This is important, because the longer your argument, the more details that have to be true, and the more likely that you have made a mistake.  Outside the realm of pure-mathematics, it is rare for an argument that chains together multiple "therefore"s to not get swamped by the fact that

No, this is obviously wrong.

  1. Argument length is substantially a function of shared premises.  I would need many more sentences to convey a novel argument about AI x-risk to someone who had never thought about the subject before, than to someone who has spent a lot of time in the field, because in all likelihood I would first need to communicate and justify many of the foundational concepts that we take for granted.
    1. Note that even here, on LessWrong, this kind of detailed argumentation is necessary to ward off misunderstandings.  
  2. Argument strength is not an inverse function with respect to argument length, because not every additional "piece" of an argument is a logical conjunction which, if false, renders the entire argument false.  Many details in any specific argument are narrowing down which argument the speaker is making, but are not themselves load-bearing (& conjunctive) claims that all have to be true for the argument to be valid.  (These are often necessary; see #1.)

Anyways, the trivial argument that AI doom is likely (given that you already believe we're likely to develop ASI in the next few decades, and that it will be capable of scientific R&D that sounds like sci-fi today) is that it's not going to have values that are friendly to humans, because we don't know how to build AI systems in the current paradigm with any particular set of values at all, and the people pushing frontier AI capabilities mostly don't think this is a real problem that needs to be figured out[1].  This is self-evidently true, but you (and many others) disagree.  What now?

  1. ^

    A moderately uncharitable compression of a more detailed disagreement, which wouldn't fit into one sentence.

Comment by RobertM (T3t) on Dario Amodei — Machines of Loving Grace · 2024-10-12T07:44:10.401Z · LW · GW

Credit where credit is due: this is much better in terms of sharing one's models than one could say of Sam Altman, in recent days. 

As noted above the footnotes, many people at Anthropic reviewed the essay.  I'm surprised that Dario would hire so many people he thinks need to "touch grass" (because they think the scenario he describes in the essay sounds tame), as I'm pretty sure that describes a very large percentage of Anthropic's first ~150 employees (certainly over 20%, maybe 50%).

My top hypothesis is that this is a snipe meant to signal Dario's (and Anthropic's) factional alliance with Serious People; I don't think Dario actually believes that "less tame" scenarios are fundamentally implausible[1].  Other possibilities that occur to me, with my not very well considered probability estimates:

  • I'm substantially mistaken about how many early Anthropic employees think "less tame" outcomes are even remotely plausible (20%), and Anthropic did actively try to avoid hiring people with those models early on (1%).
  • I'm not mistaken about early employee attitudes, but Dario does actually believe AI is extremely likely to be substantially transformative, and extremely unlikely to lead to the "sci-fi"-like scenarios he derides (20%).  Conditional on that, he didn't think it mattered whether his early employees had those models (20%) or might have slightly preferred not, all else equal, but wasn't that fussed about it compared to recruiting strong technical talent (60%).

I'm just having a lot of trouble reconciling what I know of the beliefs of Anthropic employees, and the things Dario says and implies in this essay.  Do Anthropic employees who think less tame outcomes are plausible believe Dario when he says they should "touch grass"?  If you don't feel comfortable answering that question in public, or can't (due to NDA), please consider whether this is a good situation to be in.

  1. ^

    He has not, as far as I know, deigned to offer any public argument on the subject.

Comment by RobertM (T3t) on My motivation and theory of change for working in AI healthtech · 2024-10-12T05:38:10.386Z · LW · GW

Do you have a mostly disjoint view of AI capabilities between the "extinction from loss of control" scenarios and "extinction by industrial dehumanization" scenarios?  Most of my models for how we might go extinct in next decade from loss of control scenarios require the kinds of technological advancement which make "industrial dehumanization" redundant, with highly unfavorable offense/defense balances, so I don't see how industrial dehumanization itself ends up being the cause of human extinction if we (nominally) solve the control problem, rather than a deliberate or accidental use of technology that ends up killing all humans pretty quickly.

Separately, I don't understand how encouraging human-specific industries is supposed to work in practice.  Do you have a model for maintaining "regulatory capture" in a sustained way, despite having no economic, political, or military power by which to enforce it?  (Also, even if we do succeed at that, it doesn't sound like we get more than the Earth as a retirement home, but I'm confused enough about the proposed equilibrium that I'm not sure that's the intended implication.)

Comment by RobertM (T3t) on Dario Amodei — Machines of Loving Grace · 2024-10-12T00:23:23.079Z · LW · GW

Yeah, the essay (I think correctly) notes that the most significant breakthroughs in biotech come from the small number of "broad measurement tools or techniques that allow precise but generalized or programmable intervention", which "are so powerful precisely because they cut through intrinsic complexity and data limitations, directly increasing our understanding and control".

Why then only such systems limited to the biological domain?  Even if it does end up being true that scientific and technological progress is substantially bottlenecked on real-life experimentation, where even AIs that can extract many more bits from the same observations than humans still suffer from substantial serial dependencies with no meaningful "shortcuts", it still seems implausible that we don't get to nanotech relatively quickly, if it's physically realizable.  And then that nanotech unblocks the rate of experimentation.  (If you're nanotech skeptical, human-like robots seem sufficient as actuators to speed up real-life experimentation by at least an order of magnitude compared to needing to work through humans, and work on those is making substantial progress.)

If Dario thinks that progress will cap out at some level due to humans intentionally slowing down, it seems good to say this.

Footnote 2 maybe looks like a hint in this direction if you squint, but Dario spent a decent chunk of the essay bracketing outcomes he thought were non-default and would need to be actively steered towards, so it's interesting that he didn't explicitly list those (non-tame futures) as a type of of outcome that he'd want to actively steer away from.

Comment by RobertM (T3t) on Embracing complexity when developing and evaluating  AI responsibly · 2024-10-11T18:32:07.360Z · LW · GW

Not Mitchell, but at a guess:

  • LLMs really like lists
  • Some parts of this do sound a lot like LLM output:
    • "Complex Intervention Development and Evaluation Framework: A Blueprint for Ethical and Responsible AI Development and Evaluation"
    • "Addressing Uncertainties"
  • Many people who post LLM-generated content on LessWrong often wrote it themselves in their native language and had an LLM translate it, so it's not a crazy prior, though I don't see any additional reason to have guessed that here.

Having read more of the post now, I do believe it was at least mostly human-written (without this being a claim that it was at least partially written by an LLM).  It's not obvious that it's particular relevant to LessWrong.  The advice on the old internet was "lurk more"; now we show users warnings like this when they're writing their first post.

Comment by RobertM (T3t) on Shortform · 2024-10-08T22:00:25.618Z · LW · GW

I think it pretty much only matters as a trivial refutation of (not-object-level) claims that no "serious" people in the field take AI x-risk concerns seriously, and has no bearing on object-level arguments.  My guess is that Hinton is somewhat less confused than Yann but I don't think he's talked about his models in very much depth; I'm mostly just going off the high-level arguments I've seen him make (which round off to "if we make something much smarter than us that we don't know how to control, that might go badly for us").

Comment by T3t on [deleted post] 2024-10-05T01:28:55.649Z

I don't really see how this is responding to my comment.  I was not arguing about the merits of RLHF along various dimensions, or what various people think about it, but pointing out that calling something "an alignment technique" with no further detail is not helping uninformed readers understand what "RLHF" is better (but rather worse).

Again, please model an uninformed reader: how does the claim "RLHF is an alignment technique" constrain their expectations?  If the thing you want to say is that some of the people who invented RLHF saw it as an early stepping stone to solving more challenging problems with alignment later, I have no objection to that claim.  This is a claim about the motivations and worldviews of those people.  But I don't know what sort of useful work "RLHF is an alignment technique" is doing, other than making claims that are not centrally about RLHF itself.

Comment by T3t on [deleted post] 2024-10-05T00:39:26.271Z

This wasn't part of my original reasoning, but I went and did a search for other uses of "alignment technique" in tag descriptions.  There's one other instance that I can find, which I think could also stand to be rewritten, but at least in that case it's quite far down the description, well after the object-level details about the proposed technique itself.

Comment by T3t on [deleted post] 2024-10-05T00:36:12.802Z

Two reasons:

First, the change made the sentence much worse to read.  It might not have been strictly ungrammatical, but it was bad english.

Second, I expect that the average person, unfamiliar with the field, would be left with a thought-terminating mental placeholder after reading the changed description.  What does "is an alignment technique" mean?  Despite being in the same sentence as "is a machine learning technique", it is not serving anything like the same role, in terms of the implicit claims it makes.  Intersubjective agreement on what "is an alignment technique" means will be far worse than on "is a machine learning technique", and many implications of the first claim are far more contentious than of the second. 

To me, "is an alignment technique" does not convey useful detail about the technique itself, but about how various people in the community relate to it (and similar sociological details).  If you want to describe that kind of detail explicitly, that's one thing[1].  But it's actively confusing to conflate it with technical detail about the technique itself.

  1. ^

    Though it's not the kind of detail that should live in the first sentence of the tag description, probably.

Comment by RobertM (T3t) on What prevents SB-1047 from triggering on deep fake porn/voice cloning fraud? · 2024-09-30T01:47:48.420Z · LW · GW

Almost no specific (interesting) output is information that's already been generated by any model, in the strictest sense.

Comment by RobertM (T3t) on What prevents SB-1047 from triggering on deep fake porn/voice cloning fraud? · 2024-09-27T18:48:37.348Z · LW · GW

reasonably publicly accessible by an ordinary person from sources other than a covered model or covered model derivative

Seems like it'd pretty obviously cover information generated by non-covered models that are routinely used by many ordinary people (as open source image models currently are).

As a sidenote, I think the law is unfortunately one of those pretty cursed domains where it's hard to be very confident of anything as a layman without doing a lot of your own research, and you can't even look at experts speaking publicly on the subject since they're often performing advocacy, rather than making unbiased predictions about outcomes.  You could try to hire a lawyer for such advice, but it seems to be pretty hard to find lawyers who are comfortable giving their clients quantitative (probabilistic) and conditional estimates.  Maybe this is better once you're hiring for e.g. general counsel of a large org, or maybe large tech company CEOs have to deal with the same headaches that we do.  Often your best option is to just get a basic understanding of how relevant parts of the legal system work, and then do a lot of research into e.g. relevant case law, and then sanity-check your reasoning and conclusions with an actual lawyer specialized in that domain.

Comment by RobertM (T3t) on What prevents SB-1047 from triggering on deep fake porn/voice cloning fraud? · 2024-09-27T04:45:05.699Z · LW · GW

Not your particular comment on it, no.