LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation")
Ruby · 2024-07-19T00:31:38.332Z · comments (21)

Demis Hassabis and Geoffrey Hinton Awarded Nobel Prizes
Anna Gajdova (anna-gajdova) · 2024-10-09T12:56:24.856Z · comments (14)

AI #75: Math is Easier
Zvi · 2024-08-01T13:40:05.539Z · comments (25)

[Intuitive self-models] 5. Dissociative Identity (Multiple Personality) Disorder
Steven Byrnes (steve2152) · 2024-10-15T13:31:46.157Z · comments (6)

[link] Robin Hanson AI X-Risk Debate — Highlights and Analysis
Liron · 2024-07-12T21:31:02.222Z · comments (7)

Bounty for Evidence on Some of Palisade Research's Beliefs
benwr · 2024-09-23T20:01:20.917Z · comments (4)

I finally got ChatGPT to sound like me
lsusr · 2024-09-17T09:39:59.415Z · comments (18)

In Defense of Parselmouths
Screwtape · 2023-11-15T23:02:19.344Z · comments (10)

[link] Contra Scott on Abolishing the FDA
Maxwell Tabarrok (maxwell-tabarrok) · 2023-12-15T14:00:17.247Z · comments (3)

[link] Bayesians Commit the Gambler's Fallacy
Kevin Dorst · 2024-01-07T12:54:59.939Z · comments (28)

AI #41: Bring in the Other Gemini
Zvi · 2023-12-07T15:10:05.552Z · comments (16)

Some costs of superposition
Linda Linsefors · 2024-03-03T16:08:20.674Z · comments (11)

[link] If Clarity Seems Like Death to Them
Zack_M_Davis · 2023-12-30T17:40:42.622Z · comments (191)

Vaniver's thoughts on Anthropic's RSP
Vaniver · 2023-10-28T21:06:07.323Z · comments (4)

Saving the world sucks
Defective Altruism (Elijah Bodden) · 2024-01-10T05:55:46.504Z · comments (29)

Thoughts on "The Offense-Defense Balance Rarely Changes"
Cullen (Cullen_OKeefe) · 2024-02-12T03:26:50.662Z · comments (4)

On the Proposed California SB 1047
Zvi · 2024-02-12T16:40:04.854Z · comments (18)

[link] The Leeroy Jenkins principle: How faulty AI could guarantee "warning shots"
titotal (lombertini) · 2024-01-14T15:03:21.087Z · comments (6)

AI Safety 101 : Capabilities - Human Level AI, What? How? and When?
markov (markovial) · 2024-03-07T17:29:53.260Z · comments (8)

D&D.Sci(-fi): Colonizing the SuperHyperSphere
abstractapplic · 2024-01-12T23:36:54.248Z · comments (23)

[link] Will releasing the weights of large language models grant widespread access to pandemic agents?
jefftk (jkaufman) · 2023-10-30T18:22:59.677Z · comments (25)

AI doing philosophy = AI generating hands?
Wei Dai (Wei_Dai) · 2024-01-15T09:04:39.659Z · comments (22)

The predictive power of dissipative adaptation
dr_s · 2023-12-17T14:01:31.568Z · comments (14)

[link] For Civilization and Against Niceness
Gabriel Alfour (gabriel-alfour-1) · 2023-11-20T10:56:20.352Z · comments (14)

Forecasting One-Shot Games
Raemon · 2024-08-31T23:10:05.475Z · comments (0)

Conflating value alignment and intent alignment is causing confusion
Seth Herd · 2024-09-05T16:39:51.967Z · comments (18)

[link] Metascience of the Vesuvius Challenge
Maxwell Tabarrok (maxwell-tabarrok) · 2024-03-30T12:02:38.978Z · comments (2)

Untrustworthy models: a frame for scheming evaluations
Olli Järviniemi (jarviniemi) · 2024-08-19T16:27:11.088Z · comments (3)

I'm open for projects (sort of)
cousin_it · 2024-04-18T18:05:01.395Z · comments (13)

[Valence series] 4. Valence & Liking / Admiring
Steven Byrnes (steve2152) · 2024-06-10T14:19:51.194Z · comments (12)

Exercise: Planmaking, Surprise Anticipation, and "Baba is You"
Raemon · 2024-02-24T20:33:49.574Z · comments (19)

[link] Michael Dickens' Caffeine Tolerance Research
niplav · 2024-09-04T15:41:53.343Z · comments (3)

On OpenAI’s Model Spec
Zvi · 2024-06-21T13:00:03.014Z · comments (3)

Rapid capability gain around supergenius level seems probable even without intelligence needing to improve intelligence
Towards_Keeperhood (Simon Skade) · 2024-05-06T17:09:10.729Z · comments (16)

Higher-effort summer solstice: What if we used AI (i.e., Angel Island)?
Rachel Shu (wearsshoes) · 2024-06-25T01:35:54.064Z · comments (9)

AI #68: Remarkably Reasonable Reactions
Zvi · 2024-06-13T16:30:02.969Z · comments (11)

Enriched tab is now the default LW Frontpage experience for logged-in users
Ruby · 2024-06-21T00:09:30.441Z · comments (27)

How to hire somebody better than yourself
lukehmiles (lcmgcd) · 2024-08-28T08:12:53.450Z · comments (5)

Decision Theory in Space
lsusr · 2024-08-18T07:02:11.847Z · comments (18)

D&D.Sci: The Mad Tyrant's Pet Turtles [Evaluation and Ruleset]
abstractapplic · 2024-04-09T14:01:34.426Z · comments (6)

Big Picture AI Safety: Introduction
EuanMcLean (euanmclean) · 2024-05-23T11:15:44.037Z · comments (7)

1. The CAST Strategy
Max Harms (max-harms) · 2024-06-07T22:29:13.005Z · comments (19)

All The Latest Human tFUS Studies
sarahconstantin · 2024-08-09T22:20:04.561Z · comments (2)

So You Created a Sociopath - New Book Announcement!
Garrett Baker (D0TheMath) · 2024-04-01T18:02:18.010Z · comments (3)

[link] MIRI's September 2024 newsletter
Harlan · 2024-09-16T18:15:40.785Z · comments (0)

A starting point for making sense of task structure (in machine learning)
Kaarel (kh) · 2024-02-24T01:51:49.227Z · comments (2)

Principled Satisficing To Avoid Goodhart
JenniferRM · 2024-08-16T19:05:27.204Z · comments (2)

AI #80: Never Have I Ever
Zvi · 2024-09-10T17:50:08.074Z · comments (20)

Startup Roundup #2
Zvi · 2024-08-06T13:30:06.554Z · comments (0)

[link] Book review: Deep Utopia
PeterMcCluskey · 2024-04-23T19:55:50.417Z · comments (14)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

dave-orr on Automation collapse

One way this could happen is searching for jailbreaks in the space of paraphrases and synonyms of a benign prompt.

Why would this produce fake/unlikely jailbreaks? If the paraphrases and such are natural, then doesn't the nearness to a real(istic) prompt enough to suggest that the jailbreak found is also realistic? Of course you can adversarially generate super unrealistic things, but does that necessarily happen with paraphrasing type attacks?

steve2152 on A Rocket–Interpretability Analogy

I’m not an expert and I’m not sure it matters much for your point, but: Yes there were surely important synergies between NASA activities and the military ballistic missile programs in the 1960s, but I don’t think it’s correct to suggest that most NASA activities was stuff that would have to be done for the ballistic missile program anyway. It might actually be a pretty small fraction. For example, less than half the Apollo budget was for launch vehicles; they spent a similar amount on spacecraft, which are not particularly transferable to nukes. And even for the launch vehicles, it seems that NASA tended to start with existing military rocket designs and modify them, rather than the other way around.

I would guess that the main synergy was more indirect: helping improve the consistency of work, economies of scale, defraying overhead costs, etc., for the personnel and contractors and so on.

geoffrey-wood on Cipolla's Shortform

Oh and do at least three past exams for each class. Best to practice in the conditions you will be performing in.

ete on A Rocket–Interpretability Analogy

That you're unaware of there being any notable counterfactual capabilities boost from interpretability is some update. How sure are you that you'd know if there were training multipliers that had interpretability strongly in their causal history? Are you not counting steering vectors from Anthropic here? And I didn't find out about the Hyena from the news article, but from a friend who read the paper, the article just had a nicer quote.

I could imagine that interpretability being relatively ML flavoured makes it more appealing to scaling lab leadership, and this is the reason those projects get favoured rather than them seeing it as commercially useful, at least in many cases.

Would you expect that this continues as interpretability continues to get better? I'd be pretty surprised from general models to find that opening black boxes doesn't let you debug them better, though I could imagine we're not good enough at it yet.

dagon on Change My Mind: Thirders in "Sleeping Beauty" are Just Doing Epistemology Wrong

[ bowing out after this - I'll read responses and perhaps update on them, but probably won't respond (until next time) ]

To see that the probability of Heads is not "either 1/2 or 1/3, depending on what reference class you choose

I disagree. Very specifically, it's 1/2 if your reference class is "fair coin flips" and 1/3 if your reference class is "temporary, to-be-erased experience of victims with adversarial memory problems".

If your reference class is "wakenings who are predicting what day it is", as the muffin variety, then 1/3 is a bit easier to work with (though you'd need to specify payoffs to explain why she'd EVER eat the muffin, and then 1/2 becomes pretty easy too). This is roughly equivalent to the non-memory-wiping wager: I'll flip a fair coin, you predict heads or tails. If it's heads, the wager will be $1, if it's tails, the wager is $2. The probability of tails is not 2/3, but you'd pay up to $0.50 to play, right?

abstractapplic on D&D Sci Coliseum: Arena of Data

Update:

I tried fitting my ML model without access to speed variables other than sign(speed diff) and got slightly but non-negligibly worse metrics on an outsample. This suggests that sign(speed diff) tells you most of the information you need about speed but if you rely solely on it you're still missing useful and relevant information.

(. . . either that or my code has another error, I guess. Looking forward to finding out in seven days.)

leogao on A Rocket–Interpretability Analogy

I don't think anyone has, to date, used interpretability to make any major counterfactual contribution to capabilities. I would not rely on papers introducing a new technique to be the main piece of evidence as to whether the technique is actually good at all. (I would certainly not rely on news articles about papers - they are basically noise.)

I think you should take into account the fact that before there are really good concrete capabilities results, the process that different labs use to decide what to invest in is highly contingent on a bunch of high variance things. Like, what kinds of research directions appeal to research leadership, or whether there happen to be good ICs excited to work on that direction around and not tied down to any other project.

I don't think you should be that surprised by interpretability being more popular than other areas of alignment. Certainly I think incentives towards capabilities is a small fraction of why it's popular and funded etc (if anything, its non-usefulness for capabilities to date may count against it). Rather, I think it's popular because it's an area where you can actually get traction and do well-scoped projects and have a tight feedback loop. This is not true of the majority of alignment research directions that actually could help with aligning AGI/ASI, and correspondingly those directions are utterly soul grinding to work on.

One could very reasonably argue that more people should be figuring out how to work on the low traction, ill-scoped, shitty feedback loop research problems, and that the field is looking under the streetlight for the keys. I make this argument a lot. But I think you shouldn't need to postulate some kind of nefarious capabilities incentive influence to explain it.

steve2152 on [Intuitive self-models] 3. The Homunculus

Why is, according to your model, the valence of self-reflective thoughts sorta the valence our "best"/pro-social selves would ascribe?

That would be §2.5.1 [LW · GW]. The idea is that, in general, there are lots of kinds of self-reflective thoughts: thoughts that involve me, and what I’m doing, and what I’m thinking about, and how my day is going, and whether I’m following through with my new years resolution, and what other people would think of me right now, and so on.

These all tend to have salient associations with each other. If I’m thinking about how my day is going, it remind me that I had promised myself to exercise every day, which might remind me that Sally called me fat, and so on.

Whereas non-self-reflective thoughts by and large have less relation to that whole cloud of associations. If I’m engrossed in a movie and thinking about how the prince is fighting a dragon in a river, or even if I’m just thinking about how best to chop this watermelon, then I’m not thinking about any of those self-reflective things in the above paragraph, and am unlikely to for at least the next second or two.

Incidentally, I think your description is an overstatement. My claim is that “the valence our "best"/pro-social selves would ascribe” is very relevant to the valence of self-reflective thoughts, to a much greater extent than non-self-reflective thoughts. But they’re not decisive. That’s what I was suggesting by my §2.5.2 [LW · GW] example of “Screw being ‘my best self’, I’m tired, I’m going to sleep”. The reason that they’re very relevant is those salient associations I just mentioned. If I self-reflect on what I’m thinking about, then that kinda reminds me of how what I’m thinking about reflects on myself in general; so if the latter seems really good and motivating, then some of that goodness will splash onto the former too.

Do you buy that? Sorry if I’m misunderstanding.

Why does the homunculus get modeled as wanting pro-social/best-self stuff (as opposed to just what overall valence would imply)?

Again, I think this is an overstatement, per the §2.5.2 [LW · GW] example of “Screw being ‘my best self’, I’m tired, I’m going to sleep”. But it’s certainly directionally true, and I was talking about that in §3.5.1. I think the actual rule is that, if planning / brainstorming is happening towards some goal G, then we imagine that “the homunculus wants G”, since in general the planning / brainstorming process in general pattern-matches to “wanting” (i.e., we can predict what will probably wind up happening without knowing how).

So that moves us to the question: “if planning / brainstorming is happening towards some goal G, then why do we conclude that S(G) is positive valence, rather than concluding that G is positive valence?” For one thing, if G is negative-valence but S(G) is positive-valence, then we’ll still do the planning / brainstorming, we just focus our attention on S(G) rather than G during that process. That’s my example above of “I really wanted and intended to step into the ice-cold shower, but when I got there, man, I just couldn’t.” Relatedly, if the brainstorming process involves self-reflective thoughts, then that enables better brainstorming, for example involving attention-control strategies, making deals with yourself, etc. (more in Post 8). And another part of the answer is the refrigerator-light illusion, as mentioned in §3.5.1 (and see also the edge-case of “impulsive planning” in §3.5.2).

Does that help?

I'd guess that there was evolutionary pressure for a self-model/homunculus to seem more pro-social as the overall behavior (and thoughts) of the human might imply, so I guess there might be some particular programming from evolution into that direction. I don't know how exactly it might look like though. I also wouldn't be shocked if it's mostly just like all the non-myopic desires are pretty pro-social and the self-model's values get straightened out in a way the myopic desires end up dropped because that would be incoherent. Would be interested in hearing your model on my questions above.

This is a nitpick, but I think you’re using the word “pro-social” when you mean something more like “doing socially-endorsed things”. For example, If a bully is beating up a nerd, he’s impressing his (bully) friends, and he’s acting from social motivations, and he’s taking pride in his work, and he’s improving his self-image and popularity, but most people wouldn’t call bullying “pro-social behavior”, right?

Anyway, I think there’s an innate drive to impress the people who you like in turn. I’ve been calling it the drive to feel liked / admired [LW · GW]. It is certainly there for evolutionary reasons, and I think that it’s very strong (in most people, definitely not everyone), and causes a substantial share of ego-syntonic desires, without people realizing it. It has strong self-reflective associations, in that “what the people I like would think of me” centrally involves “me” and what I’m doing, both right now and in general. It’s sufficiently strong that there tends to be a lot of overlap between “the version of myself that I would want others to see, especially whom I respect in turn” versus “the version of myself that I like best all things considered”.

I think that’s similar to what you’re talking about, right?

dagon on If far-UV is so great, why isn't it everywhere?

Ah, OK. So the claim is that the isolated effect (one building, even an office or home with significant time-spent) is small, but the cumulative effect is nonlinear in some way (either threshold effect or higher-order-than-linear). That IS a lot harder to measure, because it's distributed long-term statistical impact, rather than individually measurable impact. I'd think that we have enough epidemiology knowledge to model the threshold or effect, but I've been disappointed on this front so many times that I'm certainly wrong.

It, unfortunately, shares this difficulty with other large-scale interventions. If it's very expensive, personally annoying (rationally or not), and impossible to show an overwhelming benefit, it's probably not going to happen. And IMO, it's probably overstated in feasibility of benefit.

ete on A Rocket–Interpretability Analogy

I agree that the effect you're pointing to is real and a large part of what's going on here, and could easily be convinced that it's the main cause (along with the one flagged by @habryka [LW · GW]). It's definitely a more visible motivation from the perspective of an individual going through the funnel than the one this post highlights. I was focusing on making one concise point rather than covering the whole space of argument, and am glad comments have brought up other angles.