LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] IAPS: Mapping Technical Safety Research at AI Companies
Zach Stein-Perlman · 2024-10-24T20:30:41.159Z · comments (13)

Take SCIFs, it’s dangerous to go alone
latterframe · 2024-05-01T08:02:38.067Z · comments (1)

Paper Summary: The Effects of Communicating Uncertainty on Public Trust in Facts and Numbers
Jeffrey Heninger (jeffrey-heninger) · 2024-07-09T16:50:05.776Z · comments (2)

[link] Rowing vs steering
Saul Munn (saul-munn) · 2024-08-10T07:00:17.594Z · comments (2)

ARENA 4.0 Impact Report
Chloe Li (chloe-li-1) · 2024-11-27T20:51:54.844Z · comments (3)

When fine-tuning fails to elicit GPT-3.5's chess abilities
Theodore Chapman · 2024-06-14T18:50:52.855Z · comments (3)

Startup Success Rates Are So Low Because the Rewards Are So Large
AppliedDivinityStudies (kohaku-none) · 2024-10-10T20:22:01.557Z · comments (6)

A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed
johnswentworth · 2024-08-22T19:19:28.940Z · comments (4)

Time Efficient Resistance Training
romeostevensit · 2024-10-07T15:15:44.950Z · comments (10)

MATS AI Safety Strategy Curriculum v2
DanielFilan · 2024-10-07T22:44:06.396Z · comments (6)

Unit economics of LLM APIs
dschwarz · 2024-08-27T16:51:22.692Z · comments (0)

[link] Two interviews with the founder of DeepSeek
Cosmia_Nebula · 2024-11-29T03:18:47.246Z · comments (1)

How I internalized my achievements to better deal with negative feelings
Raymond Koopmanschap · 2024-02-27T15:10:24.149Z · comments (7)

Estimating efficiency improvements in LLM pre-training
Daan · 2024-01-19T19:32:45.124Z · comments (3)

Taking responsibility and partial derivatives
Ruby · 2023-12-31T04:33:51.419Z · comments (1)

D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues
aphyer · 2024-06-07T19:02:06.859Z · comments (16)

Housing Roundup #7
Zvi · 2024-03-04T15:00:08.192Z · comments (1)

Trust as a bottleneck to growing teams quickly
benkuhn · 2024-07-13T18:00:04.579Z · comments (3)

Protocol evaluations: good analogies vs control
Fabien Roger (Fabien) · 2024-02-19T18:00:09.794Z · comments (10)

[link] Characterizing stable regions in the residual stream of LLMs
Jett Janiak (jett) · 2024-09-26T13:44:58.792Z · comments (4)

Goals selected from learned knowledge: an alternative to RL alignment
Seth Herd · 2024-01-15T21:52:06.170Z · comments (18)

Formalizing the Informal (event invite)
abramdemski · 2024-09-10T19:22:53.564Z · comments (0)

AI #89: Trump Card
Zvi · 2024-11-07T16:30:05.684Z · comments (12)

MonoPoly Restricted Trust
ymeskhout · 2024-01-02T23:02:55.066Z · comments (37)

Australian AI Safety Forum 2024
Liam Carroll (liam-carroll) · 2024-09-27T00:40:11.451Z · comments (0)

[question] What rationality failure modes are there?
Ulisse Mini (ulisse-mini) · 2024-01-19T09:12:57.924Z · answers+comments (11)

NYT is suing OpenAI&Microsoft for alleged copyright infringement; some quick thoughts
Mikhail Samin (mikhail-samin) · 2023-12-27T18:44:33.976Z · comments (17)

US Presidential Election: Tractability, Importance, and Urgency
kuhanj · 2024-05-29T23:52:22.420Z · comments (2)

[link] Surgery Works Well Without The FDA
Maxwell Tabarrok (maxwell-tabarrok) · 2024-01-26T13:31:29.968Z · comments (28)

Causal Undertow: A Work of Seed Fiction
Daniel Murfet (dmurfet) · 2024-12-08T21:41:48.132Z · comments (0)

The Pointer Resolution Problem
Jozdien · 2024-02-16T21:25:57.374Z · comments (20)

[link] A car journey with conservative evangelicals - Understanding some British political-religious beliefs
Nathan Young · 2024-12-06T11:22:45.563Z · comments (8)

[link] Adverse Selection by Life-Saving Charities
vaishnav92 · 2024-08-14T20:46:23.662Z · comments (16)

One-shot strategy games?
Raemon · 2024-03-11T00:19:20.480Z · comments (42)

A Teacher vs. Everyone Else
ronak69 · 2024-03-21T17:45:35.714Z · comments (8)

How Emergency Medicine Solves the Alignment Problem
StrivingForLegibility · 2023-12-26T05:24:35.579Z · comments (4)

Surviving Seveneves
Yair Halberstadt (yair-halberstadt) · 2024-06-19T13:11:55.414Z · comments (4)

D&D Sci Coliseum: Arena of Data
aphyer · 2024-10-18T22:02:54.305Z · comments (23)

AXRP Episode 39 - Evan Hubinger on Model Organisms of Misalignment
DanielFilan · 2024-12-01T06:00:06.345Z · comments (0)

AI Risk and the US Presidential Candidates
Zane · 2024-01-06T20:18:04.945Z · comments (22)

Trying to translate when people talk past each other
Kaj_Sotala · 2024-12-17T09:40:02.640Z · comments (12)

[link] Programming Refusal with Conditional Activation Steering
Bruce W. Lee (bruce-lee) · 2024-09-11T20:57:08.714Z · comments (0)

Open Problems in AIXI Agent Foundations
Cole Wyeth (Amyr) · 2024-09-12T15:38:59.007Z · comments (2)

Individually incentivized safe Pareto improvements in open-source bargaining
Nicolas Macé (NicolasMace) · 2024-07-17T18:26:43.619Z · comments (2)

Superintelligent AI is possible in the 2020s
HunterJay · 2024-08-13T06:03:26.990Z · comments (3)

[link] Point of Failure: Semiconductor-Grade Quartz
Annapurna (jorge-velez) · 2024-09-30T15:57:40.495Z · comments (8)

Reflections on the Metastrategies Workshop
gw · 2024-10-24T18:30:46.255Z · comments (5)

[link] Jailbreak steering generalization
Sarah Ball · 2024-06-20T17:25:24.110Z · comments (4)

[link] College technical AI safety hackathon retrospective - Georgia Tech
yix (Yixiong Hao) · 2024-11-15T00:22:53.159Z · comments (2)

DunCon @Lighthaven
Duncan Sabien (Deactivated) (Duncan_Sabien) · 2024-09-29T04:56:27.205Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

dmitry-vaintrob on Dmitry's Koan

Thanks for asking! I said in a later shortform [LW(p) · GW(p)] that I was trying to do too many things in this post, with only vague relationships between them, and I'm planning to split it into pieces in the future.

Your 1-3 are mostly correct. I'd comment as follows:

(and also kind of 3) That advice of using the tempered local Bayesian posterior (I like the term -- let's shorten it to TLBP) is mostly aimed at non-SLT researchers (but may apply also to some SLT experiments). The suggestion is not to compute expectations. Rather, just running a single experiment at a weight sampled from the TLBP. The result is analogous to tuning a precision dial on your NN to noise away all circuits for which the quotient (usefulness)/(description length) is bounded above by 1/t (where usefulness is measured in reduction of loss). At t = 0, you're adding no noise and at you're fully noising it.
This is interesting to do in interp experiments for two general reasons:
1. You can see whether the behavior your experiment finds is general or spurious. The higher the temperature range it persists over, the more general it is in the sense of usefulness/description length (and all else being equal, the more important your result is).
2. If you are hoping to say that a behavior you found, e.g. a circuit, is "natural from the circuit's point of view" (i.e., plausibly occurs in some kind of optimal weight- or activation-level description of your model), you need to make sure your experiment isn't just putting together bits of other circuits in an ad-hoc way and calling it a circuit. One way to see this, that works 0% of the time, is to notice that turning this circuit on or off affects the output on exactly the context/ structure you care about, and has absolutely no effect at all on performance elsewhere. This never works because our interp isn't at a level where we can perform uber-precise targeted interventions, and whenever we do something to a network in an experiment, this always significantly affects loss on unrelated inputs. By having a tunable precision parameter (as given by the TLBP for example), you have more freedom to find such "clean" effects that only do what you want and don't affect loss otherwise. In general, in an imprecise sense, you expect each "true" circuit to have some "temperature of entanglement" with the rest of the model, and if this circuit is important enough to survive tempering to this temperature of entanglement, you expect to see much cleaner and nicer results in the resulting tempered model.
In the above context, you rarely want to use the Watanabe temperature or any other temperature that only depends on the number of samples n, since it's much too low in most cases. Instead, you're either looking for a characteristic temperature associated with an experiment or circuit (which in general will not depend on n much), or fishing for behaviors that you hope are "significantly general". Here the characteristic temperature associated with the level of generality that "is not literally memorizing" is the Watanabe temperature or very similar, but it is probably more interesting to consider larger scales.
(maybe more related to your question 1): Above, I explained why I think performing experiments at TLBP weight values is useful for "general interp". I also explain that you sometimes have a natural "characteristic temperature" for the TLBP that is independent of sample number (e.g. meaningful at infinite samples), which is the difference between the loss of the network you're studying and a SOTA NN, which you think of as that "true optimal loss". In large-sample (highly underparameterized) cases, this is probably a better characteristic temperature than the Watanabe temperature, including for notions of effective parameter count: indeed, insofar as your NN is "an imperfect approximation of an optimal NN", the noise inherent in this imperfection is on this scale (and not the Watanabe scale). Of course there are issues with this PoV as less expressive NN's are rarely well-conceptualized as TLBP samples (insofar as they find a subset of a "perfect NN's circuits", they find the easily learnable ones rather than the maximally general ones). However it's still reasonable to think of this as a first stab at the inherent noise scale associated to an underparametrized model, and to think of the effective parameter count at this scale (i.e., free energy / log temperature) as a better approximatin of some "inherent" parameter count.

sharmake-farah on Human takeover might be worse than AI takeover

From my perspective, I'd say that conditional on takeover happening, I'd probably say that a human taking over compared to an AI has pretty similar distributions of outcomes, mostly because I consider the variance of human and AI values to have surprisingly similar outcomes (notably a key factor here is I expect a lot of the more alien values to result in extinction, though partial alignment can make things worse, but compared to the horror show that quite a bit of people have on their values, death can be pretty good, and that's because I'm quite a bit more skeptical of the average person's values, especially conditioning on takeover leading to automatically good outcomes.)

exmateriae on Is Musk still net-positive for humanity?

I thought you said he was very close to the maximum he could do? English is a second language so maybe I misunderstood something. Also, only my first paragraph is really related to the quote, the rest is more of a free flow of what I think

meedstrom on CFAR Takeaways: Andrew Critch

I think some Rationalists believe everything is supposed to fit into one frame, but Frames != The Truth. [...] we should be able to pick up and drop frames as needed, at will.

Aye - see also In Praise of Fake Frameworks [LW · GW]. It's helped me interface with a lot people that would've otherwise befuddled me. That gives me a more fleshed-out range of possible perspectives on things, which shortcuts to new knowledge.

But perhaps it's worth thinking twice when or at least how to introduce this skill, because it looks like a method of doing Salvage Epistemology [LW · GW] and so could invite its downsides if taught poorly. I'm undecided whether that's worth worrying about.

richard_kennaway on What are some scenarios where an aligned AGI actually helps humanity, but many/most people don't like it?

The AI, for its own inscrutable reasons, seizes upon the sort of idea that you have to be really smart to be stupid enough to take seriously, and imposes it on everyone.

I think all the scenarios above are instances of this.

zac-hatfield-dodds on POC || GTFO culture as partial antidote to alignment wordcelism

"POC || GTFO culture" need not be literal, and generally cannot be when speculating about future technologies. I wouldn't even want a proof-of-concept misaligned superintelligence!

Nonetheless, I think the field has been improved by an increasing emphasis on empiricism and demonstrations over the last two years, in technical research, in governance research, and in advocacy. I'd still like to see more carefully caveating of claims for which we have arguments but not evidence, and it's useful to have a short handle for that idea - "POC || admit you're unsure", perhaps?

johannes-c-mayer on johnswentworth's Shortform

Another one: We manage to solve alignment to a significant extend. The AI who is much smarter than a human thinks that it is aligned, and takes aligned actions. The AI even predicts that it will never become unaligned to humans. However, at some point in the future as the AI naturally unrolles into a reflectively stable equilibrium it becomes unaligned.

bronson-schoen on Human takeover might be worse than AI takeover

…agentic training data for future systems may involve completing tasks in automated environments (e.g. playing games, SWE tasks, AI R&D tasks) with automated reward signals. The reward here will pick out drives that make AIs productive, smart and successful, not just drives that make them HHH.

…

These drives/goals look less promising if AIs take over. They look more at risk of leading to AIs that would use the future to do something mostly without any value from a human perspective.

I’m interested in why this would seem unlikely in your model. These are precisely the failure models I think about the most, ex:

I’ve based some of the above on extrapolating from today’s AI systems, where RLHF focuses predominantly on giving AIs personalities that are HHH(helpful, harmless and honest) and generally good by human (liberal western!) moral standards. To the extent these systems have goals and drives, they seem to be pretty good ones. That falls out of the fine-tuning (RLHF) data.

My understanding has always been that the fundamental limitation of RLHF (ex: https://arxiv.org/abs/2307.15217) is precisely that it fails at the limit of human’s ability to verify (ex: https://arxiv.org/abs/2409.12822, many other examples). You then have to solve other problems (ex: w2s generalization, etc), but I would consider it falsified that we can just rely on RLHF indefinitely (in fact I don’t believe it was a common argument that RLHF ever would hold, but it’s difficult to quanity how prevalent various opinions on it were).

mikkel-wilson on Aristocracy and Hostage Capital

I agree that this description fits the paper.

sloonz on What are some scenarios where an aligned AGI actually helps humanity, but many/most people don't like it?

My headcanon is that there are two levels of alignment :

Technical alignment : you get an AI that does what you ask it to do, without any shenanigans (a bit more precisely : without any short-term/medium-term side-effect that, should you know that side-effect beforehand, would cause you to refuse to do the thing in the first place). Typical misalignment at this level : hidden complexity of wishes (or, you know, no alignment at all, like clippy)
Comprehensive alignment : you get an AI that does what the CEV-you wants. Typical misalignment : just ask a technically-aligned AI some heavily social-desirability-biased outcome, solve for equilibrium, get close to 0 value remaining in the universe.

But yeah, I don’t think that distinction has got enough discussion.

(there’s also a third level, where CEV-you wishes also goes to essentially 0 value for current-you, but let’s not get there)