LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

My Clients, The Liars
ymeskhout · 2024-03-05T21:06:36.669Z · comments (85)

Ilya Sutskever and Jan Leike resign from OpenAI [updated]
Zach Stein-Perlman · 2024-05-15T00:45:02.436Z · comments (95)

The Best Lay Argument is not a Simple English Yud Essay
J Bostock (Jemist) · 2024-09-10T17:34:28.422Z · comments (15)

Principles for the AGI Race
William_S · 2024-08-30T14:29:41.074Z · comments (13)

Truthseeking is the ground in which other principles grow
Elizabeth (pktechgirl) · 2024-05-27T01:09:20.796Z · comments (16)

Book Review: Going Infinite
Zvi · 2023-10-24T15:00:02.251Z · comments (110)

Alignment Implications of LLM Successes: a Debate in One Act
Zack_M_Davis · 2023-10-21T15:22:23.053Z · comments (50)

AI companies aren't really using external evaluators
Zach Stein-Perlman · 2024-05-24T16:01:21.184Z · comments (15)

Laziness death spirals
PatrickDFarley · 2024-09-19T15:58:30.252Z · comments (34)

the case for CoT unfaithfulness is overstated
nostalgebraist · 2024-09-29T22:07:54.053Z · comments (38)

Refusal in LLMs is mediated by a single direction
Andy Arditi (andy-arditi) · 2024-04-27T11:13:06.235Z · comments (93)

Believing In
AnnaSalamon · 2024-02-08T07:06:13.072Z · comments (51)

Announcing MIRI’s new CEO and leadership team
Gretta Duleba (gretta-duleba) · 2023-10-10T19:22:11.821Z · comments (52)

[link] Introducing AI Lab Watch
Zach Stein-Perlman · 2024-04-30T17:00:12.652Z · comments (30)

What are the results of more parental supervision and less outdoor play?
juliawise · 2023-11-25T12:52:29.986Z · comments (31)

Thoughts on responsible scaling policies and regulation
paulfchristiano · 2023-10-24T22:21:18.341Z · comments (33)

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work
Rohin Shah (rohinmshah) · 2024-08-20T16:22:45.888Z · comments (33)

SAE feature geometry is outside the superposition hypothesis
jake_mendel · 2024-06-24T16:07:14.604Z · comments (17)

Modern Transformers are AGI, and Human-Level
abramdemski · 2024-03-26T17:46:19.373Z · comments (88)

MIRI 2024 Mission and Strategy Update
Malo (malo) · 2024-01-05T00:20:54.169Z · comments (44)

[link] AI presidents discuss AI alignment agendas
TurnTrout · 2023-09-09T18:55:37.931Z · comments (23)

LLM Generality is a Timeline Crux
eggsyntax · 2024-06-24T12:52:07.704Z · comments (117)

Brute Force Manufactured Consensus is Hiding the Crime of the Century
Roko · 2024-02-03T20:36:59.806Z · comments (156)

CFAR Takeaways: Andrew Critch
Raemon · 2024-02-14T01:37:03.931Z · comments (62)

The ‘strong’ feature hypothesis could be wrong
lewis smith (lsgos) · 2024-08-02T14:33:58.898Z · comments (17)

Superbabies: Putting The Pieces Together
sarahconstantin · 2024-07-11T20:40:05.036Z · comments (37)

AI Control: Improving Safety Despite Intentional Subversion
Buck · 2023-12-13T15:51:35.982Z · comments (7)

[link] "How could I have thought that faster?"
mesaoptimizer · 2024-03-11T10:56:17.884Z · comments (32)

ChatGPT can learn indirect control
Raymond D · 2024-03-21T21:11:06.649Z · comments (27)

"Slow" takeoff is a terrible term for "maybe even faster takeoff, actually"
Raemon · 2024-09-28T23:38:25.512Z · comments (70)

[link] The Lighthaven Campus is open for bookings
habryka (habryka4) · 2023-09-30T01:08:12.664Z · comments (18)

Towards more cooperative AI safety strategies
Richard_Ngo (ricraz) · 2024-07-16T04:36:29.191Z · comments (132)

Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
So8res · 2023-11-24T17:37:43.020Z · comments (83)

Mechanistically Eliciting Latent Behaviors in Language Models
Andrew Mack (andrew-mack) · 2024-04-30T18:51:13.493Z · comments (40)

OpenAI: Fallout
Zvi · 2024-05-28T13:20:04.325Z · comments (25)

UDT shows that decision theory is more puzzling than ever
Wei Dai (Wei_Dai) · 2023-09-13T12:26:09.739Z · comments (55)

[link] Jaan Tallinn's 2023 Philanthropy Overview
jaan · 2024-05-20T12:11:39.416Z · comments (5)

We're Not Ready: thoughts on "pausing" and responsible scaling policies
HoldenKarnofsky · 2023-10-27T15:19:33.757Z · comments (33)

Pay Risk Evaluators in Cash, Not Equity
Adam Scholl (adam_scholl) · 2024-09-07T02:37:59.659Z · comments (19)

Maybe Anthropic's Long-Term Benefit Trust is powerless
Zach Stein-Perlman · 2024-05-27T13:00:47.991Z · comments (21)

[link] Sam Altman’s Chip Ambitions Undercut OpenAI’s Safety Strategy
garrison · 2024-02-10T19:52:55.191Z · comments (52)

Funny Anecdote of Eliezer From His Sister
Noah Birnbaum (daniel-birnbaum) · 2024-04-22T22:05:31.886Z · comments (6)

Labs should be explicit about why they are building AGI
peterbarnett · 2023-10-17T21:09:20.711Z · comments (17)

This might be the last AI Safety Camp
Remmelt (remmelt-ellen) · 2024-01-24T09:33:29.438Z · comments (34)

Toward A Mathematical Framework for Computation in Superposition
Dmitry Vaintrob (dmitry-vaintrob) · 2024-01-18T21:06:57.040Z · comments (18)

Thoughts on “AI is easy to control” by Pope & Belrose
Steven Byrnes (steve2152) · 2023-12-01T17:30:52.720Z · comments (56)

The impossible problem of due process
mingyuan · 2024-01-16T05:18:33.415Z · comments (64)

The Sun is big, but superintelligences will not spare Earth a little sunlight
Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2024-09-23T03:39:16.243Z · comments (139)

[question] Examples of Highly Counterfactual Discoveries?
johnswentworth · 2024-04-23T22:19:19.399Z · answers+comments (100)

Response to Aschenbrenner's "Situational Awareness"
Rob Bensinger (RobbBB) · 2024-06-06T22:57:11.737Z · comments (27)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

thane-ruthenis on Explore More: A Bag of Tricks to Keep Your Life on the Rails

Exploit your natural motivations

There's a relevant concept that I keep meaning to write about, which I could summarize as: create gradients towards your long-term aspirations.

Humans are general intelligences, and one of the core properties of general intelligence is not being a greedy-optimization algorithm:

We can pursue long-term goals even when each individual step towards them is not pleasurable-in-itself (such as suffering through university to get a degree in a field jobs in which require it).
We can force ourselves out of local maxima (such as quitting a job you hate and changing careers, even though it'd mean a period of life filled with uncertainty and anxieties).
We can build world-models, use them to infer the shapes of our value functions, and plot a path towards their global maximum, even if it requires passing through negative-reward regions (such as engaging in self-reflection and exploration, then figuring out which vocation would be most suitable to a person-like-you).

However, it's hard. We're hybrid systems, combining generally-intelligent planning modules with greedy RL circuitry. The greedy RL circuitry holds a lot of sway. If you keep forcing yourself to do something it assigns negative rewards to, it's going to update your plan-making modules until they stop doing that.

It is much, much easier to keep doing something if every instance of it is pleasurable in itself. If the reward is instead sparse and infrequent, you'd need a lot of "willpower" to keep going (to counteract the negative updates), and accumulating that is a hard problem in itself.

So the natural solution is to plot, or create, a path towards the long-term aspiration such that motion along it would involve receiving immediate positive feedback from your learned and innate reward functions.

A lot of productivity advice reduces to it:

Breaking the long-term task into yearly, monthly, and daily subgoals, such that you can feel accomplishment on a frequent basis (instead of only at the end).
Using "cross-domain success loops": simultaneously work on several projects, such that you accomplish something worthwhile along at least one of those tracks frequently, and can then harness the momentum from the success along one track into the motivation for continuing the work along other tracks.
- I. e., sort of trick your reward system into confusing where exactly the reward is coming from.
- (I think there was an LW post about this, but I don't remember how to find it.)
Eating something tasty, or going to a party, or otherwise "indulging" yourself, every time you do something that contributes to your long-term aspiration.
Finding ways to make the process at least somewhat enjoyable, through e. g. environmental factors, such as working in a pleasant place, putting on music, using tools that feel satisfying to use, or doing small work-related rituals that you find amusing.
Creating social rewards and punishments, such as:
- Joining a community focused on pursuing the same aspiration as you.
- Finding "workout buddies".
- Having friends who'd hold you accountable if you slack off.
- Having friends who'd cheer you on if you succeed.
And, as in Shoshannah's post: searching for activities that are innately enjoyable and happen to move you in the direction of your aspirations.

None of the specific examples here are likely to work for you (they didn't for me). But you might be able to design or find an instance of that general trick that fits you!

(Or maybe not. Sometimes you have to grit your teeth and go through a rewardless stretch of landscape, if you're not willing to budge on your goal/aspiration.)

Other relevant posts:

Venkatesh Rao's The Calculus of Grit. It argues for ignoring extrinsic "disciplinary boundaries" (professions, fields) when choosing your long-term aspirations, and instead following an "internal" navigation system when mapping out the shape of the kind-of-thing that someone-like-you is well-suited to doing.
- Note that this advice goes further than Shoshannah's: in this case, you don't exert any (conscious) control even over the direction you'd like to go, much less your "goal".
- It's likely to be easier, but the trade-off should be clear.
John Wentworth's Plans Are Predictions, Not Optimization Targets [LW · GW]. This connection is a bit more rough, but: that post can be generalized to note that any explicit life goals you set for yourself should often be treated as predictions about what goal you should pursue. Recognizing that, you might instead choose to "pursue your goal-in-expectation", which might be similar to Shoshannah's point about "picking a direction, not a goal".

ryan_greenblatt on Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming

No IMO.

(I'm also skeptical of competitiveness with expert jailbreakers, but this isn't a crux.)

douglas_knight on Is the Power Grid Sustainable?

That breakdown is fiction dictated by the regulator.

annapurna on Update on the Mysterious Trump Buyers on Polymarket

Your comment would have more validity had the market not corrected the dislocation created by Theo. It took three weeks, but the market eventually corrected itself.

dagon on Does the "ancient wisdom" argument have any validity? If a particular teaching or tradition is old, to what extent does this make it more trustworthy?

Age and popularity of an idea or practice have some predictive power as to how useful it has been. Old and surviving is some evidence. Popular is some evidence. Old and NOT popular is conflicting evidence - it's useful (or at least not very harmful) to some, perhaps limited by context or covariant factors that don't apply elsewhere.

Whether your interpretation of a practice will get benefits for you should probably be determined by more specific analysis than "it worked for a small set of people in a very different environment, and never caught on universally".

hastings-greer on The hostile telepaths problem

Organizations and communities can also face hostile telepaths. My pet theory that sort of crystalized while reading this is that p-hacking is academia’s response to a hostile telepath that banned publication of negative results.

This of course sucks for non traditional researchers and especially journalists who don’t even subconsciously know that p=0.05002 r=1e-7 “breakthrough in finding relationship between milk consumption and toenail fungus” is code for “We have conclusively found no effect and want to broadcast to the community that there is no effect here; yet we cannot ever consciously acknowledging that we found nothing because our mortgages depend on fooling a hostile telepath into believing this is something”

raemon on The Median Researcher Problem

This claim looks like it's implying that research communities can build better-than-median selection pressures but, can they? And if so why have we hypothesized that scientific fields don't?

I'm a bit surprised this is the crux for you. Smaller communities have a lot more control over their gatekeeping because, like, they control it themselves, whereas the larger field's gatekeeping is determined via openended incentives in the broader world that thousands (maybe millions?) of people have influence over. (There's also things you could do in addition to gatekeeping. See Selective, Corrective, Structural: Three Ways of Making Social Systems Work [LW · GW])

(This doesn't mean smaller research communities automatically have good gatekeeping or other mechanisms, but it doesn't feel like a very confusing or mysterious problem on how to do better)

donatas-luciunas on Claude seems to be smarter than LessWrong community

Claude probably read that material right? If it finds my observations unique and serious then maybe they are unique and serious? I'll share other chat next time..

waddington on Bitter lessons about lucid dreaming

Even just writing down loose associations and your emotional state is enough; that's how you get the ball rolling. Try it for two weeks even if it feels useless. Unless you're taking antidepressants in which case this might actually be ineffective. I know this doesn't sound worthwhile, but I know from experience (mine and others) that it usually works.

donatas-luciunas on Claude seems to be smarter than LessWrong community

How can I put little effort but be perceived like someone worth listening? I thought announcing a monetary prize for someone who could find error in my reasoning 😅