LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Whiteboard Pen Magazines are Useful
Johannes C. Mayer (johannes-c-mayer) · 2024-07-12T17:15:33.200Z · comments (8)

List your AI X-Risk cruxes!
Aryeh Englander (alenglander) · 2024-04-28T18:26:19.327Z · comments (7)

instruction tuning and autoregressive distribution shift
nostalgebraist · 2024-09-05T16:53:41.497Z · comments (5)

Neuroscience and Alignment
Garrett Baker (D0TheMath) · 2024-03-18T21:09:52.004Z · comments (25)

Beware unfinished bridges
Adam Zerner (adamzerner) · 2024-05-12T09:29:07.808Z · comments (9)

[link] Forecasting: the way I think about it
Molly (hickman-santini) · 2024-05-09T00:49:01.768Z · comments (4)

Choosing My Quest (Part 2 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-02-24T21:31:45.377Z · comments (7)

[link] AI Regulation is Unsafe
Maxwell Tabarrok (maxwell-tabarrok) · 2024-04-22T16:37:55.431Z · comments (41)

How To Do Patching Fast
Joseph Miller (Josephm) · 2024-05-11T20:13:52.424Z · comments (6)

[link] "What if we could redesign society from scratch? The promise of charter cities." [Rational Animations video]
Jackson Wagner · 2024-02-18T00:57:50.444Z · comments (7)

Manifund Q1 Retro: Learnings from impact certs
Austin Chen (austin-chen) · 2024-05-01T16:48:33.140Z · comments (1)

D&D.Sci(-fi): Colonizing the SuperHyperSphere [Evaluation and Ruleset]
abstractapplic · 2024-01-22T19:20:05.001Z · comments (7)

Californians, tell your reps to vote yes on SB 1047!
Holly_Elmore · 2024-08-12T19:50:09.817Z · comments (24)

Applying Force to the Wrong End of a Causal Chain
silentbob · 2024-06-22T18:06:32.364Z · comments (0)

Why did ChatGPT say that? Prompt engineering and more, with PIZZA.
Jessica Rumbelow (jessica-cooper) · 2024-08-03T12:07:46.302Z · comments (2)

You're a Space Wizard, Luke
lsusr · 2024-08-18T05:35:39.238Z · comments (6)

Individually incentivized safe Pareto improvements in open-source bargaining
Nicolas Macé (NicolasMace) · 2024-07-17T18:26:43.619Z · comments (2)

Forget Everything (Statistical Mechanics Part 1)
J Bostock (Jemist) · 2024-04-22T13:33:35.446Z · comments (6)

Medical Roundup #3
Zvi · 2024-07-09T13:10:06.862Z · comments (4)

[link] Understanding Gödel’s completeness theorem
jessicata (jessica.liu.taylor) · 2024-05-27T18:55:02.079Z · comments (0)

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
Kola Ayonrinde (kola-ayonrinde) · 2024-08-23T18:52:31.019Z · comments (5)

[Interim research report] Evaluating the Goal-Directedness of Language Models
Rauno Arike (rauno-arike) · 2024-07-18T18:19:04.260Z · comments (4)

[link] Linear infra-Bayesian Bandits
Vanessa Kosoy (vanessa-kosoy) · 2024-05-10T06:41:09.206Z · comments (5)

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?
Teun van der Weij (teun-van-der-weij) · 2024-01-29T00:24:27.706Z · comments (5)

[link] College technical AI safety hackathon retrospective - Georgia Tech
yix (Yixiong Hao) · 2024-11-15T00:22:53.159Z · comments (2)

Monthly Roundup #23: October 2024
Zvi · 2024-10-16T13:50:05.869Z · comments (13)

Signaling with Small Orange Diamonds
jefftk (jkaufman) · 2024-11-07T20:20:08.026Z · comments (1)

How to use bright light to improve your life.
Nat Martin (nat-martin) · 2024-11-18T19:32:10.667Z · comments (8)

Anthropic rewrote its RSP
Zach Stein-Perlman · 2024-10-15T14:25:12.518Z · comments (19)

[link] FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Tamay · 2024-11-14T06:13:22.042Z · comments (0)

Natural abstractions are observer-dependent: a conversation with John Wentworth
Martín Soto (martinsq) · 2024-02-12T17:28:38.889Z · comments (13)

Nitric oxide for covid and other viral infections
Elizabeth (pktechgirl) · 2024-02-07T21:30:03.774Z · comments (6)

Instrumental deception and manipulation in LLMs - a case study
Olli Järviniemi (jarviniemi) · 2024-02-24T02:07:01.769Z · comments (13)

China-AI forecasts
[deleted] · 2024-02-25T16:49:33.652Z · comments (29)

Logical Line-Of-Sight Makes Games Sequential or Loopy
StrivingForLegibility · 2024-01-19T04:05:44.782Z · comments (0)

Apply to the PIBBSS Summer Research Fellowship
Nora_Ammann · 2024-01-12T04:06:58.328Z · comments (1)

I’m confused about innate smell neuroanatomy
Steven Byrnes (steve2152) · 2023-11-28T20:49:13.042Z · comments (2)

[link] Language Models Don't Learn the Physical Manifestation of Language
Bruce W. Lee (bruce-lee) · 2024-02-22T18:52:32.237Z · comments (23)

[link] Legalize butanol?
bhauth · 2023-12-20T14:24:33.849Z · comments (20)

Stitching SAEs of different sizes
Bart Bussmann (Stuckwork) · 2024-07-13T17:19:20.506Z · comments (12)

D&D.Sci: Whom Shall You Call?
abstractapplic · 2024-07-05T20:53:37.010Z · comments (6)

LLMs as a Planning Overhang
Larks · 2024-07-14T02:54:14.295Z · comments (8)

AI #70: A Beautiful Sonnet
Zvi · 2024-06-27T14:40:08.087Z · comments (0)

Requirements for a Basin of Attraction to Alignment
RogerDearnaley (roger-d-1) · 2024-02-14T07:10:20.389Z · comments (11)

Are we so good to simulate?
KatjaGrace · 2024-03-04T05:20:03.535Z · comments (24)

The "context window" analogy for human minds
Ruby · 2024-02-13T19:29:10.387Z · comments (0)

[link] [Linkpost] George Mack's Razors
trevor (TrevorWiesinger) · 2023-11-27T17:53:45.065Z · comments (8)

[link] Things You're Allowed to Do: At the Dentist
rbinnn · 2024-01-28T18:39:33.584Z · comments (16)

Monthly Roundup #14: January 2024
Zvi · 2024-01-24T12:50:09.231Z · comments (22)

[question] What progress have we made on automated auditing?
LawrenceC (LawChan) · 2024-07-06T01:49:43.714Z · answers+comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

meedstrom on Kenshō

If it helps, your explanations made perfect sense to me, like plain English. So thank you for putting yourself out there; you gave me and others something to chew on.

yonatan-cale-1 on Yonatan Cale's Shortform

I agree.

yonatan-cale-1 on Yonatan Cale's Shortform

This all sounds pretty in-distribution for an LLM, and also like it avoids problems like "maybe thinking in different abstractions" [minecraft isn't amazing at this either, but at least has a bit], "having the AI act/think way faster than a human", "having the AI be clearly superhuman".

a number of ways to achieve the endgame, level up, etc, both more and less morally.

I'm less interested in "will the AI say it kills its friend" (in a situation that very clearly involves killing and a person and perhaps a very clear tradeoff between that and having 100 more gold that can be used for something else), I'm more interested in noticing if it has a clear grasp of what people care about or mean. The example of chopping down the tree house of the player in order to get wood (which the player wanted to use for the tree house) is a nice toy example of that. The AI would never say "I'll go cut down your tree house", but it.. "misunderstood" [not the exact word, but I'm trying to point at something here]

wdyt?

yonatan-cale-1 on Yonatan Cale's Shortform

Your guesses on AI R&D are reasonable!

Apparently this has been tested extensively, for example:

https://x.com/METR_Evals/status/1860061711849652378

[disclaimers: I have some association with the org that ran that (I write some code for them) but I don't speak for them, opinions are my own]

Also, Anthropic have a trigger in their RSP which is somewhat similar to what you're describing, I'll quote part of it:

Autonomous AI Research and Development: The ability to either: (1) Fully automate the work of an entry-level remote-only Researcher at Anthropic, as assessed by performance on representative tasks or (2) cause dramatic acceleration in the rate of effective scaling.

Also, in Dario's interview, he spoke about AI being applied to programming.

My point is - lots of people have their eyes on this, it seems not to be solved yet, it takes more than connecting an LLM to bash.

Still, I don't want to accelerate this.

kabir-kumar on Yonatan Cale's Shortform

options to vary rules/environment/language as well, to see how the alignment generalizes ood. will try this today

kabir-kumar on Yonatan Cale's Shortform

it would basically be DnD like.

kabir-kumar on Yonatan Cale's Shortform

Making a thing like Papers Please, but as a text adventure, popping an ai agent into that.
Also, could literally just put the ai agent into a text rpg adventure - something like the equivalent of Skyrim, where there are a number of ways to achieve the endgame, level up, etc, both more and less morally. Maybe something like https://www.choiceofgames.com/werewolves-3-evolutions-end/
Will bring it up at the alignment eval hackathon

yonatan-cale-1 on Yonatan Cale's Shortform

+1

I'm imagining an assistant AI by default (since people are currently pitching that an AGI might be a nice assistant).

If an AI org wants to demonstrate alignment by showing us that having a jerk player is more fun (and that we should install their jerk-AI-app on our smartphone), then I'm open to hear that pitch, but I'd be surprised if they'd make it

lucid_levi_ackerman on Humans are not automatically strategic

It's because they take less continued attention/effort and provide more immediate/satisfying results. LW is almost purely theoretical and isn't designed to be efficient. It's an attempt to logically override bias rather than implement the quirks of human neurochemistry to automate the process.

Computer scientists are notorious for this. They know how brains make thoughts happen, but they don't have a clue how people think, so ego drives them to rationalize a framework to perceive the flaws of others: uncuriousness and lack of dedication. Because they're just as human as the rest of us, made of the same biological approximation of inherited "good-enoughness." And the smarter you are, the more complex and well-reasoned that rationalization will be.

We all seek to affirm our current beliefs and blame others for discrepancies. It's instinct, physics, chemistry. No amount of logic and reason can override the instinct to defend one's perception of reality. Every fat person in the world has been thoroughly educated on which lifestyle changes will cause them to lose weight, yet the obesity epidemic still grows.

Therefore, we study "rationality" to see ourselves as the good-guy protagonists who strive to be "less wrong," have "accurate beliefs," and "be effective at achieving our goals."

It's important work... for computers. For humanity, you're better off consulting a monk.

yonatan-cale-1 on Yonatan Cale's Shortform

I think there are lots of technical difficulties in literally using minecraft (some I wrote here [LW(p) · GW(p)]), so +1 to that.

I do think the main crux is "would the minecraft version be useful as an alignment test", and if so - it's worth looking for some other solution that preserves the good properties but avoids some/all of the downsides. (agree?)

Still I'm not sure how I'd do this in a text game. Say more?