LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

DIY RLHF: A simple implementation for hands on experience
Mike Vaiana (mike-vaiana) · 2024-07-10T12:07:03.047Z · comments (0)

Monthly Roundup #19: June 2024
Zvi · 2024-06-25T12:00:03.333Z · comments (9)

[link] Memo on some neglected topics
Lukas Finnveden (Lanrian) · 2023-11-11T02:01:55.834Z · comments (2)

{Book Summary} The Art of Gathering
Tristan Williams (tristan-williams) · 2024-04-16T10:48:41.528Z · comments (0)

[link] Conversation Visualizer
ethanmorse · 2023-12-31T01:18:01.424Z · comments (4)

Can quantised autoencoders find and interpret circuits in language models?
charlieoneill (kingchucky211) · 2024-03-24T20:05:50.125Z · comments (4)

Collection (Part 6 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-14T21:37:00.160Z · comments (0)

Experiments with an alternative method to promote sparsity in sparse autoencoders
Eoin Farrell · 2024-04-15T18:21:48.771Z · comments (7)

Ackshually, many worlds is wrong
tailcalled · 2024-04-11T20:23:59.416Z · comments (42)

Reading More Each Day: A Simple $35 Tool
aysajan · 2024-07-24T13:54:04.290Z · comments (2)

Updates to Open Phil’s career development and transition funding program
abergal · 2023-12-04T18:10:29.394Z · comments (0)

Cryonics p(success) estimates are only weakly associated with interest in pursuing cryonics in the LW 2023 Survey
Andy_McKenzie · 2024-02-29T14:47:28.613Z · comments (6)

AI #64: Feel the Mundane Utility
Zvi · 2024-05-16T15:20:02.956Z · comments (11)

[link] New blog: Expedition to the Far Lands
Connor Leahy (NPCollapse) · 2024-08-17T11:07:48.537Z · comments (3)

Employee Incentives Make AGI Lab Pauses More Costly
nikola (nikolaisalreadytaken) · 2023-12-22T05:04:15.598Z · comments (12)

Auditing LMs with counterfactual search: a tool for control and ELK
Jacob Pfau (jacob-pfau) · 2024-02-20T00:02:09.575Z · comments (6)

Cicadas, Anthropic, and the bilateral alignment problem
kromem · 2024-05-22T11:09:56.469Z · comments (6)

3. Premise three & Conclusion: AI systems can affect value change trajectories & the Value Change Problem
Nora_Ammann · 2023-10-26T14:38:14.916Z · comments (4)

An explanation of evil in an organized world
KatjaGrace · 2024-05-02T05:20:06.240Z · comments (9)

[question] How did you integrate voice-to-text AI into your workflow?
ChristianKl · 2023-11-20T12:01:37.696Z · answers+comments (12)

Online Dialogues Party — Sunday 5th November
Ben Pace (Benito) · 2023-10-27T02:41:00.506Z · comments (1)

[link] Cellular reprogramming, pneumatic launch systems, and terraforming Mars: Some things I learned about at Foresight Vision Weekend
jasoncrawford · 2024-01-04T19:33:57.887Z · comments (0)

Escaping Skeuomorphism
Stuart Johnson (stuart-johnson) · 2023-12-20T03:51:00.489Z · comments (0)

Heuristics for preventing major life mistakes
SK2 (lunchbox) · 2023-12-20T08:01:09.340Z · comments (2)

Aggregative principles approximate utilitarian principles
Cleo Nardo (strawberry calm) · 2024-06-12T16:27:22.179Z · comments (3)

I played the AI box game as the Gatekeeper — and lost
datawitch · 2024-02-12T18:39:35.777Z · comments (52)

Solstice 2023 Roundup
dspeyer · 2023-10-11T23:09:08.252Z · comments (6)

[link] Quick Thoughts on Scaling Monosemanticity
Joel Burget (joel-burget) · 2024-05-23T16:22:48.035Z · comments (1)

AI #65: I Spy With My AI
Zvi · 2024-05-23T12:40:02.793Z · comments (7)

Deconfusing “ontology” in AI alignment
Dylan Bowman (dylan-bowman) · 2023-11-08T20:03:43.205Z · comments (3)

Evaporation of improvements
Viliam · 2024-06-20T18:34:40.969Z · comments (27)

[link] AI Impacts 2023 Expert Survey on Progress in AI
habryka (habryka4) · 2024-01-05T19:42:17.226Z · comments (1)

My Dating Heuristic
Declan Molony (declan-molony) · 2024-05-21T05:28:40.197Z · comments (4)

Probably Not a Ghost Story
George Ingebretsen (george-ingebretsen) · 2024-06-12T22:55:26.264Z · comments (4)

[question] Supposing the 1bit LLM paper pans out
O O (o-o) · 2024-02-29T05:31:24.158Z · answers+comments (11)

Uncertainty in all its flavours
Cleo Nardo (strawberry calm) · 2024-01-09T16:21:07.915Z · comments (6)

Appraising aggregativism and utilitarianism
Cleo Nardo (strawberry calm) · 2024-06-21T23:10:37.014Z · comments (10)

Without Fundamental Advances, Rebellion and Coup d'État are the Inevitable Outcomes of Dictators & Monarchs Trying to Control Large, Capable Countries
Roko · 2024-01-31T10:14:02.042Z · comments (34)

[link] Link Collection: Impact Markets
Saul Munn (saul-munn) · 2023-12-26T09:01:48.815Z · comments (0)

EA Infrastructure Fund's Plan to Focus on Principles-First EA
Linch · 2023-12-06T03:24:55.844Z · comments (0)

[link] [Linkpost] Concept Alignment as a Prerequisite for Value Alignment
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2023-11-04T17:34:36.563Z · comments (0)

[link] Goodhart's Law Example: Training Verifiers to Solve Math Word Problems
Chris_Leong · 2023-11-25T00:53:26.841Z · comments (2)

The case for more Alignment Target Analysis (ATA)
Chi Nguyen · 2024-09-20T01:14:41.411Z · comments (13)

[Linkpost] Play with SAEs on Llama 3
Tom McGrath · 2024-09-25T22:35:44.824Z · comments (1)

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Daniel Lee (daniel-lee) · 2024-09-06T02:28:41.954Z · comments (0)

Cheap Whiteboards!
Johannes C. Mayer (johannes-c-mayer) · 2024-08-08T13:52:59.627Z · comments (2)

[link] David Burns Thinks Psychotherapy Is a Learnable Skill. Git Gud.
Morpheus · 2024-01-27T13:21:05.068Z · comments (20)

D&D.Sci Hypersphere Analysis Part 1: Datafields & Preliminary Analysis
aphyer · 2024-01-13T20:16:39.480Z · comments (1)

[link] AISN #30: Investments in Compute and Military AI Plus, Japan and Singapore’s National AI Safety Institutes
aogara (Aidan O'Gara) · 2024-01-24T19:38:33.461Z · comments (1)

The economy is mostly newbs (strat predictions)
lukehmiles (lcmgcd) · 2024-02-01T19:15:49.420Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

nisan on Nisan's Shortform

A fun lesson on the scientific method: A few days ago, OpenAI announced a rollout of "Advanced Voice Mode". Confused by the messaging, I ran a test to see if their phone app could respond to an interruption:

Nisan: Can you count to 10, and then I'll interrupt you in the middle and see how that works.

ChatGPT 4o: Sure! I'll start counting now: One, two, three, four, fiv—

Nisan: Stop. Interruption.

ChatGPT 4o: How'd the interruption go?

Nisan: Oh, that went great. You stopped at five.

ChatGPT 4o: Awesome! Glad it worked smoothly. Do you want to try another test or experiment?

Nisan: Yeah, let's do another interruption test. This time, please count to 20.

ChatGPT 4o: Got it! Starting now: One, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirt—

Nisan: Stop. Interruption.

ChatGPT 4o: How was that?

Nisan: Wow. [Thinks: It seemed to stop slightly before I interrupted it!]

ChatGPT 4o: Glad that impressed you! Want to try something else?

Nisan: Uh, hold on.

ChatGPT 4o: Take your time!

Nisan: [Generates a random number.] Okay, please count to 30, and I'll interrupt at a random point.

ChatGPT 4o: Alright, starting now: One, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen—

[Nisan interrupts at some point, I forgot exactly where.]

ChatGPT 4o: —sixteen, seventeen, eighteen, nineteen, twenty, twenty-one, twenty-two, twenty-three, twenty-four, twenty-five, twenty-six, twenty-seven, twenty-eight, twenty-nine, thirty. Looks like I made it to the end! Did you mean to interrupt somewhere?

Nisan: Yeah, I did. It must not have worked. Let's try that again. [Generates a random number.] Please count up to 30.

ChatGPT 4o: No problem, let's try again! Starting now: One, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen—

[Nisan interrupts at some point.]

ChatGPT 4o: —sixteen, seventeen, eighteen, nineteen, twenty, twenty-one, twenty-two, twenty-three, twenty-four, twenty-fiv— Did it work this time?

Nisan: Uh, no.

I believe ChatGPT can't hear me when it's talking. But it correctly guessed when I would interrupt on the first two trials!

elityre on How to Give in to Threats (without incentivizing them)

But if you think taxes are unfair, you can leave the country and pay some different taxes somewhere else instead of going to jail.

It's quite difficult to do that in the US, at least. You pay taxes if you're a citizen, even if you're not a resident, and you're required to pay taxes for the 10 years following your renouncing citizenship.

As far as I know, there's no way for US citizens to leave the US tax regime within a decade.

benito on [Completed] The 2024 Petrov Day Scenario

Petrov Day retro is mostly done, have a bit more coding to do before we publish it, so it's looking like we'll publish it tomorrow. Stay tuned!

gwern on Why is o1 so deceptive?

"Plausible" is a very ambiguous word. (Bayesianism has been defined as "a logic of plausible inference", but hopefully that doesn't mean Bayesians just confabulate everything.) It can mean "reasonable" for example: "Yeah, Sally's brownie recipe is a reasonable reference to include here, let's go with it." Since 4o doesn't seem to think it's a 'fake' URL in contrast to 'actual' URLs, it's not necessarily a contrast. (It could refer to still other things - like you might not actually know there is a 'brownies' Wikipedia URL, having never bothered to look it up or happened to have stumbled across it, but without retrieving it right this second, you could surely tell me what it would be and that it would be a relevant answer, and so it would be both reasonable and plausible to include.)

sharmake-farah on Ruby's Quick Takes

I basically agree with this:

People who want to speed up AI will use falsehoods and bad logic to muddy the waters, and many people won’t be able to see through it
In other words, there’s going to be an epistemic war and the other side is going to fight dirty, I think even a lot of clear evidence will have a hard time against people’s motivations/incentives and bad arguments.

But I'd be more pessimistic than that, in that I honestly think pretty much every side will fight quite dirty in order to gain power over AI, and we already have seen examples of straight up lies and bad faith.

From the anti-regulation side, I remember Martin Casado straight up lying about mechanistic interpretability rendering AI models completely understood and white box, and I'm very sure that mechanistic interpretability cannot do what Martin Casado claimed.

I also remembered a16z lying a lot about SB1047.

From the pro-regulation side, I remembered Zvi incorrectly claiming that Sakana AI did instrumental convergence/recursive self-improvement, and as it turned out, the reality was far more mundane than that:

https://www.lesswrong.com/posts/ppafWk6YCeXYr4XpH/danger-ai-scientist-danger#AtXXgsws5DuP6Jxzx [LW(p) · GW(p)]

Zvi then misrepresented what Apollo actually did, and attempted to claim that o1 was actually deceptively aligned/lying, when it did a capability eval to see if it was capable of lying/deceptively aligned, and straight up lied in claiming that this was proof of Yudkowsky's proposed AI alignment problems being here, and inevitable, which is taken down in 2 comments:

https://www.lesswrong.com/posts/zuaaqjsN6BucbGhf5/gpt-o1#YRF9mcTFN2Zhne8Le [LW(p) · GW(p)]

https://www.lesswrong.com/posts/zuaaqjsN6BucbGhf5/gpt-o1#AWXuFxjTkH2hASXPx [LW(p) · GW(p)]

Overall, this has made me update in pretty negative directions concerning the epistemics of every side.

There's a core of people who have reasonable epistemics IMO on every side, but they are outnumber and lack the force of those that don't have good epistemics.

The reason I can remain optimistic despite it is that I believe we are progressing faster than that:

At present, we are making progress on the Technical Alignment Problem^[2] and like probably could solve it within 50 years.

I think that thankfully, I think we could probably solve it in 5-10 years, primarily because I believe 0 remaining insights are necessary to align AI, and the work that needs to be done is in making large datasets about human values, because AIs are deeply affected by what their data sources are, and thus whoever controls the dataset controls the values of the AI.

mako-yass on AI #83: The Mask Comes Off

Theory: The reason OpenAI seem to not care about getting AGI right any more is because they've privately received explicit signals from the government that they wont be allowed to build AGI. This is pretty likely a-priori, and also makes sense of what we are seeing.

There'd be an automatic conspiracy of reasons to avoid outwardly acknowledging this: 1) To keep the stock up, 2) To avoid accelerating the militarization (closing) of AI and the arms race (a very good reason. If I were Zvi, I would also consider avoiding acknowledging this for this reason, but I'm less famous than zvi, so I get to acknowledge it), 3) To protect other staff from the demotivating effects of knowing this thing, that OpenAI will be reduced to a normal company who will never be allowed to release a truly great iteration of the product.

So instead what you'd see is people in leadership, one by one (as they internalize this), suddenly dropping the safety mission or leaving the company without really explaining why.

habryka4 on Why is o1 so deceptive?

I agree that o1 might not be able to tell whether the link is fake, but the chain of thought does say explicitly:

So, the assistant should [...] provide actual or plausible links.

The "plausible" here suggests that at least in its CoT, it has realized that the task would have probably been considered completed accurately in training as long as the links are plausible, even if they are not actual links.

fread2281 on Why is o1 so deceptive?

o1 CoT: The user is asking for more references about brownies. <Reasoning about what the references should look like> So, the assistant should list these references clearly, with proper formatting and descriptions, and provide actual or plausible links. Remember, the model cannot retrieve actual URLs, so should format plausible ones.

this might encourage it to make up links

cole-wyeth on Work with me on agent foundations: independent fellowship

Cool, I'll DM you.

gwern on The Sun is big, but superintelligences will not spare Earth a little sunlight

Seems importantly wrong, in that if your objective is to take a few pawns (say, three), you can easily do this.

By... decreasing your chances of winning because you keep playing moves which increase your chance of taking pawns while trading off chances of winning, so Stockfish is happy to let you hang yourself all day long, driving your winning probability down to ε by the end of the game even faster than if you had played your hardest? I don't see how this is "importantly wrong". I too can sell dollar bills for $0.90 all day long, this doesn't somehow disprove markets or people being rational - quite the opposite.