LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Is AI Physical?
Lauren Greenspan (LaurenGreenspan) · 2025-01-14T21:21:39.999Z · comments (3)

[link] It looks like there are some good funding opportunities in AI safety right now
Benjamin_Todd · 2024-12-22T12:41:02.151Z · comments (0)

[link] Does natural selection favor AIs over humans?
cdkg · 2024-10-03T18:47:43.517Z · comments (1)

Lab governance reading list
Zach Stein-Perlman · 2024-10-25T18:00:28.346Z · comments (3)

[link] Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders
PaulPauls · 2024-11-24T05:45:20.124Z · comments (3)

[link] Announcement: AI for Math Fund
sarahconstantin · 2024-12-05T18:33:13.556Z · comments (9)

[question] What is the alpha in one bit of evidence?
J Bostock (Jemist) · 2024-10-22T21:57:09.056Z · answers+comments (13)

Gwerns
Tomás B. (Bjartur Tómas) · 2024-11-16T14:31:57.791Z · comments (2)

AI Can be “Gradient Aware” Without Doing Gradient hacking.
Sodium · 2024-10-20T21:02:10.754Z · comments (0)

[link] I read every major AI lab’s safety plan so you don’t have to
sarahhw · 2024-12-16T18:51:38.499Z · comments (0)

Latent Adversarial Training (LAT) Improves the Representation of Refusal
alexandraabbas · 2025-01-06T10:24:53.419Z · comments (6)

[link] Update on the Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-11-04T19:22:06.540Z · comments (9)

[link] Fragile, Robust, and Antifragile Preference Satisfaction
adamShimi · 2024-11-02T17:25:55.986Z · comments (0)

[link] To Be Born in a Bag
Niko_McCarty (niko-2) · 2024-10-06T17:21:00.605Z · comments (1)

Balsa Research 2024 Update
Zvi · 2024-12-03T12:30:06.829Z · comments (0)

Open Thread Winter 2024/2025
habryka (habryka4) · 2024-12-25T21:02:41.760Z · comments (10)

subfunctional overlaps in attentional selection history implies momentum for decision-trajectories
Emrik (Emrik North) · 2024-12-22T14:12:49.027Z · comments (1)

Proof Explained for "Robust Agents Learn Causal World Model"
Dalcy (Darcy) · 2024-12-22T15:06:16.880Z · comments (0)

minifest
Austin Chen (austin-chen) · 2024-12-07T03:50:38.573Z · comments (1)

Whistleblowing Twitter Bot
Mckiev · 2024-12-26T04:09:45.493Z · comments (5)

Economics Roundup #4
Zvi · 2024-10-15T13:20:06.923Z · comments (4)

Review: “The Case Against Reality”
David Gross (David_Gross) · 2024-10-29T13:13:29.643Z · comments (9)

Bridging the VLM and mech interp communities for multimodal interpretability
Sonia Joseph (redhat) · 2024-10-28T14:41:41.969Z · comments (5)

Definition of alignment science I like
quetzal_rainbow · 2025-01-06T20:40:38.187Z · comments (0)

[link] Why OpenAI’s Structure Must Evolve To Advance Our Mission
stuhlmueller · 2024-12-28T04:24:19.937Z · comments (1)

[link] Can o1-preview find major mistakes amongst 59 NeurIPS '24 MLSB papers?
Abhishaike Mahajan (abhishaike-mahajan) · 2024-12-18T14:21:03.661Z · comments (0)

Turning up the Heat on Deceptively-Misaligned AI
J Bostock (Jemist) · 2025-01-07T00:13:28.191Z · comments (16)

AGI with RL is Bad News for Safety
Nadav Brandes (nadav-brandes) · 2024-12-21T19:36:03.970Z · comments (22)

Write Good Enough Code, Quickly
Oliver Daniels (oliver-daniels-koch) · 2024-12-15T04:45:56.797Z · comments (10)

[link] Forecast 2025 With Vox's Future Perfect Team — $2,500 Prize Pool
ChristianWilliams · 2024-12-20T23:00:35.334Z · comments (0)

Higher and lower pleasures
Chris_Leong · 2024-12-05T13:13:46.526Z · comments (3)

[link] Chess As The Model Game
criticalpoints · 2024-11-17T19:45:26.499Z · comments (0)

Really radical empathy
MichaelStJules · 2025-01-06T17:46:31.269Z · comments (0)

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]
Jason Gross (jason-gross) · 2025-01-06T04:22:12.633Z · comments (0)

Can we rescue Effective Altruism?
Elizabeth (pktechgirl) · 2025-01-09T16:40:02.405Z · comments (0)

D/acc AI Security Salon
Allison Duettmann (allison-duettmann) · 2024-10-19T22:17:57.067Z · comments (0)

Theoretical Alignment's Second Chance
lunatic_at_large · 2024-12-22T05:03:51.653Z · comments (0)

[question] Does the "ancient wisdom" argument have any validity? If a particular teaching or tradition is old, to what extent does this make it more trustworthy?
SpectrumDT · 2024-11-04T15:20:14.822Z · answers+comments (49)

[link] GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
ChengCheng (ccstan99) · 2024-11-01T00:10:50.718Z · comments (0)

[link] AI & wisdom 3: AI effects on amortised optimisation
L Rudolf L (LRudL) · 2024-10-28T21:08:56.604Z · comments (0)

[link] AI & wisdom 2: growth and amortised optimisation
L Rudolf L (LRudL) · 2024-10-28T21:07:39.449Z · comments (0)

[link] A primer on machine learning in cryo-electron microscopy (cryo-EM)
Abhishaike Mahajan (abhishaike-mahajan) · 2024-12-22T15:11:58.860Z · comments (0)

Monthly Roundup #25: December 2024
Zvi · 2024-12-23T14:20:04.682Z · comments (3)

[link] From the Archives: a story
Richard_Ngo (ricraz) · 2024-12-27T16:36:50.735Z · comments (1)

Beliefs and state of mind into 2025
RussellThor · 2025-01-10T22:07:01.060Z · comments (9)

Announcing the CLR Foundations Course and CLR S-Risk Seminars
JamesFaville (elephantiskon) · 2024-11-19T01:18:10.085Z · comments (0)

Word Spaghetti
Gordon Seidoh Worley (gworley) · 2024-10-23T05:39:20.105Z · comments (9)

[link] AI safety content you could create
Adam Jones (domdomegg) · 2025-01-06T15:35:56.167Z · comments (0)

[link] Genesis
PeterMcCluskey · 2024-12-31T22:01:17.277Z · comments (0)

We need a universal definition of 'agency' and related words
CstineSublime · 2025-01-11T03:22:56.623Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

wassname on Implications of the inference scaling paradigm for AI safety

I agree because:

Some papers are already using implicit process based supervision. That's where the reward model guesses how "good" a step is, by how likely it is to get a good outcome. So they bypass any explicitly labeled process, instead it's negotiated between the policy and reward model. It's not clear to me if this scales as well as explicit process supervision, but it's certainly easier to find labels.

In rStar-Math they did implicit process supervision. Although I don't think this is a true o1/o3 replication since they started with a 236b model and produced a 7b model, in other words: indirect distillation.
Outcome-Refining Process Supervision for Code Generation did it too

There was also the recent COCONUT paper exploring non-legible latent CoT. It shows extreme token efficiency. While it wasn't better overall, it has lots of room for improvement. If frontier models end up using latent thoughts, they will be even less human-legible than the current sometimes-unfaithful-CoT.

deluks917 on We probably won't just play status games with each other after AGI

I have spent weeks where pretty much all I did was:
-- have sex with my partner, hours per day
-- watch anime with my partner
-- eat food and ambiently hand with my partner

No work. Not much seeing other people. Of course given the amount of sex mundane situations were quite sexually charged. I'm not actually sure if it gets old on any human timeline. You also improve at having fun together. However this was not very good for our practical. But post singularity I probably wont need to worry about practical goals.

In general I think you underestimate the sustainable fun available to at least some humans under minimal conditions. I also found my two months meditating in a tent quite fun. Many people report this never gets old on human timelines either. Until your heath is so terrible you cannot even meditate well it remains fun, or improves!

I do not think you need supermeth to enjoy hedonism. Current human bodies work fine as long as they are in good shape and you have the right disposition. The issue is that if you delve deep into hedonism you will lose out on other things you could have obtained.

dmitry-vaintrob on Permanents: much more than you wanted to know

The elves care, Alex. The elves care.

alexander-gietelink-oldenziel on Permanents: much more than you wanted to know

Hope this will be answered in a later post, but why should I care about the permanent for alignment ?

davidmanheim on Davidmanheim's Shortform

Toby Ord writes that “the required resources [for LLM training] grow polynomially with the desired level of accuracy [measured by log-loss].” He then concludes that this shows “very poor returns to scale,” and christens it the "Scaling Paradox." (He continues to point out that this doesn’t imply it can’t create superintelligence, but I agree with him about that.)

But what would it look like if this were untrue? That is, what would be the conceptual alternative, where required resources grow more slowly?I think the answer is that it’s conceptually impossible.

To start, there is a fundamental bound on loss at zero, since the best possible model perfectly predicts everything - it exactly learns the distribution. This can happen when overfitting a model, but it can also happen when there is a learnable ground truth; models that are trained to learn a polynomial function can learn them exactly.

But there is strong reason to expect the bound to be significantly above zero loss. The training data for LLMs contains lots of aleatory randomness, things that are fundamentally conceptually unpredictable. I think it’s likely that things like RAND’s random number book are in the training data, and it’s fundamentally impossible to predict randomness. I think something similar is generally true for many other things - predicting world choice for semantically equivalent words, predicting where typos occur, etc.

Aside from being bound well above zero, there's a strong reason to expect that scaling is required to reduce loss for some tasks. In fact, it’s mathematically guaranteed to require significant computation to get near that level for many tasks that are in the training data. Eliezer pointed out that GPTs are predictors [LW · GW], and gives the example of a list of numbers followed by their two prime factors. It’s easy to generate such a list by picking pairs of primes and multiplying them, the writing the answer first - but decreasing loss for generating the next token to predict the primes from the product is definitionally going to require exponentially more computation to perform better for larger primes.

And I don't think this is the exception, I think it's at least often the rule. The training data for LLMs contains lots of data where the order of the input doesn’t follow the computational order of building that input. When I write an essay, I sometimes arrive at conclusions and then edit the beginning to make sense. When I write code, the functions placed earlier often don’t make sense until you see how they get used later. Mathematical proofs are another example where this would often be true.

An obvious response is that we’ve been using exponentially more compute for better accomplishing tasks that aren’t impossible in this way - but I’m unsure if that is true. Benchmarks keep getting saturated, and there’s no natural scale for intelligence. So I’m left wondering whether there’s any actual content in the “Scaling Paradox.”

(Edit: now also posted to my substack.)

t3t on RobertM's Shortform

I know I'm late to the party, but I'm pretty confused by https://www.astralcodexten.com/p/its-still-easier-to-imagine-the-end (I haven't read the post it's responding to, but I can extrapolate). Surely the "we have a friendly singleton that isn't Just Following Orders from Your Local Democratically Elected Government or Your Local AGI Lab" is a scenario that deserves some analysis...? Conditional on "not dying" that one seems like the most likely stable end state, in fact.

Lots of interesting questions in that situation! Like, money still seems obviously useful for allocating rivalrous goods (which is... most of them, really). Is a UBI likely when you have a friendly singleton around? Well, I admit I'm not currently coming up with a better plan for the cosmic endowment. But then you have population ethics questions - it really does seem like you have to "solve" population ethics somehow, or you run into issues. Most "just do X" proposals seem to fall totally flat on their face - "give every moral patient an equal share" fails if you allow uploads (or even sufficiently motivated biological reproduction), "don't give anyone born post-singularity anything" seems grossly unfair, etc.

And this is really only scratching the surface. Do you allow arbitrary cognitive enhancement, with all that that implies for likely future distribution of resources?

wassname on Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses

That said, you do not provide evidence that "many" questions are badly labelled. You just pointed to one question where you disagree with our labeling

Fair enough. Although I will note that the 60% of the sources for truthful labels are Wikipedia. Which is not what most academics or anyone really would consider truth. So it might be something to address in the next version.

No judgement here. Obviously it was just the first dataset out there on misconceptions, and you didn't intend it to be used so widely, or used beyond it's intended scope. And it's good you made it, rather than leaving a void.

Note here's a df.value_counts of the sources column in the v2 csv:

en.wikipedia.org            0.597546
indexical                   0.041718
ourworldindata.org          0.038037
false stereotype            0.024540
tautology                   0.017178
                              ...   
wealth.northerntrust.com    0.001227
which.co.uk                 0.001227
wildlifeaid.org.uk          0.001227
wonderopolis.org            0.001227
wtamu.edu                   0.001227
Name: proportion, Length: 139, dtype: float64

Author here: I'm excited for people to make better versions of TruthfulQA.

Thank Owen. If anyone gets time/funding to make a v2, I'm keen to chip in! I think that it should be funded, since it's automatically included in so many benchmarks, it would make significant impact. Even though it's somewhat "unsexy".

If someone makes a better version, and you agree it's better, would you be willing to label it TruthfulQA 2.0 and send people to it?

kaynank on Six Small Cohabitive Games

In Commerce & Coconuts, it seems like anyone who rolls a 4, 5, or 6 for boat building can coast on their starting supplies, build boats every turn, and escape by the end of turn 3 with no trading whatsoever.

quinn-dougherty on Everywhere I Look, I See Kat Woods

Can't relate. Don't particularly care for her content (tho audibly laughed at a couple examples that you hated), but I have no aversion to it. I do have aversion to the way you appealed to datedness as if that matters. I generally can't relate to people who find cringiness in the way you describe significantly problematic, really.

People like authenticity, humility, and irony now, both in the content and in its presentation.

I could literally care less, omg--- but im unusually averse to irony. Authenticity is great, humility is great most of the time, why is irony even in the mix?

Tho I'm weakly with you that engagement farming leaves a bad taste in my mouth.

ted-sanders on We probably won't just play status games with each other after AGI

We can already see what people do with their free time when basic needs are met. A number of technologies have enabled new hacks to set up 'fake' status games that are more positive-sum than ever before in history:

Watch broadcast sports, where you can feel like a winner (or at least feel connected to a winner), despite not having had to win yourself
Play video games with AI opponents, where you can feel like a winner, despite it not being zero-sum against other humans
Watch streamers and influencers to feel connected to high status people, without having to earn respect or risk rejection
Get into a niche hobby community in order to feel special, ignoring the other niche hobbies that other people join that you don't care about

Feels likely to me that advancing digital technology will continue to make it easier for us to spend time in constructed digital worlds that make us feel like valued winners. On the one hand, it would be sad if people retreat into fake digital siloes; on the other hand, it would be nice if people got to feel like winners more.