LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Suffering Is Not Pain
jbkjr · 2024-06-18T18:04:43.407Z · comments (45)

Difficulty classes for alignment properties
Jozdien · 2024-02-20T09:08:24.783Z · comments (5)

AXRP Episode 33 - RLHF Problems with Scott Emmons
DanielFilan · 2024-06-12T03:30:05.747Z · comments (0)

Monthly Roundup #12: November 2023
Zvi · 2023-11-14T15:20:06.926Z · comments (5)

Linear encoding of character-level information in GPT-J token embeddings
mwatkins · 2023-11-10T22:19:14.654Z · comments (4)

[link] Why Yudkowsky is wrong about "covalently bonded equivalents of biology"
titotal (lombertini) · 2023-12-06T14:09:15.402Z · comments (40)

Reflective consistency, randomized decisions, and the dangers of unrealistic thought experiments
Radford Neal · 2023-12-07T03:33:16.149Z · comments (25)

[link] GPT2, Five Years On
Joel Burget (joel-burget) · 2024-06-05T17:44:17.552Z · comments (0)

CHAI internship applications are open (due Nov 13)
Erik Jenner (ejenner) · 2023-10-26T00:53:49.640Z · comments (0)

LessWrong: After Dark, a new side of LessWrong
So8res · 2024-04-01T22:44:04.449Z · comments (5)

Intransitive Trust
Screwtape · 2024-05-27T16:55:29.294Z · comments (15)

What I Learned (Conclusion To "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-20T21:24:37.464Z · comments (0)

AI #56: Blackwell That Ends Well
Zvi · 2024-03-21T12:10:05.412Z · comments (16)

[link] math terminology as convolution
bhauth · 2023-10-30T01:05:11.823Z · comments (1)

D&D.Sci (Easy Mode): On The Construction Of Impossible Structures
abstractapplic · 2024-05-17T00:25:42.950Z · comments (12)

How to develop a photographic memory 1/3
PhilosophicalSoul (LiamLaw) · 2023-12-28T13:26:36.669Z · comments (6)

The murderous shortcut: a toy model of instrumental convergence
Thomas Kwa (thomas-kwa) · 2024-10-02T06:48:06.787Z · comments (0)

[question] If I have some money, whom should I donate it to in order to reduce expected P(doom) the most?
KvmanThinking (avery-liu) · 2024-10-03T11:31:19.974Z · answers+comments (33)

(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need
Sodium · 2024-10-03T19:11:58.032Z · comments (16)

[link] Book review: On the Edge
PeterMcCluskey · 2024-08-30T22:18:39.581Z · comments (0)

ARENA4.0 Capstone: Hyperparameter tuning for MELBO + replication on Llama-3.2-1b-Instruct
25Hour (aaron-kaufman) · 2024-10-05T11:30:11.953Z · comments (2)

Augmenting Statistical Models with Natural Language Parameters
jsteinhardt · 2024-09-20T18:30:10.816Z · comments (0)

Video and transcript of presentation on Otherness and control in the age of AGI
Joe Carlsmith (joekc) · 2024-10-08T22:30:38.054Z · comments (1)

Exploring SAE features in LLMs with definition trees and token lists
mwatkins · 2024-10-04T22:15:28.108Z · comments (4)

[link] Information dark matter
Logan Kieller (logan-kieller) · 2024-10-01T15:05:41.159Z · comments (4)

Proveably Safe Self Driving Cars [Modulo Assumptions]
Davidmanheim · 2024-09-15T13:58:19.472Z · comments (26)

The Cognitive Bootcamp Agreement
Raemon · 2024-10-16T23:24:05.509Z · comments (0)

My disagreements with "AGI ruin: A List of Lethalities"
Noosphere89 (sharmake-farah) · 2024-09-15T17:22:18.367Z · comments (44)

[link] Genocide isn't Decolonization
robotelvis · 2023-10-20T04:14:07.716Z · comments (19)

Effectively Handling Disagreements - Introducing a New Workshop
Camille Berger (Camille Berger) · 2024-04-15T16:33:50.339Z · comments (2)

AGI will be made of heterogeneous components, Transformer and Selective SSM blocks will be among them
Roman Leventov · 2023-12-27T14:51:37.713Z · comments (9)

An illustrative model of backfire risks from pausing AI research
Maxime Riché (maxime-riche) · 2023-11-06T14:30:58.615Z · comments (3)

Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols
Arjun Panickssery (arjun-panickssery) · 2024-01-15T21:21:03.962Z · comments (0)

Helpful examples to get a sense of modern automated manipulation
trevor (TrevorWiesinger) · 2023-11-12T20:49:57.422Z · comments (3)

Love, Reverence, and Life
Elizabeth (pktechgirl) · 2023-12-12T21:49:04.061Z · comments (7)

Disentangling four motivations for acting in accordance with UDT
Julian Stastny · 2023-11-05T21:26:22.514Z · comments (3)

One way violinists fail
Solenoid_Entity · 2024-05-29T04:08:17.675Z · comments (5)

[link] On Lies and Liars
Gabriel Alfour (gabriel-alfour-1) · 2023-11-17T17:13:03.726Z · comments (4)

We have promising alignment plans with low taxes
Seth Herd · 2023-11-10T18:51:38.604Z · comments (9)

ChatGPT 4 solved all the gotcha problems I posed that tripped ChatGPT 3.5
VipulNaik · 2023-11-29T18:11:53.252Z · comments (16)

Rational Animations offers animation production and writing services!
Writer · 2024-03-15T17:26:07.976Z · comments (0)

AI Safety Strategies Landscape
Charbel-Raphaël (charbel-raphael-segerie) · 2024-05-09T17:33:45.853Z · comments (1)

Update #2 to "Dominant Assurance Contract Platform": EnsureDone
moyamo · 2023-11-28T18:02:50.367Z · comments (2)

Boston Solstice 2023 Retrospective
jefftk (jkaufman) · 2024-01-02T03:10:05.694Z · comments (0)

Experimentation (Part 7 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-18T21:25:56.527Z · comments (0)

UDT1.01: Logical Inductors and Implicit Beliefs (5/10)
Diffractor · 2024-04-18T08:39:13.368Z · comments (2)

Monthly Roundup #16: March 2024
Zvi · 2024-03-19T13:10:05.529Z · comments (4)

More on the Apple Vision Pro
Zvi · 2024-02-13T17:40:05.388Z · comments (5)

Regrant up to $600,000 to AI safety projects with GiveWiki
Dawn Drescher (Telofy) · 2023-10-28T19:56:06.676Z · comments (1)

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”
Tony Wang (tw) · 2023-12-15T11:05:23.256Z · comments (8)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

nathan-helm-burger on What AI companies should do: Some rough ideas

share good access with third-party evaluators

Glad to see my personal hobbyhorse got a mention!

Overall this is great, and I have just one concern about a possible what-if case that isn't covered.

Model development secrets being disproportionately important.

I know, it's a really tricky thing to deal with if this becomes the case. But I think it's worth having a 'what if' plan in place in case this suddenly happens. Imagine a lab is working on their research and some researcher stumbles across an algorithmic innovation which makes a million-fold improvement in training efficiency. The peak capability level of the model that can now be trained for a few thousand dollars is now on par with their leading multi-billion dollar frontier model. Furthermore, this peak capability level is sufficient for the 3x speedup in AI R&D that you specified as being a threshold of high concern. Under such a circumstance, this secret becomes even more dangerous and valuable than the multi-billion-dollar-training-cost frontier model weights. Seems like having a plan in place for what if this happens would be wise.

Something like: don't tell your co-workers, report to so-and-so specific person whose job it is to handle evaluating and reporting up-the-chain about possible dangerous algorithmic developments. Probably you'd need a witness protection / isolation setup for the researchers who'd been exposed to the secret. You'd need to start planning around how soon others might stumble onto the same discovery, what other similar discoveries might be out there to be found, what government actions should be taken, etc.

I know this sounds like an implausible scenario to most people currently, but I think it's not something ruled out as physically impossible, and is worth having a what-if plan for.

error on What are your favorite books or blogs that are out of print, or whose domains have expired (especially if they also aren't on LibGen/Wayback/etc, or on Amazon)?

Not sure of the title. The tagline was "almost no one is evil; almost everything is broken." The address was http://blog.jaibot.com. Some specific essays originating there were "500 million, but not a single one more," "Foes Without Faces", and "The Copenhagen Interpretation of Ethics".

leogao on A Rocket–Interpretability Analogy

For the purposes of the original question of whether people are overinvesting in interp due to it being useful for capabilities and therefore being incentivized, I think there's a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.

Separately, it's also not clear to me that the diffuse intuitions from interpretability have actually helped people a lot with capabilities. Obviously this is very hard to attribute, and I can't speak too much about details, but it feels to me like the most important intuitions come from elsewhere. What's an example of an interpretability work that you feel has affected capabilities intuitions a lot?

nathan-helm-burger on [Linkpost] Hawkish nationalism vs international AI power and benefit sharing

Thanks Naci, that's helpful clarification. An active call to change the odds, rather than a passive hoping that things will go well, does seem like a more robust plan.

leogao on A Rocket–Interpretability Analogy

SAE steering doesn't seem like it obviously beats other steering techniques in terms of usefulness. I haven't looked closely into Hyena but my prior is that subquadratic attention papers probably suck unless proven otherwise.

Interpretability is certainly vastly more appealing to lab leadership than weird philosophy, but it's vastly less appealing than RLHF. But there are many many ML flavored directions and only a few of them are any good, so it's not surprising that most directions don't get a lot of attention.

Probably as interp gets better it will start to be helpful for capabilities. I'm uncertain whether it will be more or less useful for capabilities than just working on capabilities directly; on the one hand, mechanistic understanding has historically underperformed as a research strategy, on the other hand it could be that this will change once we have a sufficiently good mechanistic understanding.

noah-birnbaum on Noah Birnbaum's Shortform

Tyler Cowen often has really good takes (even some good stuff against AI as an x-risk!), but this was not one of them: https://marginalrevolution.com/marginalrevolution/2024/10/a-funny-feature-of-the-ai-doomster-argument.html

Title: A funny feature of the AI doomster argument

If you ask them whether they are short the market, many will say there is no way to short the apocalypse. But of course you can benefit from pending signs of deterioration in advance. At the very least, you can short some markets, or go long volatility, and then send those profits to Somalia to mitigate suffering for a few years before the whole world ends.

Still, in a recent informal debate at the wonderful Roots of Progress conference in Berkeley, many of the doomsters insisted to me that “the end” will come as a complete surprise, given the (supposed) deceptive abilities of AGI.

But note what they are saying. If markets will not fall at least partially in advance, they are saying the passage of time, and the events along the way, will not persuade anyone. They are saying that further contemplation of their arguments will not persuade any marginal investors, whether directly or indirectly. They are predicting that their own ideas will not spread any further.

I take those as signs of a pretty weak argument. “It will never get more persuasive than it is right now!” “There’s only so much evidence for my argument, and never any more!” Of course, by now most intelligent North Americans with an interest in these issues have heard these arguments and they are most decidedly not persuaded.

There is also a funny epistemic angle here. If the next say twenty years of evidence and argumentation are not going to persuade anyone else at the margin, why should you be holding this view right now? What is it that you know, that is so resistant to spread and persuasion over the course of the next twenty years?

I would say that to ask such questions is to answer them.

simon-lermen on Applying refusal-vector ablation to a Llama 3 70B agent

Hi Evan, I published this paper on arxiv recently and it also got accepted at the SafeGenAI workshop at Neurips in December this year. Thanks for adding the link, I will probably work on the paper again and put an updated version on arxiv as I am not quite happy with the current version.

I think that using the base model without instruction fine-tuning would prove bothersome for multiple reasons:

1. In the paper I use the new 3.1 models which are fine-tuned for tool using, these base models where never fine-tuned to use tools through function calling.

2. Base models are highly random and hard to control, they are not really steerable. They require very careful prompting/conditioning to do anything useful.

3. I think current post-training basically improves all benchmarks

I am also working on using such agents and directly evaluating how good they are on humans at spear phishing: https://openreview.net/forum?id=VRD8Km1I4x

david-duvenaud on Sabotage Evaluations for Frontier Models

I basically agree with everything @Lukas Finnveden [LW · GW] and @ryan_greenblatt [LW · GW] said above. Another relevant "failure" state is that we might only be able to come up with expensive or capability-reducing mitigations, meaning everyone will have an incentive to disable or work around them.

d0themath on Alexander Gietelink Oldenziel's Shortform

Yeah, I meant medical/covid masks imply the wearer is diseased. I would have also believed the cat mask is a medical/covid mask if you hadn't give a different reason for wearing it, so it has that going against it in terms of coolness. It also has a lack of plausible deniability going against it too. If you're wearing sunglasses there's actually a utilitarian reason behind wearing them outside of just creating information asymmetry. If you're just trying to obscure half your face, there's no such plausible deniability. You're just trying to obscure your face, so it becomes far less cool.

nc-1 on There aren't enough smart people in biology doing something boring

My impression is that's a little simplistic, but I also don't have the best knowledge of the market outside WGS/WES and related tools. That particular market is a bloodbath. Maybe there's better scope in proteomics/metabolomics/stuff I know nothing about.