LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

(My understanding of) What Everyone in Technical Alignment is Doing and Why
Thomas Larsen (thomas-larsen) · 2022-08-29T01:23:58.073Z · comments (89)

DeepMind alignment team opinions on AGI ruin arguments
Vika · 2022-08-12T21:06:40.582Z · comments (37)

[link] A Mechanistic Interpretability Analysis of Grokking
Neel Nanda (neel-nanda-1) · 2022-08-15T02:41:36.245Z · comments (47)

Two-year update on my personal AI timelines
Ajeya Cotra (ajeya-cotra) · 2022-08-02T23:07:48.698Z · comments (60)

Common misconceptions about OpenAI
Jacob_Hilton · 2022-08-25T14:02:26.257Z · comments (142)

What do ML researchers think about AI in 2022?
KatjaGrace · 2022-08-04T15:40:05.024Z · comments (33)

Worlds Where Iterative Design Fails
johnswentworth · 2022-08-30T20:48:29.025Z · comments (30)

Language models seem to be much better than humans at next-token prediction
Buck · 2022-08-11T17:45:41.294Z · comments (59)

How To Go From Interpretability To Alignment: Just Retarget The Search
johnswentworth · 2022-08-10T16:08:11.402Z · comments (33)

Some conceptual alignment research projects
Richard_Ngo (ricraz) · 2022-08-25T22:51:33.478Z · comments (15)

Shard Theory: An Overview
David Udell · 2022-08-11T05:44:52.852Z · comments (34)

Nate Soares' Life Advice
CatGoddess · 2022-08-23T02:46:43.369Z · comments (41)

Your posts should be on arXiv
JanB (JanBrauner) · 2022-08-25T10:35:12.087Z · comments (44)

What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?
johnswentworth · 2022-08-15T22:48:38.671Z · comments (18)

The Parable of the Boy Who Cried 5% Chance of Wolf
KatWoods (ea247) · 2022-08-15T14:33:21.649Z · comments (24)

How might we align transformative AI if it’s developed very soon?
HoldenKarnofsky · 2022-08-29T15:42:08.985Z · comments (55)

Externalized reasoning oversight: a research direction for language model alignment
tamera · 2022-08-03T12:03:16.630Z · comments (23)

Interpretability/Tool-ness/Alignment/Corrigibility are not Composable
johnswentworth · 2022-08-08T18:05:11.982Z · comments (12)

[link] Fiber arts, mysterious dodecahedrons, and waiting on “Eureka!”
eukaryote · 2022-08-04T20:37:59.388Z · comments (15)

Taking the parameters which seem to matter and rotating them until they don't
Garrett Baker (D0TheMath) · 2022-08-26T18:26:47.667Z · comments (48)

Meditation course claims 65% enlightenment rate: my review
KatWoods (ea247) · 2022-08-01T11:25:37.017Z · comments (33)

[link] The lessons of Xanadu
jasoncrawford · 2022-08-07T17:59:57.839Z · comments (20)

Beliefs and Disagreements about Automating Alignment Research
Ian McKenzie (naimenz) · 2022-08-24T18:37:00.419Z · comments (4)

The alignment problem from a deep learning perspective
Richard_Ngo (ricraz) · 2022-08-10T22:46:46.752Z · comments (15)

How likely is deceptive alignment?
evhub · 2022-08-30T19:34:25.997Z · comments (28)

Announcing Encultured AI: Building a Video Game
Andrew_Critch · 2022-08-18T02:16:26.726Z · comments (26)

[link] everything is okay
Tamsin Leake (carado-1) · 2022-08-23T09:20:33.250Z · comments (22)

Oversight Misses 100% of Thoughts The AI Does Not Think
johnswentworth · 2022-08-12T16:30:24.060Z · comments (50)

Introducing Pastcasting: A tool for forecasting practice
Sage Future (aaron-ho-1) · 2022-08-11T17:38:06.474Z · comments (10)

Survey advice
KatjaGrace · 2022-08-24T03:10:21.424Z · comments (11)

Rant on Problem Factorization for Alignment
johnswentworth · 2022-08-05T19:23:24.262Z · comments (51)

Less Threat-Dependent Bargaining Solutions?? (3/2)
Diffractor · 2022-08-20T02:19:11.405Z · comments (7)

How to do theoretical research, a personal perspective
Mark Xu (mark-xu) · 2022-08-19T19:41:21.562Z · comments (6)

[question] Seriously, what goes wrong with "reward the agent when it makes you smile"?
TurnTrout · 2022-08-11T22:22:32.198Z · answers+comments (42)

High Reliability Orgs, and AI Companies
Raemon · 2022-08-04T05:45:34.928Z · comments (7)

[link] Refining the Sharp Left Turn threat model, part 1: claims and mechanisms
Vika · 2022-08-12T15:17:38.304Z · comments (4)

I’m mildly skeptical that blindness prevents schizophrenia
Steven Byrnes (steve2152) · 2022-08-15T23:36:59.003Z · comments (9)

[link] Most Ivy-smart students aren't at Ivy-tier schools
Aaron Bergman (aaronb50) · 2022-08-07T03:18:02.298Z · comments (7)

[link] Paper is published! 100,000 lumens to treat seasonal affective disorder
Fabienne · 2022-08-20T19:48:29.687Z · comments (3)

«Boundaries», Part 2: trends in EA's handling of boundaries
Andrew_Critch · 2022-08-06T00:42:48.744Z · comments (14)

The Loire Is Not Dry
jefftk (jkaufman) · 2022-08-20T13:40:01.237Z · comments (2)

What's the Least Impressive Thing GPT-4 Won't be Able to Do
Algon · 2022-08-20T19:48:14.811Z · comments (125)

Human Mimicry Mainly Works When We’re Already Close
johnswentworth · 2022-08-17T18:41:18.140Z · comments (16)

AI strategy nearcasting
HoldenKarnofsky · 2022-08-25T17:26:28.455Z · comments (4)

Evolution is a bad analogy for AGI: inner alignment
Quintin Pope (quintin-pope) · 2022-08-13T22:15:57.223Z · comments (15)

How (not) to choose a research project
Garrett Baker (D0TheMath) · 2022-08-09T00:26:37.045Z · comments (11)

The Core of the Alignment Problem is...
Thomas Larsen (thomas-larsen) · 2022-08-17T20:07:35.157Z · comments (10)

Announcing the Introduction to ML Safety course
Dan H (dan-hendrycks) · 2022-08-06T02:46:00.295Z · comments (6)

Discovering Agents
zac_kenton (zkenton) · 2022-08-18T17:33:43.317Z · comments (11)

[question] COVID-19 Group Testing Post-mortem?
gwern · 2022-08-05T16:32:55.157Z · answers+comments (6)

next page (older posts) →

Archive

Recent comments

quiet_nan on Why I'm doing PauseAI

Maybe GPT-5 will be extremely good at interpretability, such that it can recursively self improve by rewriting its own weights.

I am by no means an expert on machine learning, but this sentence reads weird to me.

I mean, it seems possible that a part of a NN develops some self-reinforcing feature which uses the gradient descent (or whatever is used in training) to go into a particular direction and take over the NN, like a human adrift on a raft in the ocean might decide to build a sail to make the raft go into a particular direction.

Or is that sentence meant to indicate that an instance running after training might figure out how to hack the computer running it so it can actually change it's own weights?

Personally, I think that if GPT-5 is the point of no return, it is more likely that it is because it would be smart enough to actually help advance AI after it is trained. While improving semiconductors seems hard and would require a lot of work in the real world done with human cooperation, finding better NN architectures and training algorithms seems like something well in the realm of the possible, if not exactly plausible.

So if I had to guess how GPT-5 might doom humanity, I would say that in a few million instance-hours it figures out how to train LLMs of its own power for 1/100th of the cost, and this information becomes public.

The budgets of institutions which might train NN probably follows some power law, so if training cutting edge LLMs becomes a hundred times cheaper, the number of institutions which could build cutting edge LLMs becomes many orders of magnitude higher -- unless the big players go full steam ahead towards a paperclip maximizer, of course. This likely mean that voluntary coordination (if that was ever on the table) becomes impossible. And setting up a worldwide authoritarian system to impose limits would also be both distasteful and difficult.

dagon on If you are assuming Software works well you are dead

Agreed, but it's not just software. It's every complex system, anything which requires detailed coordination of more than a few dozen humans and has efficiency pressure put upon it. Software is the clearest example, because there's so much of it and it feels like it should be easy.

aaron-kaufman on MSP ACX Hangout: Davanni's Pizza

Primarily people come to this on the discord, so I just have this on lw for visibility

nathan-helm-burger on Shannon Vallor’s “technomoral virtues”

Something I think humanity is going to have to grapple with soon is the ethics of self-modification / self-improvement, and the perils of value-shift due to rapid internal and external changes. How do we stay true to ourselves while changing fundamental aspects of what it means to be human?

lawrencec on Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

Huh, that's indeed somewhat surprising if the SAE features are capturing the things that matter to CLIP (in that they reduce loss) and only those things, as opposed to "salient directions of variation in the data". I'm curious exactly what "failing to work" means -- here I think the negative result (and the exact details of said result) are argubaly more interesting than a positive result would be.

dagon on Thomas Kwa's Shortform

I think this leans a lot on "get evidence uniformly over the next 10 years" and "Brownian motion in 1% steps". By conservation of expected evidence, I can't predict the mean direction of future evidence, but I can have some probabilities over distributions which add up to 0.

For long-term aggregate predictions of event-or-not (those which will be resolved at least a few years away, with many causal paths possible), the most likely updates are a steady reduction as the resolution date gets closer, AND random fairly large positive updates as we learn of things which make the event more likely.

johannes-c-mayer on Johannes C. Mayer's Shortform

Yes, that is a good point. I think you can totally write a program that checks given two lists as input, xs and xs', that xs' is sorted and also contains exactly all the elements from xs. That allows us to specify in code what it means that a list xs' is what I get when I sort xs.

And yes I can do this without talking about how to sort a list. I nearly give a property such that there is only one function that is implied by this property: the sorting function. I can constrain what the program can be totally (at least if we ignore runtime and memory stuff).

johannes-c-mayer on My hour of memoryless lucidity

To test whether Drake’s circumvention of his short-term memory loss worked via the intended mechanism, I could ask my girlfriend in advance to prompt me once — and only once — to complete the long-term memory scene that I had been practicing. Then I could see if I have a memory of the scene after I fully regain my memory.

Maybe you need to think the thought many times over in order to overwrite the original memory. In your place, I would try to prepare something similar to what Drake did. Some mental objects that you can retrieve have a predesigned hole to put information. To me, it seems like this should not be that hard to get. Then for ideally 30 minutes or so (though the streaming algorithm experiment seems also very interesting) after the surgery when you don't have short-term memory, you can repeatedly try to insert some specific object in the memory.

Maybe it would make sense for the sake of the experiment to limit yourself to 3 possible objects that could be inserted. Your girlfriend can then choose one randomly after surgery, for you to drill into the memory, by repeatedly thinking about the scene completed with that specific object.

Then after the 30 minutes, you do something completely different. Then 1 hour afterwards your girlfriend can ask you what the object was that she told you 1 hour ago. Well and probably many times during the first 30 minutes.

Probably it would be best if your girlfriend (or whatever person is willing to do this) constantly reminds you during the first 30 minutes or so that you need to imagine the object. Probably at least every minute or so.

dagon on Johannes C. Mayer's Shortform

I kind of see what you're saying, but I also rather think you're talking about specifying very different things in a way that I don't think is required. The closer CS definition of math's "define a sorted list" is "determine if a list is sorted". I'd argue it's very close to equivalent to the math formality of whether a list is sorted. You can argue about the complexity behind the abstraction (Math's foundations on set theory and symbols vs CS library and silicon foundations on memory storage and "list" indexing), but I don't think that's the point you're making.

When used for different things, they're very different in complexity. When used for the same things, they can be pretty similar.

ryan_b on How do you actually obtain and report a likelihood function for scientific research?

I was absolutely certain I had responded to this, because I had taken the trouble to search for and locate a description of the procedure used in particle physics, which appears to be the central place where likelihood functions are the preferred tool.

Seems I wrote it but never submitted it, so in this here placeholder comment I vouchsafe to hunt that resource down again and put it here in an edit.

Edit: As I promised, the resource: https://ep-news.web.cern.ch/what-likelihood-function-and-how-it-used-particle-physics

This is a short article from by a person from CERN, Robert Cousins. It covers in brief what likelihood is and how it is different than probability, then a short description of three different methods of using a likelihood function (here listed as Likelihoodist, Neman-Pearson, and Bayesian), and then on to a slightly more advanced example. It has references which include some papers from the work on identifying the Higgs Boson, and some of his own relevant papers.