LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution
Kola Ayonrinde (kola-ayonrinde) · 2024-10-30T22:50:45.642Z · comments (0)

[link] Altruism and Vitalism Aren't Fellow Travelers
Arjun Panickssery (arjun-panickssery) · 2024-08-09T02:01:11.361Z · comments (2)

[link] "25 Lessons from 25 Years of Marriage" by honorary rationalist Ferrett Steinmetz
CronoDAS · 2024-10-02T22:42:30.509Z · comments (2)

Apply to the Cooperative AI PhD Fellowship by October 14th!
Lewis Hammond (lewis-hammond-1) · 2024-10-05T12:41:24.093Z · comments (0)

[link] Foundations - Why Britain has stagnated [crosspost]
Nathan Young · 2024-09-23T10:43:20.411Z · comments (1)

Would you benefit from, or object to, a page with LW users' reacts?
Raemon · 2024-08-20T16:35:47.568Z · comments (6)

AI #77: A Few Upgrades
Zvi · 2024-08-20T00:20:09.717Z · comments (3)

[link] The Offense-Defense Balance of Gene Drives
Maxwell Tabarrok (maxwell-tabarrok) · 2024-09-27T16:47:25.976Z · comments (1)

[LDSL#2] Latent variable models, network models, and linear diffusion of sparse lognormals
tailcalled · 2024-08-09T19:57:56.122Z · comments (2)

[link] Tokyo AI Safety 2025: Call For Papers
Blaine (blaine-rogers) · 2024-10-21T08:43:38.467Z · comments (0)

[link] The unreasonable effectiveness of plasmid sequencing as a service
Abhishaike Mahajan (abhishaike-mahajan) · 2024-10-08T02:02:55.352Z · comments (0)

[question] Money Pump Arguments assume Memoryless Agents. Isn't this Unrealistic?
Dalcy (Darcy) · 2024-08-16T04:16:23.159Z · answers+comments (6)

AI Safety University Organizing: Early Takeaways from Thirteen Groups
agucova · 2024-10-02T15:14:00.137Z · comments (0)

GPT-3.5 judges can supervise GPT-4o debaters in capability asymmetric debates
Charlie George (charlie-george) · 2024-08-27T20:44:08.683Z · comments (7)

[link] A Defense of Peer Review
Niko_McCarty (niko-2) · 2024-10-22T16:16:49.982Z · comments (1)

Rashomon - A newsbetting site
ideasthete · 2024-10-15T18:15:02.476Z · comments (8)

Why the 2024 election matters, the AI risk case for Harris, & what you can do to help
Alex Lintz (alex-lintz) · 2024-09-24T19:32:46.893Z · comments (6)

A suite of Vision Sparse Autoencoders
Louka Ewington-Pitsos (louka-ewington-pitsos) · 2024-10-27T04:05:20.377Z · comments (0)

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs
Winnie Yang (winnie-yang) · 2024-08-22T07:32:07.600Z · comments (0)

[link] How to choose what to work on
jasoncrawford · 2024-09-18T20:39:12.316Z · comments (6)

Can We Predict Persuasiveness Better Than Anthropic?
Lennart Finke (l-f) · 2024-08-04T14:05:33.668Z · comments (5)

[LDSL#3] Information-orientation is in tension with magnitude-orientation
tailcalled · 2024-08-10T21:58:27.659Z · comments (2)

[link] Hyperpolation
Gunnar_Zarncke · 2024-09-15T21:37:00.002Z · comments (6)

Monthly Roundup #21: August 2024
Zvi · 2024-08-20T00:20:08.178Z · comments (6)

[link] on Science Beakers and DDT
bhauth · 2024-09-05T03:21:21.382Z · comments (13)

Launching Adjacent News
Lucas Kohorst (lucas-kohorst) · 2024-10-16T17:58:10.289Z · comments (0)

A path to human autonomy
Nathan Helm-Burger (nathan-helm-burger) · 2024-10-29T03:02:42.475Z · comments (11)

[link] The Tech Industry is the Biggest Blocker to Meaningful AI Safety Regulations
garrison · 2024-08-16T19:37:28.416Z · comments (1)

[link] Day Zero Antivirals for Future Pandemics
Niko_McCarty (niko-2) · 2024-08-26T15:18:33.858Z · comments (2)

[link] AI safety tax dynamics
owencb · 2024-10-23T12:18:32.243Z · comments (0)

Geoffrey Hinton on the Past, Present, and Future of AI
Stephen McAleese (stephen-mcaleese) · 2024-10-12T16:41:56.796Z · comments (5)

Brief analysis of OP Technical AI Safety Funding
22tom (thomas-barnes) · 2024-10-25T19:37:41.674Z · comments (2)

August 2024 Time Tracking
jefftk (jkaufman) · 2024-08-24T13:50:04.676Z · comments (0)

[LDSL#5] Comparison and magnitude/diminishment
tailcalled · 2024-08-12T18:47:20.546Z · comments (0)

[link] To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-19T16:13:55.835Z · comments (1)

AXRP Episode 37 - Jaime Sevilla on Forecasting AI
DanielFilan · 2024-10-04T21:00:03.077Z · comments (3)

[question] What should OpenAI do that it hasn't already done, to stop their vacancies from being advertised on the 80k Job Board?
WitheringWeights (EZ97) · 2024-10-21T13:57:30.934Z · answers+comments (0)

Alignment by default: the simulation hypothesis
gb (ghb) · 2024-09-25T16:26:00.552Z · comments (39)

AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization
DanielFilan · 2024-08-24T22:30:02.039Z · comments (0)

[link] AI & wisdom 1: wisdom, amortised optimisation, and AI
L Rudolf L (LRudL) · 2024-10-28T21:02:51.215Z · comments (0)

[link] Anthropic is being sued for copying books to train Claude
Remmelt (remmelt-ellen) · 2024-08-31T02:57:27.092Z · comments (4)

Musings on Text Data Wall (Oct 2024)
Vladimir_Nesov · 2024-10-05T19:00:21.286Z · comments (2)

Gell-Mann checks
Cleo Scrolls (cleo-scrolls) · 2024-09-26T22:45:43.569Z · comments (7)

[question] What should we do about COVID in 2024?
ChristianKl · 2024-08-04T10:57:24.140Z · answers+comments (2)

[link] The Great Organism Theory of Evolution
rogersbacon · 2024-08-10T12:26:02.434Z · comments (0)

[link] Compression Moves for Prediction
adamShimi · 2024-09-14T17:51:12.004Z · comments (0)

[link] AI Model Registries: A Foundational Tool for AI Governance
Elliot Mckernon (elliot) · 2024-10-07T19:27:43.466Z · comments (1)

How Often Does Taking Away Options Help?
niplav · 2024-09-21T21:52:40.822Z · comments (6)

A necessary Membrane formalism feature
ThomasCederborg · 2024-09-10T21:33:09.508Z · comments (6)

[link] Green and golden: a meditation
Richard_Ngo (ricraz) · 2024-08-18T01:36:43.613Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

danielfilan on Habryka's Shortform Feed

Update: I have already gotten over it.

willpetillo on What if Alignment is Not Enough?

On reflection, I suspect the crux here is a differing conception of what kind of failures are important. I've written a follow-up post that comes at this topic from a different direction and I would be very interested in your feedback: https://www.lesswrong.com/posts/NFYLjoa25QJJezL9f/lenses-of-control.

donald-hobson on No, really, it predicts next tokens.

My understanding is that these are explicitly and intentionally trained (wouldn't come to exist naturally under gradient descent on normal training data)

No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples.

So if the ambient rate of adversarial examples is 10^-9, then every now and then the AI will hit such an example and go wild. If the ambient rate is 10^-500, it won't.

That's a much more complicated goal than the goal of correctly predicting the next token,

Is it more complicated? What ontological framework is this AI using to represent it's goal anyway?

any willingness to sacrifice a few tokens now would be trained out by gradient descent.

Only if, during training, the network repeatedly gets into a state where it believes that sacrificing tokens now is a good idea. Despite the fact that it isn't a good idea when you are in training. (Unless there is a training environment bug and you can sneak out mid way through training)

So, is the network able to tell whether or not it's in training?

zach-stein-perlman on Zach Stein-Perlman's Shortform

Anthropic: The case for targeted regulation.

I like the "Urgency" section.

Anthropic says the US government should "require companies to have and publish RSP-like policies" and "incentivize companies to develop RSPs that are effective" and do related things. I worry that RSPs will be ineffective for most companies, even given some transparency/accountability and vague minimum standard.

davekasten on davekasten's Shortform

Ok, so Anthropic's new policy post (explicitly NOT linkposting it properly since I assume @Zac Hatfield-Dodds [LW · GW] or @Evan Hubinger [LW · GW] or someone else from Anthropic will, and figure the main convo should happen there, and don't want to incentivize fragmenting of conversation) seems to have a very obvious implication.

Unrelated, I just slammed a big AGI-by-2028 order on Manifold Markets.

donald-hobson on No, really, it predicts next tokens.

Would you expect some part of the net to be left blank, because "a large neural net has a lot of spare neurons"?

If the lottery ticket hypothesis is true, yes.

The lottery ticket hypothesis is that some parts of the network start off doing something somewhat close to useful, and get trained towards usefulness. And some parts start off sufficiently un-useful that they just get trained to get out of the way.

Which fits with neural net distillation being a thing. (Ie training a big network, and then condensing it into a smaller network gives better performance than directly training a small network.

but gradient descent doesn't care, it reaches in and adjusts every weight.

Here is an extreme example. Suppose the current parameters were implementing a computer chip, on which was running a holomorphically encrypted piece of code.

Holomorphic encryption itself is unlikely to form, but it serves at least as an existance proof for computational structures that can't be adjusted with local optimization.

Basically the problem with gradient descent is that it's local. And when the same neurons are doing things that the neural net does want, and things that the neural net doesn't want (but doesn't dis-want either) then its possible for the network to be trapped in a local optimum. Any small change to get rid of the bad behavior would also get rid of the good behavior.

Also, any bad behavior that only very rarely effects the output will produce very small gradients. Neural nets are trained for finite time. It's possible that gradient descent just hasn't got around to removing the bad behavior even if it would do so eventually.

Can you concoct even a vague or toy model of how what you propose could possibly be a local optimum?

You can make any algorithm that does better than chance into a local optimum on a sufficiently large neural net. Holomorphicly encrypt that algorithm, Any small change and the whole thing collapses into nonsense. Well actually, this involves discrete bits. But suppose the neurons have strong regularization to stop the values getting too large (past + or - 1) , and they also have uniform [0,1] noise added to them, so each neuron can store 1 bit and any attempt to adjust parameters immediately risks errors.

Looking at the article you linked. One simplification is that neural networks tend towards the max-entropy way to solve the problem. If there are multiple solutions, the solutions with more free parameters are more likely.

And there are few ways to predict next tokens, but lots of different kinds of paperclips the AI could want.

ryankidd44 on Ryan Kidd's Shortform

MATS lowered the stipend from $50/h to $40/h ahead of the Summer 2023 Program to support more scholars. We then lowered it again to $30/h between ahead of the Winter 2023-24 Program after surveying alumni [LW(p) · GW(p)] and determining that 85% would be accept $30/h.

felix-j-binder on Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence

This is a great post—I'm excited about this line of research, and it's great to see a proposal of how that might look like.

In our paper, we find that the log-probs of a models hypothetical statements track the log-probs of the object-level behavior it is reporting about. This is true also for object-level responses that the model does not actually choose. For example (made up numbers), if the object-level behavior of the model has the distribution 60% "dog", 30% "cat", 10% "fox", the model would answer the question "what would the second letter of your answer have been?" with 70% "o", 30% "a". Note that the model only saw the winning answer during training, yet it is calibrated (to some degree) to the distribution of object-level answers.

I'm curious what you make of this result. To me, the fact that the log-probs of the hypothetical answer are calibrated wrt to the object-level behavior suggest that there an internal process that takes into account calibration when arriving at an answer, even though we don't ask it to verbalize the calibration. (Early on in the project, we actually included experiments where models were asked about eg. their second-most likely answer, but we stopped them early enough that I have no data on how well they can explicitly report on this).

lorec on The hostile telepaths problem

I once thought "slack mattered more than any outcome". But whose slack? It's wonderful for all humans to have more slack. But there's a huge game-theoretic difference between the species being wealthier, and thus wealthier per capita, and being wealthy/high-status/dominant/powerful relative to other people. The first is what I was getting at by "things orthogonal to the lineup"; the second is "the lineup". Trying to improve your position relative to copies of yourself in a way that is zero-sum is "the rat race", or "the Red Queen's race", where running will ~only ever keep you in the same place, and cause you and your mirror-selves to expend a lot of effort that is useless if you don't enjoy it.

[I think I enjoy any amount of "the rat race", which is part of why I find myself doing any of it, even though I can easily imagine tweaking my mind such that I stop doing it and thus exit an LDT negotiation equilibrium where I need to do it all the time. But I only like it so much, and only certain kinds.]

raemon on On Shifgrethor

(In addition to this not-seeming-true-across-the-board... also, literally nobody has ever made this claim to me. The entire reason I'm hypothesizing it is because this post suggested it, and it made sense given my model of how my/friends' cognition seems to work. So, IMO the slightly-aggro comment here is just basically wrong? Unless you've specifically seen people claim this?)