LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Hardshipification
Jonathan Moregård (JonathanMoregard) · 2024-05-28T20:02:29.709Z · comments (17)

[link] Nietzsche's Morality in Plain English
Arjun Panickssery (arjun-panickssery) · 2023-12-04T00:57:42.839Z · comments (13)

Some for-profit AI alignment org ideas
Eric Ho (eh42) · 2023-12-14T14:23:20.654Z · comments (19)

MATS Winter 2023-24 Retrospective
utilistrutil · 2024-05-11T00:09:17.059Z · comments (28)

5 homegrown EA projects, seeking small donors
Austin Chen (austin-chen) · 2024-10-28T23:24:25.745Z · comments (0)

[link] [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij (teun-van-der-weij) · 2024-06-13T10:04:49.556Z · comments (10)

[Intuitive self-models] 1. Preliminaries
Steven Byrnes (steve2152) · 2024-09-19T13:45:27.976Z · comments (19)

[link] A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Alexandre Variengien (alexandre-variengien) · 2023-12-19T11:52:27.354Z · comments (3)

[link] [Paper] Stress-testing capability elicitation with password-locked models
Fabien Roger (Fabien) · 2024-06-04T14:52:50.204Z · comments (10)

Newsom Vetoes SB 1047
Zvi · 2024-10-01T12:20:06.127Z · comments (6)

[link] "AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case
habryka (habryka4) · 2024-05-03T18:10:12.478Z · comments (10)

Why you should be using a retinoid
GeneSmith · 2024-08-19T03:07:41.722Z · comments (57)

[link] What are you getting paid in?
Austin Chen (austin-chen) · 2024-07-17T19:23:04.219Z · comments (14)

OpenAI o1, Llama 4, and AlphaZero of LLMs
Vladimir_Nesov · 2024-09-14T21:27:41.241Z · comments (24)

AI #51: Altman’s Ambition
Zvi · 2024-02-20T19:50:07.439Z · comments (5)

[link] What Depression Is Like
Sable · 2024-08-27T17:43:22.549Z · comments (23)

Sparse Autoencoders Work on Attention Layer Outputs
Connor Kissane (ckkissane) · 2024-01-16T00:26:14.767Z · comments (9)

Actually, Power Plants May Be an AI Training Bottleneck.
Lao Mein (derpherpize) · 2024-06-20T04:41:33.567Z · comments (13)

Untrusted smart models and trusted dumb models
Buck · 2023-11-04T03:06:38.001Z · comments (12)

A Crisper Explanation of Simulacrum Levels
Thane Ruthenis · 2023-12-23T22:13:52.286Z · comments (13)

Retirement Accounts and Short Timelines
jefftk (jkaufman) · 2024-02-19T18:50:05.231Z · comments (35)

[link] Essay competition on the Automation of Wisdom and Philosophy — $25k in prizes
owencb · 2024-04-16T10:10:13.338Z · comments (12)

[link] The Compendium, A full argument about extinction risk from AGI
adamShimi · 2024-10-31T12:01:51.714Z · comments (1)

Saying the quiet part out loud: trading off x-risk for personal immortality
disturbance · 2023-11-02T17:43:34.155Z · comments (89)

AI #83: The Mask Comes Off
Zvi · 2024-09-26T12:00:08.689Z · comments (19)

Some Vacation Photos
johnswentworth · 2024-01-04T17:15:01.187Z · comments (0)

Release: Optimal Weave (P1): A Prototype Cohabitive Game
mako yass (MakoYass) · 2024-08-17T14:08:18.947Z · comments (21)

Agent Boundaries Aren't Markov Blankets. [Unless they're non-causal; see comments.]
abramdemski · 2023-11-20T18:23:40.443Z · comments (11)

Coup probes: Catching catastrophes with probes trained off-policy
Fabien Roger (Fabien) · 2023-11-17T17:58:28.687Z · comments (7)

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning
keith_wynroe · 2024-07-02T13:17:16.352Z · comments (7)

AISafety.com – Resources for AI Safety
Søren Elverlin (soren-elverlin-1) · 2024-05-17T15:57:11.712Z · comments (3)

Self-prediction acts as an emergent regularizer
Cameron Berg (cameron-berg) · 2024-10-23T22:27:03.664Z · comments (4)

Constructability: Plainly-coded AGIs may be feasible in the near future
Épiphanie Gédéon (joy_void_joy) · 2024-04-27T16:04:45.894Z · comments (13)

Bostrom Goes Unheard
Zvi · 2023-11-13T14:11:07.586Z · comments (9)

An Introduction To The Mandelbrot Set That Doesn't Mention Complex Numbers
Yitz (yitz) · 2024-01-17T09:48:07.930Z · comments (11)

[link] New voluntary commitments (AI Seoul Summit)
Zach Stein-Perlman · 2024-05-21T11:00:41.794Z · comments (17)

My Criticism of Singular Learning Theory
Joar Skalse (Logical_Lunatic) · 2023-11-19T15:19:16.874Z · comments (56)

Refusal mechanisms: initial experiments with Llama-2-7b-chat
Andy Arditi (andy-arditi) · 2023-12-08T17:08:01.250Z · comments (7)

[link] Palworld development blog post
bhauth · 2024-01-28T05:56:19.984Z · comments (12)

Information vs Assurance
johnswentworth · 2024-10-20T23:16:25.762Z · comments (5)

The Gemini Incident
Zvi · 2024-02-22T21:00:04.594Z · comments (19)

Self-Referential Probabilistic Logic Admits the Payor's Lemma
Yudhister Kumar (randomwalks) · 2023-11-28T10:27:29.029Z · comments (14)

Announcing Athena - Women in AI Alignment Research
Claire Short (claire-short) · 2023-11-07T21:46:41.741Z · comments (2)

3C's: A Recipe For Mathing Concepts
johnswentworth · 2024-07-03T01:06:11.944Z · comments (5)

Studying The Alien Mind
Quentin FEUILLADE--MONTIXI (quentin-feuillade-montixi) · 2023-12-05T17:27:28.049Z · comments (10)

[link] Not every accommodation is a Curb Cut Effect: The Handicapped Parking Effect, the Clapper Effect, and more
Michael Cohn (michael-cohn) · 2024-09-15T05:27:36.691Z · comments (39)

Values Are Real Like Harry Potter
johnswentworth · 2024-10-09T23:42:24.724Z · comments (17)

Survey of 2,778 AI authors: six parts in pictures
KatjaGrace · 2024-01-06T04:43:34.590Z · comments (1)

[link] My thesis (Algorithmic Bayesian Epistemology) explained in more depth
Eric Neyman (UnexpectedValues) · 2024-05-09T19:43:16.543Z · comments (4)

New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"
Joe Carlsmith (joekc) · 2023-11-15T17:16:42.088Z · comments (26)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

danielfilan on Habryka's Shortform Feed

Update: I have already gotten over it.

willpetillo on What if Alignment is Not Enough?

On reflection, I suspect the crux here is a differing conception of what kind of failures are important. I've written a follow-up post that comes at this topic from a different direction and I would be very interested in your feedback: https://www.lesswrong.com/posts/NFYLjoa25QJJezL9f/lenses-of-control.

donald-hobson on No, really, it predicts next tokens.

My understanding is that these are explicitly and intentionally trained (wouldn't come to exist naturally under gradient descent on normal training data)

No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples.

So if the ambient rate of adversarial examples is 10^-9, then every now and then the AI will hit such an example and go wild. If the ambient rate is 10^-500, it won't.

That's a much more complicated goal than the goal of correctly predicting the next token,

Is it more complicated? What ontological framework is this AI using to represent it's goal anyway?

any willingness to sacrifice a few tokens now would be trained out by gradient descent.

Only if, during training, the network repeatedly gets into a state where it believes that sacrificing tokens now is a good idea. Despite the fact that it isn't a good idea when you are in training. (Unless there is a training environment bug and you can sneak out mid way through training)

So, is the network able to tell whether or not it's in training?

zach-stein-perlman on Zach Stein-Perlman's Shortform

Anthropic: The case for targeted regulation.

I like the "Urgency" section.

Anthropic says the US government should "require companies to have and publish RSP-like policies" and "incentivize companies to develop RSPs that are effective" and do related things. I worry that RSPs will be ineffective for most companies, even given some transparency/accountability and vague minimum standard.

davekasten on davekasten's Shortform

Ok, so Anthropic's new policy post (explicitly NOT linkposting it properly since I assume @Zac Hatfield-Dodds [LW · GW] or @Evan Hubinger [LW · GW] or someone else from Anthropic will, and figure the main convo should happen there, and don't want to incentivize fragmenting of conversation) seems to have a very obvious implication.

Unrelated, I just slammed a big AGI-by-2028 order on Manifold Markets.

donald-hobson on No, really, it predicts next tokens.

Would you expect some part of the net to be left blank, because "a large neural net has a lot of spare neurons"?

If the lottery ticket hypothesis is true, yes.

The lottery ticket hypothesis is that some parts of the network start off doing something somewhat close to useful, and get trained towards usefulness. And some parts start off sufficiently un-useful that they just get trained to get out of the way.

Which fits with neural net distillation being a thing. (Ie training a big network, and then condensing it into a smaller network gives better performance than directly training a small network.

but gradient descent doesn't care, it reaches in and adjusts every weight.

Here is an extreme example. Suppose the current parameters were implementing a computer chip, on which was running a holomorphically encrypted piece of code.

Holomorphic encryption itself is unlikely to form, but it serves at least as an existance proof for computational structures that can't be adjusted with local optimization.

Basically the problem with gradient descent is that it's local. And when the same neurons are doing things that the neural net does want, and things that the neural net doesn't want (but doesn't dis-want either) then its possible for the network to be trapped in a local optimum. Any small change to get rid of the bad behavior would also get rid of the good behavior.

Also, any bad behavior that only very rarely effects the output will produce very small gradients. Neural nets are trained for finite time. It's possible that gradient descent just hasn't got around to removing the bad behavior even if it would do so eventually.

Can you concoct even a vague or toy model of how what you propose could possibly be a local optimum?

You can make any algorithm that does better than chance into a local optimum on a sufficiently large neural net. Holomorphicly encrypt that algorithm, Any small change and the whole thing collapses into nonsense. Well actually, this involves discrete bits. But suppose the neurons have strong regularization to stop the values getting too large (past + or - 1) , and they also have uniform [0,1] noise added to them, so each neuron can store 1 bit and any attempt to adjust parameters immediately risks errors.

Looking at the article you linked. One simplification is that neural networks tend towards the max-entropy way to solve the problem. If there are multiple solutions, the solutions with more free parameters are more likely.

And there are few ways to predict next tokens, but lots of different kinds of paperclips the AI could want.

ryankidd44 on Ryan Kidd's Shortform

MATS lowered the stipend from $50/h to $40/h ahead of the Summer 2023 Program to support more scholars. We then lowered it again to $30/h between ahead of the Winter 2023-24 Program after surveying alumni [LW(p) · GW(p)] and determining that 85% would be accept $30/h.

felix-j-binder on Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence

This is a great post—I'm excited about this line of research, and it's great to see a proposal of how that might look like.

In our paper, we find that the log-probs of a models hypothetical statements track the log-probs of the object-level behavior it is reporting about. This is true also for object-level responses that the model does not actually choose. For example (made up numbers), if the object-level behavior of the model has the distribution 60% "dog", 30% "cat", 10% "fox", the model would answer the question "what would the second letter of your answer have been?" with 70% "o", 30% "a". Note that the model only saw the winning answer during training, yet it is calibrated (to some degree) to the distribution of object-level answers.

I'm curious what you make of this result. To me, the fact that the log-probs of the hypothetical answer are calibrated wrt to the object-level behavior suggest that there an internal process that takes into account calibration when arriving at an answer, even though we don't ask it to verbalize the calibration. (Early on in the project, we actually included experiments where models were asked about eg. their second-most likely answer, but we stopped them early enough that I have no data on how well they can explicitly report on this).

lorec on The hostile telepaths problem

I once thought "slack mattered more than any outcome". But whose slack? It's wonderful for all humans to have more slack. But there's a huge game-theoretic difference between the species being wealthier, and thus wealthier per capita, and being wealthy/high-status/dominant/powerful relative to other people. The first is what I was getting at by "things orthogonal to the lineup"; the second is "the lineup". Trying to improve your position relative to copies of yourself in a way that is zero-sum is "the rat race", or "the Red Queen's race", where running will ~only ever keep you in the same place, and cause you and your mirror-selves to expend a lot of effort that is useless if you don't enjoy it.

[I think I enjoy any amount of "the rat race", which is part of why I find myself doing any of it, even though I can easily imagine tweaking my mind such that I stop doing it and thus exit an LDT negotiation equilibrium where I need to do it all the time. But I only like it so much, and only certain kinds.]

raemon on On Shifgrethor

(In addition to this not-seeming-true-across-the-board... also, literally nobody has ever made this claim to me. The entire reason I'm hypothesizing it is because this post suggested it, and it made sense given my model of how my/friends' cognition seems to work. So, IMO the slightly-aggro comment here is just basically wrong? Unless you've specifically seen people claim this?)