LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Thoughts on Evo-Bio Math and Mesa-Optimization: Maybe We Need To Think Harder About "Relative" Fitness?
Lorec · 2024-09-28T14:07:42.412Z · comments (6)

[link] Consciousness As Recursive Reflections
Gunnar_Zarncke · 2024-10-05T20:00:53.053Z · comments (3)

[link] SCP Foundation - Anti memetic Division Hub
landscape_kiwi · 2024-09-15T13:40:52.691Z · comments (1)

[link] Optimising under arbitrarily many constraint equations
dkl9 · 2024-09-12T14:59:28.475Z · comments (0)

Retrieval Augmented Genesis
João Ribeiro Medeiros (joao-ribeiro-medeiros) · 2024-10-01T20:18:01.836Z · comments (0)

Thirty random thoughts about AI alignment
Lysandre Terrisse · 2024-09-15T16:24:10.572Z · comments (1)

[link] Contra Yudkowsky on 2-4-6 Game Difficulty Explanations
Josh Hickman (josh-hickman) · 2024-09-08T16:13:33.187Z · comments (1)

Using LLM's for AI Foundation research and the Simple Solution assumption
Donald Hobson (donald-hobson) · 2024-09-24T11:00:53.658Z · comments (0)

[question] AMA: International School Student in China
Novice · 2024-10-01T06:00:16.282Z · answers+comments (0)

Avoiding jailbreaks by discouraging their representation in activation space
Guido Bergman · 2024-09-27T17:49:20.785Z · comments (2)

[link] An "Observatory" For a Shy Super AI?
Sherrinford · 2024-09-27T21:22:40.296Z · comments (0)

[link] In-Context Learning: An Alignment Survey
alamerton · 2024-09-30T18:44:28.589Z · comments (0)

[link] Join the $10K AutoHack 2024 Tournament
Paul Bricman (paulbricman) · 2024-09-25T11:54:20.112Z · comments (0)

Biasing LLM Response with Visual Stimuli
Jaehyuk Lim (jason-l) · 2024-10-03T18:04:31.474Z · comments (0)

[link] Should we abstain from voting? (In nondeterministic elections)
B Jacobs (Bob Jacobs) · 2024-10-02T10:07:43.167Z · comments (5)

Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets
Abhimanyu Pallavi Sudhir (abhimanyu-pallavi-sudhir) · 2024-09-16T01:04:32.953Z · comments (1)

Longevity and the Mind
George3d6 · 2024-09-16T09:43:09.700Z · comments (2)

[link] AI Safety Newsletter #41: The Next Generation of Compute Scale Plus, Ranking Models by Susceptibility to Jailbreaking, and Machine Ethics
Corin Katzke (corin-katzke) · 2024-09-11T19:14:08.274Z · comments (1)

[link] Linkpost: Hypocrisy standoff
Chris_Leong · 2024-09-29T14:27:19.175Z · comments (1)

Seeking mentorship
Kevin Afachao (kevin-afachao) · 2024-09-21T16:54:58.353Z · comments (0)

Exploring Shard-like Behavior: Empirical Insights into Contextual Decision-Making in RL Agents
Alejandro Aristizabal (alejandro-aristizabal) · 2024-09-29T00:32:42.161Z · comments (0)

[question] How do you follow AI (safety) news?
PeterH · 2024-09-24T13:58:48.916Z · answers+comments (2)

Toy Models of Superposition: Simplified by Hand
Axel Sorensen (axel-sorensen) · 2024-09-29T21:19:52.475Z · comments (0)

New Capabilities, New Risks? - Evaluating Agentic General Assistants using Elements of GAIA & METR Frameworks
Tej Lander (tej-lander) · 2024-09-29T18:58:56.253Z · comments (0)

Likelihood calculation with duobels
Martin Gerdes (martin-gerdes) · 2024-10-01T16:21:01.268Z · comments (0)

Increasing the Span of the Set of Ideas
Jeffrey Heninger (jeffrey-heninger) · 2024-09-13T15:52:39.132Z · comments (1)

Developmental Stages in Multi-Problem Grokking
James Sullivan · 2024-09-29T18:58:22.954Z · comments (0)

[question] Calibration training for 'percentile rankings'?
david reinstein (david-reinstein) · 2024-09-14T21:51:55.705Z · answers+comments (0)

Endogenous Growth and Human Intelligence
Nicholas D. (nicholas-d) · 2024-09-18T14:05:54.567Z · comments (0)

For Limited Superintelligences, Epistemic Exclusion is Harder than Robustness to Logical Exploitation
Lorec · 2024-09-15T20:49:06.370Z · comments (9)

[link] Climate Change And Global Warming
Zero Contradictions · 2024-09-25T19:13:09.508Z · comments (0)

[link] Models of life
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-29T19:24:40.060Z · comments (0)

[link] 2024 Election Forecasting Contest
mike20731 · 2024-10-05T20:43:16.203Z · comments (0)

Collapsing “Collapsing the Belief/Knowledge Distinction”
Jeremias (jeremias-sur) · 2024-09-20T16:11:33.558Z · comments (0)

Apply to the Cooperative AI PhD Fellowship by October 14th!
Lewis Hammond (lewis-hammond-1) · 2024-10-05T12:41:24.093Z · comments (0)

On Measuring Intellectual Performance - personal experience and several thoughts
Alexander Gufan (alexander-gufan) · 2024-09-20T17:21:19.747Z · comments (2)

Building Safer AI from the Ground Up: Steering Model Behavior via Pre-Training Data Curation
Antonio Clarke (antonio-clarke) · 2024-09-29T18:48:23.308Z · comments (0)

MIT FutureTech are hiring for a Technical Associate role
peterslattery · 2024-09-09T20:16:49.299Z · comments (0)

San Francisco ACX Meetup “First Saturday”
Nate Sternberg (nate-sternberg) · 2024-09-29T03:13:34.615Z · comments (0)

Can AI Quantity beat AI Quality?
Gianluca Calcagni (gianluca-calcagni) · 2024-10-02T15:21:45.711Z · comments (0)

Survey - Psychological Impact of Long-Term AI Engagement
Manuela García (manuela-garcia) · 2024-09-17T17:31:38.397Z · comments (0)

[question] Searching for Impossibility Results or No-Go Theorems for provable safety.
Maelstrom · 2024-09-27T20:12:25.515Z · answers+comments (1)

A Psychoanalytic Explanation of Sam Altman's Irrational Actions
Gabe · 2024-09-29T18:58:13.511Z · comments (3)

[question] Most capable publicly available agents?
Gabe · 2024-09-30T00:04:24.480Z · answers+comments (0)

What bootstraps intelligence?
invertedpassion · 2024-09-10T07:11:21.819Z · comments (2)

AIS Hungary Operations Officer role, Deadline: 2024 October 6th
gergogaspar (gergo-gaspar) · 2024-09-25T13:54:25.077Z · comments (0)

Exploring Decomposability of SAE Features
Vikram_N (viknat) · 2024-09-30T18:28:09.348Z · comments (0)

Emergent Authorship: Creativity à la Communing
gswonk · 2024-09-14T19:02:07.635Z · comments (0)

Extending the Off-Switch Game: Toward a Robust Framework for AI Corrigibility
OwenChen · 2024-09-25T20:38:22.928Z · comments (0)

AGI Farm
Rahul Chand (rahul-chand) · 2024-10-01T04:29:58.606Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

ben-millwood on DanielFilan's Shortform Feed

You can't always use liability to internalise all the externality because e.g. you can't effectively sue companies for more than they have, and for some companies that stay afloat by regular fundraising rounds, that may not even be that much? like, if they're considering an action that is a coinflip between "we cause some huge liability" and "we triple the value of our company" then it's usually going to be correct from a shareholder perspective to take it, no matter the size of the liability, right?

Criminal law has the ability to increase the deterrent somewhat – probably many people will not accept any amount of money for a significant enough chance of prison – though obviously it's not perfect either

dan-valentine on Open Thread Fall 2024

Declarative and procedural knowledge are two different memory systems. Spaced repetition is good for declarative knowledge, but for procedural (like playing music) you need lots of practice. Other examples include math and programming - you can learn lots of declarative knowledge about the concepts involved, but you still need to practice solving problems or writing code.

Edit: as for why practice every day - the procedural system requires a lot more practice than the declarative system does.

max-lee on [AN #140]: Theoretical models that predict scaling laws

I think this little mistake doesn't affect the gist of your summary post, I wouldn't worry about it too much.

The mistake

The mistaken argument was an attempt to explain why . Believe it or not, the paper never actually argued $α \geq 2 / d$ , it argued $α \geq 1 / d$ . That's because the paper used the L2 loss, and $α \geq 2 / d (L2) ⟺ α \geq 1 / d (L1)$ .

I feel that was the main misunderstanding.

If you read closely, it only says $α \geq 4 / d (L2)$ for piecewise linear approximations:

If we assume that $F$ and $f$ are analytic functions on $M_{d}$ and that the loss function $L (f, F)$ is analytic in $f - F$ and minimized at $f = F$ , then the loss at a given test input, $x_{test}$ , can be expanded around the nearest training point, ${^x}_{train}$ , $L (x_{test}) = \sum_{m = n \geq 2}^{\infty} a_{m} ({^x}_{train}) (x_{test} - {^x}_{train})^{m}$ ,2 where the first term is of finite order $n \geq 2$ because the loss vanishes at the training point. As the typical distance between nearest neighbor points scales as $D^{- 1 / d}$ on a $d$ -dimensional manifold (an observation also made in [18]), the loss will be dominated by the leading term, $L \propto D^{- n / d}$ , at large $D$ . Note that if the model provides an accurate piecewise linear approximation, we will generically find $n \geq 4$ .

Figure 2b in the paper shows $α = 4 / d (L2)$ for most models but $α = 2 / d (L2)$ for some models.

Some people like the L2 loss (variance) because you can use simple addition to combine independent sources of variance. The variance increase from each source doesn't depend on the variance increase from the other sources.

The paper never mentioned L2 loss, just that the loss function is "analytic" and minimized at $f = F$ . Such a loss function converges to L2 when the loss is small enough. This important sentence is hard to read because it's cut in half by graph.

Nearest neighbors regression/classification has $α = 1 / d (L1)$

If you use the nearest neighbors regression to fit the function $f (x) = s i n (x / π)$ , it creates a piecewise constant function that looks like a staircase trying to approximate a curve. The L1 loss is proportional to D^(-1) because adding data makes the staircase smaller. A better attempt to connect the dots of the training data (e.g. a piecewise linear function) would have L1 loss proportional to D^(-2) or better because adding data makes the "staircase" both smaller and smoother. Both the distances and the loss gradient decrease.

My guess is that nearest neighbors classification also has L1 loss proportional to D^(-1/d), because the volume that is misclassified is roughly proportional to distance between the decision boundary and the nearest point. Again, I think it creates a decision boundary that is rough (analogous to the staircase), and the roughness gets smaller with additional data but never gets any smoother.

What a rough decision boundary looks like:
https://upload.wikimedia.org/wikipedia/commons/5/52/Map1NN.png

Suggestions

If you want a quick fix (still using the L1 loss), you might change the following:

Under the assumption that our test loss is sufficiently “nice”, we can do a Taylor expansion of the test loss around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero. Thus, the first non-zero term is the second-order term, which is proportional to the square of the distance. So, we expect that our scaling law will look like kD^(-2/d), that is, α = 2/d. (EDIT: I no longer endorse the above argument, see the comments.)
The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the distance between the true value and the actual value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network. In this case, for loss functions like the L2 loss, which are quadratic in the distance, we get that α = 4/d.
Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d and for L2 loss α >= 4/d.

becomes

Under the assumption that our test loss is sufficiently “nice”, we can do a Taylor expansion of the test loss around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, the loss is zero at the training data point. With a constant term of zero, we use the linear term (gradient $\cdot$ displacement), which is proportional to the distance. So, we expect that our scaling law will look like kD^(-1/d), that is, α = 1/d.
The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the distance between the true value and the actual value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network (as the distance decreases the linear term also decreases). In this case, for loss functions like the L2 loss, which are quadratic in the distance, we get that α = 4/d.
Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 1/d and for L2 loss α >= 2/d.

Also change

α >= 2/d (and α >= 4/d for the case of L2 loss)

checking whether α >= 4/d

they find that α is significantly larger than 4/d

to

α >= 1/d (and α >= 2/d for the case of L2 loss)

checking whether α >= 2/d

they find that α is significantly larger than 2/d

Another option is to switch to the L2 loss.

Conclusion

Once AI scientists make a false conclusion like $α \geq 2 / d (L1)$ or $\nabla L (x_{train}) = 0$ , they may hallucinate arguments which justify the conclusion. A future research direction is to investigate whether large language models learned this behavior from the AI scientists who trained them.

I'm so sorry I'm so immature but I can't help it.

Overall, it's not a big mistake because it doesn't invalidate the gist of the summary.

reed on the case for CoT unfaithfulness is overstated

Some excellent points (and I enjoyed the neat self-referentialism).

Headline take is I agree with you that CoT unfaithfulness - as Turpin and Lanham have operationalised it - is unlikely to pose a problem for the alignment of LLM-based systems.

I think this for the same reasons you state:

1. Unfaithfulness is primarily a function of the training distribution, only appears in particular contexts, and might potentially be avoided by simply asking the LLM to explain its rationale;

and

2. The o1 paradigm of RLing the CoT seems likely to remove unfaithfulness from o1's CoTs.
The way that these CoTs work seems make the model systematically hyper-attentive to different facets of its context window in a way that is verbalised in detail (and is typically enabled by a literal "Hmmmm..." that draws its attention to an unconsidered detail). It strikes me that this would totally rule out the chance that the model still pays attention to bias without saying it: "But wait... The user implied that option A was probably correct". This is partially an empirical question - since we can't see the o1 CoTs, I'd pipedream love to see OpenAI do and publish research on whether this is true.

This suggests to me that o1's training might already have succeeded at giving us what we'd want: an LLM that does, in fact, just say how it made its decision. (It remains an open question whether simply prompting normal LLMs to explain their rationale would also work).

The only part of the Turpin paper that remains potentially worrying to me is the (actually unsurprising) demonstrated capacity of an LLM to fabricate spurious reasoning ("shooting outside the eighteen is not a common phrase in soccer") in order to support a particular decision.

You can imagine all sorts of innocuous contexts that might incentivise an LLM to do this kind of thing. This might present some difficulties for oversight regimes - this is why I'd be interested in seeing whether something like o1 is capable of front-to-back undertaking an entire complex and malicious action whilst managing to obscure the actual nature of that malicious action (such that an automated LLM judge fails to detect that the action is in fact malicious).

nicholaskees on the case for CoT unfaithfulness is overstated

then we can all go home, right?

Doesn't this just shift what we worry about? If control of roughly human level and slightly superhuman systems is easy, that still leaves:

Human institutions using AI to centralize power
Conflict between human-controlled AI systems
Going out with a whimper [LW · GW] scenarios (or other multi-agent problems)
Not understanding the reasoning of vastly superhuman AI (even with COT)

What feels underexplored to me is: If we can control roughly human-level AI systems, what do we DO with them?

hector-perez-arenas on My hopes for YouCongress.com

Being able to vote for 'liking' or 'disliking' the quality of a Digital Twin response. Enough dislikes might trigger a model update and regeneration of the response.

@Nathan Helm-Burger [LW · GW] As a first step, moderators can now regenerate AI responses. I have also added you as a moderator, so feel free to use this feature.

Once we have more users (and have more time/resources to do it properly), we may introduce a rating system as you suggest. For now, any user can report a comment, and a moderator can regenerate it if deemed reasonable. Does it make sense?

Thanks!

will_pearson on Will_Pearson's Shortform

I've been thinking about non AI catastrophic risks.

One that I've not seen talked about is the idea of cancerous ideas. That is ideas that spread throughout a population and crowd out other ideas for attention and resources.

This could lead to civilisational collapse due to basic functions not being performed.

Safeguards for this are partitioning the idea space and some form of immune system that targets ideas that spread uncontrollably.

taras-kutsyk on Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Thanks! We'll take a closer look at these when we decide to extend our results for more models.

taras-kutsyk on Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Let me make sure I understand your idea correctly:

We use a separate single-layer model (analogous to the SAE encoder) to predict the SAE feature activations
We train this model on the SAE activations of the finetuned model (assuming that the SAE wasn't finetuned on the finetuned model activations?)
We then use this model to determine "what direction most closely maps to the activation pattern across input sequences, and how well it maps".

I'm most unsure about the 2nd step - how we train this feature-activation model. If we train it on the base SAE activations in the finetuned model, I'm afraid we'll just train it on extremely noisy data, because feature activations essentially do not mean the same thing, unless your SAE has been finetuned to appropriately reconstruct the finetuned model activations. (And if we finetune it, we might just as well use the SAE and feature-universality techniques I outlined without needing a separate model).

bohaska on Open Thread Fall 2024

If spaced repetition is the most efficient way of remembering information, why do people who learn a music instrument practice every day instead of adhering to a spaced repetition schedule?