Posts

Comments

Comment by neverix on 200 COP in MI: Exploring Polysemanticity and Superposition · 2023-11-23T22:34:42.504Z · LW · GW

To clarify, I thought it was about superposition happening inside the projection afterwards.

Comment by neverix on 200 COP in MI: Exploring Polysemanticity and Superposition · 2023-11-21T22:30:07.622Z · LW · GW

This happens in transformer MLP layers. Note that the hidden dimen

Is the point that transformer MLPs blow up the hidden dimension in the middle?

Comment by neverix on Steering GPT-2-XL by adding an activation vector · 2023-08-31T19:41:45.791Z · LW · GW

Activation additions in generative models

 

Also related is https://arxiv.org/abs/2210.10960. They use a small neural network to generate steering vectors for the UNet bottleneck in diffusion to edit images using CLIP.

Comment by neverix on The Low-Hanging Fruit Prior and sloped valleys in the loss landscape · 2023-08-24T18:18:40.530Z · LW · GW

From a conversation on Discord:

Do you have in mind a way to weigh sequential learning into the actual prior?

Dmitry:

good question! We haven't thought about an explicit complexity measure that would give this prior, but a very loose approximation that we've been keeping in the back of our minds could be a Turing machine/Boolean circuit version of the "BIMT" weight penalty from this paper https://arxiv.org/abs/2305.08746 (which they show encourages modularity at least in toy models)

Response:

Hmm, BIMT seems to only be about intra-layer locality. It would certainly encourage learning an ensemble of features, but I'm not sure if it would capture the interesting bit, which I think is the fact that features are built up sequentially from earlier to later layers and changes are only accepted if they improve local loss.

I'm thinking about something like an existence of a relatively smooth scaling law (?) as the criterion.

So, just some smoothness constraint that would basically integrate over paths SGD could take.

Comment by neverix on Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program · 2023-08-02T13:26:52.083Z · LW · GW

We can idealize the outer alignment solution as a logical inductor.

Why outer?

Comment by neverix on The "spelling miracle": GPT-3 spelling abilities and glitch tokens revisited · 2023-08-01T08:20:39.110Z · LW · GW

You could literally go through some giant corpus with an LLM and see which samples have gradients similar to those from training on a spelling task.

Comment by neverix on Hessian and Basin volume · 2023-06-06T15:56:52.579Z · LW · GW

There are also somewhat principled reasons for using a "fuzzy ellipsoid", which I won't explain here.

If you view  as 2x learning rate, the ellipsoid contains parameters which will jump straight into the basin under the quadratic approximation, and we assume for points outside the basin the approximation breaks entirely. If you account for gradient noise in the form of a Gaussian with sigma equal to gradient, the PDF of the resulting point at the basin is equal to the probability a Gaussian parametrized by the ellipsoid at the preceding point. This is wrong, but there is an interpretation of the noise as a Gaussian with variance increasing away from the basin origin.

Comment by neverix on Power-seeking can be probable and predictive for trained agents · 2023-04-20T15:50:14.875Z · LW · GW

Seems like quoting doesn't work for LaTeX, it was definitions 2/3. Reading again I saw D2 was indeed applicable to sets.

Comment by neverix on Power-seeking can be probable and predictive for trained agents · 2023-04-19T20:42:10.366Z · LW · GW

A0>A1

How is orbit comparison for sets defined?

Comment by neverix on The Credit Assignment Problem · 2023-02-23T13:39:31.171Z · LW · GW

course

coarse?

Comment by neverix on Is there a ML agent that abandons it's utility function out-of-distribution without losing capabilities? · 2023-02-22T17:04:19.353Z · LW · GW

This is the whole point of goal misgeneralization. They have experiments (albeit on toy environments that can be explained by the network finding the wrong algorithm), so I'd say quite plausible.

Comment by neverix on A note about differential technological development · 2023-02-20T10:26:44.776Z · LW · GW

sidereal

typo?

Comment by neverix on Is InstructGPT Following Instructions in Other Languages Surprising? · 2023-02-14T18:46:40.758Z · LW · GW

Is RLHF updating abstract circuits an established fact? Why would it suffer from mode collapse in that case?

Comment by neverix on SolidGoldMagikarp (plus, prompt generation) · 2023-02-06T15:57:13.892Z · LW · GW

It is based on this. I changed it to optimize using softmax instead of straight-through estimation and added regularization for the embedded tokens.

Notebook link - this is a version that mimics this post instead of optimizing a single neuron as in the original.

EDIT: github link

Comment by neverix on SolidGoldMagikarp (plus, prompt generation) · 2023-02-05T19:31:43.296Z · LW · GW

This project tried this.

Comment by neverix on SolidGoldMagikarp (plus, prompt generation) · 2023-02-05T19:26:08.411Z · LW · GW

I did some similar experiments two months ago, and with to your setup the special tokens show up on the first attempt:

Comment by neverix on Thoughts on the impact of RLHF research · 2023-01-26T07:51:25.617Z · LW · GW

techn ical

typo?

Comment by neverix on A History of Bayes' Theorem · 2022-10-30T12:50:03.073Z · LW · GW

Our of curiosity

typo

Comment by neverix on What's the difference between newer Atari-playing AI and the older Deepmind one (from 2014)? · 2021-11-04T12:15:33.581Z · LW · GW

s/EfficientNet/EfficientZero?