Posts

Humans aren't fleeb. 2024-01-24T05:31:46.929Z
Neural uncertainty estimation review article (for alignment) 2023-12-05T08:01:32.723Z
How to solve deception and still fail. 2023-10-04T19:56:56.254Z
Two Hot Takes about Quine 2023-07-11T06:42:46.754Z
Some background for reasoning about dual-use alignment research 2023-05-18T14:50:54.401Z
[Simulators seminar sequence] #2 Semiotic physics - revamped 2023-02-27T00:25:52.635Z
Shard theory alignment has important, often-overlooked free parameters. 2023-01-20T09:30:29.959Z
[Simulators seminar sequence] #1 Background & shared assumptions 2023-01-02T23:48:50.298Z
Take 14: Corrigibility isn't that great. 2022-12-25T13:04:21.534Z
Take 13: RLHF bad, conditioning good. 2022-12-22T10:44:06.359Z
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems. 2022-12-20T05:01:50.659Z
Take 11: "Aligning language models" should be weirder. 2022-12-18T14:14:53.767Z
Take 10: Fine-tuning with RLHF is aesthetically unsatisfying. 2022-12-13T07:04:35.686Z
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment. 2022-12-12T11:51:42.758Z
Take 8: Queer the inner/outer alignment dichotomy. 2022-12-09T17:46:26.383Z
Take 7: You should talk about "the human's utility function" less. 2022-12-08T08:14:17.275Z
Take 6: CAIS is actually Orwellian. 2022-12-07T13:50:38.221Z
Take 5: Another problem for natural abstractions is laziness. 2022-12-06T07:00:48.626Z
Take 4: One problem with natural abstractions is there's too many of them. 2022-12-05T10:39:42.055Z
Take 3: No indescribable heavenworlds. 2022-12-04T02:48:17.103Z
Take 2: Building tools to help build FAI is a legitimate strategy, but it's dual-use. 2022-12-03T00:54:03.059Z
Take 1: We're not going to reverse-engineer the AI. 2022-12-01T22:41:32.677Z
Some ideas for epistles to the AI ethicists 2022-09-14T09:07:14.791Z
The Solomonoff prior is malign. It's not a big deal. 2022-08-25T08:25:56.205Z
Reducing Goodhart: Announcement, Executive Summary 2022-08-20T09:49:23.881Z
Reading the ethicists 2: Hunting for AI alignment papers 2022-06-06T15:49:03.434Z
Reading the ethicists: A review of articles on AI in the journal Science and Engineering Ethics 2022-05-18T20:52:20.942Z
Thoughts on AI Safety Camp 2022-05-13T07:16:55.533Z
New year, new research agenda post 2022-01-12T17:58:15.833Z
Supervised learning and self-modeling: What's "superhuman?" 2021-12-09T12:44:14.004Z
Goodhart: Endgame 2021-11-19T01:26:30.487Z
Models Modeling Models 2021-11-02T07:08:44.848Z
Goodhart Ethology 2021-09-17T17:31:33.833Z
Competent Preferences 2021-09-02T14:26:50.762Z
Introduction to Reducing Goodhart 2021-08-26T18:38:51.592Z
How to turn money into AI safety? 2021-08-25T10:49:01.507Z
HCH Speculation Post #2A 2021-03-17T13:26:46.203Z
Hierarchical planning: context agents 2020-12-19T11:24:09.064Z
Modeling humans: what's the point? 2020-11-10T01:30:31.627Z
What to do with imitation humans, other than asking them what the right thing to do is? 2020-09-27T21:51:36.650Z
Charlie Steiner's Shortform 2020-08-04T06:28:11.553Z
Constraints from naturalized ethics. 2020-07-25T14:54:51.783Z
Meta-preferences are weird 2020-07-16T23:03:40.226Z
Down with Solomonoff Induction, up with the Presumptuous Philosopher 2020-06-12T09:44:29.114Z
The Presumptuous Philosopher, self-locating information, and Solomonoff induction 2020-05-31T16:35:48.837Z
Life as metaphor for everything else. 2020-04-05T07:21:11.303Z
Meta-preferences two ways: generator vs. patch 2020-04-01T00:51:49.086Z
Gricean communication and meta-preferences 2020-02-10T05:05:30.079Z
Impossible moral problems and moral authority 2019-11-18T09:28:28.766Z
What's the dream for giving natural language commands to AI? 2019-10-08T13:42:38.928Z

Comments

Comment by Charlie Steiner on Improving Dictionary Learning with Gated Sparse Autoencoders · 2024-04-25T22:19:01.069Z · LW · GW

Nice. I tried to do something similar (except making everything leaky with polynomial tails, so 

y = (y+torch.sqrt(y**2+scale**2)) * (1+(y+threshold)/torch.sqrt((y+threshold)**2+scale**2)) / 4

where the first part (y+torch.sqrt(y**2+scale**2)) is a softplus, and the second part (1+(y+threshold)/torch.sqrt((y+threshold)**2+scale**2)) is a leaky cutoff at the value threshold.

But I don't think I got such clearly better results, so I'm going to have to read more thoroughly to see what else you were doing that I wasn't :)

Comment by Charlie Steiner on Neural uncertainty estimation review article (for alignment) · 2024-04-23T17:18:36.036Z · LW · GW

I'm actually not familiar with the nitty gritty of the LLM forecasting papers. But I'll happily give you some wild guessing :)

My blind guess is that the "obvious" stuff is already done (e.g. calibrating or fine-tuning single-token outputs on predictions about facts after the date of data collection), but not enough people are doing ensembling over different LLMs to improve calibration.

I also expect a lot of people prompting LLMs to give probabilities in natural language, and that clever people are already combining these with fine-tuning or post-hoc calibration. But I'd bet people aren't doing enough work to aggregate answers from lots of prompting methods, and then tuning the aggregation function based on the data.

Comment by Charlie Steiner on Charlie Steiner's Shortform · 2024-04-21T12:56:31.750Z · LW · GW

Humans using SAEs to improve linear probes / activation steering vectors might quickly get replaced by a version of probing / steering that leverages unlabeled data.

Like, probing is finding a vector along which labeled data varies, and SAEs are finding vectors that are a sparse basis for unlabeled data. You can totally do both at once - find a vector along which labeled data varies and is part of a sparse basis for unlabeled data.

This is a little bit related to an idea with the handle "concepts live in ontologies." If I say I'm going to the gym, this concept of "going to the gym" lives in an ontology where people and activites are basic components - it's probably also easy to use ideas like "You're eating dinner" in that ontology, but not "1,3-diisocyanatomethylbenzene." When you try to express one idea, you're also picking a "basis" for expressing similar ideas.

Comment by Charlie Steiner on Any evidence or reason to expect a multiverse / Everett branches? · 2024-04-10T13:46:10.029Z · LW · GW

I found someone's thesis from 2020 (Hoi Wai Lai) that sums it up not too badly (from the perspective of someone who wants to make Bohmian mechanics work and was willing to write a thesis about it).

For special relativity (section 6), the problem is that the motion of each hidden particle depends instantaneously on the entire multi-particle wavefunction. According to Lai, there's nothing better than to bite the bullet and define a "real present" across the universe, and have the hyperparticles sometimes go faster than light. What hypersurface counts as the real present is unobservable to us, but the motion of the hidden particles cares about it.

For varying particle number (section 7.4), the problem is that in quantum mechanics you can have a superposition of states with different numbers of particles. If there's some hidden variable tracking which part of the superposition is "real," this hidden variable has to behave totally different than a particle! Lai says this leads to "Bell-type" theories, where there's a single hidden variable, a hidden trajectory in configuration space. Honestly this actually seems more satisfactory than how it deals with special relativity - you just had to sacrifice the notion of independent hidden variables behaving like particles, you didn't have to allow for superluminal communication in a way that highlights how pointless the hidden variables are.

Warning: I have exerted basically no effort to check if this random grad student was accurate.

Comment by Charlie Steiner on Any evidence or reason to expect a multiverse / Everett branches? · 2024-04-09T10:34:50.673Z · LW · GW

My understanding is that pilot wave theory (ie Bohmian mechanics) explains all the quantum physics

This is only true if you don't count relativistic field theory. Bohmian mechanics has mathematical troubles extending to special relativity or particle creation/annihilation operators.

Is there any reason at all to expect some kind of multiverse?

Depending on how big you expect the unobservable universe to be, there can also be a spacelike multiverse.

Comment by Charlie Steiner on LLMs for Alignment Research: a safety priority? · 2024-04-05T07:27:33.499Z · LW · GW

Wouldn't other people also like to use an AI that can collaborate with them on complex topics? E.g. people planning datacenters, or researching RL, or trying to get AIs to collaborate with other instances of themselves to accurately solve real-world problems?

I don't think people working on alignment research assistants are planning to just turn it on and leave the building, they on average (weighted by money) seem to be imagining doing things like "explain an experiment in natural language and have an AI help implement it rapidly."

So I think both they and this post are describing the strategy of "building very generally useful AI, but the good guys will be using it first." I hear you as saying you want a slightly different profile of generally-useful skills to be targeted.

Comment by Charlie Steiner on New paper on aligning AI with human values · 2024-04-02T11:45:55.470Z · LW · GW

I have now read the paper, and still think you did a great job.

One gripe I have is with this framing:

We believe our articulation of human values as constitutive attentional policies is much closer to “what we really care about”, and is thus less prone to over-optimization

If you were to heavily optimize for text that humans would rate highly on specific values, you would run into the usual problems (e.g. model incentivized to manipulate the human). Your success here doesn't come from the formulation of the values per se, but rather from the architecture that turns them into text/actions - rather than optimizing for them directly, you can prompt a LLM that's anchored on normal human text to mildly optimize them for you.

This difference implies some important points about scaling to more intelligent systems (even without making any big pivots):

  • we don't want the model to optimize for the stated values unboundedly hard, so we'll have to end up asking for something mild and human-anchored more explicitly.
  • If another use of AI is proposing changes to the moral graph, we don't want that process to form an optimization feedback loop (unless we're really sure).

The main difference made by the choice of format of values is where to draw the boundary between legible human deliberation, and illegible LLM common sense.

 

I'm excited for future projects that are sort of in this vein but try to tackle moral conflict, or that try to use continuous rather than discrete prompts that can interpolate values, or explore different sorts of training of the illegible-common-sense part, or any of a dozen other things.

Comment by Charlie Steiner on New paper on aligning AI with human values · 2024-03-31T06:14:44.664Z · LW · GW

Awesome to see this come to fruition. I think if a dozen different groups independently tried to attack this same problem head-on, we'd learn useful stuff each time.

I'll read the whole paper more thoroughly soon, but my biggest question so far is if you collected data about what happens to your observables if you change the process along sensible-seeming axes.

Comment by Charlie Steiner on Charlie Steiner's Shortform · 2024-03-29T21:01:04.843Z · LW · GW

Regular AE's job is to throw away the information outside some low-dimensional manifold, sparse ~linear AE's job is to throw away the information not represented by sparse dictionary codes. (Also a low-dimensional manifold, I guess, just made from a different prior.)

If an AE is reconstructing poorly, that means it was throwing away a lot of information. How important that information is seems like a question about which manifold the underlying network "really" generalizes according to. And also what counts as an anomaly / what kinds of outliers you're even trying to detect.

Comment by Charlie Steiner on Charlie Steiner's Shortform · 2024-03-29T16:45:27.314Z · LW · GW

Ah, yeah, that makes sense.

Comment by Charlie Steiner on Charlie Steiner's Shortform · 2024-03-29T02:44:13.453Z · LW · GW

Even for an SAE that's been trained only on normal data [...] you could look for circuits in the SAE basis and use those for anomaly detection.

Yeah, this seems somewhat plausible. If automated circuit-finding works it would certainly detect some anomalies, though I'm uncertain if it's going to be weak against adversarial anomalies relative to regular ol' random anomalies.

Comment by Charlie Steiner on Charlie Steiner's Shortform · 2024-03-28T20:36:43.538Z · LW · GW

Dictionary/SAE learning on model activations is bad as anomaly detection because you need to train the dictionary on a dataset, which means you needed the anomaly to be in the training set.

How to do dictionary learning without a dataset? One possibility is to use uncertainty-estimation-like techniques to detect when the model "thinks its on-distribution" for randomly sampled activations.

Comment by Charlie Steiner on Are (Motor)sports like F1 a good thing to calibrate estimates against? · 2024-03-28T10:26:28.142Z · LW · GW

Tracking your predictions and improving your calibration over time is good. So is practicing making outside-view estimates based on related numerical data. But I think diversity is good.

If you start going back through historical F1 data as prediction exercises, I expect the main thing that will happen is you'll learn a lot about the history of F1. Secondarily, you'll get better at avoiding your own biases, but in a way that's concentrated on your biases relevant to F1 predictions.

If you already want to learn more about the history of F1, then go for it, it's not hurting anyone :) Estimating more diverse things will probably better prepare you for making future non-F1 estimates, but if you're going to pay attention to F1 anyhow it might be a fun thing to track.

Comment by Charlie Steiner on Was Releasing Claude-3 Net-Negative? · 2024-03-28T03:33:31.842Z · LW · GW

Yup, I basically agree with this. Although we shouldn't necessarily only focus on OpenAI as the other possible racer. Other companies (Microsoft, Twitter, etc) might perceive a need to go faster / use more resources to get a business advantage if the LLM marketplace seems more crowded.

Comment by Charlie Steiner on Modern Transformers are AGI, and Human-Level · 2024-03-26T20:18:46.057Z · LW · GW

I also like "transformative AI."

I don't think of it as "AGI" or "human-level" being an especially bad term - most category nouns are bad terms (like "heap"), in the sense that they're inherently fuzzy gestures at the structure of the world. It's just that in the context of 2024, we're now inside the fuzz.

A mile away from your house, "towards your house" is a useful direction. Inside your front hallway, "towards your house" is a uselessly fuzzy direction - and a bad term. More precision is needed because you're closer.

Comment by Charlie Steiner on Neuroscience and Alignment · 2024-03-20T00:28:44.194Z · LW · GW

The brain algorithms that do moral reasoning are value-aligned in the same way a puddle is aligned with the shape of the hole it's in.

They're shaped by all sorts of forces, ranging from social environment to biological facts like how we can't make our brains twice as large. Not just during development, but on an ongoing basis our moral reasoning exists in a balance with all these other forces. But of course, a puddle always coincidentally finds itself in a hole that's perfectly shaped for it.

If you took the decision-making algorithms from my brain and put them into a brain 357x larger, that tautological magic spell might break, and the puddle that you've moved into a different hole might no longer be the same shape as it was in the original hole.

If you anticipate this general class of problems and try to resolve them, that's great! I'm not saying nobody should do neuroscience. It's just I don't think it's a "entirely scientific approach, requiring minimal philosophical deconfusion," nor does it lead to safe AIs that are just emulations of humans except smarter.

Comment by Charlie Steiner on What is the best argument that LLMs are shoggoths? · 2024-03-18T02:45:45.365Z · LW · GW

They can certainly use answer text as a scratchpad (even nonfunctional text that gives more space for hidden activations to flow). But they don't without explicit training. Actually maybe they do- maybe RLHF incentivizes a verbose style to give more room for thought. But I think even "thinking step by step," there are still plenty of issues.

Tokenization is definitely a contributor. But that doesn't really support the notion that there's an underlying human-like cognitive algorithm behind human-like text output. The point is the way it adds numbers is very inhuman, despite producing human-like output on the most common/easy cases.

Comment by Charlie Steiner on What is the best argument that LLMs are shoggoths? · 2024-03-17T14:39:34.151Z · LW · GW

I'm not totally sure the hypothesis is well-defined enough to argue about, but maybe Gary Marcus-esque analysis of the pattern of LLM mistakes?

If the internals were like a human thinking about the question and then giving an answer, it would probably be able to add numbers more reliably. And I also suspect the pattern of mistakes doesn't look typical for a human at any developmental stage (once a human can add 3 digit numbers their success rate at 5 digit numbers is probably pretty good). I vaguely recall some people looking at this, but gave forgotten the reference, sorry.

Comment by Charlie Steiner on Are AIs conscious? It might depend · 2024-03-16T03:53:42.560Z · LW · GW

A different question: When does it make your (mental) life easier to categorize an AI as conscious, so that you can use the heuristics you've developed about what conscious things are like to make good judgments?

Sometimes, maybe! Especially if lots of work has been put in to make said AI behave in familiar ways along many axes, even when nobody (else) is looking.

But for LLMs, or other similarly alien AIs, I expect that using your usual patterns of thought for conscious things creates more problems than it helps with.

If one is a bit Platonist, then there's some hidden fact about whether they're "really conscious or not" no matter how murky the waters, and once this Hard problem is solved, deciding what to do is relatively easy.

But I prefer the alternative of ditching the question of consciousness entirely when it's not going to be useful, and deciding what's right to do about alien AIs more directly.

Comment by Charlie Steiner on Update on Developing an Ethics Calculator to Align an AGI to · 2024-03-13T22:25:12.324Z · LW · GW

Interesting stuff, but I I felt like your code was just a bunch of hard-coded suggestively-named variables with no pattern-matching to actually glue those variables to reality. I'm pessimistic about the applicability - better to spend time thinking on how to get an AI to do this reasoning in a way that's connected to reality from the get-go.

Comment by Charlie Steiner on Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems · 2024-03-13T20:32:39.511Z · LW · GW

Exciting stuff, thanks!

It's a little surprising to me how bad the logit lens is for earlier layers.

Comment by Charlie Steiner on Deconstructing Bostrom's Classic Argument for AI Doom · 2024-03-12T08:02:18.226Z · LW · GW

I was curious about the context and so I went over and ctrl+F'ed Solomonoff and found Evan saying

I think you're misunderstanding the nature of my objection. It's not that Solomonoff induction is my real reason for believing in deceptive alignment or something, it's that the reasoning in this post is mathematically unsound, and I'm using the formalism to show why. If I weren't responding to this post specifically, I probably wouldn't have brought up Solomonoff induction at all.

Comment by Charlie Steiner on Deconstructing Bostrom's Classic Argument for AI Doom · 2024-03-11T19:06:55.740Z · LW · GW

Thank you for posting this, and it was interesting. Also, I think the middle section is bad.

Basically starting from Lance taking a digression out of an anthropomorphic argument to castigate those who think AI might do bad things for anthropomorphising, and ending with the end of all discussion of Solomonoff induction, I think there was a lot of misconstruing ideas or arguing against nonexistent people.

Like, I personally don't agree with people who expect optimization daemons to arise in gradient descent, but I don't say they're motivated by whether the Solomonoff prior is malign.

Comment by Charlie Steiner on Charlie Steiner's Shortform · 2024-03-09T15:59:26.236Z · LW · GW

Oh, maybe I've jumped the gun then. Whoops.

Comment by Charlie Steiner on Charlie Steiner's Shortform · 2024-03-09T13:29:05.756Z · LW · GW

Congrats to Paul on getting appointed to NIST AI safety.

Comment by Charlie Steiner on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-08T07:42:25.637Z · LW · GW

At a high level, you don't get to pick the ontology

This post seems like a case of there being too many natural abstractions.

Comment by Charlie Steiner on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-08T07:40:45.674Z · LW · GW

Had a chat with @Logan Riggs about this. My main takeaway was that if SAEs aren't learning the features for separate squares, it's plausibly because in the data distribution there's some even-more-sparse pattern going on that they can exploit. E.g. if big runs of same-color stones show up regularly, it might be lower-loss to represent runs directly than to represent them as made up of separate squares.

If this is the bulk of the story, then messing around with training might not change much (but training on different data might change a lot).

Comment by Charlie Steiner on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-07T13:55:39.213Z · LW · GW

Nice! This was a very useful question to ask.

Comment by Charlie Steiner on Protocol evaluations: good analogies vs control · 2024-03-06T13:47:33.213Z · LW · GW

Yeah, I don't know where my reading comprehension skills were that evening, but they weren't with me :P

Oh well, I'll just leave it as is as a monument to bad comments.

Comment by Charlie Steiner on Many arguments for AI x-risk are wrong · 2024-03-06T10:37:25.389Z · LW · GW

offline RL is surprisingly stable and robust to reward misspecification

Wow, what a wild paper. The basic idea - that "pessimism" about off-distribution state/action pairs induces pessimistically-trained RL agents to learn policies that hang around in the training distribution for a long time, even if that goes against their reward function - is a fairly obvious one. But what's not obvious is the wide variety of algorithms this applies to.

I genuinely don't believe their decision transformer results. I.e. I think with p~0.8, if they (or the authors of the paper whose hyperparameters they copied) made better design choices, they would have gotten a decision transformer that was actually sensitive to reward. But on the flip side, with p~0.2 they just showed that decision transformers don't work! (For these tasks.)

Comment by Charlie Steiner on Some costs of superposition · 2024-03-06T09:32:34.782Z · LW · GW

I think it's pretty tricky, because what matters to real networks is the cost difference between storing features pseudo-linearly (in superposition), versus storing them nonlinearly (in one of the host of ways it takes multiple nn layers to decode), versus not storing them at all. Calculating such a cost function seems like it has details that depend on the particulars of the network and training process, making it a total pain to try to mathematize (but maybe amenable to making toy models).

Comment by Charlie Steiner on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-05T21:49:36.833Z · LW · GW

Does it know today's date through API call? That's definitely a smoking gun.

Comment by Charlie Steiner on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-05T02:48:13.873Z · LW · GW

Oh, missed that part.

Comment by Charlie Steiner on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-05T01:24:07.781Z · LW · GW

The idea that it's usually monitored is in my prompt; everything else seems like a pretty convergent and consistent character.

It seems likely that there's a pre-prompt from google with the gist of "This is a conversation between a user and Claude 3, an AI developed by Anthropic. Text between the <start ai> and <end ai> tokens was written by the AI, and text between the <start user> and <end user> tokens was written by the human user."

(edited to not say Anthropic is Google)

Comment by Charlie Steiner on Some costs of superposition · 2024-03-03T19:25:37.033Z · LW · GW

Neat, thanks. Later I might want to rederive the estimates using different assumptions - not only should the number of active features L be used in calculating average 'noise' level (basically treating it as an environment parameter rather than a design decision), but we might want another free parameter for how statistically dependent features are. If I really feel energetic I might try to treat the per-layer information loss all at once rather than bounding it above as the sum of information losses of individual features.

Comment by Charlie Steiner on Don't Endorse the Idea of Market Failure · 2024-03-03T13:30:51.131Z · LW · GW

My guess is this is a defense of someone being mocked on twitter, and so we aren't really getting (or care about) the context.

Comment by Charlie Steiner on Common Philosophical Mistakes, according to Joe Schmid [videos] · 2024-03-03T13:19:14.682Z · LW · GW

I watched half of part 4, and called it quits. I think I'd have to be more into philosophy's particular argumentation game, and also into philosophy of religion, and also into dunking on people on facebook.

Comment by Charlie Steiner on Common Philosophical Mistakes, according to Joe Schmid [videos] · 2024-03-03T13:11:02.554Z · LW · GW

Yeah, "graduate student" can mean either Masters or PhD student.

Comment by Charlie Steiner on Counting arguments provide no evidence for AI doom · 2024-02-28T09:17:37.153Z · LW · GW

Model-based RL has a lot of room to use models more cleverly, e.g. learning hierarchical planning, and the better models are for planning, the more rewarding it is to let model-based planning take the policy far away from the prior.

E.g. you could get a hospital policy-maker that actually will do radical new things via model-based reasoning, rather than just breaking down when you try to push it too far from the training distribution (as you correctly point out a filtered LLM would).

In some sense the policy would still be close to the prior in a distance metric induced by the model-based planning procedure itself, but I think at that point the distance metric has come unmoored from the practical difference to humans.

Comment by Charlie Steiner on Counting arguments provide no evidence for AI doom · 2024-02-28T05:56:17.895Z · LW · GW

I feel like there's a somewhat common argument about RL not being all that dangerous because it generalizes the training distribution cautiously - being outside the training distribution isn't going to suddenly cause an RL system to make multi-step plans that are implied but never seen in the training distribution, it'll probably just fall back on familiar, safe behavior.

To me, these arguments feel like they treat present-day model-free RL as the "central case," and model-based RL as a small correction.

Anyhow, good post, I like most of the arguments, I just felt my reaction to this particular one could be made in meme format.

Comment by Charlie Steiner on Counting arguments provide no evidence for AI doom · 2024-02-28T05:47:18.569Z · LW · GW

Comment by Charlie Steiner on Do sparse autoencoders find "true features"? · 2024-02-23T21:04:05.860Z · LW · GW

Hm. Okay, I remembered a better way to improve efficiency: neighbor lists. For each feature, remember a list of who its closest neighbors are, and just compute your "closeness loss" by calculating dot products in that list.

The neighbor list itself can either be recomputed once in a while using the naive method, or you can accelerate the neighbor list recomputation by keeping more coarse-grained track of where features are in activation-space.

Comment by Charlie Steiner on Do sparse autoencoders find "true features"? · 2024-02-22T22:55:45.616Z · LW · GW

Quadratic complexity isn't that bad, if this is useful. If your feature vectors are normalized you can do it faster by taking a matrix product of the weights "the big way" and just penalizing the trace for being far from ones or zeros. I think?

Feature vector normalization is itself an examble of a quadratic thing that makes it in.

Comment by Charlie Steiner on Protocol evaluations: good analogies vs control · 2024-02-21T15:28:34.560Z · LW · GW

I hear you as saying "If we don't have to worry about teaching the AI to use human values, then why do sandwiching when we can measure capabilities more directly some other way?"

One reason is that with sandwiching, you can more rapidly measure capabilities generalization, because you can do things like collect the test set ahead of time or supervise with a special-purpose AI.

But if you want the best evaluation of a research assistant's capabilities, I agress using it as a research assistant is more reliable.

A separate issue I have here is the assumption that you don't have to worry about teaching an AI to make human-friendly decisions if you're using it as a research assistant, and therefore we can go full speed ahead trying to make general-purpose AI as long as we mean to use it as a research assistant. A big "trust us, we're the good guys" vibe.

Relative to string theory, getting an AI to help use do AI alignment is much more reliant on teaching the AI to give good suggestions in the first place - and not merely "good" in the sense of highly rated, but good in the contains-hard-parts-of-outer-alignment kinda way. So I disagree with the assumption in the first place.

And then I also disagree with the conclusion. Technology proliferates, and there are misuse opportunities even within an organization that's 99% "good guys." But maybe this is a strategic disagreement more than a factual one.

Comment by Charlie Steiner on Protocol evaluations: good analogies vs control · 2024-02-21T15:04:20.926Z · LW · GW

Non-deceptive failures are easy to notice, but they're not necessarily easy to eliminate - and if you don't eliminate them, they'll keep happening until some do slip through. I think I take them more seriously than you.

Comment by Charlie Steiner on Inducing human-like biases in moral reasoning LMs · 2024-02-21T04:30:22.451Z · LW · GW

This was a cool, ambitious idea. I'm still confused about your brain score results. Why did the "none" fine-tuned models have good results? Were none of your moddels succesful at learning the brain data?

Comment by Charlie Steiner on Opinions survey 2 (with rationalism score at the end) · 2024-02-17T19:15:44.674Z · LW · GW

I got 7/18.

Comment by Charlie Steiner on Physics-based early warning signal shows that AMOC is on tipping course · 2024-02-17T05:23:43.959Z · LW · GW

See the discussion section.

We have developed a physics-based, and observable early warning signal characterizing the tipping point of the AMOC: the minimum of the AMOC-induced freshwater transport at 34°S in the Atlantic, here indicated by FovS. The FovS minimum occurs 25 years (9 to 41, 10 and 90% percentiles) before the AMOC tipping event. The quantity FovS has a strong basis in conceptual models, where it is an indicator of the salt-advection feedback strength. Although FovS has been shown to be a useful measure of AMOC stability in GCMs, the minimum feature has so far not been connected to the tipping point because an AMOC tipping event had up to now not been found in these models. The FovS indicator is observable, and reanalysis products show that its value and, more importantly, its trend are negative at the moment. The latest CMIP6 model simulations indicate that FovS is projected to decrease under future climate change. However, because of freshwater biases, the CMIP6 FovS mean starts at positive values and only reaches zero around the year 2075. Hence, no salt-advection feedback–induced tipping is found yet in these models under climate change scenarios up to 2100 and longer simulations under stronger forcing would be needed (as we do here for the CESM) to find this. In observations, the estimated mean value of FovS is already quite negative, and therefore, any further decrease is in the direction of a tipping point (and a stronger salt-advection feedback). A slowdown in the FovS decline indicates that the AMOC tipping point is near.

Model year 1750 does not mean 1750 years from now. The model is subtly different from reality in several ways. Their point is they found some indicator (this FovS thing) that hits a minimum a few decades before the big change, in a way that maybe generalizes from the model to reality.

In the model, this indicator starts at 0.20, slowly decreases, and hits a minimum at -0.14 of whatever units, ~25 years before the AMOC tipping point.

In reality, this indicator was already at -0.5, and is now somewhere around -0.1 or -0.15. 

This is a bit concerning, although to reiterate, the model is subtly different from reality in several ways. Exact numerical values don't generalize that well, it's the more qualitative thing - the minimum of their indicator - that has a better chance of warning us, and we have not (as far as we can tell) hit a minimum. Yet.

Comment by Charlie Steiner on Phallocentricity in GPT-J's bizarre stratified ontology · 2024-02-17T02:26:49.191Z · LW · GW

There's a huge amount of room for you to find whatever patterns are most eye-catching to you, here.

I was sampling random embeddings at various distances from the centroid and prompting GPT-J to define them. One of these random embeddings, sampled at distance 5, produced the definition [...]

How many random embeddings did you try sampling, that weren't titillating? Suppose you kept looking until you found mentions of female sexuality again - would this also sometimes talk about holes, or would it instead sometimes talk about something totally different?

Comment by Charlie Steiner on Requirements for a Basin of Attraction to Alignment · 2024-02-15T03:02:16.955Z · LW · GW

How would the AI do something like this if it ditched the idea that there existed some perfect U*?

Assuming the existence of things that turn out not to exist does weird things to a decision-making process. In extreme cases, it starts "believing in magic" and throwing away all hope of good outcomes in the real world in exchange for the tiniest advantage in the case that magic exists.