Sparse Autoencoders Find Highly Interpretable Directions in Language Models 2023-09-21T15:30:24.432Z
AutoInterpretation Finds Sparse Coding Beats Alternatives 2023-07-17T01:41:17.397Z
[Replication] Conjecture's Sparse Coding in Small Transformers 2023-06-16T18:02:34.874Z
[Replication] Conjecture's Sparse Coding in Toy Models 2023-06-02T17:34:24.928Z
Universality and Hidden Information in Concept Bottleneck Models 2023-04-05T14:00:35.529Z
Nokens: A potential method of investigating glitch tokens 2023-03-15T16:23:38.079Z
Automating Consistency 2023-02-17T13:24:22.684Z
Distilled Representations Research Agenda 2022-10-18T20:59:20.055Z
Remaking EfficientZero (as best I can) 2022-07-04T11:03:53.281Z
Note-Taking without Hidden Messages 2022-04-30T11:15:00.234Z
ELK Sub - Note-taking in internal rollouts 2022-03-09T17:23:19.353Z
Automated Fact Checking: A Look at the Field 2021-10-06T23:52:53.577Z
Hoagy's Shortform 2020-09-21T22:00:43.682Z
Safe Scrambling? 2020-08-29T14:31:27.169Z
When do utility functions constrain? 2019-08-23T17:19:06.414Z


Comment by Hoagy on Sparse Autoencoders Find Highly Interpretable Directions in Language Models · 2023-09-22T15:01:36.183Z · LW · GW

Hi Charlie, yep it's in the paper - but I should say that we did not find a working CUDA-compatible version and used the scikit version you mention. This meant that the data volumes used are somewhat limited - still on the order of a million examples but 10-50x less than went into the autoencoders.

It's not clear whether the extra data would provide much signal since it can't learn an overcomplete basis and so has no way of learning rare features but it might be able to outperform our ICA baseline presented here, so if you wanted to give someone a project of making that available, I'd be interested to see it!

Comment by Hoagy on Influence functions - why, what and how · 2023-09-17T00:01:42.540Z · LW · GW

How do you know?

Comment by Hoagy on AI Probability Trees - Katja Grace · 2023-08-24T23:23:39.117Z · LW · GW

seems like it'd be better formatted as a nested list given the volume of text

Comment by Hoagy on Which possible AI systems are relatively safe? · 2023-08-23T20:20:10.255Z · LW · GW

Why would we expect the expected level of danger from a model of a certain size to rise as the set of potential solutions grows?

Comment by Hoagy on Why Is No One Trying To Align Profit Incentives With Alignment Research? · 2023-08-23T19:11:11.346Z · LW · GW

I think both Leap Labs and Apollo Research (both fairly new orgs) are trying to position themselves as offering model auditing services in the way you suggest.

Comment by Hoagy on The "public debate" about AI is confusing for the general public and for policymakers because it is a three-sided debate · 2023-08-01T01:57:52.603Z · LW · GW

A useful model for why it's both appealing and difficult to say 'Doomers and Realists are both against dangerous AI and for safety - let's work together!'.

Comment by Hoagy on Open problems in activation engineering · 2023-07-28T03:08:58.011Z · LW · GW

Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful?

It's not PCA but we've been using sparse coding to find important directions in activation space (see original sparse coding post, quantitative results, qualitative results).

We've found that they're on average more interpretable than neurons and I understand that @Logan Riggs and Julie Steele have found some effect using them as directions for activation patching, e.g. using a "this direction activates on curse words" direction to make text more aggressive. If people are interested in exploring this further let me know, say hi at our EleutherAI channel or check out the repo :)

Comment by Hoagy on Neuronpedia - AI Safety Game · 2023-07-27T17:25:31.442Z · LW · GW

Hi, nice work! You mentioned the possibility of neurons being the wrong unit. I think that this is the case and that our current best guess for the right unit is directions in the output space, ie linear combinations of neurons.

We've done some work using dictionary learning to find these directions (see original post, recent results) and find that with sparse coding we can find dictionaries of features that are more interpretable the neuron basis (though they don't explain 100% of the variance). 

We'd be really interested to see how this compares to neurons in a test like this and could get a sparse-coded breakdown of gpt2-small layer 6 if you're interested.

Comment by Hoagy on 10-day Critique-a-Thon · 2023-07-27T16:56:11.943Z · LW · GW

Link at the top doesn't work for me

Comment by Hoagy on Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity · 2023-07-24T21:12:05.518Z · LW · GW

I still don't quite see the connection - if it turns out that LLFC holds between different fine-tuned models to some degree, how will this help us interpolate between different simulacra?

Is the idea that we could fine-tune models to only instantiate certain kinds of behaviour and then use LLFC to interpolate between (and maybe even extrapolate between?) different kinds of behaviour?

Comment by Hoagy on Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs · 2023-07-22T21:42:47.043Z · LW · GW

For the avoidance of doubt, this accounting should recursively aggregate transitive inputs.

What does this mean?

Comment by Hoagy on Hedonic Loops and Taming RL · 2023-07-19T19:06:46.208Z · LW · GW

Importantly, this policy would naturally be highly specialized to a specific reward function. Naively, you can't change the reward function and expect the policy to instantly adapt; instead you would have to retrain the network from scratch.

I don't understand why standard RL algorithms in the basal ganglia wouldn't work. Like, most RL problems have elements that can be viewed as homeostatic - if you're playing boxcart then you need to go left/right depending on position. Why can't that generalise to seeking food iff stomach is empty? Optimizing for a specific reward function doesn't seem to preclude that function itself being a function of other things (which just makes it a more complex function).

What am I missing?

Comment by Hoagy on Aligning AI by optimizing for "wisdom" · 2023-06-28T14:32:27.243Z · LW · GW

On first glance I thought this was too abstract to be a useful plan but coming back to it I think this is promising as a form of automated training for an aligned agent, given that you have an agent that is excellent at evaluating small logic chains, along the lines of Constitutional AI or training for consistency. You have training loops using synthetic data which can train for all of these forms of consistency, probably implementable in an MVP with current systems.

The main unknown would be detecting when you feel confident enough in the alignment of its stated values to human values to start moving down the causal chain towards fitting actions to values, as this is clearly a strongly capabilities-enhancing process.

Perhaps you could at least get a measure by looking at comparisons which require multiple steps, of human value -> value -> belief etc, and then asking which is the bottleneck to coming to the conclusion that humans would want. Positing that the agent is capable of this might be assuming away a lot of the problem though.

Comment by Hoagy on Steering GPT-2-XL by adding an activation vector · 2023-05-22T18:54:27.327Z · LW · GW

Do you have a writeup of the other ways of performing these edits that you tried and why you chose the one you did?

In particular, I'm surprised by the method of adding the activations that was chosen because the tokens of the different prompts don't line up with each other in a way that I would have thought would be necessary for this approach to work, super interesting to me that it does.

If I were to try and reinvent the system after just reading the first paragraph or two I would have done something like:

  • Take multiple pairs of prompts that differ primarily in the property we're trying to capture.
  • Take the difference in the residual stream at the next token.
  • Take the average difference vector, and add that to every position in the new generated text.

I'd love to know which parts were chosen among many as the ones which worked best and which were just the first/only things tried.

Comment by Hoagy on Monthly Roundup #6: May 2023 · 2023-05-03T15:38:00.869Z · LW · GW

eedly -> feedly

Comment by Hoagy on Contra Yudkowsky on Doom from Foom #2 · 2023-04-28T12:36:16.278Z · LW · GW

Yeah I agree it's not in human brains, not really disagreeing with the bulk of the argument re brains but just about whether it does much to reduce foom %. Maybe it constrains the ultra fast scenarios a bit but not much more imo.

"Small" (ie << 6 OOM) jump in underlying brain function from current paradigm AI -> Gigantic shift in tech frontier rate of change -> Exotic tech becomes quickly reachable -> YudFoom

Comment by Hoagy on Contra Yudkowsky on Doom from Foom #2 · 2023-04-27T10:22:37.307Z · LW · GW

The key thing I disagree with is:

In some sense the Foom already occurred - it was us. But it wasn't the result of any new feature in the brain - our brains are just standard primate brains, scaled up a bit[14] and trained for longer. Human intelligence is the result of a complex one time meta-systems transition: brains networking together and organizing into families, tribes, nations, and civilizations through language. ... That transition only happens once - there are not ever more and more levels of universality or linguistic programmability. AGI does not FOOM again in the same way.

Although I think agree the 'meta-systems transition' is a super important shift, which can lead us to overestimate the level of difference between us and previous apes, it also doesn't seem like it was just a one time shift. We had fire, stone tools and probably language for literally millions of years before the Neolithic revolution. For the industrial revolution it seems that a few bits of cognitive technology (not even genes, just memes!) in renaissance Europe sent the world suddenly off on a whole new exponential.

The lesson, for me, is that the capability level of the meta-system/technology frontier is a very sensitive function of the kind of intelligences which are operating, and we therefore shouldn't feel at all confident generalising out of distribution. Then, once we start to incorporate feedback loops from the technology frontier back into the underlying intelligences which are developing that technology, all modelling goes out the window.

From a technical modelling perspective, I understand that the Roodman model that you reference below (hard singularity at median 2047) has both hyperbolic growth and random shocks, and so even within that model, we shouldn't be too surprised to see a sudden shift in gears and a much sooner singularity, even without accounting for RSI taking us somehow off-script.

Comment by Hoagy on The Brain is Not Close to Thermodynamic Limits on Computation · 2023-04-26T16:45:55.681Z · LW · GW

disingenuous probably the intended

Comment by Hoagy on An open letter to SERI MATS program organisers · 2023-04-20T17:35:07.950Z · LW · GW

I think strategically, only automated and black-box approaches to interpretability make practical sense to develop now.

Just on this, I (not part of SERI MATS but working from their office) had a go at a basic 'make ChatGPT interpret this neuron' system for the interpretability hackathon over the weekend. (GitHub)

While it's fun, and managed to find meaningful correlations for 1-2 neurons / 50, the strongest takeaway for me was the inadequacy of the paradigm 'what concept does neuron X correspond to'. It's clear (no surprise, but I'd never had it shoved in my face) that we need a lot of improved theory before we can automate. Maybe AI will automate that theoretical progress but it feels harder, and further from automation, than learning how to handoff solidly paradigmatic interpretability approaches to AI. ManualMechInterp combined with mathematical theory and toy examples seems like the right mix of strategies to me, tho ManualMechInterp shouldn't be the largest component imo.

FWIW, I agree with learning history/philosophy of science as a good source of models and healthy experimental thought patterns. I was recommended Hasok Chang's books (Inventing Temperature, Is Water H20) by folks at Conjecture and I'd heartily recommend them in turn. 

I know the SERI MATS technical lead @Joe_Collman spends a lot of his time thinking about how they can improve feedback loops, he might be interested in a chat.

You also might be interested in Mike Webb's project to set up programs to pass quality decision-making from top researchers to students, being tested on SERI MATS people at the moment.

Comment by Hoagy on Alignment of AutoGPT agents · 2023-04-12T16:22:43.183Z · LW · GW

Agree that it's super important, would be better if these things didn't exist but since they do and are probably here to stay, working out how to leverage their own capability to stay aligned rather than failing to even try seems better (and if anyone will attempt a pivotal act I imagine it will be with systems such as these).

Only downside I suppose is that these things seem quite likely to cause an impactful but not fatal warning shot which could be net positive, v unsure how to evaluate this consideration.

Comment by Hoagy on Steelman / Ideological Turing Test of Yann LeCun's AI X-Risk argument? · 2023-04-05T12:42:39.454Z · LW · GW

I've not noticed this but it'd be interesting if true as it seems that the tuning/RLHF has managed to remove most of the behaviour where it talks down to the level of the person writing as evidenced by e.g. spelling mistakes. Should be easily testable too.

Comment by Hoagy on The 0.2 OOMs/year target · 2023-03-31T13:23:51.520Z · LW · GW

Moore's law is a doubling every 2 years, while this proposes doubling every 18 months, so pretty much what you suggest (not sure if you were disagreeing tbh but seemed like you might be?)

Comment by Hoagy on The 0.2 OOMs/year target · 2023-03-30T19:03:43.642Z · LW · GW

0.2 OOMs/year is equivalent to a doubling time of 8 months.

I think this is wrong, that's nearly 8 doublings in 5 years, should instead be doubling every 5 years, should instead be doubling every 5 / log2(10) = 1.5.. years

I think pushing GPT-4 out to 2029 would be a good level of slowdown from 2022, but assuming that we could achieve that level of impact, what's the case for having a fixed exponential increase? Is it to let of some level of 'steam' in the AI industry? So that we can still get AGI in our lifetimes? To make it seem more reasonable to policymakers?

I would still rather have a moratorium until some measure of progress of understanding personally. We don't have a fixed temperature increase per decade built into our climate targets.

Comment by Hoagy on Imitation Learning from Language Feedback · 2023-03-30T15:36:45.243Z · LW · GW


  • Seems like useful work.
  • With RLHF I understand that when you push super hard for high reward you end up with nonsense results so you have to settle for quantilization or some such relaxation of maximization. Do you find similar things for 'best incorporates the feedback'?
  • Have we really pushed the boundaries of what language models giving themselves feedback is capable of? I'd expect SotA systems are sufficiently good at giving feedback, such that I wouldn't be surprised that they'd be capable of performing all steps, including the human feedback, in these algorithms, especially lots of the easier categories of feedback, leading to possibility of unlimited of cheap finetuning.  Nonetheless I don't think we've reached the point of reflexive endorsement that I'd expect to result from this process (GPT-4 still doing harmful/hallucinated completions that I expect it would be able to recognise). Expect it must be one of 
    • It in fact is at reflexive equilibrium / it wouldn't actually recognise these failures
    • OAI haven't tried pushing it to the limit
    • This process doesn't actually result in reflexive endorsement, probably because it only reaches RE within a narrow distribution in which this training is occurring.
    • OAI stop before this point for other reasons, most likely degradation of performance.
    • Not sure which of these is true though?
  • Though the core algorithm I expect to be helpful because we're stuck with RLHF-type work at the moment, having a paper focused on accurate code generation seems to push the dangerous side of a dual-use capability to the fore.
Comment by Hoagy on Nobody’s on the ball on AGI alignment · 2023-03-30T13:17:13.844Z · LW · GW

OpenAI would love to hire more alignment researchers, but there just aren’t many great researchers out there focusing on this problem.

This may well be true - but it's hard to be a researcher focusing on this problem directly unless you have access to the ability to train near-cutting edge models. Otherwise you're going to have to work on toy models, theory, or a totally different angle.

I've personally applied for the DeepMind scalable alignment team - they had a fixed, small available headcount which they filled with other people who I'm sure were better choices - but becoming a better fit for those roles is tricky, unless by just doing mostly unrelated research.

Do you have a list of ideas for research that you think is promising and possible without already being inside an org with big models?

Comment by Hoagy on Please help me sense-check my assumptions about the needs of the AI Safety community and related career plans · 2023-03-27T10:14:40.157Z · LW · GW

Your first link is broken :)

My feeling with the posts is that given the diversity of situations for people who are currently AI safety researchers, there's not likely to be a particular key set of understandings such that a person could walk into the community as a whole and know where they can be helpful. This would be great but being seriously helpful as a new person without much experience or context is just super hard. It's going to be more like here are the groups and organizations which are doing good work, what roles or other things do they need now, and what would help them scale up their ability to produce useful work.

Not sure this is really a disagreement though! I guess I don't really know what role 'the movement' is playing, outside of specific orgs, other than that it focusses on people who are fairly unattached, because I expect most useful things, especially at the meta level, to be done by groups of some size. I don't have time right now to engage with the post series more fully, so this is just a quick response, sorry!

there is uncertainty -> we need shared understanding -> we need shared language vs there is uncertainty -> what are organizations doing to bring people from individuals with potential together into productive groups making progress -> what are their bottlenecks to scaling up?

Comment by Hoagy on The algorithm isn't doing X, it's just doing Y. · 2023-03-17T18:03:42.999Z · LW · GW

Hmm, yeah there's clearly two major points:

  1. The philosophical leap from voltages to matrices, i.e. allowing that a physical system could ever be 'doing' high level description X. This is a bit weird at first but also clearly true as soon you start treating X as having a specific meaning in the world as opposed to just being a thing that occurs in human mind space.
  2. The empirical claim that this high level description X fits what the computer is doing.

I think the pushback to the post is best framed in terms of which frame is best for talking to people who deny that it's 'really doing X'. In terms of rhetorical strategy and good quality debate, I think the correct tactic is to try and have the first point mutually acknowledged in the most sympathetic case, and try to have a more productive conversation about the extent of the correlation, while I think aggressive statements of 'it's always actually doing X if it looks like its doing X' are probably unhelpful and become a bit of a scissor. (memetics over usefulness har har!)

Comment by Hoagy on The algorithm isn't doing X, it's just doing Y. · 2023-03-17T17:27:16.153Z · LW · GW

Maybe worth thinking about this in terms of different examples:

  • NN detecting the presence of tanks just by the brightness of the image (possibly apocryphal - Gwern)
  • NN recognising dogs vs cats as part of an image net classifier that would class a piece of paper with 'dog' written on as a dog
  • GPT-4 able to describe an image of a dog/cat in great detail
  • Computer doing matrix multiplication.

The range of cases in which the equivalence between the what the computer is doing, and our high level description is doing holds increases as we do down this list, and depending on what cases are salient, it becomes more or less explanatory to say that the algorithm is doing task X.

Comment by Hoagy on The hot mess theory of AI misalignment: More intelligent agents behave less coherently · 2023-03-12T13:50:21.732Z · LW · GW

Put an oak tree in a box with a lever that dispenses water, and it won't pull the lever when it's thirsty

I actually thought this was a super interesting question, just for general world modelling. The tree won't pull a lever because it barely has the capability to do so and no prior that it might work, but it could, like, control a water dispenser via sap distribution to a particular branch. In that case will the tree learn to use it?

Ended up finding an article on attempts to show learned behavioural responses to stimuli in plants at On the Conditioning of Plants: A Review of Experimental Evidence - turns out there have been some positive results but they seem not to have replicated, as well as lots of negative results, so my guess is that no, even if they are given direct control, the tree won't control its own water supply. More generally this would agree that plants lack the information processing systems to coherently use their tools.

Experiments are mostly done with M. pudica because it shows (fairly) rapid movement to close up its leaves when shaken.

Comment by Hoagy on [Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy · 2023-03-07T12:01:34.865Z · LW · GW

Could you explain why you think "The game is skewed in our favour."?

Comment by Hoagy on Scoring forecasts from the 2016 “Expert Survey on Progress in AI” · 2023-03-01T17:55:54.081Z · LW · GW

High marks for a high school essay

Is this not true? Seems Bing has been getting mid-level grades for some undergraduate courses, and anecdotally high school teachers have been seeing too-good-to-be-true work from some of their students using ChatGPT

Comment by Hoagy on AGI in sight: our look at the game board · 2023-02-22T14:27:55.579Z · LW · GW

Agree that the cited links don't represent a strong criticism of RLHF but I think there's an interesting implied criticism, between the mode-collapse post and janus' other writings on cyborgism etc that I haven't seen spelled out, though it may well be somewhere.

I see janus as saying that if you know how to properly use the raw models, then you can actually get much more useful work out of the raw models than the RLHF'd ones. If true, we're paying a significant alignment tax with RLHF that will only become clear with the improvement and take-up of wrappers around base models in the vein of Loom.

I guess the test (best done without too much fanfare) would be to get a few people well acquainted with Loom or whichever wrapper tool and identify a few complex tasks and see whether the base model or the RLHF model performs better.

Even if true though, I don't think it's really a mark against RLHF since it's still likely that RLHF makes outputs safer for the vast majority of users, just that if we think we're in an ideas arms-race with people trying to advance capabilities, we can't expect everyone to be using RLHF'd models.

Comment by Hoagy on Basic facts about language models during training · 2023-02-21T13:38:47.968Z · LW · GW

Commented on the last post but disappeared.

I understand that these are working with public checkpoints but I'd be interested if you have internal models to see similar statistics for the size of weight updates, both across the training run, and within short periods, to see if there are correlations between which weights are updated. Do you get quite consistent, smooth updates, or can you find little clusters where connected weights all change substantially in just a few steps?

If there are moments of large updates it'd be interesting if you could look for what has changed (find sequences by maximising product of difference in likelihood between the two models and likelihood of the sequence as determined by final model?? anyway..)

Also I think the axes in the first graphs of 'power law weight spectra..' are mislabelled, should be rank/singular value?

Comment by Hoagy on [deleted post] 2023-02-21T12:22:06.664Z

Nice, seems very healthy to have this info even if nothing crazy comes out of it.

Do you also have data on the distribution of the gradients? It'd be interesting from a mechanistic interpretability perspective if weight changes tended to be smooth or if clusters of weights changed a lot together at certain moments. Do we see a number of potential mini-grokking events and if so, can we zoom in on them, and what changes the model undergoes?

Also, I think the axes in 'Power law weight spectra..' are mislabelled, should it be y=singular value, x=rank, as in the previous post?

Comment by Hoagy on Anomalous tokens reveal the original identities of Instruct models · 2023-02-09T13:42:54.146Z · LW · GW

Interesting! I'm struggling to think what kind of OOD fingerprints for bad behaviour you (pl.) have in mind, other than testing fake 'you suddenly have huge power' situations which are quite common suggestions but v curious what you have in mind.

Also, think it's worth saying that the strength of the result connecting babbage to text-davinci-001 is stronger than that connecting ada to text-ada-001 (by logprob), so it feels like the first one shouldn't count that as a solid success.

I wonder whether you'd find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?

Comment by Hoagy on SolidGoldMagikarp (plus, prompt generation) · 2023-02-07T18:44:19.557Z · LW · GW

I wanted to test out the prompt generation part of this so I made a version where you pick a particular input sequence and then only allow a certain fraction of the input tokens to change. I've been initialising it with a paragraph about COVID and testing how few tokens it needs to be able to change before it reliably outputs a particular output token.

Turns out it only needs a few tokens to fairly reliably force a single output, even within the context of a whole paragraph, eg "typical people infected Majesty the virus will experience mild to moderate 74 illness and recover without requiring special treatment. However, some will become seriously ill and require medical attention. Older people and those with underlying medical conditions like cardiovascular disease" has a >99.5% chance of ' 74' as the next token. Penalising repetition makes the task much harder.

It can even pretty reliably cause GPT-2 to output SolidGoldMagikarp with >99% probability by only changing 10% of the tokens, though it does this by just inserting SolidGoldMagikarp wherever possible. As far as I've seen playing around with it for an hour or so, if you penalise repeating the initial token then it never succeeds.

I don't think these attacks are at all new (see Universal Adversarial Triggers from 2019 and others) but it's certainly fun to test out.

This raises a number of questions:

  • How does this change when we scale up to GPT-3, and to ChatGPT - is this still possible after the mode collapse that comes with lots of fine tuning?
  • Can this be extended to getting whole new behaviours, as well as just next tokens? What about discouraged behaviour?
  • Since this is a way to mechanically generate non-robustness of outputs, can this be fed back in to training to make robust models - would sprinkling noise into the data prevent adversarial examples?

code here

Comment by Hoagy on SolidGoldMagikarp (plus, prompt generation) · 2023-02-06T19:29:48.140Z · LW · GW

Good find! Just spelling out the actual source of the dataset contamination for others since the other comments weren't clear to me:

r/counting is a subreddit in which people 'count to infinity by 1s', and the leaderboard for this shows the number of times they've 'counted' in this subreddit. These users have made 10s to 100s of thousands of reddit comments of just a number. See threads like this:

They'd be perfect candidates for exclusion from training data. I wonder how they'd feel to know they posted enough inane comments to cause bugs in LLMs.

Comment by Hoagy on Alexander and Yudkowsky on AGI goals · 2023-02-01T10:44:03.557Z · LW · GW

Ah interesting, - I'd not heard of ENCODE and wasn't trying to say that there's no such thing as DNA without function.

The way I remembered it was that 10% of DNA was coding, and then a sizeable proportion of the rest was promoters and introns and such, lots of which had fairly recently been reclaimed from 'junk' status. From that wiki, though, it seems that only 1-2% is actually coding.

In any case I'd overlooked the fact that even within genes there's not going to be sensitivity to every base pair.

I'd be super interested if there were any estimates of how many bits in the genome it would take to encode a bit of a neural wiring algorithm as expressed in minified code. I'd guess the DNA would be wildly inefficient and the size of neural wiring algos expressed in code would actually be much smaller than 7.5MB but then it's had a lot of time and pressure to maximise the information content so unsure.

Comment by Hoagy on New Hackathon: Robustness to distribution changes and ambiguity · 2023-01-31T17:53:58.513Z · LW · GW

A question about the rules:

  • Participants are not allowed to label images by hand.
  • Participants are not allowed to use other datasets. They are only allowed to use the datasets provided.
  • Participants are not allowed to use arbitrary pre-trained models. Only ImageNet pre-trained models are allowed.

What are the boundaries of classifying by hand? Say you have a pre-trained ImageNet model, and you go through the output classes or layer activations, manually select the activations you expect to be relevant for faces but not text, and then train a classifier based on these labels is this manual labelling?

Comment by Hoagy on New Hackathon: Robustness to distribution changes and ambiguity · 2023-01-31T16:10:20.055Z · LW · GW

Your challenge link has an extra "." at the end.

Comment by Hoagy on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2023-01-30T13:35:36.936Z · LW · GW

Not sure why you're being downvoted on an intro thread, though it would help if you were more concise.

S-risks in general have obviously been looked at as a possible worst-case outcome by theoretical alignment researchers going back to at least Bostrom, as I expect you've been reading and I would guess that most people here are aware of the possibility.

The scenarios you described I don't think are 'overlooked' because they fall into the general pattern of AI having huge power combined with moral systems are we would find abhorrent and most alignment work is ultimately intended to prevent this scenario. Lots of Eliezer's writing on why alignment is hard talks about somewhat similar cases where superficially reasonable rules lead to catastrophes.

I don't know if they're addressed specifically anywhere, as most alignment work is about how we might implement any ethics or robust ontologies rather than addressing specific potential failures. You could see this kind of work as implicit in RLHF though, where outputs like 'we should punish people in perfect retribution for intent, or literal interpretation of their words' would hopefully be trained out as incompatible with harmlessness.

Comment by Hoagy on Thoughts on the impact of RLHF research · 2023-01-27T16:59:21.691Z · LW · GW

I'd be interested to know how you estimate the numbers here, they seem quite inflated to me.

If 4 big tech companies were to invest $50B each in 2023 then, assuming average salary as $300k and 2:1 capital to salary then investment would be hiring about 50B/900K = 55,000 people to work on this stuff. For reference the total headcount at these orgs is roughly 100-200K.

50B/yr is also around 25-50% of the size of the total income, and greater than profits for most which again seems high.

Perhaps my capital ratio is way too low but I would find it hard to believe that these companies can meaningfully put that level of capital into action so quickly. I would guess more on the order of $50B between the major companies in 2023.

Agree with paul's comment above that timeline shifts are the most important variable.

Comment by Hoagy on Alexander and Yudkowsky on AGI goals · 2023-01-27T16:26:27.958Z · LW · GW

Yeah agreed, this doesn't make sense to me.

There are probably just a few MB (wouldn't be surprised if it could be compressed into much less) of information which sets up the brain wiring. Somewhere within that information are the structures/biases that, when exposed to the training data of being a human in our world, gives us our altruism (and much else). It's a hard problem to understand these altruism-forming structures (which are not likely to be distinct things), replicate them in silica and make them robust even to large power differentials.

On the other hand, the human brain presumably has lots of wiring that pushes it towards selfishness and agenthood which we can hopefully just not replicate.

Either way, it seems that they could in theory be instantiated by the right process of trial and error - the question being whether the error (or misuse) gets us first.

Eliezer expects selfishness not to require any wiring once you select for a certain level of capability, meaning there's no permanent buffer to be gained by not implementing selfishness. The margin for error in this model is thus small, and very hard for us to find without perfect understanding or some huge process.

I agree with this argument for some unknown threshold of capability but it seems strange to phrase it as impossibility unless you're certain that the threshold is low, and even then it's a big smuggled assumption.

EDIT: Looking back on this comment, I guess it comes down to the crux that for systems powerful enough to be relevant to alignment, by virtue of their power or research capability, must be doing strong enough optimisation on some function that we should model them as agents acting to further that goal.

Comment by Hoagy on Alexander and Yudkowsky on AGI goals · 2023-01-26T11:38:47.518Z · LW · GW

10% of what's left, ie of the 75MB of non-junk DNA, so 7.5MB.

fwiw 90% junk DNA seems unlikely, I thought it was largely found to influence gene expression, but then 10% being neural wiring seems high so may cancel to about my own guess.

Comment by Hoagy on [RFC] Possible ways to expand on "Discovering Latent Knowledge in Language Models Without Supervision". · 2023-01-26T00:14:58.832Z · LW · GW

An LLM will presumably have some internal representation of the characteristics of the voice that it is speaking with, beyond its truth value. Perhaps you could test for such a representation in an unsupervised manner by asking it to complete a sentence with and without prompting for a articular disposition ('angry', 'understanding',...). Once you learn to understand the effects that the prompting has, you could test how well this allows you to modify disposition via changing the activations.

This line of thought came from imagining what the combination of CCS and Constitutional AI might look like.

Comment by Hoagy on Thoughts on the impact of RLHF research · 2023-01-25T22:56:57.302Z · LW · GW

Huh, I'd not heard that, would be interested in hearing more about the thought process behind its development.

Think they could well turn out to be correct in that having systems with such a strong understanding of human concepts gives us levers we might not have had, though code-writing proficiency is a very unfortunate development.

Comment by Hoagy on Thoughts on the impact of RLHF research · 2023-01-25T22:44:33.237Z · LW · GW

GPT [was] developed as alignment strategy.

Really? How so?

Comment by Hoagy on AGI safety field building projects I’d like to see · 2023-01-21T13:09:58.236Z · LW · GW

Cheers Severin yeah that's useful, I've not seen (almost certainly my fault, I don't do enough to find out what's going on).

That Slack link doesn't work for me though, it just asks me to sign into one of my existing workspaces..

Comment by Hoagy on AGI safety field building projects I’d like to see · 2023-01-20T13:25:18.844Z · LW · GW

Proposal: If other people are doing independent research in London I'd be really interested in co-working and doing some regular feedback and updates. (Could be elsewhere but I find being in person important for me personally). If anyone would be interested reply here or message me and we'll see if we can set something up :)

General comment: This feels accurate to me. I've been working as an independent researcher for the last few months, after 9 months of pure skill building and have got close but not succeeded in getting jobs at the local research orgs in London (DeepMind, Conjecture).

It's a great way to build some skills, having to build your own stack, but it's also hard to build research skills without people with more experience giving feedback, and because iteration of ideas is slow, it's difficult to know whether to stick with something or try something else.

In particular it forces you to be super proactive if you want to get any feedback.

Comment by Hoagy on We don’t trade with ants · 2023-01-11T13:50:17.534Z · LW · GW

Putting the entire failure to trade on the ability to communicate seems to understate the issue. Most if not all of the things listed that they 'could' do, are things which they could theoretically do with their physical capacities, but not with their cognitive abilities or ability to coordinate within themselves to accomplish a task.

In general, they aren't able to act with the level of intentionality required to be helpful to us except in cases where those things we want are almost exactly the things they have evolved to do (like bees making honey, as mentioned in another comment).

The 'failure to communicate' is therefore in fact a failure to be able to think and act at the required level of flexibility and abstraction, and that seems more likely to carry over to our relations with some theoretical, super advanced AI or civilisation.