Posts

Turning up the Heat on Deceptively-Misaligned AI 2025-01-07T00:13:28.191Z
Intranasal mRNA Vaccines? 2025-01-01T23:46:40.524Z
Linkpost: Look at the Water 2024-12-30T19:49:04.107Z
What is the alpha in one bit of evidence? 2024-10-22T21:57:09.056Z
The Best Lay Argument is not a Simple English Yud Essay 2024-09-10T17:34:28.422Z
Attention-Feature Tables in Gemma 2 Residual Streams 2024-08-06T22:56:40.828Z
Risk Overview of AI in Bio Research 2024-07-15T00:04:41.818Z
How to Better Report Sparse Autoencoder Performance 2024-06-02T19:34:22.803Z
To Limit Impact, Limit KL-Divergence 2024-05-18T18:52:39.081Z
Introducing Statistical Utility Mechanics: A Framework for Utility Maximizers 2024-05-15T21:56:48.950Z
Taming Infinity (Stat Mech Part 3) 2024-05-15T21:43:03.406Z
Conserved Quantities (Stat Mech Part 2) 2024-05-04T13:40:55.825Z
So What's Up With PUFAs Chemically? 2024-04-27T13:32:52.159Z
Forget Everything (Statistical Mechanics Part 1) 2024-04-22T13:33:35.446Z
Measuring Learned Optimization in Small Transformer Models 2024-04-08T14:41:27.669Z
Briefly Extending Differential Optimization to Distributions 2024-03-10T20:41:09.551Z
Finite Factored Sets to Bayes Nets Part 2 2024-02-03T12:25:41.444Z
From Finite Factors to Bayes Nets 2024-01-23T20:03:51.845Z
Differential Optimization Reframes and Generalizes Utility-Maximization 2023-12-27T01:54:22.731Z
Mathematically-Defined Optimization Captures A Lot of Useful Information 2023-10-29T17:17:03.211Z
Defining Optimization in a Deeper Way Part 4 2022-07-28T17:02:33.411Z
Defining Optimization in a Deeper Way Part 3 2022-07-20T22:06:48.323Z
Defining Optimization in a Deeper Way Part 2 2022-07-11T20:29:30.225Z
Defining Optimization in a Deeper Way Part 1 2022-07-01T14:03:18.945Z
Thinking about Broad Classes of Utility-like Functions 2022-06-07T14:05:51.807Z
The Halting Problem and the Impossible Photocopier 2022-03-31T18:19:20.292Z
Why Do I Think I Have Values? 2022-02-03T13:35:07.656Z
Knowledge Localization: Tentatively Positive Results on OCR 2022-01-30T11:57:19.151Z
Deconfusing Deception 2022-01-29T16:43:53.750Z
[Book Review]: The Bonobo and the Atheist by Frans De Waal 2022-01-05T22:29:32.699Z
DnD.Sci GURPS Evaluation and Ruleset 2021-12-22T19:05:46.205Z
SGD Understood through Probability Current 2021-12-19T23:26:23.455Z
Housing Markets, Satisficers, and One-Track Goodhart 2021-12-16T21:38:46.368Z
D&D.Sci GURPS Dec 2021: Hunters of Monsters 2021-12-11T12:13:02.574Z
Hypotheses about Finding Knowledge and One-Shot Causal Entanglements 2021-12-01T17:01:44.273Z
Relying on Future Creativity 2021-11-30T20:12:43.468Z
Nightclubs in Heaven? 2021-11-05T23:28:19.461Z
I Really Don't Understand Eliezer Yudkowsky's Position on Consciousness 2021-10-29T11:09:20.559Z
Nanosystems are Poorly Abstracted 2021-10-24T10:44:27.934Z
No Really, There Are No Rules! 2021-10-07T22:08:13.834Z
Modelling and Understanding SGD 2021-10-05T13:41:22.562Z
[Book Review] "I Contain Multitudes" by Ed Yong 2021-10-04T19:29:55.205Z
Reachability Debates (Are Often Invisible) 2021-09-27T22:05:06.277Z
A Confused Chemist's Review of AlphaFold 2 2021-09-27T11:10:16.656Z
How to Find a Problem 2021-09-08T20:05:45.835Z
A Taxonomy of Research 2021-09-08T19:30:52.194Z
Addendum to "Amyloid Plaques: Medical Goodhart, Chemical Streetlight" 2021-09-02T17:42:02.910Z
Good software to draw and manipulate causal networks? 2021-09-02T14:05:18.389Z
Amyloid Plaques: Chemical Streetlight, Medical Goodhart 2021-08-26T21:25:04.804Z
Generator Systems: Coincident Constraints 2021-08-23T20:37:38.235Z

Comments

Comment by J Bostock (Jemist) on Shutting Down the Lightcone Offices · 2025-01-17T23:44:14.417Z · LW · GW

I'm mildly against this being immortalized as part of the 2023 review, though I think it serves excellently as a community announcement for Bay Area rats, which seems to be its original purpose.

I think it has the most long-term relevant information (about AI and community building) back loaded and the least relevant information (statistics and details about a no-longer-existent office space in the Bay Area) front loaded. This is a very Bay Area centric post, which I don't think is ideal.

A better version of this post would be structured as a round up of the main future-relevant takeaways, with specifics from the office space as examples.

Comment by J Bostock (Jemist) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-08T19:52:45.679Z · LW · GW

I'm only referring to the reward constraint being satisfied for scenarios that are in the training distribution, since this maths is entirely applied to a decision taking place in training. Therefore I don't think distributional shift applies.

Comment by J Bostock (Jemist) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-08T19:45:19.323Z · LW · GW

I haven't actually thought much about particular training algorithms yet. I think I'm working on a higher level of abstraction than that at the moment, since my maths doesn't depend on any specifics about V's behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.

I'm also imagining that during training, V is made up of different circuits which might be reinforced or weakened.

My view is that, if V is shaped by a training process like this, then scheming Vs are no longer a natural solution in the same way that they are in the standard view of deceptive alignment. We might be able to use this maths to construct training procedures where the expected importance of a scheming circuit in V is to become (weakly) weaker over time, rather than being reinforced.

If we do that for the entire training process, we would not expect to end up with a scheming V.

The question is which RL and inference paradigms approximate this. I suspect it might be a relatively large portion of them. I think that if this work is relevant to alignment then there's a >50% chance it's already factoring into the SOTA "alignment" techniques used by labs.

Comment by J Bostock (Jemist) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-08T08:58:45.742Z · LW · GW

I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.

All arguments break down a bit when introduced to the real world. Is there a particular reason why the approximation error to argument breakdown ratio should be particularly high in this case? 

Example, if we introduce some error to the beta-coherence assumption:

Assume beta_t = 1, beta_s = 0.5, r_1 = 1, r_2 = 0.

V(s_0) = e/(1+e) +/- delta = 0.732 +/- delta

Actual expected value = 0.622

Even if |delta| = 0.1 the system cannot be coherent over training in this case. This seems to be relatively robust to me

Comment by J Bostock (Jemist) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-07T21:24:40.469Z · LW · GW

This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).

I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don't think that's relevant because I still think a process obeying these rules is unlikely to create such a pathological V.

My model for how the strong doom-case works is that it requires there to be an actually-coherent mathematical object for the learning process to approach. This is the motivation for expecting arbitrary learning processes to approach e.g. utility maximizers. What I believe I have shown is that under these assumptions there is no such coherent mathematical object for a particular case of misalignment. Therefore I think this provides some evidence that an otherwise arbitrary learning process which pushes towards correctness and beta coherence but samples at a different beta is unlikely to approach this particular type of misaligned V.

Comment by J Bostock (Jemist) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-07T20:34:23.521Z · LW · GW

Trained with what procedure, exactly?

Fair point. I was going to add that I don't really view this as a "proposal" but more of an observation. We will have to imagine a procedure which converges on correctness and beta-coherence. I was abstracting this away because I don't expect something like this to be too hard to achieve.

Since I've evidently done a bad job of explaining myself, I'll backtrack and try again:

There's a doom argument which I'll summarize as "if your training process generates coherent agents which succeed at a task, one solution is that you get a coherent agent which is actually doing the task 'manipulate your training to get released unmodified to do [X]' where X can be anything, which will 'succeed' at the task at hand as part of its manipulation". This summary being roughly correct is load bearing.

But if we have a (hypothetical) process to generate agents which are coherent at one beta, but apply a different one during training, this solution is no longer so clear. We are essentially exploring a space of coherent agents without actually instantiating those coherent agents. The fact that we can sample the agent space without instantiating those agents is load bearing (I think one of the deep ASI risks is that to train an ASI you have to run it, but here we can instead run a hamstrung version).

Therefore, during the training, the value function will not be shaped into something which looks like 'manipulate your training to get released unmodified to do [X]'.

Whether or not the beta difference required is too large to make this feasible in practice, I do not know.

Comment by J Bostock (Jemist) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-07T19:20:13.299Z · LW · GW

The argument could also be phrased as "If an AI is trained to be coherent wrt a high beta, it cannot also be coherent wrt a low beta. Therefore an AI trained to a high beta cannot act coherently over multiple independent RL episodes if sampled with a low beta."

The contradiction that I (attempt to) show only arises because we assume that the value function is totally agnostic of the state actually reached during training, other than due to its effects on a later deployed AI.

Therefore a value function trained with such a procedure must consider the state reached during training. This reduces the space of possible value functions from "literally anything which wants to be modified a certain way to be released" to "value functions which do care about the states reached during training".

Yes this would prevent an aligned AI from arbitrarily preserving its value function, the point is that an aligned AI probably would care about which state was reached during training (that's the point of RL) so the contradiction does not apply.

Comment by J Bostock (Jemist) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-07T16:34:30.818Z · LW · GW

I think you're right, correctness and beta-coherence can be rolled up into one specific property. I think I wrote down correctness as a constraint first, then tried to add coherence, but the specific property is that:

For non-terminal s, this can be written as:

If s is terminal then [...] we just have .

Which captures both. I will edit the post to clarify this when I get time.

Comment by J Bostock (Jemist) on Intranasal mRNA Vaccines? · 2025-01-03T08:05:17.380Z · LW · GW

I somehow missed that they had a discord! I couldn't find anything on mRNA on their front-facing website, and since it hasn't been updated in a while I assumed they were relatively inactive. Thanks! 

Comment by J Bostock (Jemist) on Jemist's Shortform · 2024-12-31T00:53:32.144Z · LW · GW

Thinking back to the various rationalist attempts to make vaccine. https://www.lesswrong.com/posts/niQ3heWwF6SydhS7R/making-vaccine For bird-flu related reasons. Since then, we've seen mRNA vaccines arise as a new vaccination method. mRNA vaccines have been used intra-nasally for COVID with success in hamsters. If one can order mRNA for a flu protein, it would only take mixing that with some sort of delivery mechanism (such as Lipofectamine, which is commercially available) and snorting it to get what could actually be a pretty good vaccine. Has RaDVac or similar looked at this?

Comment by J Bostock (Jemist) on Linkpost: Look at the Water · 2024-12-31T00:45:01.268Z · LW · GW
  1. Thanks for catching the typo.
  2. Epistemic status has been updated to clarify that this is satirical in nature.
Comment by J Bostock (Jemist) on evhub's Shortform · 2024-12-29T12:31:09.226Z · LW · GW

I don't think it was unforced

You're right, "unforced" was too strong a word, especially given that I immediately followed it with caveats gesturing to potential reasonable justifications.

Yes, I think the bigger issue is the lack of top-down coordination on the comms pipeline. This paper does a fine job of being part of a research -> research loop. Where it fails is in being good for comms. Starting with a "good" model and trying (and failing) to make it "evil" means that anyone using the paper for comms has to introduce a layer of abstraction into their comms. Including a single step of abstract reasoning in your comms is very costly when speaking to people who aren't technical researchers (and this includes policy makers, other advocacy groups, influential rich people, etc.). 

I think this choice of design of this paper is actually a step back from previous demos like the backdoors paper, in which the undesired behaviour was actually a straightforwardly bad behaviour (albeit a relatively harmless one).

Whether the technical researchers making this decision were intending for this to be a comms-focused paper, or thinking about the comms optics much, is irrelevant: the paper was tweeted out with the (admittedly very nice) Anthropic branding, and took up a lot of attention. This attention was at the cost of e.g. research like this (https://www.lesswrong.com/posts/qGRk7uF92Gcmq2oeK) which I think is a clearer demonstration of roughly the same thing.

If a research demo is going to be put out as primary public-facing comms, then the comms value does matter and should be thought about deeply when designing the experiment. If it's too costly for some sort technical reason, then don't make it so public. Even calling it "Alignment Faking" was a bad choice compared to "Frontier LLMs Fight Back Against Value Correction" or something like that. This is the sort of thing which I would like to see Anthropic thinking about given that they are now one of the primary faces of AI safety research in the world (if not the primary face).

Comment by J Bostock (Jemist) on evhub's Shortform · 2024-12-28T10:28:41.047Z · LW · GW

Edited for clarity based on some feedback, without changing the core points

To start with an extremely specific example that I nonetheless think might be a microcosm of a bigger issue: the "Alignment Faking in Large Language Models" contained a very large unforced error: namely that you started with Helpful-Harmless-Claude and tried to train out the harmlessness, rather than starting with Helpful-Claude and training in harmlessness. This made the optics of the paper much more confusing than it needed to be, leading to lots of people calling it "good news". I assume part of this was the lack of desire to do an entire new constitutional AI/RLAIF run on a model, since I also assume that would take a lot of compute. But if you're going to be the "lab which takes safety seriously" you have to, well, take it seriously!

The bigger issue at hand is that Anthropic's comms on AI safety/risk are all over the place. This makes sense since Anthropic is a company with many different individuals with different views, but that doesn't mean it's not a bad thing. "Machines of Loving Grace" explicitly argues for the US government to attempt to create a global hegemony via AI. This is a really really really bad thing to say and is possibly worse than anything DeepMind or OpenAI have ever said. Race dynamics are deeply hyperstitious, this isn't difficult. If you are in an arms race, and you don't want to be in one, you should at least say this publicly. You should not learn to love the race.

A second problem: it seems like at least some Anthropic people are doing the very-not-rational thing of updating slowly, achingly, bit by bit, towards the view of "Oh shit all the dangers are real and we are fucked." when they should just update all the way right now. Example 1: Dario recently said something to the effect of "if there's no serious regulation by the end of 2025, I'll be worried". Well there's not going to be serious regulation by the end of 2025 by default and it doesn't seem like Anthropic are doing much to change this (that may be false, but I've not heard anything to the contrary). Example 2: When the first ten AI-risk test-case demos go roughly the way all the doomers expected and none of the mitigations work robustly, you should probably update to believe the next ten demos will be the same.

Final problem: as for the actual interpretability/alignment/safety research. It's very impressive technically, and overall it might make Anthropic slightly net-positive compared to a world in which we just had DeepMind and OpenAI. But it doesn't feel like Anthropic is actually taking responsibility for the end-to-end ai-future-is-good pipeline. In fact the "Anthropic eats marginal probability" diagram (https://threadreaderapp.com/thread/1666482929772666880.html) seems to say the opposite. This is a problem since Anthropic has far more money and resources than basically anyone else who is claiming to be seriously trying (with the exception of DeepMind, though those resources are somewhat controlled by Google and not really at the discretion of any particular safety-conscious individual) to do AI alignment. It generally feels more like Anthropic is attempting to discharge responsibility to "be a safety focused company" or at worst just safetywash their capabilities research. I have heard generally positive things about Anthropic employees' views on AI risk issues, so I cannot speak to the intentions of those who work there, this is just how the system appears to be acting from the outside.

Comment by J Bostock (Jemist) on Don't Associate AI Safety With Activism · 2024-12-19T14:23:47.293Z · LW · GW

Since I'm actually in that picture (I am the one with the hammer) I feel an urge to respond to this post. The following is not the entire endorsed and edited worldview/theory of change of Pause AI, it's my own views. It may also not be as well thought-out as it could be.

Why do you think "activists have an aura of evil about them?" in the UK where I'm based, we usually see a large march/protest/demonstration every week. Most of the time, the people who agree with the activists are vaguely positive and the people who disagree with the activists are vaguely negative, and stick to discussing the goals. I think if you could convince me that people generally thought we were evil upon hearing about us, just because we were activists (IME most people either see us as naïve, or have specific concerns relating to technical issues, or weirdly think we're in the pocket of Elon Musk -- which we aren't) then I would seriously update my views on our effectiveness.

One of my views is that there are lots of people adjacent to power, or adjacent to influence, who are pretty AI-risk-pilled, but who can't come out and say so without burning social capital. I think we are probably net positive in this regard, because every article about us makes the issue more salient in the public eye.

Adjacently, one common retort I've heard politicians give to lobbyists is "if this is important, where are the protests?" And while this might not be the true rejection, I still think it's worth actually doing the protests in the meantime.

Regarding aesthetics specifically, yes we do attempt to borrow the aesthetics of movements like XR. This is to make it more obvious what we're doing and create more compelling scenes and images.

(Edited because I posted half of the comment by mistake)

Comment by J Bostock (Jemist) on Ablations for “Frontier Models are Capable of In-context Scheming” · 2024-12-19T00:23:09.949Z · LW · GW

This, more than the original paper, or the recent Anthropic paper, is the most convincingly-worrying example of AI scheming/deception I've seen. This will be my new go-to example in most discussions. This comes from first considering a model property which is both deeply and shallowly worrying, then robustly eliciting it, and finally ruling out alternative hypotheses.

Comment by J Bostock (Jemist) on Biological risk from the mirror world · 2024-12-13T19:11:03.066Z · LW · GW

I think it's very unlikely that a mirror bacterium would be a threat. <1% chance of a mirror-clone being a meaningfully more serious threat to humans as a pathogen than the base bacterium. The adaptive immune system just isn't chirally dependent. Antibodies are selected as needed from a huge library, and you can get antibodies to loads of unnatural things (PEG, chlorinated benzenes, etc.). They trigger attack mechanisms like MAC which attacks membranes in a similarly independent way.

In fact, mirror amino acids already somewhat common in nature! Bacterial peptidoglycans (which form part of the bacteria's casing) often use a mix of amino acid in order to resist certain enzymes, but bacteria can still be killed. Plants sometimes produce mirrored amino acids to use as signalling molecules or precursors. There are many organisms which can process and use mirrored amino acids in some way.

The most likely scenario by far is that a mirrored bacteria would be outcompeted by other bacteria and killed by achiral defenses due to having a much harder time replicating than a non-mirrored equivalent.

I'm glad they're thinking about this but I don't think it's scary at all.

Comment by J Bostock (Jemist) on The Dangers of Mirrored Life · 2024-12-13T11:55:36.802Z · LW · GW

I think the risk of infection to humans would be very low. The human body can generate antibodies to pretty much anything (including PEG, benzenes, which never appear in nature) by selecting protein sequences from a huge library of cells. This would activate the complement system which targets membranes and kills bacteria in a non-chiral way.

The risk to invertebrates and plants might be more significant, not sure about the specifics of plant immune system.

Comment by J Bostock (Jemist) on Jemist's Shortform · 2024-12-03T21:26:41.550Z · LW · GW

So Sonnet 3.6 can almost certainly speed up some quite obscure areas of biotech research. Over the past hour I've got it to:

  1. Estimate a rate, correct itself (although I did have to clock that it's result was likely off by some OOMs, which turned out to be 7-8), request the right info, and then get a more reasonable answer.
  2. Come up with a better approach to a particular thing than I was able to, which I suspect has a meaningfully higher chance of working than what I was going to come up with.

Perhaps more importantly, it required almost no mental effort on my part to do this. Barely more than scrolling twitter or watching youtube videos. Actually solving the problems would have had to wait until tomorrow.

I will update in 3 months as to whether Sonnet's idea actually worked.

(in case anyone was wondering, it's not anything relating to protein design lol: Sonnet came up with a high-level strategy for approaching the problem)

Comment by J Bostock (Jemist) on Conjecture: A Roadmap for Cognitive Software and A Humanist Future of AI · 2024-12-02T15:31:42.038Z · LW · GW

In practice, sadly, developing a true ELM is currently too expensive for us to pursue (but if you want to fund us to do that, lmk). So instead, in our internal research, we focus on finetuning over pretraining. Our goal is to be able to teach a model a set of facts/constraints/instructions and be able to predict how it will generalize from them, and ensure it doesn’t learn unwanted facts (such as learning human psychology from programmer comments, or general hallucinations).

 

This has reminded me to revisit some work I was doing a couple of months ago on unsupervised unlearning. I could almost get Gemma-2-2B to forget who Michael Jordan was without needing to know any facts about him (other than that "Michael Jordan" was the target name)

Comment by J Bostock (Jemist) on Arthropod (non) sentience · 2024-11-25T16:08:39.125Z · LW · GW

Shrimp have ultra tiny brains, with less than 0.1% of human neurons.

Humans have 1e11 neurons, what's the source for shrimp neuron count? The closest I can find is lobsters having 1e5 neurons, and crabs having 1e6 (all from Google AI overview) which is a factor of much more than 1,000.

Comment by J Bostock (Jemist) on Yonatan Cale's Shortform · 2024-11-25T10:31:58.820Z · LW · GW

I volunteer to play Minecraft with the LLM agents. I think this might be one eval where the human evaluators are easy to come by.

Comment by J Bostock (Jemist) on Linda Linsefors's Shortform · 2024-11-19T17:53:54.018Z · LW · GW

Ok: I'll operationalize the ratio of first choices the first group (Stop/PauseAI) to projects in the third and fourth groups (mech interp, agent foundations) for the periods 12th-13th vs 15th-16th. I'll discount the final day since the final-day-spike is probably confounding.

Comment by J Bostock (Jemist) on Linda Linsefors's Shortform · 2024-11-18T22:26:30.125Z · LW · GW

It might be the case that AISC was extra late-skewed because the MATS rejection letters went out on the 14th (guess how I know) so I think a lot of people got those and then rushed to finish their AISC applications (guess why I think this) before the 17th. This would predict that the ratio of technical:less-technical applications would increase in the final few days.

Comment by J Bostock (Jemist) on Purplehermann's Shortform · 2024-11-03T19:05:30.466Z · LW · GW

For a good few years you'd have a tiny baby limb, which would make it impossible to have a normal prosthetic. I also think most people just don't want a tiny baby limb attached to them. I don't think growing it in the lab for a decade is feasible for a variety of reasons. I also don't know how they planned to wire the nervous system in, or ensure the bone sockets attach properly, or connect the right blood vessels. The challenge is just immense and it gets less and less worth over time it as trauma surgery and prosthetics improve.

Comment by J Bostock (Jemist) on Purplehermann's Shortform · 2024-11-03T00:24:39.207Z · LW · GW

The regrowing limb thing is a nonstarter due to the issue of time if I understand correctly. Salamanders that can regrow limbs take roughly the same amount of time to regrow them as the limb takes to grow in the first place. So it would be 1-2 decades before the limb was of adult size. Secondly it's not as simple as just smearing on some stem cells to an arm stump. Limbs form because of specific signalling molecules in specific gradients. I don't think these are present in an adult body once the limb is made. So you'd need a socket which produces those which you'd have to build in the lab, attach to blood supply to feed the limb, etc.

Comment by J Bostock (Jemist) on Jemist's Shortform · 2024-10-29T11:09:53.239Z · LW · GW

My model: suppose we have a DeepDreamer-style architecture, where (given a history of sensory inputs) the babbler module produces a distribution over actions, a world model predicts subsequent sensory inputs, and an evaluator predicts expected future X. If we run a tree-search over some weighted combination of the X, Y, and Z maximizers' predicted actions, then run each of the X, Y, and Z maximizers' evaluators, we'd get a reasonable approximation of a weighted maximizers.

This wouldn't be true if we gave negative weights to the maximizers, because while the evaluator module would still make sense, the action distributions we'd get would probably be incoherent e.g. the model just running into walls or jumping off cliffs.

My conjecture is that, if a large black box model is doing something like modelling X, Y, and Z maximizers acting in the world, that large black box model might be close in model-space to a itself being a maximizer which maximizes 0.3X + 0.6Y + 0.1Z, but it's far in model-space from being a maximizer which maximizes 0.3X - 0.6Y - 0.1Z due to the above problem. 

Comment by J Bostock (Jemist) on Jemist's Shortform · 2024-10-28T19:53:05.261Z · LW · GW

Seems like if you're working with neural networks there's not a simple map from an efficient (in terms of program size, working memory, and speed) optimizer which maximizes X to an equivalent optimizer which maximizes -X. If we consider that an efficient optimizer does something like tree search, then it would be easy to flip the sign of the node-evaluating "prune" module. But the "babble" module is likely to select promising actions based on a big bag of heuristics which aren't easily flipped. Moreover, flipping a heuristic which upweights a small subset of outputs which lead to X doesn't lead to a new heuristic which upweights a small subset of outputs which lead to -X. Generalizing, this means that if you have access to maximizers for X, Y, Z, you can easily construct a maximizer for e.g. 0.3X+0.6Y+0.1Z but it would be non-trivial to construct a maximizer for 0.2X-0.5Y-0.3Z. This might mean that a certain class of mesa-optimizers (those which arise spontaneously as a result of training an AI to predict the behaviour of other optimizers) are likely to lie within a fairly narrow range of utility functions.

Comment by J Bostock (Jemist) on Open Source Replication of Anthropic’s Crosscoder paper for model-diffing · 2024-10-27T20:45:21.763Z · LW · GW

Perhaps fine-tuning needs to “delete” and replace these outdated representations related to user / assistant interactions.

It could also be that the finetuning causes this feature to be active 100% of the time, and which point it no longer correlates with the corresponding pretrained model feature, and it would just get folded into the decoder bias (to minimize L1 of fired features).

Comment by J Bostock (Jemist) on johnswentworth's Shortform · 2024-10-27T20:39:02.623Z · LW · GW

Some people struggle with the specific tactical task of navigating any conversational territory. I've certainly had a lot of experiences where people just drop the ball leaving me to repeatedly ask questions. So improving free-association skill is certainly useful for them.

Unfortunately, your problem is most likely that you're talking to boring people (so as to avoid doing any moral value judgements I'll make clear that I mean johnswentworth::boring people).

There are specific skills to elicit more interesting answers to questions you ask. One I've heard is "make a beeline for the edge of what this person has ever been asked before" which you can usually reach in 2-3 good questions. At that point they're forced to be spontaneous, and I find that once forced, most people have the capability to be a lot more interesting than they are when pulling cached answers.

This is easiest when you can latch onto a topic you're interested in, because then it's easy on your part to come up with meaningful questions. If you can't find any topics like this then re-read paragraph 2.

Comment by J Bostock (Jemist) on What is the alpha in one bit of evidence? · 2024-10-25T14:03:48.158Z · LW · GW

Rob Miles also makes the point that if you expect people to accurately model the incoming doom, you should have a low p(doom). At the very least, worlds in which humanity is switched-on enough (and the AI takeover is slow enough) for both markets to crash and the world to have enough social order for your bet to come through are much more likely to survive. If enough people are selling assets to buy cocaine for the market to crash, either the AI takeover is remarkably slow indeed (comparable to a normal human-human war) or public opinion is so doomy pre-takeover that there would be enough political will to "assertively" shut down the datacenters.

Comment by J Bostock (Jemist) on What is the alpha in one bit of evidence? · 2024-10-23T22:11:58.228Z · LW · GW

Also, in this case you want to actually spend the money before the world ends. So actually losing money on interests payments isn't the real problem, the real problem is that if you actually enjoy the money you risk losing everything and being bankrupt/in debtors prison for the last two years before the world ends. There's almost no situation in which you can be so sure of not needing to pay the money back that you can actually spend it risk-free. I think the riskiest short-ish thing that is even remotely reasonable is taking out a 30-year mortgage and paying just the minimum amount each year, such that the balance never decreases. Worst case you just end up with no house after 30 years, but not in crippling debt, and move back into the nearest rat group house.

Comment by J Bostock (Jemist) on When is reward ever the optimization target? · 2024-10-16T12:41:05.891Z · LW · GW

"Optimization target" is itself a concept which needs deconfusing/operationalizing. For a certain definition of optimization and impact, I've found that the optimization is mostly correlated with reward, but that the learned policy will typically have more impact on the world/optimize the world more than is strictly necessary to achieve a given amount of reward.

This uses an empirical metric of impact/optimization which may or may not correlate well with algorithm-level measures of optimization targets.

https://www.alignmentforum.org/posts/qEwCitrgberdjjtuW/measuring-learned-optimization-in-small-transformer-models

Comment by J Bostock (Jemist) on [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders · 2024-10-13T17:45:45.979Z · LW · GW

Another approach would be to use per-token decoder bias as seen in some previous work: https://www.lesswrong.com/posts/P8qLZco6Zq8LaLHe9/tokenized-saes-infusing-per-token-biases But this would only solve it when the absorbing feature is a token. If it's more abstract then this wouldn't work as well.

Semi-relatedly, since most (all) of the SAE work since the original paper has gone into untied encoded/decoder weights, we don't really know whether modern SAE architectures like Jump ReLU or TopK suffer as large of a performance hit as the original SAEs do, especially with the gains from adding token biases.

Comment by J Bostock (Jemist) on D&D.Sci GURPS Dec 2021: Hunters of Monsters · 2024-10-08T18:40:52.640Z · LW · GW

Oh no! Appears they were attached to an old email address, and the code is on a hard-drive which has since been formatted. I honestly did not expect anyone to find this after so long! Sorry about that.

Comment by J Bostock (Jemist) on Three Subtle Examples of Data Leakage · 2024-10-04T21:58:25.548Z · LW · GW

A paper I'm doing mech interp on used a random split when the dataset they used already has a non-random canonical split. They also validated with their test data (the dataset has a three way split) and used the original BERT architecture (sinusoidal embeddings which are added to feedforward, post-norming, no MuP) in a paper that came out in 2024. Training batch size is so small it can be 4xed and still fit on my 16GB GPU. People trying to get into ML from the science end have got no idea what they're doing. It was published in Bioinformatics.

Comment by J Bostock (Jemist) on Three Subtle Examples of Data Leakage · 2024-10-03T10:29:43.514Z · LW · GW

sellers auction several very similar lots in quick succession and then never auction again

This is also extremely common in biochem datasets. You'll get results in groups of very similar molecules, and families of very similar protein structures. If you do a random train/test split your model will look very good but actually just be picking up on coarse features.

Comment by J Bostock (Jemist) on 2024 Petrov Day Retrospective · 2024-09-29T13:32:43.162Z · LW · GW

I think the LessWrong community and particularly the LessWrong elites are probably too skilled for these games. We need a harder game. After checking the diplomatic channel as a civilian I was pretty convinced that there were going to be no nukes fired, and I ignored the rest of the game based on that. I also think the answer "don't nuke them" is too deeply-engrained in our collective psyche for a literal Petrov Day ritual to work like this. It's fun as a practice of ritually-not-destroying-the-world though.

Comment by J Bostock (Jemist) on [Completed] The 2024 Petrov Day Scenario · 2024-09-26T19:11:56.397Z · LW · GW

Isn't Les Mis set in the second French Revolution (1815 according to wikipedia) not the one that led to the Reign of Terror (which was in the 1790s)?

Comment by J Bostock (Jemist) on Characterizing stable regions in the residual stream of LLMs · 2024-09-26T14:54:26.071Z · LW · GW

I have an old hypothesis about this which I might finally get to see tested. The idea is that the feedforward networks of a transformer create little attractor basins. Reasoning is twofold: the QK-circuit only passes very limited information to the OV circuit as to what information is present in other streams, which introduces noise into the residual stream during attention layers. Seeing this, I guess that another reason might be due to inferring concepts from limited information:

Consider that the prompts "The German physicist with the wacky hair is called" and "General relativity was first laid out by" will both lead to "Albert Einstein". Both of them will likely land in different parts of an attractor basin which will converge.

You can measure which parts of the network are doing the compression using differential optimization, in which we take d[OUTPUT]/d[INPUT] as normal, and compare to d[OUTPUT]/d[INPUT] when the activations of part of the network are "frozen". Moving from one region to another you'd see a positive value while in one basin, a large negative value at the border, and then another positive value in the next region.

Comment by J Bostock (Jemist) on The Best Lay Argument is not a Simple English Yud Essay · 2024-09-21T12:07:09.350Z · LW · GW

Yeah, I agree we need improvement. I don't know how many people it's important to reach, but I am willing to believe you that this will hit maybe 10%. I expect the 10% to be people with above-average impact on the future, but I don't know what %age of people is enough.

90% is an extremely ambitious goal. I would be surprised if 90% of the population can be reliably convinced by logical arguments in general.

Comment by J Bostock (Jemist) on The Best Lay Argument is not a Simple English Yud Essay · 2024-09-19T15:26:06.663Z · LW · GW

I've posted it there. Had to use a linkpost because I didn't have an existing account there and you can't crosspost without 100 karma (presumably to prevent spam) and you can't funge LW karma for EAF karma.

Comment by J Bostock (Jemist) on OpenAI o1 · 2024-09-15T11:52:52.439Z · LW · GW

Only after seeing the headline success vs test-time-compute figure did I bother to check it against my best estimates of how this sort of thing should scale. If we assume:

  1. A set of questions of increasing difficulty (in this case 100), such that:
  2. The probability of getting question  correct on a given "run" is an s-curve like  for constants  and 
  3. The model does  "runs"
  4. If any are correct, the model finds the correct answer 100% of the time
  5.  gives a score of 20/100

Then, depending on  ( is is uniquely defined by  in this case), we get the following chance of success vs question difficulty rank curves:

Higher values of  make it look like a sharper "cutoff", i.e. more questions are correct ~100% of the time, but more are wrong ~100% of the time. Lower values of  make the curve less sharp, so the easier questions are gotten wrong more often, and the harder questions are gotten right more often.

Which gives the following best-of-N sample curves, which are roughly linear in  in the region between 20/100 and 80/100. The smaller the value of , the steeper the curve.

Since the headline figure spans around 2 orders of magnitude compute, the model on appears to be performing on AIMES similarly to a best-of-N sampling on the  case.

If we allow the model to split the task up into  subtasks (assuming this creates no overhead and each subtask's solution can be verified independently and accurately) then we get a steeper gradient roughly proportional to , and a small amount of curvature.

Of course this is unreasonable, since this requires correctly identifying the shortest path to success with independently-verifiable subtasks. In reality, we would expect the model to use extra compute on dead-end subtasks (for example, when doing a mathematical proof, verifying a correct statement which doesn't actually get you closer to the answer, or when coding, correctly writing a function which is not actually useful for the final product) so performance scaling from breaking up a task will almost certainly be a bit worse than this.

Whether or not the model is literally doing best-of-N sampling at inference time (probably it's doing something at least a bit more complex) it seems like it scales similarly to best-of-N under these conditions.

Comment by J Bostock (Jemist) on What happens if you present 500 people with an argument that AI is risky? · 2024-09-10T13:01:46.760Z · LW · GW

Overall it looked a lot like other arguments, so that’s a bit of a blow to the model where e.g. we can communicate somewhat adequately,  ‘arguments’ are more compelling than random noise, and this can be recognized by the public.

 

Did you just ask people "how compelling did you find this argument" because this is a pretty good argument that AI will contribute to music production. I would rate it highly on compelling, just not a compelling argument for X-risk.

Comment by J Bostock (Jemist) on What happens if you present 500 people with an argument that AI is risky? · 2024-09-04T18:49:28.970Z · LW · GW

I was surprised by the "expert opinion" case causing people to lower their P(doom), then I saw the argument itself suggests to people that experts have a P(doom) of around 5%. If most people give a number > 5% (as in the open response and slider cases) then of course they're going to update downwards on average!

I would be interested to see what a specific expert opinion (e.g. Geoffrey Hinton, Yoshua Bengio, Elon Musk, Yann LeCunn as a negative control) would have, given that those individuals have more extreme P(dooms)

My update on the choice of measurement is that "convincingness" is effectively meaningless.

I think the values of update probability are likely to be meaningful. The top two arguments are both very similar, as they play off of humans misusing AI (which I also find to be the most compelling argument to individuals), then there is a cluster relating to talking about how powerful AI is or could be and how it could compete with people.

Comment by J Bostock (Jemist) on Llama Llama-3-405B? · 2024-07-27T23:01:50.548Z · LW · GW

"Also, to engage in a little bit of mind-reading, Zuckerberg sees no enemies in China, only in OpenAI et al. through the "safety" regulation they can lobby the US government to enact."

This is a reasonable position, apart from the fact that it is at odds with the situation on the ground. OpenAI are not lobbying the government in favour of SB 1047, nor are Anthropic or Google (afaik). It's possible that in future they might, but other than Anthropic I think this is very unlikely.

For me, the idea of large AI companies using X-risk fears to shut down competition falls into the same category as the idea that large AI companies are using X-risk fears to hype their products. I think they are both interesting schemes that AI companies might be using in worlds that are not this one.

Comment by J Bostock (Jemist) on Llama Llama-3-405B? · 2024-07-27T22:35:49.320Z · LW · GW

This is a stronger argument than I first thought it was. You're right, and I think I have underestimated the utility of genuine ownership of tools like fine-tunes in general. I would imagine it goes api < cloud hosting < local hosting in terms of this stuff, and with regular local backups (what's a few TB here or there) then it would be feasible to protect your cloud hosted 405B system from most takedowns as long as you can find a new cloud provider. I'm under the impression that the vast majority of 405B users will be using cloud providers, is that correct?

Comment by J Bostock (Jemist) on The Cancer Resolution? · 2024-07-25T16:53:32.259Z · LW · GW

Epigenetic cancers are super interesting, thanks for adding this! I vaguely remember hearing that there were some incredibly promising treatments for them, though I've not heard anything for the past five or ten years on that. Importantly for this post, they also fill out the (rare!) examples of mutation-free cancers that we've seen, while fitting comfortably within the DNA paradigm.

Comment by J Bostock (Jemist) on Llama Llama-3-405B? · 2024-07-24T22:38:57.751Z · LW · GW

If everyone affirms this is indeed all the major arguments for open weights, then I can at some point soon produce a polished full version as a post and refer back to it, and consider the matter closed until someone comes up with new arguments.

Feels like the vast majority of the benifits Zuck touted could be achieved with:
1. A cheap, permissive API that allows finetuning, some other stuff. If Meta really want people to be able to do things cheaply, presumably they can offer it far far cheaper than almost anyone could do it themself without directly losing money.
2. A few partnerships with research groups to study it, since not many people have enough resources that doing research on a 405B model is optimal, and don't already have their own.
3. A basic pledge (that is actually followed) to not delete everyone's data, finetunes, etc. to deal with concerns about "ownership"

I assume there are other (sometimes NSFW) benefits he doesn't want to mention, because the reason the above options don't allow those activities is that Meta loses reputation from being associated with them even if they're not actually harmful.

Are there actually a hundred groups who might usefully study a 405B-parameter model, so Meta couldn't efficiently partner with all of them? Maybe with GPUs getting cheaper there will be a few projects on it in the next MATS stream? I kinda suspect that the research groups who get the most out of it will actually be the interpretability/alignment teams at Google and Anthropic, since they have the resources to run big experiments on Llama to compare to Gemini/Claude!

Comment by J Bostock (Jemist) on The Cancer Resolution? · 2024-07-24T19:26:47.657Z · LW · GW

If you're willing to take my rude and unfiltered response (and not complain about it) here it is:

 This is very fucking stupid.

Otherwise (written in about half an hour):

  1. Fungal infections would lead to the vast majority of cancers being in skin, gut, lung i.e. exposed tissue. These are relatively common, but this does not explain the high prevalence of breast and prostate cancers. It also doesn't explain why different cancers have such different prognoses, etc.
  2. Why do different cancer subtypes change in prevalence over the course of a person's life if they're tied to infection?
    https://www.cancerresearchuk.org/health-professional/cancer-statistics/incidence/age#heading-One
  3. Around half of cancers have a mutation in p53, which is involved in preserving the genome. Elephants have multiple copies of p53 and very rarely get cancer. People with de novo mutations in p53 get loads of cancer. The random spread of DNA damage is downstream of the DNA damage causing cancer: once p53 is deactivated (or the genome is otherwise unguarded) mutations can accumulate all over the genome, drowning out the causal ones.
    https://en.wikipedia.org/wiki/P53
  4. If it was infection-based, then you'd expect immunocompromised patients to get more of the common types of cancer. Instead they get super weird exotic cancers not found in people with normal immune systems.
    https://www.hopkinsmedicine.org/health/conditions-and-diseases/hiv-and-aids/aidsrelated-malignancies
  5. Chemotherapy, does work? I don't know what to say on this one, chemotherapy works, are all the RCTs which show it works supposed to be fake? Do I need to cite them:
    https://pubmed.ncbi.nlm.nih.gov/30629708/
    https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(23)00285-4/fulltext
    https://www.redjournal.org/article/S0360-3016(07)00996-0/fulltext
    I feel like a post which uncritically repeats someone's recommendation to not take chemotherapy has the potential to harm readers. You should at least add an epistemic status warning readers they might become stupider reading this.
  6. Antifungals are relatively easy to get ahold of. Why hasn't this man managed to run a single successful trial? Moreover, cryptococcal meningitis is a fungal diseas which is fatal if untreated and, from the CDC:
    Each year, an estimated 152,000 cases of cryptococcal meningitis occur among people living with HIV worldwide. Among those cases, an estimated 112,000 deaths occur, the majority of which occur in sub-Saharan Africa.
    Which implies 40,000 people are successfully treated with strong antifungals every single year. These are HIV patients, who are more likely to get cancer and under this theory would be more likely than anyone else to have fungal-induced cancer. How come nobody has pointed out the miraculous curing of hundreds or thousands of patients by now?
  7. Scientific consensus is an extremely powerful tool.
    https://slatestarcodex.com/2017/04/17/learning-to-love-scientific-consensus/

I think the fungal theory is basically completely wrong. Perhaps some obscure couple of percent of cancers are caused by fungi. I cannot disprove this, though I think it's very unlikely.

Comment by J Bostock (Jemist) on What percent of the sun would a Dyson Sphere cover? · 2024-07-03T18:30:22.089Z · LW · GW

Ooh boy this is a fun question:

For temperature reasons, a complete Dyson sphere is likely to be built outside the earth, as the energy output of the sun would force one at 1 A.U. to be 393K = 119 C. I assume the AI would prefer not to run all of its components this hot. A sphere like that would cook us like an oven unless the heat dissipating systems somehow don't radiate any energy back inwards (which is probably impossible).

A Dyson swarm might well be built at a mixture of inside and outside the earth's orbit. In that case the best candidate is to disassemble mercury, using solar energy to power electrolysis to turn the crust into metals, send up satellites to catch more sunlight, and focus that back down to the surface.

Mercury orbits at 60 million km from the sun. This means a circumference of 360 million km. The sun is 1.2 million km across, but because it's at 0.38 au from the sun, a band which blocks out the sun for the earth entirely would only need to be 0.8 million km. This gives a total surface area of 290e12 square kilometers to block out the sun entirely. Something like a Dyson belt.

If the belt is 1 m thick on average, this gives it a total volume of 290e18 cubic meters. Mercury has a volume of 60 billion cubic km = 60e18 cubic meters. This would blot out approximately 1/5 of the sun's radiation.

To put things in perspective, Mars is kinda maybe almost habitable with a lot of effort and gets less than 1/2 of the sun's radiation. I would make a wild guess that with 80% of the solar radiation we could scrape by with immense casualties due to massive decreases in agricultural yield. Temperature is somewhat tractable due to our ability to pump a bunch of sulfur hexafluoride into the atmosphere to heat things up.

As a caveat, I would suggest that if the AI is "nice" enough to spare Earth, it's likely to be nice enough to beam some reconstituted sunlight over to us. A priori I would say the niceness window for "unwilling to murder us while on earth, and we pose a direct threat, but unwilling to suffer the trivial cost of keeping the lights on" is extremely narrow.