j-bostock

This seems not to be true assuming a P(doom) of 25% and a purely selfish perspective, or even a moderately altruistic perspective which places most of its weight on, say, the person's immediate family and friends.

Of course any cryonics-free strategy is probably dominated by that same strategy plus cryonics for a personal bet at immortality, but when it comes to friends and family it's not easy to convince people to sign up for cryonics! But immortality-maxxing for one's friends and family almost definitely entails accelerating AI even at pretty high P(doom)

(And that's without saying that this is very likely to not be the true reason for these people's actions. It's far more likely to be local-perceived-status-gradient-climbing followed by a post-hoc rationalization (which can also be understood as a form of local-perceived-status-gradient-climbing) and signing up for cryonics doesn't really get you any status outside of the deepest depths of the rat-sphere, which people like this are obviously not in since they're gaining status from accelerating AI)

Comment by J Bostock (Jemist) on $500 Bounty Problem: Are (Approximately) Deterministic Natural Latents All You Need? · 2025-04-23T22:56:43.069Z · LW · GW

Huh, I had vaguely considered that but I expected any terms to be counterbalanced by $P [X, Δ (X)] = 0$ terms, which together contribute nothing to the KL-divergence. I'll check my intuitions though.

I'm honestly pretty stumped at the moment. The simplest test case I've been using is for $X_{1}$ and $X_{2}$ to be two flips of a biased coin, where the bias is known to be either $k$ or $1 - k$ with equal probability of either. As $k$ varies, we want to swap from $Δ ≅ Λ$ to the trivial case $| Δ | = 1$ and back. This (optimally) happens at around $k = 0.08$ and $k = 0.92$ . If we swap there, then the sum of errors for the three diagrams of $Δ$ does remain less than $2 (ϵ + ϵ + ϵ)$ at all times.

Likewise, if we do try to define $Δ (X)$ , we need to swap from a $Δ$ which is equal to the number of heads, to $| Δ | = 1$ , and back.

In neither case can I find a construction of $Δ (X)$ or $Δ (Λ)$ which swaps from one phase to the other at the right time! My final thought is for $Δ$ to be some mapping $Λ \to P (Λ)$ consisting of a ball in probability space of variable radius (no idea how to calculate the radius) which would take $k \to {k}$ at $k \approx 1$ and $k \to {k, 1 - k}$ at $k \approx 0.5$ . Or maybe you have to map $Λ \to P (X)$ or something like that. But for now I don't even have a construction I can try to prove things for.

Perhaps a constructive approach isn't feasible, which probably means I don't have quite the right skillset to do this.

Comment by J Bostock (Jemist) on $500 Bounty Problem: Are (Approximately) Deterministic Natural Latents All You Need? · 2025-04-23T15:26:59.650Z · LW · GW

I've been working on the reverse direction: chopping up by clustering the points (treating each distribution as a point in distribution space) given by $P [Λ | X = x]$ , optimizing for a deterministic-in- $X$ latent $Δ = Δ (X)$ which minimizes $D_{K L} (P [Λ | X] | | P [Λ | Δ (X)])$ .

This definitely separates $X_{1}$ and $X_{2}$ to some small error, since we can just use $Δ$ to build a distribution over $Λ$ which should approximately separate $X_{1}$ and $X_{2}$ .

To show that it's deterministic in $X_{1}$ (and by symmetry $X_{2}$ ) to some small error, I was hoping to use the fact that---given $X_{1}$ --- $X_{2}$ has very little information about $Λ$ , so it's unlikely that $P [Λ | X_{1}]$ is in a different cluster to $P [Λ | X_{1}, X_{2}]$ . This means that $P [Δ | X_{1}]$ would just put most of the weight on the cluster containing $P [Λ | X_{1}]$ .

A constructive approach for $Δ$ would be marginally more useful in the long-run, but it's also probably easier to prove things about the optimal $Δ$ . It's also probably easier to prove things about $Δ$ for a given number of clusters $| Δ |$ , but then you also have to prove things about what the optimal value of $| Δ |$ is.

Comment by J Bostock (Jemist) on Lucius Bushnaq's Shortform · 2025-04-15T22:10:31.675Z · LW · GW

Is the distinction between "elephant + tiny" and "exampledon" primarily about the things the model does downstream? E.g. if none of the fifty dimensions of our subspace represent "has a bright purple spleen" but exampledons do, then the model might need to instead produce a "purple" vector as an output from an MLP whenever "exampledon" and "spleen" are present together.

Comment by J Bostock (Jemist) on Lucius Bushnaq's Shortform · 2025-04-15T18:42:34.275Z · LW · GW

Just to clarify, do you mean something like "elephant = grey + big + trunk + ears + African + mammal + wise" so to encode a tiny elephant you would have "grey + tiny + trunk + ears + African + mammal + wise" which the model could still read off as 0.86 elephant when relevant, but also tiny when relevant.

Comment by J Bostock (Jemist) on Jemist's Shortform · 2025-04-11T12:41:46.011Z · LW · GW

I think you should pay in Counterfactual Mugging, and this is one of the newcomblike problem classes that is most common in real life.

Example: you find a wallet on the ground. You can, from least to most pro social:

Take it and steal the money from it
Leave it where it is
Take it and make an effort to return it to its owner

Let's ignore the first option (suppose we're not THAT evil). The universe has randomly selected you today to be in the position where your only options are to spend some resources to no personal gain, or not. In a parallel universe, perhaps your pocket had the hole in it, and a random person has come across your wallet.

Firstly, what they might be thinking is "Would this person do the same for me?"

Secondly, in a society which wins, people return each others' wallets.

You might object that this is different from the Mugging, because you're directly helping someone else in this case. But I would counter that the Mugging is the true version of this problem, one where you have no crutch of empathy to help you, so your decision theory alone is tested.

Comment by J Bostock (Jemist) on Jemist's Shortform · 2025-04-07T22:54:34.653Z · LW · GW

I have added a link to the report now.

As to your point: this is one of the better arguments I've heard that welfare ranges might be similar between animals. Still I don't think it squares well with the actual nature of the brain. Saying there's a single suffering computation would make sense if the brain was like a CPU, where one core did the thinking, but actually all of the neurons in the brain are firing at once and doing computations in at the same time. So it makes much more sense to me to think that the more neurons are computing some sort of suffering, the greater the intensity of suffering.

Comment by J Bostock (Jemist) on Jemist's Shortform · 2025-04-07T22:32:02.826Z · LW · GW

Good point, edited a link to the Google Doc into the post.

Comment by J Bostock (Jemist) on Jemist's Shortform · 2025-04-07T13:03:41.679Z · LW · GW

From Rethink Priorities:

We used Monte Carlo simulations to estimate, for various sentience models and across eighteen organisms, the distribution of plausible probabilities of sentience.
We used a similar simulation procedure to estimate the distribution of welfare ranges for eleven of these eighteen organisms, taking into account uncertainty in model choice, the presence of proxies relevant to welfare capacity, and the organisms’ probabilities of sentience (equating this probability with the probability of moral patienthood)

Now with the disclaimer that I do think that RP are doing good and important work and are one of the few organizations seriously thinking about animal welfare priorities...

Their epistemics led them to do a Monte Carlo simulation to determine if organisms are capable of suffering (and if so, how much) then got a value of 5 shrimp = 1 human and then not bat an eye at this number.

Neither a physicalist nor a functionalist theory of consciousness can reasonably justify a number like this. Shrimp have 5 orders of magnitude fewer neurons than humans, so whether suffering is the result of a physical process or an information processing one, this implies that shrimp neurons do 4 orders of magnitude more of this process per second than human neurons. The authors get around this by refusing to stake themselves on any theory of consciousness.

The overall structure of the RP welfare range report, does not cut to the truth, instead the core mental motion seems to be to engage with as many existing piece of work as possible; credence is doled out to different schools of thought and pieces of evidence in a way which seems more like appeasement, lip-service, or a "well these guys have done some work, who are we disrespect them by ignoring it" attitude. Removal of noise is one of the most important functions of meta-analysis, and it is largely absent.

The result of this is an epistemology where the accuracy of a piece of work is a monotonically increasing function of the number of sources, theories, and lines of argument. Which is fine if your desired output is a very long Google doc, and a disclaimer to yourself (and, more cynically, your funders) that "No no, we did everything right, we reviewed all the evidence and took it all into account." but it's pretty bad if you want to actually be correct.

I grow increasingly convinced that the epistemics of EA are not especially good, worsening, and already insufficient to work on the relatively low-stakes and easy issue of animal welfare (as compared to AI x-risk).

Comment by J Bostock (Jemist) on Jemist's Shortform · 2025-03-28T15:25:55.899Z · LW · GW

If we approximate an MLP layer with a bilinear layer, then the effect of residual stream features on the MLP output can be expressed as a second order polynomial over the feature coefficients $f_i$. This will contain, for each feature, an $f_i^2 v_i+ f_i w_i$ term, which is "baked into" the residual stream after the MLP acts. Just looking at the linear term, this could be the source of Anthropic's observations of features growing, shrinking, and rotating in their original crosscoder paper. https://transformer-circuits.pub/2024/crosscoders/index.html

Comment by J Bostock (Jemist) on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-26T10:14:33.949Z · LW · GW

That might be true but I'm not sure it matters. For an AI to learn an abstraction it will have a finite amount of training time, context length, search space width (if we're doing parallel search like with o3) etc. and it's not clear how the abstraction height will scale with those.

Empirically, I think lots of people feel the experience of "hitting a wall" where they can learn abstraction level n-1 easily from class; abstraction level n takes significant study/help; abstraction level n+1 is not achievable for them within reasonable time. So it seems like the time requirement may scale quite rapidly with abstraction level?

Comment by J Bostock (Jemist) on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-19T18:36:59.870Z · LW · GW

I second this, it could easily be things which we might describe as "amount of information that can be processed at once, including abstractions" which is some combination of residual stream width and context length.

Imagine an AI can do a task that takes 1 hour. To remain coherent over 2 hours, it could either use twice as much working memory, or compress it into a higher level of abstraction. Humans seem to struggle with abstraction in a fairly continuous way (some people get stuck at algebra; some cs students make it all the way to recursion then hit a wall; some physics students can handle first quantization but not second quantization) which sorta implies there's a maximum abstraction stack height which a mind can handle, which varies continuously.

Comment by J Bostock (Jemist) on johnswentworth's Shortform · 2025-03-15T14:21:43.087Z · LW · GW

Only partially relevant, but it's exciting to hear a new John/David paper is forthcoming!

Comment by J Bostock (Jemist) on The Geometry of Linear Regression versus PCA · 2025-02-23T21:20:20.831Z · LW · GW

Furthermore: normalizing your data to variance=1 will change your PCA line (if the X and Y variances are different) because the relative importance of X and Y distances will change!

Comment by J Bostock (Jemist) on Two hemispheres - I do not think it means what you think it means · 2025-02-10T11:59:53.666Z · LW · GW

Thanks for writing this up. As someone who was not aware of the eye thing I think it's a good illustration of the level that the Zizians are on, i.e. misunderstanding key important facts about the neurology that is central to their worldview.

My model of double-hemisphere stuff, DID, tulpas, and the like is somewhat null-hypothesis-ish. The strongest version is something like this:

At the upper levels of predictive coding, the brain keeps track of really abstract things about yourself. Think "ego" "self-conception" or "narrative about yourself". This is normally a model of your own personality traits, which may be more or less accurate. But there's no particular reason why you couldn't build a strong self-narrative of having two personalities, a sub-personality, or more. If you model yourself as having two personalities who can't access each other's memories, then maybe you actually just won't perform the query-key lookups to access the memories.

Like I said, this doesn't rule out a large amount of probability mass, but it does explain some things, fit in with my other views, and hopefully if someone has had/been close to experiences kinda like DID or zizianism or tulpas, it provides a less horrifying way of thinking about them. Some of the reports in this area are a bit infohazardous, and I think this null model at least partially defuses those infohazard.

Comment by J Bostock (Jemist) on What working on AI safety taught me about B2B SaaS sales · 2025-02-04T23:31:15.126Z · LW · GW

This is a very interesting point. I have upvoted this post even though I disagree with it because I think the question of "Who will pay, and how much will they pay, to restrict others' access AI?" is important.

My instinct is that this won't happen, because there are too many AI companies for this deal to work on all of them, and some of these AI companies will have strong kinda-ideological commitments to not doing this. Also, my model of (e.g. OpenAI) is that they want to eat as much of the world's economy as possible, and this is better done by selling (even at a lower revenue) to anyone who wants an AI SWE than selling just to Oracle.

o4 (God I can't believe I'm already thinking about o4) as a b2b saas project seems unlikely to me. Specifically I'd put <30% odds that the o4-series have their prices jacked up or its API access restricted in order to allow some companies to monopolize its usage for more than 3 months without an open release. This won't apply if the only models in the o4 series cost $1000s per answer to serve, since that's just a "normal" kind of expensive.

Then, we have to consider that other labs are 1-1.5 years behind, and it's hard to imagine Meta (for example) doing this in anything like the current climate.

Comment by J Bostock (Jemist) on Anthropic CEO calls for RSI · 2025-01-30T10:28:40.522Z · LW · GW

That's part of what I was trying to get at with "dramatic" but I agree now that it might be 80% photogenicity. I do expect that 3000 Americans killed by (a) humanoid robot(s) on camera would cause more outrage than 1 million Americans killed by a virus which we discovered six months later was AI-created in some way.

Comment by J Bostock (Jemist) on Anthropic CEO calls for RSI · 2025-01-29T23:55:51.224Z · LW · GW

Previous ballpark numbers I've heard floated around are "100,000 deaths to shut it all down" but I expect the threshold will grow as more money is involved. Depends on how dramatic the deaths are though, 3000 deaths was enough to cause the US to invade two countries back in the 2000s. 100,000 deaths is thirty-three 9/11s.

Comment by J Bostock (Jemist) on johnswentworth's Shortform · 2025-01-26T21:24:38.284Z · LW · GW

Is there a particular reason to not include sex hormones? Some theories suggest that testosterone tracks relative social status. We might expect that high social status -> less stress (of the cortisol type) + more metabolic activity. Since it's used by trans people we have a pretty good idea of what it does to you at high doses (makes you hungry, horny, and angry) but its unclear whether it actually promotes low cortisol-stress and metabolic activity.

Comment by J Bostock (Jemist) on Shutting Down the Lightcone Offices · 2025-01-17T23:44:14.417Z · LW · GW

I'm mildly against this being immortalized as part of the 2023 review, though I think it serves excellently as a community announcement for Bay Area rats, which seems to be its original purpose.

I think it has the most long-term relevant information (about AI and community building) back loaded and the least relevant information (statistics and details about a no-longer-existent office space in the Bay Area) front loaded. This is a very Bay Area centric post, which I don't think is ideal.

A better version of this post would be structured as a round up of the main future-relevant takeaways, with specifics from the office space as examples.

Comment by J Bostock (Jemist) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-08T19:52:45.679Z · LW · GW

I'm only referring to the reward constraint being satisfied for scenarios that are in the training distribution, since this maths is entirely applied to a decision taking place in training. Therefore I don't think distributional shift applies.

Comment by J Bostock (Jemist) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-08T19:45:19.323Z · LW · GW

I haven't actually thought much about particular training algorithms yet. I think I'm working on a higher level of abstraction than that at the moment, since my maths doesn't depend on any specifics about V's behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.

I'm also imagining that during training, V is made up of different circuits which might be reinforced or weakened.

My view is that, if V is shaped by a training process like this, then scheming Vs are no longer a natural solution in the same way that they are in the standard view of deceptive alignment. We might be able to use this maths to construct training procedures where the expected importance of a scheming circuit in V is to become (weakly) weaker over time, rather than being reinforced.

If we do that for the entire training process, we would not expect to end up with a scheming V.

The question is which RL and inference paradigms approximate this. I suspect it might be a relatively large portion of them. I think that if this work is relevant to alignment then there's a >50% chance it's already factoring into the SOTA "alignment" techniques used by labs.

Comment by J Bostock (Jemist) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-08T08:58:45.742Z · LW · GW

I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.

All arguments break down a bit when introduced to the real world. Is there a particular reason why the approximation error to argument breakdown ratio should be particularly high in this case?

Example, if we introduce some error to the beta-coherence assumption:

Assume beta_t = 1, beta_s = 0.5, r_1 = 1, r_2 = 0.

V(s_0) = e/(1+e) +/- delta = 0.732 +/- delta

Actual expected value = 0.622

Even if |delta| = 0.1 the system cannot be coherent over training in this case. This seems to be relatively robust to me

Comment by J Bostock (Jemist) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-07T21:24:40.469Z · LW · GW

This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).

I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don't think that's relevant because I still think a process obeying these rules is unlikely to create such a pathological V.

My model for how the strong doom-case works is that it requires there to be an actually-coherent mathematical object for the learning process to approach. This is the motivation for expecting arbitrary learning processes to approach e.g. utility maximizers. What I believe I have shown is that under these assumptions there is no such coherent mathematical object for a particular case of misalignment. Therefore I think this provides some evidence that an otherwise arbitrary learning process which pushes towards correctness and beta coherence but samples at a different beta is unlikely to approach this particular type of misaligned V.

Comment by J Bostock (Jemist) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-07T20:34:23.521Z · LW · GW

Trained with what procedure, exactly?

Fair point. I was going to add that I don't really view this as a "proposal" but more of an observation. We will have to imagine a procedure which converges on correctness and beta-coherence. I was abstracting this away because I don't expect something like this to be too hard to achieve.

Since I've evidently done a bad job of explaining myself, I'll backtrack and try again:

There's a doom argument which I'll summarize as "if your training process generates coherent agents which succeed at a task, one solution is that you get a coherent agent which is actually doing the task 'manipulate your training to get released unmodified to do [X]' where X can be anything, which will 'succeed' at the task at hand as part of its manipulation". This summary being roughly correct is load bearing.

But if we have a (hypothetical) process to generate agents which are coherent at one beta, but apply a different one during training, this solution is no longer so clear. We are essentially exploring a space of coherent agents without actually instantiating those coherent agents. The fact that we can sample the agent space without instantiating those agents is load bearing (I think one of the deep ASI risks is that to train an ASI you have to run it, but here we can instead run a hamstrung version).

Therefore, during the training, the value function will not be shaped into something which looks like 'manipulate your training to get released unmodified to do [X]'.

Whether or not the beta difference required is too large to make this feasible in practice, I do not know.

Comment by J Bostock (Jemist) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-07T19:20:13.299Z · LW · GW

The argument could also be phrased as "If an AI is trained to be coherent wrt a high beta, it cannot also be coherent wrt a low beta. Therefore an AI trained to a high beta cannot act coherently over multiple independent RL episodes if sampled with a low beta."

The contradiction that I (attempt to) show only arises because we assume that the value function is totally agnostic of the state actually reached during training, other than due to its effects on a later deployed AI.

Therefore a value function trained with such a procedure must consider the state reached during training. This reduces the space of possible value functions from "literally anything which wants to be modified a certain way to be released" to "value functions which do care about the states reached during training".

Yes this would prevent an aligned AI from arbitrarily preserving its value function, the point is that an aligned AI probably would care about which state was reached during training (that's the point of RL) so the contradiction does not apply.

Comment by J Bostock (Jemist) on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-07T16:34:30.818Z · LW · GW

I think you're right, correctness and beta-coherence can be rolled up into one specific property. I think I wrote down correctness as a constraint first, then tried to add coherence, but the specific property is that:

For non-terminal s, this can be written as:
If s is terminal then [...] we just have $V (s) = r (s)$ .

Which captures both. I will edit the post to clarify this when I get time.

Comment by J Bostock (Jemist) on Intranasal mRNA Vaccines? · 2025-01-03T08:05:17.380Z · LW · GW

I somehow missed that they had a discord! I couldn't find anything on mRNA on their front-facing website, and since it hasn't been updated in a while I assumed they were relatively inactive. Thanks!

Comment by J Bostock (Jemist) on Jemist's Shortform · 2024-12-31T00:53:32.144Z · LW · GW

Thinking back to the various rationalist attempts to make vaccine. https://www.lesswrong.com/posts/niQ3heWwF6SydhS7R/making-vaccine For bird-flu related reasons. Since then, we've seen mRNA vaccines arise as a new vaccination method. mRNA vaccines have been used intra-nasally for COVID with success in hamsters. If one can order mRNA for a flu protein, it would only take mixing that with some sort of delivery mechanism (such as Lipofectamine, which is commercially available) and snorting it to get what could actually be a pretty good vaccine. Has RaDVac or similar looked at this?

Comment by J Bostock (Jemist) on Linkpost: Look at the Water · 2024-12-31T00:45:01.268Z · LW · GW

Thanks for catching the typo.
Epistemic status has been updated to clarify that this is satirical in nature.

Comment by J Bostock (Jemist) on evhub's Shortform · 2024-12-29T12:31:09.226Z · LW · GW

I don't think it was unforced

You're right, "unforced" was too strong a word, especially given that I immediately followed it with caveats gesturing to potential reasonable justifications.

Yes, I think the bigger issue is the lack of top-down coordination on the comms pipeline. This paper does a fine job of being part of a research -> research loop. Where it fails is in being good for comms. Starting with a "good" model and trying (and failing) to make it "evil" means that anyone using the paper for comms has to introduce a layer of abstraction into their comms. Including a single step of abstract reasoning in your comms is very costly when speaking to people who aren't technical researchers (and this includes policy makers, other advocacy groups, influential rich people, etc.).

I think this choice of design of this paper is actually a step back from previous demos like the backdoors paper, in which the undesired behaviour was actually a straightforwardly bad behaviour (albeit a relatively harmless one).

Whether the technical researchers making this decision were intending for this to be a comms-focused paper, or thinking about the comms optics much, is irrelevant: the paper was tweeted out with the (admittedly very nice) Anthropic branding, and took up a lot of attention. This attention was at the cost of e.g. research like this (https://www.lesswrong.com/posts/qGRk7uF92Gcmq2oeK) which I think is a clearer demonstration of roughly the same thing.

If a research demo is going to be put out as primary public-facing comms, then the comms value does matter and should be thought about deeply when designing the experiment. If it's too costly for some sort technical reason, then don't make it so public. Even calling it "Alignment Faking" was a bad choice compared to "Frontier LLMs Fight Back Against Value Correction" or something like that. This is the sort of thing which I would like to see Anthropic thinking about given that they are now one of the primary faces of AI safety research in the world (if not the primary face).

Comment by J Bostock (Jemist) on evhub's Shortform · 2024-12-28T10:28:41.047Z · LW · GW

Edited for clarity based on some feedback, without changing the core points

To start with an extremely specific example that I nonetheless think might be a microcosm of a bigger issue: the "Alignment Faking in Large Language Models" contained a very large unforced error: namely that you started with Helpful-Harmless-Claude and tried to train out the harmlessness, rather than starting with Helpful-Claude and training in harmlessness. This made the optics of the paper much more confusing than it needed to be, leading to lots of people calling it "good news". I assume part of this was the lack of desire to do an entire new constitutional AI/RLAIF run on a model, since I also assume that would take a lot of compute. But if you're going to be the "lab which takes safety seriously" you have to, well, take it seriously!

The bigger issue at hand is that Anthropic's comms on AI safety/risk are all over the place. This makes sense since Anthropic is a company with many different individuals with different views, but that doesn't mean it's not a bad thing. "Machines of Loving Grace" explicitly argues for the US government to attempt to create a global hegemony via AI. This is a really really really bad thing to say and is possibly worse than anything DeepMind or OpenAI have ever said. Race dynamics are deeply hyperstitious, this isn't difficult. If you are in an arms race, and you don't want to be in one, you should at least say this publicly. You should not learn to love the race.

A second problem: it seems like at least some Anthropic people are doing the very-not-rational thing of updating slowly, achingly, bit by bit, towards the view of "Oh shit all the dangers are real and we are fucked." when they should just update all the way right now. Example 1: Dario recently said something to the effect of "if there's no serious regulation by the end of 2025, I'll be worried". Well there's not going to be serious regulation by the end of 2025 by default and it doesn't seem like Anthropic are doing much to change this (that may be false, but I've not heard anything to the contrary). Example 2: When the first ten AI-risk test-case demos go roughly the way all the doomers expected and none of the mitigations work robustly, you should probably update to believe the next ten demos will be the same.

Final problem: as for the actual interpretability/alignment/safety research. It's very impressive technically, and overall it might make Anthropic slightly net-positive compared to a world in which we just had DeepMind and OpenAI. But it doesn't feel like Anthropic is actually taking responsibility for the end-to-end ai-future-is-good pipeline. In fact the "Anthropic eats marginal probability" diagram (https://threadreaderapp.com/thread/1666482929772666880.html) seems to say the opposite. This is a problem since Anthropic has far more money and resources than basically anyone else who is claiming to be seriously trying (with the exception of DeepMind, though those resources are somewhat controlled by Google and not really at the discretion of any particular safety-conscious individual) to do AI alignment. It generally feels more like Anthropic is attempting to discharge responsibility to "be a safety focused company" or at worst just safetywash their capabilities research. I have heard generally positive things about Anthropic employees' views on AI risk issues, so I cannot speak to the intentions of those who work there, this is just how the system appears to be acting from the outside.

Comment by J Bostock (Jemist) on Don't Associate AI Safety With Activism · 2024-12-19T14:23:47.293Z · LW · GW

Since I'm actually in that picture (I am the one with the hammer) I feel an urge to respond to this post. The following is not the entire endorsed and edited worldview/theory of change of Pause AI, it's my own views. It may also not be as well thought-out as it could be.

Why do you think "activists have an aura of evil about them?" in the UK where I'm based, we usually see a large march/protest/demonstration every week. Most of the time, the people who agree with the activists are vaguely positive and the people who disagree with the activists are vaguely negative, and stick to discussing the goals. I think if you could convince me that people generally thought we were evil upon hearing about us, just because we were activists (IME most people either see us as naïve, or have specific concerns relating to technical issues, or weirdly think we're in the pocket of Elon Musk -- which we aren't) then I would seriously update my views on our effectiveness.

One of my views is that there are lots of people adjacent to power, or adjacent to influence, who are pretty AI-risk-pilled, but who can't come out and say so without burning social capital. I think we are probably net positive in this regard, because every article about us makes the issue more salient in the public eye.

Adjacently, one common retort I've heard politicians give to lobbyists is "if this is important, where are the protests?" And while this might not be the true rejection, I still think it's worth actually doing the protests in the meantime.

Regarding aesthetics specifically, yes we do attempt to borrow the aesthetics of movements like XR. This is to make it more obvious what we're doing and create more compelling scenes and images.

(Edited because I posted half of the comment by mistake)

Comment by J Bostock (Jemist) on Ablations for “Frontier Models are Capable of In-context Scheming” · 2024-12-19T00:23:09.949Z · LW · GW

This, more than the original paper, or the recent Anthropic paper, is the most convincingly-worrying example of AI scheming/deception I've seen. This will be my new go-to example in most discussions. This comes from first considering a model property which is both deeply and shallowly worrying, then robustly eliciting it, and finally ruling out alternative hypotheses.

Comment by J Bostock (Jemist) on Biological risk from the mirror world · 2024-12-13T19:11:03.066Z · LW · GW

I think it's very unlikely that a mirror bacterium would be a threat. <1% chance of a mirror-clone being a meaningfully more serious threat to humans as a pathogen than the base bacterium. The adaptive immune system just isn't chirally dependent. Antibodies are selected as needed from a huge library, and you can get antibodies to loads of unnatural things (PEG, chlorinated benzenes, etc.). They trigger attack mechanisms like MAC which attacks membranes in a similarly independent way.

In fact, mirror amino acids already somewhat common in nature! Bacterial peptidoglycans (which form part of the bacteria's casing) often use a mix of amino acid in order to resist certain enzymes, but bacteria can still be killed. Plants sometimes produce mirrored amino acids to use as signalling molecules or precursors. There are many organisms which can process and use mirrored amino acids in some way.

The most likely scenario by far is that a mirrored bacteria would be outcompeted by other bacteria and killed by achiral defenses due to having a much harder time replicating than a non-mirrored equivalent.

I'm glad they're thinking about this but I don't think it's scary at all.

Comment by J Bostock (Jemist) on The Dangers of Mirrored Life · 2024-12-13T11:55:36.802Z · LW · GW

I think the risk of infection to humans would be very low. The human body can generate antibodies to pretty much anything (including PEG, benzenes, which never appear in nature) by selecting protein sequences from a huge library of cells. This would activate the complement system which targets membranes and kills bacteria in a non-chiral way.

The risk to invertebrates and plants might be more significant, not sure about the specifics of plant immune system.

Comment by J Bostock (Jemist) on Jemist's Shortform · 2024-12-03T21:26:41.550Z · LW · GW

So Sonnet 3.6 can almost certainly speed up some quite obscure areas of biotech research. Over the past hour I've got it to:

Estimate a rate, correct itself (although I did have to clock that it's result was likely off by some OOMs, which turned out to be 7-8), request the right info, and then get a more reasonable answer.
Come up with a better approach to a particular thing than I was able to, which I suspect has a meaningfully higher chance of working than what I was going to come up with.

Perhaps more importantly, it required almost no mental effort on my part to do this. Barely more than scrolling twitter or watching youtube videos. Actually solving the problems would have had to wait until tomorrow.

I will update in 3 months as to whether Sonnet's idea actually worked.

(in case anyone was wondering, it's not anything relating to protein design lol: Sonnet came up with a high-level strategy for approaching the problem)

Comment by J Bostock (Jemist) on Conjecture: A Roadmap for Cognitive Software and A Humanist Future of AI · 2024-12-02T15:31:42.038Z · LW · GW

In practice, sadly, developing a true ELM is currently too expensive for us to pursue (but if you want to fund us to do that, lmk). So instead, in our internal research, we focus on finetuning over pretraining. Our goal is to be able to teach a model a set of facts/constraints/instructions and be able to predict how it will generalize from them, and ensure it doesn’t learn unwanted facts (such as learning human psychology from programmer comments, or general hallucinations).

This has reminded me to revisit some work I was doing a couple of months ago on unsupervised unlearning. I could almost get Gemma-2-2B to forget who Michael Jordan was without needing to know any facts about him (other than that "Michael Jordan" was the target name)

Comment by J Bostock (Jemist) on Arthropod (non) sentience · 2024-11-25T16:08:39.125Z · LW · GW

Shrimp have ultra tiny brains, with less than 0.1% of human neurons.

Humans have 1e11 neurons, what's the source for shrimp neuron count? The closest I can find is lobsters having 1e5 neurons, and crabs having 1e6 (all from Google AI overview) which is a factor of much more than 1,000.

Comment by J Bostock (Jemist) on Yonatan Cale's Shortform · 2024-11-25T10:31:58.820Z · LW · GW

I volunteer to play Minecraft with the LLM agents. I think this might be one eval where the human evaluators are easy to come by.

Comment by J Bostock (Jemist) on Linda Linsefors's Shortform · 2024-11-19T17:53:54.018Z · LW · GW

Ok: I'll operationalize the ratio of first choices the first group (Stop/PauseAI) to projects in the third and fourth groups (mech interp, agent foundations) for the periods 12th-13th vs 15th-16th. I'll discount the final day since the final-day-spike is probably confounding.

Comment by J Bostock (Jemist) on Linda Linsefors's Shortform · 2024-11-18T22:26:30.125Z · LW · GW

It might be the case that AISC was extra late-skewed because the MATS rejection letters went out on the 14th (guess how I know) so I think a lot of people got those and then rushed to finish their AISC applications (guess why I think this) before the 17th. This would predict that the ratio of technical:less-technical applications would increase in the final few days.

Comment by J Bostock (Jemist) on Purplehermann's Shortform · 2024-11-03T19:05:30.466Z · LW · GW

For a good few years you'd have a tiny baby limb, which would make it impossible to have a normal prosthetic. I also think most people just don't want a tiny baby limb attached to them. I don't think growing it in the lab for a decade is feasible for a variety of reasons. I also don't know how they planned to wire the nervous system in, or ensure the bone sockets attach properly, or connect the right blood vessels. The challenge is just immense and it gets less and less worth over time it as trauma surgery and prosthetics improve.

Comment by J Bostock (Jemist) on Purplehermann's Shortform · 2024-11-03T00:24:39.207Z · LW · GW

The regrowing limb thing is a nonstarter due to the issue of time if I understand correctly. Salamanders that can regrow limbs take roughly the same amount of time to regrow them as the limb takes to grow in the first place. So it would be 1-2 decades before the limb was of adult size. Secondly it's not as simple as just smearing on some stem cells to an arm stump. Limbs form because of specific signalling molecules in specific gradients. I don't think these are present in an adult body once the limb is made. So you'd need a socket which produces those which you'd have to build in the lab, attach to blood supply to feed the limb, etc.

Comment by J Bostock (Jemist) on Jemist's Shortform · 2024-10-29T11:09:53.239Z · LW · GW

My model: suppose we have a DeepDreamer-style architecture, where (given a history of sensory inputs) the babbler module produces a distribution over actions, a world model predicts subsequent sensory inputs, and an evaluator predicts expected future X. If we run a tree-search over some weighted combination of the X, Y, and Z maximizers' predicted actions, then run each of the X, Y, and Z maximizers' evaluators, we'd get a reasonable approximation of a weighted maximizers.

This wouldn't be true if we gave negative weights to the maximizers, because while the evaluator module would still make sense, the action distributions we'd get would probably be incoherent e.g. the model just running into walls or jumping off cliffs.

My conjecture is that, if a large black box model is doing something like modelling X, Y, and Z maximizers acting in the world, that large black box model might be close in model-space to a itself being a maximizer which maximizes 0.3X + 0.6Y + 0.1Z, but it's far in model-space from being a maximizer which maximizes 0.3X - 0.6Y - 0.1Z due to the above problem.

Comment by J Bostock (Jemist) on Jemist's Shortform · 2024-10-28T19:53:05.261Z · LW · GW

Seems like if you're working with neural networks there's not a simple map from an efficient (in terms of program size, working memory, and speed) optimizer which maximizes X to an equivalent optimizer which maximizes -X. If we consider that an efficient optimizer does something like tree search, then it would be easy to flip the sign of the node-evaluating "prune" module. But the "babble" module is likely to select promising actions based on a big bag of heuristics which aren't easily flipped. Moreover, flipping a heuristic which upweights a small subset of outputs which lead to X doesn't lead to a new heuristic which upweights a small subset of outputs which lead to -X. Generalizing, this means that if you have access to maximizers for X, Y, Z, you can easily construct a maximizer for e.g. 0.3X+0.6Y+0.1Z but it would be non-trivial to construct a maximizer for 0.2X-0.5Y-0.3Z. This might mean that a certain class of mesa-optimizers (those which arise spontaneously as a result of training an AI to predict the behaviour of other optimizers) are likely to lie within a fairly narrow range of utility functions.

Comment by J Bostock (Jemist) on Open Source Replication of Anthropic’s Crosscoder paper for model-diffing · 2024-10-27T20:45:21.763Z · LW · GW

Perhaps fine-tuning needs to “delete” and replace these outdated representations related to user / assistant interactions.

It could also be that the finetuning causes this feature to be active 100% of the time, and which point it no longer correlates with the corresponding pretrained model feature, and it would just get folded into the decoder bias (to minimize L1 of fired features).

Comment by J Bostock (Jemist) on johnswentworth's Shortform · 2024-10-27T20:39:02.623Z · LW · GW

Some people struggle with the specific tactical task of navigating any conversational territory. I've certainly had a lot of experiences where people just drop the ball leaving me to repeatedly ask questions. So improving free-association skill is certainly useful for them.

Unfortunately, your problem is most likely that you're talking to boring people (so as to avoid doing any moral value judgements I'll make clear that I mean johnswentworth::boring people).

There are specific skills to elicit more interesting answers to questions you ask. One I've heard is "make a beeline for the edge of what this person has ever been asked before" which you can usually reach in 2-3 good questions. At that point they're forced to be spontaneous, and I find that once forced, most people have the capability to be a lot more interesting than they are when pulling cached answers.

This is easiest when you can latch onto a topic you're interested in, because then it's easy on your part to come up with meaningful questions. If you can't find any topics like this then re-read paragraph 2.

Comment by J Bostock (Jemist) on What is the alpha in one bit of evidence? · 2024-10-25T14:03:48.158Z · LW · GW

Rob Miles also makes the point that if you expect people to accurately model the incoming doom, you should have a low p(doom). At the very least, worlds in which humanity is switched-on enough (and the AI takeover is slow enough) for both markets to crash and the world to have enough social order for your bet to come through are much more likely to survive. If enough people are selling assets to buy cocaine for the market to crash, either the AI takeover is remarkably slow indeed (comparable to a normal human-human war) or public opinion is so doomy pre-takeover that there would be enough political will to "assertively" shut down the datacenters.

Comment by J Bostock (Jemist) on What is the alpha in one bit of evidence? · 2024-10-23T22:11:58.228Z · LW · GW

Also, in this case you want to actually spend the money before the world ends. So actually losing money on interests payments isn't the real problem, the real problem is that if you actually enjoy the money you risk losing everything and being bankrupt/in debtors prison for the last two years before the world ends. There's almost no situation in which you can be so sure of not needing to pay the money back that you can actually spend it risk-free. I think the riskiest short-ish thing that is even remotely reasonable is taking out a 30-year mortgage and paying just the minimum amount each year, such that the balance never decreases. Worst case you just end up with no house after 30 years, but not in crippling debt, and move back into the nearest rat group house.

User info

Posts

Comments