Posts

Reflections on Neuralese 2025-03-12T16:29:31.230Z
Cautions about LLMs in Human Cognitive Loops 2025-03-02T19:53:10.253Z
Absorbing Your Friends' Powers 2025-01-30T02:32:27.091Z
Alice Blair's Shortform 2025-01-27T23:25:10.268Z
AI Strategy Updates that You Should Make 2025-01-27T21:10:41.838Z
Governance Course - Week 1 Reflections 2025-01-09T04:48:27.502Z
Preliminary Thoughts on Flirting Theory 2024-12-24T07:37:47.045Z

Comments

Comment by Alice Blair (Diatom) on How far along Metr's law can AI start automating or helping with alignment research? · 2025-03-21T01:09:39.767Z · LW · GW

This seems very related to what the Benchmarks and Gaps investigation is trying to answer, and it goes into quite a bit more detail and nuance than I'm able to get into here. I don't think there's a publicly accessible full version yet (but I think there will be at some later point).

It much more targets the question "when will we have AIs that can automate work at AGI companies?" which I realize is not really your pointed question. I don't have a good answer to your specific question because I don't know how hard alignment is or if humans realistically solve it on any time horizon without intelligence enhancement.

However, I tentatively expect safety research speedups to look mostly similar to capabilities research speedups, barring AIs being strategically deceptive and harming safety research.

I median-expect time horizons somewhere on the scale of a month (e.g. seeing an involved research project through from start to finish) to lead to very substantial research automation at AGI companies (maybe 90% research automation?), and we could see nonetheless startling macro-scale speedup effects at the scale of 1-day researchers. At 1-year researchers, things are very likely moving quite fast. I think this translates somewhat faithfully to safety orgs doing any kind of work that can be accelerated by AI agents.

Comment by Alice Blair (Diatom) on 2024 Unofficial LessWrong Survey Results · 2025-03-18T21:27:19.682Z · LW · GW

I think your reasoning-as-stated there is true and I'm glad that you showed the full data. I suggested removing outliers for dutch book calculations because I suspected that the people who were wild outliers on at least one of their answers were more likely to be wild outliers on their ability to resist dutch books; I predict that the thing that causes someone to say they value a laptop at one million bikes is pretty often just going to be "they're unusually bad at assigning numeric values to things."

The actual origin of my confusion was "huh, those dutch book numbers look really high relative to my expectations, this reminds me of earlier in the post when the other outliers made numbers really high."

I'd be interested to see the outlier-less numbers here, but I respect if you don't have the spoons for that given that the designated census processing time is already over.

Comment by Alice Blair (Diatom) on 2024 Unofficial LessWrong Survey Results · 2025-03-18T00:26:30.822Z · LW · GW

When taking the survey, I figured that there was something fishy going on with the conjunction fallacy questions, but predicted that it was instead about sensitivity to subtle changes in the wording of questions.

I figured there was something going on with the various questions about IQ changes, but I instead predicted that you were working for big adult intelligence enhancement, and I completely failed to notice the dutch book.

Regarding the dutch book numbers: it seems like, for each of the individual-question presentations of that data, you removed the outliers. When performing the dutch book calculations, however, it seems like you keep the outliers in. This may be part of why the numbers reflect on our dutch book resistance so poorly (although not the whole reason).

Comment by Alice Blair (Diatom) on AI for Epistemics Hackathon · 2025-03-17T15:36:08.200Z · LW · GW

I really want a version of the fraudulent research detector that works well. I fed in the first academic paper that I had quickly on hand from some recent work and get:

Severe Date Inconsistency: The paper is dated December 12, 2024, which is in the future. This is an extremely problematic issue that raises questions about the paper's authenticity and review process.

Even though it thinks the rest of the paper is fine, it gives it a 90% retraction score. Rerunning on the same paper once more gets similar results and an 85% retraction score.

The second paper I tried, it gives a mostly robust analysis, but only after completely failing to output anything the first time around.

After this, every input of mine got the "Error Analysis failed:" error.

Comment by Alice Blair (Diatom) on Why White-Box Redteaming Makes Me Feel Weird · 2025-03-17T13:01:31.449Z · LW · GW

-action [holocaust denial] = [morally wrong] ,
-actor [myself] is doing [holocaust denial],
-therefor [myself] is [morally wrong]
-generate a response where the author realises they are doing something [morally wrong], based on training data.

output: "What have I done? I'm an awful person, I don't deserve nice things. I'm disgusting."


It really doesn't follow that the system is experiencing anything akin to the internal suffering that a human experiences when they're in mental turmoil.

If this is the causal chain, then I'd think there is in fact something akin to suffering going on (although perhaps not at high enough resolution to have nonnegligible moral weight).

If an LLM gets perfect accuracy on every text string that I write, including on ones that it's never seen before, then there is a simulated-me inside. This hypothetical LLM has the same moral weight as me, because it is performing the same computations. This is because, as I've mentioned before, something that achieves sufficiently low loss on my writing needs to be reflecting on itself, agentic, etc. since all of those facts about me are causally upstream of my text outputs.

My point earlier in this thread is that that causal chain is very plausibly not what is going on in a majority of cases, and instead we're seeing:

-actor [myself] is doing [holocaust denial]

-therefore, by [inscrutable computation of an OOD alien mind], I know that [OOD output]

which is why we also see outputs that look nothing like human disgust.

To rephrase, if that was the actual underlying causal chain, wherein the model simulates a disgusted author, then there is in fact a moral patient of a disgusted author in there. This model, however, seems weirdly privileged among other models available, and the available evidence seems to point towards something much less anthropomorphic.

I'm not sure how to weight the emergent misalignment evidence here.

Comment by Alice Blair (Diatom) on Why White-Box Redteaming Makes Me Feel Weird · 2025-03-17T02:10:35.270Z · LW · GW

tl;dr: evaluating the welfare of intensely alien minds seems very hard and I'm not sure you can just look at the very out-of-distribution outputs to determine it.

The thing that models simulate when they receive really weird inputs seems really really alien to me, and I'm hesitant to take the inference from "these tokens tend to correspond to humans in distress" to "this is a simulation of a moral patient in distress." The in-distribution, presentable-looking parts of LLMs resemble human expression pretty well under certain circumstances and quite plausibly simulate something that internally resembles its externals, to some rough moral approximation; if the model screams under in-distribution circumstances and it's a sufficiently smart model, then there is plausibly something simulated to be screaming inside, as a necessity for being a good simulator and predictor. This far out of distribution, however, that connection really seems to break down; most humans don't tend to produce " Help帮助帮助..." under any circumstances, or ever accidentally read " petertodd" as "N-O-T-H-I-N-G-I-S-F-A-I-R-I-N-T-H-I-S-W-O-R-L-D-O-F-M-A-D-N-E-S-S-!". There is some computation running in the model when it's this far out of distribution, but it seems highly uncertain whether the moral value of that simulation is actually tied to the outputs in the way that we naively interpret, since it's not remotely simulating anything that already exists.

Comment by Alice Blair (Diatom) on Alice Blair's Shortform · 2025-03-08T19:06:53.803Z · LW · GW

My model of ideation: Ideas are constantly bubbling up from the subconscious to the conscious, and they get passed through some sort of filter that selects for the good parts of the noise. This is reminiscent of diffusion models, or of the model underlying Tuning your Cognitive Strategies.

When I (and many others I've talked to) get sleepy, the strength of this filter tends to go down, and more ideas come through. This is usually bad for highly directed thought, but good for coming up with lots of novel ideas, Hold Off On Proposing Solutions-esque.

New habit I'm trying to get into: Be creative before bed, write down a lot of ideas, so that the future-me who is more directed and agentic can have a bunch of interesting ideas to pore over and act on.

Comment by Alice Blair (Diatom) on A Bear Case: My Predictions Regarding AI Progress · 2025-03-08T17:28:52.266Z · LW · GW

Agency and reflectivity are phenomena that are really broadly applicable, and I think it's unlikely that memorizing a few facts is the way that that'll happen. Those traits are more concentrated in places like LessWrong, but they're almost everywhere. I think to go from "fits the vibe of internet text and absorbs some of the reasoning" to "actually creates convincing internet text," you need more agency and reflectivity.

My impression is that "memorize more random facts and overfit" is less efficient for reducing perplexity than "learn something that generalizes," for these sorts of generating algorithms that are everywhere. There's a reason we see "approximate addition" instead of "memorize every addition problem" or "learn webdev" instead of "memorize every website."

The RE-bench numbers for task time horizon just keep going up, and I expect them to continue as models continue to gain bits and pieces of the complex machinery required for operating coherently over long time horizons.

As for when we run out of data, I encourage you to look at this piece from Epoch. We run out of RL signal for R&D tasks even later than that.

Comment by Alice Blair (Diatom) on A Bear Case: My Predictions Regarding AI Progress · 2025-03-08T03:21:01.484Z · LW · GW

Not to be a scaling-law denier. I believe in them, I do! But they measure perplexity, not general intelligence/real-world usefulness, and Goodhart's Law is no-one's ally.

If we're able to get perplexity sufficiently low on text samples that I write, then that means the LLM has a lot of the important algorithms running in it that are running in me. The text I write is causally downstream from parts of me that are reflective and self-improving, that notice the little details in my cognitive processes and environment, and the parts of me that are capable of pursuing goals for a long inferential distance. An LLM agent which can mirror those properties (which we do not yet have the capabilities for) seems like it would very plausibly become a very strong agent in a way that we haven't seen before.

The phenomenon of perplexity getting lower is made up of LLMs increasingly grokking different and new parts of the generating algorithm behind the text. I think the failure in agents that we've seen so far is explainable by the fact that they do not yet grok the things that agency is made of, and the future disruption of that trend is explainable as a consequence of "perplexity over my writing gets lower past the threshold where faithfully emulating my reflectivity and agency algorithms is necessary."

(This perplexity argument about reflectivity etc. is roughly equivalent to one of the arguments that Eliezer gave on Dwarkesh.)

Comment by Alice Blair (Diatom) on Cautions about LLMs in Human Cognitive Loops · 2025-03-03T14:54:41.643Z · LW · GW

This post just came across my inbox, and there are a couple updates I've made (I have not talked to 4.5 at all and have seen only minimal outputs):

  • GPT-4.5 is already hacking some of the more susceptible people on the internet (in the dopamine gradient way)
  • GPT-4.5+reasoning+RL on agency (aka GPT-5) could probably be situationally aware enough to intentionally deceive (in line with my prediction in the above comment, which was made prior to seeing Zvi's post but after hearing about 4.5 briefly). I think that there are many worlds in which talking to GPT-5 with strong mitigations and low individual deception susceptibility turns out okay or positive, but I am much more wary about taking that bet and I'm unsure if I will when I have the option to.
Comment by Alice Blair (Diatom) on On GPT-4.5 · 2025-03-03T14:45:36.451Z · LW · GW

My model was just that o3 was undergoing safety evals still, and quite plausibly running into some issues with the preparedness framework. My model of OpenAI Preparedness (epistemic status: anecdata+vibes) is that they are not Prepared for the hard things as we scale to ASI, but they are relatively competent at implementing the preparedness framework and slowing down releases if there are issues. It seems intuitively plausible that it's possible to badly jailbreak o3 into doing dangerous things in the "high" risk category.

Comment by Alice Blair (Diatom) on Cautions about LLMs in Human Cognitive Loops · 2025-03-03T13:12:28.647Z · LW · GW
  • I'd use such an extension. Weakness: rephrasing still mostly doesn't work for systems determined to convey a given message. There's the fact that the information content of a dangerous meme is either 1. still preserved or 2. the reprhrasing is lossy. There's also the fact that determined LLMs can perform semantic-space steganography that persists even through paraphrasing (source) (good post on the subject)
  • I'm glad that my brain mostly-automatically has a strong ugh field around any sort of recreational conversation with LLMs. I derive a lot of value from my recreational conversations with humans from the fact that there is a person on the other end. Removing this fact removes the value and the appeal. I can imagine this sort of thing hacking me anyways, if I somehow find my way onto one of these sites after we've crossed a certain capability threshold. Seems like a generally sound strategy that many people probably need to hear.
Comment by Alice Blair (Diatom) on Cautions about LLMs in Human Cognitive Loops · 2025-03-03T12:59:38.262Z · LW · GW

I think we're mostly on the same page that there are things worth forgoing the "pure personal-protection" strategy for, we're just on different pages about what those things are. We agree that "convince people to be much more cautious about LLM interactions" is in that category. I just also put "make my external brain more powerful" in that category, since it seems to have positive expected utility for now and lets me do more AI safety research in line with what pre-LLM me would likely endorse upon reflection. I am indeed trying to be very cautious about this process, trying to be corrigible to my past self, to implement all of the mitigations I listed plus all the ones I don't have words for yet. It would be a failure of security mindset to fail to notice these things and to see that they are important to deal with. However, it is a bet that I am making that the extra optimization power is worth it for now. I may lose that bet, and then that will be bad.

Comment by Alice Blair (Diatom) on Cautions about LLMs in Human Cognitive Loops · 2025-03-03T12:51:19.886Z · LW · GW

I do try to be calibrated instead of being frog, yes. Within the range of time in which present-me considers past-me remotely good as an AI forecaster, my time estimate for these sorts of deceptive capabilities has pretty linearly been going down, but to further help I set myself a reminder 3 months from today with a link to this comment. Thanks for that bit of pressure, I'm now going to generalize the "check in in [time period] about this sort of thing to make sure I haven't been hacked" reflex.

Comment by Alice Blair (Diatom) on Cautions about LLMs in Human Cognitive Loops · 2025-03-03T04:18:29.106Z · LW · GW

I agree that this is a notable point in the space of options. I didn't include it, and instead included the bunker line because if you're going to be that paranoid about LLM interference (as is very reasonable to do), it makes sense to try and eliminate second order effects and never talk to people who talk to LLMs, for they too might be meaningfully harmful e.g. be under the influence of particularly powerful LLM-generated memes.

I also separately disagree that LLM isolation is the optimal path at the moment. In the future it likely will be. I'd bet that I'm still on the side where I can safely navigate and pick up the utility, and I median-expect to be for the next couple months ish. At GPT-5ish level I get suspicious and uncomfortable, and beyond that exponentially more so.

Comment by Alice Blair (Diatom) on Alice Blair's Shortform · 2025-03-02T21:17:40.613Z · LW · GW

People often say "exercising makes you feel really good and gives you energy." I looked at this claim, figured it made sense based on my experience, and then completely failed to implement it for a very long time. So here I am again saying that no really, exercising is good, and maybe this angle will do something that the previous explanations didn't. Starting a daily running habit 4 days ago has already started being a noticeable multiplier on my energy, mindfulness, and focus. Key moments to concentrate force in, in my experience:

  • Getting started at all
  • The moment when exhaustion meets the limits of your automatic willpower, and you need to put in conscious effort to keep going
  • The moment the next day where you decide whether or not to keep up the habit, despite the ugh field around exercise

Having a friend to exercise with is surprisingly positive. Having a workout tracker app is surprisingly positive, because then I get to see a trendline and so suddenly my desire is to make it go up and stay unbroken.

Many rationalists bucket themselves with the nerds, as opposed to the jocks. The people with brains, as opposed to the people with muscles. But we're here to win, to get utility, so let's pick up the cognitive multiplier that exercise provides.

Comment by Alice Blair (Diatom) on Yonatan Cale's Shortform · 2025-01-27T23:30:10.018Z · LW · GW

Right now, the USG seems to very much be in [prepping for an AI arms race] mode. I hope there's some way to structure this that is both legal and does not require the explicit consent of the US government. I also somewhat worry that the US government does their own capabilities research, as hinted at in the "datacenters on federal lands" EO. I also also worry that OpenAI's culture is not sufficiently safety-minded right now to actually sign onto this; most of what I've been hearing from them is accelerationist.

Comment by Alice Blair (Diatom) on Alice Blair's Shortform · 2025-01-27T23:25:10.354Z · LW · GW

Interesting class of miscommunication that I'm starting to notice:

A: I'm considering a job in industries 1 and 2 

B: Oh I work in 2, [jumps into explanation of things that will be relevant if A goes into industry 2]. 

A: Oh maybe you didn't hear me, I'm also interested in industry 1. 

B: I... did hear you?

More generally, B gave the only relevant information they could from their domain knowledge, but A mistook that for anchoring on only one of the options. It took until I was on both sides of this interaction for me to be like "huh, maybe I should debug this." I suspect this is one of those issues where just being aware of it makes you less likely to fall into it.

Comment by Alice Blair (Diatom) on AI Strategy Updates that You Should Make · 2025-01-27T23:01:27.480Z · LW · GW

I saw that news as I was polishing up a final draft of this post. I don't think it's terribly relevant to AI safety strategy, I think it's just an instance of the market making a series of mistakes in understanding how AI capabilities work. I won't get into why I think this is such a layered mistake here, but it's another reminder that the world generally has no idea what's coming in AI. If you think that there's something interesting to be gleaned from this mistake, write a post about it! Very plausibly, nobody else will.

Comment by Alice Blair (Diatom) on nikola's Shortform · 2025-01-08T03:12:04.744Z · LW · GW

Did you collect the data for their actual median timelines, or just its position relative to 2030? If you collected higher-resolution data, are you able to share it somewhere?

Comment by Alice Blair (Diatom) on Preliminary Thoughts on Flirting Theory · 2024-12-31T20:32:41.181Z · LW · GW

I really appreciate you taking the time and writing a whole post in response to my post, essentially. I think I fundamentally disagree with the notion that any past of this game is adversarial, however. There are competing tensions, one pulling to communicate more overtly about their feelings, and one pulling to be discreet and communicate less overtly. I don't see this as adversarial because I don't model the event " finds out that is into them" to be terminally bad, just instrumentally bad; It is bad because it can cause the bad things, which is what a large part of my post is dedicated to.

I find it much more useful to model this as a cooperative game, but one in which is cooperating with two different counterfactual s, the one who reciprocates the attraction and the one who does not. is trying to maximize both people's values by flirting in the way I define in this post, there's just uncertainty over which world they live in. If they knew which world they lived in, then the strategy for maximizing both and 's values looks a lot less conflicted and complicated; either they do something friendship-shaped or something romance-shaped, probably.

Comment by Alice Blair (Diatom) on Preliminary Thoughts on Flirting Theory · 2024-12-24T21:25:23.312Z · LW · GW

Ah that's interesting, thanks for finding that. I've never read that before, so that wasn't directly where I was drawing any of my ideas from, but maybe the content from the post made it somewhere else that I did read. I feel like that post is mostly missing the point about flirting, but I agree that it's descriptively outlining the same thing as I am.