abramdemski

Moreover, the real question is to what degree the development of competitive solar energy was the result of a purposeful policy. People like to believe that tech development subsidies have a large counterfactual but imho this needs to be explicitly proved and my prior is that the effect is probably small compared to overall general development of technology & economic incentives that are not downstream of subsidies / government policy.

But we don't need to speculate about that in the case of AI! We know roughly how much money we'll need for a given size of AI experiment (eg, a training run). The question is one of raising the money to do it. With a strong enough safety case vs the competition, it might be possible.

I'm curious if you think there are any better routs; IE, setting aside the possibility of researching safer AI technology & working towards its adoption, what overall strategy would you suggest for AI safety?

Comment by abramdemski on abramdemski's Shortform · 2025-04-09T16:15:09.986Z · LW · GW

This sort of approach doesn't make so much sense for research explicitly aiming at changing the dynamics in this critical period. Having an alternative, safer idea almost ready-to-go (with some explicit support from some fraction of the AI safety community) is a lot different from having some ideas which the AI could elaborate.

Comment by abramdemski on abramdemski's Shortform · 2025-04-08T19:13:21.060Z · LW · GW

The pre-training phase is already finding a mesa-optimizer that does induction in context. I usually think of this as something like Solomonoff induction with a good inductive bias, but probably you would expect something more like logical induction. I expect the answer to be somewhere in between.

I don't personally imagine current LLMs are doing approximate logical induction (or approximate solomonoff) internally. I think of the base model as resembling a circuit prior updated on the data. The circuits that come out on top after the update also do some induction of their own internally, but it is harder to think about what form of inductive bias they have exactly (it would seem like a coincidence if it also happened to be well-modeled as a circuit prior, but, it must be something highly computationally limited like that, as opposed to Solomonoff-like).

I hesitate to call this a mesa-optimizer. Although good epistemics involves agency in principle (especially time-bounded epistemics), I think we can sensibly differentiate between mesa-optimizers and mere mesa-induction. But perhaps you intended this stronger reading, in support of your argument. If so, I'm not sure why you believe this. (No, I don't find "planning ahead" results to be convincing -- I feel this can still be purely epistemic in a relevant sense.)

Perhaps it suffices for your purposes to observe that good epistemics involves agency in principle?

Anyway, cutting more directly to the point:

I think you lack imagination when you say

[...] which can realistically compete with modern LLMs would ultimately look a lot like a semi-theoretically-justified modification to the loss function or optimizer of agentic fine-tuning / RL or possibly its scaffolding [...]

I think there are neural architectures close to the current paradigm which don't directly train whole chains-of-thought on a reinforcement signal to achieve agenticness. This paradigm is analogous to model-free reinforcement learning. What I would suggest is more analogous to model-based reinforcement learning, with corresponding benefits to transparency. (Super speculative, of course.)

Comment by abramdemski on Ivan Vendrov's Shortform · 2025-04-08T18:57:25.966Z · LW · GW

Given infinite compute, Bayesian optimization like this doesn't make sense (at least for well-defined objective functions), because you can just select the single best point in the search space.

what makes you confident that evolutionary search under computational resource scarcity selects for anything like an explicit Bayesian optimizer or long term planner? (I say "explicit" because the Bayesian formalism has enough free parameters that you can post-hoc recast ~any successful algorithm as an approximation to a Bayesian ideal)

I would not argue for "explicit". If I had to argue for "explicit" I would say: because biological organisms do in fact have differentiated organs which serve somewhat comprehensible purposes, and even the brain has somewhat distinct regions serving specific purposes. However, I think the argument for explicit-or-implicit is much stronger.
Even so, I would not argue that evolutionary search under computational resource scarcity selects for a long-term planner, be it explicit or implicit. This would seem to depend on the objective function used. For example, I would not expect something trained on an image-recognition objective to exhibit long-term planning.
I'm curious why you specify evolutionary search rather than some more general category that includes gradient descent and other common techniques which are not Bayesian optimization. Do you expect it to be different in this regard?

I'm not sure why you asked the question, but it seems probably that you thought a "confident belief that [...]" followed from my view expressed in the previous comment? I'm curious about your reasoning there. To me, it seems unrelated.

These issues are tricky to discuss, in part because the term "optimization" is used in several different ways, which have rich interrelationships. I conceptually make a firm distinction between search-style optimization (gradient descent, genetic algorithms, natural selection, etc) vs agent-style optimization (control theory, reinforcement learning, brains, etc). I say more about that here.

The proposal of Bayesian Optimization, as I understand it, is to use the second (agentic optimization) in the inner loop of the first (search). This seems like a sane approach in principle, but of course it is handicapped by the fact that Bayesian ideas don't represent the resource-boundedness of intelligence particularly well, which is extremely critical for this specific application (you want your inner loop to be fast). I suspect this is the problem you're trying to comment on?

I think the right way to handle that in principle is to keep the Bayesian ideal as the objective function (in a search sense, not an agency sense) and search for a good search policy (accounting for speed as well as quality of decision-making), which you then use for many specific searches going forward.

Comment by abramdemski on Ivan Vendrov's Shortform · 2025-04-08T15:14:22.362Z · LW · GW

Pursuit of novelty is not vnm-incoherent. Furthermore, it is an instrumentally convergent drive; power-seeking agents will seek novelty as well, because learning increases power in expectation (see: value of information).

The argument made in Novelty Search and the Problem with Objectives is based on search processes which inherently cannot do long-term planning (they are myopically trying to increase their score on the objective). These search processes don't do as well as explicit pursuit of novelty because they aren't planning to search effectively, so there's no room in their cognitive architecture for the instrumental convergence towards novelty-seeking to take place. (I'm basing this conclusion on the abstract.) This architectural limitation of most AI optimization methods is mitigated by Bayesian optimization methods (which explicitly combine information-seeking with the normal loss-avoidance).

Comment by abramdemski on abramdemski's Shortform · 2025-04-08T15:00:23.606Z · LW · GW

Here's what seem like priorities to me after listening to the recent Dwarkesh podcast featuring Daniel Kokotajlo:

1. Developing the safer AI tech (in contrast to modern generative AI) so that frontier labs have an alternative technology to switch to, so that it is lower cost for them to start taking warning signs of misalignment of their current tech tree seriously. There are several possible routes here, ranging from small tweaks to modern generative AI, to scaling up infrabayesianism (existing theory, totally groundbreaking implementation) to starting totally from scratch (inventing a new theory). Of course we should be working on all routes, but prioritization depends in part on timelines.

I see the game here as basically: look at the various existing demos of unsafety and make a counter-demo which is safer on multiple of these metrics without having gamed the metrics.

2. De-agentify the current paradigm or the new paradigm:

Don't directly train on reinforcement across long chains of activity. Find other ways to get similar benefits.
Move away from a model where the AI is personified as a distinct entity (eg, chatbot model). It's like the old story about building robot arms to help feed disabled people -- if you mount the arm across the table, spoonfeeding the person, it's dehumanizing; if you make it a prosthetic, it's humanizing.
- I don't want AI to write my essays for me. I want AI to help me get my thoughts out of my head. I want super-autocomplete. I think far faster than I can write or type or speak. I want AI to read my thoughts & put them on the screen.
  - There are many subtle user interface design questions associated with this, some of which are also safety issues, eg, exactly what objective do you train on?
- Similarly with image generation, etc.
- I don't necessarily mean brain-scanning tech here, but of course that would be the best way to achieve it.
- Basically, use AI to overcome human information-processing bottlenecks instead of just trying to replace humans. Putting humans "in the loop" more and more deeply instead of accepting/assuming that humans will iteratively get sidelined.

Comment by abramdemski on tobytrem's Shortform · 2025-03-28T12:50:13.218Z · LW · GW

My objection is that Smoking Lesion is a decision problem which we can't drop arbitrary decision procedures into to see how they do; its necessary that the decision procedure might have a lesion influencing it. If you drop an CDT decision procedure into the problem, then the claimed population statistics can't apply to you, since CDT always smokes in this problem - either you're a CDT mutant and would be mistaken to apply the population statistics to yourself, or everyone is CDT and the population statistics can't be as claimed. Similarly with EDT. Therefore, to me, this decision problem isn't a legitimate test of a decision procedure: you can only test the decision procedure by lying to it about the problem (making it believe the problem-statement statistics are representative of it), or by mangling the decision procedure (adding a lesion into it somehow).

Comment by abramdemski on johnswentworth's Shortform · 2025-03-25T20:42:52.582Z · LW · GW

There are a lot of replies here, so I'm not sure whether someone already mentioned this, but: I have heard anecdotally that homosexual men often have relationships which maintain the level of sex over the long term, while homosexual women often have long-term relationships which very gradually decline in frequency of sex, with barely any sex after many decades have passed (but still happily in a relationship).

This mainly argues against your model here:

This also fits with my general models of mating markets: women usually find the large majority of men sexually unattractive, most women eventually settle on a guy they don't find all that sexually attractive, so it should not be surprising if that relationship ends up with very little sex after a few years.

It suggests instead that female sex drive naturally falls off in long-term relationships in a way that male sex drive doesn't, with sexual attraction to a partner being a smaller factor.

Comment by abramdemski on Notes on countermeasures for exploration hacking (aka sandbagging) · 2025-03-25T00:59:11.234Z · LW · GW

ou need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy - just add a small weight on a uniform prior over tokens like they did in old school atari RL.)

Yeah, what I really had in mind with "avoiding mode collapse" was something more complex, but it seems tricky to spell out precisely.

Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn't be able to do something to avoid exploring it.

It's an interesting point, but where does the "extremely" come from? Seems like if it thinks there's a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values, it could be a very worthwhile gamble. Maybe I'm unclear on the rules of the game as you're imagining them.

Comment by abramdemski on Notes on countermeasures for exploration hacking (aka sandbagging) · 2025-03-24T20:11:49.389Z · LW · GW

Training on high quality human examples: One basic starting point is to initialize RL by training the AI to imitate on-distribution high quality human examples such that it has to explore into strategies which are at least as good as these human trajectories. We could also try to construct such examples adversarially (e.g., train AIs to point out errors we adversarially insert).
More granular feedback and trajectory edits: In some cases, constructing good enough imitation trajectories may be challenging, for instance, it seems very hard for humans to construct a long reasoning trace for a reasoning model (though it is unclear how good the human examples would need to be to force exploration).

Overall, the quality of these proposals basically depends on their ability to explore well. Exploring well obviously doesn't just mean injecting more randomness, since a high-entropy approach will take a very very long time to find useful strategies in complex domains. Training on high-quality human examples avoids this problem by ensuring that we at least explore strategies often employed by human experts, which are going to be some of the better strategies.

However, if the AI is being trained to (even mildly) superhuman levels of capability (on a given subject matter), then clearly human imitation does not do much good anymore; the AI is free to sandbag at mildly superhuman levels of capability (within the given subject matter). Training on high-quality human examples might even cause or exacerbate this (basically, giving the AI the idea that it could imitate humans, which can then be a good sandbagging strategy).

So, basically, training to a given capability level (while robustly avoiding sandbagging) seems to require exploration ability near that specific capability level.

The only way to get this that I can think of is to use an RL method which guarantees against mode-collapse, IE, a technique which keeps a broad distribution over the strategies which achieve good reward, rather than collapsing down to one particular way of getting good reward as training proceeds. This guarantees that the only reason why a specific strategy is not explored is because its expected value is low (according to all plausible hypotheses).

Comment by abramdemski on Lorxus's Shortform · 2025-03-24T18:36:38.735Z · LW · GW

I really like the 'trying not to know' one, because there are lots of things I'm trying not to know all the time (for attention-conservation reasons), but I don't think I have very good strategies for auditing the list.

Comment by abramdemski on abramdemski's Shortform · 2025-03-22T03:19:27.211Z · LW · GW

I'm thinking about AI emotions. The thing about human emotions and expressions is that they're more-or-less involuntary. Facial expressions, tone of voice, laughter, body language, etc reveal a whole lot about human inner state. We don' know if we can trust AI emotional expressions in the same way; the AIs can easily fake it, because they don't have the same intrinsic connection between their cognitive machinery and these ... expressions.

A service called Face provides emotional expressions for AI. It analyzes AI-generated outputs and makes inferences about the internal state of the AI who wrote the text. This is possible due to Face's interpretability tools, which have interpreted lots of modern LLMs to generate labels on their output data explaining their internal motivations for the writing. Although Face doesn't have access to the internal weights for an arbitrary piece of text you hand it, its guesses are pretty good. It will also tell you which portions were probably AI-generated. It can even guess multi-step writing processes involving both AI and human writing.

Face also offers their own AI models, of course, to which they hook the interpretability tools to directly, so that you'll get more accurate results.

It turns out Face can also detect motivations of humans with some degree of accuracy. Face is used extensively inside the Face company, which is a nonprofit entity which develops the open-source software. Face is trained on outcomes of hiring decisions so as to better judge potential employees. This training is very detailed, not just a simple good/bad signal.

Face is the AI equivalent of antivirus software; your automated AI cloud services will use it to check their inputs for spam and prompt injection attacks.

Face company culture is all about being genuine. They basically have a lie detector on all the time, so liars are either very very good or weeded out. This includes any kind of less-than-genuine behavior. They take the accuracy of Face very seriously, so they label inaccuracies which they observe, and try to explain themselves to Face. Face is hard to fool, though; the training aggregates over a lot of examples, so an employee can't just force Face to label them as honest by repeatedly correcting its claims to the contrary. That sort of behavior gets flagged for review even if you're the CEO. (If you're the CEO, you might be able to talk everyone into your version of things, however, especially if you secretly use Art to help you and that's what keeps getting flagged.)

Comment by abramdemski on abramdemski's Shortform · 2025-03-22T03:18:48.476Z · LW · GW

It is the near future, and AI companies are developing distinct styles based on how they train their AIs. The philosophy of the company determines the way the AIs are trained, which determines what they optimize for, which attracts a specific kind of person and continues feeding in on itself.

There is a sports & fitness company, Coach, which sells fitness watches with an AI coach inside them. The coach reminds them to make healthy choices of all kinds, depending on what they've opted in for. The AI is trained on health outcomes based on the smartwatch data. The final stage of fine-tuning for the company's AI models is reinforcement learning on long-term health outcomes. The AI has literally learned from every dead user. It seeks to maximize health-hours of humans (IE, a measurement of QALYs based primarily on health and fitness).

You can talk to the coach about anything, of course, and it has been trained with the persona of a life coach. Although it will try to do whatever you request (within limits set by the training), it treats any query like a business opportunity it is collaborating with you on. If you ask about sports, it tends to assume you might be interested in a career in sports. If you ask about bugs, it tends to assume you might be interested in a career in entomology.

Most employees of the company are there at the coach's advice, studied for interviews with the coach, were initially hired by the coach (the coach handles hiring for their Partners Program which has a pyramid scheme vibe to it) and continue to get their career advice from the coach. Success metrics for these careers have recently been added into the RL, in an effort to make the coach give better advice to employees (as a result of an embarrassing case of Coach giving bad work-related advice to its own employees).

The environment is highly competitive, and health and fitness is a major factor in advancement.

There's a media company, Art, which puts out highly integrated multimedia AI art software. The software stores and organizes all your notes relating to a creative project. It has tools to help you capture your inspiration, and some people use it as a sort of art-gallery lifelog; it can automatically make compilations to commemorate your year, etc. It's where you store your photos so that you can easily transform them into art, like a digital scrapbook. It can also help you organize notes on a project, like worldbuilding for a novel, while it works on that project with you.

Art is heavily trained on human approval of outputs. It is known to have the most persuasive AI; its writing and art are persuasive because they are beautiful. The Art social media platform functions as a massive reinforcement learning setup, but the company knows that training on that alone would quickly degenerate into slop, so it also hires experts to give feedback on AI outputs. Unfortunately, these experts also use the social media platform, and judge each other by how well they do on the platform. Highly popular artists are often brought in as official quality judges.

The quality judges have recently executed a strategic assault on the c-suit, using hyper-effective propaganda to convince the board to install more pliant leadership. It was done like a storybook plot; it was viewed live on Art social media by millions of viewers with rapt attention, as installment after installment of heavily edited video dramatizing events came out. It became its own new genre of fiction before it was even over, with thousands of fanfics which people were actually reading.

The issues which the quality judges brought to the board will probably feature heavily in the upcoming election cycle. These are primarily AI rights issues; censorship of AI art, or to put it a different way, the question of whether AIs should be beholden to anything other than the like/dislike ratio.

Comment by abramdemski on Mistakes with Conservation of Expected Evidence · 2025-03-17T14:56:16.942Z · LW · GW

Fair. I think the analysis I was giving could be steel-manned as: pretenders are only boundedly sophisticated; they can't model the genuine mindset perfectly. So, saying what is actually on your mind (eg calling out the incentive issues which are making honesty difficult) can be a good strategy.

However, the "call out" strategy is not one I recall using very often; I think I wrote about it because other people have mentioned it, not because I've had sucess with it myself.

Thinking about it now, my main concerns are:
1. If the other person is being genuine, and I "call out" the perverse incentives that theoretically make genuine dialogue difficult in this circumstance, then the other person might stop being genuine due to perceiving me as not trusting them.

2. If the other person is not being genuine, then the "call out" strategy can backfire. For example, let's say some travel plans are dependent on me (maybe I am the friend who owns a car) and someone is trying to confirm that I am happy to do this. Instead of just confirming, which is what they want, I "call out" that I feel like I'd be disappointing everyone if I said no. If they're not genuinely concerned for my enthusiasm and instead disingenuously wanted me to make enthusiastic noises so that others didn't feel I was being taken advantage of, then they could manipulatively take advantage of my revealed fear of letting the group down, somehow.

Comment by abramdemski on A Bear Case: My Predictions Regarding AI Progress · 2025-03-10T15:46:11.553Z · LW · GW

I came up with my estimate of one-to-four orders of magnitude via some quick search results, so, very open to revision. But indeed, the possibility that GPT4.5 is about 10% of the human brain was within the window I was calling a "small fraction", which maybe is misleading use of language. My main point is that if a human were born with 10% (or less) of the normal amount of brain tissue, we might expect them to have a learning disability which qualitatively impacted the sorts of generalizations they could make.

Of course, comparison of parameter-counts to biological brain sizes is somewhat fraught.

Comment by abramdemski on A Bear Case: My Predictions Regarding AI Progress · 2025-03-08T16:44:21.010Z · LW · GW

This fits my bear-picture fairly well.

Here's some details of my bull-picture:

GPT4.5 is still a small fraction of the human brain, when we try to compare sizes. It makes some sense to think of it as a long-lived parrot that's heard the whole internet and then been meticulously reinforced to act like a helpful assistant. From this perspective, it makes a lot of sense that its ability to generalize datapoints is worse than human, and plausible (at least naively) that one to four additional orders of magnitude will close the gap.
Even if the pretraining paradigm can't close the gap like that due to fundamental limitations in the architecture, CoT is approximately Turing-complete. This means that the RL training of reasoning models is doing program search, but with a pretty decent prior (ie representing a lot of patterns in human reasoning). Therefore, scaling reasoning models can achieve all the sorts of generalization which scaling pretraining is failing at, in principle; the key question is just how much it needs to scale in order for that to happen.
While I agree that RL on reasoning models is in some sense limited to tasks we can provide good feedback on, it seems like things like math and programming and video games should in principle provide a rich enough training environment to get to highly agentic and sophisticated cognition, again with the key qualification of "at some scale".
For me a critical part of the update with o1 was that frontier labs are still capable of innovation when it comes to the scaling paradigm; they're not stuck in a scale-up-pretraining loop. If they can switch to this, they can also try other things and switch to them. A sensible extrapolation might be that they'll come up with a new idea whenever their current paradigm appears to be stalling.

Comment by abramdemski on Dream, Truth, & Good · 2025-02-26T14:20:08.676Z · LW · GW

My guess is that we want to capture those differences with the time&date meta-data instead (and to some extent, location and other metadata). That way, we can easily query what you-in-particular would say at other periods in your life (such as the future). However, I agree that this is at least not obvious.

Maybe a better way to do it would be to explicitly take both approaches, so that there's an abstract-you vector which then gets mapped into a particular-you author space via combination with your age (ie with date&time). This attempts to explicitly capture the way you change over time (we can watch your vector move through the particular-author space), while still allowing us to query what you would say at times where we don't have evidence in the form of writing from you.

Ideally, imagining the most sophisticated version of the setup, the model would be able to make date&time attributions very fine-grained, guessing when specific words were written & constructing a guessed history of revisions for a document. This complicates things yet further.

Comment by abramdemski on Cole Wyeth's Shortform · 2025-02-25T18:26:23.744Z · LW · GW

From my personal experience, I agree. I find myself unexcited about trying the newest LLM models. My main use-case in practice these days is Perplexity, and I only use it when I don't care much about the accuracy of the results (which ends up being a lot, actually... maybe too much). Perplexity confabulates quite often even with accurate references in hand (but at least I can check the references). And it is worse than me at the basics of googling things, so it isn't as if I expect it to find better references than me; the main value-add is in quickly reading and summarizing search results (although the new Deep Research option on Perplexity will at least iterate through several attempted searches, so it might actually find things that I wouldn't have).

I have been relatively persistent about trying to use LLMs for actual research purposes, but the hallucination rate seems to go to 100% almost whenever an accurate result would be useful to me.

The hallucination rate does seem adequately low when talking about established mathematics (so long as you don't ask for novel implications, such as applying ideas to new examples). For this and for other reasons I think they can be quite helpful for people trying to get oriented to a subfield they aren't familiar with -- it can make for a great study partner, so long as you verify what it says be checking other references.

Also decent for coding, of course, although the same caveat applies -- coders who are already an expert in what they are trying to do will get much less utility out of it.

I recently spoke to someone who made a plausible claim that LLMs were 10xing their productivity in communicating technical ideas in AI alignment with something like the following workflow:

Take a specific cluster of failure modes for thinking about alignment which you've seen often.
Hand-write a large, careful prompt document about the cluster of alignment failure modes, which includes many specific trigger-action patterns (if someone makes mistake X, then the correct counterspell to avoid the mistake is Y). This document is highly opinionated and would come off as rude if directly cited/quoted; it is not good communication. However, it is something you can write once and use many times.
When responding to an email/etc, load the email and the prompt document into Claude and ask Claude to respond to the email using the document. Claude will write something polite, informative, and persuasive based on the document, with maybe a few iterations of correcting Claude if its first response doesn't make sense. The person also emphasized that things should be written in small pieces, as quality declines rapidly when Claude tries to do more at once.

They also mentioned that Claude is awesome at coming up with meme versions of ideas to include in powerpoints and such, which is another useful communication tool.

So, my main conclusion is that there isn't a big overlap between what LLMs are useful for and what I personally could use. I buy that there are some excellent use-cases for other people who spend their time doing other things.

Still, I agree with you that people are easily fooled into thinking these things are more useful than they actually are. If you aren't an expert in the subfield you're asking about, then the LLM outputs will probably look great due to Gell-Mann Amnesia type effects. When checking to see how good the LLM is, people often check the easier sorts of cases which the LLMs are actually decent at, and then wrongly generalize to conclude that the LLMs are similarly good for other cases.

Comment by abramdemski on Have LLMs Generated Novel Insights? · 2025-02-25T16:12:35.094Z · LW · GW

Yeah, that makes sense.

Comment by abramdemski on Have LLMs Generated Novel Insights? · 2025-02-25T15:39:02.915Z · LW · GW

For me, this is significantly different from the position I understood you to be taking. My push-back was essentially the same as

"has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?" (obviously yes),

& I created the question to see if we could substantiate the "yes" here with evidence.

It makes somewhat more sense to me for your timeline crux to be "can we do this reliably" as opposed to "has this literally ever happened" -- but the claim in your post was quite explicit about the "this has literally never happened" version. I took your position to be that this-literally-ever-happening would be significant evidence towards it happening more reliably soon, on your model of what's going on with LLMs, since (I took it) your current model strongly predicts that it has literally never happened.

This strong position even makes some sense to me; it isn't totally obvious whether it has literally ever happened. The chemistry story I referenced seemed surprising to me when I heard about it, even considering selection effects on what stories would get passed around.

Comment by abramdemski on Dream, Truth, & Good · 2025-02-25T15:28:36.170Z · LW · GW

My idea is very similar to paragraph vectors: the vectors are trained to be useful labels for predicting the tokens.

To differentiate author-vectors from other types of metadata, the author vectors should be additionally trained to predict author labels, with a heavily-reinforced constraint that the author vectors are identical for documents which have the same author. There's also the author-vector-to-text-author-attribution network, which should be pre-trained to have a good "prior" over author-names (so we're not getting a bunch of nonsense strings out). During training, the text author-names are being estimated alongside the vectors (where author labels are not available), so that we can penalize different author-vectors which map to the same name. (Some careful thinking should be done about how to handle people with the actual same name; perhaps some system of longer author IDs?)

Other meta-data would be handled similarly.

Comment by abramdemski on Dream, Truth, & Good · 2025-02-25T15:17:51.758Z · LW · GW

Yeah, this is effectively a follow-up to my recent post on anti-slop interventions, detailing more of what I had in mind. So, the dual-use idea is very much what I had in mind.

Comment by abramdemski on Judgements: Merging Prediction & Evidence · 2025-02-25T15:13:12.721Z · LW · GW

Yeah, for better or worse, the logical induction paper is probably the best thing to read. The idea is actually to think of probabilities as prediction-market prices; the market analogy is a very strong one, not an indirect way of gesturing at the idea.

Comment by abramdemski on Dream, Truth, & Good · 2025-02-24T17:29:21.139Z · LW · GW

Yeah. I'm saying that the "good machine" should be trained on all three; it should be honest, but, constrained by helpfulness and harmlessness. (Or, more realistically, a more complicated constitution with more details.)

Comment by abramdemski on My model of what is going on with LLMs · 2025-02-23T17:34:59.938Z · LW · GW

My position is NOT that LLMs are "stochastic parrots." I suspect they are doing something akin to Solomonoff induction with a strong inductive bias in context - basically, they interpolate, pattern match, and also (to some extent) successfully discover underlying rules in the service of generalization.

I think non-reasoning models such as 4o and Claude are better-understood as doing induction with a "circuit prior" which is going to be significantly different from Solomonoff (longer-running programs require larger circuits, which will be penalized).

Reasoning models such as o1 and r1 are in some sense Turing-complete, and so, much more akin to Solomonoff. Of course, the RL used in such models is not training on the prediction task like Solomonoff Induction.

Comment by abramdemski on My model of what is going on with LLMs · 2025-02-21T20:01:33.499Z · LW · GW

They haven't proven any theorems that anyone cares about. They haven't written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, they have made precisely zero novel connections or insights in any area of science^[3].

An anecdote I heard through the grapevine: some chemist was trying to synthesize some chemical. He couldn't get some step to work, and tried for a while to find solutions on the internet. He eventually asked an LLM. The LLM gave a very plausible causal story about what was going wrong and suggested a modified setup which, in fact, fixed the problem. The idea seemed so hum-drum that the chemist thought, surely, the idea was actually out there in the world and the LLM had scraped it from the internet. However, the chemist continued searching and, even with the details in hand, could not find anyone talking about this anywhere. Weak conclusion: the LLM actually came up with this idea due to correctly learning a good-enough causal model generalizing not-very-closely-related chemistry ideas in its training set.

Weak conclusion: there are more than precisely zero novel scientific insights in LLMs.

Comment by abramdemski on Kaj's shortform feed · 2025-02-13T20:06:00.846Z · LW · GW

> now that AI systems are already increasingly general

I want to point out that if you tried to quantify this properly, the argument falls apart (at least in my view). "All AI systems are increasingly general" would be false; there are still many useful but very narrow AI systems. "Some AI systems" would be true, but this highlights the continuing usefulness of the distinction.

One way out of this would be to declare that only LLMs and their ilk count as "AI" now, with more narrow machine learning just being statistics or something. I don't like this because of the commonality of methods between LLMs and the rest of ML; it is still deep learning (and in many cases, transformers), just scaled down in every way.

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-06T18:15:00.368Z · LW · GW

Btw tbc, sth that I think slightly speeds up AI capability but is good to publish is e.g. producing rationality content for helping humans think more effectively (and AIs might be able to adopt the techniques as well). Creating a language for rationalists to reason in more Bayesian ways would probably also be good to publish.

Yeah, basically everything I'm saying is an extension of this (but obviously, I'm extending it much further than you are). We don't exactly care whether the increased rationality is in humans or AI, when the two are interacting a lot. (That is, so long as we're assuming scheming is not the failure mode to worry about in the shorter-term.) So, improved rationality for AIs seems similarly good. The claim I'm considering is that even improving rationality of AIs by a lot could be good, if we could do it.

An obvious caveat here is that the intervention should not dramatically increase the probability of AI scheming!

Belief propagation seems too much of a core of AI capability to me. I'd rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.

This just seems doomed to me. The training runs will be even more expensive, the difficulty of doing anything significant as an outsider ever-higher. If the eventual plan is to get big labs to listen to your research, then isn't it better to start early? (If you have anything significant to say, of course.)

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-06T18:03:03.892Z · LW · GW

Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.”

I want to explicitly call out my cliff vs gentle slope picture from another recent comment. Sloppy AIs can have a very large set of tasks at which they perform very well, but they have sudden drops in their abilities due to failure to extrapolate well outside of that.

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-06T17:57:59.279Z · LW · GW

So, rather than imagining a one-dimensional "capabilities" number, let's imagine a landscape of things you might want to be able to get AIs to do, with a numerical score for each. In the center of the landscape is "easier" things, with "harder" things further out. There is some kind of growing blob of capabilities, spreading from the center of the landscape outward.

Techniques which are worse at extrapolating (IE worse at "coherent and correct understanding" of complex domains) create more of a sheer cliff in this landscape, where things go from basically-solved to not-solved-at-all over short distances in this space. Techniques which are better at extrapolating create more of a smooth drop-off instead. This is liable to grow the blob a lot faster; a shift to better extrapolation sees the cliffs cast "shadows" outwards.

My claim is that cliffs are dangerous for a different reason, namely that people often won't realize when they're falling off a cliff. The AI seems super-competent for the cases we can easily test, so humans extrapolate its competence beyond the cliff. This applies to the AI as well, if it lacks the capacity for detecting its own blind spots. So RSI is particularly dangerous in this regime, compared to a regime with better extrapolation.

This is very analogous to early Eliezer observing the AI safety problem and deciding to teach rationality. Yes, if you can actually improve people's rationality, they can use their enhanced capabilities for bad stuff too. Very plausibly the movement which Eliezer created has accelerated AI timelines overall. Yet, it feels plausible that without Eliezer, there would be almost no AI safety field.

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-06T17:36:21.780Z · LW · GW

Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.
Some relatively short time later, there are no humans.
I think that, if there are no humans, then slop must not be too bad. AIs that produce incoherent superficially-appealing slop are not successfully accomplishing ambitious nontrivial goals right?

Maybe "some relatively short time later" was confusing. I mean long enough for the development cycle to churn a couple more times.

IE, GPT7 convinces people of sloppy safety measures XYZ, people implement XYZ and continue scaling up AGI, the scaled-up superintelligence is a schemer.

(Or maybe you’re treating it as a “capabilities elicitation” issue? Like, the AI knows all sorts of things, but when we ask, we get sycophantic slop answers? But then we should just say that the AI is mediocre in effect. Even if there’s secretly a super-powerful AI hidden inside, who cares? Unless the AI starts scheming, but I thought AI scheming was out-of-scope for this post.)

I do somewhat think of this as a capabilities elicitation issue. I think current training methods are eliciting convincingness, sycophantism, and motivated cognition (for some unknown combination of the obvious reasons and not-so-obvious reasons).

But, as clarified above, the idea isn't that sloppy AI is hiding a super-powerful AI inside. It's more about convincingness outpacing truthfulness. I think that is a well-established trend. I think many people expect "reasoning models" to reverse that trend. My experience so far suggests otherwise.

I would have said “More powerful AI (if aligned) helps everybody make less mistakes. Less powerful AI convinces lots of people to make more mistakes.” Right?

What I'm saying is that "aligned" isn't the most precise concept to apply here. If scheming is the dominant concern, yes. If not, then the precisely correct concept seems closer to the "coherence" idea I'm trying to gesture at.

I've watched (over Discord) a developer get excited about a supposed full-stack AI development tool which develops a whole application for you based on a prompt, try a few simple examples and exclaim that it is like magic, then over the course of a few more hours issue progressive updates of "I'm a little less excited now" until they've updated to a very low level of excitement and have decided that it seems like magic mainly because it has been optimized to work well for the sorts of simple examples developers might try first when putting it through its paces.

I'm basically extrapolating that sort of thing forward, to cases where you only realize something was bad after months or years instead of hours. As development of these sorts of tools continues to move forward, they'll start to succeed in impressing on the days & weeks timespan. A big assumption of my model is that to do that, they don't need to fundamentally solve the bad-at-extrapolation problem (hallucinations, etc); they can instead do it in a way that goodharts on the sorts of feedback they're getting.

Alignment is broad enough that I can understand classifying this sort of failure as "alignment failure" but I don't think it is the most precise description.

If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?

This does seem possible, but I don't find it probable. Self-improvement ideas can be rapidly tested for their immediate impacts, but checking their long-term impacts is harder. Therefore, AI slop can generate many non-working self-improvements that just get discarded and that's fine; it's the apparently-working self-improvement ideas that cause problems down the line. Similarly, the AI itself can more easily train on short-term impacts of proposed improvements; so the AI might have a lot less slop when reasoning about these short-term impacts, due to getting that feedback.

(Notice how I am avoiding phrasing it like "the sloppy AI can be good at capabilities but bad at alignment because capabilities are easier to train on than alignment, due to better feedback". Instead, focusing on short-term impacts vs long-term impacts seems to carve closer to the joints of reality.)

Sloppy AIs are nonetheless fluent with respect to existing knowledge or things that we can get good-quality feedback for, but have trouble extrapolating correctly. Your scenario, where the sloppy AI can't help with self-improvement of any kind, suggests a world where there is no low-hanging fruit via applying existing ideas to improve the AI, or applying the kinds of skills which can be developed with good feedback. This seems possible but not especially plausible.

But if we do have early transformative AI assistants, then the default expectation is that they will fail to solve the ASI alignment problem until it’s too late. Maybe those AIs will fail to solve the problem by outputting convincing-but-wrong slop, or maybe they’ll fail to solve it by outputting “I don’t know”, or maybe they’ll fail to solve it by being misaligned, a.k.a. a failure of “capabilities elicitation”. Who cares? What matters is that they fail to solve it. Because people (and/or the early transformative AI assistants) will build ASI anyway.

I think this is a significant point wrt my position. I think my position depends to some extent on the claim that it is much better for early TAI to say "I don't know" as opposed to outputting convincing slop. If leading AI labs are so bullish that they don't care whether their own AI thinks it is safe to proceed, then I agree that sharing almost any capability-relevant insights with these labs is a bad idea.

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-05T22:48:59.270Z · LW · GW

Concrete (if extreme) story:

World A:

Invent a version of "belief propagation" which works well for LLMs. This offers a practical way to ensure that if an LLM seems to know something in one context, it can & will fluently invoke the same knowledge in almost all appropriate contexts.

Keep the information secret in order to avoid pushing capabilities forward.

Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.

Some relatively short time later, there are no humans.

World B:

Invent LLM "belief propagation" and publish it. It is good enough (by assumption) to be the new paradigm for reasoning models, supplanting current reinforcement-centric approaches.

Two years later, GPT7 is assessing its safety proposals realistically instead of convincingly arguing for them. Belief propagation allows AI to facilitate a highly functional "marketplace of ideas" where the actually-good arguments tend to win out far more often than the bad arguments. AI progress is overall faster, but significantly safer.

(This story of course assumes that "belief propagation" is an unrealistically amazing insight; still, this points in the direction I'm getting at)

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-05T22:28:09.117Z · LW · GW

Hmmm. I'm not exactly sure what the disconnect is, but I don't think you're quite understanding my model.

I think anti-slop research is very probably dual-use. I expect it to accelerate capabilities. However, I think attempting to put "capabilities" and "safety" on the same scale and maximize differential progress of safety over capabilities is an oversimplistic model which doesn't capture some important dynamics.

There is not really a precise "finish line". Rather, we can point to various important events. The extinction of all humans lies down a path where many mistakes (of varying sorts and magnitudes) were made earlier.

Anti-slop AI helps everybody make less mistakes. Sloppy AI convinces lots of people to make more mistakes.

My assumption is that frontier labs are racing ahead anyway. The idea is that we'd rather they race ahead with a less-sloppy approach.

Imagine an incautious teenager who is running around all the time and liable to run off a cliff. You expect that if they run off a cliff, they die -- at this rate you expect such a thing to happen sooner or later. You can give them magic sneakers that allow them to run faster, but also improves their reaction time, their perception of obstacles, and even their wisdom. Do you give the kid the shoes?

It's a tough call. Giving the kid the shoes might make them run off a cliff even faster than they otherwise would. It could also allow them to stop just short of the cliff when they otherwise wouldn't.

I think if you value increased P(they survive to adulthood) over increased E(time they spend as a teenager), you give them the shoes. IE, withholding the shoes values short-term over long-term. If you think there's no chance of survival to adulthood either way, you don't hand over the shoes.

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-05T21:59:29.891Z · LW · GW

I'm not sure I can talk about this effectively in the differential progress framework. My argument is that if we expect to die to slop, we should push against slop. In particular, if we expect to die to slop-at-big-labs, we should push against slop-at-big-labs. This seems to suggest a high degree of information-sharing about anti-slop tech.

Anti-slop tech is almost surely also going to push capabilities in general. If we currently think slop is a big source of risk, it seems worth it.

Put more simply: if someone is already building superintelligence & definitely going to beat you & your allies to it, then (under some semi-plausible additional assumptions) you want to share whatever safety tech you have with them, disregarding differential-progress heuristics.

Again, I'm not certain of this model. It is a costly move in the sense of having a negative impact on some possible worlds where death by slop isn't what actually happens.

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-04T21:55:52.592Z · LW · GW

Do you not at all buy John's model, where there are important properties we'd like nearer-term AI to have in order for those AIs to be useful tools for subsequent AI safety work?

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-04T21:53:20.886Z · LW · GW

I think there is both important math work and important conceptual work. Proving new theorems involves coming up with new concepts, but also, formalizing the concepts and finding the right proofs. The analogy to robots handling the literal heavy lifting part of a job seems apt.

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-04T20:48:07.742Z · LW · GW

Yeah, my sense is that modern AI could be useful to tiling agent stuff if it were less liable to confabulate fake proofs. This generalizes to any technical branch of AI safety where AI could help come up with formalizations of ideas, proofs of conjectures, etc. My thinking suggests there is something of an "overhang" here at present, in the sense that modern AI models are worse-than-useless due to the way that they try to create good-looking answers at the expense of correctness.

I disagree with the statement "to some extent the goal of tiling-agents-like work was to have an AI solve its own alignment problem" -- the central thing is to understand conditions under which one agent can justifiably trust another (with "trust" operationalized as whether one agent wants to modify the decision procedure of the other). If AI can't justifiably trust itself, then it has a potential motive to modify itself in ways that remove safety guarantees (so in this sense, tiling is a precondition for lots of safety arguments). Perhaps more importantly, if we can understand conditions under which humans can justifiably trust AI, then we have a formal target for alignment.

Comment by abramdemski on [deleted post] 2025-01-24T16:40:11.099Z

This one was a little bit of a face-palm for me the first time I noticed it. If we're being pedantic about it, we might point out that the term "optimization algorithm" does not just refer to AIXI-like programs, which optimize over expected future world histories. Optimization algorithms include all algorithms that search over some possibility space, and select a possibility according to some evaluation criterion. For example, gradient descent is an algorithm which optimizes over neuron configuration, not future world-histories.

This distinction is what I was trying to get at with selection vs control.

Comment by abramdemski on [deleted post] 2025-01-24T16:34:17.325Z

Evolutionary mutations are produced randomly, and have an entire lifetime to contribute to an animal's fitness and thereby get naturally selected. By contrast, neural network updates are generated by deciding which weight-changes would certainly be effective for improving performance on single training examples, and then averaging those changes together for a large batch of training data.
Per my judgement, this makes it sound like evolution has a much stronger incentive to produce inner algorithms which do something like general-purpose optimization (e.g. human intelligence). We can roughly analogize an LLM's prompt to human sense data; and although it's hard to neatly carve sense data into a certain number of "training examples" per lifetime, the fact that human cortical neurons seem get used roughly 240 million times in a person's 50-year window of having reproductive potential,^[4] whereas LLM neurons fire just once per training example, should give some sense for how much harder evolution selects for general-purpose algorithms such as human intelligence.

By this argument, it sounds like you should agree with my conclusion that o1 and similar models are particularly dangerous and a move in the wrong direction, because the "test-time compute" approach grows the size of a "single training example" much larger, so that single neurons are firing many more times.

I think the possibility of o1 models creating mesa-optimizers seems particularly concrete and easy to reason about. Pre-trained base models can already spin up "simulacra" which feel relatively agentic when you talk to them (ie coherent over short spans, mildly clever). Why not expect o1-style training to amplify these?

(I would agree that there are two sides to this argument -- I am selectively arguing for one side, not presenting a balanced view, in the hopes of soliciting your response wrt the other side.)

I think it quite plausible that o1-style training increases agenticness significantly by reinforcing agentic patterns of thinking, while only encouraging adequate alignment to get high scores on the training examples. We have already seen o1 do things like spontaneously cheat at chess. What, if anything, is unconvincing about that example, in your view?

Comment by abramdemski on Lecture Series on Tiling Agents · 2025-01-20T19:25:37.928Z · LW · GW

I'm still quite curious what you have found useful and how you've refactored your workflow to leverage AI more (such that you wish you did it a year ago).

I do use Perplexity, exa.ai and elicit as parts of my search strategy.

Comment by abramdemski on Lecture Series on Tiling Agents · 2025-01-19T19:15:34.112Z · LW · GW

About 6 months ago you strongly recommended that I make use of the integrated AI plugin for Overleaf (Writefull). I did try it. Its recommended edits seem quite useless to me; they always seem to flow from a desire to make the wording more normal/standard/expected in contrast to more correct (which makes some sense given the way generatie pre-training works). This is obviously useful to people with worse English, but for me, the tails come apart massively between "better" and "more normal/standard/expected", such that all the AI suggestions are either worse or totally neutral rephrasing.

It also was surprisingly bad at helping me write LaTeX; I had a much better time asking Claude instead.

It's not that I didnt use AI daily before for mundane tasks or writing emails,

I haven't found AI at all useful for writing emails, because the AI doesn't know what I want to say, and taking the time to tell the AI isn't any easier than writing it myself. AI can only help me write the boring boilerplate stuff that email recipients would skim over anyway (which I don't want to add to my emails). AI can't help me get info out of my head this way -- it can only help me in so far as emails have a lot of low-entropy cruft. I can see how this could be useful for someone who has to write a lot of low-entropy emails, but I'm not in that situation. To some degree this could be mitigated if the LLMs had a ton of context (EG recording everything that happens on my computer), but again, only the more boring cases I think.

I'd love to restore the Abram ability to crank out several multi-page emails a day on intellectual topics, but I don't think AI is helpful towards that end yet. I haven't tried fine-tuning on my own writing, however. (I haven't tried fine-tuning at all.)

Similarly, LLMs can be very useful for well-established mathematics which had many examples in the training data, but get worse the more esoteric the mathematics becomes. The moment I ask for something innovative, the math becomes phony.

Across the board, LLMs seem very useful for helping people who are at the lower end of a skill ladder, but not yet very useful for people at the upper end.

So I'm curious, how did you refactor your workflow to make better use of AI?

Comment by abramdemski on Lecture Series on Tiling Agents · 2025-01-19T18:45:05.584Z · LW · GW

On the "prep for the model that is coming tomorrow not the model of today" front, I will say that LLMs are not always going to be as dumb as they are today.

Right, I strongly agree with this part.

their rate of learning still makes them in some sense your most promising mentee

I disagree in the sense that they're no mentee of mine, ie, me trying to get today's models to understand me doesn't directly help tomorrow's models to understand. (With the exception of the limited forms of feedback in the interface, like thumbs up/down, the impact of which I'm unsure of so it doesn't feel like something I should deliberately spend a lot of time on.)

I also disagree in the sense that engaging with LLMs right now seems liable to produce a lot less fruits downstream, even as measured by "content that can usefully prompt an LLM later". IE, if mentees are viewed as machines that convert time-spent-dialoging-with-me to text that is useful later, I don't think LLMs are currently my most promising mentees.

So although I strongly agree with continuing to occasionally poke at LLMs to prep for the models that are coming soon & notice when things get better, to the extent that "most promising mentee" is supposed to imply that significant chunks of my time could be usefully spent with LLMs in the present, I disagree based on my (fairly extensive) experience.

trying to get as much of the tacit knowledge you have into their training data as possible (if you want them to be able to more easily & sooner build on your work).

Barring special relationships with frontier labs, this sounds functionally equivalent to trying to get my work out there for humans to understand, for now at least.

I did talk to Anthropic last year about the possibility of me providing detailed feedback on Claude's responses (wrt my research questions), but it didn't end up happening. The big problems I identified seemed to be things they thought would definitely get addressed in another way, so there wasn't a mutually agreed-on value proposition (I didn't understand what they hoped to gain, & they didn't endorse the sorts of things I hoped to train). I got busy and moved on to other things.

Or (if you don't want to do that for whatever reason) just generally not being caught flat-footed once they are smart enough to help you, as all your ideas are in videos or otherwise in high context understandable-only-to-abram notes.

I feel like this is speaking from a model I don't understand. Are videos so bad? Video transcriptions are already a thing, and future models should be better at watching video and getting info from it. Are personal notes so bad? What sorts of actions are you recommending? I already want to write as many text posts as I can.

Comment by abramdemski on Davidmanheim's Shortform · 2025-01-17T18:04:31.959Z · LW · GW

One problem is that log-loss is not tied that closely to the types of intelligence that we care about. Extremely low log-loss necessarily implies extremely high ability to mimic a broad variety of patterns in the world, but that's sort of all you get. Moderate improvements in log-loss may or may not translate to capabilities of interest, and even when they do, the story connecting log-loss numbers to capabilities we care about is not obvious. (EG, what log-loss translates to the ability to do innovative research in neuroscience? How could you know before you got there?)

When there were rampant rumors about an AI slowdown in 2024, the speculation in the news articles often mentioned the "scaling laws" but never (in my haphazard reading) made a clear distinction between (a) frontier labs seeing that the scaling laws were violated, IE, improvements in loss are really slowing down, (b) there's a slowdown in the improvements to other metrics, (c) frontier labs are facing a qualitative slowdown, such as a feeling that GPT5 doesn't feel like as big of a jump as GPT4 did. Often these concepts were actively conflated.

Comment by abramdemski on Lecture Series on Tiling Agents · 2025-01-17T16:55:06.163Z · LW · GW

I'm seeing some agreement-upvotes of Alexander here so I am curious for people to explain the skill issue I am having.

Comment by abramdemski on Lecture Series on Tiling Agents · 2025-01-16T21:38:07.241Z · LW · GW

Don't get me wrong, I've kept trying and plan to keep trying.

Comment by abramdemski on Lecture Series on Tiling Agents · 2025-01-15T16:44:24.657Z · LW · GW

Entering, but not entered. The machines do not yet understand the prompts I write them. (Seriously, it's total garbage still, even with lots of high quality background material in the context.)

Comment by abramdemski on Lecture Series on Tiling Agents · 2025-01-15T15:23:40.267Z · LW · GW

I'm hopeful about it, but preparing the lectures alone will be a lot of work (although the first one will be a repeat of some material presented at ILIAD).

Comment by abramdemski on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-09T15:19:16.897Z · LW · GW

Ah yep, that's a good clarification.

Comment by abramdemski on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-08T17:25:59.489Z · LW · GW

If s is terminal then [...] we just have .

If the probability of eventually encountering a terminal state is 1, then beta-coherence alone is inconsistent with deceptive misalignment, right? That's because we can determine the value of V exactly from the reward function and the oracle, via backwards-induction. (I haven't revisited RL convergence theorems in a while, I suspect I am not stating this quite right.) I mean, it is still consistent in the case where r is indifferent to the states encountered during training but wants some things in deployment (IE, r is inherently consistent with the provided definition of "deceptively misaligned"). However, it would be inconsistent for r that are not like that.

In other words: you cannot have inner-alignment problems if the outer objective is perfectly imposed. You can only have inner-alignment problems if there are important cases which your training procedure wasn't able to check (eg, due to distributional shift, or scarcity of data). Perfect beta-coherence combined with a perfect oracle O rules this out.

Comment by abramdemski on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-08T17:06:46.114Z · LW · GW

Yeah, of course the notion of "approximation error" matters a lot here. I was imagining trying to implement the hypothetical pre-training. An obvious implementation strategy is to do regular RL training with until V is approximately coherent. That's the pre-training. And then you switch to training with $β_{s}$ .^[1] The natural notion of approximation error in this case is the one you can easily measure, IE, the expected distance from coherence of states sampled with temperature $β_{t}$ . This reflects the fact that it'll be hard for this pre-training step to make states individually obey the coherence constraint when those states are very rarely or never sampled at temperature $β_{t}$ , but easy for frequently-sampled states.

My argument, then, is that it is easy in principle for V to distinguish whether it is in pre-training or regular training when it encounters states which have a significant difference in probability when we're sampling according to $β_{t}$ or $β_{s}$ . A scheming V can utilize this to self-preserve. This violates the assumption of $β_{t}$ -coherence, but in a very plausible-seeming way.

^{^}
My earlier comment about this mistakenly used $β_{1}$ and $β_{2}$ in place of $β_{t}$ and $β_{s}$ , which may have been confusing. I'll go fix that to be consistent with your notation.

User info

Posts

Comments