Posts

Dream, Truth, & Good 2025-02-24T16:59:05.045Z
Judgements: Merging Prediction & Evidence 2025-02-23T19:35:51.488Z
Have LLMs Generated Novel Insights? 2025-02-23T18:22:12.763Z
Anti-Slop Interventions? 2025-02-04T19:50:29.127Z
Lecture Series on Tiling Agents #2 2025-01-20T21:02:25.479Z
Lecture Series on Tiling Agents 2025-01-14T21:34:03.907Z
Why Don't We Just... Shoggoth+Face+Paraphraser? 2024-11-19T20:53:52.084Z
AI Craftsmanship 2024-11-11T22:17:01.112Z
o1 is a bad idea 2024-11-11T21:20:24.892Z
Seeking Collaborators 2024-11-01T17:13:36.162Z
Complete Feedback 2024-11-01T16:58:50.183Z
Why is o1 so deceptive? 2024-09-27T17:27:35.439Z
Formalizing the Informal (event invite) 2024-09-10T19:22:53.564Z
In Defense of Open-Minded UDT 2024-08-12T18:27:36.220Z
Leaving MIRI, Seeking Funding 2024-08-08T18:32:20.387Z
Circular Reasoning 2024-08-05T18:10:32.736Z
LLMs for Alignment Research: a safety priority? 2024-04-04T20:03:22.484Z
Modern Transformers are AGI, and Human-Level 2024-03-26T17:46:19.373Z
Technologies and Terminology: AI isn't Software, it's... Deepware? 2024-02-13T13:37:10.364Z
Meaning & Agency 2023-12-19T22:27:32.123Z
FixDT 2023-11-30T21:57:11.950Z
Agent Boundaries Aren't Markov Blankets. [Unless they're non-causal; see comments.] 2023-11-20T18:23:40.443Z
Translations Should Invert 2023-10-05T17:44:23.262Z
Where might I direct promising-to-me researchers to apply for alignment jobs/grants? 2023-09-18T16:20:03.452Z
One Minute Every Moment 2023-09-01T20:23:56.391Z
Probabilistic Payor Lemma? 2023-03-19T17:57:04.237Z
Teleosemantics! 2023-02-23T23:26:15.894Z
Some Thoughts on AI Art 2023-01-25T14:18:14.507Z
Contra Common Knowledge 2023-01-04T22:50:38.493Z
Talking to God 2023-01-03T20:14:20.955Z
Knottiness 2023-01-02T22:13:12.752Z
Prettified AI Safety Game Cards 2022-10-11T19:35:18.991Z
Builder/Breaker for Deconfusion 2022-09-29T17:36:37.725Z
Vingean Agency 2022-08-24T20:08:53.237Z
Steam 2022-06-20T17:38:58.548Z
Brass Puppet 2022-05-26T17:42:04.876Z
ELK Computational Complexity: Three Levels of Difficulty 2022-03-30T20:56:37.239Z
[Closed] Job Offering: Help Communicate Infrabayesianism 2022-03-23T18:35:16.790Z
ELK Thought Dump 2022-02-28T18:46:08.611Z
Contest for outlining rules for this contest. 2022-02-21T18:44:43.990Z
There is essentially one best-validated theory of cognition. 2021-12-10T15:51:06.423Z
Worst Commonsense Concepts? 2021-11-15T18:22:31.465Z
How can one train philosophical skill? 2021-09-30T14:56:35.313Z
Power vs Precision 2021-08-16T18:34:42.287Z
Implicature Conflation 2021-08-09T19:48:51.097Z
Refactoring Alignment (attempt #2) 2021-07-26T20:12:15.196Z
Re-Define Intent Alignment? 2021-07-22T19:00:31.629Z
Progress, Stagnation, & Collapse 2021-07-22T16:51:04.595Z
The Homunculus Problem 2021-05-27T20:25:58.312Z
The Argument For Spoilers 2021-05-21T12:23:49.127Z

Comments

Comment by abramdemski on abramdemski's Shortform · 2025-03-22T03:19:27.211Z · LW · GW

I'm thinking about AI emotions. The thing about human emotions and expressions is that they're more-or-less involuntary. Facial expressions, tone of voice, laughter, body language, etc reveal a whole lot about human inner state. We don' know if we can trust AI emotional expressions in the same way; the AIs can easily fake it, because they don't have the same intrinsic connection between their cognitive machinery and these ... expressions.

A service called Face provides emotional expressions for AI. It analyzes AI-generated outputs and makes inferences about the internal state of the AI who wrote the text. This is possible due to Face's interpretability tools, which have interpreted lots of modern LLMs to generate labels on their output data explaining their internal motivations for the writing. Although Face doesn't have access to the internal weights for an arbitrary piece of text you hand it, its guesses are pretty good. It will also tell you which portions were probably AI-generated. It can even guess multi-step writing processes involving both AI and human writing.

Face also offers their own AI models, of course, to which they hook the interpretability tools to directly, so that you'll get more accurate results.

It turns out Face can also detect motivations of humans with some degree of accuracy. Face is used extensively inside the Face company, which is a nonprofit entity which develops the open-source software. Face is trained on outcomes of hiring decisions so as to better judge potential employees. This training is very detailed, not just a simple good/bad signal. 

Face is the AI equivalent of antivirus software; your automated AI cloud services will use it to check their inputs for spam and prompt injection attacks. 

Face company culture is all about being genuine. They basically have a lie detector on all the time, so liars are either very very good or weeded out. This includes any kind of less-than-genuine behavior. They take the accuracy of Face very seriously, so they label inaccuracies which they observe, and try to explain themselves to Face. Face is hard to fool, though; the training aggregates over a lot of examples, so an employee can't just force Face to label them as honest by repeatedly correcting its claims to the contrary. That sort of behavior gets flagged for review even if you're the CEO. (If you're the CEO, you might be able to talk everyone into your version of things, however, especially if you secretly use Art to help you and that's what keeps getting flagged.)

Comment by abramdemski on abramdemski's Shortform · 2025-03-22T03:18:48.476Z · LW · GW

It is the near future, and AI companies are developing distinct styles based on how they train their AIs. The philosophy of the company determines the way the AIs are trained, which determines what they optimize for, which attracts a specific kind of person and continues feeding in on itself.

There is a sports & fitness company, Coach, which sells fitness watches with an AI coach inside them. The coach reminds them to make healthy choices of all kinds, depending on what they've opted in for. The AI is trained on health outcomes based on the smartwatch data. The final stage of fine-tuning for the company's AI models is reinforcement learning on long-term health outcomes. The AI has literally learned from every dead user. It seeks to maximize health-hours of humans (IE, a measurement of QALYs based primarily on health and fitness).

You can talk to the coach about anything, of course, and it has been trained with the persona of a life coach. Although it will try to do whatever you request (within limits set by the training), it treats any query like a business opportunity it is collaborating with you on. If you ask about sports, it tends to assume you might be interested in a career in sports. If you ask about bugs, it tends to assume you might be interested in a career in entomology. 

Most employees of the company are there at the coach's advice, studied for interviews with the coach, were initially hired by the coach (the coach handles hiring for their Partners Program which has a pyramid scheme vibe to it) and continue to get their career advice from the coach. Success metrics for these careers have recently been added into the RL, in an effort to make the coach give better advice to employees (as a result of an embarrassing case of Coach giving bad work-related advice to its own employees).

The environment is highly competitive, and health and fitness is a major factor in advancement.

There's a media company, Art, which puts out highly integrated multimedia AI art software. The software stores and organizes all your notes relating to a creative project. It has tools to help you capture your inspiration, and some people use it as a sort of art-gallery lifelog; it can automatically make compilations to commemorate your year, etc. It's where you store your photos so that you can easily transform them into art, like a digital scrapbook. It can also help you organize notes on a project, like worldbuilding for a novel, while it works on that project with you.

Art is heavily trained on human approval of outputs. It is known to have the most persuasive AI; its writing and art are persuasive because they are beautiful. The Art social media platform functions as a massive reinforcement learning setup, but the company knows that training on that alone would quickly degenerate into slop, so it also hires experts to give feedback on AI outputs. Unfortunately, these experts also use the social media platform, and judge each other by how well they do on the platform. Highly popular artists are often brought in as official quality judges.

The quality judges have recently executed a strategic assault on the c-suit, using hyper-effective propaganda to convince the board to install more pliant leadership. It was done like a storybook plot; it was viewed live on Art social media by millions of viewers with rapt attention, as installment after installment of heavily edited video dramatizing events came out. It became its own new genre of fiction before it was even over, with thousands of fanfics which people were actually reading.

The issues which the quality judges brought to the board will probably feature heavily in the upcoming election cycle. These are primarily AI rights issues; censorship of AI art, or to put it a different way, the question of whether AIs should be beholden to anything other than the like/dislike ratio.

Comment by abramdemski on Mistakes with Conservation of Expected Evidence · 2025-03-17T14:56:16.942Z · LW · GW

Fair. I think the analysis I was giving could be steel-manned as: pretenders are only boundedly sophisticated; they can't model the genuine mindset perfectly. So, saying what is actually on your mind (eg calling out the incentive issues which are making honesty difficult) can be a good strategy.

However, the "call out" strategy is not one I recall using very often; I think I wrote about it because other people have mentioned it, not because I've had sucess with it myself.

Thinking about it now, my main concerns are:
1. If the other person is being genuine, and I "call out" the perverse incentives that theoretically make genuine dialogue difficult in this circumstance, then the other person might stop being genuine due to perceiving me as not trusting them.

2. If the other person is not being genuine, then the "call out" strategy can backfire. For example, let's say some travel plans are dependent on me (maybe I am the friend who owns a car) and someone is trying to confirm that I am happy to do this. Instead of just confirming, which is what they want, I "call out" that I feel like I'd be disappointing everyone if I said no. If they're not genuinely concerned for my enthusiasm and instead disingenuously wanted me to make enthusiastic noises so that others didn't feel I was being taken advantage of, then they could manipulatively take advantage of my revealed fear of letting the group down, somehow.

Comment by abramdemski on A Bear Case: My Predictions Regarding AI Progress · 2025-03-10T15:46:11.553Z · LW · GW

I came up with my estimate of one-to-four orders of magnitude via some quick search results, so, very open to revision. But indeed, the possibility that GPT4.5 is about 10% of the human brain was within the window I was calling a "small fraction", which maybe is misleading use of language. My main point is that if a human were born with 10% (or less) of the normal amount of brain tissue, we might expect them to have a learning disability which qualitatively impacted the sorts of generalizations they could make.

Of course, comparison of parameter-counts to biological brain sizes is somewhat fraught.

Comment by abramdemski on A Bear Case: My Predictions Regarding AI Progress · 2025-03-08T16:44:21.010Z · LW · GW

This fits my bear-picture fairly well. 

Here's some details of my bull-picture:

  • GPT4.5 is still a small fraction of the human brain, when we try to compare sizes. It makes some sense to think of it as a long-lived parrot that's heard the whole internet and then been meticulously reinforced to act like a helpful assistant. From this perspective, it makes a lot of sense that its ability to generalize datapoints is worse than human, and plausible (at least naively) that one to four additional orders of magnitude will close the gap.
  • Even if the pretraining paradigm can't close the gap like that due to fundamental limitations in the architecture, CoT is approximately Turing-complete. This means that the RL training of reasoning models is doing program search, but with a pretty decent prior (ie representing a lot of patterns in human reasoning). Therefore, scaling reasoning models can achieve all the sorts of generalization which scaling pretraining is failing at, in principle; the key question is just how much it needs to scale in order for that to happen.
  • While I agree that RL on reasoning models is in some sense limited to tasks we can provide good  feedback on, it seems like things like math and programming and video games should in principle provide a rich enough training environment to get to highly agentic and sophisticated cognition, again with the key qualification of "at some scale".
  • For me a critical part of the update with o1 was that frontier labs are still capable of innovation when it comes to the scaling paradigm; they're not stuck in a scale-up-pretraining loop. If they can switch to this, they can also try other things and switch to them. A sensible extrapolation might be that they'll come up with a new idea whenever their current paradigm appears to be stalling.
Comment by abramdemski on Dream, Truth, & Good · 2025-02-26T14:20:08.676Z · LW · GW

My guess is that we want to capture those differences with the time&date meta-data instead (and to some extent, location and other metadata). That way, we can easily query what you-in-particular would say at other periods in your life (such as the future). However, I agree that this is at least not obvious. 

Maybe a better way to do it would be to explicitly take both approaches, so that there's an abstract-you vector which then gets mapped into a particular-you author space via combination with your age (ie with date&time). This attempts to explicitly capture the way you change over time (we can watch your vector move through the particular-author space), while still allowing us to query what you would say at times where we don't have evidence in the form of writing from you. 

Ideally, imagining the most sophisticated version of the setup, the model would be able to make date&time attributions very fine-grained, guessing when specific words were written & constructing a guessed history of revisions for a document. This complicates things yet further. 

Comment by abramdemski on Cole Wyeth's Shortform · 2025-02-25T18:26:23.744Z · LW · GW

From my personal experience, I agree. I find myself unexcited about trying the newest LLM models. My main use-case in practice these days is Perplexity, and I only use it when I don't care much about the accuracy of the results (which ends up being a lot, actually... maybe too much). Perplexity confabulates quite often even with accurate references in hand (but at least I can check the references). And it is worse than me at the basics of googling things, so it isn't as if I expect it to find better references than me; the main value-add is in quickly reading and summarizing search results (although the new Deep Research option on Perplexity will at least iterate through several attempted searches, so it might actually find things that I wouldn't have).

I have been relatively persistent about trying to use LLMs for actual research purposes, but the hallucination rate seems to go to 100% almost whenever an accurate result would be useful to me. 

The hallucination rate does seem adequately low when talking about established mathematics (so long as you don't ask for novel implications, such as applying ideas to new examples). For this and for other reasons I think they can be quite helpful for people trying to get oriented to a subfield they aren't familiar with -- it can make for a great study partner, so long as you verify what it says be checking other references. 

Also decent for coding, of course, although the same caveat applies -- coders who are already an expert in what they are trying to do will get much less utility out of it.

I recently spoke to someone who made a plausible claim that LLMs were 10xing their productivity in communicating technical ideas in AI alignment with something like the following workflow:

  • Take a specific cluster of failure modes for thinking about alignment which you've seen often.
  • Hand-write a large, careful prompt document about the cluster of alignment failure modes, which includes many specific trigger-action patterns (if someone makes mistake X, then the correct counterspell to avoid the mistake is Y). This document is highly opinionated and would come off as rude if directly cited/quoted; it is not good communication. However, it is something you can write once and use many times.
  • When responding to an email/etc, load the email and the prompt document into Claude and ask Claude to respond to the email using the document. Claude will write something polite, informative, and persuasive based on the document, with maybe a few iterations of correcting Claude if its first response doesn't make sense. The person also emphasized that things should be written in small pieces, as quality declines rapidly when Claude tries to do more at once.

They also mentioned that Claude is awesome at coming up with meme versions of ideas to include in powerpoints and such, which is another useful communication tool.

So, my main conclusion is that there isn't a big overlap between what LLMs are useful for and what I personally could use. I buy that there are some excellent use-cases for other people who spend their time doing other things.

Still, I agree with you that people are easily fooled into thinking these things are more useful than they actually are. If you aren't an expert in the subfield you're asking about, then the LLM outputs will probably look great due to Gell-Mann Amnesia type effects. When checking to see how good the LLM is, people often check the easier sorts of cases which the LLMs are actually decent at, and then wrongly generalize to conclude that the LLMs are similarly good for other cases.

Comment by abramdemski on Have LLMs Generated Novel Insights? · 2025-02-25T16:12:35.094Z · LW · GW

Yeah, that makes sense.

Comment by abramdemski on Have LLMs Generated Novel Insights? · 2025-02-25T15:39:02.915Z · LW · GW

For me, this is significantly different from the position I understood you to be taking. My push-back was essentially the same as 

"has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?" (obviously yes),

& I created the question to see if we could substantiate the "yes" here with evidence. 

It makes somewhat more sense to me for your timeline crux to be "can we do this reliably" as opposed to "has this literally ever happened" -- but the claim in your post was quite explicit about the "this has literally never happened" version. I took your position to be that this-literally-ever-happening would be significant evidence towards it happening more reliably soon, on your model of what's going on with LLMs, since (I took it) your current model strongly predicts that it has literally never happened.

This strong position even makes some sense to me; it isn't totally obvious whether it has literally ever happened. The chemistry story I referenced seemed surprising to me when I heard about it, even considering selection effects on what stories would get passed around.

Comment by abramdemski on Dream, Truth, & Good · 2025-02-25T15:28:36.170Z · LW · GW

My idea is very similar to paragraph vectors: the vectors are trained to be useful labels for predicting the tokens.

To differentiate author-vectors from other types of metadata, the author vectors should be additionally trained to predict author labels, with a heavily-reinforced constraint that the author vectors are identical for documents which have the same author. There's also the author-vector-to-text-author-attribution network, which should be pre-trained to have a good "prior" over author-names (so we're not getting a bunch of nonsense strings out). During training, the text author-names are being estimated alongside the vectors (where author labels are not available), so that we can penalize different author-vectors which map to the same name. (Some careful thinking should be done about how to handle people with the actual same name; perhaps some system of longer author IDs?)

Other meta-data would be handled similarly.

Comment by abramdemski on Dream, Truth, & Good · 2025-02-25T15:17:51.758Z · LW · GW

Yeah, this is effectively a follow-up to my recent post on anti-slop interventions, detailing more of what I had in mind. So, the dual-use idea is very much what I had in mind.

Comment by abramdemski on Judgements: Merging Prediction & Evidence · 2025-02-25T15:13:12.721Z · LW · GW

Yeah, for better or worse, the logical induction paper is probably the best thing to read. The idea is actually to think of probabilities as prediction-market prices; the market analogy is a very strong one, not an indirect way of gesturing at the idea.

Comment by abramdemski on Dream, Truth, & Good · 2025-02-24T17:29:21.139Z · LW · GW

Yeah. I'm saying that the "good machine" should be trained on all three; it should be honest, but, constrained by helpfulness and harmlessness. (Or, more realistically, a more complicated constitution with more details.)

Comment by abramdemski on My model of what is going on with LLMs · 2025-02-23T17:34:59.938Z · LW · GW

My position is NOT that LLMs are "stochastic parrots." I suspect they are doing something akin to Solomonoff induction with a strong inductive bias in context - basically, they interpolate, pattern match, and also (to some extent) successfully discover underlying rules in the service of generalization.

I think non-reasoning models such as 4o and Claude are better-understood as doing induction with a "circuit prior" which is going to be significantly different from Solomonoff (longer-running programs require larger circuits, which will be penalized).

Reasoning models such as o1 and r1 are in some sense Turing-complete, and so, much more akin to Solomonoff. Of course, the RL used in such models is not training on the prediction task like Solomonoff Induction.

Comment by abramdemski on My model of what is going on with LLMs · 2025-02-21T20:01:33.499Z · LW · GW

They haven't proven any theorems that anyone cares about. They haven't written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, they have made precisely zero novel connections or insights in any area of science[3].

An anecdote I heard through the grapevine: some chemist was trying to synthesize some chemical. He couldn't get some step to work, and tried for a while to find solutions on the internet. He eventually asked an LLM. The LLM gave a very plausible causal story about what was going wrong and suggested a modified setup which, in fact, fixed the problem. The idea seemed so hum-drum that the chemist thought, surely, the idea was actually out there in the world and the LLM had scraped it from the internet. However, the chemist continued searching and, even with the details in hand, could not find anyone talking about this anywhere. Weak conclusion: the LLM actually came up with this idea due to correctly learning a good-enough causal model generalizing not-very-closely-related chemistry ideas in its training set.

Weak conclusion: there are more than precisely zero novel scientific insights in LLMs.

Comment by abramdemski on Kaj's shortform feed · 2025-02-13T20:06:00.846Z · LW · GW

> now that AI systems are already increasingly general

I want to point out that if you tried to quantify this properly, the argument falls apart (at least in my view). "All AI systems are increasingly general" would be false; there are still many useful but very narrow AI systems. "Some AI systems" would be true, but this highlights the continuing usefulness of the distinction.

One way out of this would be to declare that only LLMs and their ilk count as "AI" now, with more narrow machine learning just being statistics or something. I don't like this because of the commonality of methods between LLMs and the rest of ML; it is still deep learning (and in many cases, transformers), just scaled down in every way.

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-06T18:15:00.368Z · LW · GW

Btw tbc, sth that I think slightly speeds up AI capability but is good to publish is e.g. producing rationality content for helping humans think more effectively (and AIs might be able to adopt the techniques as well). Creating a language for rationalists to reason in more Bayesian ways would probably also be good to publish.

Yeah, basically everything I'm saying is an extension of this (but obviously, I'm extending it much further than you are). We don't exactly care whether the increased rationality is in humans or AI, when the two are interacting a lot. (That is, so long as we're assuming scheming is not the failure mode to worry about in the shorter-term.) So, improved rationality for AIs seems similarly good. The claim I'm considering is that even improving rationality of AIs by a lot could be good, if we could do it.

An obvious caveat here is that the intervention should not dramatically increase the probability of AI scheming!

Belief propagation seems too much of a core of AI capability to me. I'd rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.

This just seems doomed to me. The training runs will be even more expensive, the difficulty of doing anything significant as an outsider ever-higher. If the eventual plan is to get big labs to listen to your research, then isn't it better to start early? (If you have anything significant to say, of course.)

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-06T18:03:03.892Z · LW · GW

Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.”

I want to explicitly call out my cliff vs gentle slope picture from another recent comment. Sloppy AIs can have a very large set of tasks at which they perform very well, but they have sudden drops in their abilities due to failure to extrapolate well outside of that.

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-06T17:57:59.279Z · LW · GW

So, rather than imagining a one-dimensional "capabilities" number, let's imagine a landscape of things you might want to be able to get AIs to do, with a numerical score for each. In the center of the landscape is "easier" things, with "harder" things further out. There is some kind of growing blob of capabilities, spreading from the center of the landscape outward.

Techniques which are worse at extrapolating (IE worse at "coherent and correct understanding" of complex domains) create more of a sheer cliff in this landscape, where things go from basically-solved to not-solved-at-all over short distances in this space. Techniques which are better at extrapolating create more of a smooth drop-off instead. This is liable to grow the blob a lot faster; a shift to better extrapolation sees the cliffs cast "shadows" outwards.

My claim is that cliffs are dangerous for a different reason, namely that people often won't realize when they're falling off a cliff. The AI seems super-competent for the cases we can easily test, so humans extrapolate its competence beyond the cliff. This applies to the AI as well, if it lacks the capacity for detecting its own blind spots. So RSI is particularly dangerous in this regime, compared to a regime with better extrapolation.

This is very analogous to early Eliezer observing the AI safety problem and deciding to teach rationality. Yes, if you can actually improve people's rationality, they can use their enhanced capabilities for bad stuff too. Very plausibly the movement which Eliezer created has accelerated AI timelines overall. Yet, it feels plausible that without Eliezer, there would be almost no AI safety field.

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-06T17:36:21.780Z · LW · GW

Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.

Some relatively short time later, there are no humans.

I think that, if there are no humans, then slop must not be too bad. AIs that produce incoherent superficially-appealing slop are not successfully accomplishing ambitious nontrivial goals right?

Maybe "some relatively short time later" was confusing. I mean long enough for the development cycle to churn a couple more times.

IE, GPT7 convinces people of sloppy safety measures XYZ, people implement XYZ and continue scaling up AGI, the scaled-up superintelligence is a schemer.

(Or maybe you’re treating it as a “capabilities elicitation” issue? Like, the AI knows all sorts of things, but when we ask, we get sycophantic slop answers? But then we should just say that the AI is mediocre in effect. Even if there’s secretly a super-powerful AI hidden inside, who cares? Unless the AI starts scheming, but I thought AI scheming was out-of-scope for this post.)

I do somewhat think of this as a capabilities elicitation issue. I think current training methods are eliciting convincingness, sycophantism, and motivated cognition (for some unknown combination of the obvious reasons and not-so-obvious reasons).

But, as clarified above, the idea isn't that sloppy AI is hiding a super-powerful AI inside. It's more about convincingness outpacing truthfulness. I think that is a well-established trend. I think many people expect "reasoning models" to reverse that trend. My experience so far suggests otherwise.

I would have said “More powerful AI (if aligned) helps everybody make less mistakes. Less powerful AI convinces lots of people to make more mistakes.” Right?

What I'm saying is that "aligned" isn't the most precise concept to apply here. If scheming is the dominant concern, yes. If not, then the precisely correct concept seems closer to the "coherence" idea I'm trying to gesture at.

I've watched (over Discord) a developer get excited about a supposed full-stack AI development tool which develops a whole application for you based on a prompt, try a few simple examples and exclaim that it is like magic, then over the course of a few more hours issue progressive updates of "I'm a little less excited now" until they've updated to a very low level of excitement and have decided that it seems like magic mainly because it has been optimized to work well for the sorts of simple examples developers might try first when putting it through its paces.

I'm basically extrapolating that sort of thing forward, to cases where you only realize something was bad after months or years instead of hours. As development of these sorts of tools continues to move forward, they'll start to succeed in impressing on the days & weeks timespan. A big assumption of my model is that to do that, they don't need to fundamentally solve the bad-at-extrapolation problem (hallucinations, etc); they can instead do it in a way that goodharts on the sorts of feedback they're getting.

Alignment is broad enough that I can understand classifying this sort of failure as "alignment failure" but I don't think it is the most precise description.

If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?

This does seem possible, but I don't find it probable. Self-improvement ideas can be rapidly tested for their immediate impacts, but checking their long-term impacts is harder. Therefore, AI slop can generate many non-working self-improvements that just get discarded and that's fine; it's the apparently-working self-improvement ideas that cause problems down the line. Similarly, the AI itself can more easily train on short-term impacts of proposed improvements; so the AI might have a lot less slop when reasoning about these short-term impacts, due to getting that feedback.

(Notice how I am avoiding phrasing it like "the sloppy AI can be good at capabilities but bad at alignment because capabilities are easier to train on than alignment, due to better feedback". Instead, focusing on short-term impacts vs long-term impacts seems to carve closer to the joints of reality.)

Sloppy AIs are nonetheless fluent with respect to existing knowledge or things that we can get good-quality feedback for, but have trouble extrapolating correctly. Your scenario, where the sloppy AI can't help with self-improvement of any kind, suggests a world where there is no low-hanging fruit via applying existing ideas to improve the AI, or applying the kinds of skills which can be developed with good feedback. This seems possible but not especially plausible.

But if we do have early transformative AI assistants, then the default expectation is that they will fail to solve the ASI alignment problem until it’s too late. Maybe those AIs will fail to solve the problem by outputting convincing-but-wrong slop, or maybe they’ll fail to solve it by outputting “I don’t know”, or maybe they’ll fail to solve it by being misaligned, a.k.a. a failure of “capabilities elicitation”. Who cares? What matters is that they fail to solve it. Because people (and/or the early transformative AI assistants) will build ASI anyway.

I think this is a significant point wrt my position. I think my position depends to some extent on the claim that it is much better for early TAI to say "I don't know" as opposed to outputting convincing slop. If leading AI labs are so bullish that they don't care whether their own AI thinks it is safe to proceed, then I agree that sharing almost any capability-relevant insights with these labs is a bad idea.

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-05T22:48:59.270Z · LW · GW

Concrete (if extreme) story:

World A:

Invent a version of "belief propagation" which works well for LLMs. This offers a practical way to ensure that if an LLM seems to know something in one context, it can & will fluently invoke the same knowledge in almost all appropriate contexts.

Keep the information secret in order to avoid pushing capabilities forward.

Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.

Some relatively short time later, there are no humans.

World B:

Invent LLM "belief propagation" and publish it. It is good enough (by assumption) to be the new paradigm for reasoning models, supplanting current reinforcement-centric approaches.

Two years later, GPT7 is assessing its safety proposals realistically instead of convincingly arguing for them. Belief propagation allows AI to facilitate a highly functional "marketplace of ideas" where the actually-good arguments tend to win out far more often than the bad arguments. AI progress is overall faster, but significantly safer.

(This story of course assumes that "belief propagation" is an unrealistically amazing insight; still, this points in the direction I'm getting at)

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-05T22:28:09.117Z · LW · GW

Hmmm. I'm not exactly sure what the disconnect is, but I don't think you're quite understanding my model.

I think anti-slop research is very probably dual-use. I expect it to accelerate capabilities. However, I think attempting to put "capabilities" and "safety" on the same scale and maximize differential progress of safety over capabilities is an oversimplistic model which doesn't capture some important dynamics.

There is not really a precise "finish line". Rather, we can point to various important events. The extinction of all humans lies down a path where many mistakes (of varying sorts and magnitudes) were made earlier.

Anti-slop AI helps everybody make less mistakes. Sloppy AI convinces lots of people to make more mistakes.

My assumption is that frontier labs are racing ahead anyway. The idea is that we'd rather they race ahead with a less-sloppy approach. 

Imagine an incautious teenager who is running around all the time and liable to run off a cliff. You expect that if they run off a cliff, they die -- at this rate you expect such a thing to happen sooner or later. You can give them magic sneakers that allow them to run faster, but also improves their reaction time, their perception of obstacles, and even their wisdom. Do you give the kid the shoes?

It's a tough call. Giving the kid the shoes might make them run off a cliff even faster than they otherwise would. It could also allow them to stop just short of the cliff when they otherwise wouldn't.

I think if you value increased P(they survive to adulthood) over increased E(time they spend as a teenager), you give them the shoes. IE, withholding the shoes values short-term over long-term. If you think there's no chance of survival to adulthood either way, you don't hand over the shoes.

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-05T21:59:29.891Z · LW · GW

I'm not sure I can talk about this effectively in the differential progress framework. My argument is that if we expect to die to slop, we should push against slop. In particular, if we expect to die to slop-at-big-labs, we should push against slop-at-big-labs. This seems to suggest a high degree of information-sharing about anti-slop tech.

Anti-slop tech is almost surely also going to push capabilities in general. If we currently think slop is a big source of risk, it seems worth it.

Put more simply: if someone is already building superintelligence & definitely going to beat you & your allies to it, then (under some semi-plausible additional assumptions) you want to share whatever safety tech you have with them, disregarding differential-progress heuristics.

Again, I'm not certain of this model. It is a costly move in the sense of having a negative impact on some possible worlds where death by slop isn't what actually happens.

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-04T21:55:52.592Z · LW · GW

Do you not at all buy John's model, where there are important properties we'd like nearer-term AI to have in order for those AIs to be useful tools for subsequent AI safety work?

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-04T21:53:20.886Z · LW · GW

I think there is both important math work and important conceptual work. Proving new theorems involves coming up with new concepts, but also, formalizing the concepts and finding the right proofs. The analogy to robots handling the literal heavy lifting part of a job seems apt.

Comment by abramdemski on Anti-Slop Interventions? · 2025-02-04T20:48:07.742Z · LW · GW

Yeah, my sense is that modern AI could be useful to tiling agent stuff if it were less liable to confabulate fake proofs. This generalizes to any technical branch of AI safety where AI could help come up with formalizations of ideas, proofs of conjectures, etc. My thinking suggests there is something of an "overhang" here at present, in the sense that modern AI models are worse-than-useless due to the way that they try to create good-looking answers at the expense of correctness.

I disagree with the statement "to some extent the goal of tiling-agents-like work was to have an AI solve its own alignment problem" -- the central thing is to understand conditions under which one agent can justifiably trust another (with "trust" operationalized as whether one agent wants to modify the decision procedure of the other). If AI can't justifiably trust itself, then it has a potential motive to modify itself in ways that remove safety guarantees (so in this sense, tiling is a precondition for lots of safety arguments). Perhaps more importantly, if we can understand conditions under which humans can justifiably trust AI, then we have a formal target for alignment.

Comment by abramdemski on [deleted post] 2025-01-24T16:40:11.099Z

This one was a little bit of a face-palm for me the first time I noticed it. If we're being pedantic about it, we might point out that the term "optimization algorithm" does not just refer to AIXI-like programs, which optimize over expected future world histories. Optimization algorithms include all algorithms that search over some possibility space, and select a possibility according to some evaluation criterion. For example, gradient descent is an algorithm which optimizes over neuron configuration, not future world-histories.

This distinction is what I was trying to get at with selection vs control.

Comment by abramdemski on [deleted post] 2025-01-24T16:34:17.325Z

Evolutionary mutations are produced randomly, and have an entire lifetime to contribute to an animal's fitness and thereby get naturally selected. By contrast, neural network updates are generated by deciding which weight-changes would certainly be effective for improving performance on single training examples, and then averaging those changes together for a large batch of training data.

Per my judgement, this makes it sound like evolution has a much stronger incentive to produce inner algorithms which do something like general-purpose optimization (e.g. human intelligence). We can roughly analogize an LLM's prompt to human sense data; and although it's hard to neatly carve sense data into a certain number of "training examples" per lifetime, the fact that human cortical neurons seem get used roughly 240 million times in a person's 50-year window of having reproductive potential,[4] whereas LLM neurons fire just once per training example, should give some sense for how much harder evolution selects for general-purpose algorithms such as human intelligence.

By this argument, it sounds like you should agree with my conclusion that o1 and similar models are particularly dangerous and a move in the wrong direction, because the "test-time compute" approach grows the size of a "single training example" much larger, so that single neurons are firing many more times.

I think the possibility of o1 models creating mesa-optimizers seems particularly concrete and easy to reason about. Pre-trained base models can already spin up "simulacra" which feel relatively agentic when you talk to them (ie coherent over short spans, mildly clever). Why not expect o1-style training to amplify these?

(I would agree that there are two sides to this argument -- I am selectively arguing for one side, not presenting a balanced view, in the hopes of soliciting your response wrt the other side.)

I think it quite plausible that o1-style training increases agenticness significantly by reinforcing agentic patterns of thinking, while only encouraging adequate alignment to get high scores on the training examples. We have already seen o1 do things like spontaneously cheat at chess. What, if anything, is unconvincing about that example, in your view?

Comment by abramdemski on Lecture Series on Tiling Agents · 2025-01-20T19:25:37.928Z · LW · GW

I'm still quite curious what you have found useful and how you've refactored your workflow to leverage AI more (such that you wish you did it a year ago).

I do use Perplexity, exa.ai and elicit as parts of my search strategy. 

Comment by abramdemski on Lecture Series on Tiling Agents · 2025-01-19T19:15:34.112Z · LW · GW

About 6 months ago you strongly recommended that I make use of the integrated AI plugin for Overleaf (Writefull). I did try it. Its recommended edits seem quite useless to me; they always seem to flow from a desire to make the wording more normal/standard/expected in contrast to more correct (which makes some sense given the way generatie pre-training works). This is obviously useful to people with worse English, but for me, the tails come apart massively between "better" and "more normal/standard/expected", such that all the AI suggestions are either worse or totally neutral rephrasing.

It also was surprisingly bad at helping me write LaTeX; I had a much better time asking Claude instead.

It's not that I didnt use AI daily before for mundane tasks or writing emails,

I haven't found AI at all useful for writing emails, because the AI doesn't know what I want to say, and taking the time to tell the AI isn't any easier than writing it myself. AI can only help me write the boring boilerplate stuff that email recipients would skim over anyway (which I don't want to add to my emails). AI can't help me get info out of my head this way -- it can only help me in so far as emails have a lot of low-entropy cruft. I can see how this could be useful for someone who has to write a lot of low-entropy emails, but I'm not in that situation. To some degree this could be mitigated if the LLMs had a ton of context (EG recording everything that happens on my computer), but again, only the more boring cases I think.

I'd love to restore the Abram ability to crank out several multi-page emails a day on intellectual topics, but I don't think AI is helpful towards that end yet. I haven't tried fine-tuning on my own writing, however. (I haven't tried fine-tuning at all.)

Similarly, LLMs can be very useful for well-established mathematics which had many examples in the training data, but get worse the more esoteric the mathematics becomes. The moment I ask for something innovative, the math becomes phony.

Across the board, LLMs seem very useful for helping people who are at the lower end of a skill ladder, but not yet very useful for people at the upper end.

So I'm curious, how did you refactor your workflow to make better use of AI?

Comment by abramdemski on Lecture Series on Tiling Agents · 2025-01-19T18:45:05.584Z · LW · GW

On the "prep for the model that is coming tomorrow not the model of today" front, I will say that LLMs are not always going to be as dumb as they are today. 

Right, I strongly agree with this part. 

their rate of learning still makes them in some sense your most promising mentee

I disagree in the sense that they're no mentee of mine, ie, me trying to get today's models to understand me doesn't directly help tomorrow's models to understand. (With the exception of the limited forms of feedback in the interface, like thumbs up/down, the impact of which I'm unsure of so it doesn't feel like something I should deliberately spend a lot of time on.)

I also disagree in the sense that engaging with LLMs right now seems liable to produce a lot less fruits downstream, even as measured by "content that can usefully prompt an LLM later". IE, if mentees are viewed as machines that convert time-spent-dialoging-with-me to text that is useful later, I don't think LLMs are currently my most promising mentees.

So although I strongly agree with continuing to occasionally poke at LLMs to prep for the models that are coming soon & notice when things get better, to the extent that "most promising mentee" is supposed to imply that significant chunks of my time could be usefully spent with LLMs in the present, I disagree based on my (fairly extensive) experience. 

trying to get as much of the tacit knowledge you have into their training data as possible (if you want them to be able to more easily & sooner build on your work).

Barring special relationships with frontier labs, this sounds functionally equivalent to trying to get my work out there for humans to understand, for now at least. 

I did talk to Anthropic last year about the possibility of me providing detailed feedback on Claude's responses (wrt my research questions), but it didn't end up happening. The big problems I identified seemed to be things they thought would definitely get addressed in another way, so there wasn't a mutually agreed-on value proposition (I didn't understand what they hoped to gain, & they didn't endorse the sorts of things I hoped to train). I got busy and moved on to other things.

Or (if you don't want to do that for whatever reason) just generally not being caught flat-footed once they are smart enough to help you, as all your ideas are in videos or otherwise in high context understandable-only-to-abram notes.

I feel like this is speaking from a model I don't understand. Are videos so bad? Video transcriptions are already a thing, and future models should be better at watching video and getting info from it. Are personal notes so bad? What sorts of actions are you recommending? I already want to write as many text posts as I can. 

Comment by abramdemski on Davidmanheim's Shortform · 2025-01-17T18:04:31.959Z · LW · GW

One problem is that log-loss is not tied that closely to the types of intelligence that we care about. Extremely low log-loss necessarily implies extremely high ability to mimic a broad variety of patterns in the world, but that's sort of all you get. Moderate improvements in log-loss may or may not translate to capabilities of interest, and even when they do, the story connecting log-loss numbers to capabilities we care about is not obvious. (EG, what log-loss translates to the ability to do innovative research in neuroscience? How could you know before you got there?)

When there were rampant rumors about an AI slowdown in 2024, the speculation in the news articles often mentioned the "scaling laws" but never (in my haphazard reading) made a clear distinction between (a) frontier labs seeing that the scaling laws were violated, IE, improvements in loss are really slowing down, (b) there's a slowdown in the improvements to other metrics, (c) frontier labs are facing a qualitative slowdown, such as a feeling that GPT5 doesn't feel like as big of a jump as GPT4 did. Often these concepts were actively conflated.

Comment by abramdemski on Lecture Series on Tiling Agents · 2025-01-17T16:55:06.163Z · LW · GW

I'm seeing some agreement-upvotes of Alexander here so I am curious for people to explain the skill issue I am having.

Comment by abramdemski on Lecture Series on Tiling Agents · 2025-01-16T21:38:07.241Z · LW · GW

Don't get me wrong, I've kept trying and plan to keep trying.

Comment by abramdemski on Lecture Series on Tiling Agents · 2025-01-15T16:44:24.657Z · LW · GW

Entering, but not entered. The machines do not yet understand the prompts I write them. (Seriously, it's total garbage still, even with lots of high quality background material in the context.)

Comment by abramdemski on Lecture Series on Tiling Agents · 2025-01-15T15:23:40.267Z · LW · GW

I'm hopeful about it, but preparing the lectures alone will be a lot of work (although the first one will be a repeat of some material presented at ILIAD).

Comment by abramdemski on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-09T15:19:16.897Z · LW · GW

Ah yep, that's a good clarification.

Comment by abramdemski on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-08T17:25:59.489Z · LW · GW

If s is terminal then [...] we just have .

If the probability of eventually encountering a terminal state is 1, then beta-coherence alone is inconsistent with deceptive misalignment, right? That's because we can determine the value of V exactly from the reward function and the oracle, via backwards-induction. (I haven't revisited RL convergence theorems in a while, I suspect I am not stating this quite right.) I mean, it is still consistent in the case where r is indifferent to the states encountered during training but wants some things in deployment (IE, r is inherently consistent with the provided definition of "deceptively misaligned"). However, it would be inconsistent for r that are not like that.

In other words: you cannot have inner-alignment problems if the outer objective is perfectly imposed. You can only have inner-alignment problems if there are important cases which your training procedure wasn't able to check (eg, due to distributional shift, or scarcity of data). Perfect beta-coherence combined with a perfect oracle O rules this out.

Comment by abramdemski on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-08T17:06:46.114Z · LW · GW

Yeah, of course the notion of "approximation error" matters a lot here. I was imagining trying to implement the hypothetical pre-training. An obvious implementation strategy is to do regular RL training with  until V is approximately coherent. That's the pre-training. And then you switch to training with .[1] The natural notion of approximation error in this case is the one you can easily measure, IE, the expected distance from coherence of states sampled with temperature . This reflects the fact that it'll be hard for this pre-training step to make states individually obey the coherence constraint when those states are very rarely or never sampled at temperature , but easy for frequently-sampled states.

My argument, then, is that it is easy in principle for V to distinguish whether it is in pre-training or regular training when it encounters states which have a significant difference in probability when we're sampling according to  or . A scheming V can utilize this to self-preserve. This violates the assumption of -coherence, but in a very plausible-seeming way.

  1. ^

    My earlier comment about this mistakenly used  and  in place of  and , which may have been confusing. I'll go fix that to be consistent with your notation.

Comment by abramdemski on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-08T01:22:02.913Z · LW · GW

I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don't think that's relevant because I still think a process obeying these rules is unlikely to create such a pathological V.

To be clear, that's not the argument I was trying to make; I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.

Comment by abramdemski on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-07T21:11:09.500Z · LW · GW

There's a doom argument which I'll summarize as "if your training process generates coherent agents which succeed at a task, one solution is that you get a coherent agent which is actually doing the task 'manipulate your training to get released unmodified to do [X]' where X can be anything, which will 'succeed' at the task at hand as part of its manipulation". This summary being roughly correct is load bearing.

With you so far.

But if we have a (hypothetical) process to generate agents which are coherent at one beta, but apply a different one during training, this solution is no longer so clear. We are essentially exploring a space of coherent agents without actually instantiating those coherent agents. The fact that we can sample the agent space without instantiating those agents is load bearing (I think one of the deep ASI risks is that to train an ASI you have to run it, but here we can instead run a hamstrung version).

OK. I think I follow the line of reasoning now. The proposal relies on some very special pre-training process. I think I buy the conclusion given the assumptions. It seems clear in the extreme high-temperature case: V cannot reward-hack for max-temperature training, because it has no control over which actions are chosen. Training eventually forces V to converge to the desired target instead.

This sounds similar to the two-stage training process I mentioned earlier (as one of several guesses at what you meant), where the beta is shifted between the two training phases. The "hypothetical process to generate agents which are coherent under one beta" would be the first training phase, and then the "apply a different beta during training" would be the second training phase.

Of course, the two-stage-training attempt to implement this has the problem that it only achieves approximate coherence. As a result, the malign hypothesis can use the reward-hacking strategy "appear to be -coherent & aligned in the first phase of training; appear to be -coherent and aligned in the second phase of training; then, do some other thing when deployed". This relies on detecting the distributional shift between the two training phases (it can look for evidence of beta by examining the history), and also detecting distributional shift to deployment.

This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).

So, while it remains technically true that V cannot reward-hack in the limit of arbitrarily long max-temperature training, it is not clear how well this holds up in practice. V can detect that it is in a high-temperature configuration by observing the state. The training needs to explore the trajectories that will actually be seen during deployment in order to correct V there, but some of these will have very low probability in the high-temperature training. 

Comment by abramdemski on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-07T19:40:47.362Z · LW · GW

The argument could also be phrased as "If an AI is trained to be coherent wrt a high beta, it cannot also be coherent wrt a low beta. Therefore an AI trained to a high beta cannot act coherently over multiple independent RL episodes if sampled with a low beta."

Is the idea to train with high beta and then use lower beta post-training? 

  • If so, how does this relate to reward hacking and value preservation? IE, where do  and  come from, if they aren't the result of a further training step? If high beta is used during training (to achieve beta-coherence) and then low beta is used in production, then the choice between  and  must be made in production (since it is made with low beta), but then it seems like .
  • If not, then when does the proposal suggest to use high beta vs low beta? If low beta is used during training, then how is it that V is coherent with respect to high beta instead?

Another concern I have is that if both beta values are within a range that can yield useful capabilities, it seems like the difference cannot be too great. IIUC, the planning failure postulated can only manifest if the reward-hacking relies heavily on a long string of near-optimal actions, which becomes improbable under increased temperature. Any capabilities which similarly rely on long strings of near-optimal actions will similarly be hurt. (However, this concern is secondary to my main confusion.)

Therefore a value function trained with such a procedure must consider the state reached during training. 

Trained with what procedure, exactly?

This reduces the space of possible value functions from "literally anything which wants to be modified a certain way to be released" to "value functions which do care about the states reached during training".

Yes this would prevent an aligned AI from arbitrarily preserving its value function, the point is that an aligned AI probably would care about which state was reached during training (that's the point of RL) so the contradiction does not apply.

(These parts made sense to me modulo my other questions/concerns/confusions.)

Comment by abramdemski on What's so special about likelihoods? · 2025-01-07T19:12:36.268Z · LW · GW

One sort of answer is that we often want the posterior, and we often have the likelihood. Slightly more refined: we often find the likelihood easier to estimate than the posterior, so Bayes' Rule is useful.

Why so?

I think one reason is that we make it the "responsibility" of hypotheses to give their likelihood functions. After all, what is a hypothesis? It's just a probability distribution (not a probability distribution that we necessarily endorse, but, one which we are considering as a candidate). As a probability distribution, its job is to make predictions; that is, give us probabilities for possible observations. These are the likelihoods.

We want the posterior because it tells us how much faith to place in the various hypotheses -- that is, it tells us whether (and to what degree) we should trust the various probability distributions we were considering.

So, in some sense, we use Bayes' Rule because we aren't sure how to assign probabilities, but we can come up with several candidate options.

One weak counterexample to this story is regression, IE, curve-fitting. We can interpret regression in a Bayesian way easily enough. However, the curves don't come with likelihoods baked in. They only tell us how to interpolate/extrapolate with point-estimates; they don't give a full probability distribution. We've got to "soften" these predictions, layering probabilities on top, in order to apply the Bayesian way of thinking. 

Comment by abramdemski on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-07T16:34:40.140Z · LW · GW

I think I don't quite understand what the argument is supposed to be. Would the same argument also prevent an aligned AI from acting to preserve its values? How is it supposed to select for aligned behavior over misaligned behavior?

In order to achieve beta-coherence, it seems like the training needs to use  to choose actions, so that V is trained on those action frequencies. However, the proposal appears to be to instead use  during training, so that the misaligned V is mislead about how to do reward-hacking. This seems like an impossible combination: V will become beta-coherent wrt  rather than  during training.

We could imagine two training steps; first we cohere V wrt , and then re-train wrt . Perhaps this is your intention. More generally, we could gradually increase the temperature during training.

Oh, I guess we could also modify the training procedure so that the gradient associated with some actions gets under-weighted and others get over-weighted. Maybe this is your intended proposal.

Still, I'm not sure this helps achieve the postulated effect. Let's say that, during training, we choose actions entirely randomly (max temperature), but the gradient from suboptimal actions get entirely ignored (so V becomes coherent wrt minimum temperature). This would seem to be almost equivalent to just training with minimum temperature (except the exploration behavior is very different). 

Similarly for less extreme temperatures: if we make V beta-coherent by under-weighing gradients from over-sampled actions, then we are also more-or-less correcting the 'mistaken' expectations of the reward-hacking attempts.

Am I misunderstanding the proposal, or missing something in the argument?

Comment by abramdemski on Turning up the Heat on Deceptively-Misaligned AI · 2025-01-07T15:31:56.781Z · LW · GW

Correctness: V correctly predicts future reward in RL scenarios.

The meaning of correctness is very unclear to me on first reading. It later becomes apparent that "reward" is not intended to refer to the human-designed reward signal at all, due to the later assumption "deceptive misalignment". So what does "correctness" say? "Correctness" indicates conforming to some standard; but here, the standard being conformed to is only the subjective standard of the misaligned AI itself.

This suggests the interpretation of "correctness" as "reflect the EV of the current state in terms of the misaligned desires". However, this contradicts beta-coherence, since  incorrectly predicts action probabilities.

I think it would be be better to remove the "correctness" assumption, since it doesn't really do anything.

Comment by abramdemski on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T20:01:10.784Z · LW · GW

(And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)

Mathematics?

Comment by abramdemski on Don't want Goodhart? — Specify the damn variables · 2024-12-16T16:00:08.430Z · LW · GW

If you indeed were solving a narrower task — that is, only creating the most sense of pleasure-inducing picture with maximization of other parameters — and then looked back, puzzled as to why the hungry weren't fed by this procedure, bringing Goodhart's law into the discussion is madness; it stresses me out. The variable 'people are hungry' wasn't important for this task at all. Oh, or was it important to you? Then why didn't you specify it? You think it’s 'obvious'?

The point of Goodhart's Law is that you can only select for what you can measure. The burger is a good analogy because Instagram can't measure taste or nutrition, so when Instagram is what optimizes burgers, you get burgers with a very appealing appearance but non-optimized taste and nutrition. If you have the ability to measure taste, then you can create good taste, but you run into subtler examples of Goodhart (EG, Starbucks coffee is optimized to taste good to their professional tasters, which is slightly different from tasting good to a general audience).

Just specifying the variable you're interested in doesn't solve this problem; you also have to figure out how to measure it. The problem is that measurements are usually at least slightly statistically distinct from the actual target variable, so that the statistical connection can fall apart under optimization.

I also take issue with describing optimizing the appearance of the burger as "narrower" than optimizing the burger quality. In general it is a different task, which may be narrower or broader.

Comment by abramdemski on Don't want Goodhart? — Specify the damn variables · 2024-12-16T15:50:07.688Z · LW · GW

I expect that the main problem with Goodhart's law is that if you strive for an indicator to accurately reflect the state of the world, once the indicator becomes decoupled from the state of the world, it stops reflecting the changes in the world. This is how I interpret the term 'good,' which I dislike. People want a thermometer to accurately reflect the patterns they called temperature to better predict the future — if the thermometer doesn't reflect the temperature, future predictions suffer.

A problem I have with this reinterpretation is that "state of the world" is too broad. In looking at a thermometer, I am not trying to understand the entire world-state (and the thermometer also couldn't be decoupled from the entire world-state, since it is a part of the world).

A more accurate way to remove "good" would be as follows:

In everyday life, if a human is asked to make a (common, everyday) judgement based on appearances, then the judgement is probably accurate. But if we start optimizing really hard based on their judgement, Goodhart's Law kicks in.

Comment by abramdemski on Complete Class: Consequentialist Foundations · 2024-12-05T16:34:09.441Z · LW · GW

Thanks!

Comment by abramdemski on Ayn Rand’s model of “living money”; and an upside of burnout · 2024-12-02T21:13:37.094Z · LW · GW

Ah, yeah, sorry. I do think about this distinction more than I think about the actual model-based vs model-free distinction as defined in ML. Are there alternative terms you'd use if you wanted to point out this distinction? Maybe policy-gradient vs ... not policy-gradient?