Posts

LLMs for Alignment Research: a safety priority? 2024-04-04T20:03:22.484Z
Modern Transformers are AGI, and Human-Level 2024-03-26T17:46:19.373Z
Technologies and Terminology: AI isn't Software, it's... Deepware? 2024-02-13T13:37:10.364Z
Meaning & Agency 2023-12-19T22:27:32.123Z
FixDT 2023-11-30T21:57:11.950Z
Agent Boundaries Aren't Markov Blankets. [Unless they're non-causal; see comments.] 2023-11-20T18:23:40.443Z
Translations Should Invert 2023-10-05T17:44:23.262Z
Where might I direct promising-to-me researchers to apply for alignment jobs/grants? 2023-09-18T16:20:03.452Z
One Minute Every Moment 2023-09-01T20:23:56.391Z
Probabilistic Payor Lemma? 2023-03-19T17:57:04.237Z
Teleosemantics! 2023-02-23T23:26:15.894Z
Some Thoughts on AI Art 2023-01-25T14:18:14.507Z
Contra Common Knowledge 2023-01-04T22:50:38.493Z
Talking to God 2023-01-03T20:14:20.955Z
Knottiness 2023-01-02T22:13:12.752Z
Prettified AI Safety Game Cards 2022-10-11T19:35:18.991Z
Builder/Breaker for Deconfusion 2022-09-29T17:36:37.725Z
Vingean Agency 2022-08-24T20:08:53.237Z
Steam 2022-06-20T17:38:58.548Z
Brass Puppet 2022-05-26T17:42:04.876Z
ELK Computational Complexity: Three Levels of Difficulty 2022-03-30T20:56:37.239Z
[Closed] Job Offering: Help Communicate Infrabayesianism 2022-03-23T18:35:16.790Z
ELK Thought Dump 2022-02-28T18:46:08.611Z
Contest for outlining rules for this contest. 2022-02-21T18:44:43.990Z
There is essentially one best-validated theory of cognition. 2021-12-10T15:51:06.423Z
Worst Commonsense Concepts? 2021-11-15T18:22:31.465Z
How can one train philosophical skill? 2021-09-30T14:56:35.313Z
Power vs Precision 2021-08-16T18:34:42.287Z
Implicature Conflation 2021-08-09T19:48:51.097Z
Refactoring Alignment (attempt #2) 2021-07-26T20:12:15.196Z
Re-Define Intent Alignment? 2021-07-22T19:00:31.629Z
Progress, Stagnation, & Collapse 2021-07-22T16:51:04.595Z
The Homunculus Problem 2021-05-27T20:25:58.312Z
The Argument For Spoilers 2021-05-21T12:23:49.127Z
Time & Memory 2021-05-20T15:16:49.042Z
Formal Inner Alignment, Prospectus 2021-05-12T19:57:37.162Z
Fractal Conversations vs Holistic Response 2021-05-05T15:04:40.314Z
Death by Red Tape 2021-05-01T18:03:34.780Z
Gradations of Inner Alignment Obstacles 2021-04-20T22:18:18.394Z
Superrational Agents Kelly Bet Influence! 2021-04-16T22:08:18.201Z
A New Center? [Politics] [Wishful Thinking] 2021-04-12T15:19:35.430Z
My Current Take on Counterfactuals 2021-04-09T17:51:06.528Z
Reflective Bayesianism 2021-04-06T19:48:43.917Z
Affordances 2021-04-02T20:53:35.639Z
Voting-like mechanisms which address size of preferences? 2021-03-18T23:23:55.393Z
MetaPrompt: a tool for telling yourself what to do. 2021-03-16T20:49:19.693Z
Rigorous political science? 2021-03-12T15:30:53.837Z
Four Motivations for Learning Normativity 2021-03-11T20:13:40.175Z
Kelly *is* (just) about logarithmic utility 2021-03-01T20:02:08.300Z
"If You're Not a Holy Madman, You're Not Trying" 2021-02-28T18:56:19.560Z

Comments

Comment by abramdemski on LLMs for Alignment Research: a safety priority? · 2024-04-17T14:53:51.295Z · LW · GW

I don't really interact with Twitter these days, but maybe you could translate my complaints there and let me know if you get any solid gold?

Comment by abramdemski on LLMs for Alignment Research: a safety priority? · 2024-04-17T14:49:56.095Z · LW · GW

I don't have a good system prompt that I like, although I am trying to work on one. It seems to me like the sort of thing that should be built in to a tool like this (perhaps with options, as different system prompts will be useful for different use-cases, like learning vs trying to push the boundaries of knowledge). 

I would be pretty excited to try this out with Claude 3 behind it. Very much the sort of thing I was trying to advocate for in the essay!

Comment by abramdemski on LLMs for Alignment Research: a safety priority? · 2024-04-10T14:57:52.737Z · LW · GW

But not intentionally. It was an unintentional consequence of training.

Comment by abramdemski on LLMs for Alignment Research: a safety priority? · 2024-04-10T14:56:43.179Z · LW · GW

I am not much of a prompt engineer, I think. My "prompts" generally consist of many pages of conversation where I babble about some topic I am interested in, occasionally hitting enter to get Claude's responses, and then skim/ignore Claude's responses because they are bad, and then keep babbling. Sometimes I make an explicit request to Claude such as "Please try and organize these ideas into a coherent outline" or "Please try and turn this into math" but the responses are still mostly boring and bad.

I am trying ;p

But yes, it would be good for me to try and make a more concrete "Claude cannot do X" to get feedback on.

Comment by abramdemski on LLMs for Alignment Research: a safety priority? · 2024-04-10T13:35:30.831Z · LW · GW

I've tried writing the beginning of a paper that I want to read the rest of, but the LLM did not complete it well enough to be interesting.

Comment by abramdemski on LLMs for Alignment Research: a safety priority? · 2024-04-10T13:34:15.003Z · LW · GW

I agree with this worry. I am overall advocating for capabilitarian systems with a specific emphasis in helping accelerate safety research.

Comment by abramdemski on LLMs for Alignment Research: a safety priority? · 2024-04-05T17:39:43.723Z · LW · GW

Sounds pretty cool! What LLM powers it?

Comment by abramdemski on LLMs for Alignment Research: a safety priority? · 2024-04-05T15:26:20.401Z · LW · GW

I don't think the plan is "turn it on and leave the building" either, but I still think the stated goal should not be automation. 

I don't quite agree with the framing "building very generally useful AI, but the good guys will be using it first" -- the approach I am advocating is not to push general capabilities forward and then specifically apply those capabilities to safety research. That is more like the automation-centric approach I am arguing against.

Hmm, how do I put this...

I am mainly proposing more focused training of modern LLMs with feedback from safety researchers themselves, toward the goal of safety researchers getting utility out of these systems; this boosts capabilities for helping-with-safety-research specifically, in a targeted way, because that is what you are getting more+better training feedback on. (Furthermore, checking and maintaining this property would be an explicit goal of the project.)

I am secondarily proposing better tools to aid in that feedback process; these can be applied to advance capabilities in any area, I agree, but I think it only somewhat exacerbates the existing "LLM moderation" problem; the general solution of "train LLMs to do good things and not bad things" does not seem to get significantly more problematic in the presence of better training tools (perhaps the general situation even gets better). If the project was successful for safety research, it could also be extended to other fields. The question of how to avoid LLMs being helpful for dangerous research would be similar to the LLM moderation question currently faced by Claude, ChatGPT, Bing, etc: when do you want the system to provide helpful answers, and when do you want it to instead refuse to help?

I am thirdly also mentioning approaches such as training LLMs to interact with proof assistants and intelligently decide when to translate user arguments into formal languages. This does seem like a more concerning general-capability thing, to which the remark "building very generally useful AI, but the good guys will be using it first" applies.

Comment by abramdemski on Modern Transformers are AGI, and Human-Level · 2024-03-28T18:42:03.157Z · LW · GW

No, I was talking about the results. lsusr seems to use the term in a different sense than Scott Alexander or Yann LeCun. In their sense it's not an alternative to backpropagation, but a way of constantly predicting future experience and to constantly update a world model depending on how far off those predictions are. Somewhat analogous to conditionalization in Bayesian probability theory.

I haven't watched the LeCun interview you reference (it is several hours long, so relevant time-stamps to look at would be appreciated), but this still does not make sense to me -- backprop already seems like a way to constantly predict future experience and update, particularly as it is employed in LLMs. Generating predictions first and then updating based on error is how backprop works. Some form of closeness measure is required, just like you emphasize.

Comment by abramdemski on Modern Transformers are AGI, and Human-Level · 2024-03-28T18:19:11.982Z · LW · GW

Yeah, I didn't do a very good job in this respect. I am not intending to talk about a transformer by itself. I am intending to talk about transformers with the sorts of bells and whistles that they are currently being wrapped with. So not just transformers, but also not some totally speculative wrapper.

Comment by abramdemski on Modern Transformers are AGI, and Human-Level · 2024-03-28T17:56:36.349Z · LW · GW

And you end up with "well for most of human history, a human with those disabilities would be a net drain on their tribe. Sometimes they were abandoned to die as a consequence. "

And it implies something like "can perform robot manipulation and wash dishes, or the "make a cup of coffee in a strangers house" test. And reliably enough to be paid minimum wage or at least some money under the table to do a task like this.

The replace-human-labor test gets quite interesting and complex when we start to time-index it. Specifically, two time-indexes are needed: a 'baseline' time (when humans are doing all the relevant work) and a comparison time (where we check how much of the baseline economy has been automated).

Without looking anything up, I guess we could say that machines have already automated 90% of the economy, if we choose our baseline from somewhere before industrial farming equipment, and our comparison time somewhere after. But this is obviously not AGI.

A human who can do exactly what GPT4 can do is not economically viable in 2024, but might have been economically viable in 2020.

Comment by abramdemski on Modern Transformers are AGI, and Human-Level · 2024-03-28T17:30:13.681Z · LW · GW

I don't think it is sensible to model humans as "just the equivalent of a sort of huge content window" because this is not a particularly good computational model of how human learning and memory work; but I do think that the technology behind the increasing context size of modern AIs contributes to them having a small but nonzero amount of the thing Steven is pointing at, due to the spontaneous emergence of learning algorithms.

Comment by abramdemski on Modern Transformers are AGI, and Human-Level · 2024-03-27T02:11:18.761Z · LW · GW

Yep, I agree that Transformative AI is about impact on the world rather than capabilities of the system. I think that is the right thing to talk about for things like "AI timelines" if the discussion is mainly about the future of humanity. But, yeah, definitely not always what you want to talk about.

I am having difficulty coming up with a term which points at what you want to point at, so yeah, I see the problem.

Comment by abramdemski on Modern Transformers are AGI, and Human-Level · 2024-03-26T22:59:50.817Z · LW · GW

I'm not sure how you intend your predictive-coding point to be understood, but from my perspective, it seems like a complaint about the underlying tech rather than the results, which seems out of place. If backprop can do the job, then who cares? I would be interested to know if you can name something which predictive coding has currently accomplished, and which you believe to be fundamentally unobtainable for backprop. lsusr thinks the two have been unified into one theory.

I don't buy that animals somehow plug into "base reality" by predicting sensory experiences, while transformers somehow miss out on it by predicting text and images and video. Reality has lots of parts. Animals and transformers both plug into some limited subset of it.

I would guess raw transformers could handle some real-time robotics tasks if scaled up sufficiently, but I do agree that raw transformers would be missing something important architecture-wise. However, I also think it is plausible that only a little bit more architecture is needed (and, that the 'little bit more' corresponds to things people have already been thinking about) -- things such as the features added in the generative agents paper. (I realize, of course, that this paper is far from realtime robotics.)

Anyway, high uncertainty on all of this.

Comment by abramdemski on Modern Transformers are AGI, and Human-Level · 2024-03-26T22:00:40.220Z · LW · GW

With respect to METR, yeah, this feels like it falls under my argument against comparing performance against human experts when assessing whether AI is "human-level". This is not to deny the claim that these tasks may shine a light on fundamentally missing capabilities; as I said, I am not claiming that modern AI is within human range on all human capabilities, only enough that I think "human level" is a sensible label to apply.

However, the point about autonomously making money feels more hard-hitting, and has been repeated by a few other commenters. I can at least concede that this is a very sensible definition of AGI, which pretty clearly has not yet been satisfied. Possibly I should reconsider my position further.

The point about forming societies seems less clear. Productive labor in the current economy is in some ways much more complex and harder to navigate than it would be in a new society built from scratch. The Generative Agents paper gives some evidence in favor of LLM-base agents coordinating social events.

Comment by abramdemski on Modern Transformers are AGI, and Human-Level · 2024-03-26T21:31:30.517Z · LW · GW

Yeah, I think nixing the terms 'AGI' and 'human-level' is a very reasonable response to my argument. I don't claim that "we are at human-level AGI now, everyone!" has important policy implications (I am not sure one way or the other, but it is certainly not my point).

Comment by abramdemski on Modern Transformers are AGI, and Human-Level · 2024-03-26T21:22:19.358Z · LW · GW

And maybe I am misremembering history or confused about what you are referring to, but in my mind, the promise of the "AGI community" has always been (implicitly or explicitly) that if you call something "human-level AGI", it should be able to get you to (a), or at least have a bigger economic and societal impact than currently-deployed AI systems have actually had so far.

Yeah, I don't disagree with this -- there's a question here about which stories about AGI should be thought of as defining vs extrapolating consequences of that definition based on a broader set of assumptions. The situation we're in right now, as I see it, is one where some of the broader assumptions turn out to be false, so definitions which seemed relatively clear become more ambiguous.

I'm privileging notions about the capabilities over notions about societal consequences, partly because I see "AGI" as more of a technology-oriented term and less of a social-consequences-oriented term. So while I would agree that talk about AGI from within the AGI community historically often went along with utopian visions, I pretty strongly think of this as speculation about impact, rather than definitional.

Comment by abramdemski on Modern Transformers are AGI, and Human-Level · 2024-03-26T21:02:20.758Z · LW · GW

I think Steven's response hits the mark, but from my own perspective, I would say that a not-totally-irrelevant way to measure something related would be: many-shot learning, particularly in cases where few-shot learning does not do the trick.

Comment by abramdemski on Modern Transformers are AGI, and Human-Level · 2024-03-26T20:24:38.488Z · LW · GW

Thanks for your perspective! I think explicitly moving the goal-posts is a reasonable thing to do here, although I would prefer to do this in a way that doesn't harm the meaning of existing terms. 

I mean: I think a lot of people did have some kind of internal "human-level AGI" goalpost which they imagined in a specific way, and modern AI development has resulted in a thing which fits part of that image while not fitting other parts, and it makes a lot of sense to reassess things. Goalpost-moving is usually maligned as an error, but sometimes it actually makes sense.

I prefer 'transformative AI' for the scary thing that isn't here yet. I see where you're coming from with respect to not wanting to have to explain a new term, but I think 'AGI' is probably still more obscure for a general audience than you think it is (see, eg, the snarky complaint here). Of course it depends on your target audience. But 'transformative AI' seems relatively self-explanatory as these things go. I see that you have even used that term at times.

I disagree with that—as in “why I want to move the goalposts on ‘AGI’”, I think there’s an especially important category of capability that entails spending a whole lot of time working with a system / idea / domain, and getting to know it and understand it and manipulate it better and better over the course of time. Mathematicians do this with abstruse mathematical objects, but also trainee accountants do this with spreadsheets, and trainee car mechanics do this with car engines and pliers, and kids do this with toys, and gymnasts do this with their own bodies, etc. I propose that LLMs cannot do things in this category at human level, as of today—e.g. AutoGPT basically doesn’t work, last I heard. And this category of capability isn’t just a random cherrypicked task, but rather central to human capabilities, I claim. (See Section 3.1 here.)

I do think this is gesturing at something important. This feels very similar to the sort of pushback I've gotten from other people. Something like: "the fact that AIs can perform well on most easily-measured tasks doesn't tell us that AIs are on the same level as humans; it tells us that easily-measured tasks are less informative about intelligence than we thought".

Currently I think LLMs have a small amount of this thing, rather than zero. But my picture of it remains fuzzy.

Comment by abramdemski on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-26T19:45:32.946Z · LW · GW

The "one of those" phrasing makes me think there was prior conversational context about this before the start of the interview. From my own prior knowledge of Zack, my guess is that it is a tragedy of the green rationalist type sentiment. But it doesn't exactly fit.

Comment by abramdemski on "Deep Learning" Is Function Approximation · 2024-03-22T16:01:37.544Z · LW · GW

The issue seems more complex and subtle to me.

It is fair to say that the loss function (when combined with the data) is a stochastic environment (stochastic due to sampling the data), and the effect of gradient descent is to select a policy (a function out of the function space) which performs very well in this stochastic environment (achieves low average loss).

If we assume the function-approximation achieves the minimum possible loss, then it must be the case that the function chosen is an optimal control policy where the loss function (understood as including the data) is the utility function which the policy is optimal with respect to.

In this framing, both Zack and Eliezer would be wrong:

  • Zack would be wrong because there is nothing nonsensical about asking whether the function-approximation "internalizes" the loss. Utility functions are usually understood behaviorally; a linear regression might not "represent" (ie denote) squared-error anywhere, but might still be utility-theoretically optimal with respect to mean-squared error, which is enough for "representation theorems" (the decision-theory thingy) to apply. 
  • Eliezer would be wrong because his statement that there is no guarantee about representing the loss function would be factually incorrect. At best Eliezer's point could be interpreted as saying that the representation theorems break down when loss is merely very low rather than perfectly minimal.

But Eliezer (at least in the quote Zack selects) is clearly saying "explicit internal representation" rather than the decision-theoretic "representation theorem" thingy. I think this is because Eliezer is thinking about inner optimization, as Zack also says. When we are trying to apply function-approximation ("deep learning") to solve difficult problems for us -- in particular, difficult problems never seen in the data-set used for training -- it makes some sense to suppose that the internal representation will involve nontrivial computations, even "search algorithms" (and importantly, we know of no way to rule this out without crippling the generalization ability of the function-approximation). 

So based on this, we could refine the interpretation of Eliezer's point to be: even if we achieve the minimum loss on the data-set given (and therefore obey decisiot-theretic representation-theorems in the stochastic environment created by the loss function combined with the data), there is no particular guarantee that the search procedure learned by the function-approximation is explicitly searching to minimize said loss. 

This is significant because of generalization. We actually want to run the approximated-function on new data, with hopes that it does "something appropriate". (This is what Eliezer means when he says "distribution-shifted environments" in the quote.) This important point is not captured in your proposed reconciliation of Zack and Eliezer's views.

But then why emphasize (as Eliezer does) that the function approximation does not necessarily internalize the loss function it is trained on? Internalizing said loss function would probably prevent it from doing anything truly catastrophic (because it is not planning for a world any different than the actual training data it has seen). But it does not especially guarantee that it does what we would want it to do. (Because the-loss-function-on-the-given-data is not what we really want; really we want some appropriate generalization to happen!)

I think this is a rhetorical simplification, which is fair game for Zack to try and correct to something more accurate. Whether Eliezer truly had the misunderstanding when writing, I am not sure. But I agree that the statement is, at least, uncareful.

Has Zack succeeded in correcting the issue by providing a more accurate picture? Arguably TurnTrout made the same objection in more detail. He summarizes the whole thing into two points:

  1. Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent's optimization target.
  2. Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent's network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.

(Granted, TurnTrout is talking about reward signals rather than loss functions, and this is an important distinction; however, my understanding is that he would say something very similar about loss functions.)

Point #1 appears to strongly agree with at least a major part of Eliezer's point. To re-quote the List of Lethalities portion Zack quotes in the OP:

Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. [...] This is sufficient on its own [...] to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.

However, I think point #2 is similar in spirit to Zack's objection in the OP. (TurnTrout does not respond to the same exact passage, but has his own post taking issues with List of Lethalities.)

I will call the objection I see in common between Zack and TurnTrout the type error objection. Zack says that of course a line does not "represent" the loss function of a linear regression; why would you even want it to? TurnTrout says that "reward is not the optimization target" -- we should think of a reward function as a "chisel" which shapes a policy, rather than thinking of it as the goal we are trying to instill in the policy. In both cases, I understand them as saying that the loss function used for training is an entirely different sort of thing from the goals an intelligent system pursues after training. (The "wheels made of little cars" thing also resembles a type-error objection.)

While I strongly agree that we should not naively assume a reinforcement-learning agent internalizes the reward as its utility function, I think the type-error objection is over-stated, as may be clear from my point about decision-theoretic representation theorems at the beginning.

Reward functions do have the wrong type signature, but neural networks are not actually trained on reward gradients; rather, a loss is defined from the reward in some way. The type signature of the loss function is not wrong; indeed, if training were perfect, then we could conclude that the resulting neural networks would be decision-theoretically perfect at minimizing loss on the training distribution.

What we would not be able to make confident predictions about is what such systems would do outside of the training distribution, where the training procedure has not exercised selection pressure on the behavior of the system. Here, we must instead rely on the generalization power of function-approximation, which (seen through a somewhat bayesian lens) means trusting the system to have the inductive biases which we would want.

Comment by abramdemski on "Deep Learning" Is Function Approximation · 2024-03-22T13:55:31.468Z · LW · GW

To me, the lengthy phrases do in fact get closer to "zack saying what zack meant" than the common terms like 'deep learning' -- but, like you, I didn't really get anything new out of the longer phrases. I believe that people who don't already think of deep learning as function approximation may get something out of it tho. So in consequence I didn't downvote or upvote.

Comment by abramdemski on Technologies and Terminology: AI isn't Software, it's... Deepware? · 2024-03-21T17:08:16.752Z · LW · GW

Yeah, this is a pretty interesting twist in the progression, and one which I failed to see coming as a teenager learning about AI. I looked at the trend from concrete to abstract -- from machine-code to structured programming to ever-more-abstract high-level programming languages -- and I thought AI would look like the highest-level programming language one could imagine.

In some sense this is not wrong. Telling the machine what to do in plain natural language is the highest-level programming language one could imagine.

However, naive extrapolation of ever-more-sophisticated programming languages might lead one to anticipate convergence between compilers and computational linguistics, such that computers would be understanding natural language with sophisticated but well-understood parsing algorithms, converting natural-language statements to formal representations resembling logic, and then executing the commands via similarly sophisticated planning algorithms.

The reality is that computational linguistics itself has largely abandoned the idea that we can make a formal grammar which captures natural language; the best way to parse a bunch of English is, instead, to let machine learning "get the idea" from a large number of hand-parsed examples! Rather than bridging the formal-informal divide by fully formalizing English grammar, it turns out to be easier to formalize informality itself (ie, mathematically specify a model of messy neural network learning) and then throw the formalized informality at the problem!

Weird stuff.

However, at some point I did get the idea and make the update. I think it was at the 2012 AGI conference, where someone was presenting a version of neural networks which was supposed to learn interpretable models, due to the individual neurons implementing interpretable functions of their inputs, rather than big weighted sums with a nonlinear transform thrown in. It seemed obvious that the approach would be hopeless, because as the models got larger and larger, it would be no more interpretable than any other form of neural network. I had the startling realization that this same argument seems to apply to anything, no matter how logic-like the underlying representation: it will become an opaque mess as it learns the high complexity of the real world.

Comment by abramdemski on Policy Selection Solves Most Problems · 2024-03-20T15:15:19.148Z · LW · GW

It seems better in principle to find a way to respect human intuitions about which things to be updateless about. Getting something wrong in the too-updateless direction can give up control of the AI to entities which we don't think of as existing; getting something wrong in the too-updateful direction can miss out on multiverse-wide coorination via superrationality.

Comment by abramdemski on And All the Shoggoths Merely Players · 2024-02-21T19:53:54.783Z · LW · GW

Which part of LLM? Shoggoth or simulacra? As I see it, there is a pressure on shoggoth to become very good at simulating exactly correct human in exactly correct situation, which is extremely complicated task.

Yes, I think it is fair to say that I meant the Shoggoth part, although I'm a little wary of that dichotomy utilized in a load-bearing way.

But I still don't see how this leads to strategic planning or consequentialist reasoning on shoggoth's part. It's not like shoggot even "lives" in some kind of universe with linear time or gets any reward for predicting the next token, or learns on its mistakes. It is architecturally an input-output function where input is whatever information it has about previous text and output is whatever parameters the simulation needs right now. It is incredibly "smart", but not agent kind of smart. I don't see any room for shoggoth's agency in this setup.

No room for agency at all? If this were well-reasoned, I would consider it major progress on the inner alignment problem. But I fail to follow your line of thinking. Something being architecturally an input-output function seems not that closely related to what kind of universe it "lives" in. Part of the lesson of transformer architectures, in my view at least, was that giving a next-token-predictor a long input context is more practical than trying to train RNNs. What this suggests is that given a long context window, LLMs reconstruct the information which would have been kept around in a recurrent state pretty well anyway.

This makes it not very plausible that the key dividing line between agentic and non-agentic is whether the architecture keeps state around. 

The argument I sketched as to why this input-output function might learn to be agentic was that it is tackling an extremely complex task, which might benefit from some agentic strategy. I'm still not saying such an argument is correct, but perhaps it will help to sketch why this seems plausible. Modern LLMs are broadly thought of as "attention" algorithms, meaning they decide what parts of sequences to focus on. Separately, many people think it is reasonable to characterize modern LLMs as having a sort of world-model which gets consulted to recall facts. Where to focus attention is a consideration which will have lots of facets to it, of course. But in a multi-stage transformer, isn't it plausible that the world-model gets consulted in a way that feeds into how attention is allocated? In other words, couldn't attention-allocation go through a relatively consequentialist circuit at times, which essentially asks itself a question about how it expects things to go if it allocates attention in different ways?

Any specific repeated calculation of that kind could get "memorized out", replaced with a shorter circuit which simply knows how to proceed in those circumstances. But it is possible, in theory at least, that the more general-purpose reasoning, going through the world-model, would be selected for due to its broad utility in a variety of circumstances.

Since the world-model-consultation is only selected to be useful for predicting the next token, the consequentialist question which the system asks its world-model could be fairly arbitrary so long as it has a good correlation with next-token-prediction utility on the training data.

Is this planning? IE does the "query to the world-model" involve considering multiple plans and rejecting worse ones? Or is the world-model more of a memorized mess of stuff with no "moving parts" to its computation? Well, we don't really know enough to say (so far as I am aware). Input-output type signatures do not tell us much about the simplicity or complexity of calculations within. "It's just circuits" but large circuits can implement some pretty sophisticated algorithms. Big NNs do not equal big lookup tables.

Comment by abramdemski on Steam · 2024-02-21T19:22:57.219Z · LW · GW

I intended the three to be probability and utility and steam, but it might make more sense to categorize things in other ways. While I still think there might be something more interesting here, I nowadays mainly think of Steam as the probability distribution over future actions and action-related concepts. This makes Steam an epistemic object, like any other belief, but with more normative/instrumental content because it's beliefs about actions, and because there will be a lot of FixDT stuff going on in such beliefs. Kickstarter / "belief-in" dynamics also seem extremely relevant.

Comment by abramdemski on And All the Shoggoths Merely Players · 2024-02-19T19:34:57.514Z · LW · GW

Here are some different things that come to mind.

  1. As you mention, the simulacra behaves in an agentic way within its simulated environment, a character in a story. So the capacity to emulate agency is there. Sometimes characters can develop awareness that they are a character in a story. If an LLM is simulating that scenario, doesn't it seem appropriate (at least on some level) to say that there is real agency being oriented toward the real world? This is "situational awareness".
  2. Another idea is that the LLM has to learn some strategic planning in order to direct its cognitive resources efficiently toward the task of prediction. Prediction is a very complicated task, so this meta-cognition could in principle become arbitrarily complicated. In principle we might expect this to converge toward some sort of consequentialist reasoning, because that sort of reasoning is generically useful for approaching complex domains. The goals of this consequentialist reasoning do not need to be exactly "predict accurately" however; they merely need to be adequately aligned with this in the training distribution.
  3. Combining #1 and #2, if the model gets some use out of developing consequentialist metacognition, and the pseudo-consequentialist model used to simulate characters in stories is "right there", the model might borrow it for metacognitive purposes. 

The frame I tend to think about it with is not exactly "how does it develop agency" but rather "how is agency ruled out". Although NNs don't neatly separate into different hypotheses (eg, circuits can work together rather than just compete with each other) it is still roughly right to think of NN training as rejecting lots of hypotheses and keeping around lots of other hypotheses. Some of these hypotheses will be highly agentic; we know NNs are capable of arriving at highly agentic policies in specific cases. So there's a question of whether those hypotheses can be ruled out in other cases. And then there's the more empirical question of, if we haven't entirely ruled out those agentic hypotheses, what degree of influence do they realistically have?

Seemingly the training data cannot entirely rule out an agentic style of reasoning (such as deceptive alignment), since agents can just choose to behave like non-agents. So, the inner alignment problem becomes: what other means can we use to rule out a large agentic influence? (Eg, can we argue that simplicity prior favors "honest" predictive models over deceptively aligned agents temporarily playing along with the prediction game?) The general concern is: no one has yet articulated a convincing answer, so far as I know.

Hence, I regard the problem more as a lack of any argument ruling out agency, rather than the existence of a clear positive argument that agency will arise. Others may have different views on this.

Comment by abramdemski on Leading The Parade · 2024-02-01T18:03:30.053Z · LW · GW

I am thinking of this as a noise-reducing modification to the loss function, similar to using model-based rather than model-free learning (which, if done well, rewards/punishes a policy based on the average reward/punishment it would have gotten over many steps).

If science were incentivized via prediction market (and assuming scientists can make sizable bets by taking out loans), then the first person to predict a thing wins most of the money related to it. In other words, prediction markets are approximately parade-leader-incentivizing. 

But if there's a race to be the first to bet, then this reward is high-variance; Newton could get priority over Leibniz by getting his ideas to the market a little faster. 

You recommend dividing credit more to all the people who could have gotten information to the market, with some kind of time-discount for when they could have done it. If we conceive of "who won the race" as introducing some noise into the credit-assignment, this is a way to de-noise things.

This has the consequence of taking away a lot of credit from race-winners when the race was pretty big, which is the part you focus on; based on this idea, you want to be part of smaller races (ideally size 1). But, outside-view, you should have wanted this all along anyway; if you are racing for status, but you are part of a big race, only a small number of people can win anyway, so your outside-view probability of personally winning status should already be divided by the number of racers. To think you have a good chance of winning such a race you must have personal reasons, and (since being in the race selects, in part, for people who think they can win) they're probably overconfident.

So for the most part your advice has no benefit for calibrated people, since being a parade-leader is hard.

There are for sure cases where your metric comes apart from expected-parade-leading by a lot more, though. A few years ago I heard accusations that one of the big names behind Deep Learning earned their status by visiting lots of research groups and keeping an eye out for what big things were going to happen next, and managing to publish papers on these big things just a bit ahead of everyone else. This strategy creates the appearance of being a fountain of information, when in fact the service provided is just a small speed boost to pre-existing trends. (I do not recall who exactly was being accused, and I don't have a lot of info on the reliability of this assessment anyway, it was just a rumor.)

Comment by abramdemski on Palworld development blog post · 2024-01-30T16:30:47.343Z · LW · GW

Typically, people say that the market is mostly efficient, and if there was financial alpha to be gained by doing hiring differently from most corporations, then there would already be companies outcompeting others by doing that. Well, here's a company doing some things differently and outcompeting other companies. Maybe there aren't enough people willing to do such things (who have the resources to) for the returns to reach an equilibrium?

Well, it could be that the practices lead to high-variance results, so that you should mostly expect companies which operate like that to fail, but you also expect a few unusually large wins. 

But I'm not familiar enough with the specific case to say anything substantial.

Comment by abramdemski on Introducing Alignment Stress-Testing at Anthropic · 2024-01-20T17:49:29.455Z · LW · GW

I am not sure whether I am more excited about 'positive' approaches (accelerating alignment research more) vs 'negative' approaches (cooling down capability-gain research). I agree that some sorts of capability-gain research are much more/less dangerous than others, and the most clearly risky stuff right now is scaling & scaling-related.

Comment by abramdemski on Introducing Alignment Stress-Testing at Anthropic · 2024-01-19T19:22:10.664Z · LW · GW

So you agree with the claim that current LLMs are a lot more useful for accelerating capabilities work than they are for accelerating alignment work?

Comment by abramdemski on Introducing Alignment Stress-Testing at Anthropic · 2024-01-19T18:55:26.712Z · LW · GW

Hmm. Have you tried to have conversations with Claude or other LLMs for the purpose of alignment work? If so, what happened?

For me, what happens is that Claude tries to work constitutional AI in as the solution to most problems. This is part of what I mean by "bad at philosophy". 

But more generally, I have a sense that I just get BS from Claude, even when it isn't specifically trying to shoehorn its own safety measures in as the solution.

Comment by abramdemski on Introducing Alignment Stress-Testing at Anthropic · 2024-01-19T17:38:55.017Z · LW · GW

Any thoughts on the sort of failure mode suggested by AI doing philosophy = AI generating hands? I feel strongly that Claude (and all other LLMs I have tested so far) accelerate AI progress much more than they accelerate AI alignment progress, because they are decent at programming but terrible at philosophy. It also seems easier in principle to train LLMs to be even better at programming. There's also going to be a lot more of a direct market incentive for LLMs to keep getting better at programming.

(Helping out with programming is also not the only way LLMs can help accelerate capabilities.)

So this seems like a generally dangerous overall dynamic -- LLMs are already better at accelerating capabilities progress than they are at accelerating alignment, and furthermore, it seems like the strong default is for this disparity to get worse and worse. 

I would argue that accelerating alignment research more than capabilities research should actually be considered a basic safety feature.

Comment by abramdemski on Meaning & Agency · 2024-01-11T19:53:38.432Z · LW · GW

Thanks! 

Comment by abramdemski on Against Almost Every Theory of Impact of Interpretability · 2024-01-10T21:34:25.886Z · LW · GW

I'll admit I overstated it here, but my claim is that once you remove the requirement for arbitrarily good/perfect solutions, it becomes easier to solve the problem. Sometimes, it's still impossible to solve the problem, but it's usually solvable once you drop a perfectness/arbitrarily good requirement, primarily because it loosens a lot of constraints.

I mean, yeah, I agree with all of this as generic statements if we ignore the subject at hand. 

I agree it isn't a logical implication, but I suspect your example is very misleading, and that more realistic imperfect solutions won't have this failure mode, so I'm still quite comfortable with using it as an implication that isn't 100% accurate, but more like 90-95+% accurate.

I agree the example sucks and only serves to prove that it is not a logical implication.

A better example would be, like, the Goodhart model of AI risk, where any loss function that we optimize hard enough to get into superintelligence would probably result in a large divergence between what we get and what we actually want, because optimization amplifies. Note that this still does not make an assumption that we need to prove 100% safety, but rather, argues, for reasons, from assumptions that it will be hard to get any safety at all from loss functions which merely coincide to what we want somewhat well.

I still think the list of lethalities is a pretty good reply to your overall line of reasoning -- IE it clearly flags that the problem is not achieving perfection, but rather, achieving any significant probability of safety, and it gives a bunch of concrete reasons why this is hard, IE provides arguments rather than some kind of blind assumption like you seem to be indicating. 

You are doing a reasonable thing by trying to provide some sort of argument for why these conclusions seem wrong, but "things tend to be easy when you lift the requirement of perfection" is just an extremely weak argument which seems to fall apart the moment we contemplate the specific case of AI alignment at all.

Comment by abramdemski on Against Almost Every Theory of Impact of Interpretability · 2024-01-04T05:16:50.282Z · LW · GW

I finally got around to reading this today, because I have been thinking about doing more interpretability work, so I wanted to give this piece a chance to talk me out of it. 

It mostly didn't.

  • A lot of this boils down to "existing interpretability work is unimpressive". I think this is an important point, and significant sub-points were raised to argue it. However, it says little 'against almost every theory of impact of interpretability'. We can just do better work.
  • A lot of the rest boils down to "enumerative safety is dumb". I agree, at least for the version of "enumerative safety" you argue against here. 

My impact story (for the work I am considering doing) is most similar to the "retargeting" story which you briefly mention, but barely critique.

I do think the world would be better off if this were required reading for anyone considering going into interpretability vs other areas. (Barring weird side-effects of the counterfactual where someone has the ability to enforce required reading...) It is a good piece of work which raises many important points.

Comment by abramdemski on Against Almost Every Theory of Impact of Interpretability · 2024-01-04T04:34:46.951Z · LW · GW

More generally, if we grant that we don't need perfection, or arbitrarily good alignment, at least early on, then I think this implies that alignment should be really easy, and the p(Doom) numbers are almost certainly way too high, primarily because it's often doable to solve problems of you don't need perfect or arbitrarily good solutions.

It seems really easy to spell out worldviews where "we don't need perfection, or arbitrarily good alignment" but yet "alignment should be really easy". To give a somewhat silly example based on the OP, I could buy Enumerative Safety in principle -- so if we can check all the features for safety, we can 100% guarantee the safety of the model. It then follows that if we can check 95% of the features (sampled randomly) then we get something like a 95% safety guarantee (depending on priors). 

But I might also think that properly "checking" even one feature is really, really hard.

So I don't buy the claimed implication: "we don't need perfection" does not imply "alignment should be really easy". Indeed, I think the implication quite badly fails.

Comment by abramdemski on AI Is Not Software · 2024-01-02T23:51:33.276Z · LW · GW

Compare this to a similar argument that a hardware enthusiast could use to argue against making a software/hardware distinction. You can argue that saying "software" is misleading because it distracts from the physical reality. Software is still present physically somewhere in the computer. Software doesn't do anything hardware can't do, since software doing is just hardware doing

But thinking in this way will not be a very good way of predicting reality. The hypothetical hardware enthusiast would not be able to predict the rise of the "programmer" profession, or the great increase in complexity of things that machines can do thanks to "programming". 

I think it is more helpful to think of modern AI as a paradigm shift in the same way that the shift from "electronic" (hardware) to "digital" (software) was a paradigm shift. Sure, you can still use the old paradigm to put labels on things. Everything is "still hardware". But doing so can miss an important transition.

Comment by abramdemski on AI Is Not Software · 2024-01-02T23:38:54.001Z · LW · GW

While I agree that wedding photos and NN weights are both data, and this helps to highlight ways they "aren't software", I think this undersells the point. NN weights are "active" in ways wedding photos aren't. The classic code/data distinction has a mostly-OK summary: code is data of type function. Code is data which can be "run" on other data.

NN weights are "of type function" too: the usual way to use them is to "run" them. Yet, it is pretty obvious that they are not code in the traditional sense. 

So I think this is similar to a hardware geek insisting that code is just hardware configuration, like setting a dial or flipping a set of switches. To the hypothetical hardware geek, everything is hardware; "software" is a physical thing just as much as a wire is. An arduino is just a particularly inefficient control circuit.

So, although from a hardware perspective you basically always want to replace an arduino with a more special-purpose chip, "something magical" happens when we move to software -- new sorts of things become possible.

Similarly, looking at AI as data rather than code may be a way to say that AI "isn't software" within the paradigm of software, but it is not very helpful for understanding the large shift that is taking place. I think it is better to see this as a new layer in somewhat the same way as software was a new layer on top of hardware. The kinds of thinking you need to do in order to do something with hardware vs do something with software are quite different, but ultimately, more similar to each other than they both are to how to do something with AI.

Comment by abramdemski on Meaning & Agency · 2023-12-21T18:55:10.715Z · LW · GW

Ah, very interesting, thanks! I wonder if there is a different way to measure relative endorsement that could achieve transitivity.

Comment by abramdemski on Meaning & Agency · 2023-12-21T18:48:21.199Z · LW · GW

Yeah, the stuff in the updatelessness section was supposed to gesture at how to handle this with my definition. 

First of all, I think children surprise me enough in pursuit of their own goals that they do often count as agents by the definition in the post.

But, if children or animals who are intuitively agents often don't fit the definition in the post, my idea is that you can detect their agency by looking at things with increasingly time/space/data bounded probability distributions. I think taking on "smaller" perspectives is very important.

Comment by abramdemski on Meaning & Agency · 2023-12-21T18:44:56.332Z · LW · GW

I can feel what you mean about arbitrarily drawing a circle around the known optimizer and then "deleting" it, but this just doesn't feel that weird to me? Like I think the way that people model the world allows them to do this kind of operation with pretty substantially meaningful results.

I agree, but I am skeptical that there could be a satisfying mathematical notion here. And I am particularly skeptical about a satisfying mathematical notion that doesn't already rely on some other agent-detector piece which helps us understand how to remove the agent.

I think this is where Flint's framework was insightful. Instead of "detecting" and "deleting" the optimization process and then measuring the diff, you consider the system of every possible trajectory, measure the optimization of each (with respect to the ordering over states), take the average, and then compare your potential optimizer to this.

Looking back at Flint's work, I don't agree with this summary. His idea is more about spotting attractor basins in the dynamics. There is no "compare your optimizer to this" step which I can see, since he studies the dynamics of the entire system. He suggests that in cases where it is meaningful to make an optimizer/optimized distinction, this could be detected by noticing that a specific region (the 'optimizer') is sensitive to very small perturbations, which can take the whole system out of the attractor basin. 

In any case, I agree that Flint's work also eliminates the need for an unnatural baseline in which we have to remove the agent. 

Overall, I expect my definition to be more useful to alignment, but I don't currently have a well-articulated argument for that conclusion. Here are some comparison points:

  • Flint's definition requires a system with stable dynamics over time, so that we can define an iteration rule. My definition can handle that case, but does not require it. So, for example, Flint's definition doesn't work well for a goal like "become President in 2030" -- it works better for continual goals, like "be president".
  • Flint's notion of robustness involves counterfactual perturbations which we may never see in the real world. I feel a bit suspicious about this aspect. Can counterfactual perturbations we'll never see in practice be really relevant and useful for reasoning about alignment?
  • Flint's notion is based more on the physical system, whereas mine is more about how we subjectively view that system. 
  • I feel that "endorsement" comes closer to a concept of alignment. Because of the subjective nature of endorsement, it comes closer to formalizing when an optimizer is trusted, rather than merely good at its job. 
  • It seems more plausible that we can show (with plausible normative assumptions about our own reasoning) that we (should) absolutely endorse some AI, in comparison to modeling the world in sufficient detail to show that building the AI would put us into a good attractor basin.
  • I suspect Flint's definition suffers more from the value change problem than mine, although I think I haven't done the work necessary to make this clear.
Comment by abramdemski on Meaning & Agency · 2023-12-21T17:43:19.300Z · LW · GW

There are several compromises I made for the sake of getting the idea across as simply as I could. 

  • I think the graduate-level-textbook version of this would be much more clear about what the quotes are doing. I was tempted to not even include the quotes in the mathematical expressions, since I don't think I'm super clear about why they're there.
  • I totally ignored the difference between  (probability conditional on ) and  (probability after learning ).
  • I neglect to include quantifiers in any of my definitions; the reader is left to guess which things are implicitly universally quantified.

I think I do prefer the version I wrote, which uses  rather than , but obviously the English-language descriptions ignore this distinction and make it sound like what I really want is .

It seems like the intention is that  "learns" or "hears about" 's belief, and then  updates (in the above Bayesian inference sense) to have a new  that has the consistency condition with .

Obviously we can consider both possibilities and see where that goes, but I think maybe the conditional version makes more sense as a notion of whether you right now endorse something. A conditional probability is sort of like a plan for updating. You won't necessarily follow the plan exactly when you actually update, but the conditional probability is your best estimate.

To throw some terminology out there, let's call my thing "endorsement" and a version which uses actual updates rather than conditionals "deference" (because you'd actually defer to their opinions if you learn them). 

  • You can know whether you endorse something, since you can know your current conditional probabilities (to within some accuracy, anyway). It is harder to know whether you defer to something, since in the case where updates don't equal conditionals, you must not know what you are going to update to. I think it makes more sense to define the intentional stance in terms of something you can more easily know about yourself. 
  • Using endorsement to define agency makes it about how you reason about specific hypotheticals, whereas using deference to try and define agency would make it about what actually happens in those hypotheticals (ie, how you would actually update if you learned a thing). Since you might not ever get to learn that thing, this makes endorsement more well-defined than deference. 

Bayes' theorem is the statement about , which is true from the axioms of probability theory for any  and  whatsoever.

I actually prefer the view of Alan Hajek (among others) who holds that P(A|B) is a primitive, not defined as in Bayes' ratio formula for conditional probability. Bayes' ratio formula can be proven in the case where P(B)>0, but if P(B)=0 it seems better to say that conditional probabilities can exist rather than necessarily being undefined. For example, we can reason about the conditional probability that a meteor hits land given that it hits the equator, even if hitting the equator is a measure zero event. Statisticians learn to compute such things in advanced stats classes, and it seems sensible to unify such notions under the formal P(A|B) rather than insisting that they are technically some other thing.

By putting  in the conditional, you're saying that it's an event on , a thing with the same type as . And it feels like that's conceptually correct, but also kind of the hard part. It's as if  is modelling  as an agent embedded into .

Right. This is what I was gesturing at with the quotes. There has to be some kind of translation from  (which is a mathematical concept 'outside' ) to an event inside . So the quotes are doing something similar to a Goedel encoding.

While trying to understand the equations, I found it easier to visualize  and  as two separate distributions on the same , where endorsement is simply a consistency condition. For belief consistency, you would just say that  endorses  on event  if .

But that isn't what you wrote; instead you wrote thing this with conditioning on a quoted thing. And of course, the thing I said is symmetrical between  and , whereas your concept of endorsement is not symmetrical.

The asymmetry is quite important. If we could only endorse things that have exactly our opinions, we could never improve.

Comment by abramdemski on An Orthodox Case Against Utility Functions · 2023-12-18T18:01:00.739Z · LW · GW

The post is making the distinction between seeing preferences as a utility function of worlds (this is the regular old idea of utility functions as random variables) vs seeing preferences as an expectation function on events (the jeffrey-bolker view). Both perspectives hold that an agent can optimize things it does not have direct access to. Agency is optimization at a distance. Optimization that isn't at a distance is selection as opposed to control.

Comment by abramdemski on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-12-15T17:59:15.197Z · LW · GW

I agree that this is an important distinction, but I personally prefer to call it "transformative AI" or some such.

Comment by abramdemski on Self-Referential Probabilistic Logic Admits the Payor's Lemma · 2023-12-10T22:22:29.865Z · LW · GW

The interesting thing about this -- beyond showing that going probabilistic allows the handshake to work with somewhat unreliable bots -- is that proving  rather than  is a lot different. With , we're like "And so Peano arithmetic (or whatever) proves they cooperate! We think Peano arithmetic is accurate about such matters, so, they actually cooperate." 

With the conclusion  we're more like "So if the agent's probability estimates are any good, we should also expect them to cooperate" or something like that. The connection to them actually cooperating is looser.

Comment by abramdemski on FixDT · 2023-12-05T19:03:04.177Z · LW · GW

An intriguing perspective, but I'm not sure whether I agree. Naively, it would seem that a choice between fixed points in the FixDT setting is just a choice between different probability distributions, which brings us very close to the VNM idea of a choice between gambles. So VNM-like utility theory seems like the obvious outcome.

That being said, I don't really agree with the idea that an agent should have a fixed VNM-like utility function. So I do think some generalization is needed.

Comment by abramdemski on FixDT · 2023-12-05T18:40:46.275Z · LW · GW

Yeah, "settles on" here meant however the agent selects beliefs. The epistemic constraint implies that the agent uses exhaustive search or some other procedure guaranteed to produce a fixed point, rather than Banach-style iteration. 

Moving to a Banach-like setting will often make the fixed points unique, which takes away the whole idea of FixDT.

Moving to a setting where the agent isn't guaranteed to converge would mean we have to re-write the epistemic constraint to be appropriate to that setting.

Comment by abramdemski on FixDT · 2023-12-03T17:01:18.346Z · LW · GW

Yes, thanks for citing it here! I should have mentioned it, really.

I see the Skyrms iterative idea as quite different from the "just take a fixed point" theory I sketch here, although clearly they have something in common. FixDT makes it easier to combine both epistemic and instrumental concerns -- every fixed point obeys the epistemic requirement; and then the choice between them obeys the instrumental requirement. If we iteratively zoom in on a fixed point instead of selecting from the set, this seems harder?

If we try the Skyrms iteration thing, maybe the most sensible thing would be to move toward the beliefs of greatest expected utility -- but do so in a setting where epistemic utility emerges naturally from pragmatic concerts (such as A Pragmatists Guide to Epistemic Decision Theory by Ben Levinstein). So the agent is only ever revising its beliefs in pragmatic ways, but we assume enough about the environment that it wants to obey both the epistemic and instrumental constraints? But, possibly, this assumption would just be inconsistent with the sort of decision problem which motivates FixDT (and Greaves).

Comment by abramdemski on FixDT · 2023-12-01T16:42:49.663Z · LW · GW

That would be really cool.