Posts

If you weren't such an idiot... 2024-03-02T00:01:37.314Z
ARC is hiring theoretical researchers 2023-06-12T18:50:08.232Z
How to do theoretical research, a personal perspective 2022-08-19T19:41:21.562Z
ELK prize results 2022-03-09T00:01:02.085Z
ELK First Round Contest Winners 2022-01-26T02:56:56.089Z
ARC's first technical report: Eliciting Latent Knowledge 2021-12-14T20:09:50.209Z
ARC is hiring! 2021-12-14T20:09:33.977Z
Your Time Might Be More Valuable Than You Think 2021-10-18T00:55:03.380Z
The Simulation Hypothesis Undercuts the SIA/Great Filter Doomsday Argument 2021-10-01T22:23:23.488Z
Fractional progress estimates for AI timelines and implied resource requirements 2021-07-15T18:43:10.163Z
Intermittent Distillations #4: Semiconductors, Economics, Intelligence, and Technological Progress. 2021-07-08T22:14:23.374Z
Anthropic Effects in Estimating Evolution Difficulty 2021-07-05T04:02:18.242Z
An Intuitive Guide to Garrabrant Induction 2021-06-03T22:21:41.877Z
Rogue AGI Embodies Valuable Intellectual Property 2021-06-03T20:37:30.805Z
Intermittent Distillations #3 2021-05-15T07:13:24.438Z
Pre-Training + Fine-Tuning Favors Deception 2021-05-08T18:36:06.236Z
Less Realistic Tales of Doom 2021-05-06T23:01:59.910Z
Agents Over Cartesian World Models 2021-04-27T02:06:57.386Z
[Linkpost] Treacherous turns in the wild 2021-04-26T22:51:44.362Z
Intermittent Distillations #2 2021-04-14T06:47:16.356Z
Transparency Trichotomy 2021-03-28T20:26:34.817Z
Intermittent Distillations #1 2021-03-17T05:15:27.117Z
Strong Evidence is Common 2021-03-13T22:04:40.538Z
Open Problems with Myopia 2021-03-10T18:38:09.459Z
Towards a Mechanistic Understanding of Goal-Directedness 2021-03-09T20:17:25.948Z
Coincidences are Improbable 2021-02-24T09:14:11.918Z
Chain Breaking 2020-12-29T01:06:04.122Z
Defusing AGI Danger 2020-12-24T22:58:18.802Z
TAPs for Tutoring 2020-12-24T20:46:50.034Z
The First Sample Gives the Most Information 2020-12-24T20:39:04.936Z
Does SGD Produce Deceptive Alignment? 2020-11-06T23:48:09.667Z
What posts do you want written? 2020-10-19T03:00:26.341Z
The Solomonoff Prior is Malign 2020-10-14T01:33:58.440Z
What are objects that have made your life better? 2020-05-21T20:59:27.653Z
What are your greatest one-shot life improvements? 2020-05-16T16:53:40.608Z
Training Regime Day 25: Recursive Self-Improvement 2020-04-29T18:22:03.677Z
Training Regime Day 24: Resolve Cycles 2 2020-04-28T19:00:09.060Z
Training Regime Day 23: TAPs 2 2020-04-27T17:37:15.439Z
Training Regime Day 22: Murphyjitsu 2 2020-04-26T20:18:50.505Z
Training Regime Day 21: Executing Intentions 2020-04-25T22:16:04.761Z
Training Regime Day 20: OODA Loop 2020-04-24T18:11:30.506Z
Training Regime Day 19: Hamming Questions for Potted Plants 2020-04-23T16:00:10.354Z
Training Regime Day 18: Negative Visualization 2020-04-22T16:06:46.138Z
Training Regime Day 17: Deflinching and Lines of Retreat 2020-04-21T17:45:34.766Z
Training Regime Day 16: Hamming Questions 2020-04-20T14:51:31.310Z
Mark Xu's Shortform 2020-03-10T08:11:23.586Z
Training Regime Day 16: Hamming Questions 2020-03-01T18:46:32.335Z
Training Regime Day 15: CoZE 2020-02-29T17:13:42.685Z
Training Regime Day 14: Traffic Jams 2020-02-28T17:52:28.354Z
Training Regime Day 13: Resolve Cycles 2020-02-27T17:45:07.845Z

Comments

Comment by Mark Xu (mark-xu) on The strategy-stealing assumption · 2024-01-29T18:32:07.420Z · LW · GW

It's important to distinguish between:

  • the strategy of "copy P2's strategy" is a good strategy
  • because P2 had a good strategy, there exists a good strategy for P1

Strategy stealing assumption isn't saying that copying strategies is a good strategy, it's saying the possibility of copying means that there exists a strategy P1 can take that is just a good as P2.

Comment by Mark Xu (mark-xu) on Is a random box of gas predictable after 20 seconds? · 2024-01-25T00:57:07.751Z · LW · GW

You could instead ask whether or not the observer could predict the location of a single particle p0, perhaps stipulating that p0 isn't the particle that's randomly perturbed.

My guess is that a random 1 angstrom perturbation is enough so that p0's location after 20s is ~uniform. This question seems easier to answer, and I wouldn't really be surprised if the answer is no?

Here's a really rough estimate: This says 10^{10} s^{-1} per collision, so 3s after start ~everything will have hit the randomly perturbed particle, and then there are 17 * 10^{10} more collisions, each of which add's ~1 angstrom of uncertainty to p0. 1 angstrom is 10^{-10}m, so the total uncertainty is on the order of 10m, which means it's probably uniform? This actually came out closer than I thought it would be, so now I'm less certain that it's uniform.

This is a slightly different question than the total # of particles on each side, but it becomes intuitively much harder to answer # of particles if you have to make your prediction via higher order effects, which will probably be smaller.

Comment by Mark Xu (mark-xu) on Prizes for matrix completion problems · 2023-08-08T20:10:24.839Z · LW · GW

The bounty is still active. (I work at ARC)

Comment by Mark Xu (mark-xu) on All AGI Safety questions welcome (especially basic ones) [July 2023] · 2023-07-22T01:23:57.642Z · LW · GW

Humans going about their business without regard for plants and animals has historically not been that great for a lot of them.

Comment by Mark Xu (mark-xu) on Ban development of unpredictable powerful models? · 2023-07-11T05:22:10.100Z · LW · GW

Here are some things I think you can do:

  • Train a model to be really dumb unless I prepend a random secret string. The goverment doesn't have this string, so I'll be able to predict my model and pass their eval. Some precedent in: https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal

  • I can predict a single matrix multiply just by memorizing the weights, and I can predict ReLU, and I'm allowed to use helper AIs.

  • I just train really really hard on imitating 1 particular individual, then have them just say whatever first comes to mind.

Comment by Mark Xu (mark-xu) on Mechanistic anomaly detection and ELK · 2023-07-10T23:16:37.358Z · LW · GW

You have to specify your backdoor defense before the attacker picks which input to backdoor.

Comment by Mark Xu (mark-xu) on Getting Your Eyes On · 2023-05-08T18:11:51.249Z · LW · GW

I think Luke told your mushroom story to me. Defs not a coincidence.

Comment by Mark Xu (mark-xu) on What can we learn from Bayes about reasoning? · 2023-05-05T18:37:40.488Z · LW · GW

If you observe 2 pieces of evidence, you have to condition the 2nd on seeing the 1st to avoid double-counting evidence

Comment by Mark Xu (mark-xu) on LLMs and computation complexity · 2023-04-30T19:01:03.920Z · LW · GW

A human given finite time to think also only performs O(1) computation, and thus cannot "solve computationally hard problems".

Comment by Mark Xu (mark-xu) on Should we publish mechanistic interpretability research? · 2023-04-21T21:47:00.308Z · LW · GW

I don't really want to argue about language. I'll defend "almost no individual has a pretty substantial affect on capabilities." I think publishing norms could have a pretty substantial effect on capabilities, and also a pretty substantial effect on interpretability, and currently think the norms suggested have a tradeoff that's bad-on-net for x-risk.

Chris Olah's interpretability work is one of the most commonly used resources in graduate and undergraduate ML classes, so people clearly think it helps you get better at ML engineering

I think this is false, and that most ML classes are not about making people good at ML engineering. I think Olah's stuff is disproportionately represented because it's interesting and is presented well, and also that classes really love being like "rigorous" or something in ways that are random. Similarly, probably like proofs of the correctness of backprop are common in ML classes, but not that relevant to being a good ML engineer?

I also bet that if we were to run a survey on what blogposts and papers top ML people would recommend that others should read to become better ML engineers, you would find a decent number of Chris Olah's publications in the top 10 and top 100.

I would be surprised if lots of ML engineers thought that Olah's work was in the top 10 best things to read to become a better ML engineer. I less beliefs about top 100. I would take even odds (and believe something closer to 4:1 or whatever), that if you surveyed good ML engineers and ask for top 10 lists, not a single Olah interpretability piece would be in the top 10 most mentioned things. I think most of the stuff will be random things about e.g. debugging workflow, how deal with computers, how to use libraries effectively, etc. If anyone is good at ML engineering and wants to chime in, that would be neat.

I don't understand why we should have a prior that interpretability research is inherently safer than other types of ML research?

Idk, I have the same prior about trying to e.g. prove various facts about ML stuff, or do statistical learning theory type things, or a bunch of other stuff. It's just like, if you're not trying to eek out more oomph from SGD, then probably the stuff you're doing isn't going to allow you to eek out more oomph from SGD, because it's kinda hard to do that and people are trying many things.

Comment by Mark Xu (mark-xu) on Should we publish mechanistic interpretability research? · 2023-04-21T20:41:12.388Z · LW · GW

Similarly, if you thought that you should publish capabilities research to accelerate to AGI, and you found out how to build AGI, then whether you should publish is not really relevant anymore.

Comment by Mark Xu (mark-xu) on Should we publish mechanistic interpretability research? · 2023-04-21T20:40:19.682Z · LW · GW

I think it's probably reasonable to hold off on publishing interpretability if you strongly suspect that it also advances capabilities. But then that's just an instance of a general principle of "maybe don't advance capabilities", and the interpretability part was irrelevant. I don't really buy that interpretability is particularly likely to increase capabilities that you should have a sense of general caution around this. If you have a specific sense that e.g. working on nuclear fission could produce a bomb, then maybe you shouldn't publish (as has historically happen with e.g. research on graphene as a neutron modulator I think), but generically not publishing physics stuff because "it might be used to build a bomb, vaguely" seems like it basically won't matter.

I think Gwern is an interesting case, but also idk what Gwern was trying to do. I would also be surprised if Gwerns effect was "pretty substantial" by my lights (e.g. I don't think Gwern explained > 1% or even probably 0.1% variance in capabilities, and by the time you're calling 1000 things "pretty substantial effects on capabilities" idk what "pretty substantial" means).

Comment by Mark Xu (mark-xu) on Should we publish mechanistic interpretability research? · 2023-04-21T20:29:50.569Z · LW · GW

I think this case is unclear, but also not central because I'm imagining the primary benefit of publishing interp research as being making interp research go faster, and this seems like you've basically "solved interp", so the benefits no longer really apply?

Comment by Mark Xu (mark-xu) on Should we publish mechanistic interpretability research? · 2023-04-21T17:52:49.531Z · LW · GW

Naively there are so few people working on interp, and so many people working on capabilities, that publishing is so good for relative progress. So you need a pretty strong argument that interp in particular is good for capabilities, which isn't borne out empirically and also doesn't seem that strong.

In general, this post feels like it's listing a bunch of considerations that are pretty small, and the 1st order consideration is just like "do you want people to know about this interpretability work", which seems like a relatively straightfoward "yes".

I also seperately think that LW tends to reward people for being "capabilities cautious" more than is reasonable, and once you've made the decision to not specifically work towards advancing capabilities, then the capabilities externalities of your research probably don't matter ex ante.

Comment by Mark Xu (mark-xu) on New blog: Planned Obsolescence · 2023-03-29T19:44:09.061Z · LW · GW

"if you've built a powerful enough optimizer to automate scientific progress, your AI has to understand your conception of goodness to avoid having catastrophic consequences, and this requires making deep advances such that you're already 90% of the way to 'build an actual benevolent sovereign."

I think this is just not true? Consider an average human, who understands goodness enough to do science without catstrophic consequences, but is not a benevolent sovereign. One reason why they're not a soverign is because they have high uncertainty about e.g. what they think is good, and avoid taking actions that violate deontological constraints or virtue ethics constraints or other "common sense morality." AIs could just act similarly? Current AIs already seem like they basically know what types of things humans would think are bad or good, at least enough to know that when humans ask for coffee, they don't mean "steal the coffee" or "do some complicated scheme that results in coffee".

Seperately, it seems like in order for your AI act competently in the world it does have to have a pretty good understanding of "goodness", e.g. to be able to understand why Google doesn't do more spying on competitors, or more insider trading, or do other unethical but profitable things, etc. (Seperately, the AI will also be able to write philosophy books that are better than current ethical philosophy books, etc.)

My general claim is that if the AI takes creative catastrophic actions to disempower humans, it's going to know that the humans don't like this, are going to resist in the ways that they can, etc. This is a fairly large part of "understanding goodness", and enough (it seems to me) to avoid catastrophic outcomes, as long as the AI tries to do [it's best guess at what the humans wanted it to do] and not [just optimize for the thing the humans said to do, which it knows is not what the humans wanted it to do].

Comment by Mark Xu (mark-xu) on New blog: Planned Obsolescence · 2023-03-28T01:32:53.811Z · LW · GW

But from an outer alignment perspective, it's nontrivial to specify this such that, say, it doesn't convert all the earth to computronium running instances of google ad servers, and bots that navigate google clicking on ads all day.

But Google didn't want their AIs to do that, so if the AIs do that then the AIs weren't aligned. Same with the mind-hacking.

In general, your AI has some best guess at what you want it to do, and if it's aligned it'll do that thing. If it doesn't know what you meant, then maybe it'll make some mistakes. But the point is that aligned AIs don't take creative actions to disempower humans in ways that humans didn't intend, which is separate from humans intending good things.

Comment by Mark Xu (mark-xu) on What happens with logical induction when... · 2023-03-28T01:19:41.699Z · LW · GW

My shitty guess is that you're basically right that giving a finite set of programs infinite money can sort of be substituted for the theorem prover. One issue is that logical inductor traders have to be continuous, so you have to give an infinite family of programs "infinite money" (or just an increasing unbounded amount as eps -> 0)

I think if these axioms were inconsistent, then there wouldn't be a price at which no trades happen so the market would fail. Alternatively, if you wanted the infinities to cancel, then the market prices could just be whatever they wanted (b/c you would get infinite buys and sells for any price in (0, 1)).

Comment by Mark Xu (mark-xu) on How To Go From Interpretability To Alignment: Just Retarget The Search · 2022-08-28T16:22:14.850Z · LW · GW

I think competitiveness matters a lot even if there's only moderate amounts of competitive pressure. The gaps in efficiency I'm imagining are less "10x worse" and more like "I only had support vector machines and you had SGD"

Comment by Mark Xu (mark-xu) on How To Go From Interpretability To Alignment: Just Retarget The Search · 2022-08-26T16:40:32.663Z · LW · GW

humans, despite being fully general, have vastly varying ability to do various tasks, e.g. they're much better at climbing mountains than playing GO it seems. Humans also routinely construct entirely technology bases to enable them to do tasks that they cannot do themselves. This is, in some sense, a core human economic activity: the construction of artifacts that can do tasks better/faster/more efficiently than humans can do themselves. It seems like by default, you should expect a similar dynamic with "fully general" AIs. That is, AIs trained to do semiconductor manufacturing will create their own technology bases, specialized predictive artifacts, etc. and not just "think really hard" and "optimize within their own head." This also suggests a recursive form of the alignment problem, where an AI that wants to optimize human values is in a similar situation to us, where it's easy to construct powerful artifacts with SGD that optimize measurable rewards, but it doesn't know how to do that for human values/things that can't be measured.

Even if you're selecting reasonably hard for "ability to generalize" by default, the range of tasks you're selecting for aren't all going to be "equally difficult", and you're going to get an AI that is much better at some tasks than other tasks, has heuristics that enable it to accurately predict key intermediates across many tasks, heuristics that enable it to rapidly determine quick portions of the action space are even feasible, etc. Asking that your AI can also generalize to "optimize human values" aswell as the best avaliable combination of skills that it has otherwise seems like a huge ask. Humans, despite being fully general, find it much harder to optimize for some things than others, e.g. constructing large cubes of iron versus status seeking, despite being able to in theory optimize for constructing large cubes of iron.

Comment by Mark Xu (mark-xu) on How To Go From Interpretability To Alignment: Just Retarget The Search · 2022-08-25T16:34:58.540Z · LW · GW

Not literally the best, but retargetable algorithms are on the far end of the spectrum of "fully specialized" to "fully general", and I expect most tasks we train AIs to do to have heuristics that enable solving the tasks much faster than "fully general" algorithms, so there's decently strong pressure to be towards the "specialized" side.

I also think that heuristics are going to be closer to multiplicative speed ups than additive, so it's going to be closer to "general algorithms just can't compete" than "it's just a little worse". E.g. random search is terrible compared to anything using exploiting non-trivial structure (random sorting vs quicksort is, I think a representative example, where you can go from exp -> pseudolinear if you are specialized to your domain).

Comment by Mark Xu (mark-xu) on How to do theoretical research, a personal perspective · 2022-08-20T14:48:52.060Z · LW · GW

oops thanks

yeah, should be x AND y.

Comment by Mark Xu (mark-xu) on How To Go From Interpretability To Alignment: Just Retarget The Search · 2022-08-14T03:38:02.296Z · LW · GW

One of the main reasons I expect this to not work is because optimization algorithms that are the best at optimizing some objective given a fixed compute budget seem like they basically can't be generally-retargetable. E.g. if you consider something like stockfish, it's a combination of search (which is retargetable), sped up by a series of very specialized heuristics that only work for winning. If you wanted to retarget stockfish to "maximize the max number of pawns you ever have" you had, you would not be able to use [specialized for telling whether a move is likely going to win the game] heuristics to speed up your search for moves. A more extreme example is the entire endgame table is useless for you, and you have to recompute the entire thing probably.

Something like [the strategy stealing assumption](https://ai-alignment.com/the-strategy-stealing-assumption-a26b8b1ed334) is needed to even obtain the existence of a set of heuristics just as good for speeding up search for moves that "maximize the max number of pawns you ever have" compared to [telling whether a move will win the game]. Actually finding that set of heuristics is probably going to require an entirely parallel learning process.

This also implies that even if your AI has the concept of "human values" in its ontology, you still have to do a bunch of work to get an AI that can actuallly estimate the long-run consequences of any action on "human values", or else it won't be competitive with AIs that have more specialized optimization algorithms.

Comment by Mark Xu (mark-xu) on On how various plans miss the hard bits of the alignment challenge · 2022-07-12T16:53:39.376Z · LW · GW

Flagging that I don't think your description of what ELK is trying to do is that accurate, e.g. we explicitly don't think that you can rely on using ELK to ask your AI if it's being deceptive, because it might just not know. In general, we're currently quite comfortable with not understanding a lot of what our AI is "thinking", as long as we can get answers to a particular set of "narrow" questions we think is sufficient to determine how good the consequences of an action are. More in “Narrow” elicitation and why it might be sufficient.

Separately, I think that ELK isn't intended to address the problem you refer to as a "sharp-left turn" as I understand it. Vaguely, ELK is intended to be an ingredient in an outer-alignment solution, while it seems like the problem you describe falls roughly into the "inner alignment" camp. More specifically, but still at a high-level of gloss, the way I currently see things is:

  • If you want to train a powerful AI, currently the set of tasks you can train your AI on will, by default, result in your AI murdering you.
  • Because we currently cannot teach our AIs to be powerful by doing anything except rewarding them for doing things that straightforwardly imply that they should disempower humans, you don't need a "sharp left turn" in order for humanity to end up disempowered.
  • Given this, it seems like there's still a substantial part of the difficulty of alignment that remains to be solved even if knew how to cope with the "sharp left turn." That is, even if capabilities were continuous in SGD steps, training powerful AIs would still result in catastrophe.
  • ELK is intended to be an ingredient in tackling this difficulty, which has been traditionally referred to as "outer alignment."

Even more separately, it currently seems to me like it's very hard to work on the problem you describe while treating other components [like your loss function] like a black box, because my guess is that "outer alignment" solutions need to do non-trivial amounts of "reaching inside the model's head" to be plausible, and a lot of how to ensure capabilities and alignment generalize together is going to depend on details about how would have prevented it from murdering you in [capabilities continuous with SGD] world.

ELK for learned optimizers has some more details.

Comment by Mark Xu (mark-xu) on Conversation with Eliezer: What do you want the system to do? · 2022-06-25T22:16:25.724Z · LW · GW

If powerful AIs are deployed in worlds mostly shaped by slightly less powerful AIs, you basically need competitiveness to be able to take any "pivotal action" because all the free energy will have been eaten by less powerful AIs.

Comment by Mark Xu (mark-xu) on AI-Written Critiques Help Humans Notice Flaws · 2022-06-25T22:05:30.733Z · LW · GW

The humans presumably have access to the documents being summarized.

Comment by Mark Xu (mark-xu) on Conversation with Eliezer: What do you want the system to do? · 2022-06-25T20:15:12.621Z · LW · GW

Here's a conversation that I think is vaguely analogous:

Alice: Suppose we had a one-way function, then we could make passwords better by...

Bob: What do you want your system to do?

Alice: Well, I want passwords to be more robust to...

Bob: Don't tell me about the mechanics of the system. Tell me what you want the system to do.

Alice: I want people to be able to authenticate their identity more securely?

Bob: But what will they do with this authentication? Will they do good things? Will they do bad things?

Alice: IDK I just think the world is likely to be generically a better place if we can better autheticate users.

Bob: Oh OK, we're just going to create this user authetication technology and hope people use it for good?

Alice: Yes? And that seems totally reasonable?

It seems to me like you don't actually have to have a specific story about what you want your AI to do in order for alignment work to be helpful. People in general do not want to die, so probably generic work on being able to more precisely specify what you want out of your AIs, e.g. for them not to be mesa-optimizers, is likely to be helpful.

This is related to complaints I have with [pivotal-act based] framings, but probably that's a longer post.

Comment by Mark Xu (mark-xu) on Dath Ilan vs. Sid Meier's Alpha Centauri: Pareto Improvements · 2022-04-29T11:52:08.207Z · LW · GW

Isn't there an equilibrium where people assume other people's militaries are as strong as they can demonstrate, and people just fully disclose their military strength?

Comment by Mark Xu (mark-xu) on Open Problems with Myopia · 2022-04-21T14:15:02.156Z · LW · GW

Yep, thanks. Fixed.

Comment by Mark Xu (mark-xu) on Emotionally Confronting a Probably-Doomed World: Against Motivation Via Dignity Points · 2022-04-19T12:47:27.130Z · LW · GW

See https://www.nickbostrom.com/aievolution.pdf for a discussion about why such arguments probably don't end up pushing timelines forward that much.

Comment by Mark Xu (mark-xu) on ELK prize results · 2022-03-12T07:17:15.166Z · LW · GW

From my perspective, ELK is currently very much "A problem we don't know how to solve, where we think rapid progress is being made (as we're still building out the example-counterexample graph, and are optimistic that we'll find an example without counterexamples)" There's some question of what "rapid" means, but I think we're on track for what we wrote in the ELK doc: "we're optimistic that within a year we will have made significant progress either towards a solution or towards a clear sense of why the problem is hard."

We've spent ~9 months on the problem so far, so it feels like we've mostly ruled out it being an easy problem that can be solved with a "simple trick", but it very much doesn't feel like we've hit on anything like a core obstruction. I think we still have multiple threads that are still live and that we're still learning things about the problem as we try to pull on those threads.

I'm still pretty interested in aiming for a solution to the entire problem (in the worst case), which I currently think is still plausible (maybe 1/3rd chance?). I don't think we're likely to relax the problem until we find a counterexample that seems like a fundamental reason why the original problem wasn't possible. Another way of saying this is that we're working on ELK because of a set of core intuitions about why it ought to be possible and we'll probably keep working on it until those core intuitions have been shown to be flawed (or we've been chugging away for a long time without any tangible progress).

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-02-14T18:41:23.980Z · LW · GW

The official deadline for submissions is "before I check my email on the 16th", which I tend to do around 10 am PST.

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-02-14T18:38:27.504Z · LW · GW

Before I check my email on Feb 16th, which I will do around 10am PST.

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-01-29T02:23:50.754Z · LW · GW

The high-level reason is that the 1e12N model is not that much better at prediction than the 2N model. You can correct for most of the correlation even with only a vague guess at how different the AI and human probabilities are, and most AI and human probabilities aren't going to be that different in a way that produces a correlation the human finds suspicious. I think that the largest correlations are going to be produced by the places the AI and the human have the biggest differences in probabilities, which are likely also going to be the places where the 2N model has the biggest differences in probabilities, so they should be not that hard to correct.

I'm curious whether you think this is the main obstacle. If we had a version of the correlation-consistency approach that always gave the direct translator minimal expected consistency loss, do we as-of-yet lack a counterexample for it?

I think it wouldn't be clear that extending the counterexample would be possible, although I suspect it would be. It might require exhibiting more concrete details about how the consistency check would be defeated, which would be interesting. In some sense, maintaining consistency across many inputs is something that you expect to be pretty hard for the human simulator to do because it doesn't know what set of inputs it's being checked for. I would be excited about a consistency check that gave the direct translator minimal expected consistency loss. Note that I would also be interested in basically any concrete proposal for a consistency check that seemed like it was actually workable.

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-01-28T22:28:50.775Z · LW · GW

I agree that i does slightly worse than t on consistency checks, but i also does better on other regularizers you're (maybe implicitly) using like speed/simplicity, so as long as i doesn't do too much worse it'll still beat out the direct translator.

One possible thing you might try is some sort of lexicographic ordering of regularization losses. I think this rapidly runs into other issues with consistency checks, like the fact that the human is going to be systematically wrong about some correlations, so i potentially is more consistent than t.

Comment by Mark Xu (mark-xu) on ML Systems Will Have Weird Failure Modes · 2022-01-26T04:49:51.996Z · LW · GW

I think latex renders if you're using the markdown editor, but if you're using the other editor then it only works if you use the equation editor.

Comment by Mark Xu (mark-xu) on Alex Ray's Shortform · 2022-01-26T01:10:30.007Z · LW · GW

I feel mostly confused by the way that things are being framed. ELK is about the human asking for various poly-sized fragments and the model reporting what those actually were instead of inventing something else. The model should accurately report all poly-sized fragments the human knows how to ask for.

Like the thing that seems weird to me here is that you can't simultaneously require that the elicited knowledge be 'relevant' and 'comprehensible' and also cover these sorts of obfuscated debate like scenarios.

I don't know what you mean by "relevant" or "comprehensible" here.

Does it seem right to you that ELK is about eliciting latent knowledge that causes an update in the correct direction, regardless of whether that knowledge is actually relevant?

This doesn't seem right to me.

Comment by Mark Xu (mark-xu) on Alex Ray's Shortform · 2022-01-25T20:58:53.446Z · LW · GW

I don’t think I understand your distinction between obfuscated and non-obfuscated knowledge. I generally think of non-obfuscated knowledge as NP or PSPACE. The human judgement of a situation might only theoretically require a poly sized fragment of a exp sized computation, but there’s no poly sized proof that this poly sized fragment is the correct fragment, and there are different poly sized fragments for which the human will evaluate differently, so I think of ELK as trying to elicit obfuscated knowledge.

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-01-25T17:19:11.804Z · LW · GW

We would prefer submissions be private until February 15th.

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-01-25T03:01:21.112Z · LW · GW

We generally assume that we can construct questions sufficiently well that there's only one unambiguous interpretation. We also generally assume that the predictor "knows" which world it's in because it can predict how humans would respond to hypothetical questions about various situations involving diamonds and sensors and that humans would say in theory Q1 and Q2 could be different.

More concretely, our standard for judging proposals is exhibiting an unambiguous failure. If it was plausible you asked the wrong question, or the AI didn't know what you meant by the question, then the failure exhibited would be ambiguous. If humans are unable to clarify between two possible interpretations of their question, then the failure would be ambiguous.

Comment by Mark Xu (mark-xu) on Alex Ray's Shortform · 2022-01-25T02:51:45.622Z · LW · GW

I think we would be trying to elicit obfuscated knowledge in ELK. In our examples, you can imagine that the predictor's Bayes net works "just because", so an argument that is convincing to a human for why the diamond in the room has to be arguing that the Bayes net is a good explanation of reality + arguing that it implies the diamond is in the room, which is the sort of "obfuscated" knowledge that debate can't really handle.

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-01-23T23:31:18.866Z · LW · GW

Note that this has changed to February 15th.

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-01-20T02:08:28.753Z · LW · GW

The dataset is generated with the human bayes net, so it's sufficient to map to the human bayes net. There is, of course, an infinite set of "human" simulators that use slightly different bayes nets that give the same answers on the training set.

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-01-20T02:05:42.473Z · LW · GW

Does this mean that the method needs to work for ~arbitrary architectures, and that the solution must use substantially the same architecture as the original?

Yes, approximately. If you can do it for only e.g. transformers, but not other things, that would be interesting.

Does this mean that it must be able to deal with a broad variety of questions, so that we cannot simply sit down and think about how to optimize the model for getting a single question (e.g. "Where is the diamond?") right?

Yes, approximately. Thinking about how to get one question right might be a productive way to do research. However, if you have a strategy for answering 1 question right, it should also work for other questions.

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-01-14T20:18:43.241Z · LW · GW

We generally imagine that it’s impossible to map the predictors net directly to an answer because the predictor is thinking in terms of different concepts, so it has to map to the humans nodes first in order to answer human questions about diamonds and such.

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-01-13T17:28:56.209Z · LW · GW

The SmartFabricator seems basically the same. In the robber example, you might imagine the SmartVault is the one that puts up the screen to conceal the fact that it let the diamond get stolen.

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-01-13T05:05:52.530Z · LW · GW

Looks good to me.

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-01-07T19:27:26.184Z · LW · GW

Yes. Section Strategy: have a human operate the SmartVault and ask them what happened describes what I think you're asking about.

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-01-07T04:06:11.697Z · LW · GW

A different way of phrasing Ajeya's response, which I think is roughly accurate, is that if you have a reporter that gives consistent answers to questions, you've learned a fact about the predictor, namely "the predictor was such that when it was paired with this reporter it gave consistent answers to questions." if there were 8 predictor for which this fact was true then "it's the [7th] predictor such that when it was paired with this reporter it gave consistent answers to questions" is enough information to uniquely determine the reporter, e.g. the previous fact + 3 additional bits was enough. if the predictor was 1000 bits, the fact that it was consistent with a reporter "saved" you 997 bits, compressing the predictor into 3 bits.

The hope is that maybe the honest reporter "depends" on larger parts of the predictor's reasoning, so less predictors are consistent with it, so the fact that a predictor is consistent with the honest reporter allows you to compress the predictor more. As such, searching for reporters that most compressed the predictor would prefer the honest reporter. However, the best way for a reporter to compress a predictor is to simply memorize the entire thing, so if the predictor is simple enough and the gap between the complexity of the human-imitator and the direct translator is large enough, then the human-imitator+memorized predictor is the simplest thing that maximally compresses the predictor.

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-01-07T03:57:27.742Z · LW · GW

[deleted]

Comment by Mark Xu (mark-xu) on Prizes for ELK proposals · 2022-01-07T03:54:27.092Z · LW · GW

There is a distinction between the way that the predictor is reasoning and the way that the reporter works. Generally, we imagine that that the predictor is trained the same way the "unaligned benchmark" we're trying to compare to is trained, and the reporter is the thing that we add onto that to "align" it (perhaps by only training another head on the model, perhaps by finetuning). Hopefully, the cost of training the reporter is small compared to the cost of the predictor (maybe like 10% or something)

In this frame, doing anything to train the way the predictor is trained results in a big competitiveness hit, e.g. forcing the predictor to use the same ontology as a human is potentially going to prevent it from using concepts that make reasoning much more efficient. However, training the reporter in a different way, e.g. doubling the cost of training the reporter, only takes you from 10% of the predictor to 20%, which not that bad of a competitiveness hit (assuming that the human imitator takes 10% of the cost of the original predictor to train).

In summary, competitiveness for ELK proposals primarily means that you can't change the way the predictor was trained. We are already assuming/hoping the reporter is much cheaper to train than the predictor, so making the reporter harder to train results in a much smaller competitiveness hit.