Posts

Estimating Tail Risk in Neural Networks 2024-09-13T20:00:06.921Z
Backdoors as an analogy for deceptive alignment 2024-09-06T15:30:06.172Z
If you weren't such an idiot... 2024-03-02T00:01:37.314Z
ARC is hiring theoretical researchers 2023-06-12T18:50:08.232Z
How to do theoretical research, a personal perspective 2022-08-19T19:41:21.562Z
ELK prize results 2022-03-09T00:01:02.085Z
ELK First Round Contest Winners 2022-01-26T02:56:56.089Z
ARC's first technical report: Eliciting Latent Knowledge 2021-12-14T20:09:50.209Z
ARC is hiring! 2021-12-14T20:09:33.977Z
Your Time Might Be More Valuable Than You Think 2021-10-18T00:55:03.380Z
The Simulation Hypothesis Undercuts the SIA/Great Filter Doomsday Argument 2021-10-01T22:23:23.488Z
Fractional progress estimates for AI timelines and implied resource requirements 2021-07-15T18:43:10.163Z
Intermittent Distillations #4: Semiconductors, Economics, Intelligence, and Technological Progress. 2021-07-08T22:14:23.374Z
Anthropic Effects in Estimating Evolution Difficulty 2021-07-05T04:02:18.242Z
An Intuitive Guide to Garrabrant Induction 2021-06-03T22:21:41.877Z
Rogue AGI Embodies Valuable Intellectual Property 2021-06-03T20:37:30.805Z
Intermittent Distillations #3 2021-05-15T07:13:24.438Z
Pre-Training + Fine-Tuning Favors Deception 2021-05-08T18:36:06.236Z
Less Realistic Tales of Doom 2021-05-06T23:01:59.910Z
Agents Over Cartesian World Models 2021-04-27T02:06:57.386Z
[Linkpost] Treacherous turns in the wild 2021-04-26T22:51:44.362Z
Intermittent Distillations #2 2021-04-14T06:47:16.356Z
Transparency Trichotomy 2021-03-28T20:26:34.817Z
Intermittent Distillations #1 2021-03-17T05:15:27.117Z
Strong Evidence is Common 2021-03-13T22:04:40.538Z
Open Problems with Myopia 2021-03-10T18:38:09.459Z
Towards a Mechanistic Understanding of Goal-Directedness 2021-03-09T20:17:25.948Z
Coincidences are Improbable 2021-02-24T09:14:11.918Z
Chain Breaking 2020-12-29T01:06:04.122Z
Defusing AGI Danger 2020-12-24T22:58:18.802Z
TAPs for Tutoring 2020-12-24T20:46:50.034Z
The First Sample Gives the Most Information 2020-12-24T20:39:04.936Z
Does SGD Produce Deceptive Alignment? 2020-11-06T23:48:09.667Z
What posts do you want written? 2020-10-19T03:00:26.341Z
The Solomonoff Prior is Malign 2020-10-14T01:33:58.440Z
What are objects that have made your life better? 2020-05-21T20:59:27.653Z
What are your greatest one-shot life improvements? 2020-05-16T16:53:40.608Z
Training Regime Day 25: Recursive Self-Improvement 2020-04-29T18:22:03.677Z
Training Regime Day 24: Resolve Cycles 2 2020-04-28T19:00:09.060Z
Training Regime Day 23: TAPs 2 2020-04-27T17:37:15.439Z
Training Regime Day 22: Murphyjitsu 2 2020-04-26T20:18:50.505Z
Training Regime Day 21: Executing Intentions 2020-04-25T22:16:04.761Z
Training Regime Day 20: OODA Loop 2020-04-24T18:11:30.506Z
Training Regime Day 19: Hamming Questions for Potted Plants 2020-04-23T16:00:10.354Z
Training Regime Day 18: Negative Visualization 2020-04-22T16:06:46.138Z
Training Regime Day 17: Deflinching and Lines of Retreat 2020-04-21T17:45:34.766Z
Training Regime Day 16: Hamming Questions 2020-04-20T14:51:31.310Z
Mark Xu's Shortform 2020-03-10T08:11:23.586Z
Training Regime Day 16: Hamming Questions 2020-03-01T18:46:32.335Z
Training Regime Day 15: CoZE 2020-02-29T17:13:42.685Z

Comments

Comment by Mark Xu (mark-xu) on On Eating the Sun · 2025-01-13T00:32:50.845Z · LW · GW

I think I expect Earth in this case to just say no and not sell the sun? But I was confused at like 2 points in your paragraph so I don't think I understand what you're saying that well. I also think we're probably on mostly the same page, and am not that interested in hashing out further potential disagreements.

Also, mostly unrelated, maybe a hot take, but if you're able to get outcompeted because you don't upload, then the future you're in is not very good.

Comment by Mark Xu (mark-xu) on On Eating the Sun · 2025-01-13T00:29:46.532Z · LW · GW

Cool. I misinterpreted your previous comment and think we're basically on the same page.

Comment by Mark Xu (mark-xu) on On Eating the Sun · 2025-01-11T21:02:10.679Z · LW · GW

I think the majority of humans probably won't want to be uploads, leave the solar system permanently, etc. Maybe this is where we disagree? I don't really think there's going to be a thing that most people care about more.

Comment by Mark Xu (mark-xu) on On Eating the Sun · 2025-01-11T21:00:35.462Z · LW · GW

I don't think that's a very good analogy, but I will say that is is basically true for the Amish. And I do think that we should respect their preferences. (I seperately think cars are not that good, and that people would infact prefer to bicycle around or ride house drawn carriges or whatever if civilization was conducive to that, although that's kinda besides the point.)

I'm not arguing that we should be conservative about changing the sun. I'm just claiming that people like the sun and won't want to see it eaten/fundamentally transformed, and that we should respect this preference. This is reason why it's different from candles -> lightbulbs, because people very obviously wanted lightbulbs when offered. But I don't think the marginal increase in well-being from eating the sun will be nearly enough to make balance against the desire that the sun remain the same, so I don't think most people will on net want the sun to be eaten. To be clear, this is an empirical claim about what people want that might very well be false.

Comment by Mark Xu (mark-xu) on On Eating the Sun · 2025-01-10T20:33:25.169Z · LW · GW

I am claiming that people when informed will want the sun to continuing being the sun. I also think that most people when informed will not really care that much about creating new people, will continue to believe in the act-omission distinction, etc. And that this is a coherent view that will add up to a large set of people wanting things in the solar system to remain conservatively the same. I seperately claim that if this is true, then other people should just respect this preference, and use the other stars that people don't care about for energy.

Comment by Mark Xu (mark-xu) on On Eating the Sun · 2025-01-10T07:01:56.314Z · LW · GW

But most people on Earth don't want "an artificial system to light the Earth in such a way as to mimic the sun", they want the actual sun to go on existing.

Comment by Mark Xu (mark-xu) on Benito's Shortform Feed · 2025-01-03T05:46:04.477Z · LW · GW

This is in part the reasoning used by Judge Kaplan:

Kaplan himself said on Thursday that he decided on his sentence in part to make sure that Bankman-Fried cannot harm other people going forward. “There is a risk that this man will be in a position to do something very bad in the future,” he said. “In part, my sentence will be for the purpose of disabling him, to the extent that can appropriately be done, for a significant period of time.”

from https://time.com/6961068/sam-bankman-fried-prison-sentence/

Comment by Mark Xu (mark-xu) on ejenner's Shortform · 2025-01-03T05:37:56.911Z · LW · GW

It's kind of strange that, from my perspective, these mistakes are very similar to the mistakes I think I made, and also see a lot of other people making. Perhaps one "must" spend too long doing abstract slippery stuff to really understand the nature of why it doesn't really work that well?

Comment by Mark Xu (mark-xu) on Mark Xu's Shortform · 2024-12-23T06:38:25.593Z · LW · GW

I know what the word means, I just think in typical cases people should be saying a lot more about why something is undignified, because I don’t think people’s senses of dignity typically overlap that much, especially if the reader doesn’t typically read LW. In these cases I think permitting the use of the word “undignified” prevents specificity.

Comment by Mark Xu (mark-xu) on Mark Xu's Shortform · 2024-12-22T03:46:44.434Z · LW · GW

"Undignified" is really vague

I sometimes see/hear people say that "X would be a really undignified". I mostly don't really know what this means? I think it means "if I told someone that I did X, I would feel a bit embarassed." It's not really an argument against X. It's not dissimilar to saying "vibes are off with X".

Not saying you should never say it, but basically every use I see could/should be replaced with something more specific.

Comment by Mark Xu (mark-xu) on Training Regime Day 15: CoZE · 2024-12-18T22:02:26.865Z · LW · GW

Yeah I didn’t really use good words. I mean something more like “make your identity fit yourself better” which often involves making it smaller by removing false beliefs about constraints, but also involves making it larger in some ways, eg uncovering new passions.

Comment by Mark Xu (mark-xu) on Mark Xu's Shortform · 2024-10-11T16:40:00.098Z · LW · GW

I was intending to warn about the possibility of future perception of corruption, e.g. after a non-existential AI catastrophe. I do not think anyone currently working at safety teams is percieved as that "corrupted", although I do think there is mild negative sentiment among some online communities (some parts of twitter, reddit, etc.).

Comment by Mark Xu (mark-xu) on Mark Xu's Shortform · 2024-10-10T19:36:37.881Z · LW · GW

Basically (2), very small amounts of (1) (perhaps qualitatively similar to the amount of (1) you would apply to e.g. people joining US AISI or UK AISI)

Comment by Mark Xu (mark-xu) on Mark Xu's Shortform · 2024-10-10T00:05:44.011Z · LW · GW

AI safety researchers might be allocated too heavily to Anthropic compared to Google Deepmind

Some considerations:

  • Safety researchers should want Google Deepmind (GDM) to have a robust and flourishing safety department. It seems plausible that GDM will be able to create "the smartest" models: they have lots of talent, and own lots of computers. (see e.g. https://epochai.org/data/notable-ai-models#computing-capacity)
  • Anthropic (ANT) might run into trouble in the future due to not owning their own computers, e.g. if Amazon (or where ever they're renting their computers from) starts their own internal scaling competitor, and decides to stop renting out most of their compute.
  • ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the "optimal allocation".
  • GDM only recently started a bay area based safety research team/lab (with members like Alex Turner). So if people had previously decided to work for ANT based on location, they now have the opportunity to work for GDM without relocating.
  • I've heard that many safety researchers join ANT without considering working for GDM, which seems like an error, although I don't have 1st hand evidence for this being true.
  • ANT vs GDM is probably a less important consideration than “scaling lab” (ANT, OAI, GMD, XAI, etc.) vs “non scaling lab” (USAISIUKAISIRedwoodARCPalisadeMETRMATS, etc. (so many...)). I would advise people to think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted” [edit: I mean viewed as corrupted  by the broader world in situations where e.g. there is a non-existential AI disaster or there is rising dislike of the way AI is being handled by coorperations more broadly, e.g. similar to how working for an oil company might result in various climate people thinking you're corrupted, even if you were trying to get the oil company to reduce emissions, etc. I personally do not think GDM or ANT safety people are "corrupted"] (in addition to strengthening them, which I expect people to spend more time thinking about by default).
  • Because ANT has a stronger safety culture, doing safety at GDM involve more politics and navigating around buerearcracy, and thus might be less productive. This consideration applies most if you think the impact of your work is mostly through the object level research you do, which I think is possible but not that plausible. 

(Thanks to Neel Nanda for inspiring this post, and Ryan Greenblatt for comments.)

Comment by Mark Xu (mark-xu) on Mark Xu's Shortform · 2024-10-08T00:28:24.212Z · LW · GW

idk how much value that adds over this shortform, and I currently find AI prose a bit nauseating.

Comment by Mark Xu (mark-xu) on Mark Xu's Shortform · 2024-10-08T00:27:25.298Z · LW · GW

Hiliariously, it seems likely that our disagreement is even more meta, on the question of "how do you know when you have enough information to know", or potentially even higher, e.g. "how much uncertainty should one have given that they think they know" etc.

Comment by Mark Xu (mark-xu) on Mark Xu's Shortform · 2024-10-07T17:26:38.007Z · LW · GW

see my longer comment https://www.lesswrong.com/posts/A79wykDjr4pcYy9K7/mark-xu-s-shortform#8qjN3Mb8xmJxx59ZG

Comment by Mark Xu (mark-xu) on Mark Xu's Shortform · 2024-10-07T17:25:51.992Z · LW · GW

I think I disagree with your model of importance. If your goal is the make a sum of numbers small, then you want to focus your efforts where the derivative is lowest (highest? signs are hard), not where the absolute magnitude is highest.

The "epsilon fallacy" can be committed in both directions: both in that any negative dervative is worth working on, and that any extremely large number is worth taking a chance to try to improve.

I also seperately think that "bottleneck" is not generally a good term to apply to a complex project with high amounts of technical and philosophical uncertainty. The ability to see a "bottleneck" is very valuable should one exist, but I am skeptical of the ability to strongly predict where such bottlnecks will be in advance, and do not think the historical record really supports the ability to find such bottlenecks reliably by "thinking", as opposed to doing a lot of stuff, including trying things and seeing what works. If you have a broad distribution over where a bottleneck might be, then all activities lend value by "derisking" locations for particular bottlenecks if they succeed, and providing more evidence that a bottleneck is in a particular location if it fails. (kinda like: https://en.wikipedia.org/wiki/Swiss_cheese_model) For instance, I think of "deceptive alignment" as a possible way to get pessimal generalization, and thus a proabalistic "bottleneck" to various alignment approaches. But there are other ways things can fail, and so one can still lend value by solving non-deceptive-alignment related problems (although my day job consists of trying to get "benign generalization" our of ML, and thus does infact address that particular bottleneck imo).

I also seperately think that if someone thinks they have identified a bottleneck, they should try to go resolve it as best they can. I think of that as what you (John) is doing, and fully support such activities, although think I am unlikely to join your particular project. I think the questions you are trying to answer are very interesting ones, and the "natural latents" approach seems likely to shed at some light on whats going on with e.g. the ability of agents to communicate at all.

Comment by Mark Xu (mark-xu) on Why I’m not a Bayesian · 2024-10-07T17:15:36.722Z · LW · GW

related to the claim that "all models are meta-models", in that they are objects capable of e.g evaluating how applicable they are for making a given prediction. E.g. "newtonian mechanics" also carries along with it information about how if things are moving too fast, you need to add more noise to its predictions, i.e. it's less true/applicable/etc.

Comment by Mark Xu (mark-xu) on Why I’m not a Bayesian · 2024-10-07T16:02:27.219Z · LW · GW

tentative claim: there are models of the world, which make predictions, and there is "how true they are", which is the amount of noise you fudge the model with to get lowest loss (maybe KL?) in expectation.

E.g. "the grocery store is 500m away" corresponds to "my dist over the grocery store is centered at 500m, but has some amount of noise"

Comment by Mark Xu (mark-xu) on Mark Xu's Shortform · 2024-10-06T21:48:21.126Z · LW · GW

My vague plan along these lines is to attempt as hard as possible to defer all philosophically confusing questions to the "long reflection", and to use AI control as a tool to help produce AIs that can help preserve long term option value (including philosophical option value) as best as possible.

I seperately have hope we can solve "the entire problem" at some point, e.g. through ARC's agenda (which I spend most of my time trying to derisk and advance).

Comment by Mark Xu (mark-xu) on Mark Xu's Shortform · 2024-10-06T21:46:03.621Z · LW · GW

yep agreed, I have a bunch of vague plans in this direction. I most generally think that AI control is a pretty good tool in the toolbox, and is unlikely to make things much worse but plausibly makes things much better.

Comment by Mark Xu (mark-xu) on Mark Xu's Shortform · 2024-10-06T21:45:18.135Z · LW · GW

I agree it is better work on bottlenecks than non-bottlenecks. I have high uncertainty about where such bottlenecks will be, and I think sufficiently low amounts of work have gone into "control" that it's obviously worth investing more, because e.g. I think it'll let us get more data on where bottlenecks are.

Comment by Mark Xu (mark-xu) on Mark Xu's Shortform · 2024-10-05T19:21:49.272Z · LW · GW

Yes, I agree. If I had more time, this would have been a top-level post. If anyone reading wants to write such a post using my quick take as a base, I would be happy to take a look and offer comments. I might do it myself at some point as well.

Comment by Mark Xu (mark-xu) on Mark Xu's Shortform · 2024-10-05T00:47:25.441Z · LW · GW

Alignment researchers should think hard about switching to working on AI Control

I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50/50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control!

For this post, my definitions are roughly:

  • AI alignment is the task of ensuring the AIs “do what you want them to do”
  • AI control is the task of ensuring that if the AIs are not aligned (e.g. don’t always “do what you want” and potentially want to mess with you), then you are still OK and can use them for economically productive tasks (an important one of which is doing more alignment/control research.)

Here are some thoughts, arguments, and analogies (epistemic status: there is no “hidden content”, if you don’t find the literal words I wrote persuasive you shouldn’t update. In particular, just update on the words and don't update about what my words imply about my beliefs.):

  • Everything is in degrees. We can “partially align” some AIs, and things will be better if we can use those AIs for productive tasks, like helping with alignment research. The thing that actually matters is “how aligned are the AIs” + “how aligned to they need to be to use them for stuff”, so we should also focus on the 2nd thing.
  • If you were a hedge fund, and your strategy for preventing people from stealing your data was and starting new hedge fund was “we will make the hedge fund a super fun place to work and interview people carefully to make sure they really care about working for us”, and you also did not do things like “maybe people shouldn’t be able to download all the trading strategies onto a USB stick which we let them take home”, “maybe people should be forced to keep their work computers in the office” and “maybe we should have some people checking if other employees are trying to steal data and make some plans about how to stop them from starting a rival hedge fund if they do”, then all the other hedge funds would think you were an idiot. You should obviously try both to check if people want to defect, and also make it harder for them to defect.
  • I think a lot of the neglect of AI control can be traced to the “pivotal act” based framings of how to make aligned AI go well. I think these views are mistaken, but I won’t provide a great argument here (quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”). However, I don’t think things are shaping up for single pivotal acts to be very important, and being able to productively use early AI labor to help with alignment seems like it is going to be an important part of the future going well.
  • Working on AI control is really tractable compared to working on alignment. There is a lot of prior art in people thinking about how to e.g. prevent data from being stolen by insider threats, secure various information against adversaries, generally ensure that national security secrets don’t get leaked, etc. We can set up various plausibly looking scenarios and try to “game them out” under certain plausibly seeming assumptions about how much alignment we’ll be able to achieve. We can make assumptions about AI capabilities and alignment without really getting into the nitty gritty of philosophically confusing topics like “what is deception really?” and “how likely is deception?” and “when will this all go down?”
    • Note that this means control might be easier for the AIs to automate, and easier to do “during crunch time.”
    • It also means that AI control is a bit contiguous with work people already do to protect from adversaries. I think the AI thing qualitatively changes the picture though, so it’s still worth thinking about.
  • If you succeed at AI control, there is a more compelling story you can tell for why you should implement the specific scheme that you thought of. In comparison, a lot of alignment proposals depend on philosophically dubious and confusing assumptions about the nature of cognition/SGD/etc. That is to say, a successful proposal in AI control comes with an argument that might be convincing to established security professionals if they thought about it, but there is likely no such class of professionals capable of lending your proposal external legitimacy in AI alignment.
  • The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.
  • There is a bunch of research that is useful for both alignment and control, e.g. model organisms of deception, interpretability techniques, oversight techniques, etc. More people should analyze such research from the perspective of “how can this be helpful even if it fails to produce an aligned AI?”
Comment by Mark Xu (mark-xu) on Alexander Gietelink Oldenziel's Shortform · 2024-10-04T21:03:43.762Z · LW · GW

shane legg had 2028 median back in 2008, see e.g. https://e-discoveryteam.com/2023/11/17/shane-leggs-vision-agi-is-likely-by-2028-as-soon-as-we-overcome-ais-senior-moments/

Comment by Mark Xu (mark-xu) on Estimating Tail Risk in Neural Networks · 2024-09-24T20:14:39.904Z · LW · GW

Yes I agree with what you have written, and do think it’s overall not that likely that everything pans out as hoped. We do also have other hopes for how this general picture can still cohere if the specific path doesn’t work out, eg we’re open to learning some stuff empirically and adding an “algorithmic cherry on top” to produce the estimate.

Comment by Mark Xu (mark-xu) on Estimating Tail Risk in Neural Networks · 2024-09-23T18:42:20.961Z · LW · GW

The literature review is very strange to me. Where is the section on certified robustness against epsilon-ball adversarial examples? The techniques used in that literature (e.g. interval propagation) are nearly identical to what you discuss here.

I was meaning to include such a section, but forgot :). Perhaps I will edit it in. I think such work is qualitatively similar to what we're trying to do, but that the key difference is that we're interested in "best guess" estimates, as opposed to formally verified-to-be-correct estimates (mostly because we don't think formally verified estimates are tractable to produce in general).

Relatedly, what's the source of hope for these kinds of methods outperforming adversarial training? My sense from the certified defenses literature is that the estimates they produce are very weak, because of the problems with failing to model all the information in activations. (Note I'm not sure how weak the estimates actually are, since they usually report fraction of inputs which could be certified robust, rather than an estimate of the probability that a sampled input will cause a misclassification, which would be more analogous to your setting.)

The main hope comes from the fact that we're using a "best guess" estimate, instead of trying to certify that the model won't produce catastrophic actions. For example, Method 1 can be thought of as running a single example with a Gaussian blob around it through the model, but also tracking the "1st order" contributions that come from the Gaussian blob. If we wanted to bound the potential contributions from the Gaussian blob, our estimates would get really broad really fast, as you tend to see with interval propagation.

Although, this also comes with the opposite issue of how to know if the estimates are at all reasonble, especially when you train against them.

If your catastrophe detector involves a weak model running many many inferences, then it seems like the total number of layers is vastly larger than the number of layers in M, which seems like it will exacerbate the problems above by a lot. Any ideas for dealing with this?

I think fundamentally we just need our estimates to "not get that much worse" as things get deeper/more complicated. The main hope for why we can achieve this is that the underlying model itself will not get worse as it gets deeper/the chain of thought gets longer. This implies that there is some sort of stabalization going on, so we will need to capture the effect of this stabalization. It does seem like in order to do this, we will have to model only high level properties of this distribution, instead of trying to model things on the level of activations.

In other words, one issue with interval propagation is that it makes an assumption that can only become less true as you propagate through the model. After a few layers, you're (perhaps only implicitly) putting high probability on activations that the model will never produce. But as long as your "activation model" is behaving reasonably, then hopefully it will only become more uncertain insofar as the underlying reasoning done by the model becomes more uncertain.

What's your proposal for the distribution P0 for Method 2 (independent linear features)?

You can either train an SAE on the input distribution, or just try to select the input distribution to maximize the probability of catastrophe produced by the estimation method (perhaps starting with an SAE of the input distribution, or a random one). Probably this wouldn't work that well in practice.

Why think this is a cost you can pay? Even if we ignore the existence of C and just focus on M, and we just require modeling the correlations between any pair of layers (which of course can be broken by higher-order correlations), that is still quadratic in the number of parameters of M and so has a cost similar to training M in the first place. In practice I would assume it is a much higher cost (not least because C is so much larger than M).

Our ultimate goal is vaguely to "only pay costs that SGD had to pay to produce M" Slightly more specifically, M has a bunch of correlations between its layers. Some of these correlations were actively selected to be those particular values by SGD, and other correlations were kind of random. We want to track the ones that were selected, and just assume the other ones are random. Hopefully, since SGD was not actively manipulating those correlations, the underlying model is in some sense invariant to their precise values, and so a model that treats such correlations as random will predict the same underlying behavior as a model that models the precise values of those correlations.

Comment by Mark Xu (mark-xu) on My AI Model Delta Compared To Christiano · 2024-09-20T17:13:06.355Z · LW · GW

I don’t think Paul thinks verification is generally easy or that delegation is fundamentally viable. He, for example, doesn’t suck at hiring because he thinks it’s in fact a hard problem to verify if someone is good at their job.

I liked Rohin's comment elsewhere on this general thread.

I’m happy to answer more specific questions, although would generally feel more comfortable answering questions about my views then about Paul’s.

Comment by Mark Xu (mark-xu) on My AI Model Delta Compared To Christiano · 2024-09-18T18:56:02.113Z · LW · GW

If you're commited to producing a powerful AI then the thing that matters is the probability there exists something you can't find that will kill you. I think our current understanding is sufficiently paltry that the chance of this working is pretty low (the value added by doing selection on non-deceptive behavior is probably very small, but I think there's a decent chance you just won't get that much deception). But you can also get evidence about the propensity for your training process to produce deceptive AIs and stop producing them until you develop better understanding, or alter your training process in other ways. For example, you can use your understanding of the simpler forms of deception your AIs engage in to invest resources in understanding more complicated forms of deception, e.g. by focusing interpretability efforts.

Comment by Mark Xu (mark-xu) on My AI Model Delta Compared To Christiano · 2024-09-18T04:39:21.681Z · LW · GW

For any given system, you have some distribution over which properties will be necessary to verify in order to not die to that system. Some of those you will in fact be able to verify, thereby obtaining evidence about whether that system is dangerous. “Strategic deception” is a large set of features, some of which are possible to verify.

Comment by Mark Xu (mark-xu) on Estimating Tail Risk in Neural Networks · 2024-09-18T01:27:15.540Z · LW · GW

yes, you would need the catastrophe detector to be reasonably robust. Although I think it's fine if e.g. you have at least 1/million chance of catching any particular catastrophe.

I think there is a gap, but that the gap is probably not that bad (for "worst case" tail risk estimation). That is maybe because I think being able to do estimation through a single forward pass is likely already to be very hard, and to require being able to do "abstractions" over the concepts being manipulated by the forward pass. CoT seems like it will require vaguely similar struction of a qualitatively similar kind.

Comment by Mark Xu (mark-xu) on My AI Model Delta Compared To Christiano · 2024-09-18T01:20:53.251Z · LW · GW

I think there are some easy-to-verify properties that would make us more likely to die if they were hard-to-verify. And therefore think "verification is easier than generation" is an important part of the overall landscape of AI risk.

Comment by Mark Xu (mark-xu) on My AI Model Delta Compared To Christiano · 2024-09-18T01:16:14.277Z · LW · GW

I think both that:

  • this is not a good characterization of Paul's views
  • verification is typically easier than generalization and this fact is important for the overall picture for AI risk

I also think that this post is pulling a bit of a motte-and-bailey, although not really in the sense that John claims he is making in argument in the post:

  • the motte: there exist hard to verify properties
  • the bailey: all/most important properties are hard to verify
Comment by Mark Xu (mark-xu) on My AI Model Delta Compared To Christiano · 2024-09-16T21:07:47.776Z · LW · GW

I agree ergonimics can be hard to verify. But some ergonomics are easy to verify, and chairs conform to those ergonomics (e.g. having a backrest is good, not having sharp stabby parts are good, etc.).

Comment by Mark Xu (mark-xu) on My AI Model Delta Compared To Christiano · 2024-09-16T21:06:19.824Z · LW · GW

I agree that there are some properties of objects that are hard to verify. But that doesn't mean generation is as hard as verification in general. The central property of a chair (that you can sit on it) is easy to verify.

Comment by Mark Xu (mark-xu) on Estimating Tail Risk in Neural Networks · 2024-09-16T20:27:37.348Z · LW · GW

I think catastrophe detectors in practice will be composed of neural networks interacting with other stuff, like scientific literature, python, etc.

With respect to the stuff quoted, I think all but "doing experiments" can be done with a neural net doing chain of thought (although not making claims about quality).

I think we're trying to solve a different problem than trusted monitoring, but I'm not that knowledgeable about what issues trusted monitoring is trying to solve. The main thing that I don't think you can do with monitoring is producing a model that you think is unlikely to result in catastrophe. Monitoring lets you do online training when you find catastrophe, but e.g. there might be no safe fallback action that allows you to do monitoring safely.

Separately, I do think it will be easy to go from "worst-case" NN-tail-risk estimation to "worst case" more general risk estimation. I do not think it will be easy to go from "typical case" NN-tail-risk estimation to more general "typical case" risk estimation, but think that "typical case" NN-tail-risk estimation can meaningfully reduce safety despite not being able to do that generalization.

Re. more specific hopes: if your risk estimate is conducted by model with access to tools like python, then we can try to do two things:

  • vaguely get an estimate that is as good as the estimate you would get if you replaced "python" with your model's subjective distribution over the output of whatever it runs through python.
  • learn some "empirical regularities" that govern how python works (as expected by your model/SGD)

(these might be the same thing?)

Another argument: one reason why doing risk estimates for NN's is hard is because the estimate can rely on facts that live in some arbitrary LLM ontology. If you want to do such an estimate for an LLM bureaucracy, some fraction of the relevant facts will live in LLM ontology and some fraction of facts will live in words passed between models. Some fraction of facts will live in a distributed way, which adds complications, but those distributed facts can only affect the output of the bureacracy insofar as they are themselves manipulated by an LLM in that bureacracy.

Comment by Mark Xu (mark-xu) on My AI Model Delta Compared To Christiano · 2024-09-14T00:02:20.270Z · LW · GW

I have left a comment about a central way I think this post is misguided: https://www.lesswrong.com/posts/7fJRPB6CF6uPKMLWi/my-ai-model-delta-compared-to-christiano?commentId=sthrPShrmv8esrDw2

Comment by Mark Xu (mark-xu) on My AI Model Delta Compared To Christiano · 2024-09-14T00:01:54.896Z · LW · GW

This post uses "I can identify ways in which chairs are bad" as an example. But it's easier for me to verify that I can sit in a chair and that it's comfortable then to make a chair myself. So I don't really know why this is a good example for "verification is easier than generation".

More examples:

  • I can tell my computer is a good typing machine, but cannot make one myself
  • I can tell a waterbottle is water tight, but do not know how to make a water bottle
  • I can tell that my pepper grinder grinds pepper, but do not know how to make a pepper grinder.

If the goal of this post is to discuss the crux https://www.lesswrong.com/posts/fYf9JAwa6BYMt8GBj/link-a-minimal-viable-product-for-alignment?commentId=mPgnTZYSRNJDwmr64:

evaluation isn't easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it

then I think there is a large disconnect between the post above, which is positing that in order for this claim to be false there has to be some "deep" sense in which delagation is viable, and the sense in which I think this crux is obviously false in the more mundane sense in which all humans interface with the world and optimize over the products other people create, and are therefore more capable than they would have been if they had to make all products for themselves from scratch.

Comment by Mark Xu (mark-xu) on dxu's Shortform · 2024-09-13T23:54:12.022Z · LW · GW

I think "basically obviates" is too strong. imitation of human-legible cognitive strategies + RL seems liable to produce very different systems that would been produced with pure RL. For example, in the first case, RL incentizes the strategies being combine in ways conducive to accuracy (in addition to potentailly incentivizing non-human-legible cognitive strategies), whereas in the second case you don't get any incentive towards productively useing human-legible cogntive strategies.

Comment by Mark Xu (mark-xu) on My AI Model Delta Compared To Christiano · 2024-09-13T20:13:27.891Z · LW · GW

I don't think this characterization is accurate at all, but don't think I can explain the disagreement well enough for it to be productive.

Comment by Mark Xu (mark-xu) on Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data · 2024-06-21T19:28:49.127Z · LW · GW

if you train on (x, f(x)) pairs, and you ask it to predict f(x') on some novel input x', and also to write down what it thinks f is, do you know if these answers will be consistent? For instance, the model could get f wrong, and also give the wrong prediction for f(x), but it would be interesting if the prediction for f(x) was "connected" to it's sense of what f was.

Comment by Mark Xu (mark-xu) on Failures in Kindness · 2024-05-02T00:21:37.291Z · LW · GW

A tiny case of this I wrote about long ago: https://markxu.com/stop-asking-people-to-maximize

Comment by Mark Xu (mark-xu) on The strategy-stealing assumption · 2024-01-29T18:32:07.420Z · LW · GW

It's important to distinguish between:

  • the strategy of "copy P2's strategy" is a good strategy
  • because P2 had a good strategy, there exists a good strategy for P1

Strategy stealing assumption isn't saying that copying strategies is a good strategy, it's saying the possibility of copying means that there exists a strategy P1 can take that is just a good as P2.

Comment by Mark Xu (mark-xu) on Is a random box of gas predictable after 20 seconds? · 2024-01-25T00:57:07.751Z · LW · GW

You could instead ask whether or not the observer could predict the location of a single particle p0, perhaps stipulating that p0 isn't the particle that's randomly perturbed.

My guess is that a random 1 angstrom perturbation is enough so that p0's location after 20s is ~uniform. This question seems easier to answer, and I wouldn't really be surprised if the answer is no?

Here's a really rough estimate: This says 10^{10} s^{-1} per collision, so 3s after start ~everything will have hit the randomly perturbed particle, and then there are 17 * 10^{10} more collisions, each of which add's ~1 angstrom of uncertainty to p0. 1 angstrom is 10^{-10}m, so the total uncertainty is on the order of 10m, which means it's probably uniform? This actually came out closer than I thought it would be, so now I'm less certain that it's uniform.

This is a slightly different question than the total # of particles on each side, but it becomes intuitively much harder to answer # of particles if you have to make your prediction via higher order effects, which will probably be smaller.

Comment by Mark Xu (mark-xu) on Prizes for matrix completion problems · 2023-08-08T20:10:24.839Z · LW · GW

The bounty is still active. (I work at ARC)

Comment by Mark Xu (mark-xu) on All AGI Safety questions welcome (especially basic ones) [July 2023] · 2023-07-22T01:23:57.642Z · LW · GW

Humans going about their business without regard for plants and animals has historically not been that great for a lot of them.

Comment by Mark Xu (mark-xu) on Ban development of unpredictable powerful models? · 2023-07-11T05:22:10.100Z · LW · GW

Here are some things I think you can do:

  • Train a model to be really dumb unless I prepend a random secret string. The goverment doesn't have this string, so I'll be able to predict my model and pass their eval. Some precedent in: https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal

  • I can predict a single matrix multiply just by memorizing the weights, and I can predict ReLU, and I'm allowed to use helper AIs.

  • I just train really really hard on imitating 1 particular individual, then have them just say whatever first comes to mind.

Comment by Mark Xu (mark-xu) on Mechanistic anomaly detection and ELK · 2023-07-10T23:16:37.358Z · LW · GW

You have to specify your backdoor defense before the attacker picks which input to backdoor.

Comment by Mark Xu (mark-xu) on Getting Your Eyes On · 2023-05-08T18:11:51.249Z · LW · GW

I think Luke told your mushroom story to me. Defs not a coincidence.