Posts

Assigning Praise and Blame: Decoupling Epistemology and Decision Theory 2023-01-27T18:16:43.025Z
Lost in Innovation: The Case of Phlogiston 2023-01-20T12:19:00.603Z
Group-level Consequences of Psychological Problems 2023-01-19T09:27:13.920Z
Confusing the ideal for the necessary 2023-01-16T17:29:06.932Z
The special nature of special relativity 2023-01-09T17:30:18.412Z
Reification bias 2023-01-09T12:22:15.460Z
Nothing New: Productive Reframing 2023-01-07T18:43:35.617Z
Opportunity Cost Blackmail 2023-01-02T13:48:51.811Z
Proof as mere strong evidence 2022-12-14T08:56:26.790Z
Psychological Disorders and Problems 2022-12-12T18:15:49.333Z
Confusing the goal and the path 2022-12-12T16:42:40.508Z
Formalization as suspension of intuition 2022-12-11T15:16:44.319Z
The First Filter 2022-11-26T19:37:04.607Z
What I Learned Running Refine 2022-11-24T14:49:59.366Z
Methodological Therapy: An Agenda For Tackling Research Bottlenecks 2022-09-22T18:41:03.346Z
Refine's Third Blog Post Day/Week 2022-09-17T17:03:15.472Z
Refine's Second Blog Post Day 2022-08-20T13:01:47.190Z
No One-Size-Fit-All Epistemic Strategy 2022-08-20T12:56:23.261Z
Refine's First Blog Post Day 2022-08-13T10:23:10.332Z
Shapes of Mind and Pluralism in Alignment 2022-08-13T10:01:42.102Z
Abstracting The Hardness of Alignment: Unbounded Atomic Optimization 2022-07-29T18:59:49.460Z
Levels of Pluralism 2022-07-27T09:35:32.458Z
Robustness to Scaling Down: More Important Than I Thought 2022-07-23T11:40:03.686Z
How to Diversify Conceptual Alignment: the Model Behind Refine 2022-07-20T10:44:02.637Z
Mosaic and Palimpsests: Two Shapes of Research 2022-07-12T09:05:28.984Z
Epistemological Vigilance for Alignment 2022-06-06T00:27:43.956Z
Refine: An Incubator for Conceptual Alignment Research Bets 2022-04-15T08:57:35.502Z
AMA Conjecture, A New Alignment Startup 2022-04-09T09:43:02.739Z
Productive Mistakes, Not Perfect Answers 2022-04-07T16:41:50.290Z
Replacing Natural Interpretations 2022-03-16T13:05:19.610Z
Shotgun Book Reviews: Against Method, The Knowledge Machine and Understanding Nature 2022-03-14T08:13:13.643Z
Treat Examples as World-Building 2022-03-10T15:09:44.302Z
Scientific Wrestling: Beyond Passive Hypothesis-Testing 2022-03-07T12:01:23.943Z
The Art and Science of Intuition Pumping 2022-02-22T00:18:21.535Z
What The Foucault 2022-02-19T22:48:39.276Z
Implications of automated ontology identification 2022-02-18T03:30:53.795Z
An analogy as the midwife of thermodynamics 2022-02-16T21:34:25.799Z
Becoming Stronger as Epistemologist: Introduction 2022-02-15T06:15:04.652Z
Is ELK enough? Diamond, Matrix and Child AI 2022-02-15T02:29:33.471Z
Introducing the Principles of Intelligent Behaviour in Biological and Social Systems (PIBBSS) Fellowship 2021-12-18T15:23:26.672Z
Redwood's Technique-Focused Epistemic Strategy 2021-12-12T16:36:22.666Z
Interpreting Yudkowsky on Deep vs Shallow Knowledge 2021-12-05T17:32:26.532Z
Applications for AI Safety Camp 2022 Now Open! 2021-11-17T21:42:31.672Z
Epistemic Strategies of Safety-Capabilities Tradeoffs 2021-10-22T08:22:51.169Z
Epistemic Strategies of Selection Theorems 2021-10-18T08:57:23.109Z
On Solving Problems Before They Appear: The Weird Epistemologies of Alignment 2021-10-11T08:20:36.521Z
Alignment Research = Conceptual Alignment Research + Applied Alignment Research 2021-08-30T21:13:57.816Z
What are good alignment conference papers? 2021-08-28T13:35:37.824Z
Approaches to gradient hacking 2021-08-14T15:16:55.798Z
A review of "Agents and Devices" 2021-08-13T08:42:40.637Z

Comments

Comment by adamShimi on Assigning Praise and Blame: Decoupling Epistemology and Decision Theory · 2023-01-27T20:05:38.257Z · LW · GW

Thanks for the pointer!

Comment by adamShimi on The Art and Science of Intuition Pumping · 2023-01-20T09:41:49.582Z · LW · GW

Oh, didn't know him!

Thanks for the links!

Comment by adamShimi on Group-level Consequences of Psychological Problems · 2023-01-20T09:41:11.632Z · LW · GW

Thanks for the comment!

I agree with you that there are situations where the issue comes from a cultural norm rather than psychological problems. That's one reason for the last part of this post, where we point out to generally positive and productive norms that try to avoid these cultural problems and make it possible to discuss them. (One of the issue I see in my own life with cultural norms is that they are way harder to discuss when in addition psychological problems compound them and make them feel sore and emotional). But you might be right that it's worth highlighting more.

In a more meta point, my model is that we have moved from societies where almost everything is considered ''people's fault" to societies where almost everything is considered "society's fault". And it strikes me that this is an overcorrection, and that actually many issues in day to day life and groups are just people's problem (here drawing from my experience of realizing in many situations that I was the problem, and in other — less common — that the norms were the problem.)

Comment by adamShimi on Confusing the ideal for the necessary · 2023-01-18T11:24:38.965Z · LW · GW

Oh, I definitely agree, this is a really good point. What I was highlighting was an epistemic issue (namely the confusion between ideal and necessary conditions) but there is also a different decision theoretic issue that you highlighted quite well.

It's completely possible that you're not powerful enough to work outside the ideal condition. But by doing the epistemic clarification, now we can consider the explicit decision of taking step to become more powerful and being better able to manage non-ideal conditions.

Comment by adamShimi on Confusing the ideal for the necessary · 2023-01-17T16:14:04.125Z · LW · GW

Good point! The difference is that the case explained in this post is one of the most sensible version of confusing the goal and the path, since there the path is actually a really good path. On the other version (like wanting to find a simple theory simply, the path is not even a good one!

Comment by adamShimi on Biology-Inspired AGI Timelines: The Trick That Never Works · 2023-01-10T20:47:03.133Z · LW · GW

In many ways, this post is frustrating to read. It isn't straigthforward, it needlessly insults people, and it mixes irrelevant details with the key ideas.

And yet, as with many of Eliezer's post, its key points are right.

What this post does is uncover the main epistemological mistakes made by almost everyone trying their hands at figuring out timelines. Among others, there is:

  • Taking arbitrary guesses within a set of options that you don't have enough evidence to separate
  • Piling on arbitrary assumption on arbitraty assumption, leading to completely uninformative outputs
  • Comparing biological processes to human engineering in term of speed, without noticing that the optimization path is the key variable (and the big uncertainty)
  • Forcing the prediction to fit within a massively limited set of distributions, biasing it towards easy to think about distributions rather than representative ones.

Before reading this post I was already dubious of most timeline work, but this crystallized many of my objections and issues with this line of work.

So I got a lot out of this post. And I expect that many people would if they spent the time I took to analyze it in detail. But I don't expect most people to do so, and so am ambivalent on whether this post should be included in the final selection.

Comment by adamShimi on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2023-01-10T20:46:37.891Z · LW · GW

I was mostly thinking of the efficiency assumption underlying almost all the scenarios. Critch assumes that a significant chunk of the economy always can and does make the most efficient change (everyone replacing the job, automated regulations replacing banks when they can't move fast enough). Which neglects many potential factors, like big economic actors not having to be efficient for a long time, backlash from customers, and in general all factors making economic actors and market less than efficient.

I expect that most of these factors could be addressed with more work on the scenarios.

Comment by adamShimi on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) · 2023-01-10T20:10:21.107Z · LW · GW

I consider this post as one of the most important ever written on issues of timelines and AI doom scenario. Not because it's perfect (some of its assumptions are unconvincing), but because it highlights a key aspect of AI Risk and the alignment problem which is so easy to miss coming from a rationalist mindset: it doesn't require an agent to take over the whole world. It is not about agency.

What RAAPs show instead is that even in a purely structural setting, where agency doesn't matter, these problem still crop up!

This insight was already present in Drexler's work, but however insightful Eric is in person, CAIS is completely unreadable and so no one cared. But this post is well written. Not perfectly once again, but it gives short, somewhat minimal proofs of concept for this structural perspective on alignment. And it also managed to tie alignment with key ideas in sociology, opening ways for interdisciplinarity.

I have made every person I have ever mentored on alignment study this post. And I plan to continue doing so. Despite the fact that I'm unconvinced by most timeline and AI risk scenarios post. That's how good and important it is.

Comment by adamShimi on Reification bias · 2023-01-09T17:48:50.755Z · LW · GW

I agree that a lot of science relies on predictive hallucinations. But there are examples that come to mind, notably the sort of phenomenological compression pushed by Faraday and (early) Ampère in their initial exploration of electromagnetism. What they did amounted to vary a lot of the experimental condition and relate outcomes and phenomena to each other, without directly assuming any hidden entity. (see this book for more details)

More generally, I expect most phenomenological laws to not rely heavily on predictive hallucinations, even when they integrate theoretical terms in their formulation. That's because they are mostly strong experimental regularities (like the ideal gas law or the phenomenological laws of thermodynamics) which tend to be carried to the next paradigm with radically different hidden entities.

Comment by adamShimi on Reification bias · 2023-01-09T16:43:07.264Z · LW · GW

So reification means "the act of making real" in most english dictionaries (see here for example). That's the meaning we're trying to evoke here, where the reification bias amounts to first postulate some underlying entity that explain the phenomena (that's merely a modelling technique), and second to ascribe reality to this entity and manipulate it as if it was real.

Comment by adamShimi on Prediction Markets for Science · 2023-01-02T21:33:00.885Z · LW · GW

You use the analogy with sports betting multiple time in this post. But science and sports are disanalogical in almost all the relevant ways!

Notably, sports are incredibly limited and well-defined, with explicit rules that literally anyone can learn, quick feedback signals, and unambiguous results. Completely the opposite of science!

The only way I see for the analogy to hold is by defining "science" in a completely impoverished way, that puts aside most of what science actually looks like. For example, replication is not that big a part of acience, it's just the visible "clean" one. And even then, I expect the clarification of replication issues and of the original meaning to be tricky.

So my reaction to this proposal, like my reaction to any prediction market for things other than sports and games, is that I expect it to be completely irrelevant to the progress of knowledge because of the weakness of such tools. But I would definitely be curious of attempts to explicitly address all the ambiguities of epistemology and science through betting mechanisms. Maybe you know of some posts/works on that?

Comment by adamShimi on Proof as mere strong evidence · 2022-12-16T08:07:09.206Z · LW · GW

Agreed! That's definitely and important point, and one reason why it's still interesting to try to prove P \neq NP. The point I was making here was only that when proofs are used for the "certainty" that they give, then strong evidence from other ways is also enough to rely on the proposition.

Comment by adamShimi on Applications for Deconfusing Goal-Directedness · 2022-12-14T17:16:16.209Z · LW · GW

What are you particularly interested in? I expect I could probably write it with a bit of rereading.

Comment by adamShimi on Psychological Disorders and Problems · 2022-12-12T21:02:52.225Z · LW · GW

Hot take: I would say that most optimization failures I've observed in myself and in others (in alignment and elsewhere) boil down to psychological problems.

Comment by adamShimi on Formalization as suspension of intuition · 2022-12-12T14:23:03.580Z · LW · GW

Completely agree! The point is not that formalization or axiomatization is always good, but rather to elucidate one counterintuitive way in which it can be productive, so that we can figure out when to use it.

Comment by adamShimi on Formalization as suspension of intuition · 2022-12-12T14:22:05.517Z · LW · GW

Thanks for your thoughtful comment!

First, I want to clarify that this is obviously not the only function of formalization. I feel like this might clarify a lot of the point you raise.

But first, the very idea that formalization would have helped discover non-Euclidean geometries earlier seems counter to the empirical observation that Euclid himself formalized geometry with 5 postulates, how more formal can it get? Compared to the rest of the science of the time, it was a huge advance. He also saw that the 5th one did not fit neatly with the rest. Moreover, the non-Euclidean geometry was right there in front of him the whole time: spheres are all around. And yet the leap from a straight line to the great circle and realizing that his 4 postulates work just fine without the 5th had to wait some two millennia. 

So Euclid formalized our geometric intuitions, the obvious and immediate shape that make naturally sense of the universe. This use of formalization was to make more concrete and precise some concepts that we had but that were "floating around". He did it so well that these concepts and intuition acquired an even stronger "reality" and "obviousness": how could you question geometry when Euclid had made so tangible the first intuitions that came to your mind?

According to Bachelard, the further formalization, or rather the axiomatization of geometry, of simplifying the apparently simple concepts of points and lines to make them algebraically manipulable, was a key part in getting out of this conceptual constraint.

That being said, I'd be interested for an alternative take or evidence that this claim is wrong. ;)

In general, what you (he?) call "suspension of intuition", seems to me to be more like emergence of a different intuition after a lot of trying and failing. I think that the recently empirically discovered phenomenon of "grokking" in ML provides a better model of how breakthroughs in understanding happen. It is more of a Hegelian/Kuhnian model of phase transitions after a lot of data accumulation and processing. 

This strike me as a false comparison/dichotomy: why can't both be part of scientific progress? Especially in physics and chemistry (the two fields Bachelard knew best), there are many examples of productive formalization/axiomatization as suspension of intuition:

  • Bolzmann work that generally started from mathematical building blocks, build stuff from them, and then interpreted them. See this book for more details of this view.
  • Quantum Mechanics went through that phase, where the half-baked models based on classical mechanics didn't work well enough, and so there was an effort at formalization and axiomatization that revealed the underlying structure without as much pollution by macroscopic intuition.
  • The potential function came from a pure mathematical and formal effort to compress the results of classical mechanics, and ended up being incorporated in the core concepts of physics.

I've also found out that on inspection, models of science based on the gathering of a lot of data rarely fit the actual history. Notably Kuhn's model contradicts the history of science almost everywhere, and he makes a highly biased reading of the key historic events that he leverages.

Comment by adamShimi on Formalization as suspension of intuition · 2022-12-12T13:51:27.200Z · LW · GW

That definitely feels right, with a caveat that is dear to Bachelard: this is a constant process of rectification that repeats again and again. There is no ending, or the ending is harder to find that what we think.

Comment by adamShimi on Biases are engines of cognition · 2022-11-30T17:12:51.059Z · LW · GW

I'm confused by your confusion, given that I'm pretty sure you understand the meaning of cognitive bias, which is quite explicitly the meaning of bias drawn upon here.

Comment by adamShimi on Methodological Therapy: An Agenda For Tackling Research Bottlenecks · 2022-11-30T16:49:55.114Z · LW · GW

Thanks for your comment!

Actually, I don't think we really disagree. I might have just not made my position very clear in the original post.

The point of the post is not to say that these activities are not often valuable, but instead to point out that they can easily turn into "To do science, I need to always do [activity]". And what I'm getting from the examples is that in some cases, you actually don't need to do [activity]. There's a shortcut, or maybe just you're in a different phase of the problem.

Do you think there is still a disagreement after this clarification?

Comment by adamShimi on The First Filter · 2022-11-29T16:32:42.377Z · LW · GW

In a limited context, the first example that comes to me is high performers in competitive sports and games. Because if they truly only give a shit about winning (and the best generally do), they will throw away their legacy approaches when they find a new one, however it pains them.

Comment by adamShimi on What I Learned Running Refine · 2022-11-29T16:30:48.723Z · LW · GW

Thanks for the kind words!

I'm not aware of any such statistics, but I'm guessing that MATS organizers might have some.

Comment by adamShimi on Don't align agents to evaluations of plans · 2022-11-29T11:17:02.884Z · LW · GW

I interpret Alex as making an argument such that there is not just two vs one difficulties, but an additional difficulty. From this perspective, having two will be more of an issue than one, because you have to address strictly more things.

This makes me wonder though if there is not just some sort of direction question underlying the debate here. Because if you assume the "difficulties" are only positive numbers, then if the difficulty for the direct instillation is  and the one for the grader optimization is  , then there's no debate that the latter is bigger than the former.

But if you allow directionality (even in one dimension), then there's the risk that the sum leads to less difficulty in total (by having the  move in the opposite direction in one dimension). That being said, these two difficulties seem strictly additive, in the sense that I don't see (currently) how the difficulty of evaluation could partially cancel the difficulty of instillation.

Comment by adamShimi on Don't align agents to evaluations of plans · 2022-11-29T11:03:50.941Z · LW · GW

Thanks for taking time to answer my questions in detail!

About your example for other failure modes

Is it meant to point at the ability of the actor to make the plan more confusing/harder to evaluate? Meaning that you're pointing at the ability for the actor to "obfuscate" its plan in order to get high reward?

If so, it's not clear to me why this is valuable for the actor to do? How is it supposed to get better reward from confusion only? If it has another agenda (making paperclips instead of diamonds for example), then the obfuscation is clearly valuable to allow it to work on its main goal. But here its goal is to improve evaluation, and so confusion doesn't seem like it helps.

About the cost/competitiveness argument

I think that a values-executing AGI can also search over as many plans which actually make sense, I don't think its options are limited or anything. But it'll be generating different kinds of plans, using reflective reasoning to restrict its search to non-adversarial-to-own-values parts of plan space (e.g. "don't think about basilisks").

This is the part where I currently cannot reconstruct your model (maybe because I haven't read shard theory in a while). Because from an abstract perspective, the trick part of starting from significantly more limited plans is how to expand the range of plans without allowing adversarial ones. And I don't have a model of what mechanism you think make it easier to go from the bottom up (from few plans accepted to more plans accepted) safely rather than from the top down (from all plans accepted to less plans accepted) safely.

About the "negative result" of not even trying to do grader optimization

  1. I don't see why that should exist; any plan-inhabiting adversary wishes to fool the boundary of whatever rule you provide. 
  2. See the last 10+years of alignment researchers failing to do this. Probably wise to not spend further increments of research time on such matters, once the fault is pointed out.

But why don't you see that it should exist? What are the parts of your model that make this prediction? Because by definition, there is not a single innovation or progress that most people expected to exist. That's what it means to figure out something new. So you need an actual reason for why you're not falling into that trap of only being able to imagine what already exists.

Here are some guesses:

  • Specific scientific claim that there should be no true name/clean concept of what is adversarial for a large class of cognitive architectures (in opposition to Natural Abstraction Hypothesis I guess?)
  • Human limitation claim that even though there might be such a true name/clear concept, we have neither the time not the ability to get it right under the current conditions
  • Security mindset claim that because the actor is allowed to get arbitrary smart, you should expect that if it optimizes against you, it wins.

 

For the second point, what do you make of the claim that there are ridiculously few researchers who even attempted to work on the problem in the last 10 years, compared to almost all easier fields of knowledge? Do you think that this should not alter the evidence of absence you get from the absence of evidence? 

Comment by adamShimi on Corrigibility Via Thought-Process Deference · 2022-11-28T17:52:26.579Z · LW · GW

It... seems to be a significant simplification of the problem? I mean, not needing all the interpretability and surgery tools would be a bigger improvement, but that's probably not something we can have.

Why do you think so? Currently I'm seeing a couple of massive difficulties here that don't generally or necessarily appear in alternatives approaches:

  • You need to know that you're going to reach an AGI before it becomes superintelligent, or you'll waste your time training an AI that will be taken over by the competitors. Whereas many approaches don't require this.
  • You need basically perfect interpretability, compared with approaches that require no or just some interpretability capabilities
  • You need to figure out the right translation to bootstrap it, and there seem to be risks if you get it wrong.
  • You need to figure out the right thought similarity measure to bootstrap it, and there seem to be risks if you get it wrong.

Can you help me understand why you think that these strong requirements nonethless are simpler than most versions or approaches of the problem that you know about?

Comment by adamShimi on Don't align agents to evaluations of plans · 2022-11-28T17:27:24.103Z · LW · GW

The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious:

  • Do you think that you are internally trying to approximate your own ?
  • Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don't trust your ability to do so for adversarial examples (as opposed to tractability issues for example)?
  • Can you think of concrete instances where you improved your own Eval?
  • Can you think of concrete instances where you thought you improved you own Eval but then regretted it later?
  • Do you think that your own changes to your eval have been moving in the direction of your ?
Comment by adamShimi on Don't align agents to evaluations of plans · 2022-11-28T17:17:00.574Z · LW · GW

> This includes “What would this specific and superintelligent CEV-universe-simulation say about this plan?”.

> This doesn’t include (somehow) getting an AI which correctly computes what program would be recommended by AGI designers in an altruistic and superintelligent branch of humanity, and then the AI executes that program and shuts itself off without doing anything else.[5]

But isn't 1 here is at least as good as 2, since the CEV-universe-simulation could always compute X=[the program that would be recommended by AGI designers in an altruistic and superintelligent branch of humanity] then return 1 iff input-plan = 'run X then shuts itself off without doing anything else' (by doing a simple text match), 0 otherwise, so there's no chance of adversarial inputs? Not to say this is a realistic way of getting an aligned AGI, but just that your argument seems to be proving too much, if it's saying that 2 is safer/better than 1.

Is your issue here that there exist a specific CEV-universe-simulation that makes 1 just as safe as 2, by basically emulating the latter situation? If so, why do you think this is a point against Alex's claim(which strikes me more as saying "there are a lot more cases of 2. being safe than of 1.")? 

Comment by adamShimi on Don't align agents to evaluations of plans · 2022-11-28T17:07:35.513Z · LW · GW
  1. Intelligence => strong selection pressure => bad outcomes if the selection pressure is off target.
  2. In the case of agents that are motivated to optimize evaluations of plans, this argument turns into "what if the agent tricks the evaluator".
  3. In the case of agents that pursue values / shards instilled by some other process, this argument turns into "what if the values / shards are different from what we wanted".
  4. To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.

One thing that is not clear to me from your comment is what you make of Alex's argument (as I see it) to the extent that "evaluation goals" are further away from "direct goals" than "direct goals" are between themselves. If I run with this, it seems like an answer to your point 4 would be:

  • with directly instilled goals, there will be some risk of discrepancy that can explode due to selection pressure;
  • with evaluation based goals, there is the same discrepancy than between directly instilled goals (because it's hard to get your goal exactly right) plus an additional discrepancy between valuing "the evaluation of X" and valuing "X".

I'm curious what you think of this claim, and if that influences at all your take.

Comment by adamShimi on Don't align agents to evaluations of plans · 2022-11-28T16:49:44.570Z · LW · GW

A few questions to better understand your frame:

  • You mostly mention two outcomes for the various diamond-maximizer architectures: maximizing the number of diamonds produced and creating hypertuned-fooling-plans for the evaluator. If I could magically ensure that plan-space only contains plans that are not hypertuned-fooling-plans (they might try, but will most likely be figured out), would you say that then grader-optimization gives us an aligned AI? Or are there other failures modes that you see?
    • Intuitively if maximizing the number of diamonds and maximizing the evaluation of the number of diamonds are not even close, I would expect multiple distinct failure modes "in-between".
  • In your response to Wei Dai, I interpret you as making an uncompetitiveness claim for grader-optimization: that it will need to pay a cost in compute for both generating and pruning the adversarial examples that will make it cost more than alternative architectures. Why do you think that this cost isn't compensated by the fact that you're searching over more plans and so have access to more "good options" too?
  • You're making strong claims about us needing to avoid as much as possible going on the route of grader optimization. Why do you expect that there is no clean/clear cut characterization of the set of adversarial plans (or a superset) that we could just forbid and then go on our merry way building grader optimizers?
Comment by adamShimi on What I Learned Running Refine · 2022-11-24T19:57:10.679Z · LW · GW

Thanks for the kind words!

  1. Are there any particular lessons/ideas from Refine that you expect (or hope) SERI MATS to incorporate?

I have shared some of my models related to epistemology and key questions to MATS organizers, and I think they're supposed to be integrated in one of the future programs. Mostly things regarding realizing the importance of productive mistakes in science (which naturally pushes back a bit from the mentoring aspect of MATS) and understanding how less "clean" most scientific progress actually look like historically (with a basic reading list from the history of science).

From the impression I have, they are also now trying to give participants some broader perspective about the field, in addition to the specific frame of the mentor, and a bunch of the lessons from Refine about how to build a good model of the alignment problem apply.

On a more general level, I expect that I had enough discussions with them that they would naturally ask me for feedback if they thought of something that seemed Refine shaped or similar.

2. Do you think there's now a hole in the space that someone should consider filling (by making Refine 2.0), or do you expect that much of the value of Refine will be covered by SERI MATS [and other programs]?

Hum, intuitively the main value from Refine that I don't expect to be covered by future MATS would come from reaching out to very different profiles. There's a non-negligeable chance that PIBBSS manages to make that work though, so not clear that it's a problem.

Note that this is also part of why Refine feels less useful: when I conceived of it, most of these programs either didn't exist or were not well-established. Part of the frustration came from having nothing IRL for non-american to join, and just no program spending a significant amount of time on conceptual alignment, which both MATS and PIBBSS (in addition to other programs like ARENA) are now fixing. Which I think is great!

Comment by adamShimi on Two reasons we might be closer to solving alignment than it seems · 2022-09-26T12:56:40.997Z · LW · GW

Thanks for explicitly writing out your thoughts in a place where you can expect strong pushback! I think this is particularly valuable.

That being said, while I completely agree with your second point (I keep telling to people who argue theory cannot work that barely 10 people worked on it for 10 years, which is a ridiculously small number), I feel like your first point is missing some key reflections on the asymmetry of capabilities vs alignment.

I don't have time to write a long answer, but I already have a post going in depth into many of the core assumptions of science and engineering that we don't expect to apply for alignment, (almost all apply or are irrelevant for capabilities, although that's not discussed explicitly in the post)

Comment by adamShimi on Losing the root for the tree · 2022-09-26T12:52:23.409Z · LW · GW

This post is amazing. Not just good, but amazing. You manage to pack exactly the lesson I needed to hear with just the right amount of memes and cheekiness to also be entertaining.

I would genuinely not be surprised if the frame in this post (and the variations I'm already adding to it) proved one of the key causal factors in me being far more productive and optimizing as an alignment researcher.

One suggestion: let's call these trees treeory of change, because that's what they are. ;)

Thanks. Really.

Comment by adamShimi on Methodological Therapy: An Agenda For Tackling Research Bottlenecks · 2022-09-24T12:24:53.288Z · LW · GW

Thanks for the kind words and useful devil's advocate! (I'm expecting nothing less from you ;p)

  1. I expect it's unusual that [replace methodology-1 with methodology-2] will be a pareto improvement: other aspects of a researcher's work will tend to have adapted to fit methodology-1. So I don't think the creation of some initial friction is a bad sign. (also mirrors therapy - there's usually a [take things apart and better understand them] phase before any [put things back together in a more adaptive pattern] phase)
    1. It might be useful to predict this kind of thing ahead of time, to develop a sense of when to expect specific side-effects (and/or predictably unpredictable side effects)

I agree that pure replacement of methodology is a massive step that is probably premature before we have a really deep understanding both of the researcher's approach and of the underlying algorithm for knowledge production. Which is why in my model, this comes quite late; instead the first step are more revealing the cached methodology to the researcher, and showing alternatives from History of Science (and Technology) to make more options and approaches credible for them.

Also looking at the "sins of the fathers" for philosophy of science (how methodologies have fucked up people across history) is part of our last set of framing questions. ;)
 

  1. I do think it's worth interviewing at least a few carefully selected non-alignment researchers. I basically agree with your alignment-is-harder case. However, it also seems most important to be aware of things the field is just completely missing.
    1. In particular, this may be useful where some combination of cached methodologies is a local maximum for some context. Knowing something about other hills seems useful here.
      1. I don't expect it'd work to import full sets of methodologies from other fields, but I do expect there are useful bits-of-information to be had.
    2. Similarly, if thinking about some methodology x that most alignment researchers currently use, it might be useful to find and interview other researchers that don't use x. Are they achieving [things-x-produces] in other ways? What other aspects of their methodology are missing/different?
      1. This might hint both at how a methodology change may impact alignment researchers, and how any negative impact might be mitigated.

Two reactions here:

  1. I agree with the need to find things that are missing and alternatives, which is where the history and philosophy of science works come to help. One advantage of it is that you can generally judge whether the methodology was successful or problematic in hindsight there, compared to interviews.
  2. I hadn't thought about interviewing other researchers. I expect it to be less efficient in a lot of ways than the HPS work, but I'm also now on the lookout for the option, so thanks!
  1. Worth considering that there's less of a risk in experimenting (kindly, that is) on relative newcomers than on experienced researchers. It's a good idea to get a clear understanding of the existing process of experienced researchers. However, once we're in [try this and see what happens] mode there's much less downside with new people - even abject failure is likely to be informative, and the downside in counterfactual object-level research lost is much smaller in expectation.

I see what you're pointing out. A couple related thoughts:

  1. The benefits of working with established researchers is that you have a historical record of what they did, which makes it easier to judge whether you're actually helping.
  2. I also expect helping established researchers to be easier on some dimensions, because they have more experience learning new models and leveraging them.
  3. Related to your first point, I don't worry too much about messing people up because the initial input will far less invasive than replacements of methodologies wholesale. But we're still investigating the risks to be sure we're not doing something net negative.
Comment by adamShimi on My thoughts on direct work (and joining LessWrong) · 2022-08-17T14:07:29.186Z · LW · GW

Here are some of mine:

Comment by adamShimi on Refine's First Blog Post Day · 2022-08-14T18:56:53.518Z · LW · GW

It's a charitable (and hilarious) interpretation. What actually happened is that he drafted it by mistake instead of just editing it to add stuff. It should be fine now.

Comment by adamShimi on Abstracting The Hardness of Alignment: Unbounded Atomic Optimization · 2022-07-29T20:39:56.403Z · LW · GW

You probably know better than me, but I still have this intuition that seed-AI and FOOM have oriented the framing of the problem and the sort of question asked. I think people who came to agent foundations from different routes ended up asking slightly different questions.

I could totally be wrong though, thanks for making this weakness of my description explicit!

Comment by adamShimi on Robustness to Scaling Down: More Important Than I Thought · 2022-07-23T18:57:24.840Z · LW · GW

That's a great point!

There's definitely one big difference between how Scott defined it and how I'm using it, which you highlighted well. I think a better way of explaining my change is that in Scott's original example, the AI being flawed result in some sense in the alignment scheme (predict human values and do that) to be flawed too.

I hadn't made the explicit claim in my head or in the post, but thanks to your comment, I think I'm claiming that the version I'm proposing generalize one of the interesting part of the original definition, and let it be applied to more settings.

As for your question, there is a difference between flawed and not the strongest version. What I'm saying about interpretability and single-single is not that a flawed implementation of them would not work (which is obviously trivial), but that for the reductions to function, you need to solve a particularly ambitious form of the problem. And that we don't currently have a good reason to expect to solve this ambitious problem with enough probability to warrant trusting the reduction and not working on anything else.

So an example of a plausible solution (of course I don't have a good solution at hand) would be to create sufficient interpretability techniques that, when combined with conceptual and mathematical characterizations of problematic behaviours like deception, we're able to see if a model will end up having these problematic behaviours. Notice that this possible solution requires working on conceptual alignment, which the reduction to interpretability would strongly discourage.

To summarize, I'm not claiming that interpretability (or single-single) won't be enough if it's flawed, just that reducing the alignment problem (or multi-multi) to them is actually a reduction to an incredibly strong and ambitious version of the problem, that no one is currently tackling this strong version, and that we have no reason to expect to solve the strong version with such high probability that we should shun alternatives and other approaches.

Does that clarify your confusion with my model? 

Comment by adamShimi on How to Diversify Conceptual Alignment: the Model Behind Refine · 2022-07-23T11:57:20.522Z · LW · GW

Yeah, I see how it can be confusing. To give an example, Paul Christiano focuses on prosaic alignment (he even coined the term) yet his work is mostly on the conceptual side. So I don't see the two as in conflict.

Comment by adamShimi on How to Diversify Conceptual Alignment: the Model Behind Refine · 2022-07-23T11:55:45.866Z · LW · GW

Thanks for your comment!

Probably the best place to get feedback as a beginner is AI Safety Support. They can also redirect you towards relevant programs, and they have a nice alignment slack.

As for your idea, I can give you quick feedback on my issues with this whole class of solutions. I'm not saying you haven't thought about these issues, nor that no solution in this class is possible at all, just giving the things I would be wary of here:

  • How do you limit the compute if the AI is way smarter than you are?
  • Assuming that you can limit the compute, how much compute do you give it? Too little and it's not competitive, leading many people to prefer alternatives without this limit; too much and you're destroying the potential guarantees.
  • Even if there's a correct and safe amount of compute to give for each task, how do you compute that amount? How much time and resources does it cost?
Comment by adamShimi on How to Diversify Conceptual Alignment: the Model Behind Refine · 2022-07-22T07:32:58.747Z · LW · GW

Basically the distinction is relevant because there are definitely more and more people working on alignment, but the vast majority of the increase actually doesn't focus on formulating solution or deconfusing the main notions; instead they mostly work on (often relevant) experiments and empirical questions related to alignment. 

Comment by adamShimi on How to Diversify Conceptual Alignment: the Model Behind Refine · 2022-07-22T07:30:56.988Z · LW · GW

Maybe I should have added this link. ;)

Comment by adamShimi on How to Diversify Conceptual Alignment: the Model Behind Refine · 2022-07-21T17:08:13.255Z · LW · GW

Yeah, I will be posting updates, and probably the participants themselves will post some notes and related ideas. Excited too about how it's going to pan out!

Comment by adamShimi on How to Diversify Conceptual Alignment: the Model Behind Refine · 2022-07-21T17:07:29.440Z · LW · GW

Does Conjecture/Refine work with anyone remotely or is it all in person?

By default Conjecture is all in person, although right now for a bunch of administrative and travelling reasons we are more disseminated. For Refine it will be in person the whole time. Actually, ensuring that is one big reason we're starting in France (otherwise it would need to be partly remote for administrative reasons)

Having novel approaches to alignment research seems like it could really help the field at this still-early stage. Thanks for creating a program specifically designed to foster this.

You're welcome. ;)

Comment by adamShimi on Mosaic and Palimpsests: Two Shapes of Research · 2022-07-17T10:42:55.752Z · LW · GW

Thanks for the comment!

To be honest, I had more trouble classifying you, and now that you commented, I think you're right that I got the wrong label. My reasoning was that your agenda and directions look far more explicit and precise than Paul or Evan's, which is definitely a more mosaic-y trait. On the other hand, there is the iteration that you describe, and I can clearly see a difference in terms of updating between you and let's say John/Eliezer.

My current model is that you're more palimpsest-y, but compared with most of us, you're surprisingly good at making your current iteration fit into a proper structure that you can make explicit and legible.

(Will update the post in consequence. ;) )

Comment by adamShimi on Outer vs inner misalignment: three framings · 2022-07-07T14:00:52.517Z · LW · GW

Nice post! Two things I particularly like are the explicit iteration (demonstrating by example how and why not to only use one framing), as well as the online learning framing.

The policy behaves in a competent yet undesirable way which gets low reward according to the original reward function.[2] This is an inner alignment failure, also known as goal misgeneralization. Langosco et al. (2022) provide a more formal definition and some examples of goal misgeneralization.

It seems like a core part of this initial framing relies on the operationalisation of "competent", yet you don't really point to what you mean. Notably, "competent" cannot mean "high-reward" (because of category 4) and "competent" cannot mean "desirable" (because of category 3 and 4). Instead you point at something like "Whatever it's incentivized to do, it's reasonably good at accomplishing it". I share a similar intuition, but just wanted to highlight that subtleties might hide there (maybe addressed in later framings but at least not mention at this point)

In general, we should expect that alignment failures are more likely to be in the first category when the test environment is similar to (or the same as) the training environment, as in these examples; and more likely to be in the second category when the test environment is very different from the training environment.

What comes to my mind (and is a bit mentionned after the quote) is that we could think of different hypotheses on the hardness of alignment as quantifying how similar the test environment must be to the training one to avoid inner misalignment. Potentially for harder versions of the problem, almost any difference that could tractably be detected is enough for the AI to behave differently.

I’d encourage alignment researchers to get comfortable switching between these different framings, since each helps guide our thinking in different ways. Framing 1 seems like the most useful for connecting to mainstream ML research. However, I think that focusing primarily on Framing 1 is likely to overemphasize failure modes that happen in existing systems, as opposed to more goal-directed future systems. So I tend to use Framing 2 as my main framing when thinking about alignment problems. Lastly, when it’s necessary to consider online training, I expect that the “goal robustness” version of Framing 3 will usually be easier to use than the “high-stakes/low-stakes” version, since the latter requires predicting how AI will affect the world more broadly. However, the high-stakes/low-stakes framing seems more useful when our evaluations of AGIs are intended not just for training them, but also for monitoring and verification (e.g. to shut down AGIs which misbehave).

Great conclusion! I particularly like your highlighting that each framing is more adapted to different purposes.

Comment by adamShimi on Principles for Alignment/Agency Projects · 2022-07-07T13:37:57.905Z · LW · GW

Well, isn't having multiple modules a precondition to something being modular? That seems like what's happening in your example: it has only one module, so it doesn't even make sense to apply John's criterion.

Comment by adamShimi on Principles for Alignment/Agency Projects · 2022-07-07T13:34:32.097Z · LW · GW

Thanks for the post! As always I broadly agree, but I have a bunch of nitpicks.

You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you'll end up burning several years on ideas which don't actually leave the field better off.

I agree that avoiding the Hard parts is rarely productive, but you also don't address one relevant concern: what if the Hard part is not merely Hard, but actually Impossible? In this case your advice can also be cashed out by trying to prove it is impossible instead of avoiding it. And just like with most impossibility results in TCS, it's possible that even if the precise formulation is impossible, it often just means that you need to reframe the problem a bit.

Mostly, I think the hard parts are things like "understand agency in general better" and "understand what's going on inside the magic black boxes". If your response to such things is "sounds hard, man", then you have successfully identified (some of) the Hard Parts.

I expect you would also say that a crucial hard part many people are avoiding is "how to learn human values?", right? (Not the true names, but a useful pointer)

The point of the intuitive story is to steer our search. Without it, we risk blind empiricism: just cataloguing patterns without building general models/theory/understanding for what's going on. In that mode, we can easily lose track of the big picture goal and end up cataloguing lots of useless stuff. An intuitive story gives us big-picture direction, and something to aim for. Even if it turns out to be wrong!

I want to note that the failure mode of blind theory here is to accept any story, and thus make the requirement of a story completely impotent to guide research. There's an art (and hopefully a science) to finding stories that bias towards productive mistakes.

Most of the value and challenge is in finding the right operationalizations of the vague concepts involved in those arguments, such that the argument is robustly correct and useful. Because it's where most of the value and most of the challenge is, finding the right operationalization should typically be the central focus of a project.

I expect you to partially disagree, but there's not always a "right" operationalization, and there's a failure mode where one falls in love with their neat operationalization, making the misses parts of the phenomena invisible.

Don’t just run a black-box experiment on a network, or try to prove a purely behavioral theorem. We want to talk about internal structure.

I want to say that you should start with behavioral theorem, and often the properties you want to describe might make more sense behaviorally, but I guess you're going to answer that we have evidence that this doesn't work in Alignment and so it is avoiding the Hard part. Am I correct?

Partly, opening the black box is about tackling the Hard Parts rather than avoiding them. Not opening the black box is a red flag; it's usually a sign of avoiding the Hard Parts.

One formal example of this is the relativization barrier in complexity theory, which tells you that you can't prove (and a bunch of other separations) using only techniques using algorithms as blackboxes instead of looking at the structure.

Once you're past that stumbling block, I think the most important principles are Derive the Ontology and Operationalize. These two are important for opposing types of people. Some people tend to stay too abstract and avoid committing to an ontology, but never operationalize and therefore miss out on the main value-add. Other people operationalize prematurely, adopting ad-hoc operationalizations, and Deriving the Ontology pretty strongly dicourages that.

Agreed that it's a great pair of advice to keep in mind!

Comment by adamShimi on The Case for a Journal of AI Alignment · 2022-07-03T09:31:41.544Z · LW · GW

In what way is AF not open to new ideas? I think it is a bit scary to publish a post here, but that has more to do with it being very public, and less to do with anything specific about the AF. But if AF has a culture of being non welcoming of new ideas, maybe we should fix that?

It's not that easy to justify a post from a year ago, but I think that what I meant was that the alignment forum has a certain style of alignment research, and thus only reading it means you don't see stuff like CHAI research or other works that are and aim at alignment without being shared that much on the AF.

Comment by adamShimi on How to get people to produce more great exposition? Some strategies and their assumptions · 2022-07-03T09:28:51.983Z · LW · GW

Thanks!

I will look at the post soonish. Sorry for the delay in answering, I was in holidays this week. ^^

Comment by adamShimi on Epistemological Vigilance for Alignment · 2022-06-06T17:44:27.400Z · LW · GW

Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes?

This points at the same thing IMO, although still in a confusing way. This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.

Other possible names would then be either leaning into the complex systems view, so the (possibly incorrect) assumption is something like "non-complexity" or "linear/predictable responses"; or leaning into the optimisation paths analogy which might be something like "incremental improvement is ok" although that is pretty bad as a name.

Someone at Conjecture proposed linear too, but Newtonian physics isn't linear. Although I agree that the sort of behavior and reaction I'm pointing out fit within the "non-linear" category.

Comment by adamShimi on RL with KL penalties is better seen as Bayesian inference · 2022-05-29T09:37:20.421Z · LW · GW

Thanks for this post, it's clear and insightful about RLHF.

From an alignment perspective, would you say that your work gives evidence that we should focus most of the energy on finding guarantees about the distribution that we're aiming for and debugging problems there, rather than thinking about the guarantees of the inference?

(I still expect that we want to understand the inference better and how it can break, but your post seems to push towards a lesser focus on that part)