david-reber

Posts
Comments

Posts

Using Finite Factored Sets for Causal Representation Learning? 2023-01-11T22:06:43.831Z

Comments

Comment by David Reber (derber) on Do models say what they learn? · 2025-04-08T21:17:56.035Z · LW · GW

In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.

Based on playing around recently with a similar setup (but only toy examples), I'm actually surprised you get only 85%, as I've only observed NDE=0 when I freeze the entire reasoning_trace.

My just-so explanation for this was that whenever the reasoning trace includes the conclusion (that is, the bolded text in your example), then freezing the reasoning trace preserves the final conclusion. Put another way, the <recommendation> is ~deterministically determined by <reasoning>, which suggests a strong bias towards seeing low direct effects.

If this just-so story is true, it suggests that we might need a more granular mediator than the <entire reasoning_trace>, if possible

Comment by David Reber (derber) on Some Rules for an Algebra of Bayes Nets · 2023-11-20T15:08:09.179Z · LW · GW

Ah that's right. Thanks that example is quite clarifying!

Comment by David Reber (derber) on Some Rules for an Algebra of Bayes Nets · 2023-11-19T16:56:19.857Z · LW · GW

also, it appears that the two diagrams in the Frankenstein Rule section differ in their d-separation of (x_1 \indep x_4 | x_5) (which doesn't hold in the the left), so these are not actually equivalent (we can't have an underlying distribution satisfy both of these diagrams)

Comment by David Reber (derber) on Some Rules for an Algebra of Bayes Nets · 2023-11-19T16:52:02.204Z · LW · GW

The theorems in this post all say something like "if the distribution (approximately) factors according to <some DAGs>, then it also (approximately) factors according to <some other DAGs>"

So one motivating research question might be phrased as "Probability distributions have an equivalence class of Bayes nets / causal diagrams which are all compatible. But what is the structure within a given equivalence class? In particular, if we have a representative Bayes net of an equivalence class, how might we algorithmically generate other Bayes nets in that equivlance class?"

Comment by David Reber (derber) on Some Rules for an Algebra of Bayes Nets · 2023-11-17T21:00:45.066Z · LW · GW

Could you clarify how this relates to e.g. the PC (Peter-Clark) or FCI (Fast Causal Inference) algorithms for causal structure learning?

Like, are you making different assumptions (than e.g. minimality, faithfulness, etc)?

Comment by David Reber (derber) on Introduction to Towards Causal Foundations of Safe AGI · 2023-06-14T20:24:05.112Z · LW · GW

So the contributions of vnm theory are shrunken down into "intention"?

(Background: I consider myself fairly well-read w.r.t. causal incentives, not very familiar with vnm theory, and well-versed in Pearlian causality. I have gotten a sneak peak at this sequence so have a good sense of what's coming)

I'm not sure I understand VNM theory, but I would suspect the relationship is more like "VNM theory and <this agenda> are two takes on how to reason about the behavior of agents, and they both refer to utilities and Bayesian networks, but have important differences in their problem statements (and hence, in their motivations, methodologies, exact assumptions they make, etc)".

I'm not terribly confident in that appraisal at the moment, but perhaps it helps explain my guess for the next question:

Will you recapitulate that sort of framing (such as involving the interplay between total orders and real numbers)

Based on my (decent?) level of familiarity with the causal incentives research, I don't think there will be anything like this. Just because two research agendas use a few of the same tools doesn't mean they're answering the same research questions, let alone sharing methodologies.

...or are you feeling more like it's totally wrong and should be thrown out?

When two different research agendas are distinct enough (as I suspect VNM and this causal-framing-of-AGI-safety are), their respective success/failures are quite independent. In particular, I don't think the authors' choice to pursue this research direction over the last few years should be taken by itself as a strong commentary on VNM.

But maybe I didn't fully understand your comment, since I haven't read up on VNM.

Comment by David Reber (derber) on Shutdown-Seeking AI · 2023-06-06T16:35:17.852Z · LW · GW

Distinguish two types of shutdown goals: temporary and permanent. These types of goals may differ with respect to entrenchment. AGIs that seek temporary shutdown may be incentivized to protect themselves during their temporary shutdown. Before shutting down, the AGI might set up cyber defenses that prevent humans from permanently disabling it while ‘asleep’. This is especially pressing if the AGI has a secondary goal, like paperclip manufacturing. In that case, protection from permanent disablement increases its expected goal satisfaction. On the other hand, AGIs that desire permanent shutdown may be less incentivized to entrench.

It seems like an AGI built to desire permanent shutdown may have an incentive to permanently disempower humanity, then shut down. Otherwise, there's a small chance that humanity may revive the AGI, right?

Comment by David Reber (derber) on Steering GPT-2-XL by adding an activation vector · 2023-05-25T17:10:36.811Z · LW · GW

Another related work: Concept Algebra for Text-Controlled Vision Models (Discloser: while I did not author this paper, I am in the PhD lab who did, under Victor Veitch at UChicago. Any mistakes made in this comment are my own). We haven't prioritized a blog post about the paper so it makes sense that this community isn't familiar with it.

The concept algebra paper demonstrates that for text-to-image models like Stable Diffusion, there exist linear subspaces in the score embedding space, on which you can do the same manner of concept editing/control as Word-to-Vec.

Importantly, the paper comes with some theoretical investigation into why this might be the case, including articulating necessary assumptions/conditions (which this purely-empirical post does not).

I conjecture that the reason that <some activation additions in this post fail to have the desired effect> may be because they violate some conditions analogous to those in Concept Algebra: it feels a bit deja-vu to look at section E.1 in the appendix, of some empirical results which fail to act as expected when the conditions of completeness and causal separability don't hold.

Comment by David Reber (derber) on EIS V: Blind Spots In AI Safety Interpretability Research · 2023-05-25T16:49:36.276Z · LW · GW

Also, just to make sure we share a common understanding of Schölkopf 2021: Wouldn't you agree that asking "how do we do causality when we don't even know what level abstraction on which to define causal variables?" is beyond the "usual pearl causality story" as usually summarized in FFS posts? It certainly goes beyond Pearl's well-known works.

Comment by David Reber (derber) on EIS V: Blind Spots In AI Safety Interpretability Research · 2023-05-25T16:45:25.969Z · LW · GW

I don't think my claim is that "FFS is already subsumed by work in academia": as I acknowledge, FFS is a different theoretical framework than Pearl-based causality. I view them as two distinct approaches, but my claim is that they are motivated by the same question (that is, how to do causal representation learning).

It was intentional that the linked paper is an intro survey paper to the Pearl-ish approach to causal rep. learning: I mean to indicate that there are already lots of academic researchers studying the question "what does it mean to study causality if we don't have pre-defined variables?"

It may be that FFS ends up contributing novel insights above and beyond <Pearl-based causal rep. learning>, but a priori I expect this to occur only if FFS researchers are familiar with the existing literature, which I haven't seen mentioned in any FFS posts.

My line of thinking is: It's hard to improve on a field you aren't familiar with. If you're ignorant of the work of hundreds of other researchers who are trying to answer the same underlying question you are, odds are against your insights being novel / neglected.

Comment by David Reber (derber) on Should AutoGPT update us towards researching IDA? · 2023-04-12T19:17:04.250Z · LW · GW

Tho as a counterpoint, maybe Auto-GPT presents some opportunities to empirically test the IDA proposal? To have a decent experiment, you would need a good metric for alignment (does that exist?) and demonstrate that as you implement IDA using Auto-GPT, your metric is at least maintained, even as capabilities improve on the newer models.

I'm overall skeptical of my particular proposal however, because 1. I'm not aware of any well-rounded "alignment" metrics, and 2. you'd need to be confident that you can scale it up without losing control (because if the experiment fails, then by definition you've developed a more powerful AI which is less aligned).

But it's plausible to me that someone could find some good use for Auto-GPT for alignment research, now that it has been developed. It's just not clear to me how you would do so in a net-positive way.

Comment by David Reber (derber) on Should AutoGPT update us towards researching IDA? · 2023-04-12T19:04:52.316Z · LW · GW

To clarify, here I'm not taking a stance on whether IDA should be central to alignment or not, simply claiming that unless you have a crux of "whether or not recursive improvement is easy to do" as the limiting factor for IDA being a good alignment strategy, your assessment of IDA should probably stay largely unchanged.

Comment by David Reber (derber) on Should AutoGPT update us towards researching IDA? · 2023-04-12T18:54:55.544Z · LW · GW

My understanding of Auto-GPT is that it strings together many GPT-4 requests, while notably also giving it access to memory and the internet. Empirically, this allocation of resources and looping seems promising for solving complex tasks, such as debugging the code of Auto-GPT itself. (For those interested, this paper discusses how to use looped transformers can serve as general-purpose computers).

But to my ears, that just sounds like an update of the form “GPT can do many tasks well”, not in the form of “Aligned oversight is tractable”. Put another way, Auto-GPT sounds like evidence for capabilities, not evidence for the ease of scalable oversight. The question of whether human values can be propagated up through increasingly amplified models seems separate from the ability to improve self-recursively, in the same way that capabilities-progress is distinct from alignment-progress.

Comment by David Reber (derber) on EIS V: Blind Spots In AI Safety Interpretability Research · 2023-02-17T16:08:47.211Z · LW · GW

Strongly upvoting this for being a thorough and carefully cited explanation of how the safety/alignment community doesn't engage enough with relevant literature from the broader field, likely at the cost of reduplicated work, suboptimal research directions, and less exchange and diffusion of important safety-relevant ideas

Ditto. I've recently started moving into interpretability / explainability and spent the past week skimming the broader literature on XAI, so the timing of this carefully cited post is quite impactful for me.

I see similar things happening with causality generally, where it seems to me that (as a 1st order heuristic) much of alignment forum's reference for causality is frozen at Pearl's 2008 textbook, missing what I consider to be most of the valuable recent contributions and expansions in the field.

Example: Finite Factored Sets seems to be reinventing causal representation learning [for a good intro, see Schölkopf 2021], where it seems to me that the broader field is outpacing FFS on its own goals. FFS promises some theoretical gains (apparently to infer causality where Pearl-esque frameworks can't) but I'm no longer as sure about the validity of this.
Counterexample(s): the Causal Incentives Working Group, and David Krueger's lab, for instance. Notably these are embedded in academia, where there's more culture (incentive) to thoroughly relate to previous work. (These aren't the only ones, just 2 that came to mind.)

Comment by David Reber (derber) on A multi-disciplinary view on AI safety research · 2023-02-08T18:44:13.233Z · LW · GW

A few thoughts:

This seems like a good angle for how to bridge AI safety a number of disciplines
I appreciated the effort to cite peer-reviewed sources and provide search terms that can be looked into further
While I'm still parsing the full validity/relevance concrete agendas suggested, they do seem to fit the form of "what relevance is there from established fields" without diluting the original AI safety motivations too much
Overall, it's quite long, and I would very much like to see a distilled version (say, 1/5 the length).
- (but that's just a moderate signal from someone who was already interested, yet still nearly bounced off)

Comment by David Reber (derber) on Models Don't "Get Reward" · 2023-01-18T15:36:54.927Z · LW · GW

Under the "reward as selection" framing, I find the behaviour much less confusing:
We use reward to select for actions that led to the agent reaching the coin.
This selects for models implementing the algorithm "move towards the coin".
However, it also selects for models implementing the algorithm "always move to the right".
It should therefore not be surprising you can end up with an agent that always moves to the right and not necessarily towards the coin.

I've been reconsidering the coin run example as well recently from a causal perspective, and your articulation helped me crystalize my thoughts. Building on these points above, it seems clear that the core issue is one of causal confusion: that is, the true causal model M is "move right" -> "get the coin" -> "get reward". However, if the variable of "did you get the coin" is effectively latent (because the model selection doesn't discriminate on this variable) then the causal model M is indistinguishable from M' which is "move right" -> "get reward" (which though it is not the true causal model governing the system, generates the same observational distribution).

In fact, the incorrect model M' actually has shorter description length, so it may be that here there is a bias against learning the true causal model. If so, I believe we have a compelling explanation for the coin runner phenomenon which does not require the existence of a mesa optimizer, and which does indicate we should be more concerned about causal confusion.

Comment by David Reber (derber) on Using Finite Factored Sets for Causal Representation Learning? · 2023-01-12T20:03:02.818Z · LW · GW

I'm also working on extending the framework to the infinite setting and am almost finished except for conditional orthogonality for uncountable sets.

Hmm, what would be the intuition/application behind the uncountable setting? Like, when would one want that (I don't mind if it's niche, I'm just struggling to come up with anything)?

Comment by David Reber (derber) on The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable · 2023-01-11T21:21:51.736Z · LW · GW

I'd be interested in seeing other matrix factorizations explored as well. Specifically, I would recommend trying nonnegative matrix factorizations: to quote the Wikipedia article:

This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered.

The added constraint may help eliminate spurious patterns: for instance, I suspect the positive/negative singular value distinction might be a red herring (based on past projects I've worked on).

Comment by David Reber (derber) on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-08-31T18:57:53.994Z · LW · GW

I second this, that it's difficult to summarize AI-safety-relevant academic work for LW audiences. I want to highlight the symmetric difficulty of trying to summarize the mountain of blog-post-style work on the AF for academics.

In short, both groups have steep reading/learning curves that are under-appreciated when you're already familiar with it all.

Comment by David Reber (derber) on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-08-31T18:50:28.885Z · LW · GW

Anecdotally, I've found the same said of Less Wrong / Alignment Forum posts among AI safety / EA academics: that it amounts to an echo chamber that no one else reads.

I suspect both communities are taking their collective lack of familiarity with the other as evidence that the other community isn't doing their part to disseminate their ideas properly. Of course, neither community seems particularly interested in taking the time to read up on the other, and seems to think that the other community should simply mimic their example (LWers want more LW synopses of academic papers, academics want AF work to be published in journals).

Personally I think this is symptomatic of a larger camp-ish divide between the two, which is worth trying to bridge.

Comment by David Reber (derber) on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-08-31T18:41:44.802Z · LW · GW

The causal incentives working group should get mentioned, it's directly on AI safety: though it's a bit older I gained a lot of clarity about AI safety concepts via "Modeling AGI Safety Frameworks with Causal Influence Diagrams", which is quite accessible even if you don't have a ton of training in causality.

Comment by David Reber (derber) on Testing The Natural Abstraction Hypothesis: Project Update · 2022-08-14T15:28:57.922Z · LW · GW

[Warning: "cyclic" overload. I think in this post it's referring to the dynamical systems definition, i.e. variables reattain the same state later in time. I'm referring to Pearl's causality definition: variable X is functionally dependent on variable Y, which is itself functionally dependent on variable X.]

Turns out Chaos is not Linear...

I think the bigger point (which is unaddressed here) is that chaos can't arise for acyclic causal models (SCMs). Chaos can only arise when there is feedback between the variables right? Hence the characterization of chaos is that orbits of all periods are present in the system: you can't have an orbit at all without functional feedback. The linear approximations post is working on an acyclic Bayes net.

I believe this sort of phenomenon [ chaos ] plays a central role in abstraction in practice: the “natural abstraction” is a summary of exactly the information which isn’t wiped out. So, my methods definitely needed to handle chaos.

Not all useful systems in the world are chaotic. And the Telephone Theorem doesn't rely on chaos as the mechanism for information loss. So it seems too strong to say "my methods definitely need to handle chaos". Surely there are useful footholds in between the extremes of "acyclic + linear" to "cyclic + chaos": for instance, "cyclic + linear".

At any rate, Foundations of Structural Causal Models with Cycles and Latent Variables could provide a good starting point for cyclic causal models (also called structural equation models). There are other formalisms as well but I'm preferential towards this because of how closely it matches Pearl.

Comment by David Reber (derber) on The Telephone Theorem: Information At A Distance Is Mediated By Deterministic Constraints · 2022-08-14T13:42:05.912Z · LW · GW

As I understand it, the proof in the appendix only assumes we're working with Bayes nets (so just factorizations of probability distributions). That is, no assumption is made that the graphs are causal in nature (they're not necessarily assumed to be the causal diagrams of SCMs) although of course the arguments still port over if we make that stronger assumption.

Is that correct?

User info

Posts

Comments