AI presidents discuss AI alignment agendas 2023-09-09T18:55:37.931Z
Activation additions in a small residual network 2023-05-22T20:28:41.264Z
Collective Identity 2023-05-18T09:00:24.410Z
Activation additions in a simple MNIST network 2023-05-18T02:49:44.734Z
Value drift threat models 2023-05-12T23:03:22.295Z
What constraints does deep learning place on alignment plans? 2023-05-03T20:40:16.007Z
Pessimistic Shard Theory 2023-01-25T00:59:33.863Z
Performing an SVD on a time-series matrix of gradient updates on an MNIST network produces 92.5 singular values 2022-12-21T00:44:55.373Z
Don't design agents which exploit adversarial inputs 2022-11-18T01:48:38.372Z
A framework and open questions for game theoretic shard modeling 2022-10-21T21:40:49.887Z
Taking the parameters which seem to matter and rotating them until they don't 2022-08-26T18:26:47.667Z
How (not) to choose a research project 2022-08-09T00:26:37.045Z
Information theoretic model analysis may not lend much insight, but we may have been doing them wrong! 2022-07-24T00:42:14.076Z
Modelling Deception 2022-07-18T21:21:32.246Z
Another argument that you will let the AI out of the box 2022-04-19T21:54:38.810Z
[cross-post with EA Forum] The EA Forum Podcast is up and running 2021-07-05T21:52:18.787Z
Information on time-complexity prior? 2021-01-08T06:09:03.462Z
D0TheMath's Shortform 2020-10-09T02:47:30.056Z
Why does "deep abstraction" lose it's usefulness in the far past and future? 2020-07-09T07:12:44.523Z


Comment by Garrett Baker (D0TheMath) on Interpretability Externalities Case Study - Hungry Hungry Hippos · 2023-09-21T01:03:39.421Z · LW · GW

So: do you think that ambitious mech interp is impossible? Do you think that current interp work is going the wrong direction in terms of achieving ambitious understanding? Or do you think that it'd be not useful even if achieved?

Mostly I think that MI is right to think it can do a lot for alignment, but I suspect that lots of the best things it can do for alignment it will do in a very dual-use way, which skews heavily towards capabilities. Mostly because capabilities advances are easier and there are more people working on those.

At the same time I suspect that many of those dual use concerns can be mitigated by making your MI research targeted. Not necessarily made such that you can do off-the-shelf interventions based on your findings, but made such that if it ever has any use, that use is going to be for alignment, and you can predict broadly what that use will look like.

This also doesn't mean your MI research can't be ambitious. I don't want to criticize people for being ambitious or too theoretical! I want to criticize people for producing knowledge on something which, while powerful, seems powerful in too many directions to be useful if done publicly.

I agree that if your theory of change for interp goes through, "interp solves a concrete problem like deception or sensor tampering or adversarial robustness", then you better just try to solve those concrete problems instead of improving interp in general. But I think the case for ambitious mech interp isn't terrible, and so it's worth exploring and investing in anyways.

I don't entirely know what you mean by this. How would we solve alignment by not going through a concrete problem? Maybe you think MI will be secondary to that process, and will give us useful information about what problems are necessary to solve? In such a case I still don't see why you need ambitious MI. You can just test the different problem classes directly. Maybe you think the different problem classes are too large to test directly. Even in that case, I still think that a more targeted approach would be better, where you generate as much info about those target classes as possible, while minimizing info that can be used to make your models better. And you selectively report only the results of your investigation which bear on the problem class. Even if the research is exploratory, the result & verification demonstration can still be targeted.

But again, most mech interp people aren't aiming to use mech interp to solve a specific concrete problem you can exhibit on models today, so it seems unfair to complain that most of the work doesn't lead to novel alignment methods.

Maybe I misspoke. I dislike current MI because I expect large capability improvements before and at the same time as the alignment improvements, but I don't dispute future alignment improvements. Just whether they'll be worth it. The reason I brought up that was as some motivation for why I think targeted is better, and why I don't like some peoples' criticism of worries about MI externalities by appealing to the lack of capabilities advances caused by MI. There've certainly been more attempts at capabilities improvements motivated by MI than there have been attempts at alignment improvements. Regardless of what you think about the future of the field, its interesting when people make MI discoveries which don't lead to too much capabilities advances.

I personally like activation additions because they give me evidence about how models mechanistically behave in a way which directly tells me about which threat models are more or less likely, and it has the potential to make auditing and iteration a lot easier. Accomplishments which ambitious MI is nowhere close to, and for which I expect its methods would have to pay a lot in terms of capability advances in order to get to. I mention this as evidence for why I expect targeted approaches are faster and cheaper than ambitious ones. At least if done publicly.

Comment by Garrett Baker (D0TheMath) on Interpretability Externalities Case Study - Hungry Hungry Hippos · 2023-09-20T21:05:34.816Z · LW · GW

Interpretability seems pretty useful for alignment, but it also seems pretty dangerous for capabilities. Overall the field seems net-bad. Using an oversimplified model, my general reason for thinking this is because for any given interpretability advance, it can either be used for the purposes of capabilities or the purposes of alignment. Alignment is both harder, and has fewer people working on it than improving model capabilities. Even if the marginal interpretability advance would be net good for alignment if alignment and capabilities were similar in size and difficulty, we should still expect that it will get used for the purposes of capabilities.

Lots of people like pointing to how better interpretability almost never makes long-term improvements to model capabilities, but it leads to just as few improvements to model alignment! And the number & quality of papers or posts using interpretability methods for capabilities vastly exceeds the number & quality using interpretability methods for alignment.

The only example of interpretability leading to novel alignment methods I know of is shard theory's recent activation additions work (notably work that is not so useful if Nate & Eliezer are right about AGI coherence). In contrast, it seems like all the papers using interpretability to advance capabilities rely on Anthropic's transformer circuits work.

These are two interesting case-studies, and more work should probably be done comparing their relative merits. But in lieu of that, my explanation for the difference in outcomes is this:

Anthropic's work was highly explorational, while Team Shard's was highly targeted. Anthropic tried to understand the transformer architecture and training process in general, while shard theory tried to understand values and only values. If capabilities is easier than alignment, it should not be too surprising if an unfocused approach makes capabilities relatively easier, while a focused-on-values approach makes alignment relatively easier. The unfocused approach will gather a wide range of low-hanging fruit, but little low-hanging fruit is alignment related, so most fruit gathered will be capabilities related.

This is why I'm pessimistic about most interpretability work. It just isn't focused enough! And its why I'm optimistic about interpretability (and interpretability adjacent) work focused on understanding explicitly the value systems of our ML systems, and how those can be influenced.

So a recommendation for those hoping to work on interpretability and have it be net-positive: Focus on understanding the values of models! Or at least other directly alignment relevant parts of models.

For example, I mostly expect a solution to superposition to be net-negative, in the same way that transoformer circuits is net-negative. Though at the same time I also expect superposition to have lots of alignment benefits in the short-term. If AGI is further off, superposition ends up being net-negative, the closer AGI is to now, the more positive a superposition solution becomes.

Another sort of interpretability advance I'm worried about: locating the optimization algorithms operating inside neural networks. I admit these have large alignment boosts, but that seems inconsequential compared to their large potential for large boosts to capabilities. Such advances may be necessary for alignment though, so I'm more happy in a world where these are not so widely publicized, and given only to the superintelligence alignment wings of AGI labs [EDIT: and a group of researchers outside the labs, all in a way such that nobody shares it with people who may use the knowledge to advance capabilities].

Comment by Garrett Baker (D0TheMath) on D0TheMath's Shortform · 2023-09-19T06:30:01.106Z · LW · GW

Indeed, this is what I mean.

Comment by Garrett Baker (D0TheMath) on D0TheMath's Shortform · 2023-09-14T01:10:26.164Z · LW · GW

Good point! Overall I don't anticipate these layers will give you much control over what the network ends up optimizing for, but I don't fully understand them yet either, so maybe you're right.

Do you have specific reason to think moding the layers will easily let you control the high-level behavior, or is it just a justified hunch?

Comment by Garrett Baker (D0TheMath) on D0TheMath's Shortform · 2023-09-13T23:20:56.116Z · LW · GW

@TurnTrout @cfoster0 you two were skeptical. What do you make of this? They explicitly build upon the copying heads work Anthropic's interp team has been doing.

Comment by Garrett Baker (D0TheMath) on D0TheMath's Shortform · 2023-09-13T23:16:09.160Z · LW · GW

Evan Hubinger: In my paper, I theorized about the mesa optimizer as a cautionary tale

Capabilities researchers: At long last, we have created the Mesa Layer from classic alignment paper Risks From Learned Optimization (Hubinger, 2019).

Comment by Garrett Baker (D0TheMath) on D0TheMath's Shortform · 2023-09-13T23:13:47.132Z · LW · GW

Look at that! People have used interpretability to make a mesa layer!

Comment by Garrett Baker (D0TheMath) on TurnTrout's shortform feed · 2023-09-11T21:35:46.910Z · LW · GW

I don't think the conclusion follows from the premises. People often learn new concepts after studying stuff, and it seems likely (to me) that when studying human cognition, we'd first be confused because our previous concepts weren't sufficient to understand it, and then slowly stop being confused as we built & understood concepts related to the subject. If an AI's thoughts are like human thoughts, given a lot of time to understand them, what you describe doesn't rule out that the AI's thoughts would be comprehensible.

The mere existence of concepts we don't know about in a subject doesn't mean that we can't learn those concepts. Most subjects have new concepts.

Comment by Garrett Baker (D0TheMath) on Sharing Information About Nonlinear · 2023-09-08T23:01:34.789Z · LW · GW

Yes. This was mats 2.0 in the summer of 2022.

Comment by Garrett Baker (D0TheMath) on Sharing Information About Nonlinear · 2023-09-08T17:24:27.252Z · LW · GW

Last year SERI MATS was pretty late on many people’s stipends, though my understanding is they were just going through some growing pains during that time, and they’re on the ball nowadays.

Comment by Garrett Baker (D0TheMath) on Eliezer Yudkowsky Is Frequently, Confidently, Egregiously Wrong · 2023-08-27T04:23:46.693Z · LW · GW

I don't know if Eliezer is irrational about animal consciousness. There's a bunch of reasons you can still be deeply skeptical of animal consciousness even if animals have nocioceptors (RL agents have nocioceptors! They aren't conscious!), or integrated information theory & global workspace theory probably say animals are 'conscious'. For example, maybe you think consciousness is a verbal phenomenon, having to do with the ability to construct novel recursive grammars. Or maybe you think its something to do with the human capacity to self-reflect, maybe defined as making new mental or physical tools via methods other than brute force or local search.

I don't think you can show he's irrational here, because he hasn't made any arguments to show the rationality or irrationality of. You can maybe say he should be less confident in his claims, or criticize him for not providing his arguments. The former is well known, the latter less useful to me.

I find Eliezer impressive, because he founded the rationality community which IMO is the social movement with by far the best impact-to-community health ratio ever & has been highly influential to other social moments with similar ratios, knew AI would be a big & dangerous deal before virtually anyone, worked on & popularized that idea, and wrote two books (one nonfiction, and the other fanfiction) which changed many peoples' lives & society for the better. This is impressive no matter how you slice it. His effect on the world will clearly be felt for long to come, if we don't all die (possibly because we don't all die, if alignment goes well and turns out to have been a serious worry, which I am prior to believe). And that effect will be positive almost for sure.

Comment by Garrett Baker (D0TheMath) on Eliezer Yudkowsky Is Frequently, Confidently, Egregiously Wrong · 2023-08-27T03:55:10.149Z · LW · GW

I don't think your arguments support your conclusion. I think the zombies section mostly shows that Eliezer is not good at telling what his interlocutors are trying to communicate, the animal consciousness bit shows that he's overconfident, but I don't think you've shown animals are conscious, so doesn't show he's frequently confidently egregiously wrong, and your arguments against FDT seem lacking to me, and I'd tentatively say Eliezer is right about that stuff. Or at least, FDT is closer to the best decision theory than than CDT or EDT.

I think Eliezer is often wrong, and often overconfident. It would be interesting to see someone try to compile a good-faith track record of his predictions, perhaps separated by domain of subject.

This seems like one among a line of similar posts I've seen recently, of which you've linked to many in your own which try to compile a list of bad things Eliezer thinks and has said which the poster thinks is really terrible, but which seem benign to me. This is my theory of why they are all low-quality, and yet still posted:

Many have an inflated opinion of Eliezer, and when they realize he's just as epistemically mortal as the rest of us, they feel betrayed, and so overupdate towards thinking he's less epistemically impressive than he actually is, so some of those people compile lists of grievances they have against him, and post them on LessWrong, and claim this shows Eliezer is confidently egregiously wrong most of the time he talks about anything. In fact, it just shows that the OP has different opinions in some domains than Eliezer does, or that Eliezer's track-record is not spotless, or that Eliezer is overconfident. All claims that I, and other cynics & the already disillusioned already knew or could have strongly inferred.

Eliezer is actually pretty impressive both in his accomplishments in epistemic rationality, and especially instrumental rationality. But pretty impressive does not mean godlike or perfect. Eliezer does not provide ground-truth information, and often thinking for yourself about his claims will lead you away from his position, not towards it. Maybe this is something he should have stressed more in his Sequences.

Comment by Garrett Baker (D0TheMath) on D0TheMath's Shortform · 2023-08-21T18:55:43.029Z · LW · GW

I have re-upvoted my past comment, after looking more into things, I'm not so impressed with complex systems theory, but I don't fully support it. Also, past me was right to have confusions about what complex systems theory is, but still judge it, as it seems complex systems theorists don't even know what a complex system is.

Comment by Garrett Baker (D0TheMath) on What is the most effective anti-tyranny charity? · 2023-08-16T02:21:54.769Z · LW · GW

The first question I would ask is what situation currently has the most powerful people most harming the most meekest people that the fewest people are paying attention to. It seems possible this is the Uyghur genocide. But I learned about this passively and with not that much fidelity, so possibly something else fits the bill. This also wouldn’t take into account future preventions of not yet extant tyrannies.

Comment by Garrett Baker (D0TheMath) on Read More Books but Pretend to Read Even More · 2023-08-06T17:18:33.743Z · LW · GW

This seems like the kind of thing Arbital attempted to implement, based on my experience reading the Bayes theorem stuff.

Comment by Garrett Baker (D0TheMath) on Read More Books but Pretend to Read Even More · 2023-08-05T00:58:15.628Z · LW · GW

There are two reasons I like books:

  1. Many books I read are pretty good, they aren't dense with new information, but they are dense with justifications for the information they're giving. Much of the time I am not super surprised about the evidence presented after hearing the claim, but it makes later analysis not computationally intractable. This is important when trying to change your mind in light of contradictory arguments & necessary when approaching the claims with a critical eye[1].
  1. As I said after the Hanania post

I would add textbooks to the list of books that are worth reading. Not always, but often its the best way to learn a complex new field. Open to suggestions of alternative formats, like reading papers--though if you want an intro & problems, textbooks are still great.

Of course, if you have access to the person making the argument, or aren't mainly trying to learn a technical subject, books are probably inefficient.

  1. This is why I think books have the faults which you mention, compared to blogs. At least the good ones need to make an argument robust to many different criticisms, since the feedback loop between publication, and public commenting is far longer than that for blogs. Publishers notice that good books tend to be long, since there are usually many criticisms you can make, and so regardless of how narrow the author's assertion is, they make their book 300 pages to increase the appeal. ↩︎

Comment by Garrett Baker (D0TheMath) on Visible loss landscape basins don't correspond to distinct algorithms · 2023-07-28T20:27:27.858Z · LW · GW

My reading of the post says that two algorithms, with different generalization and structural properties, can lie in the same basin, and it uses evidence from our knowledge of the mechanisms behind grokking on synthetic data to make this point. But the above papers show that in more realistic settings empirically, two models lie in the same basin (up to permutation symmetries) if and only if they have similar generalization and structural properties.

Comment by Garrett Baker (D0TheMath) on Visible loss landscape basins don't correspond to distinct algorithms · 2023-07-28T18:11:10.733Z · LW · GW

What do you make of the mechanistic mode connectivity, and linear connectivity papers then?

Comment by Garrett Baker (D0TheMath) on Russian parliamentarian: let's ban personal computers and the Internet · 2023-07-26T06:41:20.101Z · LW · GW

My manifold comment.

Betting a bit on yes, since they just need to ban it, not enforce it. Ban computers, take them away from your political rivals & lock them in jail for having them. Otherwise, don’t enforce the law.

Seems pretty unlikely, but <10% is unlikely

Comment by Garrett Baker (D0TheMath) on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-20T13:57:53.816Z · LW · GW

I don’t see how the quote you mentioned is an argument rather than a statement. Does the post cited provide a calculation to support that number given current funding constraints?

Edit: Reading some of the post, it definitely assumes we are in a funding overhang, which if you take John (and my own, and others’) observations at face value, then we are not.

Comment by Garrett Baker (D0TheMath) on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-19T21:36:09.136Z · LW · GW

Counterintuitively, it may be easier for an organization (e.g. Redwood Research) to get a $1 million grant from Open Phil than it is for an individual to get a $10k grant from LTFF. The reason why is that both grants probably require a similar amount of administrative effort and a well-known organization is probably more likely to be trusted to use the money well than an individual so the decision is easier to make. This example illustrates how decision-making and grant-making processes are probably just as important as the total amount of money available.

A priori, and talking with some grant-makers, I'd think the split would be around people & orgs who are well-known by the grant-makers, and those who are not well-known by the grant-makers. Why do you think the split is around people vs orgs?

Comment by Garrett Baker (D0TheMath) on Simulators · 2023-07-19T04:34:31.319Z · LW · GW

Some academics seem to have (possibly independently? Or maybe its just in the water nowadays) discovered the Simulators theory, and have some quantitative measures to back it up.

Large Language Models (LLMs) are often misleadingly recognized as having a personality or a set of values. We argue that an LLM can be seen as a superposition of perspectives with different values and personality traits. LLMs exhibit context-dependent values and personality traits that change based on the induced perspective (as opposed to humans, who tend to have more coherent values and personality traits across contexts). We introduce the concept of perspective controllability, which refers to a model's affordance to adopt various perspectives with differing values and personality traits. In our experiments, we use questionnaires from psychology (PVQ, VSM, IPIP) to study how exhibited values and personality traits change based on different perspectives. Through qualitative experiments, we show that LLMs express different values when those are (implicitly or explicitly) implied in the prompt, and that LLMs express different values even when those are not obviously implied (demonstrating their context-dependent nature). We then conduct quantitative experiments to study the controllability of different models (GPT-4, GPT-3.5, OpenAssistant, StableVicuna, StableLM), the effectiveness of various methods for inducing perspectives, and the smoothness of the models' drivability. We conclude by examining the broader implications of our work and outline a variety of associated scientific questions. The project website is available at this https URL .

Comment by Garrett Baker (D0TheMath) on How can I get help becoming a better rationalist? · 2023-07-14T00:04:52.785Z · LW · GW

To my knowledge there’s no such transcript. The podcast is small and this one was made before Whisper so at the time a transcript would be super expensive (even if you did use whisper, you’d need to pay someone to label who’s talking, which likely isn’t cheap). You can find info about their workshops on their workshops page. Probably more informative than hearing me describe a podcast I last heard over a year ago.

Comment by Garrett Baker (D0TheMath) on How can I get help becoming a better rationalist? · 2023-07-13T23:26:42.512Z · LW · GW

They run interesting seeming workshops on a variety of subjects, most salient to me are using decision theory practically via cost-benefit analyses, teaching people how to develop their own clothing style, team community projects like making a street cleaning robot (likely misremembering specifics here, this may have been an aspirational goal of theirs) for participants’ local community, and using LLMs to automate tasks. Much of my knowledge comes from this podcast episode.

Edit: Re-listening to some parts of the podcast episode, it seems like they start talking about the guild at about 00:26:47.

Comment by Garrett Baker (D0TheMath) on How can I get help becoming a better rationalist? · 2023-07-13T17:35:53.154Z · LW · GW

I’ve heard good things about the Guild of the ROSE, a virtual community made by rationalists to help each other level up in practical success in everyday life. You may want to look into joining them.

Comment by Garrett Baker (D0TheMath) on Are there any good, easy-to-understand examples of cases where statistical causal network discovery worked well in practice? · 2023-07-12T23:13:07.316Z · LW · GW

I don’t know too much about this space, but Uber’s Causal ML python library & its uses may be a good place to look. That or Pyro, also made by Uber. Presumably Uber’s uses for these tools are success cases, but I don’t know the details. John has talked about Pyro being cool in previous posts of his, so he could have in mind the tools it provides when he talks about this.

Comment by Garrett Baker (D0TheMath) on Residual stream norms grow exponentially over the forward pass · 2023-07-12T00:28:52.922Z · LW · GW

I read TurnTrout's summary, of this plan, so this may be entirely unrelated, but the recent paper Generalizing Backpropagation for Gradient-Based Interpretability (video) seems like a good tool for this brand of interpretability work. May want to reach out to the authors to prove the viability of your paradigm and their methods, or just use their methods directly.

Comment by Garrett Baker (D0TheMath) on Introducing Fatebook: the fastest way to make and track predictions · 2023-07-11T17:18:38.685Z · LW · GW

I really like this, and am glad there's active development on something similar to predictionbook!

Comment by Garrett Baker (D0TheMath) on [Linkpost] Introducing Superalignment · 2023-07-06T01:17:08.063Z · LW · GW

Yeah, I think I remember hearing about ARC doing this a while ago too, and disliked it then, and similarly dislike it now. Suppose they make a misaligned model, and their control systems fail so that the model can spread very far. I expect their unconstrained misaligned model can do far more damage than their constrained possibly aligned ones if able to spread freely on the internet. Probably being an existential risk itself.

Edit: Man, I kinda dislike my past comment. Like listening to or watching a recording of yourself. But I stick with the lab-leak concern.

Comment by Garrett Baker (D0TheMath) on [Linkpost] Introducing Superalignment · 2023-07-06T01:05:39.292Z · LW · GW

Oh you're right. I misread a part of the text to think they were working on making superintelligence and also aligning the superintelligence in 4 years, and commented about alignment in 4 years being very ambitious of them. Oops

Comment by Garrett Baker (D0TheMath) on [Linkpost] Introducing Superalignment · 2023-07-05T22:01:36.712Z · LW · GW

Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).

Its a quote. I recommend reading the article. Its very short.

Comment by Garrett Baker (D0TheMath) on [Linkpost] Introducing Superalignment · 2023-07-05T21:47:21.198Z · LW · GW

I just read this, thinking I was going to see a big long plan of action, and got like 5 new facts:

  1. The post exists
  2. 20% of OpenAI's currently secured compute will be given to the teams they label alignment
  3. They're planning on deliberately training misaligned models!!!! This seems bad if they mean it.
  4. Sutskever is going to be on the team too, but I don't have a good feel if his name being on the team actually means anything
  5. And they're planning on aligning the AI in 4 years or presumably dying trying. Seems like a big & bad change from their previous promise to pause if they can't get alignment down. Misreading on my part, thanks Zach for the correction.
Comment by Garrett Baker (D0TheMath) on Douglas Hofstadter changes his mind on Deep Learning & AI risk (June 2023)? · 2023-07-03T19:59:43.504Z · LW · GW

He explained a bunch of his position on this in Godel, Escher, Bach. If I remember correctly, it describes the limits of primitive recursive and general recursive functions this in chapter XIII. The basic idea (again, if I remember), is that a proof system can only reason about itself if its general recursive, and will always be able to reason about itself if its general recursive. Lots of what we see that makes humanity special compared to computers has to do with people having feelings and emotions and self-concepts, and reflection about past situations & thoughts. All things that really seem to require deep levels of recursion (this is a far shallower statement than what's actually written in the book). Its strange to us then that ChatGPT can mimic those same outputs with the only recursive element of its thought being that it can pass 16 bits to its next running.

Comment by Garrett Baker (D0TheMath) on When do "brains beat brawn" in Chess? An experiment · 2023-06-28T14:18:47.071Z · LW · GW

Although realistically, the real odds would be less about the ELO and more on whether he was drunk while playing me.


Comment by Garrett Baker (D0TheMath) on Residual stream norms grow exponentially over the forward pass · 2023-06-27T04:17:58.166Z · LW · GW

You're right, that's not an exponential. I was wrong. I don't trust my toy model enough to be convinced my overall point is wrong. Unfortunately I don't have the time this week to run something more in-depth.

Comment by Garrett Baker (D0TheMath) on Residual stream norms grow exponentially over the forward pass · 2023-06-27T04:16:39.539Z · LW · GW


Comment by Garrett Baker (D0TheMath) on Residual stream norms grow exponentially over the forward pass · 2023-06-24T19:07:55.709Z · LW · GW

A mundane explanation of what's happening: We know from the NTK literature that to a (very) first approximation, SGD only affects the weights in the final layer of fully connected networks. So we should expect the first layer to have a larger norm than preceding layers. It would not be too surprising if this was distributed exponentially, since running a simple simulation, where

for  and where  is the number of gradient steps, we get the graph

and looking at the weight distribution at a given time-step, this seems distributed exponentially.

Comment by Garrett Baker (D0TheMath) on Critiques of prominent AI safety labs: Conjecture · 2023-06-16T17:46:40.596Z · LW · GW

I responded to a very similar comment of yours on the EA Forum.

To respond to the new content, I don't know if changing the board of conjecture once a certain valuation threshold is crossed would make the organization more robust (now that I think of it, I don't even really know what you mean by strong or robust here. Depending on what you mean, I can see myself disagreeing about whether that even tracks positive qualities about a corporation). You should justify claims like those, and at least include them in the original post. Is it sketchy they don't have this?

Comment by Garrett Baker (D0TheMath) on D0TheMath's Shortform · 2023-06-13T20:27:23.336Z · LW · GW

I have downvoted my comment here, because I disagree with past me. Complex systems theory seems pretty cool from where I stand now, and I think past me has a few confusions about what complex systems theory even is.

Comment by Garrett Baker (D0TheMath) on Critiques of prominent AI safety labs: Conjecture · 2023-06-12T22:36:32.982Z · LW · GW

I'm pretty skeptical they can achieve that right now using CoEm given the limited progress I expect them to have made on CoEm. And in my opinion of greater importance than "slightly behind state of the art" is likely security culture, and commonly in the startup world it is found that too-fast scaling leads to degradation in the founding culture. So a fear would be that fast scaling would lead to worse info-sec.

However, I don't know to what extent this is an issue. I can certainly imagine a world where because of EA and LessWrong, many very mission-aligned hires are lining up in front of their door. I can also imagine a lot of other things, which is why I'm confused.

Comment by Garrett Baker (D0TheMath) on Critiques of prominent AI safety labs: Conjecture · 2023-06-12T20:35:45.003Z · LW · GW

I agree with Conjecture's reply that this reads more like a hitpiece than an even-handed evaluation.

I don't think your recommendations follow from your observations, and such strong claims surely don't follow from the actual evidence you provide. I feel like your criticisms can be summarized as the following:

  1. Conjecture was publishing unfinished research directions for a while.

  2. Conjecture does not publicly share details of their current CoEm research direction, and that research direction seems hard.

  3. Conjecture told the government they were AI safety experts.

  4. Some people (who?) say Conjecture's governance outreach may be net-negative and upsetting to politicians.

  5. Conjecture's CEO Connor used to work on capabilities.

  6. One time during college Connor said that he replicated GPT-2, then found out he had a bug in his code.

  7. Connor has said at some times that open source models were good for alignment, then changed his mind.

  8. Conjecture's infohazard policy can be overturned by Connor or their owners.

  9. They're trying to scale when it is common wisdom for startups to try to stay small.

  10. It is unclear how they will balance profit and altruistic motives.

  11. Sometimes you talk with people (who?) and they say they've had bad interactions with conjecture staff or leadership when trying to tell them what they're doing wrong.

  12. Conjecture seems like they don't talk with ML people.

I'm actually curious about why they're doing 9, and further discussion on 10 and 8. But I don't think any of the other points matter, at least to the depth you've covered them here, and I don't know why you're spending so much time on stuff that doesn't matter or you can't support. This could have been so much better if you had taken the research time spent on everything that wasn't 8, 9, or 10, and used to to do analyses of 8, 9, and 10, and then actually had a conversation with Conjecture about your disagreements with them.

I especially don't think your arguments support your suggestions that

  1. Don't work at Conjecture.

  2. Conjecture should be more cautious when talking to media, because Connor seems unilateralist.

  3. Conjecture should not receive more funding until they get similar levels of organizational competence than OpenAI or Anthropic.

  4. Rethink whether or not you want to support conjecture's work non-monetarily. For example, maybe think about not inviting them to table at EAG career fairs, inviting Conjecture employees to events or workspaces, and taking money from them if doing field-building.

(1) seems like a pretty strong claim, which is left unsupported. I know of many people who would be excited to work at conjecture, and I don't think your points support the claim they would be doing net-negative research given they do alignment at Conjecture.

For (2), I don't know why you're saying Connor is unilateralist. Are you saying this because he used to work on capabilities?

(3) is just absurd! OpenAI will perhaps be the most destructive organization to-date. I do not think your above arguments make the case they are less organizationally responsible than OpenAI. Even having an info-hazard document puts them leagues above both OpenAI and Anthropic in my book. And add onto that their primary way of getting funded isn't building extremely large models... In what way do Anthropic or OpenAI have better corporate governance structures than Conjecture?

(4) is just... what? Ok, I've thought about it, and come to the conclusion this makes no sense given your previous arguments. Maybe there's a case to be made here. If they are less organizationally competent than OpenAI, then yeah, you probably don't want to support their work. This seems pretty unlikely to me though! And you definitely don't provide anything close to the level of analysis needed to elevate such hypotheses.

Edit: I will add to my note on (2): In most news articles in which I see Connor or Conjecture mentioned, I feel glad he talked to the relevant reporter, and think he/Conjecture made that article better. It is quite an achievement in my book to have sane conversations with reporters about this type of stuff! So mostly I think they should continue doing what they're doing.

I'm not myself an expert on PR (I'm skeptical if anyone is), so maybe my impressions of the articles are naive and backwards in some way. This is something which if you think is important, it would likely be good to mention somewhere why you think their media outreach is net-negative, ideally pointing to particular things you think they did wrong rather than vague & menacing criticisms of unilateralism.

Comment by Garrett Baker (D0TheMath) on The Dictatorship Problem · 2023-06-12T14:50:31.543Z · LW · GW

Oh yeah, good point about Germany. I’m still pretty skeptical about the claim. Even if the claim ended up being true, I’d be worried its just because democracy is a pretty new concept, so we just don’t have as much data as we’d like. But far less worried it’d be non predictive as I am now.

The particular argument why democracies are so stable does not seem robust to the population wrongly believing a dictatorship would be better in their interests than the current situation. Voters can be arbitrarily wrong when they aren’t able to see the effects of their actions and then re-vote.

Comment by Garrett Baker (D0TheMath) on The Dictatorship Problem · 2023-06-11T23:28:39.700Z · LW · GW

Oh also, I'd expect this analysis breaks down once the size of the essentials becomes large enough that people start advocating policies for fashion rather than the policy will actually have a positive effect on their life if implemented. See The Myth of the Rational Voter, and I expect that for a bunch of pro-Trump people, this is exactly why they are pro-Trump (similarly with a bunch of the pro-Biden people).

Comment by Garrett Baker (D0TheMath) on The Dictatorship Problem · 2023-06-11T23:26:04.701Z · LW · GW

The relevant section of The Dictator's Handbook is the following

Given the complexity of the trade-off between declining private rewards and increased societal rewards, it is useful to look at a simple graphical illustration, which, although based on specific numbers, reinforces the relationships highlighted throughout this book. Imagine a country of 100 people that initially has a government with two people in the winning coalition. With so few essentials and so many interchangeables, taxes will be high, people won’t work very hard, productivity will be low, and therefore the country’s total income will be small. Let’s suppose the country’s income is $100,000 and that half of it goes to the coalition and the other half is left to the people to feed, clothe, shelter themselves and to pay for everything else they can purchase. Ignoring the leader’s take, we assume the two coalition members get to split the $50,000 of government revenue, earning $25,000 a piece from the government plus their own untaxed income. We’ll assume they earn neither more nor less than anyone else based on whatever work they do outside the coalition.

Now we illustrate the consequences of enlarging the coalition. Figure 10.1 shows how the rewards directed towards those in the coalition (that is, private and public benefits) compare to the public rewards received by everyone as more people enter the coalition. Suppose that for each additional essential member of the winning coalition taxes decrease by half of 1 percent (so with three members the tax rate drops from 50 percent to 49.5 percent), and national income improves by 1 percent for each extra coalition member. Suppose also that spending on public goods increases by 2 percent for each added coalition member. As coalition size grows, tax rates drop, productivity increases, and the proportion of government revenue spent on public goods increases at the expense of private rewards. That is exactly the general pattern of change we explained in the previous chapters.

I can't find a section of the book talking about how often empirically democracies backslide into dictatorships, but literally zero of them seems false. I don't know much about the history of democracy or dictatorships, but Germany was a democracy before it became Nazi Germany, so the claim of literally zero backsliding seems false.

Comment by Garrett Baker (D0TheMath) on [Mostly solved] I get distracted while reading, but can easily comprehend audio text for 8+ hours per day. What are the best AI text-to-speech readers? Alternatively, do you have other ideas for what I could do? · 2023-06-11T21:41:31.928Z · LW · GW

I like Voice Dream Reader. I don't know how the voice compares to Natural Reader, but it does emphasize words and pronounce things differently based on context-cues. But those context cues are like periods and commas and stuff.

I find I stay approximately as engaged when listening to Voice Dream Reader when compared to an audiobook or someone reading stuff, but this could be an effect of having listened to several days worth of content via it.

Comment by Garrett Baker (D0TheMath) on A plea for solutionism on AI safety · 2023-06-10T22:43:16.045Z · LW · GW

I’m pretty confident the primary labs keep track of the number of flops used to train their models. I also don’t know how such a tool would prevent us all from dying.

Comment by Garrett Baker (D0TheMath) on A plea for solutionism on AI safety · 2023-06-10T18:39:05.309Z · LW · GW

Do you disagree with Apollo or ARC evals's approaches to the voluntary compliance solutions?

Comment by Garrett Baker (D0TheMath) on EY in the New York Times · 2023-06-10T18:36:03.065Z · LW · GW

This did not seem to me to be sneering or dismissive of the risks. I think it was just showing a bit of Cade's ignorance in the area.

Otherwise it was pretty positive, and I expect people to come away thinking "wow! A bunch of respectable people have been saying catastrophic AI is a real risk!".

Comment by Garrett Baker (D0TheMath) on The Base Rate Times, news through prediction markets · 2023-06-09T03:05:26.980Z · LW · GW

For prediction markets, I'm fine if they're consistently inaccurate because I, knowing their inaccuracies, would gain a bunch of money. But because there are smarter people than me who value money more than me, I expect those people will eat up the relevant money (unless the prediction market has a upper limit on how much a single person can bet like PredictIt). This is probably more of a problem for things like GJ Open or Metaculus, since their forecasts rely a bunch on crowd aggregations, so either they'd need to change the algorithms which report their publicly accessible forecasts, or in fact be less accurate.

In general, I think if NYT starts reporting on (say) Manifold Markets markets, I expect those markets to get a shit ton more accurate, even if NYT readers are tremendously biased.

Comment by Garrett Baker (D0TheMath) on The Base Rate Times, news through prediction markets · 2023-06-09T03:00:33.918Z · LW · GW

Nitpick: None of the listed sources except for Polymarket, Insight, Kashi, and maybe Manifold are technically prediction markets (I may have missed an exception), since they don't include a betting aspect, only a forecast accuracy metric they apply to the forecasters.