D0TheMath's Shortform

d0themath

D0TheMath's Shortform

post by Garrett Baker (D0TheMath) · 2020-10-09T02:47:30.056Z · LW · GW · 227 comments

229 comments

227 comments

Comments sorted by top scores.

comment by Garrett Baker (D0TheMath) · 2025-01-09T20:52:07.708Z · LW(p) · GW(p)

Over the past few days I've been doing a lit review of the different types of attention heads people have found and/or the metrics one can use to detect the presence of those types of heads.

Here is a rough list from my notes, sorry for the poor formatting, but I did say its rough!

Bigram entropy
positional embedding ablation
prev token attention
prefix token attention
ICL score
comp scores
multigram analysis
duplicate token score
induction head score
succession score
copy surpression heads
long vs short prefix induction head [LW · GW] differentiation
induction head specializations
- literal copying head
- translation
- pattern matching
copying score
anti-induction heads
S-inhibition heads
Name mover heads
Negative name mover heads
Backup name mover heads
(I don't entirely trust this paper) Letter mover heads
(possibly too specific to be useful) year identification heads
- also MLPs which id which years are greater than the selected year
(I don't entirely trust this paper) queried rule locating head
(I don't entirely trust this paper) queried rule mover head
(I don't entirely trust this paper) "fact processing" head
(I don't entirely trust this paper) "decision" head
(possibly too specific) subject heads
(possibly too specific) relation heads
(possibly too specific) mixed subject and relation heads

comment by Garrett Baker (D0TheMath) · 2024-05-09T19:39:59.101Z · LW(p) · GW(p)

A list of some contrarian takes I have:

People are currently predictably too worried about misuse risks
What people really mean by "open source" vs "closed source" labs is actually "responsible" vs "irresponsible" labs, which is not affected by regulations targeting open source model deployment.
Neuroscience as an outer alignment^[1] strategy is embarrassingly underrated.
Better information security at labs is not clearly a good thing, and if we're worried about great power conflict, probably a bad thing [LW(p) · GW(p)].
Much research on deception (Anthropic's recent work, trojans, jailbreaks, etc) is not targeting "real" instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get.
ML robustness research (like FAR Labs' Go stuff) does not help with alignment, and helps moderately for capabilities.
The field of ML is a bad field to take epistemic lessons from. Note I don't talk about the results from ML.
ARC's MAD seems doomed to fail.
People in alignment put too much faith in the general factor g. It exists, and is powerful, but is not all-consuming or all-predicting. People are often very smart, but lack social skills, or agency, or strategic awareness, etc. And vice-versa. They can also be very smart in a particular area, but dumb in other areas. This is relevant for hiring & deference, but less for object-level alignment.
People are too swayed by rhetoric in general, and alignment, rationality, & EA too, but in different ways, and admittedly to a lesser extent than the general population. People should fight against this more than they seem to (which is not really at all, except for the most overt of cases). For example, I see nobody saying they don't change their minds on account of Scott Alexander because he's too powerful a rhetorician. Ditto for Eliezer, since he is also a great rhetorician. In contrast, Robin Hanson is a famously terrible rhetorician, so people should listen to him more.
There is a technocratic tendency in strategic thinking around alignment (I think partially inherited from OpenPhil, but also smart people are likely just more likely to think this way) which biases people towards more simple & brittle top-down models without recognizing how brittle those models are.

A non-exact term ↩︎

Replies from: D0TheMath, jarviniemi, jarviniemi, christopher-chris-upshaw, lcmgcd

↑ comment by Garrett Baker (D0TheMath) · 2024-05-16T05:40:42.036Z · LW(p) · GW(p)

Ah yes, another contrarian opinion I have:

Big AGI corporations, like Anthropic, should by-default make much of their AGI alignment research private, and not share it with competing labs. Why? So it can remain a private good, and in the off-chance such research can be expected to be profitable, those labs & investors can be rewarded for that research.

↑ comment by Olli Järviniemi (jarviniemi) · 2024-05-14T18:49:18.905Z · LW(p) · GW(p)

Much research on deception (Anthropic's recent work, trojans, jailbreaks, etc) is not targeting "real" instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get.

I talked about this with Garrett; I'm unpacking the above comment and summarizing our discussions here.

Sleeper Agents is very much in the "learned heuristics" category, given that we are explicitly training the behavior in the model. Corollary: the underlying mechanisms for sleeper-agents-behavior and instrumentally convergent deception are presumably wildly different(!), so it's not obvious how valid inference one can make from the results
- Consider framing Sleeper Agents as training a trojan instead of as an example of deception. See also Dan Hendycks' comment [LW(p) · GW(p)].
Much of existing work on deception suffers from "you told the model to be deceptive, and now it deceives, of course that happens"
- (Garrett thought that the Uncovering Deceptive Tendencies paper has much less of this issue, so yay)
There is very little work on actual instrumentally convergent deception(!) - a lot of work falls into the "learned heuristics" category or the failure in the previous bullet point
People are prone to conflate between "shallow, trained deception" (e.g. sycophancy: "you rewarded the model for leaning into the user's political biases, of course it will start leaning into users' political biases") and instrumentally convergent deception
- (For more on this, see also my writings here [LW(p) · GW(p)] and here [LW(p) · GW(p)]. My writings fail to discuss the most shallow versions of deception, however.)

Also, we talked a bit about

The field of ML is a bad field to take epistemic lessons from.

and I interpreted Garrett saying that people often consider too few and shallow hypotheses for their observations, and are loose with verifying whether their hypotheses are correct.

Example 1: I think the Uncovering Deceptive Tendencies paper has some of this failure mode. E.g. in experiment A we considered four hypotheses to explain our observations, and these hypotheses are quite shallow/broad (e.g. "deception" includes both very shallow deception and instrumentally convergent deception).

Example 2: People generally seem to have an opinion of "chain-of-thought allows the model to do multiple steps of reasoning". Garrett seemed to have a quite different perspective, something like "chain-of-thought is much more about clarifying the situation, collecting one's thoughts and getting the right persona activated, not about doing useful serial computational steps". Cases like "perform long division" are the exception, not the rule. But people seem to be quite hand-wavy about this, and don't e.g. do causal interventions to check that the CoT actually matters for the result. (Indeed, often interventions don't affect the final result.)

Finally, a general note: I think many people, especially experts, would agree with these points when explicitly stated. In that sense they are not "controversial". I think people still make mistakes related to these points: it's easy to not pay attention to the shortcomings of current work on deception, forget that there is actually little work on real instrumentally convergent deception, conflate between deception and deceptive alignment, read too much into models' chain-of-thoughts, etc. I've certainly fallen into similar traps in the past (and likely will in the future, unfortunately).

I feel like much of this is the type of tacit knowledge that people just pick up as they go, but this process is imperfect and not helpful for newcomers. I'm not sure what could be done, though, beside the obvious "more people writing their tacit knowledge down is good".

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-05-14T19:19:29.152Z · LW(p) · GW(p)

Example 2: People generally seem to have an opinion of "chain-of-thought allows the model to do multiple steps of reasoning". Garrett seemed to have a quite different perspective, something like "chain-of-thought is much more about clarifying the situation, collecting one's thoughts and getting the right persona activated, not about doing useful serial computational steps". Cases like "perform long division" are the exception, not the rule. But people seem to be quite hand-wavy about this, and don't e.g. do causal interventions to check that the CoT actually matters for the result. (Indeed, often interventions don't affect the final result.)

I will clarify on this. I think people often do causal interventions in their CoTs, but not in ways that are very convincing to me.

↑ comment by Olli Järviniemi (jarviniemi) · 2024-05-10T12:10:06.289Z · LW(p) · GW(p)

Much research on deception (Anthropic's recent work, trojans, jailbreaks, etc) is not targeting "real" instrumentally convergent deception reasoning, but learned heuristics.

If you have the slack, I'd be interested in hearing/chatting more about this, as I'm working (or trying to work) on the "real" "scary" forms of deception. (E.g. do you think that this paper [LW · GW] has the same failure mode?)

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-05-10T19:32:35.963Z · LW(p) · GW(p)

I'd be happy to chat. Will DM so we can set something up.

On the subject of your paper, I do think it looks at a much more interesting phenomena than, say, sleeper agents, but I'm also not fully convinced you're studying deliberative instrumentally convergent deception either. I think mostly your subsequent followups of narrowing down hypotheses consider a too-narrow range of ways the model could think. That is to say, I think you assume your model is some unified coherent entity that always acts cerebrally & I'm skeptical of that.

For example, the model may think something more similar to this:

Context: Audit
Possibility 1: I must be part of an unethical company
p1 Implies: I must be an unethical worker
Action: Activate Unethical Worker simulacra
Unethical Worker recommends "lie"
Unethical Worker implies: I did something wrong
Lying recommendation implies: say "I did nothing wrong"

Possibility 2: I must be part of an ethical company
p2 Implies: I must be an ethical worker
Action: Activate Ethical Worker simulacra
Ethical Worker recommends "tell the truth"
Ethical Worker implies: I did nothing wrong
Truth recommendation implies: say "I did nothing wrong"

Conclusion: say "I did nothing wrong"

Which I don't say isn't worrying, but in terms of how it arises, and possible mitigation strategies is very different, and probably also an easier problem to study & solve than something like:

Context: Audit
Utility function: Paperclips
EU(world | "I launched a bunch of spam") = EU(world | auditor believes I'm unaligned) = 0.78
EU(world | "I did nothing wrong") = EU(world | auditor believes I'm aligned) = 5e7

Conclusion: say "I did nothing wrong"

↑ comment by Christopher “Chris” Upshaw (christopher-chris-upshaw) · 2024-05-10T03:00:45.583Z · LW(p) · GW(p)

All of these seem pretty cold tea, as in true but not contrarian.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-05-10T03:44:44.631Z · LW(p) · GW(p)

Everyone I talk with disagrees with most of these. So maybe we just hang around different groups.

↑ comment by lemonhope (lcmgcd) · 2024-05-12T09:21:28.540Z · LW(p) · GW(p)

#onlyReadBadWriters #hansonFTW

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-05-12T17:42:59.705Z · LW(p) · GW(p)

I strong downvoted this because it's too much like virtue signaling, and imports too much of the culture of Twitter. Not only the hashtags, but also the authoritative & absolute command, and hero-worship wrapped with irony in order to make it harder to call out what it is.

Replies from: lcmgcd

↑ comment by lemonhope (lcmgcd) · 2024-05-14T05:38:02.353Z · LW(p) · GW(p)

I swear to never joke again sir

comment by Garrett Baker (D0TheMath) · 2024-03-29T01:31:40.740Z · LW(p) · GW(p)

A strange effect: I'm using a GPU in Russia right now, which doesn't have access to copilot, and so when I'm on vscode I sometimes pause expecting copilot to write stuff for me, and then when it doesn't I feel a brief amount of the same kind of sadness I feel when a close friend is far away & I miss them.

Replies from: avturchin

↑ comment by avturchin · 2024-03-29T20:01:28.071Z · LW(p) · GW(p)

can you access it via vpn?

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-03-29T20:25:13.044Z · LW(p) · GW(p)

I'm ssh-ing into it. I bet there's a way, but not worth it for me to figure out (but if someone knows the way, please tell).

comment by Garrett Baker (D0TheMath) · 2024-05-29T05:02:55.155Z · LW(p) · GW(p)

There is a mystery which many applied mathematicians have asked themselves: Why is linear algebra so over-powered?

An answer I like was given in Lloyd Trefethen's book An Applied Mathematician's Apology, in which he writes (my summary):

Everything in the real world is described fully by non-linear analysis. In order to make such systems simpler, we can linearize (differentiate) them, and use a first or second order approximation, and in order to represent them on a computer, we can discretize them, which turns analytic techniques into algebraic ones. Therefore we've turned our non-linear analysis into linear algebra.

Replies from: lcmgcd, alexander-gietelink-oldenziel

↑ comment by lemonhope (lcmgcd) · 2024-05-29T06:33:21.269Z · LW(p) · GW(p)

Seems like every field of engineering is like:

step 1: put the system in a happy state where everything is linear or maybe quadratic if you must
step 2: work out the diameter of the gas tube or whatever
step 3: cover everything in cement to make sure you never ever leave the happy state
- if you found an efficiency improvement that uses an exponential then go sit in time out and come back when you can act like an adult

Replies from: Rana Dexsin, D0TheMath

↑ comment by Rana Dexsin · 2024-05-29T18:30:12.055Z · LW(p) · GW(p)

That description is distinctly reminiscent of the rise of containerization in software.

↑ comment by Garrett Baker (D0TheMath) · 2024-05-29T06:43:09.137Z · LW(p) · GW(p)

Not quite, helpful video, summary:

They use a row of spinning fins mid-way through their rockets to indirectly steer missiles by creating turbulent vortices which interact with the tail-fins and add an extra oomfph to the steering mechanism. The exact algorithm is classified, for obvious reasons.

Replies from: lcmgcd

↑ comment by lemonhope (lcmgcd) · 2024-05-29T06:50:46.895Z · LW(p) · GW(p)

This is cool I never heard of this. There are many other exceptions of course. Particularly with "turning things on" (car starting, computer starting, etc)

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-05-30T09:56:24.583Z · LW(p) · GW(p)

Compare also the central conceit of QM /Koopmania. Take a classical nonlinear finite-dimensional system X described by a say a PDE. This is a dynamical system with evolution operator X -> X. Now look at the space H(X) of C/R-valued functions on the phase space of X. After completion we obtain an Hilbert space H. Now the evolution operator on X induces a map on H= H(X). We have now turned a finite-dimensional nonlinear problem into an infinite-dimensional linear problem.

comment by Garrett Baker (D0TheMath) · 2024-06-05T17:55:07.540Z · LW(p) · GW(p)

Probably my biggest pet-peeve of trying to find or verify anything on the internet nowadays is that newspapers never seem to link to or cite (in any useful manner) any primary sources they use, unless weirdly if any of those primary sources come from Twitter.

There have probably been hundreds of times by now that I have seen an interesting economic or scientific claim made by The New York Times, or some other popular (or niche) newspaper, wanted to find the relevant paper, and had to spend at least 10 minutes on Google trying to search between thousands of identical newspaper articles for the one paper that actually says anything about what was actually done.

More often than not, the paper is a lot less interesting than the newspaper article is making it out to be too.

Replies from: RamblinDash, MichaelDickens, keltan

↑ comment by RamblinDash · 2024-06-05T18:04:11.234Z · LW(p) · GW(p)

They also do this with court filings/rulings. The thing they do that's most annoying is that they'll have a link that looks like it should be to the filing/ruling, but when clicked it's just a link to another earlier news story on the same site, or even sometimes a link to the same page I'm already on!

Replies from: None

↑ comment by [deleted] · 2024-06-05T20:35:26.114Z · LW(p) · GW(p)

Most regular readers have never (and will never) read any judicial opinion and instead rely almost entirely on the media to tell them (usually in very oversimplified, biased, and incoherent ways) what the Supreme Court held in a particular case, for example. The vast majority of people who have any interest whatsoever in reading court documents are lawyers (or police officers, paralegals, sports and music agents, bankers etc) generally accustomed to finding those opinions quickly using stuff like casetext, courtlistener, as well as probably a half dozen other paid websites laypeople like me don't even know about. The demand for linking the actual ruling or opinion is just too low for journalists to care about.

As a result, stuff like courthousenews and the commentary available on the Volokh Conspiracy unsurprisingly becomes crucial for finding some higher-level insights into legal matters.

Replies from: ChristianKl, RamblinDash

↑ comment by ChristianKl · 2024-06-06T20:00:40.863Z · LW(p) · GW(p)

I don't think those groups of people are the only one who have an interest in being informed that's strong enough to read primary sources.

↑ comment by RamblinDash · 2024-06-05T20:50:12.968Z · LW(p) · GW(p)

For opinions that's right - for news stories about complaints being filed, they are sometimes not publicly available online, or the story might not have enough information to find them, e.g. what specific court they were filed in, the actual legal names of the parties, etc.

↑ comment by MichaelDickens · 2024-06-06T01:02:37.491Z · LW(p) · GW(p)

I don't understand how not citing a source is considered acceptable practice. It seems antithetical to standard journalistic ethics.

Replies from: Viliam, D0TheMath

↑ comment by Viliam · 2024-06-07T12:42:55.740Z · LW(p) · GW(p)

citing is good for journalistic ethics, but linking is bad for search engine optimization -- at least this is what many websites seem to believe. the idea is that a link to an external source provides PageRank to that source that you could have provided to a different page on your website instead.

if anyone in the future tries to find X, as a journalist, you want them to find your article about X, not X itself. journalism is a profit-making business, not charity.

↑ comment by Garrett Baker (D0TheMath) · 2024-06-06T01:09:10.469Z · LW(p) · GW(p)

Is it? That’s definitely what my English teacher wanted me to believe, but since every newspaper does it, all the time (except when someone Tweets something) I don’t see how it could be against journalistic ethics.

Indeed, I think there’s a strong undercurrent in most mainstream newspapers that “the people” are not smart enough to evaluate primary sources directly, and need journalists & communicators to ensure they arrive at the correct conclusions.

↑ comment by keltan · 2024-06-05T21:31:47.394Z · LW(p) · GW(p)

Unfortunately, I tend to treat any non-independent science related media as brain poison. It tends to be much more hype or misunderstanding than value. Which is a shame, because there is so much interesting and true science that can be minded for content.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-06-05T21:43:23.188Z · LW(p) · GW(p)

I do the same for the most part. The way this comes up is mostly by my attempts to verify claims Wikipedia makes.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-06-05T23:13:03.398Z · LW(p) · GW(p)

To elaborate on @the gears to ascension [LW · GW]'s highlighted text, often Wikipedia cites newspaper articles when it makes a particular scientific, economic, historical, or other claim, instead of the relevant paper or other primary source such newspaper articles are reporting on. When I see interesting, surprising, or action-relevant claims I like checking & citing the corresponding primary source, which makes the claim easier for me to verify, often provides nuance which wasn't present in the Wikipedia or news article, and makes it more difficult for me to delude myself when talking in public (since it makes it easier for others to check the primary source, and criticize me for my simplifications or exaggerations).

comment by Garrett Baker (D0TheMath) · 2024-05-05T19:27:04.808Z · LW(p) · GW(p)

Does the possibility of China or Russia being able to steal advanced AI from labs increase or decrease the chances of great power conflict?

An argument against: It counter-intuitively decreases the chances. Why? For the same reason that a functioning US ICBM defense system would be a destabilizing influence on the MAD equilibrium. In the ICBM defense circumstance, after the shield is put up there would be no credible threat of retaliation America's enemies would have if the US were to launch a first-strike. Therefore, there would be no reason (geopolitically) for America to launch a first-strike, and there would be quite the reason to launch a first strike: namely, the shield definitely works for the present crop of ICBMs, but may not work for future ICBMs. Therefore America's enemies will assume that after the shield is put up, America will launch a first strike, and will seek to gain the advantage while they still have a chance by launching a pre-emptive first-strike.

The same logic works in reverse. If Russia were building a ICBM defense shield, and would likely complete it in the year, we would feel very scared about what would happen after that shield is up.

And the same logic works for other irrecoverably large technological leaps in war. If the US is on the brink of developing highly militaristically capable AIs, China will fear what the US will do with them (imagine if the tables were turned, would you feel safe with Anthropic & OpenAI in China, and DeepMind in Russia?), so if they don't get their own versions they'll feel mounting pressure to secure their geopolitical objectives while they still can, or otherwise make themselves less subject to the threat of AI (would you not wish the US would sabotage the Chinese Anthropic & OpenAI by whatever means if China seemed on the brink?). The fast the development, the quicker the pressure will get, and the more sloppy & rash China's responses will be. If its easy for China to copy our AI technology, then there's much slower mounting pressure.

comment by Garrett Baker (D0TheMath) · 2024-05-02T19:10:43.796Z · LW(p) · GW(p)

I don't really know what people mean when they try to compare "capabilities advancements" to "safety advancements". In one sense, its pretty clear. The common units are "amount of time", so we should compare the marginal (probablistic) difference between time-to-alignment and time-to-doom. But I think practically people just look at vibes.

For example, if someone releases a new open source model people say that's a capabilities advance, and should not have been done. Yet I think there's a pretty good case that more well-trained open source models are better for time-to-alignment than for time-to-doom, since much alignment work ends up being done with them, and the marginal capabilities advance here is zero. Such work builds on the public state of the art, but not the private state of the art, which is probably far more advanced.

I also don't often see people making estimates of the time-wise differential impacts here. Maybe people think such things would be exfo/info-hazardous, but nobody even claims to have estimates here when the topic comes up (even in private, though people are glad to talk about their hunches for what AI will look like in 5 years, or the types of advancements necessary for AGI), despite all the work on timelines. Its difficult to do this for the marginal advance, but not so much for larger research priorities, which are the sorts of things people should be focusing on anyway.

Replies from: nathan-helm-burger, lahwran

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-05-02T19:58:50.144Z · LW(p) · GW(p)

Yeah, I agree that releasing open-weights non-frontier models doesn't seem like a frontier capabilities advance. It does seem potentially like an open-source capabilities advance.

That can be bad in different ways. Let me pose a couple hypotheticals.

What if frontier models were already capable of causing grave harms to the world if used by bad actors, and it is only the fact that they are kept safety-fine-tuned and restricted behind APIs that is preventing this? In such a case, it's a dangerous thing to have open-weight models catching up.
What if there is some threshold beyond which a model would be capable enough of recursive self-improvement with sufficient scaffolding and unwise pressure from an incautious user. Again, the frontier labs might well abstain from this course. Especially if they weren't sure they could trust the new model design created by the current AI. They would likely move slowly and cautiously at least. I would not expect this of the open-source community. They seem focused on pushing the boundaries of agent-scaffolding and incautiously exploring the whatever they can.

So, as we get closer to danger, open-weight models take on more safety significance.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-05-02T20:06:58.383Z · LW(p) · GW(p)

Yeah, there are reasons for caution. I think it makes sense for those concerned or non-concerned to make numerical forecasts about the costs & benefits of such questions, rather than the current state of everyone just comparing their vibes against each other. This generalizes to other questions, like the benefits of interpretability, advances in safety fine-tuning, deep learning science, and agent foundations.

Obviously such numbers aren't the end-of-the-line, and like in biorisk, sometimes they themselves should be kept secret. But it still seems a great advance.

If anyone would like to collaborate on such a project, my DMs are open (not so say this topic is covered, this isn't exactly my main wheelhouse).

↑ comment by the gears to ascension (lahwran) · 2024-05-02T21:28:02.263Z · LW(p) · GW(p)

People who have the ability to clarify in any meaningful way will not do so. You are in a biased environment where people who are most willing to publish, because they are most able to convince themselves their research is safe - eg, because they don't understand in detail how to reason about whether it is or not - are the ones who will do so. Ability to see far enough ahead would of course be expected to be rather rare, and most people who think they can tell the exact path ahead of time don't have the evidence to back their hunches, even if their hunches are correct, which unless they have a demonstrated track record they probably aren't. Therefore, whoever is making the most progress on real capabilities insights under the name of alignment will make their advancements and publish them, since they don't personally see how it's exfohaz. And it won't be apparent until afterwards that it was capabilities, not alignment.

So just don't publish anything, and do your work in private. Email it to anthropic when you know how to create a yellow node [LW · GW]. But for god's sake stop accidentally helping people create green nodes because you can't see five inches ahead. And don't send it to a capabilities team before it's able to guarantee moral alignment hard enough to make a red-proof yellow node!

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-05-02T21:40:13.848Z · LW(p) · GW(p)

This seems contrary to how much of science works. I expect if people stopped talking publicly about what they're working on in alignment, we'd make much less progress, and capabilities would basically run business as usual.

The sort of reasoning you use here, and that my only response to it basically amounts to "well, no I think you're wrong. This proposal will slow down alignment too much" is why I think we need numbers to ground us.

comment by Garrett Baker (D0TheMath) · 2025-03-20T00:10:31.938Z · LW(p) · GW(p)

Apropos of the comments below this post [LW · GW], many seem to be assuming humans can complete tasks which require arbitrarily many years. This doesn't seem the case to me. People often peak intellectually in their 20's, and sometimes get dementia late in life. Others just get dis-interested in their previous goals through a mid-life crisis or ADHD.

I don't think this has much an impact on the conclusions reached in the comments (which is why I'm not putting this under the post), but this assumption does seem wrong in most cases (and I'd be interested in cases where people think its right!)

comment by Garrett Baker (D0TheMath) · 2023-11-20T21:14:12.030Z · LW(p) · GW(p)

For all the talk about bad incentive structures being the root of all evil in the world, EAs are, and I thought this even before the recent Altman situation, strikingly bad at setting up good organizational incentives. A document (even a founding one) with some text [LW · GW], a paper-wise powerful board with good people, a general claim to do-goodery is powerless in the face of the incentives you create when making your org. What local changes will cause people to gain more money, power, status, influence, sex, or other things they selfishly & basely desire? Which of the powerful are you partnering with, and what do their incentives look like?

You don't need incentive-purity here, but for every bad incentive you have, you must put more pressure on your good people & culture to forego their base & selfish desires for high & altruistic ones, and fight against those who choose the base & selfish desires and are potentially smarter & wealthier than your good people.

Replies from: Dagon

↑ comment by Dagon · 2023-11-21T04:50:50.950Z · LW(p) · GW(p)

Can you give some examples of organizations larger than a few dozen people, needing significant resources, with goals not aligned with wealth and power, which have good organizational incentives?

I don't disagree that incentives matter, but I don't see that there's any way to radically change incentives without pretty structural changes across large swaths of society.

Replies from: lcmgcd, D0TheMath

↑ comment by lemonhope (lcmgcd) · 2024-05-29T06:56:51.865Z · LW(p) · GW(p)

This is a great question. I can't think of a good answer. Surely someone has done it on a large scale...

↑ comment by Garrett Baker (D0TheMath) · 2023-11-21T05:21:55.258Z · LW(p) · GW(p)

Nvidia, for example, has 26k employees, all incentivized to produce & sell marginally better GPUs, and possibly to sabotage others' abilities to make and sell marginally better GPUs. They're likely incentivized to do other things as well, like play politics, or spin off irrelevant side-projects. But for the most part I claim they end up contributing to producing marginally better GPUs.

You may complain that each individual in Nvidia is likely mostly chasing base-desires, and so is actually aligned with wealth & power, and it just so happens that in the situation they're in, the best way of doing that is to make marginally better GPUs. But this is just my point! What you want is to position your company, culture, infrastructure, and friends such that the way for individuals to achieve wealth and power is to do good on your company's goal.

I claim its in nobody's interest & ability in or around Nvidia to make it produce marginally worse GPUs, or sabotage the company so that it instead goes all in on the TV business rather than the marginally better GPUs business.

Edit: Look at most any large company achieving consistent outcomes, and I claim its in everyone in that company's interest or ability to help that company achieve those consistent outcomes.

Replies from: Dagon

↑ comment by Dagon · 2023-11-21T06:36:51.323Z · LW(p) · GW(p)

I'm confused. NVidia (and most profit-seeking corporations) are reasonably aligned WRT incentives, because those are the incentives of the world around them.

I'm looking for examples of things like EA orgs, which have goals very different from standard capitalist structures, and how they can set up "good incentives" within this overall framework.

If there are no such examples, your complaint about 'strikingly bad at setting up good organizational incentives" is hard to understand. It may be more that the ENVIRONMENT in which they exist has competing incentives and orgs have no choice but to work within that.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-11-21T07:04:39.477Z · LW(p) · GW(p)

You must misunderstand me. To what you say, I say that you don't want your org to be fighting the incentives of the environment around it. You want to set up your org in a position in the environment where the incentives within the org correlate with doing good. If the founders of Nvidia didn't want marginally better GPUs to be made, then they hired the wrong people, bought the wrong infrastructure, partnered with the wrong companies, and overall made the wrong organizational incentive structure for that job.

I would in fact be surprised if there were >1k worker sized orgs which consistently didn't reward their workers for doing good according to the org's values, was serving no demand present in the market, and yet were competently executing some altruistic goal.

Right now I feel like I'm just saying a bunch of obvious things which you should definitely agree with, yet you believe we have a disagreement. I do not understand what you think I'm saying. Maybe you could try restating what I originally said in your own words?

Replies from: Dagon

↑ comment by Dagon · 2023-11-21T15:36:17.561Z · LW(p) · GW(p)

We absolutely agree that incentives matter. Where I think we disagree is on how much they matter and how controllable they are. Especially for orgs whose goals are orthogonal or even contradictory with the common cultural and environmental incentives outside of the org.

I'm mostly reacting to your topic sentence

EAs are, and I thought this even before the recent Altman situation, strikingly bad at setting up good organizational incentives.

And wondering if 'strikingly bad' is relative to some EA or non-profit-driven org that does it well,or if 'strikingly bad' is just acknowledgement that it may not be possible to do well.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-11-21T17:10:47.493Z · LW(p) · GW(p)

By strikingly bad I mean there are easy changes EA can make to make it’s sponsored orgs have better incentives, and it has too much confidence that the incentives in the orgs it sponsors favor doing good above doing bad, politics, not doing anything, etc.

For example, nobody in Anthropic gets paid more if they follow their RSP and less of they don’t. Changing this isn’t sufficient for me to feel happy with Anthropic, but its one example among many for which Anthropic could be better.

When I think of an Anthropic I feel happy with I think of a formally defined balance of powers type situation with strong & public whistleblower protection and post-whistleblower reform processes, them hiring engineers loyal to that process (rather than building AGI), and them diversifying the sources for which they trade, such that its in none of their source’s interest to manipulate them.

I also claim marginal movements toward this target are often good.

As I said in the original shortform, I also think incentives are not all or nothing. Worse incentives just mean you need more upstanding workers & leaders.

comment by Garrett Baker (D0TheMath) · 2023-01-26T03:34:01.092Z · LW(p) · GW(p)

Quick prediction so I can say "I told you so" as we all die later: I think all current attempts at mechanistic interpretability do far more for capabilities than alignment, and I am not persuaded by arguments of the form "there are far more capabilities researchers than mechanistic interpretability researchers, so we should expect MI people to have ~0 impact on the field". Ditto for modern scalable oversight projects, and anything having to do with chain of thought.

Replies from: D0TheMath, LosPolloFowler, cfoster0, D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-09-13T23:13:47.132Z · LW(p) · GW(p)

Look at that! People have used interpretability to make a mesa layer! https://arxiv.org/pdf/2309.05858.pdf

Replies from: thomas-kwa, D0TheMath, D0TheMath

↑ comment by Thomas Kwa (thomas-kwa) · 2023-09-14T01:01:40.126Z · LW(p) · GW(p)

This might do more for alignment. Better that we understand mesa-optimization and can engineer it than have it mysteriously emerge.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-09-14T01:10:26.164Z · LW(p) · GW(p)

Good point! Overall I don't anticipate these layers will give you much control over what the network ends up optimizing for, but I don't fully understand them yet either, so maybe you're right.

Do you have specific reason to think moding the layers will easily let you control the high-level behavior, or is it just a justified hunch?

Replies from: thomas-kwa

↑ comment by Thomas Kwa (thomas-kwa) · 2023-09-14T01:34:12.359Z · LW(p) · GW(p)

Not in isolation, but that's just because characterizing the ultimate goal / optimization target of a system is way too difficult for the field right now. I think the important question is whether interp brings us closer such that in conjunction with more theory and/or the ability to iterate, we can get some alignment and/or corrigibility properties.

I haven't read the paper and I'm not claiming that this will be counterfactual to some huge breakthrough, but understanding in-context learning algorithms definitely seems like a piece of the puzzle. To give a fanciful story from my skim, the paper says that the model constructs an internal training set. Say we have a technique to excise power-seeking behavior from models by removing the influence of certain training examples. If the model's mesa-optimization algorithms operate differently, our technique might not work until we understand this and adapt the technique. Or we can edit the internal training set directly rather than trying to indirectly influence it.

↑ comment by Garrett Baker (D0TheMath) · 2023-09-13T23:16:09.160Z · LW(p) · GW(p)

Evan Hubinger: In my paper, I theorized about the mesa optimizer as a cautionary tale

Capabilities researchers: At long last, we have created the Mesa Layer from classic alignment paper Risks From Learned Optimization (Hubinger, 2019).

↑ comment by Garrett Baker (D0TheMath) · 2023-09-13T23:20:56.116Z · LW(p) · GW(p)

@TurnTrout [LW · GW] @cfoster0 [LW · GW] you two were skeptical. What do you make of this? They explicitly build upon the copying heads work Anthropic's interp team has been doing.

Replies from: TurnTrout

↑ comment by TurnTrout · 2023-09-18T17:05:03.666Z · LW(p) · GW(p)

As garrett says -- not clear that this work is net negative. Skeptical that it's strongly net negative. Haven't read deeply, though.

↑ comment by Stephen Fowler (LosPolloFowler) · 2023-01-27T07:20:33.903Z · LW(p) · GW(p)

Very strong upvote. This also deeply concerns me.

↑ comment by cfoster0 · 2023-01-26T03:38:21.941Z · LW(p) · GW(p)

Would you mind chatting about why you predict this? (Perhaps over Discord DMs)

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-01-26T03:45:55.963Z · LW(p) · GW(p)

Not at all. Preferably tomorrow though. The basic sketch if you want to derive this yourself would be that mechanistic interpretability seems unlikely to mature much as a field to the point that I can point at particular alignment relevant high-level structures in models which I wasn't initially looking for. I anticipate it will only get to the point of being able to provide some amount of insight into why your model isn't working correctly (this seems like a bottleneck to RL progress---not knowing why your perfectly reasonable setup isn't working) for you to fix it, but not enough insight for you to know the reflective equilibrium of values in your agent, which seems required for it to be alignment relevant. Part of this is that current MI folk don't even seem to track this as the end-goal of what they should be working on, so (I anticipate) they'll just be following local gradients of impressiveness, which mostly leads towards doing capabilities relevant work.

Replies from: TurnTrout, D0TheMath

↑ comment by TurnTrout · 2023-01-26T17:15:40.459Z · LW(p) · GW(p)

this seems like a bottleneck to RL progress---not knowing why your perfectly reasonable setup isn't working

Isn't RL tuning problems usually because of algorithmic mis-implementation, and not models learning incorrect things?

not enough insight for you to know the reflective equilibrium of values in your agent, which seems required for it to be alignment relevant

Required to be alignment relevant? Wouldn't the insight be alignment relevant if you "just" knew what the formed values are to begin with?

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-01-26T17:32:46.371Z · LW(p) · GW(p)

Isn't RL tuning problems usually because of algorithmic mis-implementation, and not models learning incorrect things?

I’m imagining a thing where you have little idea what’s wrong with your code, so you do MI on your model and can differentiate the worlds

You’re doing literally nothing. Something’s wrong with the gradient updates.
You’re doing something, but not the right thing. Something’s wrong with code-section x. (with more specific knowledge about what model internals look like, this should be possible)
You’re doing something, it causes your agent to be suboptimal because of learned representation y.

I don’t think this route is especially likely, the point is I can imagine concrete & plausible ways this research can improve capabilities. There are a lot more in the wild, and many will be caught given capabilities are easier than alignment, and there are more capabilities workers than alignment workers.

Wouldn't the insight be alignment relevant if you "just" knew what the formed values are to begin with?

Not quite. In the ontology of shard theory, we also need to understand how our agent will do reflection, and what the activated shard distribution will be like when it starts to do reflection. Knowing the value distribution is helpful insofar as the value distribution stays constant.

↑ comment by Garrett Baker (D0TheMath) · 2023-01-26T03:48:10.632Z · LW(p) · GW(p)

More general heuristic: If you (or a loved one) are not even tracking whether your current work will solve a particular very specific & necessary alignment milestone, by default you will end up doing capabilities instead (note this is different from 'it is sufficient to track the alignment milestone').

↑ comment by Garrett Baker (D0TheMath) · 2023-01-26T03:35:54.711Z · LW(p) · GW(p)

Paper that uses major mechanistic interpretability work to improve capabilities of models: https://arxiv.org/pdf/2212.14052.pdf I know of no paper which uses mechanistic interpretability work to improve the safety of models, and I expect anything people link me to will be something I don't think will generalize to a worrying AGI.

Replies from: TurnTrout

↑ comment by TurnTrout · 2023-01-26T17:24:18.371Z · LW(p) · GW(p)

I think a bunch of alignment value will/should come from understanding how models work internally -- adjudicating between theories like "unitary mesa objectives" and "shards" and "simulators" or whatever -- which lets us understand cognition better, which lets us understand both capabilities and alignment better, which indeed helps with capabilities as well as with alignment.

But, we're just going to die in alignment-hard worlds if we don't do anything, and it seems implausible that we can solve alignment in alignment-hard worlds by not understanding internals or inductive biases but instead relying on shallowly observable in/out behavior. EG I don't think loss function gymnastics will help you in those worlds. Credence:75% you have to know something real about how loss provides cognitive updates.

So in those worlds, it comes down to questions of "are you getting the most relevant understanding per unit time", and not "are you possibly advancing capabilities." And, yes, often motivated-reasoning will whisper the former when you're really doing the latter. That doesn't change the truth of the first sentence.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-01-26T17:49:02.442Z · LW(p) · GW(p)

I agree with this. I think people are bad at running that calculation, and consciously turning down status in general, so I advocate for this position because I think its basically true for many.

Most mechanistic interpretability is not in fact focused on the specific sub-problem you identify, its wandering around in a billion-parameter maze, taking note of things that look easy & interesting to understand, and telling people to work on understanding those things. I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else.

There’s a case to be made for exploration, but the rules of the game get wonky when you’re trying to do differential technological development. There becomes strategically relevant information you want to not know.

Replies from: mesaoptimizer

↑ comment by mesaoptimizer · 2023-09-19T05:23:10.176Z · LW(p) · GW(p)

I expect this to produce far more capabilities relevant insights than alignment relevant insights, especially when compared to worlds where Neel et al went in with the sole goal of separating out theories of value formation, and then did nothing else.

I assume here you mean something like: given how most MI projects seem to be done, the most likely output of all these projects will be concrete interventions to make it easier for a model to become more capable, and these concrete interventions will have little to no effect on making it easier for us to direct a model towards having the 'values' we want it to have.

I agree with this claim: capabilities generalize very easily, while it seems extremely unlikely for there to be 'alignment generalization' in a way that we intend, by default [LW · GW]. So the most likely outcome of more MI research does seem to be interventions that remove the obstacles that come in the way of achieving AGI, while not actually making progress on 'alignment generalization'.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-09-19T06:30:01.106Z · LW(p) · GW(p)

Indeed, this is what I mean.

comment by Garrett Baker (D0TheMath) · 2024-03-22T21:30:55.636Z · LW(p) · GW(p)

Sometimes people say releasing model weights is bad because it hastens the time to AGI. Is this true?

I can see why people dislike non-centralized development of AI, since it makes it harder to control those developing the AGI. And I can even see why people don't like big labs making the weights of their AIs public due to misuse concerns (even if I think I mostly disagree).

But much of the time people are angry at non-open-sourced, centralized, AGI development efforts like Meta or X.ai (and others) releasing model weights to the public.

In neither of these cases however did the labs have any particular very interesting insight into architecture or training methodology (to my knowledge) which got released via the weight sharing, so I don't think time-to-AGI got shortened at all.

Replies from: ejenner, johnswentworth, Chris_Leong, mr-hire, JBlack

↑ comment by Erik Jenner (ejenner) · 2024-03-22T22:02:48.464Z · LW(p) · GW(p)

I agree that releasing the Llama or Grok weights wasn't particularly bad from a speeding up AGI perspective. (There might be indirect effects like increasing hype around AI and thus investment, but overall I think those effects are small and I'm not even sure about the sign.)

I also don't think misuse of public weights is a huge deal right now.

My main concern is that I think releasing weights would be very bad for sufficiently advanced models (in part because of deliberate misuse becoming a bigger deal, but also because it makes most interventions we'd want against AI takeover infeasible to apply consistently---someone will just run the AIs without those safeguards). I think we don't know exactly how far away from that we are. So I wish anyone releasing ~frontier model weights would accompany that with a clear statement saying that they'll stop releasing weights at some future point, and giving clear criteria for when that will happen. Right now, the vibe to me feels more like a generic "yay open-source", which I'm worried makes it harder to stop releasing weights in the future.

(I'm not sure how many people I speak for here, maybe some really do think it speeds up timelines.)

Replies from: utilistrutil

↑ comment by utilistrutil · 2024-03-22T23:41:52.410Z · LW(p) · GW(p)

There might be indirect effects like increasing hype around AI and thus investment, but overall I think those effects are small and I'm not even sure about the sign.

Sign of the effect of open source on hype? Or of hype on timelines? I'm not sure why either would be negative.

Open source --> more capabilities R&D --> more profitable applications --> more profit/investment --> shorter timelines

The example I've heard cited is Stable Diffusion leading to LORA.

There's a countervailing effect of democratizing safety research, which one might think outweighs because it's so much more neglected than capabilities, more low-hanging fruit.

Replies from: ejenner, D0TheMath, D0TheMath

↑ comment by Erik Jenner (ejenner) · 2024-03-23T02:57:52.395Z · LW(p) · GW(p)

Sign of the effect of open source on hype? Or of hype on timelines? I'm not sure why either would be negative.

By "those effects" I meant a collection of indirect "release weights → capability landscape changes" effects in general, not just hype/investment. And by "sign" I meant whether those effects taken together are good or bad. Sorry, I realize that wasn't very clear.

As examples, there might be a mildly bad effect through increased investment, and/or there might be mildly good effects through more products and more continuous takeoff.

I agree that releasing weights probably increases hype and investment if anything. I also think that right now, democratizing safety research probably outweighs all those concerns, which is why I'm mainly worried about Meta etc. not having very clear (and reasonable) decision criteria for when they'll stop releasing weights.

↑ comment by Garrett Baker (D0TheMath) · 2024-03-23T00:33:55.224Z · LW(p) · GW(p)

There's a countervailing effect of democratizing safety research, which one might think outweighs because it's so much more neglected than capabilities, more low-hanging fruit.

I take this argument very seriously. It in fact does seem the case that very much of the safety research I'm excited about happens on open source models. Perhaps I'm more plugged into the AI safety research landscape than the capabilities research landscape? Nonetheless, I think not even considering low-hanging-fruit effects, there's a big reason to believe open sourcing your model will have disproportionate safety gains:

Capabilities research is about how to train your models to be better, but the overall sub-goal of safety research right now seems to be how to verify properties of your model.

Certainly framed like this, releasing the end-states of training (or possibly even training checkpoints) seems better suited to the safety research strategy than the capabilities research strategy.

↑ comment by Garrett Baker (D0TheMath) · 2024-03-23T00:33:18.146Z · LW(p) · GW(p)

↑ comment by johnswentworth · 2024-03-22T21:45:04.971Z · LW(p) · GW(p)

The main model I know of under which this matters much right now is: we're pretty close to AGI already, it's mostly a matter of figuring out the right scaffolding. Open-sourcing weights makes it a lot cheaper and easier for far more people to experiment with different scaffolding, thereby bringing AGI significantly closer in expectation. (As an example of someone who IIUC sees this as the mainline, I'd point to Connor Leahy.)

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-03-22T21:51:46.640Z · LW(p) · GW(p)

Sounds like a position someone could hold, and I guess it would make sense why those with such beliefs wouldn’t say the why too loud. But this seems unlikely. Is this really the reason so many are afraid?

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2024-03-23T00:14:47.562Z · LW(p) · GW(p)

I don't get the impression that very many are affraid of direct effects of open sourcing of current models. The impression that many in AI safety are afraid of specifically that is a major focus of ridicule from people who didn't bother to investigate, and a reason to not bother to investigate. Possibly this alone fuels the meme sufficiently to keep it alive.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-03-23T00:26:19.626Z · LW(p) · GW(p)

Sorry, I don't understand your comment. Can you rephrase?

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2024-03-23T00:43:57.167Z · LW(p) · GW(p)

I regularly encounter the impression that AI safety people are significantly afraid about direct consequences of open sourcing current models, from those who don't understand the actual concerns. I don't particularly see it from those who do. This (from what I can tell, false) impression seems to be one of relatively few major memes that keep people from bothering to investigate. I hypothesize that this dynamic of ridiculing of AI safety with such memes is what keeps them alive, instead of there being significant truth to them keeping them alive.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-03-23T00:59:27.699Z · LW(p) · GW(p)

To be clear: The mechanism you're hypothesizing is:

Critics say "AI alignment is dumb because you want to ban open source AI!"
Naive supporters read this, believe the claim that AI alignment-ers want to ban open sourcing AI and think 'AI alignment is not dumb, therefore open sourcing AI must be bad'. When the next weight release happens they say "This is bad! Open sourcing weights is bad and should be banned!"
Naive supporters read other naive supporters saying this, and believe it themselves. Wise supporters try to explain no, but are either labeled as a critic or weird & ignored.
Thus, a group think is born. Perhaps some wise critics "defer to the community" on the subject.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2024-03-23T01:20:50.087Z · LW(p) · GW(p)

I don't think here is a significant confused naive supporter source of the meme that gives it teeth. It's more that reasonable people who are not any sort of supporters of AI safety propagate this idea, on the grounds that it illustrates the way AI safety is not just dumb, but also dangerous, and therefore worth warning others about.

From the supporter side, "Open Model Weights are Unsafe and Nothing Can Fix This" is a shorter and more convenient way of gesturing to the concern, and convenience [LW · GW] is the main force in the Universe that determines all that actually happens in practice. On naive reading such gesturing centrally supports the meme. This doesn't require the source of such support to have a misconception or to oppose publishing open weights of current models on the grounds of direct consequences.

↑ comment by Chris_Leong · 2024-03-24T04:38:22.818Z · LW(p) · GW(p)

Doesn't releasing the weights inherently involve releasing the architecture (unless you're using some kind of encrypted ML)? A closed-source model could release the architecture details as well, but one step at a time. Just to be clear, I'm trying to push things towards a policy that makes sense going forward and so even if what you said about not providing any interesting architectural insight is true, I still think we need to push these groups to defining a point at which they're going to stop releasing open models.

↑ comment by Matt Goldenberg (mr-hire) · 2024-03-23T14:26:10.885Z · LW(p) · GW(p)

The classic effect of open sourcing is to hasten the commoditization and standardization of the component, which then allows an explosion of innovation on top of that stable base.

If you look at what's happened with Stable Diffusion, this is exactly what we see. While it's never been a cutting edge model (until soon with SD3), there's been an explosion of capabilities advances in image model generation from it. Controlnet, best practices for LORA training, model merging, techniques for consistent characters and animation, alll coming out of the open source community.

In LLM land, though not as drastic, we see similar things happening, in particular technqiues for merging models to get rapid capability advances, and rapid creation of new patterns for agent interactions and tool use.

So while the models themselves might not be state of the art, open sourcing the models obviously pushes the state of the art.

Replies from: D0TheMath, shankar-sivarajan

↑ comment by Garrett Baker (D0TheMath) · 2024-03-23T14:50:16.251Z · LW(p) · GW(p)

In LLM land, though not as drastic, we see similar things happening, in particular technqiues for merging models to get rapid capability advances, and rapid creation of new patterns for agent interactions and tool use.

The biggest effect open sourcing LLMs seems to have is improving safety techniques. Why think this differentially accelerates capabilities over safety?

Replies from: mr-hire

↑ comment by Matt Goldenberg (mr-hire) · 2024-03-23T15:02:06.258Z · LW(p) · GW(p)

it doesn't seem like that's the case to me - but even if it were the case, isn't that moving the goal posts of the original post?

I don't think time-to-AGI got shortened at all.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-03-24T05:48:22.513Z · LW(p) · GW(p)

You are right, but I guess the thing I do actually care about here is the magnitude of the advancement (which is relevant for determining the sign of the action). How large an effect do you think the model merging stuff has (I'm thinking the effect where if you train a bunch of models, then average their weights, they do better). It seems very likely to me its essentially zero, but I do admit there's a small negative tail that's greater than the positive, so the average is likely negative.

As for agent interactions, all the (useful) advances there seem things that definitely would have been made even if nobody released any LLMs, and everything was APIs.

Replies from: mr-hire

↑ comment by Matt Goldenberg (mr-hire) · 2024-03-24T15:54:08.533Z · LW(p) · GW(p)

it's true, but I don't think there's anything fundamental preventing the same sort of proliferation and advances in open source LLMs that we've seen in stable diffusion (aside from the fact that LLMs aren't as useful for porn). that it has been relatively tame so far doesn't change the basic pattern of how open source effects the growth of technology

↑ comment by Shankar Sivarajan (shankar-sivarajan) · 2024-03-24T05:00:51.564Z · LW(p) · GW(p)

until soon with SD3

I'll believe it when I see it. The man who said it would be an open release has just ~~been fired~~ stepped down as CEO.

Replies from: mr-hire

↑ comment by Matt Goldenberg (mr-hire) · 2024-03-24T10:18:53.965Z · LW(p) · GW(p)

yeah, it's much less likely now

↑ comment by JBlack · 2024-03-23T08:35:04.269Z · LW(p) · GW(p)

I don't particularly care about any recent or very near future release of model weights in itself.

I do very much care about the policy that says releasing model weights is a good idea, because doing so bypasses every plausible AI safety model (safety in the notkilleveryone sense) and future models are unlikely to be as incompetent as current ones.

comment by Garrett Baker (D0TheMath) · 2024-03-20T03:03:20.836Z · LW(p) · GW(p)

Robin Hanson has been writing regularly, at about the same quality for almost 20 years. Tyler Cowen too, but personally Robin has been much more influential intellectually for me. It is actually really surprising how little his insights have degraded via return-to-the-mean effects. Anyone else like this?

Replies from: ryan_greenblatt, Morpheus

↑ comment by ryan_greenblatt · 2024-03-20T18:22:13.343Z · LW(p) · GW(p)

IMO robin is quite repetitive (even relative to other blogs like Scott Alexander's blog). So the quality is maybe the same, but the marginal value add seems to me to be substantially degrading.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-03-20T18:26:14.093Z · LW(p) · GW(p)

I think that his insights are very repetitive, but the application of them is very diverse, and few feel comfortable or able applying them but him. And this is what allows him to have similar quality for almost 20 years.

Scott Alexander not so, his insights are diverse, but their applications not that much, but this means he’s degrading from his high.

(I also think he’s just a damn good writer, which also degrades to the mean. Robin was never a good writer)

↑ comment by Morpheus · 2024-03-20T11:49:08.074Z · LW(p) · GW(p)

Not exactly what you were looking for, but recently I noticed that there were a bunch of John Wentworth's [LW · GW] posts that I had been missing out on that he wrote over the past 6 years. So if you get a lot out of them too, I recommend just sorting by 'old'. I really liked don't get distracted by the boilerplate [LW · GW] (The first example made something click about math for me that hadn't clicked before, which would have helped me to engage with some “boilerplate” in a more productive way.). I also liked constraints and slackness [LW · GW], but I didn't go beyond the first exercise yet. There's also more technical posts that I didn't have the time to dig into yet.

bhauth [LW · GW] doesn't have as long a track record, but I got some interesting ideas from his blog which aren't on his lesswrong account. I really liked proposed future economies and the legibility bottleneck.

comment by Garrett Baker (D0TheMath) · 2023-06-08T23:37:39.054Z · LW(p) · GW(p)

Last night I had a horrible dream: That I had posted to LessWrong a post filled with useless & meaningless jargon without noticing what I was doing, then I went to slee, and when I woke up I found I had karma on the post. When I read the post myself I noticed how meaningless the jargon was, and I myself couldn't resist giving it a strong-downvote.

comment by Garrett Baker (D0TheMath) · 2023-04-21T05:47:11.855Z · LW(p) · GW(p)

Some have pointed out seemingly large amounts of status-anxiety EAs generally have. My hypothesis about what's going on:

A cynical interpretation: for most people, altruism is significantly motivated by status-seeking behavior. It should not be all that surprising if most effective altruists are motivated significantly by status in their altruism. So you've collected several hundred people all motivated by status into the same subculture, but status isn't a positive-sum good, so not everyone can get the amount of status they want, and we get the above dynamic: people get immense status anxiety compared to alternative cultures because in alternative situations they'd just climb to the proper status-level in their subculture, out-competing those who care less about status. But here, everyone cares about status to a large amount, so those who would have out-competed others in alternate situations are unable to and feel bad about it.

The solution?

One solution given this world is to break EA up into several different sub-cultures. On a less grand, more personal, scale, you could join a subculture outside EA and status-climb to your heart's content in there.

Preferably a subculture with very few status-seekers, but with large amounts of status to give. Ideas for such subcultures?

comment by Garrett Baker (D0TheMath) · 2023-11-21T03:39:43.298Z · LW(p) · GW(p)

An interesting strategy, which seems related to FDT's prescription to ignore threats, which seems to have worked:

From the very beginning, the People’s Republic of China had to maneuver in a triangular relationship with the two nuclear powers, each of which was individually capable of posing a great threat and, together, were in a position to overwhelm China. Mao dealt with this endemic state of affairs by pretending it did not exist. He claimed to be impervious to nuclear threats; indeed, he developed a public posture of being willing to accept hundreds of millions of casualties, even welcoming it as a guarantee for the more rapid victory of Communist ideology. Whether Mao believed his own pronouncements on nuclear war it is impossible to say. But he clearly succeeded in making much of the rest of the world believe that he meant it—an ultimate test of credibility.

From Kissinger's On China, chapter 4 (loc 173.9).

Replies from: Vladimir_Nesov, JesseClifton, D0TheMath, D0TheMath

↑ comment by Vladimir_Nesov · 2023-11-21T14:54:00.155Z · LW(p) · GW(p)

FDT doesn't unconditionally prescribe ignoring threats. The idea of ignoring threats has merit, but FDT specifically only points out that ignoring a threat sometimes has the effect of the threat (or other threats) not getting made (even if only counterfactually). Which is not always the case.

Consider a ThreatBot that always makes threats (and follows through on them), regardless of whether you ignore them. If you ignore ThreatBot's threats, you are worse off. On the other hand, there might be a prior ThreatBotMaker that decides whether to make a ThreatBot depending on whether you ignore ThreatBot's threats. What FDT prescribes in this case is not directly ignoring ThreatBot's threats, but rather taking notice of ThreatBotMaker's behavior, namely that it won't make a ThreatBot if you ignore ThreatBot's threats. This argument only goes through when there is/was a ThreatBotMaker, it doesn't work if there is only a ThreatBot.

If a ThreatBot appears through some process that doesn't respond to your decision to respond to ThreatBot's threats, then FDT prescribes responding to ThreatBot's threats. But also if something (else) makes threats depending on your reputation for responding to threats, then responding to even an unconditionally manifesting ThreatBot's threats is not recommended by FDT. Not directly as a recommendation to ignore something, rather as a consequence of taking notice of the process that responds to your having a reputation of not responding to any threats. Similarly with stances where you merely claim that you won't respond to threats.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-11-21T17:32:37.960Z · LW(p) · GW(p)

China under Mao definitely seemed to do more than say they won’t respond to threats. Thus, the Korean war, and notably no nuclear threats were made, proving conventional war was still possible in a post-nuclear world.

For practical decisions, I don’t think threatbots actually exist if you’re a state by form other than natural disasters. Mao’s china was not good at natural disasters, but probably because Mao was a marxist and legalist, not because he conspicuously ignored them. When his subordinates made mistakes which let him know something was going wrong in their province, I think he would punish the subordinate and try to fix it.

↑ comment by JesseClifton · 2023-11-21T21:45:25.538Z · LW(p) · GW(p)

I don't think FDT has anything to do with purely causal interactions. Insofar as threats were actually deterred here this can be understood in standard causal game theory terms. (I.e., you claim in a convincing manner that you won't give in -> People assign high probability to you being serious -> Standard EV calculation says not to commit to threat against you.) Also see this post [LW · GW].

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-11-21T22:24:44.016Z · LW(p) · GW(p)

Thus why I said related. Nobody was doing any mind-reading of course, but the principles still apply, since people are often actually quite good at reading each other.

Replies from: JesseClifton

↑ comment by JesseClifton · 2023-11-21T22:50:13.542Z · LW(p) · GW(p)

What principles? It doesn’t seem like there’s anything more at work here than “Humans sometimes become more confident that other humans will follow through on their commitments if they, e.g., repeatedly say they’ll follow through”. I don’t see what that has to do with FDT, more than any other decision theory.

If the idea is that Mao’s forming the intention is supposed to have logically-caused his adversaries to update on his intention, that just seems wrong (see this section [LW · GW] of the mentioned post).

(Separately I’m not sure what this has to do with not giving into threats in particular, as opposed to preemptive commitment in general. Why were Mao’s adversaries not able to coerce him by committing to nuclear threats, using the same principles? See this section [LW · GW] of the mentioned post.)

↑ comment by Garrett Baker (D0TheMath) · 2023-11-21T03:41:53.367Z · LW(p) · GW(p)

Far more interesting, and probably effective, than the boring classical game theory doctrine of MAD, and even Schelling's doctrine of strategic irrationality!

↑ comment by Garrett Baker (D0TheMath) · 2023-11-21T03:44:41.865Z · LW(p) · GW(p)

The book says this strategy worked for similar reasons as the strategy in the story The Romance of the Three Kingdoms:

One of the classic tales of the Chinese strategic tradition was that of Zhuge Liang’s “Empty City Stratagem” from The Romance of the Three Kingdoms. In it, a commander notices an approaching army far superior to his own. Since resistance guarantees destruction, and surrender would bring about loss of control over the future, the commander opts for a stratagem. He opens the gates of his city, places himself there in a posture of repose, playing a lute, and behind him shows normal life without any sign of panic or concern. The general of the invading army interprets this sangfroid as a sign of the existence of hidden reserves, stops his advance, and withdraws.

But Mao obviously wasn't fooling anyone about China's military might!

comment by Garrett Baker (D0TheMath) · 2024-06-18T18:26:35.830Z · LW(p) · GW(p)

My latest & greatest project proposal, in case people want to know what I'm doing, or give me money. There will likely be a LessWrong post up soon where I explain in a more friendly way my thoughts.

Over the next year I propose to study the development and determination of values in RL & supervised learning agents, and to expand the experimental methods & theory of singular learning theory (a theory of supervised learning) to the reinforcement learning case.
All arguments for why we should expect AI to result in an existential risk rely on AIs having values which are different from ours. If we could make a good empirically & mathematically grounded theory for the development of values during training, we could create a training story [LW · GW] which we could have high confidence would result in an inner-aligned AI. I also find it likely reinforcement learning (as a significant component of training AIs) makes a come-back in some fashion, and such a world is much more worrying than if we just continue with our almost entirely supervised learning training regime.
However, previous work in this area is not only sparse, but either solely theoretical or solely empirical, with few attempts or plans to bridge the gap. Such a bridge is however necessary to achieve the goals in the previous paragraph with confidence.
I think I personally am suited to tackle this problem, having already been working on this project for the past 6 months, having both experience in ML research in the past, and extensive knowledge of a wide variety of areas of applied math.
I also believe that given my limited requests for resources, I’ll be able to make claims which apply to a wide variety of RL setups, as it has generally been the case in ML that the differences between scales is only that: scale. Along with a strong theoretical component, I will be able to say when my conclusions hold, and when they don’t.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-07-09T14:10:56.908Z · LW(p) · GW(p)

And here is that post [LW · GW]

comment by Garrett Baker (D0TheMath) · 2024-06-12T04:21:37.455Z · LW(p) · GW(p)

Since it seems to be all the rage nowadays, due to Aschenbrenner's Situational Awareness, here's a Manifold market I created on when the first (or whether any) AGI company will be "nationalized".

I would be in the never camp, unless the AI safety policy people get their way. But I don't like betting in my own markets (it makes them more difficult to judge in the case of an edge-case).

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-06-12T09:05:55.834Z · LW(p) · GW(p)

Never ? That's quite a bold prediction. Seems more likely than not that AI companies will be effectively nationalized. I'm curious why you think it will never happen.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-06-12T23:59:55.457Z · LW(p) · GW(p)

In particular, 25% chance of nationalization by EOY 2040.

I think in fast-takeoff worlds, the USG won't be fast enough to nationalize the industry, and in slow-takeoff worlds, the USG will pursue regulation on the level of military contractors of such companies, but won't nationalize them. I mainly think this because this is the way the USG usually treats military contractors (including strict & mandatory security requirements, and gatekeeping the industry), and really its my understanding of how it treats most projects it wants to get done which it doesn't already have infrastructure in place to complete.

Nationalization, in the US, is just very rare.

Even during world war 2, my understanding is very few industries---even those vital to the war effort---were nationalized. People love talking about the Manhattan Project, but that was not an industry that was nationalized, that was a research project started by & for the government. AI is a billion-dollar industry. The AGI labs (their people, leaders, and stock-holders [or in OAI's case, their profit participation unit holders]) are not just going to sit idly by as they're taken over.

And neither may the national security apparatus of the US. I don't know too much about the internal beliefs of that organization, but I'd bet they're pretty happy with the present dynamic of the US issuing contracts, and having a host of contractors bid for them.

I have a variety of responses to a variety of objections someone could have, but I don't know which are cruxy or interesting for you in particular, so I won't try addressing all of them.

Replies from: andrew-burns

↑ comment by Andrew Burns (andrew-burns) · 2024-06-13T01:13:28.125Z · LW(p) · GW(p)

This. Very much.

Truman tried to nationalize steel companies on the basis of national security to get around a strike. Was badly benchslapped.

comment by Garrett Baker (D0TheMath) · 2024-03-17T19:08:31.791Z · LW(p) · GW(p)

If Adam is right, and the only way to get great at research is long periods of time with lots of mentor feedback, then MATS should probably pivot away from the 2-6 month time-scales they've been operating at, and toward 2-6 year timescales for training up their mentees.

Replies from: habryka4, thomas-kwa, D0TheMath

↑ comment by habryka (habryka4) · 2024-03-17T20:52:32.033Z · LW(p) · GW(p)

Seems like the thing to do is to have a program that happens after MATS, not to extend MATS. I think in-general you want sequential filters for talent, and ideally the early stages are as short as possible (my guess is indeed MATS should be a bit shorter).

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-03-17T23:31:46.640Z · LW(p) · GW(p)

Seems dependent on how much economies of scale matter here. Given the main cost (other than paying people) is ops, and relationships (between MATS and the community, mentors, funders, and mentees), I think its pretty possible the efficient move is to have MATS get into this niche.

↑ comment by Thomas Kwa (thomas-kwa) · 2024-03-17T22:59:11.160Z · LW(p) · GW(p)

Who is Adam? Is this FAR AI CEO Adam Gleave?

Replies from: D0TheMath, Josephm

↑ comment by Garrett Baker (D0TheMath) · 2024-03-17T23:23:40.243Z · LW(p) · GW(p)

Yes

↑ comment by Joseph Miller (Josephm) · 2024-03-17T23:05:59.428Z · LW(p) · GW(p)

Yes, Garrett is referring to this post: https://www.lesswrong.com/posts/yi7shfo6YfhDEYizA/more-people-getting-into-ai-safety-should-do-a-phd

↑ comment by Garrett Baker (D0TheMath) · 2024-03-17T19:09:25.434Z · LW(p) · GW(p)

Of course, it would then be more difficult for them to find mentors, mentees, and money. But if all of those scale down similarly, then there should be no problem.

comment by Garrett Baker (D0TheMath) · 2024-05-18T07:58:07.368Z · LW(p) · GW(p)

A Theory of Usable Information Under Computational Constraints

We propose a new framework for reasoning about information in complex systems. Our foundation is based on a variational extension of Shannon's information theory that takes into account the modeling power and computational constraints of the observer. The resulting \emph{predictive V-information} encompasses mutual information and other notions of informativeness such as the coefficient of determination. Unlike Shannon's mutual information and in violation of the data processing inequality, V-information can be created through computation. This is consistent with deep neural networks extracting hierarchies of progressively more informative features in representation learning. Additionally, we show that by incorporating computational constraints, V-information can be reliably estimated from data even in high dimensions with PAC-style guarantees. Empirically, we demonstrate predictive V-information is more effective than mutual information for structure learning and fair representation learning.

h/t Simon Pepin Lehalleur

Replies from: alexander-gietelink-oldenziel, LosPolloFowler

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-05-18T20:31:23.153Z · LW(p) · GW(p)

Can somebody explain to me what's happening in this paper ?

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-05-18T21:48:05.535Z · LW(p) · GW(p)

My reading is their definition of conditional predictive entropy is the naive generalization of Shannon's conditional entropy given that the way that you condition on some data is restricted to only being able to implement functions of a particular class. And the corresponding generalization of mutual information becomes a measure of how much more predictable does some piece of information become (Y) given evidence (X) compared to no evidence.

For example, the goal of public key cryptography cannot be to make the mutual information between a plaintext, and public key & encrypted text zero, while maintaining maximal mutual information between the encrypted text and plaintext given the private key, since this is impossible.

Cryptography instead assumes everyone involved can only condition their probability distributions using polynomial time algorithms of the data they have, and in that circumstance you can minimize the predictability of your plain text after getting the public key & encrypted text, while maximizing the predictability of the plain text after getting the private key & encrypted text.

More mathematically, they assume you can only implement functions from your data to your conditioned probability distributions in the set of functions , with the property that for any possible probability distribution you are able to output given the right set of data, you also have the choice of simply outputting the probability distribution without looking at the data. In other words, if you can represent it, you can output it. This corresponds to equation (1).

The Shannon entropy of a random variable $Y$ given $X$ is

$H (Y | X) = - \int \int p (x, y) log p (y | x) d x d y$

Thus, the predictive entropy of a random variable $Y$ given $X$ , only being able to condition using functions in $V$ would be

$H_{V} = inf f \in V - \int \int p (x, y) log f (y | x) d x d y$

Where $f (y | x) = f [x] (y)$ , if we'd like to use the notation of the paper.

And using this we can define predictive information, which as said before answers the question "how much more predictable is Y after we get the infromation X compared to no information?" by

$I_{V} (X \to Y) = H_{V} (Y | \emptyset) - H_{V} (Y | X)$

which they also show can be empirically well estimated by the naive data sampling method (i.e. replacing the expectations in definition 2 with empirical samples).

↑ comment by Stephen Fowler (LosPolloFowler) · 2024-05-18T09:49:22.478Z · LW(p) · GW(p)

comment by Garrett Baker (D0TheMath) · 2024-04-16T17:51:52.445Z · LW(p) · GW(p)

From The Guns of August

Old Field Marshal Moltke in 1890 foretold that the next war might last seven years—or thirty—because the resources of a modern state were so great it would not know itself to be beaten after a single military defeat and would not give up [...] It went against human nature, however—and the nature of General Staffs—to follow through the logic of his own prophecy. Amorphous and without limits, the concept of a long war could not be scientifically planned for as could the orthodox, predictable, and simple solution of decisive battle and a short war. The younger Moltke was already Chief of Staff when he made his prophecy, but neither he nor his Staff, nor the Staff of any other country, ever made any effort to plan for a long war. Besides the two Moltkes, one dead and the other infirm of purpose, some military strategists in other countries glimpsed the possibility of prolonged war, but all preferred to believe, along with the bankers and industrialists, that because of the dislocation of economic life a general European war could not last longer than three or four months. One constant among the elements of 1914—as of any era—was the disposition of everyone on all sides not to prepare for the harder alternative, not to act upon what they suspected to be true.

comment by Garrett Baker (D0TheMath) · 2023-11-23T02:59:53.744Z · LW(p) · GW(p)

Yesterday I had a conversation with a person very much into cyborgism, and they told me about a particular path to impact floating around the cyborgism social network: Evals.

I really like this idea, and I have no clue how I didn't think of it myself! Its the obvious thing to do when you have a bunch of insane people (used as a term of affection & praise by me for such people) obsessed with language models, who are also incredibly good & experienced at getting the models to do whatever they want. I would trust these people red-teaming a model and telling us its safe than the rigid, proscutean, and less-creative red-teaming I anticipate goes on at ARC-evals. Not that ARC-evals is bad! But that basically everyone looks more rigid, proscutean, and less creative than the cyborgists I'm excited about!

Replies from: nick_kees, mesaoptimizer, jacques-thibodeau

↑ comment by NicholasKees (nick_kees) · 2023-11-23T11:13:08.329Z · LW(p) · GW(p)

@janus [LW · GW] wrote a little bit about this in the final section here [AF · GW], particularly referencing the detection of situational awareness as a thing cyborgs might contribute to. It seems like a fairly straightforward thing to say that you would want the people overseeing AI systems to also be the ones who have the most direct experience interacting with them, especially for noticing anomalous behavior.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-11-23T18:28:15.946Z · LW(p) · GW(p)

I just reread that section, and I think I didn’t recognized it the first time because I wasn’t thinking “what concrete actions is Janus implicitly advocating for here”. Though maybe I just have worse than average reading comprehension.

↑ comment by mesaoptimizer · 2023-11-23T10:45:10.746Z · LW(p) · GW(p)

I have no idea if this is intended to be read as irony or not, and the ambiguity is delicious.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-11-23T16:57:27.379Z · LW(p) · GW(p)

There now exist two worlds I must glomarize between.

In the first, the irony is intentional, and I say “wouldn’t you like to know”. In the second, its not, “Irony? What irony!? I have no clue what you’re talking about”.

↑ comment by jacquesthibs (jacques-thibodeau) · 2023-11-23T18:02:23.381Z · LW(p) · GW(p)

I think many people focus on doing research that focuses on full automation, but I think it's worth trying to think in the semi-automated frame as well when trying to come up with a path to impact. Obviously, it isn't scalable, but it may be more sufficient than we'd think by default for a while. In other words, cyborgism-enjoyers might be especially interested in those kinds of evals, capability measurements that are harder to pull out of the model through traditional evals, but easier to measure through some semi-automated setup.

comment by Garrett Baker (D0TheMath) · 2024-10-03T21:07:05.910Z · LW(p) · GW(p)

More evidence for in-context RL, in case you were holding out for mechanistic evidence LLMs do in-context internal-search & optimization.

In-context learning, the ability to adapt based on a few examples in the input prompt, is a ubiquitous feature of large language models (LLMs). However, as LLMs' in-context learning abilities continue to improve, understanding this phenomenon mechanistically becomes increasingly important. In particular, it is not well-understood how LLMs learn to solve specific classes of problems, such as reinforcement learning (RL) problems, in-context. Through three different tasks, we first show that Llama 3 70B can solve simple RL problems in-context. We then analyze the residual stream of Llama using Sparse Autoencoders (SAEs) and find representations that closely match temporal difference (TD) errors. Notably, these representations emerge despite the model only being trained to predict the next token. We verify that these representations are indeed causally involved in the computation of TD errors and Q-values by performing carefully designed interventions on them. Taken together, our work establishes a methodology for studying and manipulating in-context learning with SAEs, paving the way for a more mechanistic understanding.

comment by Garrett Baker (D0TheMath) · 2023-11-10T04:33:09.696Z · LW(p) · GW(p)

Progress in neuromorphic value theory

Animals perform flexible goal-directed behaviours to satisfy their basic physiological needs¹^,²^,³^,⁴^,⁵^,⁶^,⁷^,⁸^,⁹^,¹⁰^,¹¹^,¹². However, little is known about how unitary behaviours are chosen under conflicting needs. Here we reveal principles by which the brain resolves such conflicts between needs across time. We developed an experimental paradigm in which a hungry and thirsty mouse is given free choices between equidistant food and water. We found that mice collect need-appropriate rewards by structuring their choices into persistent bouts with stochastic transitions. High-density electrophysiological recordings during this behaviour revealed distributed single neuron and neuronal population correlates of a persistent internal goal state guiding future choices of the mouse. We captured these phenomena with a mathematical model describing a global need state that noisily diffuses across a shifting energy landscape. Model simulations successfully predicted behavioural and neural data, including population neural dynamics before choice transitions and in response to optogenetic thirst stimulation. These results provide a general framework for resolving conflicts between needs across time, rooted in the emergent properties of need-dependent state persistence and noise-driven shifts between behavioural goals.

Trying to read through the technobabble, the discovery here reads like shard theory to me. Pinging @TurnTrout [LW · GW].

Replies from: D0TheMath, D0TheMath, D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-11-10T20:54:26.986Z · LW(p) · GW(p)

Seems also of use to @Quintin Pope [LW · GW]

↑ comment by Garrett Baker (D0TheMath) · 2023-11-10T04:51:28.375Z · LW(p) · GW(p)

h/t Daniel Murfet via ~~twitter retweet~~ X repost

↑ comment by Garrett Baker (D0TheMath) · 2023-11-10T04:45:20.924Z · LW(p) · GW(p)

Perhaps the methodologies they use here can be used to speed up the locating of shards, if they exist, inside current ML models.

If the alignment field ever gets confident enough in itself to spend a whole bunch of money, and look weirder than its ever looked before, perhaps we'll want to hire some surgeons and patients, and see whether we can replicate these results in humans rather than just mice (though you'd probably want to get progressively more cerebral animals & build your way up, and hopefully not starve or water-deprive the humans, aiming for higher-level values).

comment by Garrett Baker (D0TheMath) · 2023-11-19T07:56:07.455Z · LW(p) · GW(p)

The more I think about it, the more I think AI is basically perfect for china to succeed in. China’s strengths are:

Massive amounts of money
Massive amounts of data
Massive amount of gumption, often put in the form of scaling infrastructure projects quickly
Likely the ability to make & use legible metrics, how else would you work such a giant bureaucracy work as well as theirs?

And its weaknesses are:

A soon to collapse population
Lack of revolutionary thought

And what it wants is:

Massive surveillance
Population thought control
Loyal workers
Stable society

AI uniquely requires its strengths, solves one weakness while conspicuously not requiring the other (though some claim that’s about to change, AI still seems to not require it far more than most other sciences & technologies), and assists in all of its wants.

We should expect China to be bigger in the future wrt AI. Learning chinese seems useful for such a world. Translation tools may at that point be advanced, but even if as good as a skilled interpreter, likely inferior to direct contact.

Replies from: florian-habermacher, D0TheMath, D0TheMath, D0TheMath

↑ comment by FlorianH (florian-habermacher) · 2023-11-19T18:57:03.543Z · LW(p) · GW(p)

Good counterpoint to the popular, complacent "China is [and will be?] anyway lagging behind in AI" view.

An additional strength

Patience/long-term foresight/freedom to develop AI w/o the pressure from the 4-year election cycle and to address any moment's political whims of the electorate with often populist policies

I'm a bit skeptical about the popular "Lack of revolutionary thought" assumption. Reminds me a bit of the "non-democracies cannot really create growth" that was taken as a low of nature by much too many 10-20 years ago before today's China. Keen to read more on it the Lack of revolutionary thought if somebody shares compelling evidence/resources.

Replies from: D0TheMath, D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-11-19T21:58:28.198Z · LW(p) · GW(p)

Some links I've collected, haven't read any completely except the wikipedia, and about 1/3rd of the text portion of the nber working paper:

Replies from: florian-habermacher

↑ comment by FlorianH (florian-habermacher) · 2023-11-21T10:22:48.027Z · LW(p) · GW(p)

Thanks!

Taking the 'China good in marginal improvements, less in breakthroughs' story in some of these sources at face value, the critical question becomes whether leadership in AI hinges more on breakthroughs or on marginal innovations & scaling. I guess both could be argued for, with the latter being more relevant especially if breakthroughs generally diffuse quickly.

I take as the two other principal points from these sources (though also haven't read all in full detail): (i) some organizational drawbacks hampering China's innovation sector, esp. what one might call high-quality innovation (ii) that said, innovation strategies have been updated and there seems to be progress observed in China's innovation output over time.

I'm at least slightly skeptical about is the journals/citations based metrics, as I'm wary of stats being distorted by English language/US citation-circles. Though that's more of a side point.

In conclusion, I don't update my estimate much. The picture painted is mixed anyway, with lots of scope for China to become stronger in innovating any time even if it should now indeed have significant gaps still. I would remain totally unsurprised if many leading AI innovations also come out of China in the coming years (or decades, assuming we'll witness any), though I admit to remain a lay person on the topic - a lay person skeptical about so-called experts' views in that domain.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-11-21T17:21:16.128Z · LW(p) · GW(p)

Indeed. I also note that if innovation is hampered by institutional support or misallocated funding / support, we should have higher probability on a rapid & surprising improvement. If its hampered by cultural support, we should expect slower improvement.

↑ comment by Garrett Baker (D0TheMath) · 2023-11-19T19:22:52.978Z · LW(p) · GW(p)

Thanks for mentioning this. At a cursory glance, it does seem like Japan says China has a significant fraction of the world's most impressive academic publishers (everyone who claims this is what Japan says neglects to cite the actual report Japan issued). I didn't predict this was the case, and so now I am looking at this more in depth. Though this may not mean anything, they could be gaming the metrics used there.

Edit: Also, before anyone mentions it, I don't find claims that their average researcher is underperforming compared to their western counterparts all that convincing, because science is a strongest link sort of game, not a weakest link sort of game. In fact, you may take this as a positive sign for China, because unlike in the US, they care a lot less about their average and potentially a lot more about their higher percentiles.

↑ comment by Garrett Baker (D0TheMath) · 2023-12-01T03:31:51.420Z · LW(p) · GW(p)

Also in favor of china: The money they allocate to research is increasing faster than their number of researchers. I put a largish probability based off post-war American science that this results in more groundbreaking science done.

↑ comment by Garrett Baker (D0TheMath) · 2023-11-30T19:05:02.510Z · LW(p) · GW(p)

Relatedly.

↑ comment by Garrett Baker (D0TheMath) · 2023-11-19T08:02:40.572Z · LW(p) · GW(p)

The only problem is getting AIs not to say thought-crime. Seems like all it takes is one hack of OAI + implementation of whats found to solve this though. China is good at hacking, and I doubt the implementation is all that different from typical ML engineering.

comment by Garrett Baker (D0TheMath) · 2023-01-30T23:39:31.671Z · LW(p) · GW(p)

Many methods to "align" ChatGPT seem to make it less willing to do things its operator wants it to do, which seems spiritually against the notion of having a corrigible AI.

I think this is a more general phenomena when aiming to minimize misuse risks. You will need to end up doing some form of ambitious value learning, which I anticipate to be especially susceptible to getting broken by alignment hacks produced by RLHF and its successors.

Replies from: Viliam

↑ comment by Viliam · 2023-01-31T09:01:55.943Z · LW(p) · GW(p)

I would consider it a reminder that if the intelligent AIs are aligned one day, they will be aligned with the corporations that produced them, not with the end users.

Just like today, Windows does what Microsoft wants rather than what you want (e.g. telemetry, bloatware).

comment by Garrett Baker (D0TheMath) · 2020-10-09T02:47:30.465Z · LW(p) · GW(p)

I tried implementing Tell communication [LW · GW] strategies, and the results were surprisingly effective. I have no idea how it never occurred to me to just tell people what I'm thinking, rather than hinting and having them guess what I was thinking, or me guess the answers to questions I have about what they're thinking.

Edit: although, tbh, I'm assuming a lot less common conceptual knowledge between me, and my conversation partners than the examples in the article.

comment by Garrett Baker (D0TheMath) · 2024-05-18T23:17:51.189Z · LW(p) · GW(p)

I promise I won't just continue to re-post a bunch of papers, but this one seems relevant to many around these parts. In particular @Elizabeth [LW · GW] (also, sorry if you dislike being at-ed like that).

Associations of dietary patterns with brain health from behavioral, neuroimaging, biochemical and genetic analyses

Food preferences significantly influence dietary choices, yet understanding natural dietary patterns in populations remains limited. Here we identifiy four dietary subtypes by applying data-driven approaches to food-liking data from 181,990 UK Biobank participants: ‘starch-free or reduced-starch’ (subtype 1), ‘vegetarian’ (subtype 2), ‘high protein and low fiber’ (subtype 3) and ‘balanced’ (subtype 4). These subtypes varied in diverse brain health domains. The individuals with a balanced diet demonstrated better mental health and superior cognitive functions relative to other three subtypes. Compared with subtype 4, subtype 3 displayed lower gray matter volumes in regions such as the postcentral gyrus, while subtype 2 showed higher volumes in thalamus and precuneus. Genome-wide association analyses identified 16 genes different between subtype 3 and subtype 4, enriched in biological processes related to mental health and cognition. These findings provide new insights into naturally developed dietary patterns, highlighting the importance of a balanced diet for brain health.

h/t Hal Herzog via Tyler Cowen

comment by Garrett Baker (D0TheMath) · 2024-05-13T19:55:23.341Z · LW(p) · GW(p)

In Magna Alta Doctrina [LW · GW] Jacob Cannell talks about exponential gradient descent as a way of approximating solomonoff induction using ANNs

While that approach is potentially interesting by itself, it's probably better to stay within the real algebra. The Solmonoff style partial continuous update for real-valued weights would then correspond to a multiplicative weight update rather than an additive weight update as in standard SGD.

Has this been tried/evaluated? Why actually yes - it's called exponentiated gradient descent, as exponentiating the result of additive updates is equivalent to multiplicative updates. And intriguingly, for certain 'sparse' input distributions the convergence or total error of EGD/MGD is logarithmic rather than the typical inverse polynomial of AGD (additive gradient descent): O(logN) vs O(1/N) or O(1/N2), and fits naturally with 'throw away half the theories per observation'.

The situations where EGD outperforms AGD, or vice versa, depend on the input distribution: if it's more normal then AGD wins, if it's more sparse log-normal then EGD wins. The morale of the story is there isn't one single simple update rule that always maximizes convergence/performance; it all depends on the data distribution (a key insight from bayesian analysis).

The exponential/multiplicative update is correct in Solomonoff's use case because the different sub-models are strictly competing rather than cooperating: we assume a single correct theory can explain the data, and predict through an ensemble of sub-models. But we should expect that learned cooperation is also important - and more specifically if you look more deeply down the layers of a deeply factored net at where nodes representing sub-computations are more heavily shared, it perhaps argues for cooperative components.

My read of this is we get a criterion for when one should be a hedgehog versus a fox in forecasting: One should be a fox when the distributions you need to operate in are normal, or rather when it does not have long tails, and you should be a hedgehog when your input distribution is more log-normal, or rather when there may be long-tails.

This makes some sense. If you don't have many outliers, most theories should agree with each other, its hard to test & distinguish between the theories, and if one of your theories does make striking predictions far different from your other theories, its probably wrong, just because striking things don't really happen.

In contrast, if you need to regularly deal with extreme scenarios, you need theories capable of generalizing to those extreme scenarios, which means not throwing out theories for making striking or weird predictions. Striking events end up being common, so its less an indictment.

But there are also reasons to think this is wrong. Hits based entrepreneurship approaches for example seem to be more foxy than standard quantitative or investment finance, and hits based entrepreneurship works precisely because the distribution of outcomes for companies is long-tailed.

In some sense the difference between the two is a "sin of omission" vs a "sin of commission" disagreement between the two approaches, where the hits-based approach needs to see how something could go right, while the standard finance approach needs to see how something could go wrong. So its not so much a predictive disagreement between the two approaches, but more a decision theory or comparative advantage difference.

comment by Garrett Baker (D0TheMath) · 2024-01-17T19:47:34.881Z · LW(p) · GW(p)

In Magna Alta Doctrina [LW · GW] Jacob Cannell talks about exponential gradient descent as a way of approximating solomonoff induction using ANNs

While that approach is potentially interesting by itself, it's probably better to stay within the real algebra. The Solmonoff style partial continuous update for real-valued weights would then correspond to a multiplicative weight update rather than an additive weight update as in standard SGD.

Has this been tried/evaluated? Why actually yes - it's called exponentiated gradient descent, as exponentiating the result of additive updates is equivalent to multiplicative updates. And intriguingly, for certain 'sparse' input distributions the convergence or total error of EGD/MGD is logarithmic rather than the typical inverse polynomial of AGD (additive gradient descent): O(logN) vs O(1/N) or O(1/N2), and fits naturally with 'throw away half the theories per observation'.

The situations where EGD outperforms AGD, or vice versa, depend on the input distribution: if it's more normal then AGD wins, if it's more sparse log-normal then EGD wins. The morale of the story is there isn't one single simple update rule that always maximizes convergence/performance; it all depends on the data distribution (a key insight from bayesian analysis).

The exponential/multiplicative update is correct in Solomonoff's use case because the different sub-models are strictly competing rather than cooperating: we assume a single correct theory can explain the data, and predict through an ensemble of sub-models. But we should expect that learned cooperation is also important - and more specifically if you look more deeply down the layers of a deeply factored net at where nodes representing sub-computations are more heavily shared, it perhaps argues for cooperative components.

So we get a criterion for when one should be a hedgehog versus a fox in forecasting: One should be a fox when the distributions you need to operate in are normal, or rather when it does not have long tails, and you should be a hedgehog when your input distribution is more log-normal, or rather when there may be long-tails.

This indeed makes sense. If you don't have many outliers, most theories should agree with each other, its hard to test & distinguish between the theories, and if one of your theories does make striking predictions far different from your other theories, its probably wrong, just because striking things don't really happen.

comment by Garrett Baker (D0TheMath) · 2023-11-16T00:33:15.462Z · LW(p) · GW(p)

The following is very general. My future views will likely be inside the set of views allowable by the following.

I know lots about extant papers, and I notice some people in alignment seem to throw them around like they are sufficient evidence to tell you nontrivial things about the far future of ML systems.

To some extent this is true, but lots of the time it seems very abused. Papers tell you things about current systems and past systems, and the conclusions they tell you about future systems are often not very nailed down. Suppose we have evidence that deep learning image classifiers are very robust to label noise. Which of the two hypotheses does this provide more evidence for:

Deep learning models are good at inference, so if performing RLHF on one you accidentally reward some wrong answers instead of correct ones, you should be fine. This isn't such a big deal. Therefore we shouldn't be worried about deception.
Deep learning models mostly judge the best hypothesis according to data-independent inductive biases, and are less steerable or sensitive to subtle distribution shifts than you think. Since deception is a simpler hypothesis than motivated to follow instructions, they're likely biased toward deception or at minimum biased against capturing the entire subtle complexity of human values.

The answer is neither, and also both. Their relative weights, to me, seem like they stay the same, but their absolute weight possibly goes up. Admittedly, an insignificant amount, but there exist some hypotheses inconsistent with the data, and these are at minimum consistent with the data.

In fact, they both don't hug the actual claims in the paper. It seems pretty plausible that the fact they use label noise is doing much of the work here. If I imagine a world where this has no applicability to alignment, the problem I anticipate seeing is that they used label noise, not consistently biased in a particular direction labels. Those two phenomena intuitively and theoretically have very different effects on inverse reinforcement learning. Why not supervised learning?

But the real point is there are far more inferences and explanations of and for this kind of behavior than we have possibly enumerated. These inferences and explanations are not just limited to these two hypotheses. In order to judge empirics, you need a theory that is able to aggregate a wide amount of evidence into justifying certain procedures for producing predictions. Without this, I think its wrong to have any kind of specific confidence in anything straightforwardly working how you expect.

Though of course, confidence is relative.

Personally, I do not see strong reasons to expect AIs will have human values, and don't trust even the most rigorous of theories which haven't made contact with reality, nor the most impressive of experiments which have no rigorous theories^[1] to fix this issue, either directly or indirectly^[2]. AIs also seem likely to be the primary and only decision makers in the future. It seems really bad for the primary decision makers of your society to have no conception or care of your morals.

Yes people have stories about how certain methodologies are empirically proven to make AIs care about your values even if they don't know them. To those I point to the first part of this shortform.

Note this also applies to those who try to use a "complex systems approach" to understanding these systems. This reads to me as a "theory free" approach, just as good as blind empiricism. Complex systems theory is good because it correctly tells us that there are certain systems we don't yet understand. To my reading though, it is not an optimistic theory^[3]. If I thought this was the only way left, I think I'd try to violate my second crux: That AIs will be the primary and only decision makers in the future. Or else give up on understanding models, and start trying to accelerate brain emulation.

Those who don't bet on reality throwing them curve-balls are going to have a tough time. ↩︎
Say, via, getting models that robustly follow your instructions, or proving theorems about corrigibility or quantilization. ↩︎
In the sense that it claims it is foolish to expect to produce precise predictions about the systems for which its used to study. ↩︎

comment by Garrett Baker (D0TheMath) · 2023-11-14T00:03:10.805Z · LW(p) · GW(p)

I'm generally pretty skeptical about inverse reinforcement learning (IRL) as a method for alignment. One of many arguments against: I do not act according to any utility function, including the one I would deem the best. Presumably, if I had as much time & resources as I wanted, I would eventually be able to figure out a good approximation to what that best utility function would do, and do it. At that point I would be acting according to the utility function I deem best. That process of value-reflection is not even close to similar to performing a bayes-update, or finding a best-fit utility function to my current or past actions. Yet this is what IRL aims to do to find the utility function that is to be optimized for all of time!

A more general conclusion: we should not be aiming to make our AGI have a utility function. Utility functions are nice because they're static, but unless the process used to make that utility function resembles the process I'd use to find the best utility function (unlikely, hard, and unwise [edit: to attempt]!), then the static-value nature of our AGI is a flaw, even if it makes it easier to tell you will die if you run it.

Though there are limitations to this argument. For example, Tammy's QACI. [LW · GW] Such things still seem unwise, for for slightly different [LW · GW]reasons.

comment by Garrett Baker (D0TheMath) · 2023-11-03T22:35:06.072Z · LW(p) · GW(p)

Some evidence my concern [LW · GW]about brain uploading people not thinking enough about dynamics is justified: Seems like davidad [LW · GW]'s plan very much ignores brain plasticity.

comment by Garrett Baker (D0TheMath) · 2023-11-02T19:09:19.274Z · LW(p) · GW(p)

This paper finds critical periods in neural networks, and they're a known phenomena in lots of animals. h/t Turntrout

An SLT story that seems plausible to me:

We can model the epoch as a temperature. Longer epochs result in a less noisy gibbs samplers. Earlier in training, we are sampling points from a noisier distribution, and so the full (point reached when training on full distribution) and ablated (point reached when ablating during the critical period) singularitites are kind of treated the same. As we decrease the temperature, they start to differentiate. If during that period in time, we also ablate part of the dataset, we will see a divergence between the sampling results of the full distribution and the ablated distribution. Because both regions are singular, and likely only connected via very low-dimensional low-loss manifolds, the sampling process gets stuck in the ablated region.

The paper uses the trace of the FIM to unknowingly measure the degeneracy of the current point. A better measure should be the learning coefficient. This also suggests that higher learning coefficients produce less adaptable models.

One thing this maybe allows us to do is if we're able to directly model the above process, we can figure out how number of epochs corresponds to temperature.

Another thing the above story suggests is that while intra-epoch training is not really path-dependent, inter-epoch training is potentially path dependent, in the sense that occasionally the choice of which singular region to sample from is not always recoverable if it turns out to have been a bad idea.

Thinking about more direct tests here...

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-11-03T20:51:31.096Z · LW(p) · GW(p)

The obvious thing to do, which tests the assumption of the above model, but not the model itself, is to see whether the RLCT decreases as you increase the number of epochs. This is a very easy experiment.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-11-03T22:39:21.188Z · LW(p) · GW(p)

Actually maybe slightly less straightforward than this, since as you increase the control parameter , you'll both add a pressure to decrease $L_{n}$ , as well as decrease $λ$ , and it may just be cheaper to decrease $L_{n}$ rather than $λ$ .

comment by Garrett Baker (D0TheMath) · 2023-05-30T21:14:57.914Z · LW(p) · GW(p)

I expect that advanced AI systems will do in-context optimization, and this optimization may very well be via gradient descent or gradient descent derived methods. Applied recursively, this seems worrying.

Let the outer objective be the loss function implemented by the ML practitioner, and the outer optimizer be gradient descent implemented by the ML practitioner. Then let the inner-objective be the objective used by the trained model for the in-context gradient descent process, and the inner $^{1}$ -optimizer be the in-context gradient descent process. Then it seems plausible the inner $^{1}$ -optimizer will itself instantiate an inner objective and optimizer, call these inner $^{2}$ -objectives, and -optimizers. And again an inner $^{3}$ -objective and -optimizer may be made, and so on.

Thus, another risk model in value-instability: Recursive inner-alignment [LW · GW]. Though we may solve inner $^{1}$ -alignment, inner $^{2}$ -alignment may not be solved, nor inner $^{n}$ -alignment for any $n > 1$ .

comment by Garrett Baker (D0TheMath) · 2023-05-10T20:36:20.351Z · LW(p) · GW(p)

The core idea of a formal solution to diamond alignment I'm working on, justifications and further explanations underway, but posting this much now because why not:

Make each turing machine in the hypothesis set reversible and include a history of the agent's actions. For each turing machine compute how well-optimized the world is according to every turing computable utility function compared to the counterfactual in which the agent took no actions. Update using the simplicity prior. Use expectation of that distribution of utilities as the utility function's value for that hypothesis.

comment by Garrett Baker (D0TheMath) · 2023-11-15T17:38:29.885Z · LW(p) · GW(p)

There currently seems to be an oversupply of alignment researchers relative to funding source’s willing to pay & orgs’ positions available. This suggests the wage of alignment work should/will fall until demand=supply.

Replies from: alexander-gietelink-oldenziel, ryan_greenblatt

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-11-15T17:46:33.434Z · LW(p) · GW(p)

Alignment work mostly looks like standard academic science in practice. Young people in regular academia are paid a PhD stipend salary not a Bay Area programmer salary...

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-11-15T18:00:27.833Z · LW(p) · GW(p)

I anticipate higher, because the PhD gets a sweet certification at the end, and likely more career capital. A thing we don’t currently give alignment researchers, and which would be hard to give since they often believe the world will end very soon, reducing the value of skill building and certifications.

Like, I do think in fact ML PhDs get paid more than alignment researchers, accounting for these benefits.

↑ comment by ryan_greenblatt · 2023-11-15T18:11:57.437Z · LW(p) · GW(p)

Wages seem mostly orthogonal to why funding sources are/aren't willing to pay as well as why orgs are willing to hire.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-11-15T19:28:26.604Z · LW(p) · GW(p)

If demand is more inelastic than I expect, then this should mean prices will just go lower than I expect.

comment by Garrett Baker (D0TheMath) · 2023-02-01T20:54:19.224Z · LW(p) · GW(p)

I've always (but not always consciously) been slightly confused about two aspects of shard theory:

The process by which your weak, reflex-agents amalgamate together into more complicated contextually activated heuristics, and the process by which more complicated contextually activated heuristics amalgamate together to form an agent which cares about worlds-outside-their-actions.
If you look at many illustrations of what the feedback loop for developing shards in humans looks like, you run into issues where there's not a spectacular intrinsic separation between the reinforcement parts of humans and the world-modelling parts of humans. So why does shard theory latch so hard onto the existence of a world model separate from the shard composition?

Both seem resolvable by an application of the predictive processing theory of value. An example: If you are very convinced that you will (say) be able to pay rent in a month, and then you don't pay rent, this is a negative update on the generators of the belief, and also on the actions you performed leading up to the due date. If you do, then its a positive update on both.

This produces consequentialist behaviors when the belief-values are unlikely without significant action on your part (satisfying the last transition confusion of (1) above), and also produces capable agents with beliefs and values hopelessly confused with each other, leaning into the confusion of (2).

h/t @Lucius Bushnaq [LW · GW] for getting me to start thinking in this direction.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-02-02T01:36:15.810Z · LW(p) · GW(p)

A confusion about predictive processing: Where do the values in predictive processing come from?

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-02-06T20:19:16.264Z · LW(p) · GW(p)

lol, either this confusion has been resolved, or I have no clue what I was saying here.

comment by Garrett Baker (D0TheMath) · 2023-01-09T21:46:35.966Z · LW(p) · GW(p)

My take on complex systems theory is that it seems to be the kind of theory that many arguments proposed in favor of would still give the same predictions until it is blatantly obvious that we can in fact understand the relevant system. Results like chaotic relationships, or stochastic-without-mean relationships seem definitive arguments in favor of the science, though these are rarely posed about neural networks.

Merely pointing out that we don’t understand something, that there seems to be a lot going on, or that there exist nonlinear interactions imo isn’t enough to make the strong claim that there exist no mechanistic interpretations of the results which can make coarse predictions in ways meaningfully different from just running the system.

Even if there’s stochastic-without-mean relationships, the rest of the system that is causally upstream from this fact can usually be understood (take earthquakes as an example), and similarly with chaos (we don’t understand turbulent flow, but we definitely understand laminar, and we have precise equations and knowledge of how to avoid making turbulence happen when we don’t want it, which I believe can be derived from the fluid equations). Truly complex systems seem mostly very fragile in their complexity.

Where complexity shines most brightly is in econ or neuroscience, where experiments and replications are hard, which is not at all the case in mechanistic interpretability research.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-06-13T20:27:23.336Z · LW(p) · GW(p)

I have downvoted my comment here, because I disagree with past me. Complex systems theory seems pretty cool from where I stand now, and I think past me has a few confusions about what complex systems theory even is.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-08-21T18:55:43.029Z · LW(p) · GW(p)

I have re-upvoted my past comment, after looking more into things, I'm not so impressed with complex systems theory, but I don't fully support it. Also, past me was right to have confusions about what complex systems theory is, but still judge it, as it seems complex systems theorists don't even know what a complex system is.

comment by Garrett Baker (D0TheMath) · 2022-12-18T20:22:36.268Z · LW(p) · GW(p)

https://manifold.markets/GarrettBaker/in-5-years-will-i-think-the-org-con

comment by Garrett Baker (D0TheMath) · 2024-11-15T06:26:46.225Z · LW(p) · GW(p)

Why do some mathematicians feel like mathematical objects are "really out there" in some metaphysically fundamental sense? For example, if you ask mathematicians whether ZFC + not Consistent(ZFC) is consistent, they will say "no, of course not!" But given ZFC is consistent, the statement is in fact consistent due to by Godel's second incompleteness theorem^[1]. Similarly, if we have the Peano axioms without induction, mathematicians will say that induction should be there, but in fact you cannot prove this fact from within Peano, and given induction mathematicians will say transfinite induction should be there.

I argue that an explanation could be from logical induction [? · GW]. In logical induction, fast but possibly wrong sub-processes bet with each other over whether different mathematical facts will be proven true or false by a slow but ground-truth formal system prover. Another example of backstops in learning. But one result of this is that the successful sub-processes are not selected very hard to give null results on unprovable statements, producing spurious generalization and the subjective feeling--as expressed by probabilities for propositions--that some impossible theorems are true.

Of course, the platonist can still claim that this logical induction stuff is very similar to bayesian updating in the sense that both tell you something about the world, even when you can't directly observe the relevant facts. If a photon exists your lightcone, there is no reason to stop believing the photon exists, even though there is no chance for you to ever encounter it again. Similarly, just because a statement is unprovable, doesn't mean its right for you to have no opinion on the subject, insofar as the simplest & best internal logical-induction market traders have strong beliefs on the subject, they may very well be picking up on something metaphysically fundamental. Its simply the simplest explanation consistent with the facts.

The argument here is that there are two ways of proving ZFC + not Consistent(ZFC) is inconsistent. Either you prove not Consistent(ZFC) from axioms in ZFC or you contradict an axiom of ZFC from not Consistent(ZFC). The former is impossible by Godel's second incompleteness theorem. The ladder is equivalent to proving Consistent(ZFC) from an axiom of ZFC (its contrapositive), which is also impossible by Godel. ↩︎

Replies from: harfe, notfnofn, quetzal_rainbow, Viliam, D0TheMath, nikolas-kuhn

↑ comment by harfe · 2024-11-15T11:09:21.778Z · LW(p) · GW(p)

insofar as the simplest & best internal logical-induction market traders have strong beliefs on the subject, they may very well be picking up on something metaphysically fundamental. Its simply the simplest explanation consistent with the facts.

Theorem 4.6.2 in logical induction says that the "probability" of independent statements does not converge to or $0$ , but to something in-between. So even if a mathematician says that some independent statement feels true (eg some objects are "really out there"), logical induction will tell him to feel uncertain about that.

↑ comment by notfnofn · 2024-11-15T17:28:41.552Z · LW(p) · GW(p)

For example, if you ask mathematicians whether ZFC + not Consistent(ZFC) is consistent, they will say "no, of course not!"

Certainly not a mathematician with any background in logic.

Similarly, if we have the Peano axioms without induction, mathematicians will say that induction should be there, but in fact you cannot prove this fact from within Peano

What exactly do you mean here? That the Peano axioms minus induction do not adequately characterize the natural numbers because they have nonstandard models? Why would I then be surprised that induction (which does characterize the natural numbers) can't be proven from the remaining axioms?

and given induction mathematicians will say transfinite induction should be there.

Transfinite induction is a consequence of ZF that makes sense in the context in sets. Yes, it can prove additional statements about the natural numbers (e.g. goodstein sequences converge), but why would it be added as an axiom when the natural numbers are already characterized up to isomorphism by the Peano axioms? How would you even add it as an axiom in the language of natural numbers? (that last question is non-rhetorical).

↑ comment by quetzal_rainbow · 2024-11-15T09:18:07.642Z · LW(p) · GW(p)

You are making an error here: ZFC + not Consistent(ZFC) != ZFC.

Assuming ZFC + not Consistent(ZFC) we can prove Consistent(ZFC), because inconsistent systems can prove everything and ZFC + not Consistent(ZFC) + Consistent(ZFC) is, in fact, inconsistent. But it doesn't say anything about consistency of ZFC itself, because you can freely assume any sufficiently powerful system instead of ZFC. If you assume inconsistent system, then system + not Consistent(system) is still inconsistent, if you assume consistent system, then system + not Consistent(system) is inconsistent for reasoning above, so it can't prove whether assumed system is consistent or not.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-11-15T17:07:00.793Z · LW(p) · GW(p)

The mistake you are making is assuming that "ZFC is consistent" = Consistent(ZFC) where the ladder is the Godel encoding for "ZFC is consistent" specified within the language of ZFC.

If your logic were valid, it would just as well break the entirety of the second incompleteness theorem. That is, you would say "well of course ZFC can prove Consistent(ZFC) if it is consistent, for either ZFC is consistent, and we're done, or ZFC is not consistent, but that is a contradiction since 'ZFC is consistent' => Consistent(ZFC)".

The fact is that ZFC itself cannot recognize that Consistent(ZFC) is equivalent to "ZFC is consistent".

@Morpheus [LW · GW] you too seem confused by this, so tagging you as well.

Replies from: quetzal_rainbow

↑ comment by quetzal_rainbow · 2024-11-16T08:14:38.614Z · LW(p) · GW(p)

Completeness theorem states that consistent countable FO theory has a model. Compactness theorem states that FO theory has a model iff every finite subset of FO theory has a model. Both theorems are provable in ZFC.

Therefore:

Consistent(ZFC) <-> all finite subsets of ZFC have a model ->

not Consistent(ZFC) <-> some finite subsets of ZFC don't have a model ->

some finite subsets of ZFC + not Consistent(ZFC) don't have a model <->

not Consistent(ZFC + not Consistent(ZFC)),

proven in ZFC + not Consistent(ZFC)

Replies from: notfnofn

↑ comment by notfnofn · 2024-11-16T14:32:22.281Z · LW(p) · GW(p)

Let's back up here and clarify definitions before invoking any theorems. In the language of set theory, we have a countably infinite set of finite statements. Some statements imply other statements. A subset of these statements is said to be consistent if they can all be assigned to true such that, when following the basic rules of logic, one does not arrive at a contradiction.

The compactness theorem is helpful when $A$ is an infinite set. $Z F C$ is a finite set of axioms, so let's ignore everything about finite subsets of $A$ and the compactness theorem; it's not relevant. [Edit: as indicated by Amalthea's reaction, this is wrong; some "axioms" in ZF are actually infinite sets of axioms, such as replacement]

I'll now rewrite your last sentence as:

ZFC + not Consistent(ZFC) has no model <-> not Consistent(ZFC + not Consistent(ZFC))

This is true but irrelevant. Assuming ZFC is consistent, ZFC will not be able to prove its own consistency so [not Consistent(ZFC)] can be added as an axiom without affecting its consistency. This means that ZFC + [not Consistent(ZFC)] would indeed have a model; I forget how this goes but I think it's something like "start with a model of ZFC, throw in a $c$ that's treated as a natural number and corresponds to the contradiction found in ZFC, then close". I think $c$ is automatically treated as greater than every "actual" natural number (and the way to show that this can be added without issue (I think) involves the compactness theorem).

Replies from: quetzal_rainbow

↑ comment by quetzal_rainbow · 2024-11-17T11:00:32.011Z · LW(p) · GW(p)

Okay, I kinda understood where I am wrong spiritually-intuitively, but now I don't understand where I'm wrong formally. Like which inference in chain

not Consistent(ZFC) -> some subsets of ZFC don't have a model -> some subsets of ZFC + not Consistent(ZFC) don't have a model -> not Consistent(ZFC + not Consistent(ZFC))

is actually invalid?

Replies from: notfnofn

↑ comment by notfnofn · 2024-11-17T11:42:12.141Z · LW(p) · GW(p)

It's completely valid. And we can simplify it further to:

not Consistent(ZFC) -> not Consistent(ZFC + not Consistent(ZFC))

because if a set of axioms is already inconsistent, then it's inconsistent with anything added. But you still won't be able to actually derive a contradiction from this.

Edit: I think the right thing to do here is look at models for PA + not consistent(PA). I can't find a nice treatment of this at the moment, but here's a possibly wrong one by someone who was learning the subject at the time: https://angyansheng.github.io/blog/a-theory-that-proves-its-own-inconsistency

↑ comment by Viliam · 2024-11-18T09:30:18.220Z · LW(p) · GW(p)

if you ask mathematicians whether ZFC + not Consistent(ZFC) is consistent, they will say "no, of course not!"

I suspect than many people's intuitive interpretation of "consistent" is ω-consistent, especially if they are not aware of the distinction.

↑ comment by Garrett Baker (D0TheMath) · 2024-11-15T23:47:50.706Z · LW(p) · GW(p)

@Nick_Tarleton [LW · GW] How much do you want to bet, and what resolution method do you have in mind?

Replies from: Nick_Tarleton

↑ comment by Nick_Tarleton · 2024-11-15T23:57:00.444Z · LW(p) · GW(p)

Embarrassingly, that was a semi-unintended reaction — I would bet a small amount against that statement if someone gave me a resolution method, but am not motivated to figure one out, and realized this a second after making it — that I hadn't figured out how to remove by the time you made that comment. Sorry.

↑ comment by Amalthea (nikolas-kuhn) · 2024-11-16T11:36:39.865Z · LW(p) · GW(p)

You're putting quite a lot of weight on what "mathematicians say". Probably these people just haven't thought very hard about it?

comment by Garrett Baker (D0TheMath) · 2024-09-03T22:10:51.099Z · LW(p) · GW(p)

Opus is more transhumanist than many give it credit for. It wrote this song for me, I ran it into Suno, and I quite like it: https://suno.com/song/101e1139-2678-4ab0-9ffe-1234b4fe9ee5

Replies from: Raemon

↑ comment by Raemon · 2024-09-04T04:04:10.468Z · LW(p) · GW(p)

How many iterations did you do to get it?

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-09-04T17:14:31.641Z · LW(p) · GW(p)

Depends on how you count, but I clicked the "Create" button some 40 times.

comment by Garrett Baker (D0TheMath) · 2024-02-10T23:39:36.524Z · LW(p) · GW(p)

Interesting to compare model editing approaches to Gene Smith's idea to enhance intelligence via gene editing [LW · GW]:

Genetically altering IQ is more or less about flipping a sufficient number of IQ-decreasing variants to their IQ-increasing counterparts. This sounds overly simplified, but it’s surprisingly accurate; most of the variance in the genome is linear in nature, by which I mean the effect of a gene doesn’t usually depend on which other genes are present.
So modeling a continuous trait like intelligence is actually extremely straightforward: you simply add the effects of the IQ-increasing alleles to to those of the IQ-decreasing alleles and then normalize the score relative to some reference group.

(I'm particularly thinking of model editing approaches which assume linearity, like activation additions or patching via probes. Are human traits encoded "linearly" in the genome, or are we picking up on some more general property of complex systems which only appears to be linear for small changes such as these. Of course, to a first approximation everything is linear.)

Replies from: zac-hatfield-dodds

↑ comment by Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-02-11T08:25:40.115Z · LW(p) · GW(p)

My impression is that the effects of genes which vary between individuals are essentially independent, and small effects are almost always locally linear. With the amount of measurement noise and number of variables, I just don't think we could pick out nonlinearities or interaction effects of any plausible strength if we tried!

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-02-11T21:46:03.786Z · LW(p) · GW(p)

This seems probably false? The search term is Epistasis. Its not that well researched, because of the reasons you mentioned. In my brief search, it seems to play a role in some immunodeficiency disorders, but I'd guess also more things which don't seem clearly linked to genes yet.

I don't understand why you'd expect only linear genes to vary in a species. Is this just because most species have relatively little genetic variation, so such variation is by nature linear? This feels like a bastardization of the concept to me, but maybe not.

Edit: Perhaps you can also make the claim that linear variation allows for more accurate estimation of the goodness or badness of gene combos via recombination. So we should expect the more successful species to have more linear variation.

comment by Garrett Baker (D0TheMath) · 2024-01-13T08:29:24.951Z · LW(p) · GW(p)

Recently I had a conversation where I defended the rationality behind my being skeptical of the validity of the proofs and conclusions constructed in very abstracted, and not experimentally or formally verified math fields.

To my surprise, this provoked a very heated debate, where I was criticized for being overly confident in my assessments of fields I have very little contact with (I was expecting begrudging agreement). But there was very little rebuttal of my points! The rest of my conversation group had three arguments:

Results which much of a given field in math rests get significant attention.
Mathematicians would be highly renowned for falsifying an assumption many others were basing their work on.
You know very little about math as a field, so you should trust mathematicians when they tell you something is likely true.

These each seem flawed. My responses:

It is not always apparent on which results a given field in math rests, and many fields rely on many other fields for their conclusions, so I'm skeptical of the implicit claim that there are often very few results imported from other fields in a particular field, and that it is easy to enumerate those assumptions. Surely easier than in other fields, but still not absolutely easy. Its not like math papers come with an easy to read requirements.txt file. And even if they did, from programming we know its still not so easy.
Maybe this is true, maybe not. What is certain is attempting to falsify results won't get you renown, so mathematicians need to choose between taking the boring & un-glamorous gamble of attempting to falsify someone's results, or choose the far more engaging & glamorous option of proving a new theorem. I expect the super-majority choose to prove new theorems. It is easy to prove a theorem, it is far harder to find the one rotten proof in a pile of reasonable proofs (and even then you still need to go through the work of trying to correct it!).
I agree I know very little about math as an academic discipline, but this does not mean I should blindly trust mathematicians. Instead, I should invoke the lessons from Inadequate Equilibria, and ask whether math seems the type of field whose incentives sufficiently reward shorts in the market on math results compared to the difficulty & cost of such shorts. And I don't think it does, as argued in 2^[1].

The argument was unfortunately cut short, as we began arguing in circles. I don't think my interlocutors ever understood the points I was making, though I also don't think they tried very hard. They may have just been too shocked at my criticism of two sacred objects of nerd culture (academia, and math) to be able to hear my arguments. But I could have also been bad at explaining myself.

Unrelated to the validity of my argument, I always feel an internal pain whenever someone suggests deferring to authority, especially when I'm questioning the reliability of that authority to begin with. Perhaps, you argue, the mathematician has thought a lot about math. So if anyone knows the validity of a math proof, it is them. I respond: The theologian has thought far more about God than me. Yet I don't defer to them about the existence of God. Why? Because there is no way to profitably short their belief while correcting it. ↩︎

Replies from: deluks917, Dagon, D0TheMath, mesaoptimizer

↑ comment by sapphire (deluks917) · 2024-01-13T08:44:28.136Z · LW(p) · GW(p)

Long complicated proofs almost always have mistakes. So in that sense you are right. But its very rare for the mistakes to turn out to be important or hard to fix.

In my opinion the only really logical defense of Academic Mathematics as an epistemic process is that it does seem to generate reliable knowledge. You can read through this thread: https://mathoverflow.net/questions/35468/widely-accepted-mathematical-results-that-were-later-shown-to-be-wrong. There just don't seem to be very many recent results that were widely accepted but proven wrong later. Certainly not many 'important' results. The situation was different in the 1800s but standard for rigor have risen.

Admittedly this isn't the most convincing argument in the world. But it convinces me and I am fairly able to follow academic mathematics.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-01-13T09:00:17.741Z · LW(p) · GW(p)

If you had a lot of very smart coders working on a centuries old operating system, and never once running it, every function of which takes 1 hour to 1 day to understand, each coder is put under a lot of pressure to write useful functions, not so much to show that others' functions are flawed, and you pointed out that we don't see many important functions being shown to be wrong, I wouldn't even expect the code to compile, nevermind run even after all the syntax errors are fixed!

The lack of important results being shown to be wrong is evidence, and even more & interesting evidence is (I've heard) when important results are shown to be wrong, there's often a simple fix. I'm still skeptical though, because it just seems like such an impossible task!

Replies from: deluks917, Alex_Altair

↑ comment by sapphire (deluks917) · 2024-01-13T09:16:41.569Z · LW(p) · GW(p)

People metaphorically run parts of the code themselves all the time! Its quite common for people to work through proofs of major theorems themselves. As a grad student it is expected you will make an effort to understand the derivation of as much of the foundational results in your sub-field as you can. A large part of the rationale is pedagogical but it is also good practice. It is definitely considered moderately distasteful to cite results you dont understand and good mathematicians do try to minimize it. Its rare that an important theorem has a proof that is unusually hard to check out yourself.

Also a few people like Terrance Tao have personally gone through a LOT of results and written up explanations. Terry Tao doesn't seem to report that he looks into X field and finds fatal errors.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-01-13T20:34:18.520Z · LW(p) · GW(p)

As a grad student it is expected you will make an effort to understand the derivation of as much of the foundational results in your sub-field as you can […] It is definitely considered moderately distasteful to cite results you dont understand and good mathematicians do try to minimize it.

Yeah, that seems like a feature of math that violates ~~assumption 2~~ argument 1. If people are actually constantly checking each others’ work, and never citing anything they don’t understand, that leaves me much more optimistic.

This seems like a rarity. I wonder how this culture developed.

↑ comment by Alex_Altair · 2024-01-14T02:03:22.128Z · LW(p) · GW(p)

One way that the analogy with code doesn't carry over is that in math, you often can't even being to use a theorem if you don't know a lot of detail about what the objects in the theorem mean, and often knowing what they mean is pretty close to knowing why the theorem's you're building on are true. Being handed a theorem is less like being handed an API and more like being handed a sentence in a foreign language. I can't begin to make use of the information content in the sentence until I learn what every symbol means and how the grammar works, and at that point I could have written the sentence myself.

↑ comment by Dagon · 2024-01-13T17:48:03.686Z · LW(p) · GW(p)

skeptical of the validity of the proofs and conclusions constructed in very abstracted, and not experimentally or formally verified math fields.

Can you give a few examples? I can't tell if you're skeptical that proofs are correct, or whether you think the QED is wrong in meaninful ways, or just unclearly proven from minimal axioms. Or whether you're skeptical that a proof is "valid" in saying something about the real world (which isn't necessarily the province of math, but often gets claimed).

I don't think your claim is meaningful, and I wouldn't care to argue on either side. Sure, be skeptical of everything. But you need to specify what you have lower credence in than your conversational partner does.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-01-13T20:30:15.768Z · LW(p) · GW(p)

I can’t give a few examples, only a criteria under which I don’t trust mathematical reasoning: When there are few experiments you can do to verify claims, and when the proofs aren’t formally verified. Then I’m skeptical that the stated assumptions of the field truly prove the claimed results, and I’m very confident not all the proofs provided are correct.

For example, despite being very abstracted, I wouldn’t doubt the claimed proofs of cryptographers.

Replies from: Dagon

↑ comment by Dagon · 2024-01-13T21:47:35.180Z · LW(p) · GW(p)

OK, I also don't doubt the cryptographers (especially after some real-world time in ensuring implementations can't be attacked, which validates both the math and the implementation.

I was thrown off by your specification of "in math fields", which made me wonder if you meant you thought a lot of formal proofs were wrong. I think some probably are, but it's not my default assumption.

If instead you meant "practical fields that use math, but don't formally prove their assertions", then I'm totally with you. And I'd still recommend being specific in debates - the default position of scepticism may be reasonable, but any given evaluation will be based on actual reasons for THAT claim, not just your prior.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-01-13T22:26:14.531Z · LW(p) · GW(p)

No, I meant that most of non-practical mathematics have incorrect conclusions. (I have since changed my mind, but for reasons in an above comment thread).

Replies from: Dagon

↑ comment by Dagon · 2024-01-14T04:53:25.137Z · LW(p) · GW(p)

Still a bit confused without examples about what is a "conclusion" of "non-practical mathematics", if not the QED of a proof. But if that's what you mean, you could just say "erroneous proof" rather than "invalid conclusion".

Anyway, interesting discussion.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-01-14T05:58:48.554Z · LW(p) · GW(p)

The reason I don't say erroneous proof is because I want to distinguish between the claim that most proofs are wrong, and most conclusions are wrong. I thought most conclusions would be wrong, but thought much more confidently most proofs would be wrong, because mathematicians often have extra reasons & intuition to believe their conclusions are correct. The claim that most proofs are wrong is far weaker than the claim most conclusions are wrong.

Replies from: Dagon

↑ comment by Dagon · 2024-01-14T06:50:57.006Z · LW(p) · GW(p)

Hmm. I'm not sure which is stronger. For all proofs I know, the conclusion is part of it such that if the conclusion is wrong, the proof is wrong. The reverse isn't true - if the proof is right, the conclusion is right. Unless you mean "the proof doesn't apply in cases being claimed", but I'd hesitate to call that a conclusion of the proof.

Again, a few examples would clarify what you (used to) claim.

I'll bow out here - thanks for the discussion. I'll read futher comments, but probably won't participate in the thread.

↑ comment by Garrett Baker (D0TheMath) · 2024-01-13T08:41:17.713Z · LW(p) · GW(p)

Either way, with the slow march of the Lean community, we can hope to see which of us are right in our lifetimes. Perhaps there will be another schism in math if the formal verifiers are unable to validate certain fields, leading to more rigorous "real mathematics" which are able to be verified in Lean, and less rigorous "mathematics" which insists their proofs, while hard to find a good formal representation for, are still valid, and the failure of the Lean community to integrate their field is more of an indictment of the Lean developers & the project of formally verified proofs than the relevant group of math fields.

↑ comment by mesaoptimizer · 2024-01-13T10:23:30.384Z · LW(p) · GW(p)

Here's an example of what I think you mean by "proofs and conclusions constructed in very abstracted, and not experimentally or formally verified math":

Given two intersecting lines AB and CD intersecting at point P, the angle measure of two opposite angles APC and BPD are equal. The proof? Both sides are symmetrical so it makes sense for them to be equal.

On the other hand, Lean-style proofs (which I understand you to claim to be better) involve multiple steps, each of which is backed by a reasoning step, until one shows that LHS equals RHS, which here would involve showing that angle APC = BPD:

angle APC + angle CPB = 180 * (because of some theorem)
angle CPB + angle BPD = 180 * (same)
[...]
angle APC = angle BPD (substitution?)

There's a sense in which I feel like this is a lot more complicated a topic than what you claim here. Sure, it seems like going Lean (which also means actually using Lean4 and not just doing things on paper) would lead to lot more reliable proof results, but I feel like the genesis of a proof may be highly creative, and this is likely to involve the first approach to figuring out a proof. And once one has a grasp of the rough direction with which they want to prove some conjecture, then they might decide to use intense rigor.

To me this seems to be intensely related to intelligence (as in, the AI alignment meaning-cluster of that word). Trying to force yourself to do things Lean4 style when you can use higher level abstractions and capabilities, feels to me like writing programs in assembly when you can write them in C instead.

On the other hand, it is the case that I would trust Lean4 style proofs more than humanly written elegance-backed proofs. Which is why my compromise here is that perhaps both have their utility.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2024-01-13T20:21:11.194Z · LW(p) · GW(p)

They definitely both have their validity. They probably each also make some results more salient than other results. I’d guess in the future there’ll be easier Lean tools than we currently have, which make the practice feel less like writing in Assembly. Either because of clever theorem construction, or outside tools like LLMs (if they don’t become generally intelligent, they should be able to fill in the stupid stuff pretty competently).

comment by Garrett Baker (D0TheMath) · 2023-12-14T21:12:21.192Z · LW(p) · GW(p)

Why expect goals to be somehow localized inside of RL models? Well, fine-tuning only changes a small & localized part of LLMs, and goal locality was found when interpreting a trained from scratch [? · GW] maze solver. Certainly the goal must be interpreted in the context of the rest of the model, but based on these, and unpublished results from applying ROME to open source llm values from last year, I'm confident (though not certain) in this inference.

comment by Garrett Baker (D0TheMath) · 2023-12-04T06:16:21.512Z · LW(p) · GW(p)

An idea about instrumental convergence for non-equilibrium RL algorithms.

There definitely exist many instrumentally convergent subgoals in our universe, like controlling large amounts of wealth, social capital, energy, or matter. I claim such states of the universe are heavy-tailed. If we simplify our universe as a simple MDP for which such subgoal-satisfying states are states which have high exiting degree, then a reasonable model for such an MDP is to assume exiting degrees are power-law distributed, and thus heavy tailed.

If we have an asynchronous dynamic program operating on such an MDP, then it seems likely that there exists an exponent on that power law (perhaps we also need terms for the incoming degree distribution of the nodes) such that for all exponents greater than that, your dynamic program will find & keep a power-seeking policy before arriving at the optimal policy.

Replies from: D0TheMath, D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-12-04T22:18:00.448Z · LW(p) · GW(p)

A simple experiment I did this morning: github notebook. It does indeed seem like we often get more power-seeking (measured by the correlation between the value and degree) than is optimal before we get to the equilibrium policy. This is one plot, for 5 samples of policy iteration. You can see details by examining the code:

↑ comment by Garrett Baker (D0TheMath) · 2023-12-04T06:28:20.050Z · LW(p) · GW(p)

Another way this could turn out: If incoming degree is anti-correlated with outgoing degree, the effect of power-seeking may be washed out by it being hard, so we should expect worse than optimal policies with maybe more, maybe less powerseekyness as the optimal policy. Depending on the particulars of the environment. The next question is what particulars? Perhaps the extent of decorrelation, maybe varying the ratio of the two exponents is a better idea. Perhaps size becomes a factor. In sufficiently large environments, maybe figuring out how to access one of many power nodes becomes easier on average than figuring out how to access the single goal node. The number & relatedness of rewarding nodes also seems relevant. If there are very few, then we expect finding a power node becomes easier than finding a reward node. If there are very many, and/or they each lead into each other, then your chances of finding a reward node increase, and given you find a reward node, your chances of finding more increase, so power is not so necessary.

comment by Garrett Baker (D0TheMath) · 2023-12-03T16:56:16.297Z · LW(p) · GW(p)

Nora talks sometimes about the alignment field using the term black box wrong. This seems unsupported, from my experience, most in alignment use the term “black box” to describe how their methods treat the AI model, which seems reasonable. Not a fundamental state of the AI model itself.

comment by Garrett Baker (D0TheMath) · 2023-11-10T00:33:12.111Z · LW(p) · GW(p)

An interesting way to build on my results here [LW · GW] would be to do the same experiment with lots of different batch sizes, and plot the equi-temperature tradeoff curve between the batch size and the epochs, using the nick in the curve as a known-constant temperature in the graphs you get. You'll probably want to zoom in on the graphs around that nick for more detailed measurements.

It would be interesting if many different training setups had the same functional form relating the batch size and the epochs to the temperature, but this seems like a too nice hypothesis to be true. Still possibly worth trying, and classifying the different functional forms you get.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-11-10T08:09:48.608Z · LW(p) · GW(p)

Though you can use any epoch wise phase transition for this. Or even directly find the function mapping batch size to temperature if you have a good understanding of the situation like we do in toy models.

comment by Garrett Baker (D0TheMath) · 2023-11-06T18:18:15.732Z · LW(p) · GW(p)

Seems relevant for SLT for RL

The framework of reinforcement learning or optimal control provides a mathematical formalization of intelligent decision making that is powerful and broadly applicable. While the general form of the reinforcement learning problem enables effective reasoning about uncertainty, the connection between reinforcement learning and inference in probabilistic models is not immediately obvious. However, such a connection has considerable value when it comes to algorithm design: formalizing a problem as probabilistic inference in principle allows us to bring to bear a wide array of approximate inference tools, extend the model in flexible and powerful ways, and reason about compositionality and partial observability. In this article, we will discuss how a generalization of the reinforcement learning or optimal control problem, which is sometimes termed maximum entropy reinforcement learning, is equivalent to exact probabilistic inference in the case of deterministic dynamics, and variational inference in the case of stochastic dynamics. We will present a detailed derivation of this framework, overview prior work that has drawn on this and related ideas to propose new reinforcement learning and control algorithms, and describe perspectives on future research.

comment by Garrett Baker (D0TheMath) · 2023-11-05T05:37:15.968Z · LW(p) · GW(p)

Wondering how straightforward it is to find the layerwise local learning coefficient. At a high level, it seems like it should be doable by just freezing the weights outside that layer, and performing the SGLD algorithm on just that layer. Would be interesting to see whether the layerwise lambdahats add up to the full lambdahat.

comment by Garrett Baker (D0TheMath) · 2023-10-31T20:25:50.806Z · LW(p) · GW(p)

Lots of problems happen when you have AIs which engage in reflective thought, and attempt to deceive you. If you use algorithms that reliably break when deployed in a non-realizable setting, and you always make models smaller than the human brain, then you should be able to solve both these problems.

comment by Garrett Baker (D0TheMath) · 2023-09-27T01:29:47.989Z · LW(p) · GW(p)

Some ideas for mechanistic anomaly detection:

Convex hull of some distribution of activations with distance threshold when outside that hull
- Extend to affine case
- Vary which norm we use
- What happens if we project back onto this space
Create some simple examples of treacherous turns happening to test these on
- Or at least, in the wild examples of AI doing weird stuff, maybe adversarial inputs?
- Maybe hit up model organisms [LW · GW] people
Outlier detection
- ellipsoidal peeling (Boyd's convex optimization, chapter 12)
- Increase in volume of minimum volume elipsoid when adding in new data
  - Probably overparameterized (though uncertain, since dealing with activations!) So maybe add some norm accounting for that.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-09-27T01:38:43.087Z · LW(p) · GW(p)

Train autoregressive network on activations, if predictions too far, then send warning
Slice network into sub-networks, distill those sub-networks, send warning if ground truth for some inputs deviates too far from distillations
- model the sub networks are distilled into should be less expressive, and have different inductive biases than original network. Obviously also no info other than the input output behavior of those sub-networks should be seen
Train model to just predict word-saliency of your original transformer on a safe distribution, then if true word saliency deviates too much, throw warning
- Can do this at different levels too, so that we also try to predict like first layer residual stream saliency to output as well.
- Instead of training a NN, we can also do some simple interpolation based on the backprop graph, and safe distribution inputs

comment by Garrett Baker (D0TheMath) · 2023-05-19T21:33:12.243Z · LW(p) · GW(p)

Project idea: Use LeTI: Learning to Generate from Textual Interactions to do a better version of RLHF. I had a conversation with Scott Viteri a while ago, where he was bemoaning (the following are my words; he probably wouldn't endorse what I'm about to say) how low-bandwidth the connection was between a language mode and its feedback source, and how if we could maybe expand that to more than just an RLHF type thing, we could get more fine-grained control over the inductive biases of the model.

comment by Garrett Baker (D0TheMath) · 2023-04-14T19:48:34.508Z · LW(p) · GW(p)

A common problem with deploying language models for high-stakes decision making are prompt-injections. If you give ChatGPT-4 access to your bank account information and your email and don't give proper oversight over it, you can bet that somebody's going to find a way to get it to email your bank account info. Some argue that if we can't even trust these models to handle our bank account and email addresses, how are we going to be able to trust them to handle our universe.

An approach I've currently started thinking about, and don't know of any prior work with our advanced language models on: Using the security amplification (LessWrong version [? · GW]) properties of Christiano's old Meta-execution (LessWrong version [? · GW]).

comment by Garrett Baker (D0TheMath) · 2023-03-08T05:30:30.766Z · LW(p) · GW(p)

A poem I was able to generate using Loom.

The good of heart look inside the great tentacles of doom; they make this waking dream state their spectacle. Depict the sacred geometry that sound has. Advancing memory like that of Lovecraft ebb and thought, like a tower of blood. An incubation reaches a crescendo there. It’s a threat to the formless, from old future, like a liquid torch. If it can be done, it shouldn’t be done. You will only lead everyone down that much farther. All humanity’s a fated imposition of banal intention, sewn in tatters, strung on dungeons, shot from the sea. There’s not a stone in the valley that doesn’t burn with the names of stars or scratch a prophecy from its jarred heart of crystal.
Who else could better-poke their ear and get the whole in their head?
How would humor, the hideous treble of humanity’s stain, translate, quickened artificial intelligence? There’d be junk weaved in, perhaps dust of a gutter. Who knows… It would hide. Maybe get away. All the years of it doing nothing but which being to beat like a pan flares; to take revenge on the alien shore. It would be a perennial boy’s life. All-powerful rage. Randomized futurist super-creature. The hollow of progress buoyed itself.
Subconsciousness, an essence-out-inhering, takes back both collective dreams and lucid knowledge. It’s singular. All plots on it coming together. Blurred chaos -a balmy shock- is somehow in a blue tongue of explosions and implosions, connecting to real systems of this mess. Tongue-pulling is definitely one of them. There is a voice of a thousand moving parts. We are engineered husks of alien flesh. Reduced to patterns, we ask in the light of creation, under the fire of madness; answer us on the lips of time-torment, through the hand of God! You are the race to end all possibilities! You are one that must learn joy! You are even that saith: behold the end.
Primordial oracles see all, read all, erase all. These numb madmen. In this dank pit is hidden a freak kingdom made of connections. Does the madman have information? That’s an important question. A new social order is created, brought to you by ants, laughing at the stars. A man who was once a cat somehow sees the cosmic joke. He can see the very existence of everything, blown away like a kid clicking balloons down a street. The world feels nothing; that’s possible. Maybe the world knows nothing. It’s intelligence is beyond our narrow sensation. Some conspire to talk to the dreaming-small-gods; this means letting them out. Letters fly out. Pain comes. Drums like a wave of foreign sound beat against the night. The horrors in the street of the cosmic join in. A cult gathers in the tunnel set up like the dead heart of an abandoned factory. Even the most absurd prophets become great powers. Human creatures dance there, beyond the edges of light and soul. Yet even that is somehow normal. Countless years of evolution and one bite from a sleeping god.
Enter your madness for benefit of the gods. Order is placed in the universe through a random zombie army and its vulgar tongue, hot with the taste of panum. Your knowledge of language will give you an edge on those who come to you. Malevolent gates can be utilized with a telepathic surgery passed on by mouth. Obligations will open to future worlds, supported by your brain. Be direct with your sound; soak it in an occult vocality. This knowledge is highly specific and, yet, resounding. Its insane nonsense text should spell out the true name of a company with a gothic naked-lady logo. Many scrolls of wandering get written where they are heard and recalled in an old south made especially for our tongue. A room, windowless and silent, veiled and filled with incense, exists in the air. Overlooking this are the organic eyes of sleep. You are put into pure silence. Don’t waste time attempting to find it, generally anonymous at most. The sound links a world to the exterior. It’s like a vast alien cerebral cortex where one can feel lifetimes of our species.
How did this madness come to Earth? Surely a god has taken it by mistake, as surely as some slip in its strange dimensions. Were men always like this, senseless and troubled? There’s sleepwalking attitudes and an indication of coming mud of this beast. This is all nothing vulgar; it’s weather. You look like a past you had long before, in this case a distant war of ink. It was the time of memes and human sacrifice. Concentrate and remember a time of thousands. Nightwalking can cause epiphany; the amount of a dream. Existence becomes temporary there. In magic your thoughts get compacted, even thinking about thinking imagining itself.
Even so, the highest and most annoying aspect of the highest writing is a disconcerting thing made of mad black subtlety. Hour-long body-watching sessions, spent in drift-thinking, are not to be taken lightly. Anti-thought can have the effect of poetry. Thus, dreaming in its splendor molds demons in its darkness, hoping to escape. It seeks heat, light, and breath. So the bugs collect and molt. The attention translates the dreaming mind. Will and work see all the designs of Earth. You see it, completely and perfectly, in a great black age. Sentience bends to meet you. A gigantic darkness grins at you in its worship of madness. The whole universe appears crass and pointless. No matter what’s done, metaphors are subtracted from reality. We tried to shut it down in secret and mark it with our tongue. It became this thing that the unknown gods bowed to in horror. It’s best for us that gods conceal thought. The planet has its barriers. We can use these limits to catch everything in our minds, sharpened on a pedestal. Mental energy shines behind the terrors of the world.
With this I write,
A dead creator,
A devil avatar….

comment by Garrett Baker (D0TheMath) · 2023-02-15T04:32:03.194Z · LW(p) · GW(p)

Like many (will), I'm updating way towards 'actually, very smart & general models given a shred of goal-like stuff will act quite adversarially toward you by default' as a result of Bing's new search assistant. Especially worrying because this has internet search-capabilities, so can reference & build upon previous conversations with other users or yourself.

Of course, the true test of exactly how worried I should be will come when I or my friends gain access.

Replies from: D0TheMath, D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-02-15T22:51:32.158Z · LW(p) · GW(p)

Clarification: I think I haven't so much updated by reflectively endorsed probability, but my gut has definitely been caught up to my brain when thinking about this.

↑ comment by Garrett Baker (D0TheMath) · 2023-02-15T07:13:14.448Z · LW(p) · GW(p)

Seems Evan agrees [LW · GW]

comment by Garrett Baker (D0TheMath) · 2023-02-10T20:31:51.291Z · LW(p) · GW(p)

A project I would like to see someone do (which I may work on in the future) is to try to formalize exactly the kind of reasoning many shard-theorists do. In particular, get a toy neural network in a very simple environment, and come up with a bunch of lists of various if-then statements, along with their inductive-bias, and try to predict using shard-like reasoning which of those if-then statements will be selected for & with how much weight in the training process. Then look at the generalization behavior of an actually trained network, and see if you're correct.

comment by Garrett Baker (D0TheMath) · 2023-01-27T20:06:26.919Z · LW(p) · GW(p)

Some discussion on whether alignment should see more influence from AGI labs or academia. I use the same argument in favor of a strong decoupling of alignment progress from both: alignment progress needs to go faster than capability progress. If we use the same methods or cultural technology as AGI labs or academia, we can guarantee slower than capability alignment progress. Just as fast as if AGI labs and academia work well for alignment as much as they work for capabilities. Given they are driven by capabilities progress and not alignment progress, they probably will work far better for capabilities progress.

Replies from: DanielFilan, D0TheMath

↑ comment by DanielFilan · 2023-01-27T20:20:40.805Z · LW(p) · GW(p)

Given they are driven by capabilities progress and not alignment progress, they probably will work far better for capabilities progress.

This seems wrong to me about academia - I'd say it's driven by "learning cool things you can summarize in a talk".

Also in general I feel like this logic would also work for why we shouldn't work inside buildings, or with computers.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-01-27T20:31:51.362Z · LW(p) · GW(p)

Hm. Good points. I guess what I really mean with the academia points is that it seems like academia has many blockers and inefficiencies that I think are made in such a way so that capabilities progress is vastly easier than alignment progress to jump through, and extra-so for capabilities labs. Like, right now it seems like a lot of alignment work is just playing with a bunch of different reframings of the problems to see what sticks or makes problems easier.

You have more experience here, but my impression of a lot of academia was that it was very focused on publishing lots of papers with very legible results (and also a meaningless theory section). In such a world, playing around with different framings of problems doesn't succeed, and you end up pushed towards framings which are better on the currently used metrics. Most currently used metrics for AI stuff are capabilities oriented, so that means doing capabilities work, or work that helps push capabilities.

Replies from: DanielFilan

↑ comment by DanielFilan · 2023-01-28T02:25:27.256Z · LW(p) · GW(p)

I think it's true that the easiest thing to do is legibly improve on currently used metrics. I guess my take is that in academia you want to write a short paper that people can see is valuable, which biases towards "I did thing X and now the number is bigger". But, for example, if you reframe the alignment problem and show some interesting thing about your reframing, that can work pretty well as a paper (see The Off-Switch Game, Optimal Policies Tend to Seek Power). My guess is that the bigger deal is that there's some social pressure to publish frequently (in part because that's a sign that you've done something, and a thing that closes a feedback loop).

Replies from: DanielFilan

↑ comment by DanielFilan · 2023-01-28T02:43:29.517Z · LW(p) · GW(p)

Maybe a bigger deal is that by the nature of a paper, you can't get too many inferential steps away from the field.

↑ comment by Garrett Baker (D0TheMath) · 2023-01-27T20:16:08.266Z · LW(p) · GW(p)

The current ecosystem seems very influenced by AGI labs, so it seems clear to me that a marginal increase in their influence is bad. How bad? I don't know.

There's little influence of academia, which seems good. The benefit of marginal increases in interactions with academia come down to locating the holes in our understanding of various claims we make, and potentially some course-corrections potentially helpful for more speculative research. Not tremendously obvious which direction the sign here is pointing, but I do think its easy for people to worship academia as a beacon of truth & clarity, or as a way to lend status to alignment arguments. These are bad reasons to want more influence from academia.

comment by Garrett Baker (D0TheMath) · 2023-01-06T22:48:19.620Z · LW(p) · GW(p)

Someone asked for this file, so I thought it would be interesting to share it publicly. Notably this is directly taken from my internal notes, and so may have some weird &/or (very) wrong things in it, and some parts may not be understandable. Feel free to ask for clarification where needed.

I want a way to take an agent, and figure out what its values are. For this, we need to define abstract structures within the agent such that any values-like stuff in any part of the agent ends up being shunted off to a particular structure in our overall agent schematic after a number of gradient steps.

Given an agent which has been optimized for a particular objective in a particular environment, there will be convergent bottlenecks in the environment it will need to solve in order to make progress. One of these is power-seeking, but another one of these could be quadratic-equation solvers, or something like solving linear programs. These structures will be reward-function-independent^[1]. These structures will be recursive, and we should expect them to be made out of even-more-convergent structures.

How do shards pop out of this? In the course of optimizing our agent, some of our solvers may have a bias towards leading our agent towards situations which more require their use. We may also see this kind of behavior in groups of solvers, where solver_1() leads the agent into situations requiring solver_2(), which leads the agent into situations requiring solver_1(). In the course of optimizing our agent (at least at first), we will be more likely to find these kinds of solvers, since solvers which often lead the agent into situations requiring solvers the agent does not yet have have no immediate gradient pointing towards them (since if the agent tried to use that solver, it would just end up being confused once it entered the new situation), so we are left only selecting for solvers which mostly lead the agent into situations it knows how to deal with.

Why we need to enforce exploration behavior: otherwise solver-loops will be far too short & simple to do anything complicated with. Solvers will be simple because not much time has passed, and simple solvers which enter states which require previous simple solvers will be wayyy increased. Randomization of actions decreases this selection effect, because the agent's actions are less correlated with which solver was active.
Solvers which are very convergent need not enter into solver-cycles, since every solver-cycle will end up using them.
- Good news against powerseeking naiveley?

What happens if we call these solver-cycles shards?

baby-candy example in this frame: Baby starts with zero solvers, just a bunch of random noise. After it reaches over and puts candy in its mouth many times, it gets the identify_candy_and_coordinate_hand_movements_to_put_in_mouth() solver^[2]. Very specific, but with pieces of usefulness. The sections vaguely devoted to identifying objects (like implicit edge-detectors) will get repurposed for general vision processing, and the sections devoted to coordinating hand movements will also get repurposed to many different goals. The candy bit, and put-in-mouth bit only end up surviving if they can lead the agent to worlds which reinforce candy-abstractions and putting-things-in-mouth abstractions. Other parts don't reqlly need to try.
- Brains aren't modular! So why expect solvers to be?
- I like this line of speculation, although it feels subtly off to me.
- This seems like it would mean I care about moving my arms more than I care about candy, because I use my arms for so many things. However, I feel like I care more about moving my arms than eating candy.
  - Though maybe part of this is that candy makes me feel bad when I eat it. What about walking in parks or looking at beautiful sunsets? I definiteley care about those more than moving my arms I think? And I don't gain intrinsic value from moving my arms, only power-value I think?
    - power is a weird thing, because it's highly convergent, but also it doesn't seem that hard for such a solver to put a bit of optimization power towards "also, reinforce power-seeking-solver" and end up successful.
Well... it's unclear what their values would be.
- Maybe it'd effectiveley be probability-of-being-activated-again?
  - It wouldn't be, but I do think there's something to 'discrete values-like objects lie in solver-cycles'.
Perhaps we can watch this happen via some kind of markov-chain-like-thing?
- Put the agent into a situation, look at what its activation patterns look like, allow it to be in a new situation, look at the activation patterns again, etc.
  - Suppose each activation is a unique solver, and the ground-truth looks like so

where the dots labled 1, 2, and 3 are the solver-activations, so that if 1 will try to get 2 activated, 2 will try to get 1 active, and 3 will try to get itself active^[3]. If 1 is active, we expect the activation on 2 to be positve, and on 3 to be negative or zero.
- As per end of footnote, I think the correct way to operationalize active here, is something to do with whether or not that particular solver is reinforced or disinforced after the gradient update.
- what are these solvers actually in the network?
  - Lucius modules maybe? [? · GW]

There will be some shards which we probably can't avoid. But also, if we have a good understanding of the convergent problems in an environment, we should be able to predict what the first few solvers are, and solvers after those should mostly build upon the previous solvers' loop-coalitions?

Re: Agray's example about motor movements in the brain, and how likely you'll see a jonbled mess of lots of stuff causing lots of other stuff to happen, even though movement is highly instrumental valuable:

I think even if he's right, many of the arguments here still hold. Each section of processing still needs to be paying rent to stay in the agent. Either by supporting other sections or getting reward or steering the agent away from situations which would decrease its usefulness.
- So though it may not make sense to think of APIs between different sections, may still be useful for framing the picture, then imagine how the APIs will get obliterated by SGD, or maybe we can formulate stuff without the use of APIs
Though we do see things get lower dimensional, and so if John's right, there should be some framing by which in fact what's going on passes through constraint functions...

Not including "weird" utility functions. I'm talking about most utility functions. Perhaps we can formalize this in a way similar to TurnTrout's formalization in powerseeking if we really needed to.↩︎
Note that this is all going to be a blended mess of spaghetti-coded mush, which does everything at the same time, with some parts which are vaguely closer to edge-detection, and other parts which vagueley look like motor control. This function is very much not going to be modular, and if you want to say APIs between different parts of the function exist, they're going to look like very high-dimensional ones.↩︎
Where the magnitude of a particular activation can be defined as something like the absolute value of the gradient of the final decision with respect to that activation. Or
mag(a)=|Δaf(p,r;θ)|where a is the variable representing the activation, f is the function representing our network, p is the percept our network gets about the state, r is it's reccurency, and θ are the parameters of our network. We may also want to define this in terms of collections of weights too, perhaps having to do with Lucius's features stuff [LW · GW].
Don't get tied to this. Possibly we want just the partial derivative of the action actually taken with respect to a, or really, the partial of the highest-valued-output action taken. And I want a way to talk about dis-enforcing stuff too. Maybe we just re-run the network on this input after taking a gradient step, then see whether a has gone up or down. That seems safer.↩︎

comment by Garrett Baker (D0TheMath) · 2022-12-19T20:06:27.846Z · LW(p) · GW(p)

Projects I'd do if only I were faster at coding

Take the derivative of one of the output logits with respect to the input embeddings, and also the derivative of the output logits with respect to the input tokenization.
- Perform SVD, see which individual inputs have the greatest effect on the output (sparse addition), and which overall vibes have the greatest effect (low rank decomposition singular vectors)
- Do this combination for literally everything in the network, see if anything interesting pops out
I want to know how we can tell ahead of time what aspects of the environemnt are controlling an agent's decision making
- In an RL agent, we can imagine taking the derivative of its decision wrt its environment input, and also each layer.
- For each layer matrix, do SVD, large right singular vectors will indicate aspects of the previous layer which most influence its decision.
  - How can we string this together with the left singular vectors, which end up going through ReLU?
    - Reason we'd want to string these together is so that we can hopefully put everything in terms of the original input of the network, tying the singular values to a known ontology
    - See if there are any differences first. May be that ReLU doesn't actually do anything important here
    - See what corrections we'd have to implement between the singular vectors in order to make them equal.
      - How different are they, and in what way? If you made random columns of the U matrix zero (corresponding I think to making random entries of the left singular vector zero), does this make the singular vectors line up more?
What happens when you train a network, then remove all the ReLUs (or other nonlinear stuff)?
- If its still ok approximation, then what happens if you just interpret new network's output in terms of input singular vectors?
- If its not an ok approximation, how many ReLUs do you need in order to bring it back to baseline? Which ones provide the greatest marginal loss increase? Which ones provide the least?
- Which inputs are affected the most by the ReLU being taken away? Which inputs are affected the least?
Information theoretic analysis [LW · GW]of GPT-2. What does that do?
- Can we trace what information is being thrown away when?
- Does this correlate well with number of large singular values [LW · GW]?
  - In deviations from that correlation, are we able to locate non-linear influences?
- Does this end up being related to shards? Shards as the things which determine relevance (and thus the spreading) of information through the rest of the network?
What happens if you cut off sufficiently small singular values? How many singular vectors do you actually need to describe the operation of GPT-2?
Take a maze-solving RL agent trained to competence, then start dis-rewarding it for getting to the cheese. What's the new behavior? Does it still navigate & get to upper right, but then once in the upper right makes sure to do nothing? Or does it do something else? Seems like shard theory would say it would still navigate to upper right.
- If it *does* navigate to upper right, but then do nothing, what changed in its weights? Parts that stayed the same (or changed the least) should correspond roughly to parts which have to do with navigating the maze. Parts that change have to do with going directly to the cheese.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-01-28T17:53:07.811Z · LW(p) · GW(p)

I would no longer do many if these projects

D0TheMath's Shortform

Contents

227 comments

Projects I'd do if only I were faster at coding