Posts
Comments
Part 3: not my type
should probably be "Part 2" instead
McDonalds et al. (2016) had yeast evolve with and without sex for 1000 generations
... wait, yeast can reproduce sexually or asexually?
looks at paper
OK, they figured out how to get yeast to have sex? Seems wild.
Some talks are visible on YouTube here
Did this ever get written up? I'm still interested in it.
Redwood Research?
It gets weirder. For some reason, squirting cold water into the left ear canal wakes up the revolutionary.
This link gets me "page not found", both here and on the oldest saved copy on the internet archive. That said, some papers are available here, here, here if you're at a university that pays for this sort of stuff, and generally linked to from this page. I'll be adding these links to the wayback machine, unfortunately when I go to archive.is I get caught in some sort of weird loop of captchas and am unable to actually get to the site.
Most of your comment seems to be an appeal to modest epistemology. We can in fact do better than total agnosticism about whether some arguments are productive or not, and worth having more or less of on the margin.
Note that the more you believe that your commenters can tell whether some arguments are productive or not, and worth having more or less of on the margin, the less you should worry as mods about preventing or promoting such arguments (altho you still might want to put them near the top or bottom of pages for attention-management reasons)
Site admins, would it be possible to see the edit history of posts, perhaps in diff format (or at least make that a default that authors can opt out of)? Seems like something I want in a few cases:
- controversial posts like these
- sometimes mods edit my posts and I'd like to know what they edited
Is your point that "being asked to not hang out with low value people" is inherently abusive in a way worse than everything else going on in that list?
Yes
Spencer responded to a similar request in the EA forum. Copy-pasting the response here in quotes, but for further replies etc. I encourage readers to follow the link:
Yes, here two examples, sorry I can’t provide more detail:
-there were claims in the post made about Emerson that were not actually about Emerson at all (they were about his former company years after he left). I pointed this out to Ben hours before publication and he rushed to correct it (in my view it’s a pretty serious mistake to make false accusations about a person, I see this as pretty significant)!
-there was also a very disparaging claim made in the piece (I unfortunately can’t share the details for privacy reasons; but I assume nonlinear will later) that was quite strongly contradicted by a text message exchange I have
Sorry, I was using "normal" to mean "not abusive". Even in weird and atypical environments, I find it hard to think of situations where "don't hang out with your family" is an acceptable ask (with the one exception listed in my comment).
Sure, but wasn't there some previous occasion where Lightcone made a grant to people after they shared negative stories about a former employer (maybe to Zoe Curzi? but I can't find that atm)? If so, then presumably at some point you get a reputation for doing so.
I can guarantee you from my perspective as a coach that a good number of the items mentioned here are abjectly false.
What's an example of something that's false?
Being asked to... not hang out with low value people... is just one more thing that is consistent with the office environment.
Maybe I'm naive, but I don't think there's approximately any normal relationship in which it's considered acceptable to ask someone to not associate with ~anyone other than current employees. The closest example I can think of is monasticism, but in that context (a) that expectation is clear and (b) at least in the Catholic church there's a higher internal authority who can adjudicate abuse claims.
The nearly final draft of this post that I was given yesterday had factual inaccuracies that (in my opinion and based on my understanding of the facts) are very serious
Could you share examples of these inaccuracies?
Any reversible effect might be reversed. The question asks about the final effects of the mind
This talk of "reversible" and "final" effects of a mind strikes me as suspicious: for one, in a block / timeless universe, there's no such thing as "reversible" effects, and for another, in the end, it may wash out in an entropic mess! But it does suggest a rephrasing of "a first-order approximation of the (direction of the) effects, understood both spatially and temporally".
Is the idea that the set of "states" is the codomain of gamma?
assigns the set of states that remain possible once a node is reached.
What's bold S here?
I was at a party recently, and happened to meet a senior person at a well-known AI startup in the Bay Area. They volunteered that they thought "humanity had about a 50% chance of extinction" caused by artificial intelligence. I asked why they were working at an AI startup if they believed that to be true. They told me that while they thought it was true, "in the meantime I get to have a nice house and car".
This strikes me as the sort of thing one would say without quite meaning it. Like, I'm sure this person could get other jobs that also support a nice house and car. And if they thought about it, they could probably also figure this out. I'm tempted to chalk the true decision up to conformity / lack of confidence in one's ability to originate and execute consequentialist plans, but that's just a guess and I'm not particularly well-informed about this person.
The Manhattan Project brought us nuclear weapons, whose existence affects the world to this day, 79 years after its founding - I would call that a long timeline. And we might not have seen all the relevant effects!
But yeah, I think we have enough info to make tentative judgements of at least Klaus Fuchs' espionage, and maybe Joseph Rotblat's quitting.
I appreciate the multiple levels of summarization!
Would you go on a first date if there were a 20% chance that instead of an actual date someone would yell at you? It's obviously not a pleasant possibility, but IMO still worth it.
Research project idea: formalize a set-up with two reinforcement learners, each training the other. I think this is what's going on in baby care. Specifically: a baby is learning in part by reinforcement learning: they have various rewards they like getting (food, comfort, control over environment, being around people). Some of those rewards are dispensed by you: food, and whether you're around them, smiling and/or mimicking them. Also, you are learning via RL: you want the baby to be happy, nourished, rested, and not cry (among other things). And the baby is dispensing some of those rewards.
Questions:
- What even happens? (I think in many setups you won't get mutual wireheading)
- Do you get a nice equilibrium?
- Is there some good alignment property you can get?
- Maybe a terrible alignment property?
This could also be a model of humans and advanced algorithmic timelines or some such thing.
Update: I currently think that Nguyen (2019) proves the claim, but it actually requires a layer to have two hidden neurons per training example.
Mechanistically dissimilar algorithms can be "mode connected" - that is, local minima-ish that are connected by a path of local minima (the paper proves this for their definition of "mechanistically similar")
Mea culpa: AFAICT, the 'proof' in Mechanistic Mode Connectivity fails. It basically goes:
- Prior work has shown that under overparametrization, all global loss minimizers are mode connected.
- Therefore, mechanistically distinct global loss minimizers are also mode connected.
The problem is that prior work made the assumption that for a net of the right size, there's only one loss minimizer up to permutation - aka there are no mechanistically distinct loss minimizers.
[EDIT: the proof also cites Nguyen (2019) in support of its arguments. I haven't checked the proof in Nguyen (2019), but if it holds up, it does substantiate the claim in Mechanistic Mode Connectivity - altho if I'm reading it correctly you need so much overparameterization that the neural net has a layer with as many hidden neurons as there are training data points.]
the above papers show that in more realistic settings empirically, two models lie in the same basin (up to permutation symmetries) if and only if they have similar generalization and structural properties.
I think they only check if they lie in linearly-connected bits of the same basin if they have similar generalization properties? E.g. Figure 4 of Mechanistic Mode Connectivity is titled "Non-Linear Mode Connectivity of Mechanistically Dissimilar Models" and the subtitle states that "quadratic paths can be easily identified to mode connect mechanistically dissimilar models[, and] linear paths cannot be identified, even after permutation". Linear Connectivity Reveals Generalization Strategies seems to be focussed on linear mode connectivity, rather than more general mode connectivity.
We argue that a sustained commitment to transparency (e.g. to auditors) would make the RLHF research environment more robust from a safety standpoint.
Do you think this is more true of RLHF than other safety techniques or frameworks? At first blush, I would have thought "no", and the reasoning you provide in this post doesn't seem to distinguish RLHF from other things.
I think I probably didn't quite word that question right, and that's what's explaining the confusion - I meant something like "Once you've created the AAR, what alignment problems are left to be solved? Please answer in terms of the gap between the AAR and superintelligence."
The second paper is just about linear connectivity, and does seem to suggest that linearly connected models run similar algorithms. But I guess I don't expect neural net training to go in straight lines? (Altho I suppose momentum helps with this?)
I didn't read super carefully, but it seems like the former paper is saying that, for some definition of "mechanistically similar":
- Mechanistically dissimilar algorithms can be "mode connected" - that is, local minima-ish that are connected by a path of local minima (the paper proves this for their definition of "mechanistically similar").
- If two models aren't linearly mode connected, then that means that they're dissimilar (note that this is a conjecture, but I guess they probably find evidence for it).
I don't think this is in much tension with the post?
I guess I just don't have a strong sense of where the practitioners' numbers are coming from, or why they believe what they believe. Which is fine if you want to bulid a pipeline that turn some intuitions into decisions, but not obviously incredibly useful for the rest of us (beyond just telling us those intuitions).
Finally, it’s worth mentioning that the cost-effectiveness of these programs relative to one another do not rely very heavily on conversions.
The thing you link shows that if you change the conversion ratio of both programs the same amount, the relative cost-effectiveness doesn't change, which makes sense. But if workshops produced 100x more conversions than socials, or vice versa, presumably this must make a difference. If you say that the treatment effects only differ by a factor of 2, then fair enough, but that's just not super a priori clear (and the fact that you claim that (a) you can measure the TDC better and (b) the TDC has a different treatment effect makes me skeptical).
(For the record, I couldn't really make heads or tails of the spreadsheet you linked or what the calculations in the script were supposed to be, but I didn't try super hard to understand them - perhaps I'd write something different if I really understood them)
Note that this paper in section 2.6 claims to demonstrate that some 175 billion parameter RLHF'ed LLM (presumably Claude or Claude 2?) does not use steganography in its CoT. It does this by:
- Get a bunch of answers from the model using CoT prompting.
- Take prefixes of CoTs, asking another language model to paraphrase them, regenerate the rest of the CoT, and see how accurate the models are now.
- Check if the model loses accuracy for paraphrased CoTs.
The basic result is that paraphrasing parts of the CoT doesn't appear to reduce accuracy.
Closest thing I'm aware of is that at the time of the AlphaGo matches he bet people at like 3:2 odds, favourable to him, that Lee Sedol would win. Link here
I read RobertM as apophatially saying that Benquo could be confessing to something with Benquo's comment, and Benquo asking what Benquo is allegedly confessing to.
wait, how does Benquo's text imply that Benquo is confessing to committing assault?
Similar to my comment on the other post: it seems like this critically relies on guesses about the 'treatment effects' of these programs, as detailed in the Pipeline probabilities and scientist-equivalence section. How did you come up with these guesses?
It seems like the estimates for the cost-effectiveness of the NeurIPS social and workshop rely heavily on estimates of the number of "conversions" those produced, but I couldn't find an explanation of how these estimates were produced in the post. No chance you can walk us thru the envelope math there?
Thanks for your detailed comments!
I feel a bit reticent about pause advocacy, altho I have to admit I'm not familiar with the details (and I'm not feeling so negative about it that I want to spend a bunch of time trashing it). My attempt to flesh out why:
- I'm pretty influenced by the type of libertarian political philosophy that says that hastily-assembled policy proposals can have big negative unintended side effects, especially when such policy proposals involve giving discretionary control over something to a government.
- I'm pessimistic about our odds of surviving really powerful AI, but not so pessimistic that I think p(doom) couldn't be 10 percentage points higher.
- Pause advocacy seems to seek compromise with normal people in order to get their policy proposals passed - an obviously good strategy on some level, but I kind of hate policy proposals that normal people like! This is doubly true for polities where it's easiest to start passing serious tech regulation (California, the EU).
- Relatedly, I have the impression that pause policy advocacy tends to look like taking popular policies and promoting those that slow down AI the most, rather than doing something like mandatory AI liability insurance which seems like it's close to optimal, then adjusting it to be popular with lots of people.
- I worry that "give more discretionary control over AI to such-and-such political body" just produces worse decisions.
Anyway that's why I have some sort of instinctive negative reaction, but again I'm not very familiar with the details and I'm sure different people are doing different things etc.
Toryn Q. Klassen, Parand Alizadeh Alamdari, and Sheila A. McIlraith wrote a paper on the multi-agent AUP thing, framing it as a study of epistemic side effects.
Yeah, I've been having difficulty getting Google Podcasts to find the new episode, unfortunately. In the meantime, consider listening on YouTube or Spotify, if those work for you?
Maybe "subagents"?
The core claim of this post is that if you train a network in some environment, the agent will not generalize optimally with respect to the reward function you trained it on, but will instead be optimal with respect to some other reward function in a way that is compatible with training-reward-optimality, and that this means that it is likely to avoid shutdown in new environments. The idea that this happens because reward functions are "internally represented" isn't necessary for those results. You're right that the post uses the phrase "internal representation" once at the start, and some very weak form of "representation" is presumably necessary for the policy to be optimal for a reward function (at least in the sense that you can derive a bunch of facts about a reward function from the optimal policy for that reward function), but that doesn't mean that they're central to the post.
The spark of strategicness, if such a thing is possible, recruits the surrounding mental elements. Those surrounding mental elements, by hypothesis, make goals achievable. That means the wildfire can recruit these surrounding elements toward the wildfire's ultimate ends... Also by hypothesis, the surrounding mental elements don't themselves push strongly for goals. Seemingly, that implies that they do not resist the wildfire, since resisting would constitute a strong push.
I think this is probably a fallacy of composition (maybe in the reverse direction than how people usually use that term)? Like, the hypothesis is that the mind as a whole makes goals achievable and doesn't push towards goals, but I don't think this implies that any given subset of the mind does that.
Like, the only reason we're calling it a "Fourier basis" is that we're looking at a few different speeds of rotation, in order to scramble the second-place answers that almost get you a cos of 1 at the end, while preserving the actual answer.
I agree a rotation matrix story would fit better, but I do think it's a fair analogy: the numbers stored are just coses and sines, aka the x and y coordinates of the hour hand.
My submission: when we teach modular arithmetic to people, we do it using the metaphor of clock arithmetic. Well, if you ignore the multiple frequencies and argmax weirdness, clock arithmetic is exactly what this network is doing! Find the coordinates of rotating the hour hand (on a 113-hour clock) x hours, then y hours, use trig identities to work out what it would be if you rotated x+y hours, then count how many steps back you have to rotate to get to 0 to tell where you ended up. In fairness, the final step is a little bit different than the usual imagined rule of "look at the hour mark where the hand ends up", but not so different that clock arithmetic counts as a bad prediction IMO.
An attempt at rephrasing a shard theory critique of utility function reasoning, while restricting myself to things I basically agree with:
Yes, there are representation theorems that say coherent behaviour is optimizing some utility function. And yes, for the sake of discussion let's say this extends to reward functions in the setting of sequential decision-making (even tho I don't remember seeing a theorem for that). But: just because there's a mapping, doesn't mean that we can pull back a uniform measure on utility/reward functions to get a reasonable measure on agents - those theorems don't tell us that we should expect a uniform distribution on utility/reward functions, or even a nice distribution! They would if agents were born with utility functions in their heads represented as tables or something, where you could swap entries in different rows, but that's not what the theorems say!
Not having read other responses, my attempt to answer in my own words: what goes wrong is that there are tons of possible cognitive influences that could be reinforced by rewards for making people smile. E.g. "make things of XYZ type think things are going OK", "try to promote physical configurations like such-and-such", "trying to stimulate the reinforcer I observe in my environment". Most of these decision-influences, when extrapolated to coherent behaviour where those decision-influences drive the course of the behaviour, lead to resource-gathering and not respecting what the informed preferences of humans would be. Then this causes doom because you can better achieve most goals/preferences you could have by having more power and disempowering the humans.
I think you're making a mistake: policies can be reward-optimal even if there's not an obvious box labelled "reward" that they're optimal with respect to the outputs of. Similarly, the formalism of "reward" can be useful even if this box doesn't exist, or even if the policy isn't behaving the way you would expect if you identified that box with the reward function. To be fair, the post sort of makes this mistake by talking about "internal representations", but I think everything goes thru if you strike out that talk.
The main thing I want to talk about
I can talk about utility functions instead (which would be equivalent to value functions in this case)
I disagree that these are equivalent, and expect the policy and value function to come apart in practice. Indeed, that was observed in the original goal misgeneralization paper (3.3, actor-critic inconsistency).
I think you're the one who's imposing a type error here. For "value functions" to be useful in modelling a policy, it doesn't have to be the case that the policy is acting optimally with respect to a suggestively-labeled critic - it just has to be the case that the agent is acting consistently with some value function. Analogously, momentum is conserved in classical mechanics, even if objects have labels on them that inaccurately say "my momentum is 23 kg m/s".
Anyways, we can talk about utility functions, but then we're going to lose claim to predictiveness, no? Why should we assume that the network will internally represent a scalar function over observations, consistent with a historical training signal's scalar values (and let's not get into nonstationary reward), such that the network will maximize discounted sum return of this internally represented function? That seems highly improbable to me, and I don't think reality will be "basically that" either.
The utility function formalism doesn't require agents to "internally represent a scalar function over observations". You'll notice that this isn't one of the conclusions of the VNM theorem.
Another thing I don't get
My point is rather that these results are not predictive because the assumption won't hold. The assumptions are already known to not be good approximations of trained policies, in at least some prototypical RL situations.
What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-solving agent, which isn't incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.