Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision 2023-12-16T05:39:10.558Z
Shapley Value Attribution in Chain of Thought 2023-04-14T05:56:18.208Z
[ASoT] Some thoughts on human abstractions 2023-03-16T05:42:12.595Z
Clarifying wireheading terminology 2022-11-24T04:53:23.925Z
Scaling Laws for Reward Model Overoptimization 2022-10-20T00:20:06.920Z
How many GPUs does NVIDIA make? 2022-10-08T17:54:35.466Z
Towards deconfusing wireheading and reward maximization 2022-09-21T00:36:43.244Z
Humans Reflecting on HRH 2022-07-29T21:56:53.561Z
leogao's Shortform 2022-05-24T20:08:32.928Z
[ASoT] Consequentialist models as a superset of mesaoptimizers 2022-04-23T17:57:40.130Z
[ASoT] Some thoughts about imperfect world modeling 2022-04-07T15:42:09.589Z
[ASoT] Some thoughts about LM monologue limitations and ELK 2022-03-30T14:26:15.381Z
[ASoT] Some thoughts about deceptive mesaoptimization 2022-03-28T21:14:27.217Z
[ASoT] Searching for consequentialist structure 2022-03-27T19:09:13.370Z
[ASoT] Some ways ELK could still be solvable in practice 2022-03-27T01:15:16.607Z
[ASoT] Observations about ELK 2022-03-26T00:42:20.540Z
What do paradigm shifts look like? 2022-03-16T19:17:37.586Z
EleutherAI's GPT-NeoX-20B release 2022-02-10T06:56:41.155Z
NFTs, Coin Collecting, and Expensive Paintings 2022-01-24T01:01:48.117Z
Retail Investor Advantages 2021-12-07T02:08:20.694Z
Behavior Cloning is Miscalibrated 2021-12-05T01:36:01.802Z
Quadratic Voting and Collusion 2021-11-17T00:19:15.737Z
In Defence of Optimizing Routine Tasks 2021-11-09T05:09:41.595Z
Towards Deconfusing Gradient Hacking 2021-10-24T00:43:32.916Z
Dissolving the Experience Machine Objection 2021-10-03T16:56:28.312Z
Gradient descent is not just more efficient genetic algorithms 2021-09-08T16:23:46.996Z
Obstacles to gradient hacking 2021-09-05T22:42:22.876Z
Thoughts on the Alignment Implications of Scaling Language Models 2021-06-02T21:32:08.555Z
Building AGI Using Language Models 2020-11-09T16:33:25.864Z
GPT-3: A Summary 2020-06-02T18:14:54.380Z


Comment by leogao on leogao's Shortform · 2024-03-27T20:37:39.281Z · LW · GW

philosophy: while the claims "good things are good" and "bad things are bad" at first appear to be compatible with each other, actually we can construct a weird hypothetical involving exact clones that demonstrates that they are fundamentally inconsistent with each other

law: could there be ambiguity in "don't do things that are bad as determined by a reasonable person, unless the thing is actually good?" well, unfortunately, there is no way to know until it actually happens

Comment by leogao on Modern Transformers are AGI, and Human-Level · 2024-03-27T01:42:26.391Z · LW · GW

I believe that the important part of generality is the ability to handle new tasks. In particular, I disagree that transformers are actually as good at handling new tasks as humans are. My mental model is that modern transformers are not general tools, but rather an enormous Swiss army knife with billions of specific tools that compose together to only a limited extent. (I think human intelligence is also a Swiss army knife and not the One True Tool, but it has many fewer tools that are each more general and more compositional with the other tools.)

I think this is heavily confounded because the internet is so huge that it's actually quite hard to come up with things that are not already on the internet. Back when GPT-3 first came out, I used to believe that widening the distribution to cover every task ever was a legitimate way to solve the generality problem, but I no longer believe this. (I think in particular this would have overestimated the trajectory of AI in the past 4 years)

One way to see this is that the most interesting tasks are ones that nobody has ever done before. You can't just widen the distribution to include discovering the cure for cancer, or solving alignment. To do those things, you actually have to develop general cognitive tools that compose in interesting ways.

We spend a lot of time thinking about how human cognitive tools are flawed, which they certainly are compared to the true galaxy brain superintelligence. But while humans certainly don't generalize perfectly and there isn't a sharp line between "real reasoning" and "mere memorization", it's worth keeping in mind that we're literally pretrained on surviving in the wilderness and those cognitive tools can still adapt to pushing buttons on a keyboard to write code.

I think this effect is also visible on a day to day basis. When I learn something new - say, some unfamiliar new piece of math - I generally don't immediately fully internalize it. I can recall some words to describe it and maybe apply it in some very straightforward cases where it obviously pattern matches, but I don't really fully grok its implications and connections to other knowledge. Then, after simmering on it for a while, and using it to bump into reality a bunch, I slowly begin to actually fully internalize the core intuition, at which point I can start generating new connections and apply it in unusual ways.

(From the inside, the latter feels like fully understanding the concept. I think this is at least partly the underlying reason why lots of ML skeptics say that models "don't really understand" - the models do a lot of pattern matching things straightforwardly.)

To be clear, I agree with your argument that there is substantial overlap between the most understanding language models and the least understanding humans. But I think this is mostly not the question that matters for thinking about AI that can kill everyone (or prevent that).

Comment by leogao on All About Concave and Convex Agents · 2024-03-25T04:41:12.207Z · LW · GW

Well, if you make a convex misaligned AI, it will play the (metaphorical) lottery over and over again until 99.9999%+ of the time it has no power and resources left whatsoever. The smarter it is, the faster and more efficient it will be at achieving this outcome.

So unless the RNG gods are truly out to get you, in the long run you are exceedingly unlikely to actually encounter a convex misaligned AI that has accumulated any real amount of power.

Comment by leogao on All About Concave and Convex Agents · 2024-03-24T21:58:45.030Z · LW · GW

Thankfully, almost all of the time the convex agents end up destroying themselves by taking insane risks to concentrate their resources into infinitesimally likely worlds, so you will almost never have to barter with a powerful one.

(why not just call them risk seeking / risk averse agents instead of convex/concave?)

Comment by leogao on More people getting into AI safety should do a PhD · 2024-03-17T01:46:36.399Z · LW · GW

My personal anecdote as one of the no-undergrad people: I got into ML research on my own and published papers without much research mentorship, and then joined OpenAI. My background is definitely more in engineering than research, but I've spent a substantial amount of time exploring my own research directions. I get direct mentorship from my manager, but I also seek out advice from many other researchers in the organization, which I've found to be valuable.

My case is quite unusual, so I would caution about drawing generalized conclusions about what to do based on my experience.

Comment by leogao on leogao's Shortform · 2024-03-10T18:05:42.018Z · LW · GW

it's often stated that believing that you'll succeed actually causes you to be more likely to succeed. there are immediately obvious explanations for this - survivorship bias. obviously most people who win the lottery will have believed that buying lottery tickets is a good idea, but that doesn't mean we should take that advice. so we should consider the plausible mechanisms of action.

first, it is very common for people with latent ability to underestimate their latent ability. in situations where the cost of failure is low, it seems net positive to at least take seriously the hypothesis that you can do more than you think you can. (also keeping in mind that we often overestimate the cost of failure). there are also deleterious mental health effects to believing in a high probability of failure, and then bad mental health does actually cause failure - it's really hard to give something your all if you don't really believe in it.

belief in success also plays an important role in signalling. if you're trying to make some joint venture happen, you need to make people believe that the joint venture will actually succeed (opportunity costs exist). when assessing the likelihood of success of the joint venture, people will take many pieces of information into account: your track record, the opinions of other people with a track record, object level opinions on the proposal, etc.

being confident in your own venture is an important way of putting your "skin in the game" to vouch that it will succeed. specifically, the way this is supposed to work is that you get punished socially for being overconfident, so you have an incentive to only really vouch for things that really will work. in practice, in large parts of the modern world overconfidence is penalized less than we're hardwired to expect. sometimes this is due to regions with cultural acceptance and even embrace of risky bets (SV), or because of atomization of modern society making the effects of social punishment less important.

this has both good and bad effects. it's what enables innovation, because that fundamentally requires a lot of people to play the research lottery. if you're not willing to work on something that will probably fail but also will pay out big if it succeeds, it's very hard to innovate. research consists mostly of people who are extremely invested in some research bet, to the point where it's extremely hard to convince them to pivot if it's not working out. ditto for startups, which are probably the architypical example of both innovation and also of catastrophic overconfidence.

this also creates problems - for instance, it enables grifting because you don't actually need to have to be correct if you just claim that your idea will work, and then when it inevitably fails you can just say that this is par for the course. also, being systematically overconfident can cause suboptimal decision making where calibration actually is important.

because many talented people are underequipped with confidence (there is probably some causal mechanism here - technical excellence often requires having a very mechanistic mental model of the thing you're doing, rather than just yoloing it and hoping it works), it also creates a niche for middlemen to supply confidence as a service, aka leadership. in the ideal case, this confidence is supplied by people who are calibratedly confident because of experience, but the market is inefficient enough that even people who are not calibrated can supply confidence because of the market inefficiency. another way to view this is that leaders deliver the important service of providing certainty in the face of an uncertain world.

(I'm using the term middleman here in a sense that doesn't necessarily imply that they deliver no value - in fact, causing things to happen can create lots of value, and depending on the specifics this role can be very difficult to fill. but they aren't the people who do the actual technical work. it is of course also valuable for the leader to e.g be able in theory to fill any of the technical roles if needed, because it makes them more able to spend their risk budget on the important technical questions, it creates more slack and thereby increases the probability of success, and the common knowledge of the existence of this slack itself also increases the perceived inevitability of success)

a similar story also applies at the suprahuman level, of tribes or ideologies. if you are an ideology, your job is unfortunately slightly more complicated. on the one hand, you need to project the vibe of inevitable success so that people in other tribes feel the need to get in early on your tribe, but on the other hand you need to make your tribe members feel like every decision they make is very consequential for whether the tribe succeeds. if you're merely calibrated, then only one of the two can be true. different social technologies are used by religions, nations, political movements, companies, etc to maintain this paradox.

Comment by leogao on leogao's Shortform · 2024-03-10T17:02:06.153Z · LW · GW

I make no claim to fungibility or lack of value created by middlemen.

Comment by leogao on leogao's Shortform · 2024-03-09T16:34:57.032Z · LW · GW

an example: open source software produces lots of value. this value is partly captured by consumers who get better software for free, and partly by businesses that make more money than they would otherwise.

the most clear cut case is that some businesses exist purely by wrapping other people's open source software, doing advertising and selling it for a handsome profit; this makes the analysis simpler, though to be clear the vast majority of cases are not this egregious.

in this situation, the middleman company is in fact creating value (if a software is created in a forest with no one around to use it, does it create any value?) by using advertising to cause people to get value from software. in markets where there are consumers clueless enough to not know about the software otherwise (e.g legacy companies), this probably does actually create a lot of counterfactual value. however, most people would agree that the middleman getting 90% of the created value doesn't satisfy our intuitive notion of fairness. (open source developers are more often trying to have the end consumers benefit from better software, not for random middlemen to get rich off their efforts)

and if advertising is commoditized, then this problem stops existing (you can't extract that much value as an advertising middleman if there is an efficient market with 10 other competing middlemen), and so most of the value does actually accrue to the end user.

Comment by leogao on Vote on Anthropic Topics to Discuss · 2024-03-09T13:25:54.698Z · LW · GW

[meta comment] maybe comments that are also poll options should be excluded from popular comments, visibly differently on profile pages, etc to remove the need to say things like "[This comment is present for voting purposes, it does not represent my opinions, see the OP for context.]"

Comment by leogao on leogao's Shortform · 2024-03-09T12:51:36.974Z · LW · GW

of course, this is more a question about equilibria than literal transactions. suppose you capture most of the value and then pay it back out to users as a dividend: the users now have more money with which they could pay a middleman, and a middleman that could have extracted some amount of value originally can still extract that amount of value in this new situation.

we can model this as a game of ultimatum between the original value creator and the middlemen. if the participation of the OVC and middleman are both necessary, the OVC can bargain for half the value in an iterated game / as FDT agents. however, we usually think of the key differentiating factor between the OVC and middlemen as the middlemen being more replaceable, so the OVC should be able to bargain for a lot more. (see also: commoditizing your complement)

so to ensure that the end users get most of the value, you need to either ensure that all middleman roles are commoditized, or precommit to only provide value in situations where the end user can actually capture most of the value

Comment by leogao on leogao's Shortform · 2024-03-09T12:31:58.539Z · LW · GW

any time someone creates a lot of value without capturing it, a bunch of other people will end up capturing the value instead. this could be end consumers, but it could also be various middlemen. it happens not infrequently that someone decides not to capture the value they produce in the hopes that the end consumers get the benefit, but in fact the middlemen capture the value instead

Comment by leogao on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-07T00:20:50.036Z · LW · GW

In my experiments log L0 vs log unexplained variance should be a nice straight line. I think your autoencoders might be substantially undertrained (especially given that training longer moves off the frontier a lot). Scaling up the data by 10x or 100x wouldn't be crazy. 

(Also, I think L0 is more meaningful than L0 / d_hidden for comparing across different d_hidden (I assume that's what "percent active features" is))

Comment by leogao on Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT · 2024-03-05T23:51:43.187Z · LW · GW

Fwiw, I find it's much more useful to have (log) active features on the x axis, and (log) unexplained variance on the y axis. (if you want you can then also plot the L1 coefficient above the points, but that seems less important)

Comment by leogao on Can we get an AI to do our alignment homework for us? · 2024-02-27T00:04:53.883Z · LW · GW

My mental model is that there is an entire space of possible AIs, each with some capability level and alignability level. Given the state of the alignment field, there is some alignability ceiling, below which we can reliably align AIs. Right now, this ceiling is very low, but we can push it higher over time.

At some capability level, the AI is powerful enough to solve alignment of a more capable AI, which can then solve alignment for even more capable AI, etc all the way up. However, even the most alignable AI capable of this is still potentially very hard to align. There will of course be more alignable and less capable AIs too, but they will not be capable enough to actually kick off this bucket chain.

Then the key question is whether there will exist an AI that is both alignable and capable enough to start the bucket chain. This is a function of both (a) the shape of the space of AIs (how quickly do models become unalignable as they become more capable?) and (b) how good we become at solving alignment. Opinions differ on this - my personal opinion is that probably this first AI is pretty hard to align, so we're pretty screwed, though it's still worth a try.

Comment by leogao on Do sparse autoencoders find "true features"? · 2024-02-23T04:47:19.033Z · LW · GW

In the limit of infinite SAE width and infinite (iid) training data, you can get perfect reconstruction and perfect sparsity (both L0 and L1). We can think of this as maximal feature splitting. Obviously, this is undesirable, because you've discarded all of the structure present in your data.

Therefore, reconstruction and sparsity aren't exactly the thing we most fundamentally care about. It just happens to do something reasonable at practical scales. However, that doesn't mean we have to throw it out - we might hope that it gives us enough of a foothold in practice.

In particular, the maximal feature splitting case requires exponentially many latents. We might believe that in practice, on the spectrum from splitting too little (polysemanticity) to splitting too much, erring on the side of splitting too much is preferable, because we can still do circuit finding and so on if we artificially cut some existing features into smaller pieces.

Comment by leogao on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-14T22:23:00.201Z · LW · GW

For the dashboards, did you filter out the features that fire less frequently? I looked through a few and didn't notice any super low density ones.

Comment by leogao on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-14T10:17:41.764Z · LW · GW

For your dashboards, how many tokens are you retrieving the top examples from?

Comment by leogao on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-10T03:07:45.498Z · LW · GW

Why do you scale your MSE by 1/(x_centred**2).sum(dim=-1, keepdim=True).sqrt() ? In particular, I'm confused about why you have the square root. Shouldn't it just be 1/(x_centred**2).sum(dim=-1, keepdim=True)?

Comment by leogao on More Hyphenation · 2024-02-08T07:02:42.758Z · LW · GW

hot take: if you find that your sentences can't be parsed reliably without brackets, that's a sign you should probably refactor your writing to be clearer

Comment by leogao on leogao's Shortform · 2024-02-03T11:50:14.888Z · LW · GW

a tentative model of ambitious research projects

when you do a big research project, you have some amount of risk you can work with - maybe you're trying to do something incremental, so you can only tolerate a 10% chance of failure, or maybe you're trying to shoot for the moon and so you can accept a 90% chance of failure.

budgeting for risk is non negotiable because there are a lot of places where risk can creep in - and if there isn't, then you're not really doing research. most obviously, your direction might just be a dead end. but there are also other things that might go wrong: the code might end up too difficult to implement, or it might run too slowly, or you might fail to fix a solvable-in-principle problem that comes up.

I claim that one of the principal components of being a good researcher is being able to eliminate as much unnecessary risk as possible, so you can spend your entire risk budget on the important bets.

for example, if you're an extremely competent engineer, when brainstorming experiments you don't have to think much about the risk that you fail to implement it. you know that even if you don't think through all the contingencies that might pop up, you can figue it out, because you have a track record of figuring it out. you can say the words "and if that happens we'll just scale it up" without spending much risk because you know full well that you can actually execute on it. a less competent engineer would have to pay a much greater risk cost, and correspondingly have to reduce the ambitiousness of the research bets (or else, take on way more risk than intented).

not all research bets are created equal, either. the space of possible research bets is vast, and most of them are wrong. but if you have very good research taste, you can much more reliably tell whether a bet is likely to work out. even the best researchers can't just look at a direction and know for sure if it will work, if you know that you get a good direction 10% of the time you can do a lot more than if your direction is only good 0.1% of the time.

finally, if you know and trust someone to be reliable at executing on their area of expertise, you can delegate things that fall in their domain to them. in practice, this can be quite tough and introduce risk unless they have a very legible track record, or you are sufficiently competent in their domain yourself to tell if they're likely to succeed. and if you're sufficiently competent to do the job of any of your report (even if less efficiently), then you can budget less risk here knowing that even if someone drops their ball you could always pick it up yourself.

Comment by leogao on leogao's Shortform · 2024-01-28T23:52:31.596Z · LW · GW

Yeah, this seems like a good idea for reading - lets you get best of both worlds. Though it works for reading mostly because it doesn't take that much longer to do so. This doesn't translate as directly to e.g what to do when debugging code or running experiments.

Comment by leogao on jacquesthibs's Shortform · 2024-01-28T04:51:26.361Z · LW · GW

"larger models exploit the RM more" is in contradiction with what i observed in the RM overoptimization paper. i'd be interested in more analysis of this

Comment by leogao on leogao's Shortform · 2024-01-28T04:46:54.866Z · LW · GW

i've noticed a life hyperparameter that affects learning quite substantially. i'd summarize it as "willingness to gloss over things that you're confused about when learning something". as an example, suppose you're modifying some code and it seems to work but also you see a warning from an unrelated part of the code that you didn't expect. you could either try to understand exactly why it happened, or just sort of ignore it.

reasons to set it low:

  • each time your world model is confused, that's an opportunity to get a little bit of signal to improve your world model. if you ignore these signals you increase the length of your feedback loop, and make it take longer to recover from incorrect models of the world.
  • in some domains, it's very common for unexpected results to actually be a hint at a much bigger problem. for example, many bugs in ML experiments cause results that are only slightly weird, but if you tug on the thread of understanding why your results are slightly weird, this can cause lots of your experiments to unravel. and doing so earlier rather than later can save a huge amount of time
  • understanding things at least one level of abstraction down often lets you do things more effectively. otherwise, you have to constantly maintain a bunch of uncertainty about what will happen when you do any particular thing, and have a harder time thinking of creative solutions

reasons to set it high:

  • it's easy to waste a lot of time trying to understand relatively minor things, instead of understanding the big picture. often, it's more important to 80-20 by understanding the big picture, and you can fill in the details when it becomes important to do so (which often is only necessary in rare cases).
  • in some domains, we have no fucking idea why anything happens, so you have to be able to accept that we don't know why things happen to be able to make progress
  • often, if e.g you don't quite get a claim that a paper is making, you could resolve your confusion just by reading a bit ahead. if you always try to fully understand everything before digging into it, you'll find it very easy to get stuck before actually make it to the main point the paper is making

there are very different optimal configurations for different kinds of domains. maybe the right approach is to be aware that this is an important hparameter and occasionally try going down some rabbit holes and seeing how much value it provides

Comment by leogao on leogao's Shortform · 2024-01-23T10:00:06.160Z · LW · GW

more importantly, both i and the other person get more out of the conversation. almost always, there are subtle misunderstandings and the rest of the conversation would otherwise involve a lot of talking past each other. you can only really make progress when you're actually engaging with the other person's true beliefs, rather than a misunderstanding of their beliefs.

Comment by leogao on leogao's Shortform · 2024-01-23T06:13:06.565Z · LW · GW

saying "sorry, just to make sure I understand what you're saying, do you mean [...]" more often has been very valuable

Comment by leogao on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-13T21:03:51.675Z · LW · GW

I think this paper is empirical evidence for a nontrivial part of the deceptive alignment argument (RLHF/adversarial training being insufficient to remove it), and I also think most empirical papers don't make any sense when applied to AGI.

I think I have an intellectually consistent stance - I don't think this is because I have a double standard for pessimistic results.

First, suppose you did an experiment where you show models that usually kick puppies and hide a sleeper agent that suddenly becomes helpful and harmless in 2024, and adversarial training failing to remove this. I think I would draw the exact same conclusion about deceptive alignment from this experiment where the labels are painted on differently but the mechanics are the same. And just as I think it is invalid to conclude from the sleeper agent paper that models naturally want to insert backdoors in code even if they're harmless now, it is also invalid to argue from this hypothetical experiment that models naturally want to be helpful even if you try to train them to kick puppies.

Second, I think this paper is actually genuinely better evidence for deceptive alignment than many of the "deception" papers that came before. For example, I claim that the sycophancy and insider trading papers provide approximately no evidence for deceptive alignment. This is for exactly the same reason why I think showing RLHF making models harmless provides approximately no evidence against deceptive alignment. So I don't think it's true that I like empirical papers as long as they purport to support the deceptive alignment argument.

The reasons I think this paper is actually better than the other deception papers (beyond just quality of execution) are that the deceptive alignment in this setup happens for reasons more similar to why it might happen in AGI than in previous work, and the secret scratchpad setting seeming more analogous to AGI than single shot or visible scratchpad.

Comment by leogao on Suggestions for net positive LLM research · 2024-01-08T10:02:42.083Z · LW · GW

I'm not familiar enough with agent foundations to provide very detailed object level advice, but I think it would be hugely valuable to empirically test agent foundations ideas in real models, with the understanding that AGI doesn't necessarily have to look like LMs but any theory for intelligence has to at least fit both LMs and AGI. As an example, we might believe that LMs might not have goals in the same sense as AGI eventually, but then we can ask why LMs can still seem to achieve any goals at all, and perhaps through empirical investigation of LMs we can get a better understanding of the nature of goal seeking. I think this would be much, much more valuable than generic LM alignment work.

Comment by leogao on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-05T11:07:16.064Z · LW · GW

The training set is a random 100k subsample of this dataset:

I'm prepending Alice/Bob and doing the xor of the label in exactly the same way you do.

Comment by leogao on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-05T08:59:01.665Z · LW · GW

I'm having some trouble replicating this result in a not exactly comparable setting (internal model, looking at is_alice xor amazon_sentiment). I get 90%+ on the constituent datasets, but only up to 75% on the xor depending on which layer I look at.

(low confidence, will update as I get more results)

Comment by leogao on 2023 in AI predictions · 2024-01-01T10:43:02.843Z · LW · GW

(since this list is pretty heavily weighted to the <5 year timelines, I'd like to register that my timelines are more like 10 years median)

Comment by leogao on The Plan - 2023 Version · 2023-12-30T18:07:37.049Z · LW · GW

Short answer: The core focus of the "yet to be worked out techniques" is to figure out the "how do we get it to generalize properly" part, not the "how do we be super careful with the labels" part.

Longer answer: We can consider weak to strong generalization as actually two different subproblems:

  • generalizing from correct labels on some easy subset of the distribution (the 10,000 super careful definitely 100% correct labels)
  • generalizing from labels which can be wrong and are more correct on easy problems than hard problems, but we don't exactly know when the labels are wrong (literally just normal human labels)

The setting in the paper doesn't quite distinguish between the two but I personally think the former problem is more interesting and contains the bulk of the difficulty. Namely, most of the difficulty is in understanding when generalization happens/fails and what kinds of generalizations are more natural.

Comment by leogao on Anki setup best practices? · 2023-12-25T22:55:38.820Z · LW · GW

I never use easy/hard, only correct/incorrect. To see cards more often, you can tune the interval factor. You can also change the new card steps to show new cards multiple times on the first day.

Comment by leogao on Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning · 2023-12-21T08:19:48.840Z · LW · GW

I think I could pass the ITTs of Quintin/Nora sufficiently to have a productive conversation while also having interesting points of disagreement. If that's the bottleneck, I'd be interested in participating in some dialogues, if it's a "people genuinely trying to understand each other's views" vibe rather than a "tribalistically duking it out for the One True Belief" vibe.

Comment by leogao on Mapping the semantic void: Strange goings-on in GPT embedding spaces · 2023-12-19T10:11:12.923Z · LW · GW

the following 3-d mockup might convey some useful spatial intuitions


This mockup conveys actively incorrect spatial intuitions. The observed radii are exactly what you'd expect if there's only a single gaussian.

Let's say we look at a 1000d gaussian centered around some point away from the origin:

x = torch.randn(10000, 1000)
x += torch.ones_like(x) * 3

plt.xlim(0, None)

We get what appears to be a shell at radius 100.

Then, we plot the distribution of distances to the centroid:

sns.displot((x - x.mean(dim=0)).norm(dim=-1))
plt.xlim(0, None)

Suddenly it looks like a much smaller shell! But really there is only one gaussian (high dimensional gaussians have their mass concentrated almost entirely in a thin shell), centered around some point away from the origin. There is no weird torus.

Comment by leogao on leogao's Shortform · 2023-12-18T08:34:43.438Z · LW · GW

sure, but seems orthogonal to the thing i'm describing - the claim is that a lot of alignment work on current models has ~no bearing on progress towards aligning AGI.

Comment by leogao on TurnTrout's shortform feed · 2023-12-18T06:09:25.092Z · LW · GW

I think deceptive alignment is still reasonably likely despite evidence from LLMs.

I agree with:

  • LLMs are not deceptively aligned and don't really have inner goals in the sense that is scary
  • LLMs memorize a bunch of stuff
  • the kinds of reasoning that feed into deceptive alignment do not predict LLM behavior well
  • Adam on transformers does not have a super strong simplicity bias
  • without deceptive alignment, AI risk is a lot lower
  • LLMs not being deceptively aligned provides nonzero evidence against deceptive alignment (by conservation of evidence)

I predict I could pass the ITT for why LLMs are evidence that deceptive alignment is not likely.

however, I also note the following: LLMs are kind of bad at generalizing, and this makes them pretty bad at doing e.g novel research, or long horizon tasks. deceptive alignment conditions on models already being better at generalization and reasoning than current models.

my current hypothesis is that future models which generalize in a way closer to that predicted by mesaoptimization will also be better described as having a simplicity bias.

I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI

Comment by leogao on leogao's Shortform · 2023-12-18T05:47:16.125Z · LW · GW

new galaxy brain hypothesis of how research advances: progress happens when people feel unhappy about a bad but popular paper and want to prove it wrong (or when they feel like they can do even better than someone else)

this explains:

  • why it's often necessary to have bad incremental papers that don't introduce any generalizable techniques (nobody will care about the followup until it's refuting the bad paper)
  • why so much of academia exists to argue that other academics are wrong and bad
  • why academics sometimes act like things don't exist unless there's a paper about them, even though the thing is really obvious
Comment by leogao on leogao's Shortform · 2023-12-18T05:37:13.846Z · LW · GW

hypothesis: the kind of reasoning that causes ML people to say "we have made no progress towards AGI whatsoever" is closely analogous to the kind of reasoning that makes alignment people say "we have made no progress towards hard alignment whatsoever"

ML people see stuff like GPT4 and correctly notice that it's in fact kind of dumb and bad at generalization in the same ways that ML always has been. they make an incorrect extrapolation, which is that AGI must therefore be 100 years away, rather than 10 years away

high p(doom) alignment people see current model alignment techniques and correctly notice that they fail to tackle the AGI alignment problem in the same way that alignment techniques always have. they make an incorrect extrapolation and conclude that p(doom) = 0.99, rather than 0.5

(there is an asymmetry which is that overconfidence that alignment will be solved is much more dangerous than overconfidence that AGI will be solved)

Comment by leogao on leogao's Shortform · 2023-12-18T02:49:34.580Z · LW · GW

Is it a very universal experience to find it easier to write up your views if it's in response to someone else's writeup? Seems like the kind of thing that could explain a lot about how research tends to happen if it were a pretty universal experience.

Comment by leogao on leogao's Shortform · 2023-12-17T20:19:30.989Z · LW · GW

learning rate

Comment by leogao on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2023-12-17T09:14:05.610Z · LW · GW

I agree with the spirit of the post but not the kinda clickbaity title. I think a lot of people are over updating on single forward pass behavior of current LLMs. However, I think it is still possible to get evidence using current models with careful experiment design and being careful with what kinds of conclusions to draw.

Comment by leogao on leogao's Shortform · 2023-12-17T07:17:55.965Z · LW · GW

current understanding of optimization

  • high curvature directions (hessian eigenvectors with high eigenvalue) want small lrs. low curvature directions want big lrs
  • if the lr in a direction is too small, it takes forever to converge. if the lr is too big, it diverges by oscillating with increasing amplitude
  • momentum helps because if your lr is too small, it makes you move a bit faster. if your lr is too big, it causes the oscillations to cancel out with themselves. this makes high curvature directions more ok with larger lrs and low curvature directions more ok with smaller lrs, improving conditioning
  • high curvature directions also have bigger gradients. this is the opposite of what we want because in a perfect world higher curvature directions would have smaller gradients (natural gradient does this but it's usually too expensive). adam second moment / rmsprop helps because it makes gradients stay exactly the same size when the direction gets bigger, which is sorta halfway right
    • applied per param rather than per eigenvector
  • in real NNs edge of stability means it's actually even more fine to have a too-high lr: the max curvature increases throughout training until it gets to the critical point where it would diverge, but then instead of diverging all the way the oscillations along the top eigenvector somehow cause the model to move into a slightly lower curvature region again, so that it stabilizes right at the edge of stability. 
    • for Adam, these oscillations also cause second moment increases, which decreases preconditioned max curvature without affecting the original curvature. so this means the original max curvature can just keep increasing for Adam whereas it doesn't for SGD (though apparently there's some region where it jumps into a region with low original max curvature too)


Comment by leogao on leogao's Shortform · 2023-12-17T05:46:30.294Z · LW · GW : interesting paper with improvement on straight through estimator

Comment by leogao on leogao's Shortform · 2023-12-17T05:45:18.098Z · LW · GW : sharpness doesn't seem to correlate with generalization

Comment by leogao on leogao's Shortform · 2023-12-07T05:25:45.244Z · LW · GW

adhd is a mechanism for seeking domains with tight feedback loops

Comment by leogao on leogao's Shortform · 2023-12-04T04:18:20.856Z · LW · GW

One of the greatest tragedies of truth-seeking as a human is that the things we instinctively do when someone else is wrong are often the exact opposite of the thing that would actually convince the other person.

Comment by leogao on Quick takes on "AI is easy to control" · 2023-12-04T04:14:34.587Z · LW · GW

I think there are at least two definitions of optimistic/pessimistic that are often conflated:

  • Epistemic: an optimist is someone who thinks doom is unlikely, a pessimist someone who thinks doom is likely
  • Dispositional: an optimist is someone who is hopeful and glass-half-full, a pessimist is someone who is despondent and fatalistic

Certainly these are correlated to some extent: if you believe there's a high chance of everyone dying, probably this is not great for your mental health. Also probably people who are depressed are more likely to have negatively distorted epistemics. This would explain why it's tempting to use the same term to refer to both.

However, I think using the same term to refer to both leads to some problems:

  • Being cheerful and hopeful is generally a good trait to have. However, this often bleeds into also believing it is desirable to have epistemic beliefs that doom is unlikely, rather than trying to figure out whether doom is actually likely.
  • Because "optimism" feels morally superior to "pessimism" (due to the dispositional definition), it's inevitable that using the terms for tribal affiliation even for the epistemic definition causes tension.

I personally strive to be someone with an optimistic disposition and also to try my best to have my beliefs track the truth. I also try my best to notice and avoid the tribal pressures.

Comment by leogao on Shallow review of live agendas in alignment & safety · 2023-11-29T10:13:37.471Z · LW · GW

Fwiw I think "deep" reviews serve a very different purpose from shallow reviews so I don't think you should let the existence of shallow reviews prevent you from doing a deep review

Comment by leogao on Shallow review of live agendas in alignment & safety · 2023-11-28T23:00:41.116Z · LW · GW

There's also a much harder and less impartial option, which is to have an extremely opinionated survey that basically picks one lens to view the entire field and then describes all agendas with respect to that lens in terms of which particular cruxes/assumptions each agenda runs with. This would necessarily require the authors of the survey to deeply understand all the agendas they're covering, and inevitably some agendas will receive much more coverage than other agendas. 

This makes it much harder than just stapling together a bunch of people's descriptions of their own research agendas, and will never be "the" alignment survey because of the opinionatedness. I still think this would have a lot of value though: it would make it much easier to translate ideas between different lenses/notice commonalities, and help with figuring out which cruxes need to be resolved for people to agree. 

Relatedly, I don't think alignment currently has a lack of different lenses (which is not to say that the different lenses are meaningfully decorrelated). I think alignment has a lack of convergence between people with different lenses. Some of this is because many cruxes are very hard to resolve experimentally today. However, I think even despite that it should be possible to do much better than we currently are--often, it's not even clear what the cruxes are between different views, or whether two people are thinking about the same thing when they make claims in different language. 

Comment by leogao on leogao's Shortform · 2023-11-27T10:57:30.563Z · LW · GW

funnily enough, my experience has been almost entirely from the other direction - almost everything I know is from working directly on things I care about, and very little is from study. one of the reasons behind this shortform was trying to untangle why people spend lots of time studying stuff and whether/when it makes sense for me to study vs simply to learn by doing