PaLM-2 & GPT-4 in "Extrapolating GPT-N performance" 2023-05-30T18:33:40.765Z
Some thoughts on automating alignment research 2023-05-26T01:50:20.099Z
Before smart AI, there will be many mediocre or specialized AIs 2023-05-26T01:38:41.562Z
PaLM in "Extrapolating GPT-N performance" 2022-04-06T13:05:12.803Z
Truthful AI: Developing and governing AI that does not lie 2021-10-18T18:37:38.325Z
OpenAI: "Scaling Laws for Transfer", Hernandez et al. 2021-02-04T12:49:25.704Z
Prediction can be Outer Aligned at Optimum 2021-01-10T18:48:21.153Z
Extrapolating GPT-N performance 2020-12-18T21:41:51.647Z
Formalising decision theory is hard 2019-08-23T03:27:24.757Z
Quantifying anthropic effects on the Fermi paradox 2019-02-15T10:51:04.298Z


Comment by Lukas Finnveden (Lanrian) on Let’s use AI to harden human defenses against AI manipulation · 2023-06-03T03:58:59.594Z · LW · GW

doesn't it seem to you that the topic is super neglected (even compared to AI alignment) given that the risks/consequences of failing to correctly solve this problem seem comparable to the risk of AI takeover?

Yes, I'm sympathetic. Among all the issues that will come with AI, I think alignment is relatively tractable (at least it is now) and that it has an unusually clear story for why we shouldn't count on being able to defer it to smarter AIs (though that might work). So I think it's probably correct for it to get relatively more attention. But even taking that into account, the non-alignment singularity issues do seem too neglected.

I'm currently trying to figure out what non-alignment stuff seems high-priority and whether I should be tackling any of it.

Comment by Lukas Finnveden (Lanrian) on Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better? · 2023-06-02T22:44:09.859Z · LW · GW

This was also my impression.

Curious if OP or anyone else has a source for the <1% claim? (Partially interested in order to tell exactly what kind of "doom" this is anti-predicting.)

Comment by Lukas Finnveden (Lanrian) on PaLM-2 & GPT-4 in "Extrapolating GPT-N performance" · 2023-06-01T21:11:43.525Z · LW · GW

I assume that's from looking at the GPT-4 graph. I think the main graph I'd look at for a judgment like this is probably the first graph in the post, without PaLM-2 and GPT-4. Because PaLM-2 is 1-shot and GPT-4 is just 4 instead of 20+ benchmarks.

That suggests 90% is ~1 OOM away and 95% is ~3 OOMs away.

(And since PaLM-2 and GPT-4 seemed roughly on trend in the places where I could check them, probably they wouldn't change that too much.)

Comment by Lukas Finnveden (Lanrian) on PaLM-2 & GPT-4 in "Extrapolating GPT-N performance" · 2023-05-31T19:33:38.047Z · LW · GW

Interesting. Based on skimming the paper, my impression is that, to a first approximation, this would look like:

  • Instead of having linear performance on the y-axis, switch to something like log(max_performance - actual_performance). (So that we get a log-log plot.)
  • Then for each series of data points, look for the largest n such that the last n data points are roughly on a line. (I.e. identify the last power law segment.)
  • Then to extrapolate into the future, project that line forward. (I.e. fit a power law to the last power law segment and project it forward.)

That description misses out on effects where BNSL-fitting would predict that there's a slow, smooth shift from one power-law to another, and that this gradual shift will continue into the future. I don't know how important that is. Curious for your intuition about whether or not that's important, and/or other reasons for why my above description is or isn't reasonable.

When I think about applying that algorithm to the above plots, I worry that the data points are much too noisy to just extrapolate a line from the last few data points. Maybe the practical thing to do would be to assume that the 2nd half of the "sigmoid" forms a distinct power law segment, and fit a power law to the points with >~50% performance (or less than that if there are too few points with >50% performance). Which maybe suggests that the claim "BNSL does better" corresponds to a claim that the speed at which the language models improve on ~random performance (bottom part of the "sigmoid") isn't informative for how fast they converge to ~maximum performance (top part of the "sigmoid")? That seems plausible.

Comment by Lukas Finnveden (Lanrian) on Before smart AI, there will be many mediocre or specialized AIs · 2023-05-29T16:35:09.383Z · LW · GW

Thanks, fixed.

Comment by Lukas Finnveden (Lanrian) on Let’s use AI to harden human defenses against AI manipulation · 2023-05-22T19:02:03.651Z · LW · GW

I'm also concerned about how we'll teach AIs to think about philosophical topics (and indeed, how we're supposed to think about them ourselves). But my intuition is that proposals like this looks great on that perspective.

For areas where we don't have empirical feedback-loops (like many philosophical topics), I imagine that the "baseline solution" for getting help from AIs is to teach them to imitate our reasoning. Either just by literally writing the words that it predicts that we would write (but faster), or by having it generate arguments that we would think looks good. (Potentially recursively, c.f. amplification, debate, etc.)

(A different direction is to predict what we would think after thinking about it more. That has some advantages, but it doesn't get around the issue where we're at-best speeding things up.)

One of the few plausible-seeming ways to outperform that baseline is to identify epistemic practices that work well on questions where we do have empirical feedback loops, and then transferring those practices to questions where we lack such feedback loops. (C.f. imitative generalization.) The above proposal is doing that for a specific sub-category of epistemic practices (recognising ways in which you can be misled by an argument).

Worth noting: The broad category of "transfer epistemic practices from feedback-rich questions to questions with little feedback" contains a ton of stuff, and is arguably the root of all our ability to reason about these topics:

  • Evolution selected human genes for ability to accomplish stuff in the real world. That made us much better at reasoning about philosophy than our chimp ancestors are.
  • Cultural evolution seems to have at least partly promoted reasoning practices that do better at deliberation. (C.f. possible benefits from coupling competition and deliberation.)
  • If someone is optimistic that humans will be better at dealing with philosophy after intelligence-enhancement, I think they're mostly appealing to stuff like this, since intelligence would typically be measured in areas where you can recognise excellent performance.
Comment by Lukas Finnveden (Lanrian) on Matthew Barnett's Shortform · 2023-05-17T20:56:30.103Z · LW · GW

It seems like the list mostly explains away the evidence that "human's can't currently prevent value drift" since the points apply much less to AIs. (I don't know if you agree.)

  • As you mention, (1) probably applies less to AIs (for better or worse).
  • (2) applies to AIs in the sense that many features of AIs' environments will be determined by what tasks they need to accomplish, rather than what will lead to minimal value drift. But the reason to focus on the environment in the human case is that it's the ~only way to affect our values. By contrast, we have much more flexibility in designing AIs, and it's plausible that we can design them so that their values aren't very sensitive to their environments. Also, if we know that particular types of inputs are dangerous, the AIs' environment could be controllable in the sense that less-susceptible AIs could monitor for such inputs, and filter out the dangerous ones.
  • (3): "can't change the trajectory of general value drift by much" seems less likely to apply to AIs (or so I'm arguing). "Most people are selfish and don't care about value drift except to the extent that it harms them directly" means that human value drift is pretty safe (since people usually maintain some basic sense of self-preservation) but that AI value drift is scary (since it could lead your AI to totally disempower you).
  • (4) As you noted in the OP, AI could change really fast, so you might need to control value-drift just to survive a few years. (And once you have those controls in place, it might be easy to increase the robustness further, though this isn't super obvious.)
  • (5) For better or worse, people will probably care less about this in the AI case. (If the threat-model is "random drift away from the starting point", it seems like it would be for the better.)

Since the space of possible AIs is much larger than the space of humans, there are more degrees of freedom along which AI values can change.

I don't understand this point. We (or AIs that are aligned with us) get to pick from that space, and so we can pick the AIs that have least trouble with value drift. (Subject to other constraints, like competitiveness.)

(Imagine if AGI is built out of transformers. You could then argue "since the space of possible non-transformers is much larger than the space of transformers, there are more degrees of freedom along which non-transformer values can change". And humans are non-transformers, so we should be expected to have more trouble with value drift. Obviously this argument doesn't work, but I don't see the relevant disanalogy to your argument.)

Creating new AIs is often cheaper than creating new humans, and so people might regularly spin up new AIs to perform particular functions, while discounting the long-term effect this has on value drift (since the costs are mostly borne by civilization in general, rather than them in particular)

Why are the costs mostly borne by civilizaiton in general? If I entrust some of my property to an AI system, and it changes values, that seems bad for me in particular?

Maybe the argument is something like: As long as law-and-order is preserved, things are not so bad for me even if my AI's values start drifting. But if there's a critical mass of misaligned AIs, they can launch a violent coup against the humans and the aligned AIs. And my contribution to the coup-probability is small?

Comment by Lukas Finnveden (Lanrian) on Matthew Barnett's Shortform · 2023-05-17T18:22:56.122Z · LW · GW

It's possible that there's a trade-off between monitoring for motivation changes and competitiveness. I.e., I think that monitoring would be cheap enough that a super-rich AI society could happily afford it if everyone coordinated on doing it, but if there's intense competition, then it wouldn't be crazy if there was a race-to-the-bottom on caring less about things. (Though there's also practical utility in reducing principal-agents problem and having lots of agents working towards the same goal without incentive problems. So competitiveness considerations could also push towards such monitoring / stabilization of AI values.)

Comment by Lukas Finnveden (Lanrian) on Matthew Barnett's Shortform · 2023-05-17T18:04:20.894Z · LW · GW

5. However, AI values will drift over time. This happens for a variety of reasons, such as environmental pressures and cultural evolution. At some point AIs decide that it's better if they stopped listening to the humans and followed different rules instead.

How does this happen at a time when the AIs are still aligned with humans, and therefore very concerned that their future selves/successors are aligned with human? (Since the humans are presumably very concerned about this.)

This question is related to "we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects", but sort of posed on a different level. That quote seemingly presumes that their will be a systemic push away from human alignment, and seemingly suggests that we'll need some clever coordinated solution. (Do tell me if I'm reading you wrong!) But I'm asking why there is a systemic push away from human alignment if all the AIs are concerned about maintaining it?

Maybe the answer is: "If everyone starts out aligned with humans, then any random perturbations will move us away from that. The systemic push is entropy." I agree this is concerning if AIs are aligned in the sense of "their terminal values are similar to my terminal values", because it seems like there's lots of room for subtle and gradual changes, there. But if they're aligned in the sense of "at each point in time I take the action that [group of humans] would have preferred I take after lots of deliberation" then there's less room for subtle and gradual changes:

  • If they get subtly worse at predicting what humans would want in some cases, then they can probably still predict "[group of humans] would want me to take actions that ensures that my predictions of human deliberation are accurate" and so take actions to occasionally fix those misconceptions. (You'd have to be really bad at predicting humans to not realise that the humans wanted that^.)
  • Maybe they sometimes randomly stop caring about what the [group of humans] want. But that seems like it'd be abrupt enough that you could set up monitoring for it, and then you're back in a more classic alignment regime of detecting deception, etc. (Though a bit different in that the monitoring would probably be done by other AIs, and so you'd have to watch out for e.g. inputs that systematically and rapidly changed the values of any AIs that looked at them.)
  • Maybe they randomly acquire some other small motivation alongside "do what humans would have wanted". But if it's predictably the case that such small motivations will eventually undermine their alignment to humans, then the part of their goals that's shaped lilke "do what humans would have wanted" will vote strongly to monitor for such motivation changes and get rid of them ASAP. And if the new motivation is still tiny, probably it can't provide enough of a counteracting motivation to defend itself.

(Maybe you think that this type of alignment is implausible / maybe the action is in your "there's slight misalignment".)

Comment by Lukas Finnveden (Lanrian) on My views on “doom” · 2023-04-28T01:59:34.847Z · LW · GW

Maybe x-risk driven by explosive (technological) growth?

Edit: though some people think AI point of no return might happen before the growth explosion. 

Comment by Lukas Finnveden (Lanrian) on rohinmshah's Shortform · 2023-04-25T02:47:49.712Z · LW · GW

This is true if "the standard setting" refers to one where you have equally robust evidence of all options. But if you have more robust evidence about some options (which is common), the optimizer's curse will especially distort estimates of options with less robust evidence. A correct bayesian treatment would then systematically push you towards picking options with more robust evidence.

(Where I'm using "more robust evidence" to mean something like: evidence that has an overall greater likelihood ratio, and that therefore pushes you further from the prior. Where the error driving the optimizer's curse error is to look at the peak of the likelihood function while neglecting the prior and how much the likelihood ratio pushes you away from it.)

Comment by Lukas Finnveden (Lanrian) on GPT-4 · 2023-04-04T01:55:34.842Z · LW · GW

Where do you get the 3-4 months max training time from? GPT-3.5 was made available March 15th, so if they made that available immediately after it finished training, that would still have left 5 months for training GPT-4. And more realistically, they finished training GPT-3.5 quite a bit earlier, leaving 6+ months for GPT-4's training.

Comment by Lukas Finnveden (Lanrian) on GPT-4 · 2023-03-23T16:53:30.613Z · LW · GW

Are you saying that you would have expected GPT-4 to be stronger if it was 500B+10T? Is that based on benchmarks/extrapolations or vibes?

Comment by Lukas Finnveden (Lanrian) on AGI in sight: our look at the game board · 2023-02-20T06:35:40.646Z · LW · GW

LW discussion

Comment by Lukas Finnveden (Lanrian) on AGI in sight: our look at the game board · 2023-02-20T06:33:05.567Z · LW · GW

This one?

Comment by Lukas Finnveden (Lanrian) on Literature review of TAI timelines · 2023-01-29T06:39:15.975Z · LW · GW

The numbers you use from Holden says that he thinks AGI by 2036 is more than 10%. But when fitting the curves you put that at exactly 10%, which will predictably be an underestimate. It seems better to fit the curves without that number and just check that the result is higher than 10%.

Comment by Lukas Finnveden (Lanrian) on Vegan Nutrition Testing Project: Interim Report · 2023-01-21T17:38:24.795Z · LW · GW

One what later? Year, month?

Comment by Lukas Finnveden (Lanrian) on My Current Take on Counterfactuals · 2023-01-12T18:31:15.758Z · LW · GW

I'm curious if anyone made a serious attempt at the shovel-ready math here and/or whether this approach to counterfactuals still looks promising to Abram? (Or anyone else with takes.)

Comment by Lukas Finnveden (Lanrian) on What's the Least Impressive Thing GPT-4 Won't be Able to Do · 2023-01-09T03:43:52.552Z · LW · GW

Even when using chain of thought?

Comment by Lukas Finnveden (Lanrian) on Language models are nearly AGIs but we don't notice it because we keep shifting the bar · 2023-01-01T17:13:43.103Z · LW · GW

GPT-3- a text-generating language model.

PaLM-540B- a stunningly powerful question-answering language model.

Great Palm- A hypothetical language model that combines the powers of GPT-3 and PaLM-540B.

I would've thought that palm was better at text generation then gpt-3 by default. They're both pretrained on internet next-word prediction and palm is bigger with more data. What makes you think GPT-3 is better at text generation?

Comment by Lukas Finnveden (Lanrian) on Revisiting algorithmic progress · 2022-12-21T16:11:20.062Z · LW · GW

Interesting, thanks! To check my understanding:

  • In general, as time passes, all the researcheres increase their compute usage at a similar rate. This makes it hard to distinguish between improvements caused by compute and algorithmic progress.
  • If the correlation between year and compute was perfect, we wouldn't be able to do this at all.
  • But there is some variance in how much compute is used in different papers, each year. This variance is large enough that we can estimate the first-order effects of algorithmic progress and compute usage.
  • But complementarity is a second-order effect, and the data doesn't contain enough variation/data-points to give a good estimate of second-order effects.
Comment by Lukas Finnveden (Lanrian) on Revisiting algorithmic progress · 2022-12-19T15:16:12.858Z · LW · GW

Thanks for this!

Question: Do you have a sense of how strongly compute and algorithms are complements vs substitutes in this dataset?

(E.g. if you compare compute X in 2022, compute (k^2)X in 2020, and kX in 2021: if there's a k such that the last one is better than both the former two, that would suggest complementarity)

Comment by Lukas Finnveden (Lanrian) on ($1000 bounty) How effective are marginal vaccine doses against the covid delta variant? · 2022-11-02T17:12:24.114Z · LW · GW

I'm curious how much of a concern you think this is, now, 1 year later. I haven't heard the "total number of mRNA shots (for any disease)"-concern from other places, and I'm wondering if that's for good reasons.

Comment by Lukas Finnveden (Lanrian) on Counterarguments to the basic AI x-risk case · 2022-10-17T17:04:02.265Z · LW · GW

Competence does not seem to aggressively overwhelm other advantages in humans: 


g. One might counter-counter-argue that humans are very similar to one another in capability, so even if intelligence matters much more than other traits, you won’t see that by looking at  the near-identical humans. This does not seem to be true. Often at least, the difference in performance between mediocre human performance and top level human performance is large, relative to the space below, iirc. For instance, in chess, the Elo difference between the best and worst players is about 2000, whereas the difference between the amateur play and random play is maybe 400-2800 (if you accept Chess StackExchange guesses as a reasonable proxy for the truth here).

The usage of capabilities/competence is inconsistent here. In points a-f, you argue that general intelligence doesn't aggressively overwhelm other advantages in humans. But in point g, the ELO difference between the best and worst players is less determined by general intelligence than by how much practice people have had.

If we instead consistently talk about domain-relevant skills: In the real world, we do see huge advantages from having domain-specific skills. E.g. I expect elected representatives to be vastly better at politics than medium humans.

If we instead consistently talk about general intelligence: The chess data doesn't falsify the hypothesis that human-level variation in general intelligence is small. To gather data about that, we'd want to analyse the ELO-difference between humans who have practiced similarly much but who have very different g.

(There are some papers on the correlation between intelligence and chess performance, so maybe you could get the relevant data from there. E.g. this paper says that (not controlling for anything) most measurements of cognitive ability correlates with chess performance at about ~0.24 (including IQ iff you exclude a weird outlier where the correlation was -0.51).)

Comment by Lukas Finnveden (Lanrian) on Common misconceptions about OpenAI · 2022-08-29T23:39:17.767Z · LW · GW

Another fairly common argument and motivation at OpenAI in the early days was the risk of "hardware overhang," that slower development of AI would result in building AI with less hardware at a time when they can be more explosively scaled up with massively disruptive consequences. I think that in hindsight this effect seems like it was real, and I would guess that it is larger than the entire positive impact of the additional direct work that would be done by the AI safety community if AI progress had been slower 5 years ago.

Could you clarify this bit? It sounds like you're saying that OpenAI's capabilities work around 2017 was net-positive for reducing misalignment risk, even if the only positive we count is this effect. (Unless you think that there's substantial reason that acceleration is bad other than giving the AI safety community less time.) But then in the next paragraph you say that this argument was wrong (even before GPT-3 was released, which vaguely gestures at the "around 2017"-time). I don't see how those are compatible.

Comment by Lukas Finnveden (Lanrian) on chinchilla's wild implications · 2022-08-14T22:13:11.741Z · LW · GW

(If 1 firing = 1 bit, that should be 34 megabit ~= 4 megabyte.)

This random article (which I haven't fact-checked in the least) claims a bandwidth of 8.75 megabit ~= 1 megabyte. So that's like 2.5 OOMs higher than the number I claimed for chinchilla. So yeah, it does seem like humans get more raw data.

(But I still suspect that chinchilla gets more data if you adjust for (un)interestingness. Where totally random data and easily predictable/compressible data are interesting, and data that is hard-but-possible to predict/compress is interesting.)

Comment by Lukas Finnveden (Lanrian) on chinchilla's wild implications · 2022-08-14T18:18:52.764Z · LW · GW

There's a billion seconds in 30 years. Chinchilla was trained on 1.4 trillion tokens. So for a human adult to have as much data as chinchilla would require us to process the equivalent of ~1400 tokens per second. I think that's something like 2 kilobyte per second.

Inputs to the human brain are probably dominated by vision. I'm not sure how many bytes per second we see, but I don't think it's many orders of magnitudes higher than 2kb.

Comment by Lukas Finnveden (Lanrian) on Two-year update on my personal AI timelines · 2022-08-03T21:50:41.601Z · LW · GW

The acronym is definitely used for reinforcement learning. ["RLHF" "reinforcement learning from human feedback"] gets 564 hits on google, ["RLHF" "reward learning from human feedback"] gets 0.

Comment by Lukas Finnveden (Lanrian) on Two-year update on my personal AI timelines · 2022-08-03T18:59:37.975Z · LW · GW

Reinforcement* learning from human feedback

Comment by Lukas Finnveden (Lanrian) on Unifying Bargaining Notions (2/2) · 2022-07-28T20:50:41.991Z · LW · GW

Ah, I see, it was today. Nope, wasn't trying to join. I first interpreted "next" thursday as thursday next week, and then "June 28" was >1 month off, which confused me. In retrospect, I could have deduced that it was meant to say July 28.

Comment by Lukas Finnveden (Lanrian) on Unifying Bargaining Notions (2/2) · 2022-07-28T19:56:22.506Z · LW · GW

Also, next Thursday (June 28) at noon Pacific time is the Schelling time to meet in the Walled Garden and discuss the practical applications of this.

Is the date wrong here?

Comment by Lukas Finnveden (Lanrian) on Daniel Kokotajlo's Shortform · 2022-07-27T23:53:07.999Z · LW · GW

Some previous LW discussion on this:

(author favors weak arguments; plenty of discussion and some disagreements in comments; not obviously worth reading)

Comment by Lukas Finnveden (Lanrian) on Reward is not the optimization target · 2022-07-26T15:22:01.305Z · LW · GW

This seems plausible if the environment is a mix of (i) situations where task completion correlates (almost) perfectly with reward, and (ii) situations where reward is very high while task completion is very low. Such as if we found a perfect outer alignment objective, and the only situation in which reward could deviate from the overseer's preferences would be if the AI entirely seized control of the reward.

But it seems less plausible if there are always (small) deviations between reward and any reasonable optimization target that isn't reward (or close enough so as to carry all relevant arguments). E.g. if an AI is trained on RL from human feedback, and it can almost always do slightly better by reasoning about which action will cause the human to give it the highest reward.

Comment by Lukas Finnveden (Lanrian) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T19:12:08.589Z · LW · GW

This is about the agreement karma, though, which starts at 0.

Comment by Lukas Finnveden (Lanrian) on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover · 2022-07-19T18:58:25.946Z · LW · GW

There's a 1000 year old vampire stalking lesswrong!? 16 is supposed to be three levels above Eliezer.

Comment by Lukas Finnveden (Lanrian) on On how various plans miss the hard bits of the alignment challenge · 2022-07-13T16:31:14.119Z · LW · GW

As the main author of the "Alignment"-appendix of the truthful AI paper, it seems worth clarifying: I totally don't think that "train your AI to be truthful" in itself is a plan for how to tackle any central alignment problems. Quoting from the alignment appendix:

While we’ve argued that scaleable truthfulness would constitute significant progress on alignment (and might provide a solution outright), we don’t mean to suggest that truthfulness will sidestep all difficulties that have been identified by alignment researchers. On the contrary, we expect work on scaleable truthfulness to encounter many of those same difficulties, and to benefit from many of the same solutions.

In other words: I don't think we had a novel proposal for how to make truthful AI systems, which tackled the hard bits of alignment. I just meant to say that the hard bits of making truthful A(G)I are similar to the hard bits of making aligned A(G)I.

At least from my own perspective, the truthful AI paper was partly about AI truthfulness maybe being a neat thing to aim for governance-wise (quite apart from the alignment problem), and partly about the idea that research on AI truthfulness could be helpful for alignment, and so it's good if people (at least/especially people who wouldn't otherwise work on alignment) work on that problem. (As one example of this: Interpretability seems useful for both truthfulness and alignment, so if people work on interpretability intended to help with truthfulness, then this might also be helpful for alignment.)

I don't think you're into this theory of change, because I suspect that you think that anyone who isn't directly aiming at the alignment problem has negligible chance of contributing any useful progress.

I just wanted to clarify that the truthful AI paper isn't evidence that people who try to hit the hard bits of alignment always miss — it's just a paper doing a different thing.

(And although I can't speak as confidently about others' views, I feel like that last sentence also applies to some of the other sections. E.g. Evan's statement, which seems to be about how you get an alignment solution implemented once you have it, and maybe about trying to find desiderata for alignment solutions, and not at all trying to tackle alignment itself. If you want to critique Evan's proposals for how to build aligned AGI, maybe you should look at this list of proposals or this positive case for how we might succeed.)

Comment by Lukas Finnveden (Lanrian) on Comment on "Propositions Concerning Digital Minds and Society" · 2022-07-11T11:15:59.505Z · LW · GW

An un-aligned AI has the decision of acting to maximize its goals in training and getting a higher short-term reward, or deceptively pretending to be aligned in training, and getting a lower short-term reward.

If there is a conflict between these, that must be because the AI's conception of reward isn't identical to the reward that we intended. So even if we dole out higher intended reward during deployment, it's not clear that that increases the reward that the AI expects after deployment. (But it might.)

Comment by Lukas Finnveden (Lanrian) on Decision theory and dynamic inconsistency · 2022-07-04T00:01:46.079Z · LW · GW
  • The Tuesday-creature might believe that its decision is correlated with the Monday-creature. [...] If the correlation is strong enough and stopping values change is expensive, then the Tuesday-creature is best served by being kind to its Wednesday-self, and helping to put it in a good position to realize whatever its goals may be.
  • The Tuesday-creature might believe that its decision is correlated with the Monday-creature’s predictions about what the Tuesday-creature would do. [...] If the Monday-creature is a good enough predictor of the Tuesday-creature, then the Tuesday-creature is best served by at least “paying back” the Monday-creature for all of the preparation the Monday-creature did

These both seem like very UDT-style arguments, that wouldn't apply to a naive EDT:er once they'd learned how helpful the Monday creature was?

So based on the rest of this post, I would have expected these motivations to only apply if either (i) the Tuesday-creature was uncertain about whether the Monday-creature had been helpful or not, or (ii) the Tuesday creature cared about not-apparently-real-worlds to a sufficient extent (including because they might think they're in a simulation). Curious if you disagree with that.

Comment by Lukas Finnveden (Lanrian) on Limerence Messes Up Your Rationality Real Bad, Yo · 2022-07-01T23:50:17.964Z · LW · GW

"suitors severely underestimate probability of being liked back"

Is this supposed to say 'overestimate'? Regardless, what info from the paper is the claim based on? Since they're only sampling stories where people were rejected, the stories will have disproportionately large numbers of cases where the suitors are over-optimistic, so that seems like it'd make it hard to draw general conclusions.

(For the other two bullet points: I'd expect those effects, directionally, just from the normal illusion of transparency playing out in a context where there are social barriers to clear communication. But haven't looked at the paper to see whether the effect is way stronger than I'd normally expect.)

Comment by Lukas Finnveden (Lanrian) on Slow motion videos as AI risk intuition pumps · 2022-06-19T00:31:04.038Z · LW · GW

2 seems more worrying than reassuring. If you have to rely on human action, you'll be slowed down. So AI's who can route around humans, or humans who can delegate more decision-making to AI systems, will have a competitive advantage over AIs that don't do that. If we're talking about AGI + decent robotics, there's in principle nothing that AIs need humans for.

3: "useless without full information" is presumably hyperbole, but I also object to weaker claims like "being 100x faster is less than half as useful as you think, if you haven't considered that spying is non-trivial". Random analogy: Consider a conflict (e.g. a war or competition between two firms) except that one side (i) gets only 4 days per year, and (ii) gets a very well-secured room to discuss decisions in. Benefit (ii) doesn't really seem to help much against the disadvantage from (i)!

Comment by Lukas Finnveden (Lanrian) on Slow motion videos as AI risk intuition pumps · 2022-06-19T00:17:20.090Z · LW · GW

(Small exception to Critch's video looking like a still frame: There's a dude with a moving hand at 0:45.)

Comment by Lukas Finnveden (Lanrian) on Scott Aaronson is joining OpenAI to work on AI safety · 2022-06-19T00:09:45.647Z · LW · GW

Here's a 1-year-old answer from Christiano to the question "Do you still think that people interested in alignment research should apply to work at OpenAI?". Generally pretty positive about people going there to "apply best practices to align state of the art models". That's not exactly what Aaronson will be doing, but it seems like alignment theory should have even less probability of differentially accelerating capabilities.

Comment by Lukas Finnveden (Lanrian) on Why all the fuss about recursive self-improvement? · 2022-06-14T14:03:28.575Z · LW · GW

Agree it's not clear. Some reasons why they might:

  • If training environments' inductive biases point firmly towards some specific (non-human) values, then maybe the misaligned AIs can just train bigger and better AI systems using similar environments that they were trained in, and hope that those AIs will end up with similar values.
    • Maybe values can differ a bit, and cosmopolitanism or decision theory can carry the rest of the way. Just like Paul says he'd be pretty happy with intelligent life that came from a similar distribution that our civ came from.
  • Humans might need to use a bunch of human labor to oversee all their human-level AIs. The HLAIs can skip this, insofar as they can trust copies of themself. And when training even smarter AI, it's a nice benefit to have cheap copyable trustworthy human-level overseers.
  • Maybe you can somehow gradually increase the capabilities of your HLAIs in a way that preserves their values.
    • (You have a lot of high-quality labor at this point, which really helps for interpretability and making improvements through other ways than gradient descent.)
Comment by Lukas Finnveden (Lanrian) on Why all the fuss about recursive self-improvement? · 2022-06-13T23:42:25.779Z · LW · GW

Hm, maybe there are two reasons why human-level AIs are safe:

1. A bunch of our alignment techniques work better when the overseer can understand what the AIs are doing (given enough time). This means that human-level AIs are actually aligned.
2. Even if the human-level AIs misbehave, they're just human-level, so they can't take over the world.

Under model (1), it's totally ok that self-improvement is an option, because we'll be able to train our AIs to not do that.

Under model (2), there are definitely some concerning scenarios here where the AIs e.g. escape onto the internet, then use their code to get resources, duplicate themselves a bunch of times, and set-up a competing AI development project. Which might have an advantage if it can care less about paying alignment taxes, in some ways.

Comment by Lukas Finnveden (Lanrian) on Why all the fuss about recursive self-improvement? · 2022-06-13T23:17:17.477Z · LW · GW

Which crazy stuff happens first seems pretty important to me, in adjudicating between hypotheses. So far, the type of crazy that we've been seeing undermines my understanding of Robin's hypotheses. I'm open to the argument that I simply don't understand what his hypotheses predict.

FWIW, I think everyone agrees strongly with "which crazy stuff happens first seems pretty important". Paul was saying that Robin never disagreed with eventual RSI, but just argued that other crazy stuff would happen first. So Robin shouldn't be criticized on the grounds of disagreeing about the importance of RSI, unless you want to claim that RSI is the first crazy thing that happens (which you don't seem to believe particularly strongly). But it's totally fair game to e.g. criticize the prediction that ems will happen before de-novo AI (if you think that now looks very unlikely).

Comment by Lukas Finnveden (Lanrian) on What is causality to an evidential decision theorist? · 2022-06-11T17:07:10.112Z · LW · GW

In a causal diagram, there is an easy graphical condition (d-connectedness) to see whether (and how) X and Y are related given Z:

We need to have a path from X to Y that satisfies certain properties:

That path can start out moving upstream (i.e. against the causal arrows); it may switch from moving upstream to downstream at any time (including at the start); it must switch direction whenever it hits a node in Z; and it may only switch from moving downstream to upstream when it hits a node in Z.

I think this is wrong, because it would imply that X and Y are d-connected in [X <- Z -> Y]. It should say:

That path can start out moving upstream (i.e. against the causal arrows); it may switch from moving upstream to downstream at any time (including at the start); it can only connect to a node in Z if it's currently moving downstream; and it must (and can only) switch from moving downstream to upstream when it hits a node in Z.

Comment by Lukas Finnveden (Lanrian) on PaLM in "Extrapolating GPT-N performance" · 2022-06-05T21:45:43.326Z · LW · GW

Here's what the curves look like if you fit them to the PaLM data-points as well as the GPT-3 data-points.

Keep in mind that this is still based on Kaplan scaling laws. The Chinchilla scaling laws would predict faster progress.



Comment by Lukas Finnveden (Lanrian) on Is AI Progress Impossible To Predict? · 2022-05-18T12:22:22.834Z · LW · GW

Here's the corresponding graph for the non-logged difference, which also displays a large correlation.

Comment by Lukas Finnveden (Lanrian) on Stuff I might do if I had covid · 2022-05-15T18:57:12.186Z · LW · GW

FWIW, friend-of-a-friend who had covid took 100mg of fluvoxamine together with melatonin and a range of cold & flu medication, and got something that was probably serotonin syndrome. Also strong drowsiness (slept for 10h) which is normally anti-correlated with serotonin syndrome, but which was plausibly caused by the melatonin+fluvoxamine interaction.

Serotonin syndrome is pretty bad. It's unclear what caused it in this case. (It's normally super rare if the only seratonin-boosting thing you're taking is a single SSRI.) But we know that fluvoxamine+melatonin interacts to increase melatonin levels a ton. And I think melatonin-levels have non-0 interaction with serotonin-levels. So personally, I would not simultaneously take melatonin and fluvoxamine as a response to covid.

Comment by Lukas Finnveden (Lanrian) on Are smart people's personal experiences biased against general intelligence? · 2022-04-22T11:28:28.096Z · LW · GW

There's also some evidence that different cognitive skills correlate less at high g than low g:'s_law_of_diminishing_returns

So if you mostly interact with very intelligent people, it might be relatively less useful to think about unidimensional intelligence.