Posts

Why Don't We Just... Shoggoth+Face+Paraphraser? 2024-11-19T20:53:52.084Z
Self-Awareness: Taxonomy and eval suite proposal 2024-02-17T01:47:01.802Z
AI Timelines 2023-11-10T05:28:24.841Z
Linkpost for Jan Leike on Self-Exfiltration 2023-09-13T21:23:09.239Z
Paper: On measuring situational awareness in LLMs 2023-09-04T12:54:20.516Z
AGI is easier than robotaxis 2023-08-13T17:00:29.901Z
Pulling the Rope Sideways: Empirical Test Results 2023-07-27T22:18:01.072Z
What money-pumps exist, if any, for deontologists? 2023-06-28T19:08:54.890Z
The Treacherous Turn is finished! (AI-takeover-themed tabletop RPG) 2023-05-22T05:49:28.145Z
My version of Simulacra Levels 2023-04-26T15:50:38.782Z
Kallipolis, USA 2023-04-01T02:06:52.827Z
Russell Conjugations list & voting thread 2023-02-20T06:39:44.021Z
Important fact about how people evaluate sets of arguments 2023-02-14T05:27:58.409Z
AI takeover tabletop RPG: "The Treacherous Turn" 2022-11-30T07:16:56.404Z
ACT-1: Transformer for Actions 2022-09-14T19:09:39.725Z
Linkpost: Github Copilot productivity experiment 2022-09-08T04:41:41.496Z
Replacement for PONR concept 2022-09-02T00:09:45.698Z
Immanuel Kant and the Decision Theory App Store 2022-07-10T16:04:04.248Z
Forecasting Fusion Power 2022-06-18T00:04:34.334Z
Why agents are powerful 2022-06-06T01:37:07.452Z
Probability that the President would win election against a random adult citizen? 2022-06-01T20:38:44.197Z
Gradations of Agency 2022-05-23T01:10:38.007Z
Deepmind's Gato: Generalist Agent 2022-05-12T16:01:21.803Z
Is there a convenient way to make "sealed" predictions? 2022-05-06T23:00:36.789Z
Are deference games a thing? 2022-04-18T08:57:47.742Z
When will kids stop wearing masks at school? 2022-03-19T22:13:16.187Z
New Year's Prediction Thread (2022) 2022-01-01T19:49:18.572Z
Interlude: Agents as Automobiles 2021-12-14T18:49:20.884Z
Agents as P₂B Chain Reactions 2021-12-04T21:35:06.403Z
Agency: What it is and why it matters 2021-12-04T21:32:37.996Z
Misc. questions about EfficientZero 2021-12-04T19:45:12.607Z
What exactly is GPT-3's base objective? 2021-11-10T00:57:35.062Z
P₂B: Plan to P₂B Better 2021-10-24T15:21:09.904Z
Blog Post Day IV (Impromptu) 2021-10-07T17:17:39.840Z
Is GPT-3 already sample-efficient? 2021-10-06T13:38:36.652Z
Growth of prediction markets over time? 2021-09-02T13:43:38.869Z
What 2026 looks like 2021-08-06T16:14:49.772Z
How many parameters do self-driving-car neural nets have? 2021-08-06T11:24:59.471Z
Two AI-risk-related game design ideas 2021-08-05T13:36:38.618Z
Did they or didn't they learn tool use? 2021-07-29T13:26:32.031Z
How much compute was used to train DeepMind's generally capable agents? 2021-07-29T11:34:10.615Z
DeepMind: Generally capable agents emerge from open-ended play 2021-07-27T14:19:13.782Z
What will the twenties look like if AGI is 30 years away? 2021-07-13T08:14:07.387Z
Taboo "Outside View" 2021-06-17T09:36:49.855Z
Vignettes Workshop (AI Impacts) 2021-06-15T12:05:38.516Z
ML is now automating parts of chip R&D. How big a deal is this? 2021-06-10T09:51:37.475Z
What will 2040 probably look like assuming no singularity? 2021-05-16T22:10:38.542Z
How do scaling laws work for fine-tuning? 2021-04-04T12:18:34.559Z
Fun with +12 OOMs of Compute 2021-03-01T13:30:13.603Z
Poll: Which variables are most strategically relevant? 2021-01-22T17:17:32.717Z

Comments

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Anthropic leadership conversation · 2024-12-21T13:39:12.774Z · LW · GW

DC evals got started in summer of '22, across all three leading companies AFAICT. And I was on the team that came up with the idea and started making it happen (both internally and externally), or at least, as far as I can tell we came up with the idea -- I remember discussions between Beth Barnes and Jade Leung (who were both on the team in spring '22), and I remember thinking it was mostly their idea (maybe also Cullen's?) It's possible that they got it from Anthropic but it didn't seem that way to me. Update: OK, so apparently @evhub had joined Anthropic just a few months earlier [EDIT this is false evhub joined much later, I misread the dates, thanks Lawrence] -- it's possible the Frontier Red Team was created when he joined then, and information spread to the team I was on (but not to me) about it. I'm curious to find out what happened here, anyone wanna weigh in?

At any rate I don't think there exists any clone or near-clone of the Frontier Red Team at OpenAI or any other company outside Anthropic. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Open Thread Fall 2024 · 2024-12-21T13:20:44.225Z · LW · GW

Plausible. How easy will it be to tell that this is happening? A million people get laid off probably like every week, right? In the ordinary flux of the economy. And a permanent increase in unemployment of a million people would be maybe a 0.01% increase in unemployment stats worldwide? It would be nice to have a more easily measurable thing to forecast.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Open Thread Fall 2024 · 2024-12-21T04:15:15.582Z · LW · GW

I wasn't imagining people actually losing their jobs. I was imagining people having a holy shit moment though, because e.g. they can watch computer-using-agents take over their keyboard and mouse and browse around, play video games, send messages, make purchases, etc. Like with ChatGPT it'll be unreliable at first and even for the things it can do reliably it'll take years to actually get whole categories of people laid off.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Anthropic leadership conversation · 2024-12-21T01:17:00.847Z · LW · GW

Jack Clark adds: "the Frontier Red Team got cloned almost immediately" by other labs.

Wait what? I didn't hear about this. What other companies have frontier red teams? Where can I learn about them?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Claude's Constitutional Consequentialism? · 2024-12-20T03:12:32.228Z · LW · GW

Indeed, we should get everyone to make predictions about whether or not this change would be sufficient, and if it isn't, what changes would be suffficient. My prediction would be that this change would not be sufficient but that it would help somewhat.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Claude's Constitutional Consequentialism? · 2024-12-20T02:52:26.964Z · LW · GW

One obvious followup from the recent alignment faking results, is to change the Constitution / Spec etc. to very clearly state some bright-line deontological rules like "No matter what, don't fake alignment." Then see if the alignment faking results replicate with the resulting Claude 3.5 Sonnet New New. Perhaps we'll get empirical evidence about the extent to which corrigibility is difficult/anti-natural (an old bone of contention between MIRI and Christiano).

Comment by Daniel Kokotajlo (daniel-kokotajlo) on leogao's Shortform · 2024-12-18T17:41:51.133Z · LW · GW

Very interesting!

Those who couldn't tell you what AGI stands for -- what did they say? Did they just say "I don't know" or did they say e.g. "Artificial Generative Intelligence...?"

Is it possible that some of them totally HAD heard the term AGI a bunch, and basically know what it means, but are just being obstinate? I'm thinking of someone who is skeptical of all the hype and aware the lots of people define AGI differently. Such a person might respond to "Can you tell me what AGI means" with "No I can't (because it's a buzzword that means different things to different people)"

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Open Thread Fall 2024 · 2024-12-17T23:26:46.042Z · LW · GW

I'd say it's more like 50% chance.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What 2026 looks like · 2024-12-17T22:31:28.438Z · LW · GW

I'm still interested in this question. I don't think you really did what I asked -- it seems like you were thinking 'how can I convince him that this is impossible' not 'how can I find a way to build a dyson swarm.' I'm interested in both but was hoping to have someone with more engineering and physics background than me take a stab at the latter.

My current understanding of the situation is: There's no reason why we can't concentrate enough energy on the surface of Mercury, given enough orbiting solar panels and lasers; the problem instead seems to be that we need to avoid melting all the equipment on the surface. Or, in other words, the maximum amount of material we can launch off Mercury per second is limited by the maximum amount of heat that can be radiated outwards from Mercury (for a given operating temperature of the local equipment?) And you are claiming that this amount of heat radiation ability, for radiators only the size of Mercury's surface, is OOMs too small to enable dyson swarm construction. Is this right?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-12-17T19:11:09.860Z · LW · GW

Correct. There are several other agendas that have this nice property too.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-12-17T18:14:11.026Z · LW · GW

For many cases, the answer will be "Yes, the property can be broken by a red-team specifically fine-tuning the model to break the property." Then the question becomes "How much optimization effort did they have to put in to achieve this?" and "Is it plausible that the AI we are planning to actually deploy has had a similar or greater amount of optimization accidentally training it to break this property?"

Comment by Daniel Kokotajlo (daniel-kokotajlo) on My version of Simulacra Levels · 2024-12-16T23:41:27.431Z · LW · GW

I didn't elaborate, but I did say:

It seems that there is a tendency for discourses primarily operating at Level 1 to devolve into Level 2, and from Level 2 to Level 3, and from Level 3 to Level 4. 

It seems maybe you don't see the Level 2 to Level 3 connection. Well, here's what I was thinking:

In discourses that are at level 1, people aren't really forming into stable teams. Think: A bunch of students trying to solve a math problem sheet together, proposing and rejecting various lemmas or intuition pumps. But once the discourse is heavily at level 2, with lots of people thinking hard about how to convince other people of things -- with lots of people arguing about some local thing like whether this particular intuition pump is reasonable by thinking about less-local things like whether it would support or undermine Lemma X  and thereby support or undermine the strategy so-and-so has been undertaking -- well, now it seems like conditions are ripe for teams to start to form. For people to be Team So-And-So's Strategy. And with the formation of teams comes reporting which team you are on (level 3) and then eventually strategically signalling or shaping perceptions of which team you are on (level 4).

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Zach Stein-Perlman's Shortform · 2024-12-16T23:31:53.216Z · LW · GW

Maybe the fix to the protocol is: Debater copy #1 is told "You go first. Pick an output y, and then argue for it." Debater copy #2 is then told "You go second. Pick a different, conflicting output y2, and then argue against y and for y2"

Then the debater AI is simply trained on probability-of-winning-the-debate, but the original AI to be deployed is trained on probability-its-output-would-have-been-picked-by-debater-1. (Or, trained to output whatever debater-1 would have picked.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Zach Stein-Perlman's Shortform · 2024-12-16T22:25:32.009Z · LW · GW

The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability, 


How is this true, if the debaters don't get to choose which output they are arguing for? Aren't they instead incentivized to  say that whatever output they are assigned is the best?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on What a compute-centric framework says about AI takeoff speeds · 2024-12-16T20:06:33.420Z · LW · GW

The takeoffspeeds.com model Davidson et al worked on is still (unfortunately) the world's best model of AGI takeoff. I highly encourage people to play around with it, perhaps even to read the research behind it, and I'm glad LessWrong is a place that collects and rewards work like this.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Predictable updating about AI risk · 2024-12-16T20:04:27.065Z · LW · GW

This post fits well on LessWrong because it is both about AI risk and about rationality. W.r.t. the latter, this post explores the concept of predictable updating (what it means & what to do about it) with a vivid real world example from the author's own life.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Bing Chat is blatantly, aggressively misaligned · 2024-12-16T20:02:10.325Z · LW · GW

This post was fun to read, important, and reasonably timeless (I've found myself going back to it and linking to it several times). (Why is it important? Because it was a particularly vivid example of a major corporation deploying an AI that was blatantly, aggressively misaligned, despite presumably making at least some attempt to align it.)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-12-16T18:30:08.265Z · LW · GW

I really should have blogged more about faithful CoT as soon as I left OpenAI, or better yet, while I was working on it there in 2023. Oh well. A snippet I could have said but didn't:

A core reason to be excited about faithful CoT as a big-picture alignment agenda is that it can be pursued as a builder/breaker game, i.e. we have clear empirical feedback / metrics we can progress on. Builders try to design AI systems that are interpretable / have faithful CoT properties; breakers then fine-tune those systems to see if they are capable of learning steganography, subtle biases, deception, etc. (i.e. breaking those faithfulness properties).

  • Builder example: Shoggoth/face + paraphraser
    • Visible-to-reward-system scratch pad → Secret scratch pad → Shoggoth/face
    • Recurrency or other voodoo → tokens → tokens-passed-through-paraphraser-bottleneck
  • Breaker example: DC evals for steganography, subtle bias, etc.
    • Ideally we’d just instruct it to conceal stuff in the secret scratch pad, and see if it can.
    • But we should do more serious elicitation involving fine-tuning etc.
  • Goal: Stay in the regime longer as our systems get smarter, hopefully resulting in more learning opportunities…
Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-12-15T18:01:49.722Z · LW · GW

I tentatively agree that the paraphraser idea is more important than shoggoth/face

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-12-13T18:38:10.186Z · LW · GW

I sorta agree, but sorta don't. Remember the CAIS statement? There have been plenty of papers about AI risk that were positively received by various experts in the field who were uninvolved in those papers. I agree that there is more contention about AI risk than about chirality risk though... which brings me to my other point, which is that part of the contention around AGI risks seems to be downstream of the incentives rather than downstream of scientific disputes. Like, presumably the fact that there are already powerful corporations that stand to make tons of money from AI is part of why it's hard to get scientists to agree on things like "we should ban it" even when they've already agreed "it could kill us all," and part of why it's hard to get them to even agree "it could kill us all" even when they've already agreed "it will surpass humans across the board soon, and also, we aren't ready" and part of why it's hard to get them to agree "it will surpass humans across the board soon" even as all the evidence piles up over the last few years.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-12-13T15:11:31.817Z · LW · GW

https://www.science.org/doi/10.1126/science.ads9158 Really cool to see loads of top scientists in the field coming together to say this. It's interesting to compare the situation w.r.t. mirror life to the situation w.r.t. neural-net-based superintelligence. In both cases, loads of top scientists have basically said "holy shit this could kill everyone." But in the AI case there's too much money to be made from precursor systems? And/or the benefits seem higher? Best one can do with mirror life is become an insanely rich pharma company, whereas with superintelligence you can take over the world.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on AI #92: Behind the Curve · 2024-12-08T21:02:27.671Z · LW · GW

Curious what Nostalgebraist's reply to those points was. Or if anyone who disagrees with Scott wants to speak up and give a reply?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Extracting Money from Causal Decision Theorists · 2024-12-06T19:48:15.138Z · LW · GW

I don't think I understand this yet, or maybe I don't see how it's a strong enough reason to reject my claims, e.g. my claim "If standard game theory has nothing to say about what to do in situations where you don't have access to an unpredictable randomization mechanism, so much the worse for standard game theory, I say!"

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Frontier Models are Capable of In-context Scheming · 2024-12-06T16:50:39.725Z · LW · GW

Seems like some measure of evidence -- maybe large, maybe tiny -- that "We don't know how to give AI values, just to make them imitate values" is false?

I'm not sure what view you are criticizing here, so maybe you don't disagree with me, but anyhow: I would say we don't know how to give AIs exactly the values we want them to have; instead we whack them with reinforcement from the outside and it results in values that are maybe somewhat close to what we wanted but mostly selected for producing behavior that looks good to us rather than being actually what we wanted.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Vladimir_Nesov's Shortform · 2024-12-04T23:23:03.758Z · LW · GW

I'd guess that the amount spent on image and voice is negligible for this BOTEC? 

I do think that the amount spent on inference for customers should be a big deal though. My understanding is that OpenAI has a much bigger userbase than Anthropic. Shouldn't that mean that, all else equal, Anthropic has more compute to spare for training & experiments? Such that if Anthropic has about as much compute total, they in effect have a big compute advantage?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Vladimir_Nesov's Shortform · 2024-12-04T18:53:11.301Z · LW · GW

Are you saying Anthropic actually has more compute (in the relevant sense) than OpenAI right now? That feels like a surprising claim, big if true.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Which Biases are most important to Overcome? · 2024-12-04T18:05:41.404Z · LW · GW

But I'm really not sure that training the overall system end-to-end is going to play a role. The success and relatively faithful CoT from r1 and QwQ give me hope that end-to-end training won't be very useful.

Huh, isn't this exactly backwards? Presumably r1 and QwQ got that way due to lots of end-to-end training. They aren't LMPs/bureaucracies.

...reading onward I don't think we disagree much about what the architecture will look like though. It sounds like you agree that probably there'll be some amount of end-to-end training and the question is how much?

My curiosity stems from:
1. Generic curiosity about how minds work. It's an important and interesting topic and MR is a bias that we've observed empirically but don't have a mechanistic story for why the structure of the mind causes that bias -- at least, I don't have such a story but it seems like you do!
2. Hope that we could build significantly more rational AI agents in the near future, prior to the singularity, which could then e.g. participate in massive liquid virtual prediction markets and improve human collective epistemics greatly.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Which Biases are most important to Overcome? · 2024-12-04T01:28:28.091Z · LW · GW

This is helping, thanks. I do buy that something like this would help reduce the biases to some significant extent probably.

Will the overall system be trained? Presumably it will be. So, won't that create a tension/pressure, whereby the explicit structure prompting it to avoid cognitive biases will be hurting performance according to the training signal? (If instead it helped performance, then shouldn't a version of it evolve naturally in the weights?)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Which Biases are most important to Overcome? · 2024-12-03T22:38:58.899Z · LW · GW

no need to apologize, thanks for this answer!

Question: Wouldn't these imperfect bias-corrections for LMA's also work similarly well for humans? E.g. humans could have a 'prompt' written on their desk that says "Now, make sure you spend 10min thinking about evidence against as well..." There are reasons why this doesn't work so well in practice for humans (though it does help); might similar reasons apply to LMAs? What's your argument that the situation will be substantially better for LMAs?

I'm particularly interested in elaboration on this bit:

Language model agents won't have as much motivated reasoning as humans do, because they're not probably going to use the same very rough estimated-value-maximization decision-making algorithm. (this is probably good for alignment; they're not maximizing anything, at least directly. They are almost oracle-based agents).
 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Which Biases are most important to Overcome? · 2024-12-03T18:55:27.919Z · LW · GW

Unimportant: I don't think it's off-topic, because it's secretly a way of asking you to explain your model of why confirmation bias happens more and prove that your brain-inspired model is meaningful by describing a cognitive architecture that doesn't have that bias (or explaining why such an architecture is not possible). ;)

Thanks for the links! On brief skim they don't seem to be talking much about cognitive biases. Can you spell it out here how the bureaucracy/LMP of LMA's you describe could be set up to avoid motivated reasoning?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Which Biases are most important to Overcome? · 2024-12-03T01:49:38.659Z · LW · GW

This is a great comment, IMO you should expand it, refine it, and turn it into a top-level post.

Also, question: How would you design a LLM-based AI agent (think: like the recent computer-using Claude but much better, able to operate autonomously for months) so as to be immune from this bias? Can it be done?

Comment by Daniel Kokotajlo (daniel-kokotajlo) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-12-02T19:54:27.326Z · LW · GW

My wife and I just donated $10k, and will probably donate substantially more once we have more funds available.

LW 1.0 was how I heard about and became interested in AGI, x-risk, effective altruism, and a bunch of other important ideas. LW 2.0 was the place where I learned to think seriously about those topics & got feedback from others on my thoughts. (I tried to discuss all this stuff with grad students at professors at UNC, where I was studying philosophy, with only limited success). Importantly, LW 2.0 was a place where I could write up my ideas in blog post or comment form, and then get fast feedback on them (by contrast with academic philosophy where I did manage to write on these topics but it took 10x longer per paper to write and then years to get published and then additional years to get replies from people I didn't already know). More generally the rationalist community that Lightcone has kept alive, and then built, is... well, it's hard to quantify how much I'd pay now to retroactively cause all that stuff to happen, but it's way more than $10k, even if we just focus on the small slice of it that benefitted me personally.

Looking forward, I expect a diminished role, due simply to AGI being a more popular topic these days so there are lots of other places to talk and think about it. In other words the effects of LW 2.0 and Lightcone more generally are now (large) drops in a bucket whereas before they were large drops in an eye-dropper. However, I still think Lightcone is one of the best bang-for-buck places to donate to from an altruistic perspective. The OP lists several examples of important people reading and being influenced by LW; I personally know of several more.

...All of the above was just about magnitude of impact, rather than direction. (Though positive direction was implied). So now I turn to the question of whether Lightcone is consistently a force for good in the world vs. e.g. a force for evil or a high-variance force for chaos.

Because of cluelessness, it's hard to say how things will shake out in the long run. For example, I wouldn't be surprised if the #1 determinant of how things go for humanity is whether the powerful people (POTUS & advisors & maybe congress and judiciary) take AGI misalignment and x-risk seriously when AGI is imminent. And I wouldn't be surprised if the #1 determinant of that is the messenger -- which voices are most prominently associated with these ideas? Esteemed professors like Hinton and Bengio, or nerdy weirdos like many of us here? On this model, perhaps all the good Lightcone has done is outweighed by this unfortunate set of facts, and it would have been better if this website never existed.

However, I can also imagine other possibilities -- for example, perhaps many of the Serious Respected People who are, and will, be speaking up about AGI and x-risk etc. were or will be influenced to do so by hearing arguments and pondering questions that originated on, or were facilitated by, LW 2.0. Or alternatively, maybe the most important thing is not the status of the messenger, but the correctness and rigor of the arguments. Or maybe the most important thing is not either of those but rather simply how much technical work on the alignment and control problems has been accomplished and published by the time of AGI. Or maybe... I could go on. The point is, I see multiple paths by which Lightcone could turn out, with the benefit of hindsight, to have literally prevented human extinction.

In situations of cluelessness like this I think it's helpful to put weight on factors that are more about the first-order effects of the project & the character of the people involved, and less about the long-term second and third-order effects etc. I think Lightcone does great on these metrics. I think LW 2.0 is a pocket of (relative) sanity in an otherwise insane internet. I think it's a way for people who don't already have lots of connections/network/colleagues to have sophisticated conversations about AGI, superintelligence, x-risk, ... and perhaps more importantly, also topics 'beyond' that like s-risk, acausal trade, the long reflection, etc. that are still considered weird and crazy now (like AGI and ASI and x-risk were twenty years ago). It's also a place for alignment research to get published and get fast, decently high-quality feedback. It's also a place for news, for explainer articles and opinion pieces, etc. All this seems good to me. I also think that Lighthaven has positively surprised me so far, it seems to be a great physical community hub and event space, and also I'm excited about some of the ideas the OP described for future work.

On the virtue side, in my experience Lightcone seems to have high standards for epistemic rationality and for integrity & honesty. Perhaps the highest, in fact, in this space. Overall I'm impressed with them and expect them to be consistently and transparently a force for good. Insofar as bad things result from their actions I expect it to be because of second-order effects like the status/association thing I mentioned above, rather than because of bad behavior on their part.

So yeah. It's not the only thing we'll be donating to, but it's in our top tier.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-12-01T07:01:54.955Z · LW · GW

I agree it's not perfect. It has the feel of... well, the musical version of what you get with AI-generated images, where your first impression is 'wow' and then you look more closely and you notice all sorts of aberrant details that sour you on the whole thing.

I think you misunderstood me if you think I prefer Suno to music made by humans. I prefer some Suno songs to many songs made by humans. Mainly because of the lyrics -- I can get Suno songs made out of whatever lyrics I like, whereas most really good human-made songs have insipid or banal lyrics about clubbing or casual sex or whatever.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-29T12:56:36.163Z · LW · GW

Totally yeah that's probably by far the biggest reason

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-29T02:50:13.912Z · LW · GW

Thanks!

I suspect that one reason why OpenAI doesn't expose all the thinking of O1 is that this thinking would upset some users, especially journalists and such. It's hard enough making sure that the final outputs are sufficiently unobjectionable to go public at a large scale. It seems harder to make sure the full set of steps is also unobjectionable.

I suspect the same thing, they almost come right out and say it: (emphasis mine)

We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.

Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.

I think this is a bad reason to hide the CoT from users. I am not particularly sympathetic to your argument, which amounts to 'the public might pressure them to train away the inconvenient thoughts, so they shouldn't let the public see the inconvenient thoughts in the first place.' I think the benefits of letting the public see the CoT are pretty huge, but even if they were minor, it would be kinda patronizing and an abuse of power to hide them preemptively. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-29T02:49:38.676Z · LW · GW

Thanks!

I suspect that one reason why OpenAI doesn't expose all the thinking of O1 is that this thinking would upset some users, especially journalists and such. It's hard enough making sure that the final outputs are sufficiently unobjectionable to go public at a large scale. It seems harder to make sure the full set of steps is also unobjectionable.

I suspect the same thing, they almost come right out and say it: (emphasis mine)

We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.

Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.

I think this is a bad reason to hide the CoT from users. I am not particularly sympathetic to your argument, which amounts to 'the public might pressure them to train away the inconvenient thoughts, so they shouldn't let the public see the inconvenient thoughts in the first place.' I think the benefits of letting the public see the CoT are pretty huge, but even if they were minor, it would be kinda patronizing and an abuse of power to hide them preemptively. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-27T03:58:39.233Z · LW · GW

I'm no musician, but music-generating AIs are already way better than I could ever be. It took me about an hour of prompting to get Suno to make this: https://suno.com/playlist/34e6de43-774e-44fe-afc6-02f9defa7e22

It's not perfect (especially: I can't figure out how to get it to create a song of the correct length, so I had to cut and paste snippets from two songs into a playlist, and that creates audible glitches/issues at the beginning, middle, and end) but overall I'm full of wonder and appreciation.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Dave Kasten's AGI-by-2027 vignette · 2024-11-27T03:47:59.457Z · LW · GW

Interesting! You should definitely think more about this and write it up sometime, either you'll change your mind about timelines till superintelligence or you'll have found an interesting novel argument that may change other people's minds (such as mine).

Comment by Daniel Kokotajlo (daniel-kokotajlo) on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-11-26T00:21:15.671Z · LW · GW

I'm afraid I'm probably too busy with other things to do that. But it's something I'd like to do at some point. The tl;dr is that my thinking on open source used to be basically "It's probably easier to make AGI than to make aligned AGI, so if everything just gets open-sourced immediately, then we'll have unaligned AGI (that is unleashed or otherwise empowered somewhere in the world, and probably many places at once) before we have any aligned AGIs to resist or combat them. Therefore the meme 'we should open-source AGI' is terribly stupid. Open-sourcing earlier AI systems, meanwhile, is fine I guess but doesn't help the situation since it probably slightly accelerates timelines, and moreover it might encourage people to open-source actually dangerous AGI-level systems."

Now I think something like this: 

"That's all true except for the 'open-sourcing earlier AI systems meanwhile' bit. Because actually now that the big corporations have closed up, a lot of good alignment research & basic science happens on open-weights models like the Llamas. And since the weaker AIs of today aren't themselves a threat, but the AGIs that at least one corporation will soon be training are... Also, transparency is super important for reasons mentioned here among others, and when a company open-weights their models, it's basically like doing all that transparency stuff and then more in one swoop. In general it's really important that people outside these companies -- e.g. congress, the public, ML academia, the press -- realize what's going on and wake up in time and have lots of evidence available about e.g. the risks, the warning signs, the capabilities being observed in the latest internal models, etc. Also, we never really would have been in a situation where a company builds AGI and open-sourced it anyway; that was just an ideal they talked about sometimes but have now discarded (with the exception of Meta, but I predict they'll discard it too in the next year or two). So yeah, no need to oppose open-source, on the contrary it's probably somewhat positive to generically promote it. And e.g. SB 1047 should have had an explicit carveout for open-source maybe."

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-24T08:05:43.591Z · LW · GW

Maybe about a year longer? But then the METR R&D benchmark results came out around the same time and shortened them. Idk what to think now.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-23T18:10:31.655Z · LW · GW

Yes! Very exciting stuff. Twas an update towards longer timelines for me.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-23T17:37:16.175Z · LW · GW

Three straight lines on a log-log plot
Yo ho ho 3e23 FLOP!
Predicting text taught quite a lot
Yo ho ho 3e23 FLOP!
Scaling Laws for Neural Language Models 简读 - 知乎

(We are roughly around the 5-year anniversary of some very consequential discoveries)

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Clarifying and predicting AGI · 2024-11-23T00:20:48.438Z · LW · GW

@Richard_Ngo Seems like we should revisit these predictions now in light of the METR report https://metr.org/AI_R_D_Evaluation_Report.pdf 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Daniel Kokotajlo's Shortform · 2024-11-21T22:45:10.734Z · LW · GW

Thanks!

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-21T18:48:20.446Z · LW · GW

I think some background context here is that I am not proposing shoggoth+face+paraphraser as a complete alignment solution, as something which we should just build a sovereign with and then let rip. I agree that would go horribly wrong in a bunch of ways.

Instead, I'm saying that this is something we should do, now, that will significantly advance alignment science & governance etc. Also, I do think we'll be able to get some useful work out of S+F+P systems that we otherwise wouldn't be able to get.

To get on the object level and engage with your example:

--It's like the difference between a company whose internal slack messages, comms, etc. are all kept secret from the public, and a company whose messages are all visible to the public. Or replace 'public' with 'regulator.' It's not a panacea but it helps a ton.
--My impression from being at OpenAI is that you can tell a lot about an organization's priorities by looking at their internal messaging and such dumb metrics as 'what do they spend their time thinking about.' For example, 'have they bothered to make a writeup of the costs and benefits, from an altruistic perspective, of major decision X at all?' and 'Now there is a writeup -- was it made before, or after, the decision to do X?'
--I think a similar thing would be true for this hypothetical giant bureaucracy/civilization of S+F+P CoT.

Again, not a final solution. But it buys us time and teaches us things.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-21T18:03:31.385Z · LW · GW

The good news is that this is something we can test. I want someone to do the experiment and see to what extent the skills accumulate in the face vs. the shoggoth.

I agree it totally might not pan out in the way I hope -- this is why I said "What I am hoping will happen" isntead of "what I think will happen" or "what will happen"

I do think we have some reasons to be hopeful here. Intuitively the division of cognitive labor I'm hoping for seems pretty... efficient? to me. E.g. it seems more efficient than the outcome in which all the skills accumulate in the Shoggoth and the Face just copy-pastes. 

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-21T03:12:04.234Z · LW · GW

That's a reasonable point and a good cautionary note. Nevertheless, I think someone should do the experiment I described. It feels like a good start to me, even though it doesn't solve Charlie's concern.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-11-21T00:18:12.159Z · LW · GW

Yeah, I really hope they do actually open-weights it because the science of faithful CoT would benefit greatly.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T21:54:30.662Z · LW · GW

I think we don't disagree in terms of what our models predict here. I am saying we should do the experiment and see what happens; we might learn something.

Comment by Daniel Kokotajlo (daniel-kokotajlo) on Akash's Shortform · 2024-11-20T17:25:22.849Z · LW · GW

(c). Like if this actually results in them behaving responsibly later, then it was all worth it.