Posts

AXRP Episode 38.3 - Erik Jenner on Learned Look-Ahead 2024-12-12T05:40:06.835Z
AXRP Episode 39 - Evan Hubinger on Model Organisms of Misalignment 2024-12-01T06:00:06.345Z
AXRP Episode 38.2 - Jesse Hoogland on Singular Learning Theory 2024-11-27T06:30:03.821Z
AXRP Episode 38.1 - Alan Chan on Agent Infrastructure 2024-11-16T23:30:09.098Z
AXRP Episode 38.0 - Zhijing Jin on LLMs, Causality, and Multi-Agent Systems 2024-11-14T07:00:06.977Z
MATS AI Safety Strategy Curriculum v2 2024-10-07T22:44:06.396Z
AXRP Episode 37 - Jaime Sevilla on Forecasting AI 2024-10-04T21:00:03.077Z
AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics 2024-09-29T05:50:02.531Z
AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization 2024-08-24T22:30:02.039Z
AXRP Episode 34 - AI Evaluations with Beth Barnes 2024-07-28T03:30:07.192Z
Why keep a diary, and why wish for large language models 2024-06-14T16:10:07.658Z
AXRP Episode 33 - RLHF Problems with Scott Emmons 2024-06-12T03:30:05.747Z
AXRP Episode 32 - Understanding Agency with Jan Kulveit 2024-05-30T03:50:05.289Z
AXRP Episode 31 - Singular Learning Theory with Daniel Murfet 2024-05-07T03:50:05.001Z
AXRP Episode 30 - AI Security with Jeffrey Ladish 2024-05-01T02:50:04.621Z
AXRP Episode 29 - Science of Deep Learning with Vikrant Varma 2024-04-25T19:10:06.063Z
Bayesian inference without priors 2024-04-24T23:50:08.312Z
AXRP Episode 28 - Suing Labs for AI Risk with Gabriel Weil 2024-04-17T21:42:46.992Z
AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt 2024-04-11T21:30:04.244Z
Daniel Kahneman has died 2024-03-27T15:59:14.517Z
Superforecasting the Origins of the Covid-19 Pandemic 2024-03-12T19:01:15.914Z
Common Philosophical Mistakes, according to Joe Schmid [videos] 2024-03-03T00:15:47.899Z
11 diceware words is enough 2024-02-15T00:13:43.420Z
Most experts believe COVID-19 was probably not a lab leak 2024-02-02T19:28:00.319Z
n of m ring signatures 2023-12-04T20:00:06.580Z
AXRP Episode 26 - AI Governance with Elizabeth Seger 2023-11-26T23:00:04.916Z
How to type Aleksander Mądry's last name in LaTeX 2023-11-21T00:50:07.189Z
Aaron Silverbook on anti-cavity bacteria 2023-11-20T03:06:19.524Z
If a little is good, is more better? 2023-11-04T07:10:05.943Z
On Frequentism and Bayesian Dogma 2023-10-15T22:23:10.747Z
AXRP Episode 25 - Cooperative AI with Caspar Oesterheld 2023-10-03T21:50:07.552Z
Watermarking considered overrated? 2023-07-31T21:36:05.268Z
AXRP Episode 24 - Superalignment with Jan Leike 2023-07-27T04:00:02.106Z
AXRP Episode 23 - Mechanistic Anomaly Detection with Mark Xu 2023-07-27T01:50:02.808Z
AXRP announcement: Survey, Store Closing, Patreon 2023-06-28T23:40:02.537Z
AXRP Episode 22 - Shard Theory with Quintin Pope 2023-06-15T19:00:01.340Z
[Linkpost] Interpretability Dreams 2023-05-24T21:08:17.254Z
Difficulties in making powerful aligned AI 2023-05-14T20:50:05.304Z
AXRP Episode 21 - Interpretability for Engineers with Stephen Casper 2023-05-02T00:50:07.045Z
Podcast with Divia Eden and Ronny Fernandez on the strong orthogonality thesis 2023-04-28T01:30:45.681Z
AXRP Episode 20 - ‘Reform’ AI Alignment with Scott Aaronson 2023-04-12T21:30:06.929Z
[Link] A community alert about Ziz 2023-02-24T00:06:00.027Z
Video/animation: Neel Nanda explains what mechanistic interpretability is 2023-02-22T22:42:45.054Z
[linkpost] Better Without AI 2023-02-14T17:30:53.043Z
AXRP: Store, Patreon, Video 2023-02-07T04:50:05.409Z
Podcast with Oli Habryka on LessWrong / Lightcone Infrastructure 2023-02-05T02:52:06.632Z
AXRP Episode 19 - Mechanistic Interpretability with Neel Nanda 2023-02-04T03:00:11.144Z
First Three Episodes of The Filan Cabinet 2023-01-18T19:20:06.588Z
Podcast with Divia Eden on operant conditioning 2023-01-15T02:44:29.706Z
On Blogging and Podcasting 2023-01-09T00:40:00.908Z

Comments

Comment by DanielFilan on Alternatives to Masks for Infectious Aerosols · 2024-12-08T17:52:40.594Z · LW · GW

Would being in a room with people who are vaping have the same benefits as the fog machine? Obviously it has downsides of smell and other additives, but still - I think this should predict that people maybe don't get airborne illnesses at vaping conventions.

Comment by DanielFilan on Detection of Asymptomatically Spreading Pathogens · 2024-12-05T21:51:12.979Z · LW · GW

Typo:

We sequencing a typical sample to between one and two billion reads.

Should maybe be "We will be sequencing..."?

Comment by DanielFilan on The 2023 LessWrong Review: The Basic Ask · 2024-12-05T00:42:20.070Z · LW · GW

Wait I'm a moron and the thing I checked was actually whether it was an exponential function, sorry.

Comment by DanielFilan on The 2023 LessWrong Review: The Basic Ask · 2024-12-05T00:35:00.177Z · LW · GW

Votes cost quadratic points – a vote strength of "1" costs 1 point. A vote of strength 4 costs 10 points. A vote of strength 9 costs 45.

FYI this is not a quadratic function.

Comment by DanielFilan on 2024 Unofficial LessWrong Census/Survey · 2024-12-03T01:51:52.169Z · LW · GW

Dojo Organizations What organizations are you aware of that are providing some kind of rationality dojo format (courses focused on improving the skill of rationality)?

Seems like the stuff after "Dojo Organizations" should be on a new line.

Comment by DanielFilan on 2024 Unofficial LessWrong Census/Survey · 2024-12-03T01:49:34.390Z · LW · GW

About how often do you use LLMs like ChatGPT while active?

What does "while active" mean in this question?

Comment by DanielFilan on The Big Nonprofits Post · 2024-11-29T18:04:26.700Z · LW · GW

If one wants to investigate [the Alignment of Complex Systems research group] further, he has an AXRP podcast episode, which I haven’t listened to.

Note that if you want to investigate further but would rather read a transcript than watch a video, AXRP has you covered.

Comment by DanielFilan on Seven lessons I didn't learn from election day · 2024-11-17T22:19:58.763Z · LW · GW

Yeah but a bunch of people might actually answer how their neigbours will vote, given that that's what the pollster asked - and if the question is phrased as the post assumes, that's going to be a massive issue.

Comment by DanielFilan on Seven lessons I didn't learn from election day · 2024-11-14T19:11:28.078Z · LW · GW

So I guess 1.5% of Americans have worse judgment than I expected (by my lights, as someone who thinks that Trump is really bad). Those 1.5% were incredibly important for the outcome of the election and for the future of the country, but they are only 1.5% of the population.

Nitpick: they are 1.5% of the voting population, making them around 0.7% of the US population.

Comment by DanielFilan on Seven lessons I didn't learn from election day · 2024-11-14T19:09:02.721Z · LW · GW

If you ask people who they're voting for, 50% will say they're voting for Harris. But if you ask them who most of their neighbors are voting for, only 25% will say Harris and 75% will say Trump!

Note this issue could be fixed if you instead ask people who the neighbour immediately to the right of their house/apartment will vote for, which I think is compatible with what we know about this poll. That said, the critique of "do people actually know" stands.

Comment by DanielFilan on Seven lessons I didn't learn from election day · 2024-11-14T19:02:40.265Z · LW · GW

she should have picked Josh Shapiro as her running mate

Note that this news story makes allegations that, if true, make it sound like the decision was partly Shapiro's:

Following Harris's interview with Pennsylvania Governor Josh Shapiro, there was a sense among Shapiro's team that the meeting did not go as well as it could have, sources familiar with the matter tell ABC News.

Later Sunday, after the interview, Shapiro placed a phone call to Harris' team, indicating he had reservations about leaving his job as governor, sources said.

Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-11-14T18:04:05.774Z · LW · GW

Oh except: I did not necessarily mean to claim that any of the things I mentioned were missing from the alignment research scene, or that they were present.

Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-11-14T17:40:44.124Z · LW · GW

When I wrote that, I wasn't thinking so much about evals / model organisms as stuff like:

basically stuff along the lines of "when you put agents in X situation, they tend to do Y thing", rather than trying to understand latent causes / capabilities

Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-11-14T06:56:53.422Z · LW · GW

Yeah, that seems right to me.

Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-11-14T04:07:04.103Z · LW · GW

A theory of how alignment research should work

(cross-posted from danielfilan.com)

Epistemic status:

  • I listened to the Dwarkesh episode with Gwern and started attempting to think about life, the universe, and everything
  • less than an hour of thought has gone into this post
  • that said, it comes from a background of me thinking for a while about how the field of AI alignment should relate to agent foundations research

Maybe obvious to everyone but me, or totally wrong (this doesn't really grapple with the challenges of working in a domain where an intelligent being might be working against you), but:

  • we currently don't know how to make super-smart computers that do our will
    • this is not just a problem of having a design that is not feasible to implement: we do not even have a sense of what the design would be
    • I'm trying to somewhat abstract over intent alignment vs control approaches, but am mostly thinking about intent alignment
    • I have not thought that much about societal/systemic risks very much, and this post doesn't really address them.
  • ideally we would figure out how to do this
  • the closest traction that we have: deep learning seems to work well in practice, altho our theoretical knowledge of why it works so well or how capabilities are implemented is lagging
  • how should we proceed? Well:
    • thinking about theory alone has not been practical
    • probably we need to look at things that exhibit alignment-related phenomena and understand them, and that will help us develop the requisite theory
      • said things are probably neural networks
    • there are two ways we can look at neural networks: their behaviour, and their implementation.
    • looking at behaviour is conceptually straightforward, and valuable, and being done
    • looking at their implementation is less obvious
    • what we need is tooling that lets us see relevant things about how neural networks are working
    • such tools (e.g. SAEs) are not impossible to create, but it is not obvious that their outputs tell us quantities that are actually of interest
    • in order to discipline the creation of such tools, we should demand that they help us understand models in ways that matter
    • once we get such tools, we should be trying to use them to understand alignment-relevant phenomena, to build up our theory of what we want out of alignment and how it might be implemented
      • this is also a thing that looking at the external behaviour of models in alignment-relevant contexts should be doing
  • so should we be just doing totally empirical things? No.
    • firstly, we need to be disciplined along the way by making sure that we are looking at settings that are in fact relevant to the alignment problem, when we do our behavioural analysis and benchmark our interpretability tools. This involves having a model of what situations are in fact alignment-relevant, what problems we will face as models get smarter, etc
    • secondly, once we have the building blocks for theory, ideally we will put them together and make some actual theorems like "in such-and-such situations models will never become deceptive" (where 'deceptive' has been satisfactorily operationalized in a way that suffices to derive good outcomes from no deception and relatively benign humans)
  • I'm imagining the above as being analogous to an imagined history of statistical mechanics (people who know this history or who have read "inventing temperature" should let me know if I'm totally wrong about it):
    • first we have steam engines etc
    • then we figure out that 'temperature' and 'entropy' are relevant things to track for making the engines run
    • then we relate temperature, entropy, and pressure
    • then we get a good theory of thermodynamics
    • then we develop statistical mechanics
  • exceptions to "theory without empiricism doesn't work":
  • lesson of above: theory does seem to help us analyze some issues and raise possibilities
Comment by DanielFilan on Some Rules for an Algebra of Bayes Nets · 2024-11-08T00:12:54.610Z · LW · GW

A way I'd phrase John's sibling comment, at least for the exact case: adding arrows to a DAG increases the set of probability distributions it can represent. This is because the fundamental rule of a Bayes net is that d-separation has to imply conditional independence - but you can have conditional independences in a distribution that aren't represented by a network. When you add arrows, you can remove instances of d-separation, but you can't add any (because nodes are d-separated when all paths between them satisfy some property, and (a) adding arrows can only increase the number of paths you have to worry about and (b) if you look at the definition of d-separation the relevant properties for paths get harder to satisfy when you have more arrows). Therefore, the more arrows a graph G has, the fewer constraints distribution P has to satisfy for P to be represented by G.

Comment by DanielFilan on JargonBot Beta Test · 2024-11-02T20:01:22.299Z · LW · GW

I enjoyed reading Nicholas Carlini and Jeff Kaufman write about how they use them, if you're looking for inspiration.

Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-11-02T03:16:46.940Z · LW · GW

Another way of maintaining Sola Scriptura and Perspicuity in the face of Protestant disagreement about essential doctrines is the possibility that all of this is cleared up in the deuterocanonical books that Catholics believe are scripture but Protestants do not. That said, this will still rule out Protestantism, and it's not clear that the deuterocanon in fact clears everything up.

Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-11-02T03:12:14.387Z · LW · GW

A failure of an argument against sola scriptura (cross-posted from Superstimulus)

Recently, Catholic apologist Joe Heschmeyer has produced a couple of videos arguing against the Protestant view of the Bible - specifically, the claims of Sola Scriptura and Perspicuity (capitalized because I'll want to refer to them as premises later). "Sola Scriptura" has been operationalized a few different ways, but one way that most Protestants would agree on is (taken from the Westminster confession):

The whole counsel of God, concerning all things necessary for [...] man’s salvation [...] is either expressly set down in Scripture, or by good and necessary consequence may be deduced from Scripture

"Perspicuity" means clarity, and is propounded in the Westminster confession like this:

[T]hose things which are necessary to be known, believed, and observed, for salvation, are so clearly propounded and opened in some place of Scripture or other, that not only the learned, but the unlearned, in a due use of the ordinary means, may attain unto a sufficient understanding of them.

So, in other words, Protestants think that everything you need to know to be saved is in the Bible, and is expressed so obviously that anyone who reads it and thinks about it in a reasonable way will understand it.

I take Heschmeyer's argument to be that if Sola Scriptura and Perspicuity were true, then all reasonable people who have read the Bible and believe it would agree on which doctrines were necessary for salvation - in other words, you wouldn't have a situation where one person thinks P and P is necessary for salvation, while another thinks not-P, or a third thinks that P is not necessary for salvation. But in fact this situation happens a lot, even among seemingly sincere followers of the Bible who believe in Sola Scriptura and Perspicuity. Therefore Sola Scriptura and Perspecuity are false. (For the rest of this post, I'll write Nec(P) for the claim "P is necessary for salvation" to save space.)

I think this argument doesn't quite work. Here's why:

It can be the case that the Bible clearly explains everything that you need to believe, but it doesn't clearly explain which things you need to believe. In other words, Sola Scriptura and Perspicuity say that for all P such that Nec(P), the Bible teaches P clearly - but they don't say that for such P, the Bible teaches P clearly, and also clearly teaches Nec(P). Nor do they say that the only things that are taught clearly in the Bible are things you need to believe (otherwise you could figure out which doctrines you had to believe by just noticing what things the Bible clearly teaches).

For example, suppose that the Bible clearly teaches that Jesus died for at least some people, and that followers of Jesus should get baptized, and in fact, the only thing you need to believe to be saved is that Jesus died for at least some people. In that world, people of good faith could disagree about whether you need to believe that Jesus died for at least some people, and this would be totally consistent with Sola Scriptura and Perspicuity.

Furthermore, suppose that it's not clear to people of good faith whether or not something is clear to people of good faith. Perhaps something could seem clear to you but not be clear to others of good faith, or also something could be clear but others could fail to understand it because they're not actually of good faith (you need this part otherwise you can tell if something's clear by noticing if anyone disagrees with you). Then, you can have one person who believes P and Nec(P), and another who believes not-P and Nec(not-P), and that be consistent with Sola Scriptura and Perspicuity.

For example, take the example above, and suppose that some people read the Bible as clearly saying that Jesus died for everyone (aka Unlimited Atonement), and others read the Bible as clearly saying that Jesus only died for his followers (aka Limited Atonement). You could have that disagreement, and if the two groups think the others are being disingenuous, they could both think that you have to agree with them to be saved, while still having Sola Scriptura and Perspicuity being true.

That said, Heschmeyer's argument is still going to limit the kinds of Protestantism you can adopt. In the above example, if we suppose that you can tell that neither group is in fact being disingenuous, then his argument rules out the combination of Sola Scriptura, Perspicuity, and Nec(Limited Atonement) (as well as Sola Scriptura, Perspicuity, and Nec(Unlimited Atonement)). In this way, applied to the real world, it's going to rule out versions of Protestantism that claim that you have to believe a bunch of things that sincere Christians who are knowledgeable about the Bible don't agree on. That said, it won't rule out Protestantisms that are liberal about what you can believe while being saved.

Comment by DanielFilan on 2024 Unofficial LW Community Census, Request for Comments · 2024-11-02T00:45:57.705Z · LW · GW

Oh I misread it as "eighty percent of the effort" oops.

Comment by DanielFilan on 2024 Unofficial LW Community Census, Request for Comments · 2024-11-01T22:11:16.302Z · LW · GW

You say "higher numbers for polyamorous relationships" which is contrary to "If you're polyamorous, but happen to have one partner, you would also put 1 for this question."

Comment by DanielFilan on 2024 Unofficial LW Community Census, Request for Comments · 2024-11-01T21:43:42.918Z · LW · GW

If you've been waiting for an excuse to be done, this is probably the point where twenty percent of the effort has gotten eighty percent of the effect.

Should be "eighty percent of the benefit" or similar.

Comment by DanielFilan on 2024 Unofficial LW Community Census, Request for Comments · 2024-11-01T21:42:02.243Z · LW · GW

I'd be interested in a Q about whether people voted in the last national election for their country (maybe with an option for "my country does not hold national elections") and if so how they voted (if you can find a schema that works for most countries, which I guess is hard).

Comment by DanielFilan on 2024 Unofficial LW Community Census, Request for Comments · 2024-11-01T21:40:48.159Z · LW · GW

In the highest degree question, one option is "Ph D.". This should be "PhD", no spaces, no periods.

Comment by DanielFilan on 2024 Unofficial LW Community Census, Request for Comments · 2024-11-01T21:39:52.935Z · LW · GW

Are you planning on having more children? Answer yes if you don't have children but want some, or if you do have children but want more.

Whether I want to have children and whether I plan to have children are different questions. There are lots of things I want but don't have plans to get, and one sometimes finds oneself with plans to achieve things that one doesn't actually want.

Comment by DanielFilan on Habryka's Shortform Feed · 2024-11-01T21:28:22.293Z · LW · GW

Sure, I'm just surprised it could work without me having Calibri installed.

Comment by DanielFilan on JargonBot Beta Test · 2024-11-01T20:28:58.693Z · LW · GW

Could be a thing where people can opt into getting the vibes or the vibes and the definitions.

Comment by DanielFilan on JargonBot Beta Test · 2024-11-01T20:22:12.087Z · LW · GW

Also, my feedback is that some of the definitions seem kind of vague. Like, apparently an ultracontribution is "a mathematical object representing uncertainty over probability" - this tells me what it's supposed to be, but doesn't actually tell me what it is. The ones that actually show up in the text don't seem too vague, partially because they're not terms that are super precise.

Comment by DanielFilan on JargonBot Beta Test · 2024-11-01T20:19:20.262Z · LW · GW

How are you currently determining which words to highlight? You say "terms that readers might not know" but this varies a lot based on the reader (as you mention in the long-term vision section).

Comment by DanielFilan on JargonBot Beta Test · 2024-11-01T20:14:57.435Z · LW · GW

FWIW I think it's not uncommon for people to not use LLMs daily (e.g. I don't).

Comment by DanielFilan on JargonBot Beta Test · 2024-11-01T20:13:39.770Z · LW · GW

FWIW I think the actual person with responsibility is the author if the author approves it, and you if the author doesn't.

Comment by DanielFilan on Habryka's Shortform Feed · 2024-11-01T20:07:34.400Z · LW · GW

I believe I'm seeing Gill Sans? But when I google "Calibri" I see text that looks like it's in Calibri, so that's confusing.

Comment by DanielFilan on New report: Safety Cases for AI · 2024-10-31T21:52:19.728Z · LW · GW

Since people have reported not being able to see the tweet thread, I will reproduce it in this comment (with pictures replaced by my descriptions of them):

If developers had to prove to regulators that powerful AI systems are safe to deploy, what are the best arguments they could use?

Our new report tackles the (very big!) question of how to make a ‘safety case’ for AI.

[image of the start of the paper]

We define a safety case as a rationale developers provide to regulators to show that their AI systems are unlikely to cause a catastrophe.

The term ‘safety case’ is not new. In many industries (e.g. aviation), products are ‘put on trial’ before they are released.

[cartoon of a trial: regulator is a judge, one side is a developer advancing a safety case, other side is a red team advancing a risk case]

We simplify the process of making a safety case by breaking it into six steps.

  1. Specify the macrosystem (all AI systems) and the deployment setting.
  2. Concretize 'AI systems cause a catastrophe' into specific unacceptable outcomes (e.g. the AI systems build a bioweapon)
  3. Justify claims about the deployment setting.
  4. Carve up the collection of AI systems into smaller groups (subsystems) that can be analyzed in isolation.
  5. Assess risk from subsystems acting unilaterally.
  6. Assess risk from subsystems cooperating together.

[The above, but in picture format]

We first focus on step 5: how would developers argue that individual subsystems are safe? These arguments can be grouped into four categories.

Arguments could assert AI systems are

  • too weak to be dangerous even if open-sourced (inability)
  • unable to cause disaster due to restrictions (control)
  • safe despite ablility to cause harm (trustworthiness)
  • judged to be safe by credible AI advisors (deference)

[The above, but in picture format and slightly reworded]

Consider the analogy of a robber and a diamond. How might one justify that the robber won’t steal the diamond?

[Image with heading: "Why won't the robber steal the diamond?".

Row 1 says "Inability The robber is an infant. Analogy: e.g. GPT-3 is clearly too weak to be dangerous", next to an image of a baby near a diamond on a pedestal.

Row 2 says "Control The robber is in jail. Analogy: e.g. AI systems are sandboxed and can't escape.", next to an image of a prisoner in a box with bars, near a diamond on a pedestal.

Row 3 says "Trustworthiness The robber does not steal. Analogy: e.g. AI systems follow developer instructions.", next to an image of a person sitting down cross-legged and wearing an orange robe, next to a diamond on a pedestal.

Row 4 says "Deference The security service CEO says the diamond is secure. Analogy: AI advisors say that AI systems are safe.", next to an image of a figure with grey hair saying "All good" next to a diamond on a pedestal with lines coming out of it, connecting it to small grey boxes (sensors?).]

The ordering of these categories is intentional. As AI systems become more powerful, developers will likely rely mostly on inability, then control, then trustworthiness, and finally, deference to AI advisors.

[Image of graph where the horizontal axis is "Increasingly powerful AI" and the vertical axis is "Primary safety argument". Inability, Control, Trustworthiness, and Deference are shown in order from bottom-left to top-right. An arrow connects the words "We are here" to Inability.]

Next, we give examples of arguments in each category. Arguments are ranked on three axes:

  1. Practicality
  2. Strength
  3. Scalability

No argument received full marks! Research will be needed to justify the safety of advanced AI systems.

[A complicated diagram showing a variety of arguments under the Inability, Control, Trustworthiness, and Deference categories, together with ratings for their Practicality, Maximum Strength, and Scalability.]

The arguments in the previous step pertain to small groups of AI systems. It would be difficult to directly apply them to large groups. We also explain how to justify that the actions of many AI systems won’t cause a catastrophe (step 6 in our framework).

[Image titled "Large-scale AI misbehavior". Below are 3 rows, with 2 columns. The left column is labelled "Causes" and the right is labelled "Strategies".

Row 1: Cause: Widespread alignment faking. Strategy: Blitzkrieg: overwhelm controls

Row 2: Cause: Infectious jailbreaks. Strategy: Strike: disable infrastructure.

Row 3: Cause: Rapid memetic value drift. Strategy: Hivemind: combine intelligence.

dots are shown below, likely to indicate that there are more causes and strategies not shown.]

We are hoping this report will:

  1. Motivate research that further clarifies the assumptions behind safety arguments.
  2. Inform the design of hard safety standards.

More in the paper: https://bit.ly/3IJ5N95 Many thanks to my coauthors! @NickGabs01, @DavidSKrueger, and @thlarsen.

Might be of interest to @bshlgrs, @RogerGrosse, @DavidDuvenaud, @EvanHub, @aleks_madry, @ancadianadragan, @rohinmshah, @jackclarkSF, @Manderljung, @RichardMCNgo

Comment by DanielFilan on Habryka's Shortform Feed · 2024-10-31T21:35:47.550Z · LW · GW

Update: I have already gotten over it.

Comment by DanielFilan on Habryka's Shortform Feed · 2024-10-30T00:02:22.326Z · LW · GW

It looks kinda small to me, someone who uses Firefox on Ubuntu.

Comment by DanielFilan on MATS AI Safety Strategy Curriculum v2 · 2024-10-15T21:07:01.872Z · LW · GW

A thing you are maybe missing is that the discussion groups are now in the past.

You should be sure to point out that many of the readings are dumb and wrong

The hope is that the scholars notice this on their own.

Week 3 title should maybe say “How could we safely train AIs…”? I think there are other training options if you don’t care about safety.

Lol nice catch.

Comment by DanielFilan on MATS AI Safety Strategy Curriculum v2 · 2024-10-15T21:04:27.485Z · LW · GW

We included a summary of Situational Awareness as an optional reading! I guess I thought the full thing was a bit too long to ask people to read. Thanks for the other recs!

Comment by DanielFilan on Research update: Towards a Law of Iterated Expectations for Heuristic Estimators · 2024-10-07T22:36:07.474Z · LW · GW

to simplify, we ask that for every expression and set of arguments

Here and in the next dot point, should the inner heuristic estimate be conditioning on a larger set of arguments (perhaps chosen by an unknown method)? Otherwise it seems like you're just expressing some sort of self-knowledge.

Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-10-07T16:59:30.529Z · LW · GW

OP doesn't emphasize liability insurance enough but part of the hope is that you can mandate that companies be insured up to $X00 billion, which costs them less than $X00 billion assuming that they're not likely to be held liable for that much. Then the hope is the insurance company can say "please don't do extremely risky stuff or your premium goes up".

Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-10-05T23:11:25.838Z · LW · GW

On the other hand, there's not a clear criteria for when we would pause again after, say, a six month pause in scaling.

Realized that I didn't respond to this - PauseAI's proposal is for a pause until safety can be guaranteed, rather than just for 6 months.

Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-10-05T23:10:03.408Z · LW · GW

I believe AI pauses by governments would absolutely be more serious and longer, preventing overhangs from building up too much.

Are you saying that overhangs wouldn't build up too much under pauses because the government wouldn't let it happen, or that RSPs would have less overhang because they'd pause for less long so less overhang would build up? I can't quite tell.

Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-10-05T23:05:06.431Z · LW · GW

I'm not saying there's no reason to think that RSPs are better or worse than pause, just that if overhang is a relevant consideration for pause, it's also a relevant consideration for RSPs.

Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-10-05T23:04:18.643Z · LW · GW

I'd imagine that RSP proponents think that if we execute them properly, we will simply not build dangerous models beyond our control, period.

I think pause proponents think similarly!

Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-10-05T23:03:21.283Z · LW · GW

Do you see the same people invoking overhang as an argument against pauses and also talking about RSPs as though they are not also impacted?

I guess I'm not tracking this closely enough. I'm not really that focussed on any one arguer's individual priorities, but more about the discourse in general. Basically, I think that overhang is a consideration for unconditional pauses if and only if it's a consideration for RSPs, so it's a bad thing if overhang is brought up as an argument against unconditional pauses and not against RSPs, because this will distort the world's ability to figure out the costs and benefits of each kind of policy.

Also, to be clear, it's not impossible that RSPs are all things considered better than unconditional pauses, and better than nothing, despite overhang. But if so, I'd hope someone somewhere would have written a piece saying "RSPs have the cost of causing overhang, but on net are worth it".

Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-10-04T18:01:24.955Z · LW · GW

I'm not saying that RSPs are or aren't better than a pause. But I would think that if overhang is a relevant consideration for pauses, it's also a relevant consideration for RSPs.

Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-10-03T20:49:52.571Z · LW · GW

A complaint about AI pause: if we pause AI and then unpause, progress will then be really quick, because there's a backlog of improvements in compute and algorithmic efficiency that can be immediately applied.

One definition of what an RSP is: if a lab makes observation O, then they pause scaling until they implement protection P.

Doesn't this sort of RSP have the same problem with fast progress after pausing? Why have I never heard anyone make this complaint about RSPs? Possibilities:

  • They do and I just haven't seen it
  • People expect "AI pause" to produce longer / more serious pauses than RSPs (but this seems incidental to the core structure of RSPs)
Comment by DanielFilan on DanielFilan's Shortform Feed · 2024-10-03T20:46:59.069Z · LW · GW

I continue to think that agent foundations research is kind of underrated. Like, we're supposed to do mechinterp to understand the algorithm models implement - but how do we know what algorithms are good?

Comment by DanielFilan on 2024 Petrov Day Retrospective · 2024-10-01T22:37:40.964Z · LW · GW

Ah, thanks for clarifying that.

Comment by DanielFilan on 2024 Petrov Day Retrospective · 2024-09-29T08:21:51.729Z · LW · GW

I'm kind of confused which unilateralist got to design the game. You say:

The first person to click the Unilateral Virtue Link was a proponent of "Avoiding actions that noticeably increase the chance that the world will end." But, this virtue was actually in the majority. The first unilateralist of a Virtue Minority was a proponent of "Accurately reporting your epistemic state."

A year later, as we decided what to do for Petrov Day, we decided to lure the first unilateralist into a surprise meeting, where I then said "Here's a reminder of what happened in Petrov Day last year. You now have one hour to design this year's Petrov game. Go."

So it sounds like the unilateralist who wanted to avoid actions that noticeably increase the chance the world will end got picked. But then it sounds like the winner made a game that was supposed to be about accurately reporting epistemic state:

The designer had (I think?) initially noticed the "focus on accurately reporting epistemic state" aspect, but said that during the stressful hour of designing the game had eventually forgotten that. The version they handed off wasn't particularly optimized for that, but the framework of a social deception game seemed to me to be a good substrate for "accurate epistemic reporting." [...] It seemed important to me that Petrov's payoff specifically be about reporting his beliefs

Comment by DanielFilan on Defining alignment research · 2024-08-29T19:04:26.315Z · LW · GW