Posts

I get pretty intense visceral outrage at overreaches in immigration enforcement, just seems the height of depravity. Ive looked for a lot of different routes to mental coolness over the last decade (since Trump started his speeches), they mostly amount to staying busy and distracted. Just seems like a really cost ineffective kind of activism to get involved in. Bankrolling lawyers for random people isn't really in my action space and if it was i'd have opportunity cost to consider.

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2025-03-19T18:02:24.890Z · LW · GW

seems like there's more prior literature than I thought https://en.wikipedia.org/wiki/Role-based_access_control

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2025-03-19T17:49:03.695Z · LW · GW

are SOTA configuration languages sufficient for AI proliferation?

My main aim is to work on "hardening the box" i.e. eliminating software bugs so containment schemes don't fail for preventable reasons. But in the famous 4o system card example, the one that looks a little like docker exfiltration, the situation arose from user error, wild guess in compose.yaml or the shell script invoking docker run.

In a linux machine

Here's an example nix file

users.users =
    let
      authorized-key-files = [
        "${keyspath}/id_server_ed25519.pub"
        "${keyspath}/id_qd_ed25519.pub"
      ];
    in
    {
      unpermissioneduser = { 
        isNormalUser = false;
        extraGroups = [ "docker" ];
        description = "AgentID=claude-0x0000";
      };
      coreuser = {
        isNormalUser = true;
        extraGroups = [
          "wheel"
          "networkmanager"
          "docker"
          "video"
        ];
        home = "/home/coreuser";
        description = "Core User (delegator of unpermissioneduser)";
        shell = pkgs.fish;
        openssh.authorizedKeys.keyFiles = authorized-key-files;
      };
      root = {
        openssh.authorizedKeys.keyFiles = authorized-key-files;
        shell = pkgs.fish;
      };
    };

You can see that unpermissioneduser has less abilities than coreuser. So you can imagine I just say that unpermissioneduser is an agent and coreuser is the human delegator.

Nix is simply a fully declarative way to do standard linux permissioning (a feature not in the snippet is allocating chmod/chown information for particular users to particular parts of the filesystem). There's no conceptual leaps from the status quo.

agents and delegation

is linux all that great for when you want to keep track of who's a delegatee and who's a delegator? do we need a more graph flavored version of linux userspace/permissions? I'm talking about once we're reasoning about proliferating agents and their permissions on various machines. Linux groups do not support inheritance, but a user can be a member of many groups. So you could in principle MVP a graph based permissions DSL (perhaps in Nix) on top of the existing Linux user/group ontology, 80% confident, but it could be hairier than making a new ontology. idk.

Comment by Quinn (quinn-dougherty) on AI Tools for Existential Security · 2025-03-17T17:28:36.778Z · LW · GW

Examples of promising risk-targeted applications

This section reeks of the guaranteed safe AI agendas, a lot of agreement. For example, using formal methods to harden any box we try to put the AI in is a kind of defensive acceleration that doesn't work (too expensive) until certain pre-ASI stages of development. I'm working on formal verification agents along these lines right now.

Comment by Quinn (quinn-dougherty) on Plausibly Factoring Conjectures · 2025-03-01T20:34:44.339Z · LW · GW

@Tyra Burgess and I wrote down a royalty-aware payout function yesterday:

For a type , let $L (B)$ be the "left closure under implication" or the admissible antecedents. I.e., the set of all the antecedents A in the public ledger such that $A \to B$ . $p : T y p e \to M o n e y$ is the price that a proposition was listed for (admitting summing over duplicates). Suppose player $1, . . ., k$ have previously proven $B_{1}, . . ., B_{k}$ and $L (A)$ is none other than the set of all $B_{i}$ from $1$ to $k$ .

We would like to fix an $ϵ$ (could be fairly big, like $\frac{1}{5}$ ) and say that the royalty-aware payout given epsilon of $A$ upon an introduction of $α : A$ to the database is $p (A) \times (1 - ϵ)$ such that, where $k = | L (A) |$ , $\frac{p (A) \times ϵ}{k}$ is paid out to each player $i \in 1, . . ., k$ .

This seems vaguely like it has some desirable properties, like the decay of a royalty with length in implications separating it from the currently outpaying type. You might even be able to reconcile it with cartesian-closedness / currying, where $A \times B \to C$ behaves equivalently to $A \to B \to C$ under the payout function.

I think to be more theoretically classy, royalties would arise from recursive structure, but it may work well enough without recursion. It'd be fun to advance all the way to coherence and incentive-compatible proofs, but I certainly don't see myself doing that.

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2025-02-16T21:18:51.509Z · LW · GW

I want a name for the following principle:

the world-spec gap hurts you more than the spec-component gap

I wrote it out much like this a couple years ago and Zac recently said the same thing.

I'd love to be able to just say "the <one to three syllables> principle", yaknow?

Comment by Quinn (quinn-dougherty) on davekasten's Shortform · 2025-02-16T00:17:42.370Z · LW · GW

I'm working on making sure we get high quality critical systems software out of early AGI. Hardened infrastructure buys us a lot in the slightly crazy story of "self-exfiltrated model attacks the power grid", but buys us even more in less crazy stories about all the software modules adjacent to AGI having vulnerabilities rapidly patched at crunchtime.

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2025-02-16T00:06:03.033Z · LW · GW

`<standup_comedian>` What's the deal with evals `</standup_comedian>`

epistemic status: tell me I'm wrong.

Funders seem particularly enchanted with evals, which seems to be defined as "benchmark but probably for scaffolded systems and scoring that is harder than scoring most of what we call benchmarks".

I can conjure a theory of change. It's like, 1. if measurement is bad then we're working with vibes, so we'd like to make measurement good. 2. if measurement is good then we can demonstrate to audiences (especially policymakers) that warning shots are substantial signals and not base it on vibes. (question: what am I missing?)

This is an at least coherent reason why dangerous capability evals pay into governance strats in such a way that maybe philanthropic pressure is correct. It relies on cruxes that I don't share, like that a principled science of measurement would outperform vibes in a meme war in the first place, but it at least has a crux that works as a fulcrum.

Everything worth doing is at least a little dual use, I'm not attacking anybody. But it's a faustian game where, like benchmarks, evals pump up races cuz everyone loves it when number go up. The primal urge to see number go up infects every chart with an x and y axis, in other words, evals come with steep capabilities externalities because they spray the labs with more charts that number hasn't gone up on yet, daring and challenging the lab to step up their game. So the theory of change in which, in spite of this dynamic, an eval is differentially defensive just has to meet a really high standard.

A further problem: the theory of change where we can have really high quality / inarguable signals as warning shots instead of vibes as warning shots doesn't even apply to most of the evals I'm hearing about from the nonprofit and independent sector. I'm hearing about evals that make me go, "huh, I wonder what's differentially defensive about that?" and I don't get good answers. Moreover, an ancient wisdom says "never ask a philanthropist for something capitalism gives you for free". The case for an individual eval's unlikeliness to be created by default lab incentives needs to be especially strong, cuz when it isn't strong one is literally doing the lab's work for them.

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2025-02-15T23:32:10.670Z · LW · GW

The more I learn about measurement, the less seriously I take it

I'm impressed with models that accomplish tasks in zero or one shot with minimal prompting skill. I'm not sure what galaxy brained scaffolds and galaxy brained prompts demonstrate. There's so much optimization in the measurement space.

I shipped a benchmark recently, but it's secretly a synthetic data play so regardless of how hard people try in order to score on it, we get synthetic data out of it which leads to finetune jobs which leads to domain specific models that can do such tasks hopefully with minimal prompting effort and no scaffolding.

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2025-02-15T23:22:14.369Z · LW · GW

$PERSON at $LAB once showed me an internal document saying that there are bad benchmarks - dangerous capability benchmarks - that are used negatively, so unlike positive benchmarks where the model isn't shipped to prod if it performs under a certain amount, these benchmarks could block a model from going to prod that performs over a certain amount. I asked, "you create this benchmark like it's a bad thing, and it's a bad thing at your shop, but how do you know it won't be used in a sign-flipped way at another shop?" and he said "well we just call it EvilBench and no one will want to score high on EvilBench".

It sounded like a ridiculous answer, but is maybe actually true in the case of labs. It is extremely not true in the open weight case, obviously huggingface user Yolo4206969 would love to score high on EvilBench.

Comment by Quinn (quinn-dougherty) on In response to critiques of Guaranteed Safe AI · 2025-02-01T18:04:30.906Z · LW · GW

I'm surprised to hear you say that, since you write

Upfront, I want to clarify: I don’t believe or wish to claim that GSAI is a full or general panacea to AI risk.

I kinda think anything which is not a panacea is swiss cheese, that those are the only two options.

In a matter of what sort of portofolio can lay down slices of swiss cheese at what rate and with what uncorrelation. And I think in this way GSAI is antifragile to next year's language models, which is why I can agree mostly with Zac's talk and still work on GSAI (I don't think he talks about my cruxes).

Specifically, I think the guarantees of each module and the guarantees of each pipe (connecting the modules) isolate/restrict the error to the world-model gap or the world-spec gap, and I think the engineering problems of getting those guarantees are straightforward / not conceptual problems. Furthermore, I think the conceptual problems with reducing the world-spec gap below some threshold presented by Safeguarded's TA1 are easier than the conceptual problems in alignment/safety/control.

Comment by Quinn (quinn-dougherty) on In response to critiques of Guaranteed Safe AI · 2025-01-31T05:09:55.684Z · LW · GW

I gave a lightning talk with my particular characterization, and included "swiss cheese" i.e. that gsai sources some layers of swiss cheese without trying to be one magic bullet. But if people agree with this, then really guaranteed-safe ai is a misnomer, cuz guarantee doesn't evoke swiss cheese at all

Comment by Quinn (quinn-dougherty) on Fertility Will Never Recover · 2025-01-30T07:07:26.316Z · LW · GW

For anecdata: id be really jazzed about 3 or 4, 5 might be a little crazy but somewhat open to that or more.

Ladies

Comment by Quinn (quinn-dougherty) on Benito's Shortform Feed · 2025-01-27T20:03:35.846Z · LW · GW

yeah last week was grim for a lot of people with r1's implications for proliferation and the stargate fanfare after inauguration. Had a palpable sensation of it pivoting from midgame to endgame, but I would doubt that sensation is reliable or calibrated.

Comment by Quinn (quinn-dougherty) on Tips and Code for Empirical Research Workflows · 2025-01-27T04:48:48.363Z · LW · GW

Tmux allows you to set up multiple panes in your terminal that keep running in the background. Therefore, if you disconnect from a remote machine, scripts running in tmux will not be killed. We tend to run experiments across many tmux panes (especially overnight).

Does no one use suffix & disown which sends a command to a background process that doesn't depend on the ssh process, or prefix nohup which does the same thing? You have to make sure any logging that goes to stdout goes to a log file instead (and in this way tmux or screen are better)

Your remark about uv: forgot to mention that it's effectively a poetry replacement, too

Like htop: btm and btop are a little newer, nicer to look at

Also for json: jq. cat file.json | jq for pretty printing json to terminal

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2025-01-23T22:26:59.787Z · LW · GW

Feels like a MATS-like Program in india is a big opportunity. When I went to EAG in Singapore a while ago there were so many people underserved by the existing community building and mentorship organizations cuz of visa issues.

Comment by Quinn (quinn-dougherty) on Some lessons from the OpenAI-FrontierMath debacle · 2025-01-20T18:52:53.819Z · LW · GW

the story i roughly understand is that this was within Epoch's mandate in the first place because they wanted to forecast on benchmarks but didn't think existing benchmarks were compelling or good enough so had to take matters into their own hands. Is that roughly consensus, or true? Why is frontiermath a safety project? i haven't seen adequate discussion on this.

Comment by Quinn (quinn-dougherty) on Everywhere I Look, I See Kat Woods · 2025-01-16T06:37:28.966Z · LW · GW

Can't relate. Don't particularly care for her content (tho audibly laughed at a couple examples that you hated), but I have no aversion to it. I do have aversion to the way you appealed to datedness as if that matters. I generally can't relate to people who find cringiness in the way you describe significantly problematic, really.

People like authenticity, humility, and irony now, both in the content and in its presentation.

I could literally care less, omg--- but im unusually averse to irony. Authenticity is great, humility is great most of the time, why is irony even in the mix?

Tho I'm weakly with you that engagement farming leaves a bad taste in my mouth.

Comment by Quinn (quinn-dougherty) on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2025-01-16T00:53:53.699Z · LW · GW

Update: new funding call from ARIA calls out the Safeguarded/Gatekeeper stack in a video game directly

Creating (largely) self-contained prototypes/minimal-viable-products of a Safeguarded AI workflow, similar to this example but pushing for incrementally more advanced environments (e.g. Atari games).

Comment by Quinn (quinn-dougherty) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2025-01-04T17:57:53.053Z · LW · GW

I tried a little myself too. Hope I didn't misremembering.

Comment by Quinn (quinn-dougherty) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2025-01-04T16:27:46.243Z · LW · GW

Very anecdotally, I've talked to some extremely smart people who I would guess are very good at making progress on hard problems, but just didn't think too hard about what solutions help.

A few of the dopest people i know, who id love to have on the team, fall roughly into the category of "engaged and little with lesswrong, grok the core pset better than most 'highly involved' people, but are working on something irrelevant and not even trying cuz they think it seems too hard". They have some thoughtful p(doom), but assume they're powerless.

Comment by Quinn (quinn-dougherty) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2025-01-04T16:24:23.949Z · LW · GW

Richard ngo tweeted recently that it was a mistake to design the agi safety fundamentals curriculum to be broadly accessible, that if he could do it over again thered be punishing problem sets that alienate most people

Comment by Quinn (quinn-dougherty) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2025-01-04T16:14:59.828Z · LW · GW

The upvotes and agree votes on this comment updated my perception of the rough consensus about mats and streetlighting. I previously would have expected less people to evaluate mats that way

Comment by Quinn (quinn-dougherty) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2025-01-04T16:01:11.233Z · LW · GW

As someone who, isolated and unfunded, went on months-long excursions into the hard version of the pset multiple times and burned out each time, I felt extremely validated when you verbally told me a fragment of this post around a fire pit at illiad. The incentives section of this post is very grim, but very true. I know naive patches to the funding ecosystem would also be bad (easy for grifters, etc), but I feel very much like I and we were failed by funders. I could've been stronger etc, I could've been in berkeley during my attempts instead of philly, but "why not just be heroically poor and work between shifts at a waged job" is... idk man, maybe fine with the right infrastructure, but i didnt have that infrastructure. (Again, I don't know a good way to have fixed the funding ecosystem, so funders reading this shouldn't feel too attacked).

(Epistemic status: have given up on the hard pset, but i think I have a path to adding some layers of Swiss cheese)

Comment by Quinn (quinn-dougherty) on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2025-01-03T22:23:00.518Z · LW · GW

(i'm guessing) super mario might refer to a simulation of the Safeguarded AI / Gatekeeper stack in a videogame. It looks like they're skipping videogames and going straight to cyberphysical systems (1, 2).

Comment by Quinn (quinn-dougherty) on Dress Up For Secular Solstice · 2024-12-21T01:09:16.345Z · LW · GW

Ok. I'll wear a solid black shirt instead of my bright blue shirt.

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2024-12-16T00:55:42.351Z · LW · GW

talk to friends as a half measure

When it comes to your internal track record, it is often said that finding what you wrote at time t-k beats trying to remember what you thought at t-k. However, the activation energy to keep such a journal is kinda a hurdle (which is why products like https://fatebook.io are so good!).

I find that a nice midpoint between the full and correct internal track record practices (rigorous journaling) and completely winging it (leaving yourself open to mistakes and self delusion) is talking to friends, because I think my memory of conversations that are had out loud with other people is more detailed and honest than my memory of things I've thought / used to think, especially when it's a stressful and treacherous topic.^[1]

I may be more socially attuned than average around here(?) so this may not work for people less socially attuned than me ↩︎

Comment by Quinn (quinn-dougherty) on Alexander Gietelink Oldenziel's Shortform · 2024-12-07T05:20:23.615Z · LW · GW

I was at an ARIA meeting with a bunch of category theorists working on safeguarded AI and many of them didn't know what the work had to do with AI.

epistemic status: short version of post because I never got around to doing the proper effort post I wanted to make.

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2024-12-03T00:13:48.623Z · LW · GW

A sketch I'm thinking of: asking people to consume information (a question, in this case) is asking them to do you a favor, so you should do your best to ease this burden, however, also don't be paralyzed so budget some leeway to be less than maximally considerate in this way when you really need to.

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2024-12-03T00:11:59.202Z · LW · GW

what's the best essay on asking for advice?

Going over etiquette and the social contract, perhaps if it's software specific it talks about minimal reproducers, whatever else the author thinks is involved.

Comment by Quinn (quinn-dougherty) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-12-02T23:14:27.125Z · LW · GW

Rumors are that 2025 lighthaven is jam packed. If this is the case, and you need money, rudimentary economics suggests only the obvious: raise prices. I know many clients are mission aligned, and there's a reasonable ideological reason to run the joint at or below cost, but I think it's aligned with that spirit if profits from the campus fund the website.

I also want to say in print what I said in person a year ago: you can ask me to do chores on campus to save money, it'd be within my hufflepuff budget. There are good reasons to not go totally "by and for the community" DIY like many say community libraries or soup kitchens, but nudging a little in that direction seems right.

EDIT: I did a mostly symbolic $200 right now, may or may not do more as I do some more calculations and find out my salary at my new job

Comment by Quinn (quinn-dougherty) on What are the good rationality films? · 2024-11-20T16:52:54.223Z · LW · GW

ThingOfThings said that Story of Louis Pasteur is a very EA movie, but I think it also counts for rationality. Huge fan.

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2024-11-18T17:36:30.834Z · LW · GW

Guaranteed Safe AI paper club meets again this thursday

Event for the paper club: https://calendar.app.google/2a11YNXUFwzHbT3TA

blurb about the paper in last month's newsletter:

... If you’re wondering why you just read all that, here’s the juice: often in GSAI position papers there’ll be some reference to expectations that capture “harm” or “safety”. Preexpectations and postexpectations with respect to particular pairs of programs could be a great way to cash this out, cuz we could look at programs as interventions and simulate RCTs (labeling one program control and one treatment) in our world modeling stack. When it comes to harm and safety, Prop and bool are definitely not rich enough.

Comment by Quinn (quinn-dougherty) on Alexander Gietelink Oldenziel's Shortform · 2024-11-17T17:43:35.490Z · LW · GW

my dude, top level post- this does not read like a shortform

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2024-11-13T18:35:28.649Z · LW · GW

Yoshua Bengio is giving a talk online tomorrow https://lu.ma/4ylbvs75

Comment by Quinn (quinn-dougherty) on Science advances one funeral at a time · 2024-11-04T19:34:25.319Z · LW · GW

by virtue of their technical chops, also care about their career capital.

I didn't understand this-- "their technical chops impose opportunity cost as they're able to build very safe successful careers if they toe the line" would make sense, or they care about career capital independent of their technical chops would make sense. But here, the relation between technical chops and caring about career capital doesn't come through clear.

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2024-10-16T20:03:46.998Z · LW · GW

did anyone draw up an estimate of how much the proportion of code written by LLMs will increase? or even what the proportion is today

Comment by Quinn (quinn-dougherty) on Yoav Ravid's Shortform · 2024-09-26T01:42:19.982Z · LW · GW

I was thinking the same thing this morning! My main thought was, "this is a trap. ain't no way I'm pressing a big red button especially not so near to petrov day"

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2024-09-19T00:15:21.153Z · LW · GW

GSAI paper club is tomorrow (gcal ticket), summary (by me) and discussion of this paper

Comment by Quinn (quinn-dougherty) on First Lighthaven Sequences Reading Group · 2024-09-05T22:18:57.378Z · LW · GW

Alas, belief is easier than disbelief; we believe instinctively, but disbelief requires a conscious effort.

Yes, but this is one thing that I have felt being mutated as I read the sequences and continued to hang out with you lot (roughly 8 years ago, with some off and on)

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2024-08-27T05:47:43.667Z · LW · GW

By all means. Happy for that

Comment by Quinn (quinn-dougherty) on Provably Safe AI: Worldview and Projects · 2024-08-27T03:28:09.616Z · LW · GW

discussion of the bet in Aug 2024 Progress in GSAI newsletter

Comment by Quinn (quinn-dougherty) on Limitations on Formal Verification for AI Safety · 2024-08-27T03:20:47.549Z · LW · GW

Note in August 2024 GSAI newsletter

See Limitations on Formal Verification for AI Safety over on LessWrong. I have a lot of agreements, and my disagreements are more a matter of what deserves emphasis than the fundamentals. Overall, I think the Tegmark/Omohundro paper failed to convey a swisscheesey worldview, and sounded too much like “why not just capture alignment properties in ‘specs’ and prove the software ‘correct’?” (i.e. the vibe I was responding to in my very pithy post). However, I think my main reason I’m not using Dickson’s post as a reason to just pivot all my worldview and resulting research is captured in one of Steve’s comments:

I'm focused on making sure our infrastructure is safe against AI attacks.

Like, a very strong version I almost endorse is “GSAI isn’t about AI at all, it’s about systems coded by extremely powerful developers (which happen to be AIs)”, and ensuring safety, security, and reliability capabilities scale at similar speeds with other kinds of capabilities.

It looks like one can satisfy Dickson just by assuring him that GSAI is a part of a swiss cheese stack, and that no one is messianically promoting One Weird Trick To Solve Alignment. Of course, I do hope that no one is messianically promoting One Weird Trick…

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2024-08-27T02:49:44.219Z · LW · GW

august 2024 guaranteed safe ai newsletter

in case i forgot last month, here's a link to july

A wager you say

One proof of concept for the GSAI stack would be a well-understood mechanical engineering domain automated to the next level and certified to boot. How about locks? Needs a model of basic physics, terms in some logic for all the parts and how they compose, and some test harnesses that simulate an adversary. Can you design and manufacture a provably unpickable lock?

Zac Hatfield-Dodds (of hypothesis/pytest and Anthropic, was offered and declined authorship on the GSAI position paper) challenged Ben Goldhaber to a bet after Ben coauthored a post with Steve Omohundro. It seems to resolve in 2026 or 2027, the comment thread should get cleared up once Ben gets back from Burning Man. The arbiter is Raemon from LessWrong.

Zac says you can’t get a provably unpickable lock on this timeline. Zac gave (up to) 10:1 odds, so recall that the bet can be a positive expected value for Ben even if he thinks the event is most likely not going to happen.

For funsies, let’s map out one path of what has to happen for Zac to pay Ben $10k. This is not the canonical path, but it is a path:

Physics to the relevant granularity (question: can human lockpicks leverage sub-newtownian issues?) is conceptually placed into type theory or some calculus. I tried a riemann integral in coq once (way once), so it occurs to me that you need to decide if you want just the functional models (perhaps without computation / with proof irrelevance) in your proof stack or if you want the actual numerical analysis support in there as well.
Good tooling, library support, etc. around that conceptual work (call it mechlib) to provide mechanical engineering primitives
A lock designing toolkit, depending on mechlib, is developed
Someone (e.g. a large language model) is really good at programming in the lock designing toolkit. They come up with a spec L.
You state the problem “forall t : trajectories through our physics simulation, if L(t) == open(L) then t == key(L)”
Then you get to write a nasty gazillion line Lean proof
Manufacture a lock (did I mention that the design toolkit has links to actual manufacturing stacks?)
Bring a bunch to DefCon 2027 and send another to the lockpicking lawyer
Everyone fails. Except Ben and the army of postdocs that $9,999 can buy.

Looks like after the magnificent research engineering in steps 1 and 2, the rest is just showing off and justifying those two steps. Of course, in a world where we have steps 1 and 2 we have a great deal of transformative applications of formal modeling and verification just in reach, and we’ll need a PoC like locks to practice and concretize the workflow.

Cryptography applications tend to have a curse of requiring a lot of work after the security context, permission set, and other requirements are frozen in stone, which means that when the requirements change you have to start over and throw out a bunch of work (epistemic status: why do you think so many defi projects have more whitepapers than users?). The provably unpickable lock has 2 to 10 x that problem– get the granularity wrong in step one, most of your mechlib implementation won’t be salvageable. As the language model iterates on the spec L in step 5, the other language model has to iterate on the proof in step 6, because the new spec will break most of the proof.

Sorry I don’t know any mechanical engineering, Ben, otherwise I’d take some cracks at it. The idea of a logic such that its denotation is a bunch of mechanical engineering primitives seems interesting enough that my “if it was easy to do in less than a year someone would’ve, therefore there must be a moat” heuristic is tingling. Perhaps oddly, the quantum semantics folks (or with HoTT!) seem to have been productive, but I don’t know how much of that is translatable to mechanical engineering.

Reinforcement learning from proof assistant feedback, and yet more monte carlo tree search

DeepSeek’s paper

The steps are pretraining, supervised finetuning, RLPAF (reinforcement learning from proof assistant feedback), and MCTS (monte carlo tree search). RLPAF is not very rich: it’s a zero reward for any bug at all and a one for a happy typechecker. Glad they got that far with just that.

You can use the model at deepseek.com.

Harmonic ships their migration of miniF2F to Lean 4, gets 90% on it, is hiring

From their “one month in” newsletter. “Aristotle”, which has a mysterious methodology since I’ve only seen their marketing copy rather than an arxiv paper, gets 90% on miniF2F 4 when prompted with natural language proofs. It doesn’t look to me like the deepseek or LEGO papers do that? I could be wrong. It’s impressive just to autoformalize natural language proofs, I guess I’m still wrapping my head around how much harder it is (for an LLM) to implement coming up with the proof as well.

Jobs: research engineer and software engineer

Atlas ships their big google doc alluded to in the last newsletter

Worth a read! The GSAI stack is large and varied, and this maps out the different sub-sub-disciplines. From the executive summary:

You could start whole organizations for every row in this table, and I wouldn’t be a part of any org that targets more than a few at once for fear of being unfocused. See the doc for more navigation (see what I did there? Navigating like with an atlas, perhaps? Get it?) of the field’s opportunities.^[1]

Efficient shield synthesis via state-space transformation

Shielding is an area of reactive systems and reinforcement learning that marks states as unsafe and synthesizes a kind of guarding layer between the agent and the environment that prevents unsafe actions from being executed in the environment. So in the rejection sampling flavored version, it literally intercepts the unsafe action and tells the agent “we’re not running that, try another action”. One of the limitations in this literature is computational cost, shields are, like environments, state machines plus some frills, and there may simply be too many states. This is the limitation that this paper focuses on.

We consider the problem of synthesizing safety strategies for control systems, also known as shields. Since the state space is infinite, shields are typically computed over a finite-state abstraction, with the most common abstraction being a rectangular grid. However, for many systems, such a grid does not align well with the safety property or the system dynamics. That is why a coarse grid is rarely sufficient, but a fine grid is typically computationally infeasible to obtain. In this paper, we show that appropriate state-space transformations can still allow to use a coarse grid at almost no computational overhead. We demonstrate in three case studies that our transformation-based synthesis outperforms a standard synthesis by several orders of magnitude. In the first two case studies, we use domain knowledge to select a suitable transformation. In the third case study, we instead report on results in engineering a transformation without domain knowledge.

Besides cost, demanding a lot of domain knowledge is another limitation of shields, so this is an especially welcome development.

Funding opportunities

ARIA jumped right to technical area three (TA3), prototyping the gatekeeper. Deadline October 2nd. Seems geared toward cyber-physical systems folks. In the document:

Note that verified software systems is an area which is highly suitable for a simplified gatekeeper workflow, in which the world-model is implicit in the specification logic. However, in the context of ARIA’s mission to “change the perception of what’s possible or valuable,” we consider that this application pathway is already perceived to be possible and valuable by the AI community. As such, this programme focuses on building capabilities to construct guaranteed-safe AI systems in cyber-physical domains. That being said, if you are an organisation which specialises in verified software, we would love to hear from you outside of this solicitation about the cyber-physical challenges that are just at the edge of the possible for your current techniques.

This is really cool stuff, I hope they find brave and adventurous teams. I had thought gatekeeper prototypes would be in minecraft or mujoco (and asked a funder if they’d support me in doing that), so it’s wild to see them going for actual cyberphysical systems so quickly.

Paper club

Add to your calendar. On September 19th we will read a paper about assume-guarantee contracts with learned components. I’m liable to have made a summary slide deck to kick us off, but if I don’t, we’ll quietly read together for the first 20-30 minutes then discuss. The google meet room in the gcal event by default.

Andrew Dickson’s excellent post

I'm focused on making sure our infrastructure is safe against AI attacks.

^{^}
One problem off the top of my head regarding the InterFramework section: Coq and Lean seems the most conceptually straightforward since they have the same underlying calculus, but even there just a little impredicativity or coinduction could lead to extreme headaches. Now you can have a model at some point in the future that steamrolls over these headaches, but then you have a social problem of the broader Lean community not wanting to upstream those changes– various forks diverging fundamentally seems problematic to me, would lead to a lot of duplicated work and missed opportunities for collaboration. I plan to prompt Opus 3.5 with “replicate flocq in lean4” as soon as I get access to the model, but how much more prompting effort will it be to ensure compliance with preexisting abstractions and design patterns, so that it can not only serve my purposes but be accepted by the community? At least there’s no coinduction in flocq, though some of the proofs may rely on set impredicativity for all I know (I haven’t looked at it in a while).

Comment by Quinn (quinn-dougherty) on Provably Safe AI: Worldview and Projects · 2024-08-26T17:53:25.244Z · LW · GW

I do think it's important to reach appropriately high safety assurances before developing or deploying future AI systems which would be capable of causing a catastrophe. However, I believe that the path there is to extend and complement current techniques, including empirical and experimental approaches alongside formal verification - whatever actually works in practice.

for what it's worth, I do see GSAI as largely a swiss cheesey worldview. Though I can see how you might read some of the authors involved to be implying otherwise! I should knock out a post on this.

Comment by Quinn (quinn-dougherty) on Please do not use AI to write for you · 2024-08-22T17:46:40.206Z · LW · GW

not just get it to sycophantically agree with you

i struggle with this, and need to attend a prompting bootcamp

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2024-07-31T18:22:13.918Z · LW · GW

i'm getting back into composing and arranging. send me rat poems to set to music!

Comment by Quinn (quinn-dougherty) on Eric Neyman's Shortform · 2024-04-26T14:42:51.433Z · LW · GW

sure -- i agree that's why i said "something adjacent to" because it had enough overlap in properties. I think my comment completely stands with a different word choice, I'm just not sure what word choice would do a better job.

Comment by Quinn (quinn-dougherty) on Eric Neyman's Shortform · 2024-04-25T19:53:51.209Z · LW · GW

I eventually decided that human chauvinism approximately works most of the time because good successor criteria are very brittle. I'd prefer to avoid lock-in to my or anyone's values at t=2024, but such a lock-in might be "good enough" if I'm threatened with what I think are the counterfactual alternatives. If I did not think good successor criteria were very brittle, I'd accept something adjacent to E/Acc that focuses on designing minds which prosper more effectively than human minds. (the current comment will not address defining prosperity at different timesteps).

In other words, I can't beat the old fragility of value stuff (but I haven't tried in a while).

I wrote down my full thoughts on good successor criteria in 2021 https://www.lesswrong.com/posts/c4B45PGxCgY7CEMXr/what-am-i-fighting-for

AI welfare: matters, but when I started reading lesswrong I literally thought that disenfranching them from the definition of prosperity was equivalent to subjecting them to suffering, and I don't think this anymore.

Comment by Quinn (quinn-dougherty) on Quinn's Shortform · 2024-04-20T01:01:07.934Z · LW · GW

Thinking about a top-level post on FOMO and research taste

Fear of missing out defined as inability to execute on a project cuz there's a cooler project if you pivot
but it also gestures at more of a strict negative, where you think your project sucks before you finish it, so you never execute
was discussing this with a friend: "yeah I mean lesswrong is pretty egregious cuz it sorta promotes this idea of research taste as the ability to tear things down, which can be done armchair"
I've developed strategies to beat this FOMO and gain more depth and detail with projects (too recent to see returns yet, but getting there) but I also suspect it was nutritious of me to develop discernment about what projects are valuable or not valuable for various threat models and theories of change (in such a way that being a phd student off of lesswrong wouldn't have been as good in crucial ways, tho way better in other ways).
- but I think the point is you have to turn off this discernment sometimes, unless you want to specialize in telling people why their plans won't work, which I'm more dubious on the value of than I used to be

Idk maybe this shortform is most of the value of the top level post

User info