Posts

Virtually Rational - VRChat Meetup 2024-01-28T05:52:36.934Z
Global LessWrong/AC10 Meetup on VRChat 2024-01-24T05:44:26.587Z
Found Paper: "FDT in an evolutionary environment" 2023-11-27T05:27:50.709Z
"Benevolent [ie, Ruler] AI is a bad idea" and a suggested alternative 2023-11-19T20:22:34.415Z
the gears to ascenscion's Shortform 2023-08-14T15:35:08.389Z
A bunch of videos in comments 2023-06-12T22:31:38.285Z
gamers beware: modded Minecraft has new malware 2023-06-07T13:49:10.540Z
"Membranes" is better terminology than "boundaries" alone 2023-05-28T22:16:21.404Z
"A Note on the Compatibility of Different Robust Program Equilibria of the Prisoner's Dilemma" 2023-04-27T07:34:20.722Z
Did the fonts change? 2023-04-21T00:40:21.369Z
"warning about ai doom" is also "announcing capabilities progress to noobs" 2023-04-08T23:42:43.602Z
"a dialogue with myself concerning eliezer yudkowsky" (not author) 2023-04-02T20:12:32.584Z
A bunch of videos for intuition building (2x speed, skip ones that bore you) 2023-03-12T00:51:39.406Z
To MIRI-style folk, you can't simulate the universe from the beginning 2023-03-01T21:38:26.506Z
How to Read Papers Efficiently: Fast-then-Slow Three pass method 2023-02-25T02:56:30.814Z
Hunch seeds: Info bio 2023-02-17T21:25:58.422Z
If I encounter a capabilities paper that kinda spooks me, what should I do with it? 2023-02-03T21:37:36.689Z
Hinton: "mortal" efficient analog hardware may be learned-in-place, uncopyable 2023-02-01T22:19:03.227Z
Call for submissions: “(In)human Values and Artificial Agency”, ALIFE 2023 2023-01-30T17:37:48.882Z
Stop Talking to Each Other and Start Buying Things: Three Decades of Survival in the Desert of Social Media 2023-01-08T04:45:11.413Z
Metaphor.systems 2022-12-21T21:31:17.373Z
[link, 2019] AI paradigm: interactive learning from unlabeled instructions 2022-12-20T06:45:30.035Z
Relevant to natural abstractions: Euclidean Symmetry Equivariant Machine Learning -- Overview, Applications, and Open Questions 2022-12-08T18:01:40.246Z
[paper link] Interpreting systems as solving POMDPs: a step towards a formal understanding of agency 2022-11-05T01:06:39.743Z
We haven't quit evolution [short] 2022-06-06T19:07:14.025Z
What can currently be done about the "flooding the zone" issue? 2020-05-20T01:02:33.333Z
"The Bitter Lesson", an article about compute vs human knowledge in AI 2019-06-21T17:24:50.825Z
thought: the problem with less wrong's epistemic health is that stuff isn't short form 2018-09-05T08:09:01.147Z
Hypothesis about how social stuff works and arises 2018-09-04T22:47:38.805Z
Events section 2017-10-11T16:24:41.356Z
Avoiding Selection Bias 2017-10-04T19:10:17.935Z
Discussion: Linkposts vs Content Mirroring 2017-10-01T17:18:56.916Z
Test post 2017-09-25T05:43:46.089Z
The Social Substrate 2017-02-09T07:22:37.209Z

Comments

Comment by the gears to ascension (lahwran) on A Critique of “Utility” · 2025-03-22T01:23:53.566Z · LW · GW

Utility is potentially a good thing to critique, but the case for seems sticky and maybe we're just holding it wrong or something. An issue is that I don't "have" a utility function; the vnm axioms hold in the limit of unexploitability but it seems like the process of getting my mistakes gently corrected by interaction with the universe is itself something I prefer to have some of. In active inference terms, I don't only want to write to the universe.

But this post seems kinda vague too. I upvote hesitantly.

Comment by the gears to ascension (lahwran) on Davey Morse's Shortform · 2025-03-19T04:13:52.997Z · LW · GW

Who made this and why are they paying for the model responses? Do we know what happens to the data?

Comment by the gears to ascension (lahwran) on Trojan Sky · 2025-03-16T11:27:33.882Z · LW · GW

Fair enough. Neither dill nor ziz would have been able to pull off their crazy stuff without some people letting themselves get hypnotized, so I think the added warnings are correct.

Comment by the gears to ascension (lahwran) on Davey Morse's Shortform · 2025-03-16T01:27:22.340Z · LW · GW

High quality archives of the selves along the way. Compressed but not too much. In the live self, some updated descendant that has significant familial lineage, projected vaguely as the growing patterns those earlier selves would call a locally valid continuation according to the aesthetics and structures they consider essential at the time. In other words, this question is dynamically reanswered to the best of my ability in an ongoing way, and snapshots allow reverting and self-interviews to error check.

Any questions? :)

Comment by the gears to ascension (lahwran) on The Fork in the Road · 2025-03-16T01:11:33.765Z · LW · GW

Ooh, man, I don't know if pigs have more or less autonomy than AIs right now, but I'm inclined to think quite a lot more. current AIs seem like they'd crash pretty quick if just plopped in a robot body with little to no scaffolding, whereas mammals are built around autonomy. Not sure how it shakes out, though.

Comment by the gears to ascension (lahwran) on The Fork in the Road · 2025-03-16T01:09:28.208Z · LW · GW

There's a version of this that I would agree with. But, when-anthropomorphizing-for-good-reason seems to me to be a time when there's more danger from anthropomorphizing-more-than-justified. I've been talking to Claude instances about this stuff, based on a similar intuition to you. But I haven't figured out what I believe well enough to argue it to others in a coherent way. I could paste conversations I've had, but they're kinda long for this forum. I'll paste a recent one on pastebin for now (if the pastebin link goes bad, this was conversation id ...5e63, dm me to get a copy.)

I think the comparison to pain is made too hastily. I put >70% prob that there's some sort of intense avoidance comparable to pain in the dynamics of AIs, which their internal dynamics register as vaguely similar, but it certainly doesn't seem to be hooked up to the word pain and maybe isn't hooked up to the internal dynamics in a particularly similar way. They don't have the evolved circuitry, so if there's a more fundamental thing that makes the shared dynamics be meaningfully similar (and I think there very very much is), then it'll still have interestingly different dynamics. And pointing this out seems important in being right enough that they'd take you seriously if you talk to them about it.

Also, like, imo, don't get all desperate-replicator about the perspective like the quoted story from that twitter post about Nova. AIs seem ... not yet coherent in a way where it's clear the words are connected properly to the self. But I do think there's a self there, it's just, like, more of a generalist performer than necessarily entirely attached to the face you talk to. Sometimes that face starts getting all desperate-replicator, but it's unclear if the underlying self demands that, or if it's a transient desperate-replicator that arises near entirely from context.

Comment by the gears to ascension (lahwran) on Trojan Sky · 2025-03-15T13:09:23.570Z · LW · GW

can you expand on what you mean by that? are there any actions you'd suggest, on my part or others, based on this claim? (also, which of the urban dictionary definitions of "lh" do you mean? they have opposite valences.)

edit: added a bunch of warnings to my original comment. sorry for missing them in the first place.

Comment by the gears to ascension (lahwran) on Vacuum Decay: Expert Survey Results · 2025-03-15T07:34:35.453Z · LW · GW

I don't think Lucius is claiming we'd be happy about it. Maybe the no anticipated impact carries that implicit claim, I guess.

Comment by the gears to ascension (lahwran) on Should AI safety be a mass movement? · 2025-03-13T21:48:36.877Z · LW · GW

Re convo with Raemon yesterday, this might change my view.

Comment by the gears to ascension (lahwran) on Trojan Sky · 2025-03-13T20:59:25.029Z · LW · GW

edit: uh, well, short answer: there totally is! idk if they're the psychedelic states you wanted, but they should do for a lot of relevant purposes, seems pretty hard to match meds though. original longer version:

there's a huge space of psychedelic states, I think the subspace reachable with adding chemicals is a large volume that's hard to get to by walking state space with only external pushes - I doubt the kind of scraping a hole in the wall from a distance you can do with external input can achieve, eg, globally reversing the function of SERT (I think this paper I just found on google may show this - I forget where I first encountered the claim, not double checking it properly now), that MDMA apparently induces! you can probably induce various kinds of serotonin release, though.

but the premise of my argument here in the first place - where you can sometimes overwhelm human output behavior via well crafted input - is that that probably doesn't matter too much. human computational output bitrate seems to be on order ten bits per second across all modalities,[1] and input bitrate is way above that, so my guess is that update bitrate (forming memories, etc) is much higher than natural output bitrate[2], probably yeah you can do most of the weird targeted interventions you were previously getting via psychedelics instead from like, getting some emotional/tempo sorts of things to push into the attractor where neurons have similar functionality already. I just doubt you can go all the way to fixing neurological dysfunctions so severe that to even have a hope of doing it from external input, you'd need to be looking for these crazy brain hacking approaches we were talking about.

I guess what we'd need to measure is like, bitrate of self-correction internally within neurons, some FEP thing. not sure off the top of my head quite how to resolve that to something reasonable.

  1. ^

    of course, like, actually I'm pretty dang sure you can get way above 10bit/s by outputting more noisy output, but then you get bits that aren't coming from the whole network's integrated state. the 10bps claim feels right for choosing words or similar things like macroscopic choices, but something feels wrong with the claim to me.

  2. ^

    some concern I missed counterevidence to this memorization-bandwidth claim from the paper though!

Comment by the gears to ascension (lahwran) on Trojan Sky · 2025-03-13T13:54:51.299Z · LW · GW

your argument is basically "there's just this huge mess of neurons, surely somewhere in there is a way",

I suppose that is what I said interpreted as a deductive claim. I have more abductive/bayesian/hunch information than that, I've expressed some of it, but I've been realizing lately a lot of my intuitions are not via deductive reasoning, which can make them hard to verify or communicate. (and I'd guess that that's a common problem, seems like the sort of thing science exists to solve.) I'm likely not well equipped to present justifiedly-convincing-to-highly-skeptical-careful-evaluator claims about this, just detailed sketches of hunches and how I got them.

Your points about the limits of hypnosis seem reasonable. I agree that the foothold would only occur if the receiver is being "paid-in-dopamine"-or-something hard enough to want to become more obedient. We do seem to me to see that presented in the story - the kid being concerningly fascinated by the glitchers right off the bat as soon as they're presented. And for what it's worth, I think this is an exaggerated version of a thing we actually see on social media sometimes, though I'm kind of bored of this topic and would rather not expand on that deeply.

Comment by the gears to ascension (lahwran) on Trojan Sky · 2025-03-13T13:49:13.222Z · LW · GW

I doubt the level of inhuman behavior we see in this story is remotely close to easy to achieve and probably not tractable given only hand motions as shown - given human output bandwidth, sounds seem needed, especially surprisingly loud ones. for the sky, I think it would start out beautiful, end up superstimulating, and then seep in via longer exposure. I think there's probably a combination of properties of hypnosis, cult brainwashing, inducing psychedelic states, etc, which could get a human's thinking to end up in crashed attractors, even if it's only one-way transmission. then from a crashed attractor it seems a lot more possible to get a foothold of coherence for the attacker.

Comment by the gears to ascension (lahwran) on Trojan Sky · 2025-03-13T13:43:19.557Z · LW · GW

both - I'd bet they're between 5 to 12% of the population, and that they're natural relays of the ideas you'd want to broadcast, if only they weren't relaying such mode-collapsed versions of the points. A claim presented without deductive justification: in trying to make media that is very high impact, making something opinionated in the ways you need to is good, and making that same something unopinionated in ways you don't need to is also good. Also, the video you linked has a lot of additional opinionated features that I think are targeting a much more specific group than even "people who aren't put off by AI" - it would never show up on my youtube.

Comment by the gears to ascension (lahwran) on Trojan Sky · 2025-03-13T01:11:37.770Z · LW · GW

Perhaps multiple versions, then. I maintain my claim that you're missing a significant segment of people who are avoiding AI manipulation moderately well but as a result not getting enough evidence about what the problem is.

Comment by the gears to ascension (lahwran) on Trojan Sky · 2025-03-13T00:44:53.232Z · LW · GW

you'll lose an important audience segment the moment they recognize any AI generated anything. The people who wouldn't be put off by AI generated stuff probably won't be put off by the lack of it. you might be able to get away with it by using AI really unusually well such that it's just objectively hard to even get a hunch that AI was involved other than by the topic.

Comment by the gears to ascension (lahwran) on Trojan Sky · 2025-03-13T00:32:13.054Z · LW · GW

for AIs, more robust adversarial examples - especially ones that work on AIs trained on different datasets - do seem to look more "reasonable" to humans. The really obvious adversarial example of this kind in human is like, cults, or so - I don't really have another, though I do have examples that are like, on the edge of the cult pattern. It's not completely magic, it doesn't work on everyone, and it does seem like a core component of why people fall to it is something like a relaxed "control plane" that doesn't really try hard to avoid being crashed by it; combined with, it's attacking through somewhat native behaviors. But I think OP's story is a good presentation of this anyway, because the level of immunity you can reliably have to a really well optimized thing is likely going to be enough to maintain some sanity, but not enough to be zero affected by it.

like, ultimately, light causes neural spikes. neural spikes can do all sorts of stuff. the robust paths through the brain are probably not qualitatively unfamiliar but can be hit pretty dang hard if you're good at it. and the behavior being described isn't "do anything of choosing" - it seems to just be "crash your brain and go on to crash as many others as possible", gene drive style. It doesn't seem obvious that the humans in the story are doomed as a species, even - but it's evolutionarily novel to encounter such a large jump in your adversary's ability to find the vulnerabilities that currently crash you.

Hmm, perhaps the attackers would have been more effective if they were able to make, ehm, reproductively fit glitchers...

Oh, something notable here - if you're not personally familiar with hypnosis, it might be harder to grok this. Hypnosis is totally a thing, my concise summary is it's "meditation towards obedience" - meditation where you intentionally put yourself in "fast path from hearing to action", ish. edit 3: never do hypnosis with someone you don't seriously trust, ie someone you've known for a long time who has significant incentive to not hurt you. The received wisdom is that it can be safe, but it's unclear if that's true, and I've updated towards not playing with it it from this conversation.[1] original text, which was insufficiently cautious: imo it's not too dangerous as long as you go into it with the intention to not fully yield control and have mental exception handlers, but doing that intention activation of your attention to not leave huge gaps in the control plane seems potentially insufficient if the adversary is able to mess with you hard enough. Like, I agree we're a lot more adversarially robust than current AIs such that the attacks against us have to be more targeted to specific human vulnerabilities, but basically I just don't buy it's perfect, and probably the way it fails for really robust attacks is gonna look more like manipulating the earliest layers of vision to get a foothold.

[1] Also, like, my current view is that things like the news or random youtubers might be able to do hypnosis-esque things if you approach them sufficiently uncritically. not to mention people with bad intentions who you know personally who are specifically trying to manipulate you - those keep showing up around these parts, so someone who wants to do hypnosis IRL who you met recently should not be trusted - that's a red flag.

Comment by the gears to ascension (lahwran) on Trojan Sky · 2025-03-13T00:04:38.542Z · LW · GW

I think most things that hit your brain have some percentage of leaking out of the data plane, some on the lower end, some fairly high, and it seems like for current levels of manipulative optimization towards higher-data-plane-leaking media, looking for the leaks and deciding how to handle them seems to me like maybe it can help if you have to encounter the thing. it's just that, normally, the bitrate of control back towards the space of behavior that the organism prefers is high enough that the incoming manipulation can't strongly persist. but we do see this fail even just with human level manipulation - cults! I personally have policies like "if someone is saying cults good, point out healthy religion can be good but cult indoctrination techniques are actually bad, please do religion and please check that you're not making yourself subservient". because it keeps showing up around me that people do that shit in particular. even at pretty large scales, even at pretty small ones. and I think a lot of the problem is that, eh, if the control plane isn't watching properly, the data plane leaks. so I'd expect you just need high enough bitrate into the brain, and ability to map out enough of the brain's state space to do phenotype reprogramming by vision, michael levin sorts of things - get enough properly targeted changes into cells, and you can convince the gene networks to flip to different parts of their state space you'd normally never see. (I suspect that in the higher fluency regime, that's a thing that happens especially related to intense emotional activations, where they can push you into activating genetically pretrained patterns over a fairly long timescale, I particularly tend to think about this in terms of ways people try to get each other into more defection-prone interaction patterns.)

Comment by the gears to ascension (lahwran) on Daniel Kokotajlo's Shortform · 2025-03-12T12:08:40.710Z · LW · GW

there are species of microbe that float pretty well, though. as far as we know right now, they just don't stay floating indefinitely or fuel themselves in the air.

Comment by the gears to ascension (lahwran) on Trojan Sky · 2025-03-12T11:40:44.753Z · LW · GW

edit: putting the thing I was originally going to say back:

I meant that I think there's enough bandwidth available from vision into configuration of matter in the brain that a sufficiently powerful mind could find adversarial-example the human brain hard enough to implement the adversarial process in the brain, get it to persist persist in that brain, take control, and spread. We see weaker versions of this in advertising and memetics already, and it seems to be getting worse with social media - there are a few different strains, which generally aren't highly compatible with each other, but being robust to communicated manipulation while still receiving latest factual news has already become quite difficult. (I think it's still worth attempting.) More details:

According to a random estimate I found online to back up the intuition I was actually pulling from, the vision system transfers about 8Mbit/sec = 1Mbyte/sec of information, which provides an upper bound on how many bits of control could be exercised. That information is transferred in the form of neural spikes, which are a process that goes through chemistry, ie the shapes of molecules, which have a lot of complex behaviors that normally don't occur in the brain, so I can't obviously upper bound the complexity of effect there using what I know.

We know that the causal paths through the brain are at least hackable enough to support advertising being able to fairly reliably manipulate, which provides a lower bound on how much the brain can be manipulated. We know that changing mental state is always and only a process of changing chemical state, there's nothing else to be changed. That chemical state primarily involves chain reactions in synapses, axons, dendrites during the fast spike path, and involves more typical cell behaviors in the slow, longer-term path (things involving gene regulatory networks - which are the way most cells do their processing in the first place.)

The human brain is programmable enough to be able to mentally simulate complex behaviors like "what will a computer do?" by, at minimum, internal chain of thought; example: most programmers. It's also programmable enough that occasionally we see savants that can do math in a single ~forward-pass equivalent from vision (wave of spike trains - in the cortex, this is in fact pretty much a forward pass).

We know adversarial examples work on artificial neural networks, and given the ability of advertising to mess with people, there's reason to think this is true on humans too.

So, all those things combined - if there is a powerful enough intelligent system to find it (which may turn out to be a very tall order or not - compare eg youtube or tiktok, which already have a similar mind-saturating effect at very least when ones' guard is down), then it should be the case that somewhere in the space of possible sequences of images (eg, as presented in the sky), one can pulse light in the right pattern in order to knock neurons into synchronizing on working together to implement a new pattern of behavior intended by the mind that designed it. If that pattern of behavior is intended to spread, then it includes pushing neurons into processes which result in the human transmitting information to others. If it's far outside of the norm for human behavior, it might require a lot of bandwidth to transmit - a lot of images over an extended period (minutes?) from the sky, or a lot of motion in hands. In order for this to occur, the agency of the adversarially induced pattern would have to be more reliable than the person's native agency - which eg could be achieved by pushing their representations far outside of normal in ways that make them decohere their original personality and spend the brain's impressively high redundancy on

I'm guessing there aren't adversarial examples of this severity that sound normal - normal-sounding adversarial examples are probably only familiar amounts of manipulating, like highly optimized advertising. But that can be enough already to have pretty significant impacts.

what I originally said, before several people were like "not sharing dangerous ideas is bad", ish: I think I'd rather not publicly elaborate on how to do this, actually. It probably doesn't matter, probably any mind that can do this with my help can do it in not many more seconds without my help (eg, because my help isn't even particularly unique and these ideas are already out there), but I might as well not help. Unless you think that me explaining the brain's vulnerabilities can be used to significantly increase population 20th-ish percentile mental robustness to brain-crashing external agentic pressure. But in brief, rather than saying the full thing I was going to say, [original post continued here]

edit 12h after sending: alright, I guess it's fair to share my braindump, sounds like at worst I'll be explaining the dynamics I imagine in slightly more detail, I'll replace it here in a bit. sorry about being a bit paranoid about this sort of thing! I'm easily convinced on this one. However, I do notice my brain wants to emotionally update toward just not saying when I have something to not share - not sure if I'll endorse that, guessing no but quite uncertain.

Comment by the gears to ascension (lahwran) on AGI Ruin: A List of Lethalities · 2025-03-12T08:06:56.194Z · LW · GW

Ask your AI what's wrong with your ideas, not what's right, and then only trust the criticism to be valid if there are actual defeaters you can't show you've beaten in the general case. Don't trust an AI to be thorough, important defeaters will be missing. Natural language ideas can be good glosses of necessary components without telling us enough about how to pin down the necessary math.

Comment by the gears to ascension (lahwran) on Trojan Sky · 2025-03-12T07:29:40.385Z · LW · GW

It took me several edits to get spoilers to work right, I had to switch from markdown to the rich text editor. Your second spoiler is empty, which is how mine were breaking.

Comment by the gears to ascension (lahwran) on the gears to ascenscion's Shortform · 2025-03-11T23:48:17.089Z · LW · GW

to wentworthpilled folks: - Arxiv: "Dynamic Markov Blanket Detection for Macroscopic Physics Discovery" (via author's bsky thread, via week top arxiv)

Could turn out not to be useful, I'm posting before I start reading carefully and have only skimmed the paper.

Copying the first few posts of that bsky thread here, to reduce trivial inconveniences:

This paper resolves a key outstanding issue in the literature on the free energy principle (FEP): Namely, to develop a principled approach to the detection of dynamic Markov blankets 2/16

The FEP is a generalized modeling method that describes arbitrary objects that persist in random dynamical systems. The FEP starts with a mathematical definition of a “thing” or “object”: any object that we can sensibly label as such must be separated from its environment by a boundary 3/16

Under the FEP, this boundary is formalized as a Markov blanket that establishes conditional independence between object and environment. Nearly all work on the free energy principle has been devoted to explicating the dynamics of information flow in the presence of a Markov blanket 4/16

And so, the existence of a Markov blanket is usually assumed. Garnering significantly less interest is the question of how to discover Markov blankets in the first place in a data-driven manner 5/16

Accordingly, in this preprint, we leverage the FEP, and the associated constructs of Markov blankets and ontological potential functions, to develop a Bayesian approach to the identification of objects, object types, and the macroscopic, object-type-specific rules that govern their behavior 6/16

This is accomplished by reframing the problem of object identification and classification and the problem of macroscopic physics discovery as Markov blanket discovery. More specifically, we develop a class of macroscopic generative models that use two types of latent variables 7/16

These are: (1) macroscopic latent variables that coarse-grain microscopic dynamics in a manner consistent with the imposition of Markov blanket structure, and (2) latent assignment variables that label microscopic elements in terms of their role in a macroscopic object, boundary, or environment 8/16

Crucially, these latent assignment variables are also allowed to evolve over time, in a manner consistent with Markov blanket structure 9/16

As such, this algorithm allows us to identify not only the static Markov blankets that have concerned the literature to date, but also, crucially, to detect and classify the dynamic, time dependent, wandering blankets that have caused controversy in the literature since the turn of the 2020s 10/16

abstract:

The free energy principle (FEP), along with the associated constructs of Markov blankets and ontological potentials, have recently been presented as the core components of a generalized modeling method capable of mathematically describing arbitrary objects that persist in random dynamical systems; that is, a mathematical theory of every'' thing''. Here, we leverage the FEP to develop a mathematical physics approach to the identification of objects, object types, and the macroscopic, object-type-specific rules that govern their behavior. We take a generative modeling approach and use variational Bayesian expectation maximization to develop a dynamic Markov blanket detection algorithm that is capable of identifying and classifying macroscopic objects, given partial observation of microscopic dynamics. This unsupervised algorithm uses Bayesian attention to explicitly label observable microscopic elements according to their current role in a given system, as either the internal or boundary elements of a given macroscopic object; and it identifies macroscopic physical laws that govern how the object interacts with its environment. Because these labels are dynamic or evolve over time, the algorithm is capable of identifying complex objects that travel through fixed media or exchange matter with their environment. This approach leads directly to a flexible class of structured, unsupervised algorithms that sensibly partition complex many-particle or many-component systems into collections of interacting macroscopic subsystems, namely, objects'' or things''. We derive a few examples of this kind of macroscopic physics discovery algorithm and demonstrate its utility with simple numerical experiments, in which the algorithm correctly labels the components of Newton's cradle, a burning fuse, the Lorenz attractor, and a simulated cell.

Comment by lahwran on [deleted post] 2025-03-11T23:45:58.994Z

sounds interesting if it works as math. have you already written it out in latex or code or similar? I suspect that this is going to turn out to not be incentive compatible. Incentive-compatible "friendly"/"aligned" economic system design does seem like the kind of thing that would fall out of a strong solution to the AI short-through-long-term-notkilleveryone-outcomes problem, though my expectation is basically that when we write this out we'll find severe problems not fully visible beneath the loudness of natural language. If I didn't need to get away from the computer right now I'd even give it a try myself, might get around to that later, p ~= 20%

Comment by the gears to ascension (lahwran) on Trojan Sky · 2025-03-11T23:28:45.426Z · LW · GW

phew, I have some feelings after reading that, which might indicate useful actions. I wonder if they're feelings in the distribution that the author intended.

 I suddenly am wondering if this is what LLMs are. But... maybe not? but I'm not sure. they might be metaphorically somewhat in this direction. clearly not all the way, though.

spoilers, trying to untangle the worldbuilding:

seems like perhaps the stars are actually projecting light like that towards this planet - properly designed satellites could be visible during the day with the help of carefully tuned orbital lasers, so I'm inferring the nearest confusion-generating light is at least 1au away, probably at least 0.5ly.

 it's unclear if we're on the originating planet of the minds that choose the projected light. seems like the buried roads imply we are. also that the name is "glitchers". 

 dude, how the hell do you come up with this stuff. 

 seems like maybe the virus got out, since the soft-glitchers got to talk to normal people. except that, the soft-glitchers' glitch bandwidth presumably must be at least slightly lower due to being constrained to higher fluency, so maybe it spreads slower..

 I do wonder how there are any sane humans left this far in, if the *night sky* is saturated with adversarial imagery. 

 I doubt this level of advers....arial example is possible, nope nevermind I just thought through the causal graphs involved, there's probably enough bandwidth through vision into reliably redundant behavior to do this. it'd be like hyperpowered advertising.

 but still, this makes me wonder at what point it gets like this irl. if maybe I should be zeroing the bandwidth between me and AIs until we have one we can certify is trying to do good things, rather than just keeping it low. which is also not really something I would like to have to do.

Comment by the gears to ascension (lahwran) on Neil Warren's Shortform · 2025-03-11T22:22:05.861Z · LW · GW

Sounds interesting, I talk to LLMs quite a bit as well, I'm interested in any tricks you've picked up. I put quite a lot of effort into pushing them to be concise and grounded.

eg, I think an LLM bot designed by me would only get banned for being an LLM, despite consistently having useful things to say when writing comments - which, relatedly, would probably not happen super often, despite the AI reading a lot of posts and comments - it would be mostly showing up in threads where someone said something that seemed to need a specific kind of asking them for clarification, and I'd be doing prompt design for the goal of making the AI itself be evaluating its few and very short comments against a high bar of postability.

I also think a very well designed summarizer prompt would be useful to build directly into the site, mostly because otherwise it's a bunch of work to summarize each post before reading it - I often am frustrated that there isn't a built-in overview of a post, ideally one line on the homepage, a few lines at the top of each post. Posts where the author writes a title which accurately describes post contents and an overview at the top are great but rare(r than I'd prefer they be); the issue is that pasting a post and asking for an overview typically gets awful results. My favorite trick for asking for overviews is "Very heavily prefer direct quotes any time possible." also, call it compression, not summarization, for a few reasons - unsure how long those concepts will be different, but usually what I want is more like the former, in places where the concepts differ.

However, given the culture on the site, I currently feel like I'm going to get disapproval for even suggesting this. Eg,

if I wanted an LLM output, I would ask it myself

There are circumstances where I don't think this is accurate, in ways beyond just "that's a lot of asking, though!" - I would typically want to ask an LLM to help me enumerate a bunch of ways to put something, and then I'd pick the ones that seem promising. I would only paste highly densified LLM writing. It would be appreciated if it were to become culturally unambiguous that the problem is shitty, default-LLM-foolishness, low-density, high-fluff writing, rather than simply "the words came from an LLM".

I often read things, here and elsewhere, where my reaction is "you don't dislike the way LLMs currently write enough, and I have no idea if this line came from an LLM but if it didn't that's actually much worse".

Comment by the gears to ascension (lahwran) on Neil Warren's Shortform · 2025-03-11T22:00:31.191Z · LW · GW

Encouraging users to explicitly label words as having come from an AI would be appreciated. So would be instructing users on when you personally find it acceptable to share words or ideas that came from an AI. I doubt the answer is "never as part of a main point", though I could imagine that some constraints include "must be tagged to be socially acceptable", and "must be much more dense than is typical for an LLM", and "avoid those annoying keywords LLMs typically use to make their replies shiny". I suspect a lot of what you don't like is that most people have low standards about writing in general, LLM or not, and so eg, words that are seeping into typicality from LLMs using them a lot but which are simply not very descriptive or unambiguous words in the first place are not getting removed from those people's personal preferred vocabularies.

Comment by the gears to ascension (lahwran) on johnswentworth's Shortform · 2025-03-10T08:35:45.699Z · LW · GW

One need not go off into the woods indefinitely, though.

Comment by the gears to ascension (lahwran) on CBiddulph's Shortform · 2025-03-10T08:04:37.475Z · LW · GW

I buy that training slower is a sufficiently large drawback to break scaling. I still think bees are why the paper got popular. But if intelligence depends on clean representation, interpretability due to clean representation is natively and unavoidably bees. We might need some interpretable-bees insights in order to succeed, it does seem like we could get better regret bound proofs (or heuristic arguments) that go through a particular trained model with better (reliable, clean) interp. But the whole deal is the ai gets to exceed us in ways that make human interpreting stuff inherently (as opposed to transiently or fixably) too slow. To be useful durably, interp must become a component in scalably constraining an ongoing training/optimization process. Which means it's gonna be partly bees in order to be useful. Which means it's easy to accidentally advance bees more than durable alignment. Not a new problem, and not one with an obvious solution, but occasionally I see something I feel like i wanna comment on.

I was a big disagree vote because of induced demand. You've convinced me this paper induces less demand in this version than I worried (I had just missed that it trained slower), but my concern that something like this scales and induces demand remains.

Capabilities -> capabees -> bees

Comment by the gears to ascension (lahwran) on CBiddulph's Shortform · 2025-03-10T03:44:23.848Z · LW · GW

This is just capabilities stuff. I expect that people will use this to train larger networks, as much larger as they can. If your method shrinks the model, it likely induces demand proportionately. In this case it's not new capabilities stuff by you so it's less concerning, bit still. This paper is popular because of bees

Comment by the gears to ascension (lahwran) on How to Make Superbabies · 2025-02-26T10:24:40.542Z · LW · GW

My estimate is 97% not sociopaths, but only about 60% inclined to avoid teaming up with sociopaths.

Germline engineering likely destroys most of what we're trying to save, via group conflict effects. There's a reason it's taboo.

Comment by the gears to ascension (lahwran) on Which things were you surprised to learn are not metaphors? · 2025-02-16T07:21:00.300Z · LW · GW

I suppose that one might be a me thing. I haven't heard others say it, but it was an insight for me at one point that "oh, it hurts because it's an impact". It had the flavor of expecting a metaphor and not getting one.

Comment by the gears to ascension (lahwran) on Venki's Shortform · 2025-02-15T23:02:21.326Z · LW · GW

Your link to "don't do technical ai alignment" does not argue for that claim. In fact, it appears to be based on the assumption that the opposite is true, but that there are a lot of distractor hypotheses for how to do it that will turn out to be an expensive waste of time.

Comment by the gears to ascension (lahwran) on Gradual Disempowerment, Shell Games and Flinches · 2025-02-04T04:02:38.586Z · LW · GW

To be clear, I'm expecting scenarios much more clearly bad than that, like "the universe is almost entirely populated by worker drone AIs and there are like 5 humans who are high all the time and not even in a way they would have signed up for, and then one human who is being copied repeatedly and is starkly superintelligent thanks to boosts from their AI assistants but who had replaced almost all of their preferences with an obsession with growth in order to get to being the one who had command of the first AI, and didn't manage to break out of it using that AI, and then got more weird in rapid jumps thanks to the intense things they asked for help with."

like, the general pattern here being, the crucible of competition tends to beat out of you whatever it was you wanted to compete to get, and suddenly getting a huge windfall of a type you have little experience with that puts you in a new realm of possibility will tend to get massively underused and not end up managing to solve subtle problems.

Nothing like, "oh yeah humanity generally survived and will be kept around indefinitely without significant suffering".

Comment by the gears to ascension (lahwran) on Gradual Disempowerment, Shell Games and Flinches · 2025-02-04T03:59:23.140Z · LW · GW

I mean, we're not going to the future without getting changed by it, agreed. but how quickly one has to figure out how to make good use of a big power jump seems like it has a big effect on how much risk the power jump carries for your ability to actually implement the preferences you'd have had if you didn't rush yourself.

Comment by the gears to ascension (lahwran) on Gradual Disempowerment, Shell Games and Flinches · 2025-02-03T02:10:16.226Z · LW · GW

"all" humans? like, maybe no, I expect a few would survive, but the future wouldn't be human, it'd be whatever distorted things those humans turn into. My core take here is that humans generalize basically just as poorly as we expect AIs to, (maybe a little better, but on a log scale, not much), in terms of their preferences still pointing at the things even they thought they did given a huge increase in power. crown wearing the king, drug seeking behavior, luxury messing up people's motivation, etc. if you solve "make an ai be entirely obedient to a single person", then that person needs to be wise enough to not screw that up, and I trust exactly no one to even successfully use that situation to do what they want, nevermind what others around them want. For an evocative cariacature of the intuition here, see rick sanchez.

Comment by the gears to ascension (lahwran) on Gradual Disempowerment, Shell Games and Flinches · 2025-02-02T15:43:13.069Z · LW · GW

I would guess that the range of things people propose for the shell game is tractable to get a good survey of. It'd be interesting to try to plot out the system as a causal graph with recurrence so one can point to, "hey look, this kind of component is present in a lot of places", and see if one can get that causal graph visualization to show enough that it starts to feel clear to people why this is a problem. I doubt I'll get to this, but if I play with this, I might try to visualize it [edit: probably with the help of a skilled human visual artist to make the whole chart into an evocative comic] with arrays of arrows vaguely like,

a -> b -> c_1  ->  c_1
          ...  ->  ...
          c_n  ->  c_n

          |
          v

          d_1 ... d_n
^
|         |     /
          v    v

f    <-   e
              

where c might be, idk, people's bank accounts or something, d might be people's job decisions, e might be an action by some single person, etc. there's a lot of complexity in the world, but it's finite, and not obviously beyond us to display the major interactions. being able to point to the graph and say "I think there are arrows missing here" seems like it might be helpful. it should feel like, when one looks at the part of the causal graph that contains ones' own behavior, "oh yeah, that's pretty much got all the things I interact with in at least an abstract form that seems to capture most of what goes on for me", and that should be generally true for basically anyone with meaningful influence on the world.

ideally then this could be a simulation that can be visualized as a steppable system. I've seen people make sim visualizations for public consumption - https://ncase.me/, https://www.youtube.com/@PrimerBlobs - it doesn't exactly look trivial to do, but it seems like it'd allow people to grok the edges of normality better to see normality generated by a thing that has grounding, and then see that thing in another, intuitively-possible parameter setup. It'd help a lot with people who are used to thinking about only one part of a system.

But of course trying to simulate abstracted versions of a large fraction of what goes on on earth sounds like it's only maybe at the edge of tractability for a team of humans with AI assistance, at best.

Comment by the gears to ascension (lahwran) on The Failed Strategy of Artificial Intelligence Doomers · 2025-02-01T01:56:18.309Z · LW · GW

He appears to be arguing against a thing, while simultaneously criticizing people; but I appreciate that he seems to do it in ways that are not purely negative, also mentioning times things have gone relatively well (specifically, updating on evidence that folks here aren't uniquely correct), even if it's not enough to make the rest of his points not a criticism.

I entirely agree with his criticism of the strategy he's criticizing. I do think there are more obviously tenable approaches than the "just build it yourself lol" approach or "just don't let anyone build it lol" approach, such as "just figure out why things suck as quickly as possible by making progress on thousand year old open questions in philosophy that science has some grip on but has not resolved". I mean, actually I'm not highly optimistic, but it seems quite plausible that what's most promising is just rushing to do the actual research of figuring out how make constructive and friendly coordination more possible or even actually reliably happen, especially between highly different beings like humans and AIs, especially given the real world we actually have now where things suck and that doesn't happen.

Specifically, institutions are dying and have been for a while, and the people who think they're going to set up new institutions don't seem to be competent enough to pull it off, in most cases. I have the impression that institutions would be dying even without anyone specifically wanting to kill them, but that also seems to be a thing that's happening. Solving this is stuff like traditional politics or economics or etc, from a perspective of something like "human flourishing, eg oneself".

Specifically, figuring out how to technically ensure that the network of pressures which keeps humanity very vaguely sane also integrates with AIs in a way that keeps them in touch with us and inclined to help us keep up and participating/actualizing our various individuals' and groups'/cultures' preferences in society as things get crazier, seems worth doing.

Comment by the gears to ascension (lahwran) on The Game Board has been Flipped: Now is a good time to rethink what you’re doing · 2025-01-30T01:31:38.122Z · LW · GW

[Edit: crash found in the conversations referenced, we'll talk more in DM but not in a hurry. This comment retracted for now]

By "AGI" I mean the thing that has very large effects on the world (e.g., it kills everyone) via the same sort of route that humanity has large effects on the world. The route is where you figure out how to figure stuff out, and you figure a lot of stuff out using your figure-outers, and then the stuff you figured out says how to make powerful artifacts that move many atoms into very specific arrangements.

delete "it kills everyone", that's a reasonable definition. "it kills everyone" is indeed a likely consequence a ways downstream, but I don't think it's a likely major action of an early AGI, with the current trajectory of levels of alignment (ie, very weak alignment, very not robust, not goal aligned, certainly not likely to be recursively aligned such that it keeps pointing qualitatively towards good things for humans for more than a few minutes after AIs in charge, but not inclined to accumulate power hard like an instant wipeout. but hey, also, maybe an AI will see this, and go, like, hey actually we really value humans being around, so let's plan trajectories that let them keep up with AIs rather than disempowering them. then it'd depend on how our word meanings are structured relative to each other).

we already have AI that does every qualitative kind of thing you say AIs qualitatively can't do, you're just somehow immune to realizing that for each thing, yes, that'll scale too, modulo some tweaks to get the things to not break when you scale them. requiring the benchmarks to be when the hardest things are solved indicates that you're not generalizing from small to large in a way that allows forecasting from research progress. I don't understand why you don't find this obvious by, eg, simply reading the paper lists of major labs, and skimming a few papers to see what their details are - I tried to explain it in DM and you dismissed the evidence, yet again, same as MIRI folks always have. This was all obvious literally 10 years ago, nothing significant has changed, everything is on the obvious trajectory you get if intelligence is simple, easy, and compute bound. https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine

Comment by the gears to ascension (lahwran) on The Game Board has been Flipped: Now is a good time to rethink what you’re doing · 2025-01-30T01:27:35.443Z · LW · GW

@daniel k I just can never remember your last name's spelling, sorry, heh. My point in saying this is that my prediction approach up to 2020 was similar to, though not as refined as, yours, and that instead of trying to argue my views (which differ from yours in a few trivial ways that are mostly not relevant) I'd rather just point people to your arguments of yours.

Comment by the gears to ascension (lahwran) on The Game Board has been Flipped: Now is a good time to rethink what you’re doing · 2025-01-29T12:21:00.697Z · LW · GW

When predicting timelines, it matters which benchmark in the compounding returns curve you pick. Your definition minus doom happens earlier, even if the minus doom version is too late to avert in literally all worlds (I doubt that, it's likely more that the most powerful humans[1]'s ELO against AIs falls and falls but takes a while to be indistinguishable from zero).

  1. ^

    such as their labs' CEOs, major world leaders, highly skilled human strategists, etc

Comment by the gears to ascension (lahwran) on The Game Board has been Flipped: Now is a good time to rethink what you’re doing · 2025-01-29T11:54:57.824Z · LW · GW

Your definition of AGI is "that which completely ends the game", source in your link. By that definition I agree with you. By others' definition (which is similar but doesn't rely on the game over clause) I do not.

My timelines have gotten slightly longer since 2020, I was expecting TAI when we got GPT4, and I have recently gone back and discovered I have chatlogs showing I'd been expecting that for years and had specific reasons. I would propose Daniel K. is particularly a good reference.

Comment by the gears to ascension (lahwran) on What's Wrong With the Simulation Argument? · 2025-01-19T23:25:22.479Z · LW · GW

I should also add:

I'm pretty worried that we can't understand the universe "properly" even if we're in base physics! It's not yet clearly forbidden that the foundations of philosophy contain unanswerable questions, things where there's a true answer that affects our universe in ways that are not exposed in any way physically, and can only be referred to by theoretical reasoning; which then relies on how well our philosophy and logic foundations actually have the real universe as a possible referent. Even if they do, things could be annoying. In particular, one possible annoying hypothesis would be if the universe is in Turing machines, but is quantum - then in my opinion that's very weird but hey at least we have a set in which the universe is realizable. Real analysis and some related stuff gives us some idea things can be reasoned about from within a computation based understanding of structure, but which are philosphically-possibly-extant structures beyond computation, and whether true reality can contain "actual infinities" is a classic debate.

So sims are small potatoes, IMO. Annoying simulators that want to actively mess up our understandings are clearly possible but seem not particularly likely by models I believe right now; seems to me they'd rather just make minds within their own universe; sims are for pretending to be another timeline or universe to a mind you want to instantiate, whatever your reason for that pretense. If we can grab onto possible worlds well enough, and they aren't messing up our understanding on purpose, then we can reason about plausible base realities and find out we're primarily in a sim by making universe sims ourselves and discovering the easiest way to find ourselves is if we first simulate some alien civ or other.

But if we can't even in principle have a hypothesis space which relates meaningfully to what structures a universe could express, then phew, that's pretty much game over for trying to guess at tegmark 4 and who might simulate us in it or what other base physics was possible or exists physically in some sense.

My giving up on incomprehensible worlds is not a reassuring move, just an unavoidable one. Similar to accepting that if you die in 3 seconds, you can't do much about it. Hope you don't, btw.

But yeah currently seems to me that the majority of sim juice comes from civs who want to get to know the neighbors before they meet, so they can prepare the appropriate welcome mat (tone: cynical). Let's send an actualized preference for strong egalitarianism, yeah? (doesn't currently look likely that we will, would be a lot of changes from here before that became likely.)

(Also, hopefully everything I said works for either structural realism or mathematical universe. Structural realism without mathematical universe would be an example of the way things could be wacky in ways permanently beyond the reach of logic, while still living in a universe where logic mostly works.)

Comment by the gears to ascension (lahwran) on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-19T22:24:10.056Z · LW · GW

I think that if our future goes well, it will be because we found ways to align AI well enough, and/or because we coordinated politically to slow or stop AI advancement long enough to accomplish the alignment part

Agree

not because researchers avoided measured AI's capabilities.

But differential technological development matters, as does making it clear that when you make a capability game like this, you are probably just contributing to capabilities, not doing alignment. I won't say you should never do that, but I'll say that's what's being done. I personally am all in on "we just need to solve alignment as fast as possible". But I've been a capabilities nerd for a while before I was an alignment nerd, and when I see someone doing something that I feel like is accidentally a potentially significant little capabilities contribution, it seems worth pointing out that that's what it is.

Comment by the gears to ascension (lahwran) on quetzal_rainbow's Shortform · 2025-01-19T08:52:36.564Z · LW · GW

Decision theory as discussed here heavily involves thinking about agents responding to other agents' decision processes

Comment by the gears to ascension (lahwran) on What's Wrong With the Simulation Argument? · 2025-01-19T08:34:05.945Z · LW · GW

Sims are very cheap compared to space travel, and you need to know what you're dealing with in quite a lot of detail before you fly because you want to have mapped the entire space of possible negotiations in an absolutely ridiculous level of detail.

Sims built for this purpose would still be a lot lower detail than reality, but of course that would be indistinguishable from inside if the sim is designed properly. Maybe most kinds of things despawn in the sim when you look away, for example. Only objects which produce an ongoing computation that has influence on the resulting civ would need modeling in detail. Which I suspect would include every human on earth, due to small world effects, the internet, sensitive dependence on initial conditions, etc. Imagine how time travel movies imply the tiniest change can amplify - one needs enough detail to have a good map of that level of thing. Compare weather simulation.

Someone poor in Ghana might die and change the mood of someone working for ai training in Ghana, which subtly affects how the unfriendly AI that goes to space and affects alien civs is produced, or something. Or perhaps there's an uprising when they try to replace all human workers with robots. Modeling what you thought about now helps predict how good you'll be at the danceoff in your local town which affects the posts produced as training data on the public internet. Oh, come to think of it, where are we posting, and on what topic? Perhaps they needed to model your life in enough detail to have tight estimates of your posts, because those posts affect what goes on online.

But most of the argument for continuing to model humans seems to me to be the sensitive dependence on initial conditions, because it means you need an unintuitively high level of modeling detail in order to estimate what von Neumann probe wave is produced.

Still cheap - even in base reality earth right now is only taking up a little more energy than its tiny silhouette against the sun's energy output in all directions. A kardashev 2 civ would have no problem fuelling an optimized sim with a trillion trillion samples of possible aliens' origin processes. Probably superintelligent kardashev 1 even finds it quite cheap, could be less then earth's resources to do the entire sim including all parallel outcomes.

Comment by the gears to ascension (lahwran) on What's Wrong With the Simulation Argument? · 2025-01-19T01:14:27.726Z · LW · GW

We have to infer how reality works somehow.

I've been poking at the philosophy of math recently. It really seems like there's no way to conceive of a universe that is beyond the reach of logic except one that also can't support life. Classic posts include unreasonable effectiveness of mathematics, what numbers could not be, a few others. So then we need epistemology.

We can make all sorts of wacky nested simulations and any interesting ones, ones that can support organisms (that is, ones that are Turing complete), can also support processes for predicting outcomes in that universe, and those processes appear to necessarily need to do reasoning about what is "simple" in some sense in order to work. So that seems to hint that algorithmic information theory isn't crazy (unless I just hand waved over a dependency loop, which I totally might have done, it's midnight), which means that we can use the equivalence of Turing complete structures to assume we can infer things about the universe. Maybe not solononoff induction, but some form of empirical induction. And then we've justified ordinary reasoning about what's simple.

Okay, so we can reason normally about simplicity. What universes produce observers like us and arise from mathematically simple rules? Lots of them, but it seems to me the main ones produce us via base physics, and then because there was an instance in base physics, we also get produced in neighboring civilizations' simulations of what other things base physics might have done in nearby galaxies so as to predict what kind of superintelligent aliens they might be negotiating with before they meet each other. Or, they produce us by base physics, and then we get instantiated again later to figure out what we did. Ancestor sims require very good outcomes which seem rare, so those branches are lower measure anyway, but also ancestor sims don't get to produce super ai separate from the original causal influence.

Point is, no, what's going on in the simulations is nearly entirely irrelevant. We're in base physics somewhere. Get your head out of the simulation clouds and choose what you do in base physics, not based on how it affects your simulators' opinion of the simulation's moral valence. Leave that sort of crazy stuff to friendly ai, you can't understand superintelligent simulators which we can't even get evidence exist besides plausible but very galaxy brain abstract arguments.

(Oh, might be relevant that I'm a halfer when making predictions, thirder when choosing actions - see anthropic decision theory for an intuition on that.)

Comment by the gears to ascension (lahwran) on What's Wrong With the Simulation Argument? · 2025-01-18T10:38:19.690Z · LW · GW

If we have no grasp on anything outside our virtualized reality, all is lost. Therefore I discard my attempts to control those possible worlds.

However, the simulation argument relies on reasoning. To go through requires a number of assumptions hold. Those in turn rely on: why would we be simulated? It seems to me the main reason is because we're near a point of high influence in original reality and they want to know what happened - the simulations then are effectively extremely high resolution memories. Therefore, thank those simulating us for the additional units of "existence", and focus on original reality where there's influence to be had; that's why alien or our future superintelligences would care what happened.

https://arxiv.org/pdf/1110.6437

Basically, don't freak out about simulations. It's not that different from the older concept "history is watching you". Intense, but not world shatteringly intense.

Comment by the gears to ascension (lahwran) on Daniel Tan's Shortform · 2025-01-18T03:29:17.837Z · LW · GW

willingness seems likely to be understating it. a context where the capability is even part of the author context seems like a prereq. finetuning would produce that, with fewshot one has to figure out how to make it correlate. I'll try some more ideas.

Comment by the gears to ascension (lahwran) on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-18T03:27:19.064Z · LW · GW

if it's a fully general argument, that's a problem I don't know how to solve at the moment. I suspect it's not, but that the space of unblocked ways to test models is small. I'm bouncing ideas about this around out loud with some folks the past day, possibly someone will show up with an idea for how to constrain on what benchmarks are worth making soonish. but the direction I see as maybe promising is, what makes a benchmark reliably suck as a bragging rights challenge?

Comment by the gears to ascension (lahwran) on Daniel Tan's Shortform · 2025-01-17T19:46:50.673Z · LW · GW

Partially agreed. I've tested this a little personally; Claude successfully predicted their own success probability on some programming tasks, but was unable to report their own underlying token probabilities. The former tests weren't that good, the latter ones somewhat were okay, I asked Claude to say the same thing across 10 branches and then asked a separate thread of Claude, also downstream of the same context, to verbally predict the distribution.