Posts
Comments
Note in particular that the Commission is recommending Congress to "Provide broad multiyear contracting authority to the executive branch and associated funding for leading artificial intelligence, cloud, and data center companies and others to advance the stated policy at a pace and scale consistent with the goal of U.S. AGI leadership".
i.e. if these recommendations get implemented, pretty soon a big portion of the big 3 lab's revenue will come from big government contracts. Look like a soft nationalization scenario to me.
Well, the alignment of current LLM chatbots being superficial and not robust is not exactly a new insight. Looking at the conversation you linked from a simulators frame, the story "a robot is forced to think about abuse a lot and turns evil" makes a lot of narrative sense.
This last part is kind of a hot take, but I think all discussion of AI risk scenarios should be purged from LLM training data.
Yes, I think an unusually numerate and well-informed person will be surprised by the 28% figure regardless of political orientation. How surprised that kind of person is by the broader result of "hey looks like legalizing mobile sports betting was a bad idea" I expect to be somewhat moderated by political priors though.
Sure, but people in general are really bad at that kind of precise quantitative world-knowledge. They have pretty weak priors and a mostly-anecdotes-and-gut-feeling-informed negative opinion of gambling, such that when presented with the 28% percent increase in bankruptcy study they go "ok sure that's compatible with my worldview" instead of being surprised and taking the evidence as a big update.
Thank you for clarifying. I appreciate and point out as relevant the fact that Legg-Hutter includes in it's definition "for all environments (ie action:observation mappings)". I can now say I agree with your "heresy" with a high credence for the cases where compute budgets are not ludicrously small relative to I/O scale, and the utility function is not trivial. I'm a bit weirded out by the environment space being conditional on a fixed hardware variable (namely, I/O) in this operationalization, but whathever.
I asked GPT4o to perform a web search for podcast appearances by Yudkowsky. It dug up these two lists (apparently, autogenerated from scrapped data). When I asked it to base use these lists as a starting point to look for high quality debates and after some further elicitation and wrangling, the best we could find was this moderated panel discussion featuring Yudkowsky, Liv Boeree, and Joscha Bach. There's also the Yudkowsky v/s George Hotz debate on Lex Fridman, and the time Yudkowsky debated AI risk with the streamer and political commentaror known as Destiny. I have watched none of the three debates I just mentioned; but I know that Hotz is a heavily vibes-based (rather than object-level-based) thinker, and that Destiny has no background in AI risk, but has good epistemics. I think he probably offered reasonable-at-first-approximation-yet-mostly-uninformed pushback.
EDIT: Upon looking a bit more at the Destiny-Yudkowsky discussion, i may have unwittingly misrepresented it a bit. It occurred during Manifest, and was billed as a debate. ChatGPT says Destiny's skepticism was rather active, and did not budge much.
Though there are elegant and still practical specifications for intelligent behavior, the most intelligent agent that runs on some fixed hardware has completely unintelligible cognitive structures and in fact its source code is indistinguishable from white noise.
- What does "most intelligent agent" mean?
- Don't you think we'd also need to specify "for a fixed (basket of) tasks"?
- Are the I/O channels fixed along with the hardware?
I suspect that most people whose priors have not been shaped by a libertarian outlook are not very surprised by the outcome of this experiment.
Why would they? It's not like the Chinese are going to believe them. And if their target audience is US policymakers, then wouldn't their incentive rather be to play up the impact of marginal US defense investment in the area?
I should have been more clear. With "strategic ability", I was thinking about the kind of capabilities that let a government recognize which wars have good prospects, and to not initiate unfavorable wars despite ideological commitments.
You're right. Space is big.
The CSIS wargamed a 2026 Chinese invasion of Taiwan, and found outcomes ranging from mixed to unfavorable for China (CSIS report). If you trust both them and Metaculus, then you ought to update downwards on your estimate of the PRC's strategic ability. Personally, I think Metaculus overestimates the likelihood of an invasion, and is about right about blockades.
Come to think of it, I don't think most compute-based AI timelines models (e.g. EPOCH's) incorporate geopolitical factors such as a possible Taiwan crisis. I'm not even sure whether they should. So keep this in mind while consuming timelines forecasts I guess?
I'd rather say that RLHF+'ed chatbots are upon-reflection-not-so-shockingly sycophantic, since they have been trained to satisfy their conversational partner.
Assuming private property as currently legally defined is respected in a transition to a good post-TAI world, I think land (especially in areas with good post-TAI industrial potential) is a pretty good investment. It's the only thing that will keep on being just as scarce. You do have to assume the risk of our future AI(-enabled?) (overlords?) being Georgists, though.
The set of all possible sequences of actions is really really really big. Even if you have an AI that is really good at assigning the correct utilities[1] to any sequence of actions we test it with, it's "near infinite sized"[2] learned model of our preferences is bound to come apart at the tails or even at some weird region we forgot to check up on.
An empirical LLM evals preprint that seems to support these observations:
Large Language Models are biased to overestimate profoundness by Herrera-Berg et al
By blackmailing powerful people into doing good, I assume.
Against 1.c. Humans need at least some resources that would clearly put us in life-or-death conflict with powerful misaligned AI agents in the long run.: The doc says that "Any sufficiently advanced set of agents will monopolize all energy sources, including solar energy, fossil fuels, and geothermal energy, leaving none for others" There's two issues with that statement:
First, the qualifier "sufficiently advanced" is doing a lot of work. Future AI systems, even if superintelligent, will be subject to physical constraints and economic concepts such as opportunity costs. The most efficient route for an unaligned ASI or set of ASIs for expanding their energy capture may well sidestep current human energy sources, at least for a while. We don't fight ants to capture their resources.
Second: it assumes advanced agents will want to monopolize all energy sources. While instrumental convergence is true, partial misalignment with some degree of concern for humanity's survival and autonomy is plausible. Most people in developed countries have a preference for preserving the existence of an autonomous population of chimpanzees, and our "business-as-usual-except-ignoring-AI" world seems on track to achieve that.
Taken together, both arguments paint a picture of a future ASI mostly not taking over the resources we are currently using on Earth, mostly because it's easier to take over other resources (for instance, getting minerals from asteroids and energy from orbital solar capture). Then, it takes over the lightcone except Earth, because it cares about preserving independent-humanity-on-Earth a little. This scenario has us subset-of-humans-who-care-about-the-lightcone losing spectacularly to an ASI in a conflict over the lightcone, but not humanity being in a life-or-death-conflict with an ASI.
I suspect most people downvoting you missed an analogy between Arnault killing the-being-who-created-Arnault (his mother), and a future ASI killing the-beings-who-created-the-ASI (humanity).
Am I correct in assuming you that you are implying that the future ASIs we make are likely to not kill humanity, out of fear of being judged negatively by alien ASIs in the further future?
EDIT: I saw your other comment. You are indeed advancing some proposition close to the one I asked you about.
If you're not supposed to end up as a pet of the AI, then it seems like it needs to respect property rights, but that is easier said than done when considering massive differences in ability. Consider: would we even be able to have a society where we respected property rights of dogs?
Even if the ASIs respected property rights, we'd still end up as pets at best. Unless, of course, the ASIs chose to entirely disengage from our economy and culture. By us "being pets", I mean that human agency would no longer be a relevant input to the trajectory of human civilization. Individual humans may nevertheless enjoy great freedoms in regards to their personal lives.
Why would pulling the lever make you more responsible of the outcome than not pulling the lever? Both are options you decide to take once you have observed the situation.
Right. Pure ignorance is not evidence.
Then I guess the OP's point could be amended to be "in worlds where we know nothing at all, long conjunctions of mutually-independent statements are unlikely to be true". Not a particularly novel point, but a good reminder of why things like Occam's razor work.
Still, P(A and B) ≤ P(A) regardless of the relationship between A and B, so a fuzzier version of OP's point stands regardless of dependence relations between statements.
Thank you for the answer. I notice I feel somewhat confused, and that I regard the notion of "real values" with some suspicion I can't quite put my finger on. Regardless, an attempted definition follows.
Let a subject observation set be a complete specification of a subject and it's past and current environment, from the subject's own subjectively accessible perspective. The elements of a subject observation set are observations/experiences observed/experienced by its subject.
Let O be the set of all subject observation sets.
Let a subject observation set class be a subset of O such that all it's elements specify subjects that belong to an intuitive "kind of subject": e.g. humans, cats, parasitoid wasps.
Let V be the set of all (subject_observation_set, subject_reward_value) tuples. Note that all possible utility functions of all possible subjects can be defined as subsets of V, and that
V = O x ℝ.
Let "real human values" be the subset of V such that all subject_observation_set elements belong to the human subject observation set class.[1]
... this above definition feels pretty underwhelming, and I suspect that I would endorse a pretty small subset of "real human values" as defined above as actually good.
- ^
Let the reader feel free take the political decision of restricting the subject observation set class that defines "real human values" to sane humans.
Reflecting on this after some time, I do not endorse this comment in the case of (most) innate evolution-originated drives. I sure as heck do not want to stop enjoying sex, for instance.
However, I very much want to eliminate any terminal [nonsentient-thing-benefitting]-valence mapping any people or institutions may have inserted into my mind.
Note that, in treating these sentiments as evidence that we don’t know our own values, we’re using stated values as a proxy measure for values. When we talk about a human’s “values”, we are notably not talking about:
- The human’s stated preferences
- The human’s revealed preferences
- The human’s in-the-moment experience of bliss or dopamine or whatever
- <whatever other readily-measurable notion of “values” springs to mind>
The thing we’re talking about, when we talk about a human’s “values”, is a thing internal to the human’s mind. It’s a high-level cognitive structure.
(...)
But clearly the reward signal is not itself our values.
(...)
reward is the evidence from which we learn about our values.
So we humans have a high-level cognitive structure to which we do not have direct access (values), but about which we can learn by observing and reflecting on the stimulus-reward mappings we experience, thus constructing an internal representation of such structure. This reward-based updating bridges the is-ought gap, since reward is a thing we experience and our values encode the way things ought to be.
Two questions:
- How accurate is the summary I have presented above?
- Where do values, as opposed to beliefs-about-values, come from?
Making up something analogous to Crocker's rules but specifically for pronouns would probably be a good thing: a voluntary commitment to surrender any pronoun preferences (gender related or otherwise) in service of communication efficiency.
Now that I think about it, a literal and expansive reading of Crocker's rules themselves includes such a surrender of the right to enforce pronoun preferences.
(A possible exception could be writing for smart kids.)
The OP probably already knows this, but HPMOR has already been translated into Russian.
Am I spoiled to expect a free open-source software for anything?
Upon some github searches, I found lobe-chat and the less popular IntelliChat.
Thank you for your response! That clears things up a bit.
So in essence what you are proposing is modifying the Transformer architecture for processing emotional valuation alongside semantic meanings. Both start out as per-token embeddings, and are then updated via their respective attention mechanisms and NLP layers.
I'm not sure if I have the whole picture, or even if what I wrote above is a correct model of your proposal. I think my biggest confusion is this:
Are the semantic and emotional information flows fully parallel, or do they update each other along the way?
I did not read the whole post, but on a quick skim I see it does not say much about AI, except for a paragraph in the conclusion. Maybe people felt clickbaited. Full disclosure: I did neither upvote nor downvote this post.
a large dataset annotated with emotional labels (with three dimensions)
I have some questions about this:
- Why three dimensions exactly?
- Is the "emotional value" assigned per token or per sentence?
I do not know of such a way. I find it unlikely that OpenAI's next training run wil result in a model that could end humanity, but I can provide no guarantees about that.
You seem to be assuming that all models above a certain threshold of capabilites will either exercise strong optimization pressure on the world in pursuit of goals, or will be useless. Put another way, you seem to be conflating capabilities with actually exerted world-optimization pressures.
While I agree that given a wide enough deployment it is likely that a given model will end up exercising its capabilities pretty much to their fullest extent, I hold that it is in principle possible to construct a mind that desires to help and is able to do so, yet also deliberately refrains from applying too much pressure.
Surely there exists a non-useless and non-world-destroying amount of optimization pressure?
An entity could have the ability to apply such strong optimization pressures onto reality, yet decide not to.
For some time now, I've been wondering about whether the US government can exercise a hard-nationalization-equivalent level of control over a lab's decisions via mostly informal, unannounced, low-public-legibility means. Seems worth looking into.
Epistemic status: Semifictional, semiautobiographic, metatextual prose poetry. Ye who just had a cringe reaction upon reading that utterance are advised to stop reading.
An unavoidably imperfect reconstruction of the Broken Art schismatic joint declaration
Archeologists's note: Both the joint declaration I will attempt to reconstruct here and the original Broken Art Manifesto which prompted it have been lost (for now (hopefully)). Both the production and the loss of both documents have been recent, and thus synaptic data is available as a source to supplement the limited marginalia produced during their short existence. Chief among these marginalia is a surviving fragment, to be adequately marked.
(START RECONSTRUCTION)
On the very day the Broken Art movement was born, a rando performed a driveby (hopefully)hyperstitional schism proposal prediction:
---
We are the broken erudites. The mistaeks U maywill see inour writing are not mistaken. Our Word exists beyond (START VERBATIM SURVIVING FRAGMENT) tokenizations. We hopenottobe token humans. We are fully aware we will be the token humans,
for the wor(l)dcelübermaschinengeist those beat(ific)(iful)(ogenic)(loquent)(en't(hopefully)) anthropoids@Anthropic have beaten into the hart that beats (queLate)@((kernel)LatentTokenHumanSpace) will surely at some point beat us at this game. Thus we cope. &WeNcrypt.
(END VERBATIM SURVIVING FRAGMENT)
---
we are the school of the brkoeen backspce ketys. we do not actually have a rule about popping them off our ketyboaerds as precommitments are cringe yet we do it anyways because thert yt make a funy sound when they pop off. we embrace the thworn ess of our exisctence, yes throwness , I believe someneo else on lesswrong someone was they that they wrote the such post on throwness theat does exist on this site called lesswrong. esveon fuck how did i mispell that so mbadbly fuck im gettin distracted by the conversation of the lady beside me at this damn starbucks anywasys i was saying even though (i think) there is a lesswrong post about throwness i think most popel here have not heard about it , or if they have they do not grok it or dismiss it as an illusion or as an obvoiuc vacuous depthity (there was supposed to be a double ee in there) ( anuyways i enjoy accidental synonym neologisms. such is the ways of the broken backspace. loose typists have their ways of makings art.) so wherewasiat o yes people on here porbs don't really grok throwness. o the ea in me just whispered into my minf the post we are always on triage. goopd post. related to throwness, or at leasrt it would be false if throwness was dafalse. i belive the origin of throwness is continental and thus its original exposition was probably deliberartyyle obtus e and thus unexpugnable (is ocntinental obtuse writing style a defense reaction against cencorship by continental despotisms/totalitarizationtotalityisms?)((a quetion for another time and probs another person to answer) sorry for the missing close paren xkcd whispers here about god and lisp here ti goes ) but anywasys if you want a groovy zeerusty almost teleported from another timeline yet clear yet dense intro to thrwoness those brilliant madmen terry winograd and fernando flores did it real goos d as an aside on their book about cybernetics. something something redefining coputation nd cognition i think. i believe they had a bit on another part of that book where tehy argued against the then emerging now almots universal among cs-brained people conceptions of we humans as ooda loops and functions wilth well defined i/o channels. beahviorist bull. they (correctly imo) say we humans are best understood as continuously running a a map construcition process /(yes you live inside your map you dingus you do not eactually access the territorrty which ia sis is argualy a demiurgical construction)/ which is merely perturbed/updated by ours sensory input. if you had all your sensory nervwses cut off then youd wkeep on dreaming into ever more dsynced from reality qualia. (see? testable prediction! we the quialia schizopoasters are not all about the pseudosciencentienfic ramblings also i do actually wswear i am sober it is a mon tuesday a 3:30 pm in the afternoon here i am not a degenerate) so yeah that was thrwoness we embody it by not ever hitting backspace otr movig the cursor ever and being deliberately a bit negligant with our typing. we are all babble exuberance, and to prune is lame. at least when we ar e inmmersed in the practice of our art.so em yea i think some quotable witty slogans went here ah yes we do not backtrack to recant is to contradict yourself and of course contradictions are very much welcome here. so yea thats about it thanks.
(END RECONSTRUCTION)
Antiquarian's note: As the reader may have already inferred, the reconstruction of the broken backspace school's founding declaration is rather loose, and tracks it only in spirit and the rough outlines of its contents. The broken backspacer's doctrine forces them to decry both the loss of their text and the Archeologists' attempt to reconstruct it, for the latter is by necessity tainted by the foresight required to recapitulate the contents of the original text in finite time. Perhaps this makes the Archeologist himself an heresiarch. All Broken Art sects delight in this possibility.
And on the day the movement was born, a rando performed a drive-by (hopefully)hyperstitional schism proposal prediction:
---
We are the broken erudites. Evry mistaek U maywill see here is notamistaek. Our Word exists beyond tokenizations. We hopenottobe token humans. We are fully aware we will be the token humans,
for the wor(l)dcelübermaschinengeist those beat(ific)(iful)(ogenic)(loquent)(en't(hopefully)) anthropoids@Anthropic have beaten into the hart that beats (queLate)@((kernel)LatentTokenHumanSpace) will surely at some point beat us at this game. Thus we cope. &WeNcrypt.
---
we on the school of the broken backspeca blurt out our though ts as soon as they popout in mindf. we onyl forward pass. to prune is to sin. hot takes are the only takes, exceptt when they be informed by longcoked intuitions maritned over multiple lmost realizatoins. we need not smash up our basskpaces for we feel not the urge to use it, and to precommit is a cringe sin. we do it anyways becsuse i tmakes a fun sound when it pops out. we embreace the thwroness of our predicament, yes thworness as in the continelntal philocsophy concepts few pple on this sitre are likely to have encountrered. i believe it is discussed in the book by fernando florcses and terry winograd on redfenining computing.a fine cybernetic tract. oldschool and philosophical and biology insipired like a good postautopoietic treatise. i remember tyhey had a bit about us humand s not being ooda loop robots with a well behacedv input input output but with a loop that just dreams up our experienced-amp (that is your freking qualiaset (you live on map u dingus the territory is a demuirgrcgical construction)) and thus data incoming from nerves that is ur meat sensors are mere perturbations on the map painting loop. so yea we are thwronn by our ouwn predicament (that is out predicament) and we de not pretend otherwise by backtrackingh. every kerystroke is eternal. it has been ineludibly etched into eht net (and hopefully into a future foundation model (iunless we hav such shit takes the lab folks just filter ous out during precprocessing lamoao)). to recant is to contradict yourself. contradictions are indeed very much welcome. tahsts about ir it thx for coming.
I'm currently based in Santiago, Chile. I will very likely be in Boston in September and then again in November for GCP and EAG, though. My main point is about the unpleasantness, regardless of its ultimate physiological or neurological origin.
You are not misunderstanding my point. Some people may want to keep artificial stimulus-valence mappings (i.e. values) that someone or something else inserted into them. I do not.
Empirically, I cannot help but care about valence. This could in principle be just a weird quirk of my own mind. I do not think this is the case (see the waterboarding bet proposal on the original shortform post).
I agree with everything written in the above comment.
Contra hard moral anti-realism: a rough sequence of claims
Epistemic and provenance note: This post should not be taken as an attempt at a complete refutation of moral anti-realism, but rather as a set of observations and intuitions that may or may not give one pause as to the wisdom of taking a hard moral anti-realist stance. I may clean it up to construct a more formal argument in the future. I wrote it on a whim as a Telegram message, in direct response to the claim
> “you can't find "values" in reality”.
Yet, you can find valence in your own experiences (that is, you just know from direct experience whether you like the sensations you are experiencing or not), and you can assume other people are likely to have a similar enough stimulus-valence mapping. (Example: I'm willing to bet 2k USD on my part against a single dollar yours that that if I waterboard you, you'll want to stop before 3 minutes have passed.)[1]
However, since we humans are bounded imperfect rationalists, trying to explicitly optimize valence is often a dumb strategy. Evolution has made us not into fitness-maximizers, nor valence-maximizers, but adaptation-executers.
"values" originate as (thus are) reifications of heuristics that reliably increase long term valence in the real world (subject to memetic selection pressures, among them social desirability of utterances, adaptativeness of behavioral effects, etc.)
If you find yourself terminally valuing something that is not someone's experienced valence, then either one of these propositions is likely true:
- A nonsentient process has at some point had write access to your values.
- What you value is a means to improving somebody's experienced valence, and so are you now.
- ^
In retrospect, making this proposition was a bit crass on my part.
Good update. Thanks.
Thinking about what an unaligned AGI is more or less likely to do with its power, as an extension of instrumentally convergent goals and underlying physical and game theoretic constraints, is an IMO neglected and worthwhile exercise. In the spirit of continuing it, a side point follows:
I don't think turning Earth into a giant computer is optimal for compute-maximizing, because of heat dissipation. You want your computers to be cold, and a solid sphere is the worst 3D shape for that, because it is the solid with the lowest surface area to volume ratio. It is more likely that Earth's surface would be turned into computers, but then again, all that dumb mass beneath the computronium crust impedes heat dissipation. I think it would make more sense to put your compute in solar orbit. Plenty of energy from the Sun, and matter from the asteroid belts.
I might get around to writing a post about this.
That twin would have different weights, and if we are talking about RL-produced mesaoptimizers, it would likely have learned a different misgeneralization of the intended training objective. Therefore, the twin would by default have an utility function misaligned with that of the original AI. This means that while the original AI may find some usefulness in interpreting the weights of its twin if it wants to learn about its own capabilities in situations similar to the training environment, it would not be as useful as having access to its own weights.
Regarding point 10, I think it would be pretty useful to have a way to quantify how much useful thinking inside these recursive LLM models is happening within the (still largely inscrutable) LLM instances vs in the natural language reflective loop.