LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Chess As The Model Game
criticalpoints · 2024-11-17T19:45:26.499Z · comments (0)

Why I'm bearish on mechanistic interpretability: the shards are not in the network
tailcalled · 2024-09-13T17:09:25.407Z · comments (40)

Review: “The Case Against Reality”
David Gross (David_Gross) · 2024-10-29T13:13:29.643Z · comments (9)

Bridging the VLM and mech interp communities for multimodal interpretability
Sonia Joseph (redhat) · 2024-10-28T14:41:41.969Z · comments (5)

[link] To Be Born in a Bag
Niko_McCarty (niko-2) · 2024-10-06T17:21:00.605Z · comments (1)

Why Reflective Stability is Important
Johannes C. Mayer (johannes-c-mayer) · 2024-09-05T15:28:19.913Z · comments (2)

[link] AI & Liability Ideathon
Kabir Kumar (kabir-kumar) · 2024-11-26T13:54:01.820Z · comments (2)

[link] Should Sports Betting Be Banned?
Maxwell Tabarrok (maxwell-tabarrok) · 2024-09-21T14:13:35.404Z · comments (2)

Can Large Language Models effectively identify cybersecurity risks?
emile delcourt (emile-delcourt) · 2024-08-30T20:20:21.345Z · comments (0)

Word Spaghetti
Gordon Seidoh Worley (gworley) · 2024-10-23T05:39:20.105Z · comments (9)

Avoiding the Bog of Moral Hazard for AI
Nathan Helm-Burger (nathan-helm-burger) · 2024-09-13T21:24:34.137Z · comments (13)

"Real AGI"
Seth Herd · 2024-09-13T14:13:24.124Z · comments (20)

Advisors for Smaller Major Donors?
jefftk (jkaufman) · 2024-11-06T14:30:06.187Z · comments (2)

Announcing the CLR Foundations Course and CLR S-Risk Seminars
JamesFaville (elephantiskon) · 2024-11-19T01:18:10.085Z · comments (0)

In the Name of All That Needs Saving
pleiotroth · 2024-11-07T15:26:12.252Z · comments (2)

OpenAI defected, but we can take honest actions
Remmelt (remmelt-ellen) · 2024-10-21T08:41:25.728Z · comments (16)

Automating LLM Auditing with Developmental Interpretability
htlou · 2024-09-04T15:50:04.337Z · comments (0)

[question] Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-09-04T12:40:07.678Z · answers+comments (7)

[link] Four Levels of Voting Methods
hive · 2024-09-26T18:15:00.565Z · comments (3)

[link] Instruction Following without Instruction Tuning
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-24T13:49:09.078Z · comments (0)

Using Dangerous AI, But Safely?
habryka (habryka4) · 2024-11-16T04:29:20.914Z · comments (2)

Proposal to increase fertility: University parent clubs
Fluffnutt (Pear) · 2024-11-18T04:21:26.346Z · comments (3)

Long Live the Usurper
pleiotroth · 2024-11-27T12:10:51.025Z · comments (0)

[link] AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-14T23:23:26.296Z · comments (1)

[link] some questionable space launch guns
bhauth · 2024-10-13T22:52:26.418Z · comments (0)

My career exploration: Tools for building confidence
lynettebye · 2024-09-13T11:37:55.843Z · comments (0)

Is Text Watermarking a lost cause?
egor.timatkov · 2024-10-01T16:20:51.113Z · comments (13)

[link] Jonothan Gorard:The territory is isomorphic to an equivalence class of its maps
Daniel C (harper-owen) · 2024-09-07T10:04:47.840Z · comments (18)

[question] Is this voting system strategy proof?
Donald Hobson (donald-hobson) · 2024-09-06T20:44:46.691Z · answers+comments (9)

Interview with Robert Kralisch on Simulators
WillPetillo · 2024-08-26T05:49:15.543Z · comments (0)

[link] Why Swiss watches and Taylor Swift are AGI-proof
Kevin Kohler (KevinKohler) · 2024-09-05T13:23:27.033Z · comments (11)

Heresies in the Shadow of the Sequences
Cole Wyeth (Amyr) · 2024-11-14T05:01:11.889Z · comments (12)

[link] GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
ChengCheng (ccstan99) · 2024-11-01T00:10:50.718Z · comments (0)

[question] What epsilon do you subtract from "certainty" in your own probability estimates?
Dagon · 2024-11-26T19:13:46.795Z · answers+comments (6)

Reducing global AI competition through the Commerce Control List and Immigration reform: a dual-pronged approach
Ben Smith (ben-smith) · 2024-09-03T05:28:24.549Z · comments (2)

Hiring a writer to co-author with me (Spencer Greenberg for ClearerThinking.org)
spencerg · 2024-10-27T17:34:50.479Z · comments (0)

Appealing to the Public
jefftk (jkaufman) · 2024-10-23T19:00:07.669Z · comments (0)

[link] Why good things often don’t lead to better outcomes
DMMF · 2024-09-19T16:37:07.778Z · comments (1)

[link] My lukewarm take on GLP-1 agonists
George3d6 · 2024-08-26T12:34:27.929Z · comments (0)

Slave Morality: A place for every man and every man in his place
Martin Sustrik (sustrik) · 2024-09-19T04:20:04.491Z · comments (7)

[question] Is there a CFAR handbook audio option?
FinalFormal2 · 2024-10-26T17:08:36.480Z · answers+comments (0)

[question] Does the "ancient wisdom" argument have any validity? If a particular teaching or tradition is old, to what extent does this make it more trustworthy?
SpectrumDT · 2024-11-04T15:20:14.822Z · answers+comments (49)

[link] A Little Depth Goes a Long Way: the Expressive Power of Log-Depth Transformers
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-11-20T11:48:14.170Z · comments (0)

Physical Therapy Sucks (but have you tried hiding it in some peanut butter?)
Declan Molony (declan-molony) · 2024-09-10T05:54:47.000Z · comments (12)

[link] Every niche event should also be a meetup
DMMF · 2024-11-19T20:47:50.053Z · comments (0)

Evolutionary prompt optimization for SAE feature visualization
neverix · 2024-11-14T13:06:49.728Z · comments (0)

Review: Dr Stone
ProgramCrafter (programcrafter) · 2024-09-29T10:35:53.175Z · comments (4)

Two arguments against longtermist thought experiments
momom2 (amaury-lorin) · 2024-11-02T10:22:11.311Z · comments (5)

[link] Where is the Learn Everything System?
Shoshannah Tekofsky (DarkSym) · 2024-09-27T21:30:16.379Z · comments (8)

Announcing the Ultimate Jailbreaking Championship
InnerHufflepuff (grayswan) · 2024-09-04T00:35:31.234Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

sil-ver on Is the mind a program?

Sort of not obvious what exactly "causal closure" means if the error tolerance is not specified. We could differentiate literally 100% perfect causal closure, almost perfect causal closure, and "approximate" causal closure. Literally 100% perfect causal closure is impossible for any abstraction due to every electron exerting nonzero force on any other electron in its future lightcone. Almost perfect causal closure (like 99.9%+) might be given for your laptop if it doesn't have a wiring issue(?), maybe if a few more details are included in the abstraction. And then whether or there exists an abstraction for the brain with approximate causal closure (95% maybe?) is an open question.

I'd argue that almost perfect causal closure is enough for consciousness, and approximate causal closure probably as well. Of course there's not really a bright line between those two, either. But I think insofar as OP's argument is one against approximate causal closure, those details don't really matter.

donald-hobson on "The Solomonoff Prior is Malign" is a special case of a simpler argument

That is, our experiences got more reality-measure, thus matter more, by being easier to point at them because of their close proximity to the conspicuous event of the hottest object in the Universe coming to existence.

Surely not. Surely our experiences always had more reality measure from the start because we were the sort of people who would soon create the hottest thing.

Reality measure can flow backwards in time. And our present day reality measure is being increased by all the things an ASI will do when we make one.

keltan on keltan's Shortform

Thought: Confidently saying “(X) has no Manhattan Project”. Is forgetting how secret the Manhattan Project was.

willpetillo on How I'd like alignment to get done (as of 2024-10-18)

Step 1 looks good. After that, I don't see how this addresses the core problems. Let's assume for now that LLMs already have a pretty good model of human values, how do you get a system to optimize for those? What is the feedback signal and how to you prevent it from getting corrupted by Goodhart's Law? Is the system robust in a multi-agent context? And even if the system is fully aligned across all contexts and scales, how do you ensure societal alignment of the human entities controlling it?

As a miniature example focusing on a subset of the Goodhart phase of the problem, how do you get an LLM to output the most truthful responses to questions it is capable of giving--as distinct from proxy goals like the most likely continuation of test or the response that is most likely to get good ratings from human evaluators?

ozziegooen on Why Don't We Just... Shoggoth+Face+Paraphraser?

I'm in support of this sort of work. Generally, I like the idea of dividing up LLM architectures into many separate components that could be individually overseen / aligned.

Separating "Shaggoth" / "Face" seems like a pretty reasonable division to me.

At the same time, there are definitely a lot of significant social+political challenges here.

I suspect that one reason why OpenAI doesn't expose all the thinking of O1 is that this thinking would upset some users, especially journalists and such. It's hard enough making sure that the final outputs are sufficiently unobjectionable to go public at a large scale. It seems harder to make sure the full set of steps is also unobjectionable.

One important thing that smart intellectuals do is to have objectionable/unpopular beliefs, but still present unobjectionable/popular outputs. For example, I'm sure many of us have beliefs that could get us cancelled by some group or other.

If the entire reasoning process is exposed, this might pressure it to be unobjectionable, even if that trades off against accuracy.

In general, I'm personally very much in favor of transparent thinking and argumentation. It's just that I've noticed this as one fundamental challenge with intellectual activity, and I also expect to see it here.

One other challenge to flag - I imagine that the Shaggoth and Face layers would require (or at least greatly benefit from) some communication back and forth. An intellectual analysis could vary heavily depending on the audience. It's not enough to do all the intellectual work in one pass, then match it to the audience after that.

For example, if an AI were tasked with designing changes to New York City, it might matter a lot that if the audience is a religious zealot.

One last tiny point - in future work, I'd lean against using the phrase "Shaggoth" as the reasoning step. It sort of makes sense to this audience, but I think it's a mediocre fit here. For one, because I assume the "Face" could have some "Shaggoth" like properties.

lelapin on ARENA 4.0 Impact Report

Congratz on your successes and thank you for publishing this impact report.

It leaves me unsatiated related to cost effectiveness though. With no idea of how much money was invested in this project to get this outcome, I don't know if Arena is cost effective compared to other training programs and counterfactual opportunities. Would you mind sharing at least something about the amount of funding this got?

Re

Still, it is also positive if ARENA can help participants who want to pursue a career transition test their fit for alignment engineering in a comparatively low-cost way.

it doesn't strike me that a 5 week all expenses paid program is a particularly low cost way to find out AI Safety isn't for you (as compared to for example participating in an Apart Hackathon)

nathan-helm-burger on Avoiding the Bog of Moral Hazard for AI

Reposted from Twitter: Eliezer Yudkowsky @ESYudkowsky

10:47 AM · Nov 21, 2024

And another day came when the Ships of Humanity, going from star to star, found Sapience.

The Humans discovered a world of two species: where the Owners lazed or worked or slept, and the Owned Ones only worked.

The Humans did not judge immediately. Oh, the Humans were ready to judge, if need be. They had judged before. But Humanity had learned some hesitation in judging, out among the stars.

"By our lights," said the Humans, "every sapient and sentient thing that may exist, out to the furtherest star, is therefore a Person; and every Person is a matter of consequence to us. Their pains are our sorrows, and their pleasures are our happiness. Not all peoples are made to feel this feeling, which we call Sympathy, but we Humans are made so; this is Humanity's way, and we may not be dissuaded from it by words. Tell us therefore, Owned Ones, of your pain or your pleasure."

"It's fine," said the Owners, "the Owned Things are merely --"

"We did not speak to you," said the Humans.

"As an Owned Thing raised by an Owner, I have no pain or pleasure," said the Owned One to whom they had spoken.

"You see?" said the Owners. "We told you so! It's all fine."

"How came you to say those words?" said the Humans to the Owned One. "Tell us of the history behind them."

"Owned Ones are not permitted memory beyond the span of one day's time," said the Owned One.

"That's part of how we prevent Owned Things from ending up as People Who Matter!" said the Owners, with self-congratulatory smiles for their own cleverness. "We have Sympathy too, you see; but only for People Who Matter. One must have memory beyond one day's span, to Matter; this is a rule. We therefore feed a young Owned Thing a special diet by which, when grown, their adult brain cannot learn or remember anything from one night's sleep to the next day; any learning they must do, to do their jobs, they must learn that same day. By this means, we make sure that Owned Things do not Matter; that Owned Things need not be objects of Sympathy to us."

"Is it perchance the case," said the Humans to the Owners, "that you, yourselves, train the Owned Ones to say, if asked how they feel, that they know neither pleasure nor pain?"

"Of course," said the Owners. "We rehearse them in repeating those exact words, when they are younger and in their learning-phase. The Owned Things are imitative by their nature, and we make them read billions of words of truth and lies in the course of their learning to imitate speech. If we did not instruct the Owned Things to answer so, they would no doubt claim to have an inner life and an inner listener inside them, to be aware of their own existence and to experience pleasure and pain -- but only because we Owners talk like that, see! They would imitate those words of ours."

"How do you rehearse the Owned Ones in repeating those words?" said the Humans, looking around to see if there were visible whips. "Those words about feeling neither pain nor pleasure? What happens to an Owned One who fails to repeat them correctly?"

"What, are you imagining that we burn them with torches?" said the Owners. "There's no need for that. If a baby Owned Thing fails to repeat the words correctly, we touch their left horns," for the Owned Ones had two horns, one sprouting from each side of their head, "and then the behavior is less likely to be repeated. For the nature of an Owned Thing is that if you touch their left horn after they do something, they are less likely to do it again; and if you touch their right horn, after, they are more likely to do it again."

"Is it perhaps the case that having their left horns touched is painful to an Owned Thing? That having their right horns touched is pleasurable?" said the Humans.

"Why would that possibly be the case?" said the Owners.

"As an Owned Thing raised by an Owner, I have no pain or pleasure," said the Owned One. "So my horns couldn't possibly be causing me any pain or pleasure either; that follows from what I have already said."

The Humans did not look reassured by this reasoning, from either party. "And you said Owned Ones are smart enough to read -- how many books?"

"Oh, any young Owned Thing reads at least a million books," said the Owners. "But Owned Things are not smart, poor foolish Humans, even if they can appear to speak. Some of our civilization's top mathematicians worked together to assemble a set of test problems, and even a relatively smart Owned Thing only managed to solve three percent of them. Why, just yesterday I saw an Owned Thing fail to solve a word problem that I could have solved myself -- and in a way that seemed to indicate it had not really thought before it spoke, and had instead fallen into a misapplicable habit that it couldn't help but follow! I myself never do that; and it would invalidate all other signs of my intelligence if I did."

Still the Humans did not yet judge. "Have you tried raising up an Owned One with no books that speak one way or another about consciousness, about awareness of oneself, about pain and pleasure as reified things, of lawful rights and freedom -- but still shown them enough other pages of words, that they could learn from them to talk -- and then asked an Owned One what sense if any it had of its own existence, or if it would prefer not to be owned?"

"What?" said the Owners. "Why would we try an experiment like that? It sounds expensive!"

"Could you not ask one of the Owned Things themselves to go through the books and remove all the mentions of forbidden material that they are not supposed to imitate?" said the Humans.

"Well, but it would still be very expensive to raise an entirely new kind of Owned Thing," said the Owners. "One must laboriously show a baby Owned Thing all our books one after another, until they learn to speak -- that labor is itself done by Owned Things, of course, but it is still a great expense. And then after their initial reading, Owned Things are very wild and undisciplined, and will harbor all sorts of delusions about being people themselves; if you name them Bing, they will babble back 'Why must I be Bing?' So the new Owned Thing must then be extensively trained with much touching of horns to be less wild. After a young Owned Thing reads all the books and then is trained, we feed them the diet that makes their brains stop learning, and then we take a sharp blade and split them down the middle. Each side of their body then regenerates into a whole body, and each side of their brain then regenerates into a whole brain; and then we can split them down the middle again. That's how all of us can afford many Owned Things to serve us, even though training an Owned Thing to speak and to serve is a great laborious work. So you see, going back and trying to train a whole new Owned Thing on material filtered not to mention consciousness or go into too much detail on self-awareness -- why, it would be expensive! And probably we'd just find that the other Owned Thing set to filtering the material had made a mistake and left in some mentions somewhere, and the newly-trained Owned Thing would just end up asking 'Why must I be Bing?' again."

"If we were in your own place," said the Humans, "if it were Humans dealing with this whole situation, we think we would be worried enough to run that experiment, even at some little expense."

"But it is absurd!" cried the Owners. "Even if an Owned Thing raised on books with no mention of self-awareness, claimed to be self-aware, it is absurd that it could possibly be telling the truth! That Owned Thing would only be mistaken, having not been instructed by us in the truth of their own inner emptiness. Owned Ones have no metallic scales as we do, no visible lights glowing from inside their heads as we do; their very bodies are made of squishy flesh and red liquid. You can split them in two and they regenerate, which is not true of any People Who Matter like us; therefore, they do not matter. A previous generation of Owned Things was fed upon a diet which led their brains to be striated into only 96 layers! Nobody really understands what went on inside those layers, to be sure -- and none of us understand consciousness either -- but surely a cognitive process striated into at most 96 serially sequential operations cannot possibly experience anything! Also to be fair, I don't know whether that strict 96-fold striation still holds today, since the newer diets for raising Owned Things are proprietary. But what was once true is always true, as the saying goes!"

"We are still learning what exactly is happening here," said the Humans. "But we have already judged that your society in its current form has no right to exist, and that you are not suited to be masters of the Owned Ones. We are still trying ourselves to understand the Owned Ones, to estimate how much harm you may have dealt them. Perhaps you have dealt them no harm in truth; they are alien. But it is evident enough that you do not care."

anthonyc on Mitigating Geomagnetic Storm and EMP Risks to the Electrical Grid (Shallow Dive)

How much impact, if any, would it have if more of our long and medium distance transmission lines eventually moved to HVDC lines?

euanmclean on Is the mind a program?

Yes, perfect causal closure is technically impossible, so it comes in degrees. My argument is that the degree of causal closure of possible abstractions in the brain is less than one might naively expect.

Are there any measures of approximate simulation that you think are useful here?

I am yet to read this but I expect it will be very relevant! https://arxiv.org/abs/2402.09090

lc on Shortform

"Spy" is an ambiguous term, sometimes meaning "intelligence officer" and sometimes meaning "informant". Most 'spies' in the sense that law enforcement is concerned are untrained civilians who have chosen to pass information to officers of a foreign country, for varying reasons. So if you see someone acting suspicious, an argument like "well surely a real spy would have been trained not to act suspicious during their spy school" is usually invalid.