LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Claude 3 Opus can operate as a Turing machine
Gunnar_Zarncke · 2024-04-17T08:41:57.209Z · comments (2)

[link] Shifting Headspaces - Transitional Beast-Mode
Jonathan Moregård (JonathanMoregard) · 2024-08-12T13:02:06.120Z · comments (9)

Mental Masturbation and the Intellectual Comfort Zone
Declan Molony (declan-molony) · 2024-05-07T05:47:05.257Z · comments (2)

[link] "Model UN Solutions"
Arjun Panickssery (arjun-panickssery) · 2023-12-08T23:06:33.490Z · comments (5)

[link] UC Berkeley course on LLMs and ML Safety
Dan H (dan-hendrycks) · 2024-07-09T15:40:00.920Z · comments (1)

Drone Wars Endgame
RussellThor · 2024-02-01T02:30:46.161Z · comments (71)

(Appetitive, Consummatory) ≈ (RL, reflex)
Steven Byrnes (steve2152) · 2024-06-15T15:57:39.533Z · comments (1)

Good job opportunities for helping with the most important century
HoldenKarnofsky · 2024-01-18T17:30:03.332Z · comments (0)

[question] What are your cruxes for imprecise probabilities / decision rules?
Anthony DiGiovanni (antimonyanthony) · 2024-07-31T15:42:27.057Z · answers+comments (29)

A Socratic dialogue with my student
lsusr · 2023-12-05T09:31:05.266Z · comments (14)

We’re not as 3-Dimensional as We Think
silentbob · 2024-08-04T14:39:16.799Z · comments (16)

[link] Toki pona FAQ
dkl9 · 2024-03-17T21:44:21.782Z · comments (8)

Closeness To the Issue (Part 5 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-09T00:36:47.388Z · comments (0)

Forecasting AI (Overview)
jsteinhardt · 2023-11-16T19:00:04.218Z · comments (0)

Representation Tuning
Christopher Ackerman (christopher-ackerman) · 2024-06-27T17:44:33.338Z · comments (9)

Empirical vs. Mathematical Joints of Nature
Elizabeth (pktechgirl) · 2024-06-26T01:55:22.858Z · comments (1)

Open consultancy: Letting untrusted AIs choose what answer to argue for
Fabien Roger (Fabien) · 2024-03-12T20:38:03.785Z · comments (5)

Unpicking Extinction
ukc10014 · 2023-12-09T09:15:41.291Z · comments (10)

A sketch of acausal trade in practice
Richard_Ngo (ricraz) · 2024-02-04T00:32:54.622Z · comments (4)

Economics Roundup #2
Zvi · 2024-07-02T12:40:05.908Z · comments (5)

Index of rationalist groups in the Bay Area July 2024
Lucie Philippon (lucie-philippon) · 2024-07-26T16:32:25.337Z · comments (10)

Open Thread – Winter 2023/2024
habryka (habryka4) · 2023-12-04T22:59:49.957Z · comments (160)

'Theories of Values' and 'Theories of Agents': confusions, musings and desiderata
Mateusz Bagiński (mateusz-baginski) · 2023-11-15T16:00:48.926Z · comments (8)

Predictive model agents are sort of corrigible
Raymond D · 2024-01-05T14:05:03.037Z · comments (6)

Proposal for improving the global online discourse through personalised comment ordering on all websites
Roman Leventov · 2023-12-06T18:51:37.645Z · comments (21)

Agency in Politics
Martin Sustrik (sustrik) · 2024-07-17T05:30:01.873Z · comments (2)

[link] List of Collective Intelligence Projects
Chipmonk · 2024-07-02T14:10:41.789Z · comments (9)

Humans aren't fleeb.
Charlie Steiner · 2024-01-24T05:31:46.929Z · comments (5)

List of strategies for mitigating deceptive alignment
joshc (joshua-clymer) · 2023-12-02T05:56:50.867Z · comments (2)

Book Review: On the Edge: The Gamblers
Zvi · 2024-09-24T11:50:06.065Z · comments (1)

Monthly Roundup #22: September 2024
Zvi · 2024-09-17T12:20:08.297Z · comments (10)

Video and transcript of presentation on Otherness and control in the age of AGI
Joe Carlsmith (joekc) · 2024-10-08T22:30:38.054Z · comments (1)

Which evals resources would be good?
Marius Hobbhahn (marius-hobbhahn) · 2024-11-16T14:24:48.012Z · comments (2)

Live Machinery: An Interface Design Philosophy for Wholesome AI Futures (Workshop @ EA Hotel!)
Sahil · 2024-11-01T17:24:09.957Z · comments (2)

Secondary Risk Markets
Vaniver · 2023-12-11T21:52:46.836Z · comments (4)

[link] On Fables and Nuanced Charts
Niko_McCarty (niko-2) · 2024-09-08T17:09:07.503Z · comments (2)

Doomsday Argument and the False Dilemma of Anthropic Reasoning
Ape in the coat · 2024-07-05T05:38:39.428Z · comments (55)

[link] Twitter thread on politics of AI safety
Richard_Ngo (ricraz) · 2024-07-31T00:00:34.298Z · comments (2)

Dangers of Closed-Loop AI
Gordon Seidoh Worley (gworley) · 2024-03-22T23:52:22.010Z · comments (9)

How I select alignment research projects
Ethan Perez (ethan-perez) · 2024-04-10T04:33:08.092Z · comments (4)

Categories of leadership on technical teams
benkuhn · 2024-07-22T04:50:04.071Z · comments (0)

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
Jan Wehner · 2024-07-14T10:37:21.544Z · comments (5)

[Valence series] 4. Valence & Social Status (deprecated)
Steven Byrnes (steve2152) · 2023-12-15T14:24:41.040Z · comments (19)

[link] OpenAI appoints Retired U.S. Army General Paul M. Nakasone to Board of Directors
Joel Burget (joel-burget) · 2024-06-13T21:28:18.110Z · comments (10)

What Helped Me - Kale, Blood, CPAP, X-tiamine, Methylphenidate
Johannes C. Mayer (johannes-c-mayer) · 2024-01-03T13:22:11.700Z · comments (12)

My Detailed Notes & Commentary from Secular Solstice
Jeffrey Heninger (jeffrey-heninger) · 2024-03-23T18:48:51.894Z · comments (16)

[link] My Model of Epistemology
adamShimi · 2024-08-31T17:01:45.472Z · comments (0)

[link] My article in The Nation — California’s AI Safety Bill Is a Mask-Off Moment for the Industry
garrison · 2024-08-15T19:25:59.592Z · comments (0)

How predictive processing solved my wrist pain
max_shen (makoshen) · 2024-07-04T01:56:20.162Z · comments (8)

Open Problems in AIXI Agent Foundations
Cole Wyeth (Amyr) · 2024-09-12T15:38:59.007Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

vladimir_nesov on O O's Shortform

It needs a sufficient base model that captures the relevant considerations. Solving AIME is like winning a chess, except the rules of chess are trivial, and the rules of AIME are much harder. But the rules of AIME are still not that hard, it's using them to win that is hard.

In the real world, the rules get much harder than that, so it's unclear how far o1 can go if the base model doesn't get sufficiently better (at knowing the rules), and it's unclear how much better it needs to get. Plausibly it needs to get so good that the o1-like post-training won't be needed for it to pursue long chains of reasoning on its own, as an emergent capability.

donatas-luciunas on Claude seems to be smarter than LessWrong community

It sounds to me like you're saying that the intelligent agent will just disregard optimization of its utility function and instead investigate the possibility of an objective goal.

Yes, exactly.

The logic is similar to Pascal's wager. If objective goal exists, it is better to find and pursue it, than a fake goal. If objective goal does not exist, it is still better to make sure it does not exist before pursuing a fake goal. Do you see?

sohaib-imran on Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data

Thanks for having a read!

Do you expect abductive reasoning to be significantly different from deductive reasoning? If not, (and I put quite high weight on this,) then it seems like (Berglund, 2023) already tells us a lot about the cross-context abductive reasoning capabilities of LLMs. I.e. replicating their methodology wouldn't be very exciting.

Berglund et. al. (2023) utilise a prompt containing a trigger keyword (the chatbot's name) for their experiments,

where $β$ corresponds much more strongly to $B_{1}$ than $B_{2}$ . In our experiments the As are the chatbot names, Bs are behaviour descriptions and $β$ s are observed behaviours in line with the descriptions. My setup is:

$\begin{matrix} A_{1} \to B_{1} A_{2} \to B_{2} β A_{1} \end{matrix}$

The key difference here is that $β$ is much less specific/narrow since it corresponds with many Bs, some more strongly than others. This is therefore a (Bayesian) inference problem. I think it's intuitively much easier to forward reason given a narrow trigger to your knowledge base (eg. someone telling you to use the compound interest equation) than to figure out which parts of your knowledge base are relevant (you observe numbers 100, 105, 110.25 and asking which equation would allow you to compute the numbers).

Similarly, for the RL experiments (experiment 3) they use chatbot names as triggers and therefore there results are relevant to narrow backdoor triggers rather than more general reward hacking leveraging declarative facts in training data.

I am slightly confused whether their reliance on narrow triggers is central to the difference between abductive and deductive reasoning. Part of the confusion is because (Beysian) inference itself seems like an inductive reasoning process. Popper and Stengel (2024) discuss this a little bit under page 3 and footnote 4 (process of realising relevance).

One difference that I note here is that abductive reasoning is uncertain / ambiguous; maybe you could test whether the model also reduces its belief of competing hypotheses (c.f. 'explaining away').

Figures 1 and 4 attempt to show that. If the tendency to self-identify as the incorrect chatbot falls with increasing k and iteration than the models are doing this inference process right. In Figures 1 and 4, you can see that the tendency to incorrectly identify as a pangolin falls in all non-pangolin tasks (except for tiny spikes sometimes at iteration 1).

abandon on Trying Bluesky

Yes, I know; the following tab was already present at that time, is what I meant to communicate.

cubefox on Trying Bluesky

Twitter did use an algorithmic timeline before (e.g. tweets you might be interested in, tweets people you followed liked), it was just less algorithmic than the "for you" tab currently. The time when it was completely like the current "following" tab was many years ago.

sharmake-farah on Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent Claims

This is also my interpretation of the rumors, assuming they are true, which I don't put much probability on.

abandon on Trying Bluesky

The following tab doesn't postdate Musk; it's been present since before they introduced the algorithmic timeline.

afeller08 on Happiness and Goodness as Universal Terminal Virtues

Someone probably does. I believe that the cultural practice of preferring coffee to tea began in the British colonies at the time the United States started to cease to be part of the British Empire as a side effect of boycotting tea to avoid paying a tea tax. (This is a pretty well-known episode of American history within the United States.) I was boycotting the boycott. Refusing to drink tea is a signaling thing in the United States to let people know that you are not in agreement with the government of the United States as to which side constitutes the actual enemy in most wars the United States fights. It more less means "I was an anglophile on my route to becoming a Bob Dylan fan, and I make a point of singing, at least, the first verse of "Chimes of Freedom" loudly and publicly every May 1, July 4, and September 2." By "more or less," I mean, I'm a musician so that's how I now express some of the same things that I used to express by refusing to drink coffee before I had enough confidence to just sing "flashing for the warrior, whose strength is not to fight; flashing for the refugee on the unarmed road of flight" whenever I see someone wearing a uniform that I deem offensive. Relatedly, refusing to drink coffee while still drinking caffeine is a fairly radical refusal to participate in mainstream culture that an enormous number of second-and-third-tier trendsetters recognize as a common signal used by first-tier trend-setters. For instance, most hipsters are at least vaguely aware that many of the most influential people who call the shots and set the trends in their subculture are some subset of the people who are not actually hipsters but who interact with the fringes of hipster culture and who have also spent at least a few years saying, "I DO NOT DRINK COFFEE. i drink tea." ("No thanks, I drink tea," is completely different.) To become a first-tier trend-setter in hipster culture, you have to be a non-hipster who has learned how to do a super-hipster thing for the right reasons, and one of the most obvious and easy ways you can do that is to express a disdain for Starbucks that is more menacing/intimidating than it is merely contemptuous (but is also at least as contemptuous as the typical hipsters' ability to express disdain). Hipsters are not formidable people, but they respect formidable people; and they disrespect people's whose power is derived from social structures. There is at least one venue that I used to go primarily to consume tea, where hipsters still go primarily to consume Jazz. The comment that you responded mostly consisted of me cryptically calling a few shots. The comments I've posted today consist of cryptically taking victory laps for all the shots called in that comment ten years ago; while calling some shots for the next ten years. I occasionally interact with hipster culture to inform hipsters about what types of aesthetic preferences they are going to help spread in the next few years. All the minor celebrities I interact with respond to all the comments I direct towards them and ignore all the comments I make about them. For instance, Scott Siskind always replies to the comments I post on his blogs that I want him to respond to. And when I go to less wrong meetups I figure out whose worth talking to by saying, "I learned Scott's last name from the blog that I sort of vaguely remember as being named after an octopus long before I confirmed it by asking 'how many Jazz pianists who performed in Carnegie Hall can possibly have a brother named Scott who has practiced psychiatry in Michigan."

steve2152 on Gunnar_Zarncke's Shortform

FWIW I don’t think “self-models” in the Intuitive Self-Models [? · GW] sense are related to instrumental power-seeking—see §8.2 [LW · GW]. For example, I think of my toenail as “part of myself”, but I’m happy to clip it. And I understand that if someone “identifies with the universal consciousness”, their residual urges towards status-seeking, avoiding pain, and so on are about the status and pain of their conventional selves, not the status and pain of the universal consciousness. More examples here [LW(p) · GW(p)] and here [LW(p) · GW(p)].

Separately, I’m not sure what if anything the Intuitive Self-Models [? · GW] stuff has to do with LLMs in the first place.

But there’s a deeper problem: the instrumental convergence concern is about agents that have preferences about the state of the world in the distant future, not about agents that have preferences about themselves. (Cf. here [LW · GW].) So for example, if an agent wants there to be lots of paperclips in the future, then that’s the starting point, and everything else can be derived from there.

Q: Does the agent care about protecting “the temporary state of the execution of the model (or models)”?
- A: Yes, if and only if protecting that state is likely to ultimately lead to more paperclips.
Q: Does the agent care about protecting “the compute resources (CPU/GPU/RAM) allocated to run the model and its collection of support programs”?
- A: Yes, If and only if protecting those resources is likely to ultimately lead to more paperclips.

Etc. See what I mean? That’s instrumental convergence, and self-models have nothing to do with it.

Sorry if I’m misunderstanding.

seth-herd on Making a conservative case for alignment

I agree with everything you've said. The advantages are primarily from not aligning to values but only to following instructions rather than using RL or any other process to infer underlying values. Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW].

I think creating real AGI based on an LLM aligned to be helpful, harmless and honest would probably be the end of us, as carrying the set of value implied by RLHF to their logical conclusions outside of human control would probably be pretty different from our desired values. Instruction-following provides corrigibililty.