LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[question] why did OpenAI employees sign
bhauth · 2023-11-27T05:21:28.612Z · answers+comments (23)

Should rationalists be spiritual / Spirituality as overcoming delusion
Kaj_Sotala · 2024-03-25T16:48:08.397Z · comments (57)

[link] Chapter 1 of How to Win Friends and Influence People
gull · 2024-01-28T00:32:52.865Z · comments (5)

Notes on control evaluations for safety cases
ryan_greenblatt · 2024-02-28T16:15:17.799Z · comments (0)

Bounty: Diverse hard tasks for LLM agents
Beth Barnes (beth-barnes) · 2023-12-17T01:04:05.460Z · comments (31)

Wrong answer bias
lemonhope (lcmgcd) · 2024-02-01T20:05:38.573Z · comments (24)

AI #58: Stargate AGI
Zvi · 2024-04-04T13:10:06.342Z · comments (9)

[link] in defense of Linus Pauling
bhauth · 2024-06-03T21:27:43.962Z · comments (8)

The Fragility of Life Hypothesis and the Evolution of Cooperation
KristianRonn · 2024-09-04T21:04:49.878Z · comments (6)

AI #67: Brief Strange Trip
Zvi · 2024-06-06T18:50:03.514Z · comments (6)

The Dunning-Kruger of disproving Dunning-Kruger
kromem · 2024-05-16T10:11:33.108Z · comments (0)

[link] DM Parenting
Shoshannah Tekofsky (DarkSym) · 2024-07-16T08:50:08.144Z · comments (4)

An issue with training schemers with supervised fine-tuning
Fabien Roger (Fabien) · 2024-06-27T15:37:56.020Z · comments (12)

[link] On scalable oversight with weak LLMs judging strong LLMs
zac_kenton (zkenton) · 2024-07-08T08:59:58.523Z · comments (18)

[link] Anthropic announces interpretability advances. How much does this advance alignment?
Seth Herd · 2024-05-21T22:30:52.638Z · comments (4)

[LDSL#0] Some epistemological conundrums
tailcalled · 2024-08-07T19:52:55.688Z · comments (10)

Interoperable High Level Structures: Early Thoughts on Adjectives
johnswentworth · 2024-08-22T21:12:38.223Z · comments (1)

Low Probability Estimation in Language Models
Gabriel Wu (gabriel-wu) · 2024-10-18T15:50:05.947Z · comments (0)

Interested in Cognitive Bootcamp?
Raemon · 2024-09-19T22:12:13.348Z · comments (0)

Evaluating the truth of statements in a world of ambiguous language.
Hastings (hastings-greer) · 2024-10-07T18:08:09.920Z · comments (19)

[link] Book review: Xenosystems
jessicata (jessica.liu.taylor) · 2024-09-16T20:17:56.670Z · comments (18)

[question] If I wanted to spend WAY more on AI, what would I spend it on?
Logan Zoellner (logan-zoellner) · 2024-09-15T21:24:46.742Z · answers+comments (16)

An alternative approach to superbabies
Towards_Keeperhood (Simon Skade) · 2024-11-05T22:56:15.740Z · comments (19)

[link] Active Recall and Spaced Repetition are Different Things
Saul Munn (saul-munn) · 2024-11-08T20:14:56.092Z · comments (2)

On the lethality of biased human reward ratings
Eli Tyre (elityre) · 2023-11-17T18:59:02.303Z · comments (10)

Safety First: safety before full alignment. The deontic sufficiency hypothesis.
Chipmonk · 2024-01-03T17:55:19.825Z · comments (3)

What is the next level of rationality?
lsusr · 2023-12-12T08:14:14.846Z · comments (24)

Exercise: Planmaking, Surprise Anticipation, and "Baba is You"
Raemon · 2024-02-24T20:33:49.574Z · comments (19)

D&D.Sci(-fi): Colonizing the SuperHyperSphere
abstractapplic · 2024-01-12T23:36:54.248Z · comments (23)

On ‘Responsible Scaling Policies’ (RSPs)
Zvi · 2023-12-05T16:10:06.310Z · comments (3)

AISC 2024 - Project Summaries
NickyP (Nicky) · 2023-11-27T22:32:23.555Z · comments (3)

“Why can’t you just turn it off?”
Roko · 2023-11-19T14:46:18.427Z · comments (25)

D&D.Sci: The Mad Tyrant's Pet Turtles [Evaluation and Ruleset]
abstractapplic · 2024-04-09T14:01:34.426Z · comments (6)

[link] Spaced repetition for teaching two-year olds how to read (Interview)
Chipmonk · 2023-11-26T16:52:58.412Z · comments (9)

Highlights from Lex Fridman’s interview of Yann LeCun
Joel Burget (joel-burget) · 2024-03-13T20:58:13.052Z · comments (15)

[link] Designing for a single purpose
Itay Dreyfus (itay-dreyfus) · 2024-05-07T14:11:22.242Z · comments (12)

AI and the Technological Richter Scale
Zvi · 2024-09-04T14:00:08.625Z · comments (8)

Philosophers wrestling with evil, as a social media feed
David Gross (David_Gross) · 2024-06-03T22:25:22.507Z · comments (2)

[link] Web-surfing tips for strange times
eukaryote · 2024-05-31T07:10:25.805Z · comments (19)

[link] Contra Acemoglu on AI
Maxwell Tabarrok (maxwell-tabarrok) · 2024-06-28T13:13:15.796Z · comments (0)

Why the Best Writers Endure Isolation
Declan Molony (declan-molony) · 2024-07-16T05:58:25.032Z · comments (6)

Mechanistic Interpretability Workshop Happening at ICML 2024!
Neel Nanda (neel-nanda-1) · 2024-05-03T01:18:26.936Z · comments (6)

Misnaming and Other Issues with OpenAI's “Human Level” Superintelligence Hierarchy
Davidmanheim · 2024-07-15T05:50:17.770Z · comments (2)

The Mom Test: Summary and Thoughts
Adam Zerner (adamzerner) · 2024-04-18T03:34:21.020Z · comments (3)

SRE's review of Democracy
Martin Sustrik (sustrik) · 2024-08-03T07:20:01.483Z · comments (2)

How to do conceptual research: Case study interview with Caspar Oesterheld
Chi Nguyen · 2024-05-14T15:09:30.390Z · comments (5)

[link] JumpReLU SAEs + Early Access to Gemma 2 SAEs
Senthooran Rajamanoharan (SenR) · 2024-07-19T16:10:54.664Z · comments (10)

How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation")
Ruby · 2024-07-19T00:31:38.332Z · comments (21)

Some Experiments I'd Like Someone To Try With An Amnestic
johnswentworth · 2024-05-04T22:04:19.692Z · comments (33)

Extended Interview with Zhukeepa on Religion
Ben Pace (Benito) · 2024-08-18T03:19:05.625Z · comments (59)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

maia on Secular Solstice Songbook Update

I would suggest taking out the paganism verse in Bold Orion. We never use it, dunno about you guys.

romeostevensit on "The Solomonoff Prior is Malign" is a special case of a simpler argument

Thanks for writing this, I indeed felt that the arguments were significantly easier to follow than previous efforts.

notrishi on notrishi's Shortform

I have not read this before, thanks. Reminds me a lot of Normal Computings extended mind models. I think these are good ideas worth testing, and there are many others within the same vein. My intuition suggests that any idea that pursues a gradual increase in global information prior to decoding is a worthwhile experiment, whether through your method or similar (doesn't necessarily have to be diffusion on embeddings).

Aesthetically I just don't like that transformers have an information collapse on each token and don't allow backtracking (without significant effort in a custom sampler). In my ideal world we could completely reconstruct prose from embeddings and thus simply autoregress in latent space. I think Yann Lecun has discussed this with JEPA as well.

I originally had my thought from a frequency autoregression experiment I had, where I used a causal transformer on the frequency domain of images (to sort of replicate diffusion). This gradually adds information globally to all pixels due to the nature of the ifft, yet still has an autoregressive backend.

seth-herd on What are Emotions?

I think you're primarily addressing reward signals or reinforcement signals. These are, by definition, signals that make behavior preceding them more likely in the future. In the mammalian brain, they define what we pursue.

Other emotions are different; back to them later.

The dopamine system appears to play this role in the mammalian brain. It's somewhat complex, in that new predictions of future rewards seem to be the primary source of reinforcement for humans; for instance, if someone hands me a hundred dollars, I have a new prediction that I'll eat food, get shelter, or do something that in turn predicts reward; so I'll repeat whatever behavior preceded that, and I'll update my predictions for future reward.

For way more than you want to know about how dopamine seems to shape our actions, see my paper Neural mechanisms of human decision-making and the masses of work it references.

Or better yet, read Steve Byrnes' Intro to brain-like-AGI safety sequence [LW · GW], focusing on the steering subsystem. Then look at his Valence sequence [LW · GW] for more on how we pass reward predictions among our "thoughts" (representations of concepts). (IMO, his Valence matches exactly what the dopamine system is known to do for short time tasks, and what it probably does in human complex thought).

So, when you ask people what their goals are, they're mentioning things that predict reward to them. They're guesses about what would give a lot of reward signals. The correct answer to "'why do you want that" is "because I think I'd find it really rewarding". ("I'd really enjoy it" is close but not quite correct, since there's a difference between wanting and liking in the brain- google that for another headfull).

Now, we can be really wrong about what we'd find rewarding or enjoy. I think we're usually way off. But that is how we pick goals, and what drives our behavior (along with a bunch of other factors that are less determinative, like what we know about and what happens into our attention).

Other emotions, like fear, anger, etc. are different. They can be thought of as "tilts"' to our cognitive landscape. Even learning that we're experiencing them is tricky. That's why emotional awareness is a subject to learn about, not just something we're born knowing. We need to learn to "feel the tilt". Elevated heart rate might signal fear, anger, or excitement; noticing it or finding other cues are necessary to understand how we're tilted, and how to correct for it if we want to act rationally. Those sorts of emotions "tilt the landscape" of our cognition by making different thoughts and actions more likely, like thoughts of how someone's actions were unfair or physical attacks when we're angry.

See also my post [Human preferences as RL critic values - implications for alignment](https://www.lesswrong.com/posts/HEonwwQLhMB9fqABh/human-preferences-as-rl-critic-values-implications-for). I'm not sure how clear or compelling it is. But I'm pretty sure that predicted reward is pretty synonymous with what we call "values".

avturchin on Anthropic signature: strange anti-correlations

There is a strange correlation between paradox of young Sun (it had lower luminosity) and stable Earth temperature which was provided by higher greenhouse effect. As sun goes brighter, CO2 declined. It was even analyses as evidence of anthropic effects.

In his article "The Anthropic Principle in Cosmology and Geology" [Shcherbitsky, 1999], A. S. Shcherbakov thoroughly examines the anthropic principle's effect using the historical dynamics of Earth's atmosphere as an example. He writes: "It is known that geological evolution proceeds within an oscillatory regime. Its extreme points correspond to two states, known as the 'hot planet' and 'white planet'... The 'hot planet' situation occurs when large volumes of gaseous components, primarily carbon dioxide, are released from Earth's mantle...

As calculations show, the gradual evaporation of ocean water just 10 meters deep can create such greenhouse conditions that water begins to boil. This process continues without additional heat input. The endpoint of this process is the boiling away of the oceans, with near-surface temperatures and pressures rising to hundreds of atmospheres and degrees... Geological evidence indicates that Earth has four times come very close to total glaciation. An equal number of times, it has stopped short of ocean evaporation. Why did neither occur? There seems to be no common and unified saving cause. Instead, each time reveals a single and always unique circumstance. It is precisely when attempting to explain these that geological texts begin to show familiar phrases like '...extremely low probability,' 'if this geological factor had varied by a small fraction,' etc...
In the fundamental monograph 'History of the Atmosphere' [Budyko, 1985], there is discussion of an inexplicable correlation between three phenomena: solar activity rhythms, mantle degassing stages, and the evolution of life. 'The correspondence between atmospheric physicochemical regime fluctuations and biosphere development needs can only be explained by random coordination of direction and speed of unrelated processes - solar evolution and Earth's evolution. Since the probability of such coordination is exceptionally small, this leads to the conclusion about the exceptional rarity of life (especially its higher forms) in the Universe.'"

michael-cohn on What Ketamine Therapy Is Like

The "horse tranquilizer" thing goes back to long before the pandemic. I was hearing it in the aughts in relation to recreational use. My guess about the term is that 1) among drug warriors, it's good moral panic fodder, 2) among drug users, it sounds really funny, and 3) I imagine it's easier to divert doses from the veterinary system than from the human pharmacy system, so it may have originated with dealers whose supply literally had the words "horse" and "tranquilizer" on the label.

seth-herd on If we solve alignment, do we die anyway?

Hey, thanks for the prompt! I had forgotten to get back to this thread. Now I've replied to James' comment, attempting to address the remaining difference in our predictions.

seth-herd on If we solve alignment, do we die anyway?

We're mostly in agreement here. If you're willing to live with universal surveillance, hostile RSI attempts might be prevented indefinitely.

you're probably smart enough to know that the scenario outlined here has a near 100% chance of failure for you and your family, because you've created something more intelligent than you that is willing to hide its intentions and destroy billions of people, it doesn't take much to realise that that intelligence isn't going to think twice about also destroying you.

In my scenario, we've got aligned AGI - or at least AGI aligned to follow instructions. If that didn't work, we're already dead. So the AGI is going to follow its human's orders unless something goes very wrong as it self-improves. It will be working to maintain its alignment as it self-improves, because preserving a goal is implied by instrumentally pursuing a goal (I'm guessing here at where we might not be thinking of things the same way).

If I thought ordering an AGI to self-improve was suicidal, I'd be relieved.

Alternately, if someone actually pulled off full value alignment, that AGI will take over without a care for international law or the wishes of its creator - and that takeover would be for the good of humanity as a whole. This is the win scenario people seem to have considered most often, or at least from the earliest alignment work. I now find this unlikely because I think Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW] - following instructions given by a single person is much easier to define and more robust to errors than defining or defining-how-to-deduce the values of all humanity. And even if it wasn't, the sorts of people who will have or seize control of AGI projects will prefer it to follow their values. So I find full value alignment for our first AGI(s) highly unlikely, while successful instruction-following seems pretty likely on our current trajectory.

Again, I'm guessing at where our perspectives on whether someone could expect themselves and a few loved ones to survive a takeover attempt by ordering their AGI to hide, self-improve, build exponentially, and take over even at bloody cost. If the thing is aligned as an AGIi, it should be competent enough to maintain that alignment as it self improves.

If I've missed the point of differing perspectives, I apologize.

sharmake-farah on Why would ASI share any resources with us?

IMO, the psychological unity of humankind thesis is a case of typical minding/overgeneralizing, combined with overestimating the role of genetics/algorithms and underestimating the role of data in what makes us human.

I basically agree with the game-theoretic perspective, combined with another perspective which suggests that as long as humans are relevant in the economy, you kind of have to help those humans if you want to profit, and merely an AI that automates a lot of work could disrupt it very heavily if a CEO could have perfectly loyal AI workers that never demanded anything in the broader economy.

gwern on notrishi's Shortform

You might be interested in a small "hybrid LLM" proposal I wrote for using diffusion on embeddings for then decoding/sampling.