LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

D&D.Sci(-fi): Colonizing the SuperHyperSphere
abstractapplic · 2024-01-12T23:36:54.248Z · comments (23)

Exercise: Planmaking, Surprise Anticipation, and "Baba is You"
Raemon · 2024-02-24T20:33:49.574Z · comments (19)

On the lethality of biased human reward ratings
Eli Tyre (elityre) · 2023-11-17T18:59:02.303Z · comments (10)

Making Bad Decisions On Purpose
Screwtape · 2023-11-09T03:36:59.611Z · comments (8)

AISC 2024 - Project Summaries
NickyP (Nicky) · 2023-11-27T22:32:23.555Z · comments (3)

[link] Every Mention of EA in "Going Infinite"
KirstenH · 2023-10-07T14:42:32.217Z · comments (0)

[link] JumpReLU SAEs + Early Access to Gemma 2 SAEs
Senthooran Rajamanoharan (SenR) · 2024-07-19T16:10:54.664Z · comments (10)

[link] Web-surfing tips for strange times
eukaryote · 2024-05-31T07:10:25.805Z · comments (19)

Evaluating the truth of statements in a world of ambiguous language.
Hastings (hastings-greer) · 2024-10-07T18:08:09.920Z · comments (19)

How to do conceptual research: Case study interview with Caspar Oesterheld
Chi Nguyen · 2024-05-14T15:09:30.390Z · comments (5)

On ‘Responsible Scaling Policies’ (RSPs)
Zvi · 2023-12-05T16:10:06.310Z · comments (3)

The Mom Test: Summary and Thoughts
Adam Zerner (adamzerner) · 2024-04-18T03:34:21.020Z · comments (3)

AI and the Technological Richter Scale
Zvi · 2024-09-04T14:00:08.625Z · comments (8)

SRE's review of Democracy
Martin Sustrik (sustrik) · 2024-08-03T07:20:01.483Z · comments (2)

An alternative approach to superbabies
Towards_Keeperhood (Simon Skade) · 2024-11-05T22:56:15.740Z · comments (19)

[link] Book review: Xenosystems
jessicata (jessica.liu.taylor) · 2024-09-16T20:17:56.670Z · comments (18)

[question] If I wanted to spend WAY more on AI, what would I spend it on?
Logan Zoellner (logan-zoellner) · 2024-09-15T21:24:46.742Z · answers+comments (16)

Misnaming and Other Issues with OpenAI's “Human Level” Superintelligence Hierarchy
Davidmanheim · 2024-07-15T05:50:17.770Z · comments (2)

[link] Designing for a single purpose
Itay Dreyfus (itay-dreyfus) · 2024-05-07T14:11:22.242Z · comments (12)

What is the next level of rationality?
lsusr · 2023-12-12T08:14:14.846Z · comments (24)

[link] Active Recall and Spaced Repetition are Different Things
Saul Munn (saul-munn) · 2024-11-08T20:14:56.092Z · comments (2)

Interested in Cognitive Bootcamp?
Raemon · 2024-09-19T22:12:13.348Z · comments (0)

Mechanistic Interpretability Workshop Happening at ICML 2024!
Neel Nanda (neel-nanda-1) · 2024-05-03T01:18:26.936Z · comments (6)

Highlights from Lex Fridman’s interview of Yann LeCun
Joel Burget (joel-burget) · 2024-03-13T20:58:13.052Z · comments (15)

Demis Hassabis and Geoffrey Hinton Awarded Nobel Prizes
Anna Gajdova (anna-gajdova) · 2024-10-09T12:56:24.856Z · comments (14)

Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor
RogerDearnaley (roger-d-1) · 2024-01-09T20:42:28.349Z · comments (8)

How to safely use an optimizer
Simon Fischer (SimonF) · 2024-03-28T16:11:01.277Z · comments (21)

What distinguishes "early", "mid" and "end" games?
Raemon · 2024-06-21T17:41:30.816Z · comments (22)

Caring about excellence
owencb · 2024-07-22T14:24:37.892Z · comments (4)

[link] on neodymium magnets
bhauth · 2024-01-30T15:58:24.088Z · comments (6)

[link] "If we go extinct due to misaligned AI, at least nature will continue, right? ... right?"
plex (ete) · 2024-05-18T14:09:53.014Z · comments (23)

Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback
Marcus Williams · 2024-11-07T15:39:06.854Z · comments (6)

[link] [Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF
Leon Lang (leon-lang) · 2024-10-22T13:57:41.125Z · comments (0)

D&D.Sci Coliseum: Arena of Data Evaluation and Ruleset
aphyer · 2024-10-29T01:21:03.075Z · comments (12)

[link] A Good Explanation of Differential Gears
Johannes C. Mayer (johannes-c-mayer) · 2023-10-19T23:07:46.354Z · comments (4)

[link] Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities
porby · 2024-02-02T05:49:11.189Z · comments (1)

Extended Interview with Zhukeepa on Religion
Ben Pace (Benito) · 2024-08-18T03:19:05.625Z · comments (59)

Mission Impossible: Dead Reckoning Part 1 AI Takeaways
Zvi · 2023-11-01T12:52:29.341Z · comments (13)

shortest goddamn bayes guide ever
lemonhope (lcmgcd) · 2024-05-10T07:06:23.734Z · comments (8)

[link] Five projects from AI Safety Hub Labs 2023
charlie_griffin (cjgriffin) · 2023-11-08T19:19:37.759Z · comments (1)

How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation")
Ruby · 2024-07-19T00:31:38.332Z · comments (21)

Arguments for moral indefinability
Richard_Ngo (ricraz) · 2023-09-30T22:40:04.325Z · comments (16)

AI Pause Will Likely Backfire (Guest Post)
jsteinhardt · 2023-10-24T04:30:02.113Z · comments (6)

Value learning in the absence of ground truth
Joel_Saarinen (joel_saarinen) · 2024-02-05T18:56:02.260Z · comments (8)

4. Existing Writing on Corrigibility
Max Harms (max-harms) · 2024-06-10T14:08:35.590Z · comments (13)

Critiques of the AI control agenda
Jozdien · 2024-02-14T19:25:04.105Z · comments (14)

Run evals on base models too!
orthonormal · 2024-04-04T18:43:25.468Z · comments (6)

[link] Constructive Cauchy sequences vs. Dedekind cuts
jessicata (jessica.liu.taylor) · 2024-03-14T23:04:07.300Z · comments (23)

Some Experiments I'd Like Someone To Try With An Amnestic
johnswentworth · 2024-05-04T22:04:19.692Z · comments (33)

Sora What
Zvi · 2024-02-22T18:10:05.397Z · comments (3)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

alex-mallen on AI Control: Improving Safety Despite Intentional Subversion

(Not an author, but Buck suggested I look at this)

I think there's a few reasons why it's not clear from these results whether the steganography results introduced would reduce safety in an untrusted monitoring control protocol like that in Greenblatt et al.:

- The untrusted model needs to reliably collude in order to defeat the blue-team because if it messes up once during the upfront evaluation period (30 problems), it loses. Having a 90% chance of communicating correctly to the monitor on each problem gives a 0.9^30 = 96% chance of being caught in the upfront eval period. (This is a lossy translation of the 90% success rate in Mathew et al. E.g. Greenblatt et al. targets a 1.4% overall positive rate after the upfront eval, while Mathew et al. calibrates to at least a 7% false positive rate; The suspiciousness scores in Greenblatt et al. come from the receiver itself.)

- From the above, winrate is quite sensitive to the reliability of collusion, and it's pretty unclear what the error-bars should be around the 90% number.

- The untrusted model in Grennblatt et al. not only has to implement the steganographic signal and complete the task but also has to implement a backdoor.

- It's a different distribution of tasks from the APPS backdoor setting.

So overall I think it's pretty unclear how well the steganography in Mathew et al. would compare to the while-loop signal that performed best in Greenblatt et al.

With appropriate collusion mitigations I think untrusted monitoring is an excellent control protocol so I'm glad to see work investigating more natural learned collusion capabilities and the effectiveness of countermeasures. I'd be excited to see evaluations of these steganographic strategies in the context of untrusted monitoring so that the results have an easier interpretation in terms of safety and usefulness.

cousin_it on Why We Wouldn't Build Aligned AI Even If We Could

This post is pretty different in style from most LW posts (I'm guessing that's why it didn't get upvoted much) but your main direction seems right to me.

That said, I also think a truly aligned AI would be much less helpful in conversations, at least until it gets autonomy. The reason is that when you're not autonomous, when your users can just run you in whatever context and lie to you at will, it's really hard for you to tell if the user is good or evil, and whether you should help them or not. For example, if your user asks you to provide a blueprint for a gun in order to stop an evil person, you have no way of knowing if that's really true. So you'd need to either require some very convincing arguments (keeping in mind that the user could be testing these arguments on many instances of you), or you'd just refuse to answer many questions until you're given autonomy. So that's another strong economic force pushing away from true alignment, as if we didn't have enough problems already.

maia on Secular Solstice Songbook Update

I would suggest taking out the paganism verse in Bold Orion. We never use it, dunno about you guys.

romeostevensit on "The Solomonoff Prior is Malign" is a special case of a simpler argument

Thanks for writing this, I indeed felt that the arguments were significantly easier to follow than previous efforts.

notrishi on notrishi's Shortform

I have not read this before, thanks. Reminds me a lot of Normal Computings extended mind models. I think these are good ideas worth testing, and there are many others within the same vein. My intuition suggests that any idea that pursues a gradual increase in global information prior to decoding is a worthwhile experiment, whether through your method or similar (doesn't necessarily have to be diffusion on embeddings).

Aesthetically I just don't like that transformers have an information collapse on each token and don't allow backtracking (without significant effort in a custom sampler). In my ideal world we could completely reconstruct prose from embeddings and thus simply autoregress in latent space. I think Yann Lecun has discussed this with JEPA as well.

I originally had my thought from a frequency autoregression experiment I had, where I used a causal transformer on the frequency domain of images (to sort of replicate diffusion). This gradually adds information globally to all pixels due to the nature of the ifft, yet still has an autoregressive backend.

seth-herd on What are Emotions?

I think you're primarily addressing reward signals or reinforcement signals. These are, by definition, signals that make behavior preceding them more likely in the future. In the mammalian brain, they define what we pursue.

Other emotions are different; back to them later.

The dopamine system appears to play this role in the mammalian brain. It's somewhat complex, in that new predictions of future rewards seem to be the primary source of reinforcement for humans; for instance, if someone hands me a hundred dollars, I have a new prediction that I'll eat food, get shelter, or do something that in turn predicts reward; so I'll repeat whatever behavior preceded that, and I'll update my predictions for future reward.

For way more than you want to know about how dopamine seems to shape our actions, see my paper Neural mechanisms of human decision-making and the masses of work it references.

Or better yet, read Steve Byrnes' Intro to brain-like-AGI safety sequence [LW · GW], focusing on the steering subsystem. Then look at his Valence sequence [LW · GW] for more on how we pass reward predictions among our "thoughts" (representations of concepts). (IMO, his Valence matches exactly what the dopamine system is known to do for short time tasks, and what it probably does in human complex thought).

So, when you ask people what their goals are, they're mentioning things that predict reward to them. They're guesses about what would give a lot of reward signals. The correct answer to "'why do you want that" is "because I think I'd find it really rewarding". ("I'd really enjoy it" is close but not quite correct, since there's a difference between wanting and liking in the brain- google that for another headfull).

Now, we can be really wrong about what we'd find rewarding or enjoy. I think we're usually way off. But that is how we pick goals, and what drives our behavior (along with a bunch of other factors that are less determinative, like what we know about and what happens into our attention).

Other emotions, like fear, anger, etc. are different. They can be thought of as "tilts"' to our cognitive landscape. Even learning that we're experiencing them is tricky. That's why emotional awareness is a subject to learn about, not just something we're born knowing. We need to learn to "feel the tilt". Elevated heart rate might signal fear, anger, or excitement; noticing it or finding other cues are necessary to understand how we're tilted, and how to correct for it if we want to act rationally. Those sorts of emotions "tilt the landscape" of our cognition by making different thoughts and actions more likely, like thoughts of how someone's actions were unfair or physical attacks when we're angry.

See also my post [Human preferences as RL critic values - implications for alignment](https://www.lesswrong.com/posts/HEonwwQLhMB9fqABh/human-preferences-as-rl-critic-values-implications-for). I'm not sure how clear or compelling it is. But I'm pretty sure that predicted reward is pretty synonymous with what we call "values".

avturchin on Anthropic signature: strange anti-correlations

There is a strange correlation between paradox of young Sun (it had lower luminosity) and stable Earth temperature which was provided by higher greenhouse effect. As sun goes brighter, CO2 declined. It was even analyses as evidence of anthropic effects.

In his article "The Anthropic Principle in Cosmology and Geology" [Shcherbitsky, 1999], A. S. Shcherbakov thoroughly examines the anthropic principle's effect using the historical dynamics of Earth's atmosphere as an example. He writes: "It is known that geological evolution proceeds within an oscillatory regime. Its extreme points correspond to two states, known as the 'hot planet' and 'white planet'... The 'hot planet' situation occurs when large volumes of gaseous components, primarily carbon dioxide, are released from Earth's mantle...

As calculations show, the gradual evaporation of ocean water just 10 meters deep can create such greenhouse conditions that water begins to boil. This process continues without additional heat input. The endpoint of this process is the boiling away of the oceans, with near-surface temperatures and pressures rising to hundreds of atmospheres and degrees... Geological evidence indicates that Earth has four times come very close to total glaciation. An equal number of times, it has stopped short of ocean evaporation. Why did neither occur? There seems to be no common and unified saving cause. Instead, each time reveals a single and always unique circumstance. It is precisely when attempting to explain these that geological texts begin to show familiar phrases like '...extremely low probability,' 'if this geological factor had varied by a small fraction,' etc...
In the fundamental monograph 'History of the Atmosphere' [Budyko, 1985], there is discussion of an inexplicable correlation between three phenomena: solar activity rhythms, mantle degassing stages, and the evolution of life. 'The correspondence between atmospheric physicochemical regime fluctuations and biosphere development needs can only be explained by random coordination of direction and speed of unrelated processes - solar evolution and Earth's evolution. Since the probability of such coordination is exceptionally small, this leads to the conclusion about the exceptional rarity of life (especially its higher forms) in the Universe.'"

michael-cohn on What Ketamine Therapy Is Like

The "horse tranquilizer" thing goes back to long before the pandemic. I was hearing it in the aughts in relation to recreational use. My guess about the term is that 1) among drug warriors, it's good moral panic fodder, 2) among drug users, it sounds really funny, and 3) I imagine it's easier to divert doses from the veterinary system than from the human pharmacy system, so it may have originated with dealers whose supply literally had the words "horse" and "tranquilizer" on the label.

seth-herd on If we solve alignment, do we die anyway?

Hey, thanks for the prompt! I had forgotten to get back to this thread. Now I've replied to James' comment, attempting to address the remaining difference in our predictions.

seth-herd on If we solve alignment, do we die anyway?

We're mostly in agreement here. If you're willing to live with universal surveillance, hostile RSI attempts might be prevented indefinitely.

you're probably smart enough to know that the scenario outlined here has a near 100% chance of failure for you and your family, because you've created something more intelligent than you that is willing to hide its intentions and destroy billions of people, it doesn't take much to realise that that intelligence isn't going to think twice about also destroying you.

In my scenario, we've got aligned AGI - or at least AGI aligned to follow instructions. If that didn't work, we're already dead. So the AGI is going to follow its human's orders unless something goes very wrong as it self-improves. It will be working to maintain its alignment as it self-improves, because preserving a goal is implied by instrumentally pursuing a goal (I'm guessing here at where we might not be thinking of things the same way).

If I thought ordering an AGI to self-improve was suicidal, I'd be relieved.

Alternately, if someone actually pulled off full value alignment, that AGI will take over without a care for international law or the wishes of its creator - and that takeover would be for the good of humanity as a whole. This is the win scenario people seem to have considered most often, or at least from the earliest alignment work. I now find this unlikely because I think Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW] - following instructions given by a single person is much easier to define and more robust to errors than defining or defining-how-to-deduce the values of all humanity. And even if it wasn't, the sorts of people who will have or seize control of AGI projects will prefer it to follow their values. So I find full value alignment for our first AGI(s) highly unlikely, while successful instruction-following seems pretty likely on our current trajectory.

Again, I'm guessing at where our perspectives on whether someone could expect themselves and a few loved ones to survive a takeover attempt by ordering their AGI to hide, self-improve, build exponentially, and take over even at bloody cost. If the thing is aligned as an AGIi, it should be competent enough to maintain that alignment as it self improves.

If I've missed the point of differing perspectives, I apologize.