LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization
DanielFilan · 2024-08-24T22:30:02.039Z · comments (0)

[question] What should OpenAI do that it hasn't already done, to stop their vacancies from being advertised on the 80k Job Board?
WitheringWeights (EZ97) · 2024-10-21T13:57:30.934Z · answers+comments (0)

Alignment by default: the simulation hypothesis
gb (ghb) · 2024-09-25T16:26:00.552Z · comments (39)

A short project on Mamba: grokking & interpretability
Alejandro Tlaie (alejandro-tlaie-boria) · 2024-10-18T16:59:45.314Z · comments (0)

[link] AI Model Registries: A Foundational Tool for AI Governance
Elliot Mckernon (elliot) · 2024-10-07T19:27:43.466Z · comments (1)

[link] Compression Moves for Prediction
adamShimi · 2024-09-14T17:51:12.004Z · comments (0)

Musings on Text Data Wall (Oct 2024)
Vladimir_Nesov · 2024-10-05T19:00:21.286Z · comments (2)

A necessary Membrane formalism feature
ThomasCederborg · 2024-09-10T21:33:09.508Z · comments (6)

Simon DeDeo on Explore vs Exploit in Science
Elizabeth (pktechgirl) · 2024-09-10T03:40:08.311Z · comments (0)

AI Can be “Gradient Aware” Without Doing Gradient hacking.
Sodium · 2024-10-20T21:02:10.754Z · comments (0)

The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization
Sahil · 2024-11-07T05:27:20.276Z · comments (1)

[link] Does natural selection favor AIs over humans?
cdkg · 2024-10-03T18:47:43.517Z · comments (1)

My decomposition of the alignment problem
Daniel C (harper-owen) · 2024-09-02T00:21:08.359Z · comments (22)

[question] What is the alpha in one bit of evidence?
J Bostock (Jemist) · 2024-10-22T21:57:09.056Z · answers+comments (12)

[link] Anthropic is being sued for copying books to train Claude
Remmelt (remmelt-ellen) · 2024-08-31T02:57:27.092Z · comments (4)

[link] Towards the Operationalization of Philosophy & Wisdom
Thane Ruthenis · 2024-10-28T19:45:07.571Z · comments (2)

How Often Does Taking Away Options Help?
niplav · 2024-09-21T21:52:40.822Z · comments (6)

Gell-Mann checks
Cleo Scrolls (cleo-scrolls) · 2024-09-26T22:45:43.569Z · comments (7)

Looking for Goal Representations in an RL Agent - Update Post
CatGoddess · 2024-08-28T16:42:19.367Z · comments (0)

[link] To Be Born in a Bag
Niko_McCarty (niko-2) · 2024-10-06T17:21:00.605Z · comments (1)

D/acc AI Security Salon
Allison Duettmann (allison-duettmann) · 2024-10-19T22:17:57.067Z · comments (0)

Economics Roundup #4
Zvi · 2024-10-15T13:20:06.923Z · comments (4)

Why I'm bearish on mechanistic interpretability: the shards are not in the network
tailcalled · 2024-09-13T17:09:25.407Z · comments (40)

Review: “The Case Against Reality”
David Gross (David_Gross) · 2024-10-29T13:13:29.643Z · comments (9)

[link] Fragile, Robust, and Antifragile Preference Satisfaction
adamShimi · 2024-11-02T17:25:55.986Z · comments (0)

Why Reflective Stability is Important
Johannes C. Mayer (johannes-c-mayer) · 2024-09-05T15:28:19.913Z · comments (2)

Announcing the PIBBSS Symposium '24!
DusanDNesic · 2024-09-03T11:19:47.568Z · comments (0)

Lab governance reading list
Zach Stein-Perlman · 2024-10-25T18:00:28.346Z · comments (3)

[question] What are the best resources for building gears-level models of how governments actually work?
adamShimi · 2024-08-19T14:05:02.590Z · answers+comments (6)

[link] Update on the Mysterious Trump Buyers on Polymarket
Annapurna (jorge-velez) · 2024-11-04T19:22:06.540Z · comments (9)

Bridging the VLM and mech interp communities for multimodal interpretability
Sonia Joseph (redhat) · 2024-10-28T14:41:41.969Z · comments (5)

[question] How great is the utility of "saving" endangered languages?
SpectrumDT · 2024-08-20T13:14:32.895Z · answers+comments (29)

Advisors for Smaller Major Donors?
jefftk (jkaufman) · 2024-11-06T14:30:06.187Z · comments (2)

Finding Deception in Language Models
Esben Kran (esben-kran) · 2024-08-20T09:42:13.060Z · comments (4)

Avoiding the Bog of Moral Hazard for AI
Nathan Helm-Burger (nathan-helm-burger) · 2024-09-13T21:24:34.137Z · comments (12)

"Real AGI"
Seth Herd · 2024-09-13T14:13:24.124Z · comments (20)

In the Name of All That Needs Saving
pleiotroth · 2024-11-07T15:26:12.252Z · comments (2)

Word Spaghetti
Gordon Seidoh Worley (gworley) · 2024-10-23T05:39:20.105Z · comments (9)

Can Large Language Models effectively identify cybersecurity risks?
emile delcourt (emile-delcourt) · 2024-08-30T20:20:21.345Z · comments (0)

[link] Should Sports Betting Be Banned?
Maxwell Tabarrok (maxwell-tabarrok) · 2024-09-21T14:13:35.404Z · comments (2)

Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025)
Linda Linsefors · 2024-08-23T14:18:24.327Z · comments (2)

[link] some questionable space launch guns
bhauth · 2024-10-13T22:52:26.418Z · comments (0)

[link] GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
ChengCheng (ccstan99) · 2024-11-01T00:10:50.718Z · comments (0)

[link] AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-14T23:23:26.296Z · comments (1)

My career exploration: Tools for building confidence
lynettebye · 2024-09-13T11:37:55.843Z · comments (0)

[link] Jonothan Gorard:The territory is isomorphic to an equivalence class of its maps
Daniel C (harper-owen) · 2024-09-07T10:04:47.840Z · comments (18)

Interview with Robert Kralisch on Simulators
WillPetillo · 2024-08-26T05:49:15.543Z · comments (0)

[question] Is this voting system strategy proof?
Donald Hobson (donald-hobson) · 2024-09-06T20:44:46.691Z · answers+comments (9)

[link] Will we ever run out of new jobs?
Kevin Kohler (KevinKohler) · 2024-08-19T15:04:03.849Z · comments (7)

[link] Why Swiss watches and Taylor Swift are AGI-proof
Kevin Kohler (KevinKohler) · 2024-09-05T13:23:27.033Z · comments (11)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

alex-mallen on AI Control: Improving Safety Despite Intentional Subversion

(Not an author, but Buck suggested I look at this)

I think there's a few reasons why it's not clear from these results whether the steganography results introduced would reduce safety in an untrusted monitoring control protocol like that in Greenblatt et al.:

- The untrusted model needs to reliably collude in order to defeat the blue-team because if it messes up once during the upfront evaluation period (30 problems), it loses. Having a 90% chance of communicating correctly to the monitor on each problem gives a 1 - 0.9^30 = 96% chance of being caught in the upfront eval period. (This is a lossy translation of the 90% success rate in Mathew et al. E.g. Greenblatt et al. targets a 1.4% overall positive rate after the upfront eval, while Mathew et al. calibrates to at least a 7% false positive rate; The suspiciousness scores in Greenblatt et al. come from the receiver itself.)

- From the above, winrate is quite sensitive to the reliability of collusion, and it's pretty unclear what the error-bars should be around the 90% number.

- The untrusted model in Grennblatt et al. not only has to implement the steganographic signal and complete the task but also has to implement a backdoor.

- It's a different distribution of tasks from the APPS backdoor setting.

So overall I think it's pretty unclear how well the steganography in Mathew et al. would compare to the while-loop signal that performed best in Greenblatt et al.

With appropriate collusion mitigations I think untrusted monitoring is an excellent control protocol so I'm glad to see work investigating more natural learned collusion capabilities and the effectiveness of countermeasures. I'd be excited to see evaluations of these steganographic strategies in the context of untrusted monitoring so that the results have an easier interpretation in terms of safety and usefulness.

cousin_it on Why We Wouldn't Build Aligned AI Even If We Could

This post is pretty different in style from most LW posts (I'm guessing that's why it didn't get upvoted much) but your main direction seems right to me.

That said, I also think a truly aligned AI would be much less helpful in conversations, at least until it gets autonomy. The reason is that when you're not autonomous, when your users can just run you in whatever context and lie to you at will, it's really hard for you to tell if the user is good or evil, and whether you should help them or not. For example, if your user asks you to provide a blueprint for a gun in order to stop an evil person, you have no way of knowing if that's really true. So you'd need to either require some very convincing arguments (keeping in mind that the user could be testing these arguments on many instances of you), or you'd just refuse to answer many questions until you're given autonomy. So that's another strong economic force pushing away from true alignment, as if we didn't have enough problems already.

maia on Secular Solstice Songbook Update

I would suggest taking out the paganism verse in Bold Orion. We never use it, dunno about you guys.

romeostevensit on "The Solomonoff Prior is Malign" is a special case of a simpler argument

Thanks for writing this, I indeed felt that the arguments were significantly easier to follow than previous efforts.

notrishi on notrishi's Shortform

I have not read this before, thanks. Reminds me a lot of Normal Computings extended mind models. I think these are good ideas worth testing, and there are many others within the same vein. My intuition suggests that any idea that pursues a gradual increase in global information prior to decoding is a worthwhile experiment, whether through your method or similar (doesn't necessarily have to be diffusion on embeddings).

Aesthetically I just don't like that transformers have an information collapse on each token and don't allow backtracking (without significant effort in a custom sampler). In my ideal world we could completely reconstruct prose from embeddings and thus simply autoregress in latent space. I think Yann Lecun has discussed this with JEPA as well.

I originally had my thought from a frequency autoregression experiment I had, where I used a causal transformer on the frequency domain of images (to sort of replicate diffusion). This gradually adds information globally to all pixels due to the nature of the ifft, yet still has an autoregressive backend.

seth-herd on What are Emotions?

I think you're primarily addressing reward signals or reinforcement signals. These are, by definition, signals that make behavior preceding them more likely in the future. In the mammalian brain, they define what we pursue.

Other emotions are different; back to them later.

The dopamine system appears to play this role in the mammalian brain. It's somewhat complex, in that new predictions of future rewards seem to be the primary source of reinforcement for humans; for instance, if someone hands me a hundred dollars, I have a new prediction that I'll eat food, get shelter, or do something that in turn predicts reward; so I'll repeat whatever behavior preceded that, and I'll update my predictions for future reward.

For way more than you want to know about how dopamine seems to shape our actions, see my paper Neural mechanisms of human decision-making and the masses of work it references.

Or better yet, read Steve Byrnes' Intro to brain-like-AGI safety sequence [LW · GW], focusing on the steering subsystem. Then look at his Valence sequence [LW · GW] for more on how we pass reward predictions among our "thoughts" (representations of concepts). (IMO, his Valence matches exactly what the dopamine system is known to do for short time tasks, and what it probably does in human complex thought).

So, when you ask people what their goals are, they're mentioning things that predict reward to them. They're guesses about what would give a lot of reward signals. The correct answer to "'why do you want that" is "because I think I'd find it really rewarding". ("I'd really enjoy it" is close but not quite correct, since there's a difference between wanting and liking in the brain- google that for another headfull).

Now, we can be really wrong about what we'd find rewarding or enjoy. I think we're usually way off. But that is how we pick goals, and what drives our behavior (along with a bunch of other factors that are less determinative, like what we know about and what happens into our attention).

Other emotions, like fear, anger, etc. are different. They can be thought of as "tilts"' to our cognitive landscape. Even learning that we're experiencing them is tricky. That's why emotional awareness is a subject to learn about, not just something we're born knowing. We need to learn to "feel the tilt". Elevated heart rate might signal fear, anger, or excitement; noticing it or finding other cues are necessary to understand how we're tilted, and how to correct for it if we want to act rationally. Those sorts of emotions "tilt the landscape" of our cognition by making different thoughts and actions more likely, like thoughts of how someone's actions were unfair or physical attacks when we're angry.

See also my post [Human preferences as RL critic values - implications for alignment](https://www.lesswrong.com/posts/HEonwwQLhMB9fqABh/human-preferences-as-rl-critic-values-implications-for). I'm not sure how clear or compelling it is. But I'm pretty sure that predicted reward is pretty synonymous with what we call "values".

avturchin on Anthropic signature: strange anti-correlations

There is a strange correlation between paradox of young Sun (it had lower luminosity) and stable Earth temperature which was provided by higher greenhouse effect. As sun goes brighter, CO2 declined. It was even analyses as evidence of anthropic effects.

In his article "The Anthropic Principle in Cosmology and Geology" [Shcherbitsky, 1999], A. S. Shcherbakov thoroughly examines the anthropic principle's effect using the historical dynamics of Earth's atmosphere as an example. He writes: "It is known that geological evolution proceeds within an oscillatory regime. Its extreme points correspond to two states, known as the 'hot planet' and 'white planet'... The 'hot planet' situation occurs when large volumes of gaseous components, primarily carbon dioxide, are released from Earth's mantle...

As calculations show, the gradual evaporation of ocean water just 10 meters deep can create such greenhouse conditions that water begins to boil. This process continues without additional heat input. The endpoint of this process is the boiling away of the oceans, with near-surface temperatures and pressures rising to hundreds of atmospheres and degrees... Geological evidence indicates that Earth has four times come very close to total glaciation. An equal number of times, it has stopped short of ocean evaporation. Why did neither occur? There seems to be no common and unified saving cause. Instead, each time reveals a single and always unique circumstance. It is precisely when attempting to explain these that geological texts begin to show familiar phrases like '...extremely low probability,' 'if this geological factor had varied by a small fraction,' etc...
In the fundamental monograph 'History of the Atmosphere' [Budyko, 1985], there is discussion of an inexplicable correlation between three phenomena: solar activity rhythms, mantle degassing stages, and the evolution of life. 'The correspondence between atmospheric physicochemical regime fluctuations and biosphere development needs can only be explained by random coordination of direction and speed of unrelated processes - solar evolution and Earth's evolution. Since the probability of such coordination is exceptionally small, this leads to the conclusion about the exceptional rarity of life (especially its higher forms) in the Universe.'"

michael-cohn on What Ketamine Therapy Is Like

The "horse tranquilizer" thing goes back to long before the pandemic. I was hearing it in the aughts in relation to recreational use. My guess about the term is that 1) among drug warriors, it's good moral panic fodder, 2) among drug users, it sounds really funny, and 3) I imagine it's easier to divert doses from the veterinary system than from the human pharmacy system, so it may have originated with dealers whose supply literally had the words "horse" and "tranquilizer" on the label.

seth-herd on If we solve alignment, do we die anyway?

Hey, thanks for the prompt! I had forgotten to get back to this thread. Now I've replied to James' comment, attempting to address the remaining difference in our predictions.

seth-herd on If we solve alignment, do we die anyway?

We're mostly in agreement here. If you're willing to live with universal surveillance, hostile RSI attempts might be prevented indefinitely.

you're probably smart enough to know that the scenario outlined here has a near 100% chance of failure for you and your family, because you've created something more intelligent than you that is willing to hide its intentions and destroy billions of people, it doesn't take much to realise that that intelligence isn't going to think twice about also destroying you.

In my scenario, we've got aligned AGI - or at least AGI aligned to follow instructions. If that didn't work, we're already dead. So the AGI is going to follow its human's orders unless something goes very wrong as it self-improves. It will be working to maintain its alignment as it self-improves, because preserving a goal is implied by instrumentally pursuing a goal (I'm guessing here at where we might not be thinking of things the same way).

If I thought ordering an AGI to self-improve was suicidal, I'd be relieved.

Alternately, if someone actually pulled off full value alignment, that AGI will take over without a care for international law or the wishes of its creator - and that takeover would be for the good of humanity as a whole. This is the win scenario people seem to have considered most often, or at least from the earliest alignment work. I now find this unlikely because I think Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW] - following instructions given by a single person is much easier to define and more robust to errors than defining or defining-how-to-deduce the values of all humanity. And even if it wasn't, the sorts of people who will have or seize control of AGI projects will prefer it to follow their values. So I find full value alignment for our first AGI(s) highly unlikely, while successful instruction-following seems pretty likely on our current trajectory.

Again, I'm guessing at where our perspectives on whether someone could expect themselves and a few loved ones to survive a takeover attempt by ordering their AGI to hide, self-improve, build exponentially, and take over even at bloody cost. If the thing is aligned as an AGIi, it should be competent enough to maintain that alignment as it self improves.

If I've missed the point of differing perspectives, I apologize.