LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Partial value takeover without world takeover
KatjaGrace · 2024-04-05T06:20:03.961Z · comments (23)

[link] New report: Safety Cases for AI
joshc (joshua-clymer) · 2024-03-20T16:45:27.984Z · comments (14)

Stagewise Development in Neural Networks
Jesse Hoogland (jhoogland) · 2024-03-20T19:54:06.181Z · comments (1)

AI #73: Openly Evil AI
Zvi · 2024-07-18T14:40:05.770Z · comments (20)

Dragon Agnosticism
jefftk (jkaufman) · 2024-08-01T17:00:06.434Z · comments (60)

Based Beff Jezos and the Accelerationists
Zvi · 2023-12-06T16:00:08.380Z · comments (29)

Public Call for Interest in Mathematical Alignment
Davidmanheim · 2023-11-22T13:22:09.558Z · comments (9)

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
hugofry · 2024-04-29T20:57:35.127Z · comments (8)

story-based decision-making
bhauth · 2024-02-07T02:35:27.286Z · comments (11)

[link] Large Language Models can Strategically Deceive their Users when Put Under Pressure.
ReaderM · 2023-11-15T16:36:04.446Z · comments (8)

Covert Malicious Finetuning
Tony Wang (tw) · 2024-07-02T02:41:51.698Z · comments (4)

[link] Debating with More Persuasive LLMs Leads to More Truthful Answers
Akbir Khan (akbir-khan) · 2024-02-07T21:28:10.694Z · comments (14)

Singular learning theory: exercises
Zach Furman (zfurman) · 2024-08-30T20:00:03.785Z · comments (5)

Three Notions of "Power"
johnswentworth · 2024-10-30T06:10:08.326Z · comments (43)

Teaching CS During Take-Off
andrew carle (andrew-carle) · 2024-05-14T22:45:39.447Z · comments (13)

On the abolition of man
Joe Carlsmith (joekc) · 2024-01-18T18:17:06.201Z · comments (18)

[link] Executable philosophy as a failed totalizing meta-worldview
jessicata (jessica.liu.taylor) · 2024-09-04T22:50:18.294Z · comments (40)

I'm a bit skeptical of AlphaFold 3
Oleg Trott (oleg-trott) · 2024-06-25T00:04:41.274Z · comments (14)

How well do truth probes generalise?
mishajw · 2024-02-24T14:12:19.729Z · comments (11)

We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming"
Lukas_Gloor · 2024-05-09T15:43:11.490Z · comments (36)

Natural Latents: The Concepts
johnswentworth · 2024-03-20T18:21:19.878Z · comments (18)

Research update: Towards a Law of Iterated Expectations for Heuristic Estimators
Eric Neyman (UnexpectedValues) · 2024-10-07T19:29:29.033Z · comments (2)

[link] More Hyphenation
Arjun Panickssery (arjun-panickssery) · 2024-02-07T19:43:29.086Z · comments (19)

Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort (stanislavfort) · 2024-08-29T17:17:47.136Z · comments (8)

[link] Self-Help Corner: Loop Detection
adamShimi · 2024-10-02T08:33:23.487Z · comments (6)

[link] Re: Anthropic's suggested SB-1047 amendments
RobertM (T3t) · 2024-07-27T22:32:39.447Z · comments (13)

[link] Detecting Genetically Engineered Viruses With Metagenomic Sequencing
jefftk (jkaufman) · 2024-06-27T14:01:34.868Z · comments (10)

A Crisper Explanation of Simulacrum Levels
Thane Ruthenis · 2023-12-23T22:13:52.286Z · comments (13)

GPT-o1
Zvi · 2024-09-16T13:40:06.236Z · comments (34)

There is a globe in your LLM
jacob_drori (jacobcd52) · 2024-10-08T00:43:40.300Z · comments (4)

Apply to be a Safety Engineer at Lockheed Martin!
yanni kyriacos (yanni) · 2024-03-31T21:02:08.499Z · comments (3)

OpenAI: Helen Toner Speaks
Zvi · 2024-05-30T21:10:02.938Z · comments (8)

The Aspiring Rationalist Congregation
maia · 2024-01-10T22:52:54.298Z · comments (23)

LLMs Look Increasingly Like General Reasoners
eggsyntax · 2024-11-08T23:47:28.886Z · comments (40)

[Intuitive self-models] 1. Preliminaries
Steven Byrnes (steve2152) · 2024-09-19T13:45:27.976Z · comments (20)

Neutrality
sarahconstantin · 2024-11-13T23:10:05.469Z · comments (0)

Rejecting Television
Declan Molony (declan-molony) · 2024-04-23T04:59:50.253Z · comments (10)

[Valence series] 2. Valence & Normativity
Steven Byrnes (steve2152) · 2023-12-07T16:43:49.919Z · comments (5)

5 homegrown EA projects, seeking small donors
Austin Chen (austin-chen) · 2024-10-28T23:24:25.745Z · comments (4)

Fluent, Cruxy Predictions
Raemon · 2024-07-10T18:00:06.424Z · comments (14)

[link] Anxiety vs. Depression
Sable · 2024-03-17T00:15:08.255Z · comments (35)

Reflections on Less Online
Error · 2024-07-07T03:49:44.534Z · comments (15)

Scalable oversight as a quantitative rather than qualitative problem
Buck · 2024-07-06T17:42:41.325Z · comments (11)

[question] What are the best arguments for/against AIs being "slightly 'nice'"?
Raemon · 2024-09-24T02:00:19.605Z · answers+comments (51)

A simple case for extreme inner misalignment
Richard_Ngo (ricraz) · 2024-07-13T15:40:37.518Z · comments (41)

[link] Environmentalism in the United States Is Unusually Partisan
Jeffrey Heninger (jeffrey-heninger) · 2024-05-13T21:23:10.755Z · comments (26)

Addressing Feature Suppression in SAEs
Benjamin Wright (Benw8888) · 2024-02-16T18:32:51.927Z · comments (3)

Some for-profit AI alignment org ideas
Eric Ho (eh42) · 2023-12-14T14:23:20.654Z · comments (19)

[link] A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Alexandre Variengien (alexandre-variengien) · 2023-12-19T11:52:27.354Z · comments (3)

[link] "AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case
habryka (habryka4) · 2024-05-03T18:10:12.478Z · comments (10)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

satron on Why would ASI share any resources with us?

That makes sense. Are there any promising developments in the field of AI safety that make you think that we will be able to answer that question by the time we need to?

quetzal_rainbow on D0TheMath's Shortform

You are making an error here: ZFC + not Consistent(ZFC) != ZFC.

Assuming ZFC + not Consistent(ZFC) we can prove Consistent(ZFC), because inconsistent systems can prove everything and ZFC + not Consistent(ZFC) + Consistent(ZFC) is, in fact, inconsistent. But it doesn't say anything about consistency of ZFC itself, because you can freely assume any sufficiently powerful system instead of ZFC. If you assume inconsistent system, then system + not Consistent(system) is still inconsistent, if you assume consistent system, then system + not Consistent(system) is inconsistent for reasoning above, so it can't prove whether assumed system is consistent or not.

satron on Buck's Shortform

I am asking this, because I am planning on entering the field, but this is a pretty niche topic that I haven't yet researched.

startattheend on Anvil Problems

I meant that they were functionally booleans, as a single condition is fulfilled "is rich", "has anvil", "AGI achieved". In the anvil example, any number past 1 corresponds to true. In programming, casting positive integers to booleans results in "true" for all positive numbers, and "false" in the case of zero, just like in the anvil example. The intuition carries over too well for me to ignore.

The first example which came to mind for me when reading the post was confidence, which is often treated as a boolean "Does he have confidence? yes/no". So you don't need any countable objects, only a condition/threshold which is either reached or not, with anything past "yes" still being "yes".

A function where everything past a threshold maps to true, and anything before it maps to false, is similar to the anvil example, and to a function like "is positive" (since a more positive number is still positive). But for the threshold to be exactly 1 unit, you need to choose a unit which is large enough. 1$ is not rich, and having one water droplet on you is not "wet", but with the appropriate unit (exactly the size of the threshold/condition) these should be functionally similar.

I'm hoping there is simple and intuitive mathematics for generalizing this class of problems. And now that I think about it, most of these things (the ones which can be used for making more of themselves) are catalysts (something used but not consumed in the process of making something). Using money to make more money, anvils to make more anvils, breeding more of a species before it goes extinct.

kajus on Complex Systems for AI Safety [Pragmatic AI Safety #3]

Great post!

In the past, broad interventions would clearly have been more effective: for instance, there would have been little use in studying empirical alignment prior to deep learning. Even more recently than the advent of deep learning, many approaches to empirical alignment were highly deemphasized when large, pretrained language models arrived on the scene (refer to our discussion of creative destruction in the last post).

As discussed in the last post, a leading motivation for researchers is the interestingness or “coolness” of a problem. Getting more people to research relevant problems is highly dependent on finding interesting and well-defined subproblems for them to work on. This relies on concretizing problems and providing funding for solving them.

This seems be a conflicting advice to me. If you try to follow both you might end up having hard time finding direction for research.

daemonicsigil on The Foraging (Ex-)Bandit [Ruleset & Reflections]

Thanks for making the game! I also played it, just didn't leave a comment on the original post. Scored 2751. I played each location for an entire day after building an initial food stockpile, and so figured out the timing of Tiger Forest and Dog Valley. But I also did some fairly dumb stuff, like assuming a time dependence for other biomes. And I underestimated Horse Hills, since when I foraged it for a full day, I got unlucky and only rolled a single large number. For what it's worth, I find these applet things more accessible than a full-on D&D.Sci (though those are also great), which I often end up not playing because it feels too much like work. With applets you can play on medium-low effort (which I did) and make lots of mistakes (which I did) and learn Valuable Lessons about How Not To Science (which one might hope I did).

jan-betley on Seven lessons I didn't learn from election day

The second reason that I don't trust the neighbor method is that people just... aren't good at knowing who a majority of their neighbors are voting for. In many cases it's obvious (if over 70% of your neighbors support one candidate or the other, you'll probably know). But if it's 55-45, you probably don't know which direction it's 55-45 in.

My guess is that there's some postprocessing here. E.g. if you assume that the "neighbor" estimate is wrong but without the refusal problem, and you have the same data from the previous election, then you could estimate the shift of opinions and apply that to other pools that ask about your vote. Or you could ask some additional question like "who did your neighbours vote for in the previous election" and compare that to the real data (ideally per county or so). I would be very surprised if they based the bets just on the raw results.

unexpectedvalues on Seven lessons I didn't learn from election day

If you ask people who their neighbors are voting for, they will make their best guess about who their neighbors are voting for. Occasionally their best guess will be to assume that their neighbors will vote the same way that they're voting, but usually not. Trump voters in blue areas will mostly answer "Harris" to this question, and Harris voters in red areas will mostly answer "Trump".

clone-of-saturn on Seven lessons I didn't learn from election day

The second reason that I don’t trust the neighbor method is that people just… aren’t good at knowing who a majority of their neighbors are voting for.

This seems like a point in favor of the neighbor method, not against it. You would want people to find "who are my neighbors voting for?" too difficult to readily answer and so mentally replace it with the simpler question "who am I voting for?" thus giving them a plausibly deniable way to admit to voting for Trump.

jimmy on [Intuitive self-models] 4. Trance

I'm the person JenniferRM mentioned. I'm also a physics guy, and got into studying/practicing hypnosis in ~2010/2011. I kinda moved on from "hypnosis" and drifted up the abstraction ladder, but still working on similar things and working on tying them together.

Anyway, here are my thoughts.

Suppose I really want her to be spinning clockwise in my mind. What might I do?

What worked for me is to focus on the foot alone and ignore the broader context so that I had a "clean slate" without "confirmatory experience" blocking my desired conclusion. When looking at the foot alone I experience it as oscillating rather than rotating (which I guess it technically is), and from there I can "release" it into whichever spin I intend by just kinda imagining that this is what's going on.

On the one hand, shifting intuitive models is surprisingly hard! You can’t necessarily just want to have a particular intuitive model, and voluntarily make that happen.

I actually disagree with this. It certainly seems hard, but the difficulty is largely illusory and pretty much disappears once you stop trying to walk through the wall and notice the front door.

The problem is that "wanting to have a particular model" isn't the thing that matters. You can want to have a particular model all you want, and you can even think the model is true all you want, but you're still talking about the statement itself not about the reality to which the statement refers. Even if you convince someone that their fear is irrational and they'd be better off not being scared, you've still only convinced them that their fear is irrational and they'd be better off not being scared. If you want to convince them that they are safe -- and therefore change their fear response itself -- then you need to convince them that they're safe. It's the difference between looking at yourself from the third person and judging whether your beliefs are correct or not, vs looking at the world from the first person and seeing what is there. If you want to change the third person perspective, then you can look at which models are desirable and why. If you want to change the first person models themselves, you have to look to the world and see what's there.

This doesn't really work with the spinning dancer because "Which way is the dancer spinning?" doesn't have an answer, but this is an artificial issue which doesn't exist in the real world. You still have to figure out "Is this safe enough to be worth doing?" and that's not always trivial, but the problem of "How do I change this irrational fear?" (for example) is. The answer is "By attending to the question of whether it is actually safe".

I don't deny that there's "skill" to it, but most of the skill IME is a meta skill of knowing what to even aim for rather than aiming well. Once you start attending to "Is it safe enough?", then when the answer is actually obvious the intuitive models just change. I can give a whole bunch of examples of this if you want, where people were stuck unable to change their responses and the problem just melts away with this redirection. Even stuff that you'd think would be resistant to change like physical pain can change essentially instantly. I've had it take as little as a single word.

Again we see that the subject is made to feel that his body is out of control, and becomes subject to a high-status person. Some hypnotists sit you down, ask you to stare upwards into their eyes and suggest that your eyelids are wanting to close—which works because looking upwards is tiring, and because staring up into a high-status person’s eyes makes you feel inferior.

This isn't exactly wrong, but I want to push back on the implication that this is the central or most important thing here.

The central thing, IMO, is a willingness to try on another person's worldview even though it clashes with your own. It doesn't require "inferiority"/"high status"/"control" except in the extremely minimal sense that they might know something important that you don't, and that seeing it for yourself might change your behavior. That alone will get you inhibition of all the normal stuff and an automatic (albeit tentative) acceptance of worldview-dissonant perspectives (e.g. name amnesia). It helps if the person has reason to respect and trust you which is kinda like "high status", but not really because it can just as easily happen with people on equal social standing in neutral contexts.

Similarly, hypnosis has very little to do with sleep and eye fatigue/closure is not the important part of eye contact. The important part of eye contact is that it's incredibly communicative. You can convey with eye contact things which you can't convey with words. "I see you". "Seeing you doesn't cause conflict in me". "I see you seeing me see you" and so on, to name a few. All the things you need to communicate to show someone that your perspective is safe and worthy of experiencing are best communicated with the eyes. And perhaps equally important it is a bid for attention, by holding your own.

So far, this isn’t a trance; I’m just describing a common social dynamic. Specifically, if I’m not in a hypnotic trance, the sequence of thoughts in the above might look like a three-step process:
[...]
i.e., in my intuitive model, first, the hypnotist exercises his free will with the intention of me standing; second, I (my homunculus) exercise my own free will with the intention of standing; and third, I actually stand. In this conceptualization, it’s my own free will / vitalistic force / wanting (§3.3.4) that causes me to stand. So this is not a trance.

It's important to note that while this self reflective narrative is indeed different in the way you describe, the underlying truth often is not. In the hypnosis literature this is known as "cold control theory", because it's the same control without the usual Higher Order Thoughts (HOT).

In "common social dynamics" we explain it as "I chose to", but what is actually happening a lot of the time is the speaker is exercising their free will through your body, and you're not objecting because it matches your narrative. The steps aren't actually in series, and you didn't choose to do it so much as you chose to not decline to do it.

These "higher order thoughts" do change some things, but turn out to be relatively unimportant and the better hypnotists usually don't bother too much with them and instead just address the object level. This is also why you get hypnotists writing books subtitled "there's no such thing as hypnosis" and stuff like that.

The short version is: If I have a tune in my head, then I’m very unlikely to simultaneously recall a memory of a different tune. Likewise, if I’m angry right now, then I’m less likely to recall past memories where I felt happy and forgiving, and vice-versa.

As far as I can tell, there are several different things going on with amnesia. I agree that this is one of them, and I'm not sure if I've seen anyone else notice this, so it's cool to see someone point it out.

The "null hypothesis", though, any time it comes to hypnosis is that it's all just response to suggestion. You "know" that being hypnotized involves amnesia, and you believe you're hypnotized, so you experience what you expect. There's an academic hypnosis researcher I talk to sometimes who doesn't even believe "hypnotic trance" is real in any fundamental sense and thinks that all the signs of trance are the result of suggestion.

I don't believe suggestion is all that's going on, but it really is sufficient for amnesia. The answer to Yudkowsky's old question of "Do we believe everything we're told?" [LW · GW] is indeed "Yes" -- if we don't preemptively push it away or actively remember to unbelieve later. Back when I was working this stuff out I did a fun experiment where I'd come up with an excuse to get people to not pre-emptively reject what I was about to say, then I'd suggest amnesia for this conversation and that they'd laugh when I scratch my nose, and then I'd distract them so that the suggestion could take effect before they had a chance to unbelieve it. The excuse was something like "I know this is ridiculous so I don't expect you to believe it, but hear me out and let me know if you understand" -- which is tricky because they think the fact that we "agreed" that they won't believe it means they actually aren't believing it when they say "I understand", even though the full statement is "I understand [that I will laugh when you scratch your nose and have no idea why"]. They still had awareness that this belief is wrong and would therefore act to stop themselves from acting on it, which is why the unexpected distraction was necessary in order to get their mind off of it long enough for it to work.