LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

[link] Self-fulfilling misalignment data might be poisoning our AI models
TurnTrout · 2025-03-02T19:51:14.775Z · comments (5)

Statistical Challenges with Making Super IQ babies
Jan Christian Refsgaard (jan-christian-refsgaard) · 2025-03-02T20:26:22.103Z · comments (6)

Maintaining Alignment during RSI as a Feedback Control Problem
beren · 2025-03-02T00:21:43.432Z · comments (4)

Methods for strong human germline engineering
TsviBT · 2025-03-03T08:13:49.414Z · comments (0)

[question] Will LLM agents become the first takeover-capable AGIs?
Seth Herd · 2025-03-02T17:15:37.056Z · answers+comments (8)

Cautions about LLMs in Human Cognitive Loops
Alice Blair (Diatom) · 2025-03-02T19:53:10.253Z · comments (6)

[question] Request for Comments on AI-related Prediction Market Ideas
PeterMcCluskey · 2025-03-02T20:52:41.114Z · answers+comments (0)

Open Thread Spring 2025
Ben Pace (Benito) · 2025-03-02T02:33:16.307Z · comments (1)

Saving Zest
jefftk (jkaufman) · 2025-03-02T12:00:41.732Z · comments (1)

Spencer Greenberg hiring a personal/professional/research remote assistant for 5-10 hours per week
spencerg · 2025-03-02T18:01:32.880Z · comments (0)

[question] Examples of self-fulfilling prophecies in AI alignment?
Chipmonk · 2025-03-03T02:45:51.619Z · answers+comments (3)

Not-yet-falsifiable beliefs?
Benjamin Hendricks (benjamin-hendricks) · 2025-03-02T14:11:07.121Z · comments (4)

[question] help, my self image as rational is affecting my ability to empathize with others
KvmanThinking (avery-liu) · 2025-03-02T02:06:36.376Z · answers+comments (9)

Identity Alignment (IA) in AI
Davey Morse (davey-morse) · 2025-03-03T06:26:12.015Z · comments (0)

Positional kernels of attention heads
Alex Gibson · 2025-03-03T01:40:13.014Z · comments (0)

next page (older posts) →

Archive

Recent comments

alice-blair on Cautions about LLMs in Human Cognitive Loops

I do try to be calibrated instead of being frog, yes. Within the range of time in which present-me considers past-me remotely good as an AI forecaster, my time estimate for these sorts of deceptive capabilities has pretty linearly been going down, but to further help I set myself a reminder 3 months from today with a link to this comment. Thanks for that bit of pressure, I'm now going to generalize the "check in in [time period] about this sort of thing to make sure I haven't been hacked" reflex.

christiankl on ChristianKl's Shortform

A German legal advice Youtube channel talks about scams via fake voice getting more common and being used against normal people. One of the examples seems to be needing money to make bail.

If you haven't talked about with your parents or grandparents about these kinds of scams, now is the time to find protocols to deal with them.

mateusz-baginski on Self-fulfilling misalignment data might be poisoning our AI models

Emergence of utility-function-consistent stated preferences in LLMs [LW · GW] might be an example () though going from reading stuff on utility functions to the kind of behavior revealed there requires more inferential steps than going from reading stuff on reward hacking to reward hacking.

niplav on Cautions about LLMs in Human Cognitive Loops

Great post, thank you. Ideas (to also mitigate extremely engaging/addictive outputs in long conversations):

Don't look at the output of the large model, instead give it to a smaller model and let the smaller model rephrase it.
- I don't think there's useful software for this yet, though that might not be so hard? Could be a browser extension. To do for me, I guess.
Don't use character.ai and similar sites. Allegedly, users spend on average two hours a day talking on there (though I find that number hard to believe). If I had to guess they're fine-tuning models to be engaging to talk to, maybe even doing RL based on conversation length. (If they're not yet doing it, a competitor might, or they might in the future).

adamshimi on Alexander Gietelink Oldenziel's Shortform

One point evoked by other comments, which I've realized only after leaving France and living in the UK, is that there is still a massive prestige for engineering. ENS is not technically an engineering school, but it benefits from this prestige by being lumped with them, and by being accessed mainly from the national contests at the end of Prepas.

As always with these kind of cultural phenomena, I didn't really notice them until I left France for the UK. There is a sense in France (more when I was a student, but still there) that the most prestigious jobs are engineering ones. Going to engineering school is considered one of the top options (with medecine), and it is considered a given that any good student with a knack for maths, physics, science, will go to prepa and engineering school.^[1] It's almost free (and in practice is free if your parents don't make more than a certain amount), and it is guaranteed to lead to a good future.

This means that the vast majority of mathematical talent studies the equivalent of a undergraduate degree in maths, compressed in the span of 2 years. In addition of giving the standard french engineer much more of a mathematical training, it shows to the potential mathematicians, by default, a lot of what they could do. And if they decide to go to ENS (or Polytechnique, which is the best engineering school but still quite researchy if you want to), this is actually one of the most prestigious options you could take.

Similarly, the prestige of engineering (and science to some extent) impacts what people decide to do after their degrees. I remember that in my good prepa and my good engineering school, the cool ones were those going to build planes and bridges. The ones who went into consulting and finance were pitied and mocked as the failures, not the impressive successes to emulate. Yet what my UK friends tell me is that this is the exact opposite of what happens even in great universities in the UK.

^{^}
This has become less true, as more private schools open, and the whole elitist system is wormed out by software engineering startups (which generally doesn't ask you for an engineering degree, as opposed to the older big french companies).

daijin on help, my self image as rational is affecting my ability to empathize with others

go find people who are better than you by a lot. one way to quickly do this is to join some sort of physical exercise class e.g. running, climbing etc. there will be lots of people who are better than you. you will feel smaller.

or you could read research papers. or watch a movie with real life actors who are really good at acting.

you will then figure out, as @Algon [LW · GW] has mentioned in the comments, that the narcissism is load-bearing, and have to deal with that. which is a lot more scary

daijin on Will_Pearson's Shortform

game-theory-trust is built through expectation of reward from future cooperative scenarios. it is difficult to build this when you 'dont actually know who or how many people you might be talking to'.

jmaar on [deleted]

Idea for a psychology study with the goal of defeating confirmations bias:

I think that confirmation bias (and all the biases it encompasses) is (one) of the most evil cognitive biases because it essentially causes peoples opinions to diverge (pretty much independent of what the evidence says), which causes society to be unable to pursue the best path of action.

Early studies showed the existence of confirmation bias by providing people with the same evidence (e.g. about nuclear power accidents) and observing that most people get even more convinced of their previously held opinions.

In real life when people change their minds, it's often because a friend get's them too engage with the other view in a one to one conversation. It might be possible to reproduce this effect with LLM's.

The idea of the study is to give people evidence about a topic just like the early studies then let half of those people talk to a well prompted LLM for 10 minutes afterwards. The LLM tries to get people to really engage with the other view(s) on the topic.

I could imagine that this could reduce or maybe even cancel out the effect of confirmation bias, but it's a long shot. If true though, this would provide a scalable way for reducing polarization and improving decisionmaking (on important topics). I imagine you could have such a chat window under every online article. Also it could give whoever decides to do this a chance of publishing in a prestigious journal.

Projects like this (that improve decisionmaking / cooperation) might be helpful in Gradual [LW · GW] Disempowerment [LW · GW]type scenarios.

Maybe someone here knows somebody who could do such a study.

I got the idea from this post.

asksathvik on Humans are Just Self Aware Intelligent Biological Machines

I specifically mentioned wife instead of a generic friends specifically due to this reason, I have been with her 7.5 years now and we have grown together, and I have a good understanding of what she likes and how those likes are changing or are constant.
Why do you need god for this? If we sufficiently understand how the brain and body works we should be able to predict.
Anyways the post isn't about if we can do this now or in the future, its about how humans are just doing computation and nothing more, and how similar is that to a powerful AI doing computation, so that we can have a more unified view on what's conscious and what has agency.

mateusz-baginski on Fabien's Shortform

Have you noticed anything interesting about the CoT that may account for the mechanism of how the threat reduces the model's performance??