LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Biasing VLM Response with Visual Stimuli
Jaehyuk Lim (jason-l) · 2024-10-03T18:04:31.474Z · comments (0)

[link] AI Safety Newsletter #41: The Next Generation of Compute Scale Plus, Ranking Models by Susceptibility to Jailbreaking, and Machine Ethics
Corin Katzke (corin-katzke) · 2024-09-11T19:14:08.274Z · comments (1)

[question] If the DoJ goes through with the Google breakup,where does Deepmind end up?
O O (o-o) · 2024-10-12T05:06:50.996Z · answers+comments (1)

[link] Should we abstain from voting? (In nondeterministic elections)
B Jacobs (Bob Jacobs) · 2024-10-02T10:07:43.167Z · comments (6)

[link] Linkpost: Hypocrisy standoff
Chris_Leong · 2024-09-29T14:27:19.175Z · comments (1)

[question] How do we know dreams aren't real?
Logan Zoellner (logan-zoellner) · 2024-08-22T12:41:57.380Z · answers+comments (31)

Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets
Abhimanyu Pallavi Sudhir (abhimanyu-pallavi-sudhir) · 2024-09-16T01:04:32.953Z · comments (1)

Scattered thoughts on what it means for an LLM to believe
TheManxLoiner · 2024-11-06T22:10:29.429Z · comments (4)

Some reasons to start a project to stop harmful AI
Remmelt (remmelt-ellen) · 2024-08-22T16:23:34.132Z · comments (0)

[question] AMA: International School Student in China
Novice · 2024-10-01T06:00:16.282Z · answers+comments (0)

[link] An "Observatory" For a Shy Super AI?
Sherrinford · 2024-09-27T21:22:40.296Z · comments (0)

Longevity and the Mind
George3d6 · 2024-09-16T09:43:09.700Z · comments (2)

Using Narrative Prompting to Extract Policy Forecasts from LLMs
Max Ghenis (MaxGhenis) · 2024-11-05T04:37:52.004Z · comments (0)

[link] How long should political (and other) terms be?
ohmurphy · 2024-10-14T21:38:43.050Z · comments (0)

Meta: On viewing the latest LW posts
quiet_NaN · 2024-08-25T19:31:39.008Z · comments (2)

[link] Join the $10K AutoHack 2024 Tournament
Paul Bricman (paulbricman) · 2024-09-25T11:54:20.112Z · comments (0)

Effects of Non-Uniform Sparsity on Superposition in Toy Models
Shreyans Jain (shreyans-jain) · 2024-11-14T16:59:43.234Z · comments (3)

Interest poll: A time-waster blocker for desktop Linux programs
nahoj · 2024-08-22T20:44:04.479Z · comments (5)

Some Comments on Recent AI Safety Developments
testingthewaters · 2024-11-09T16:44:58.936Z · comments (0)

Ways to think about alignment
Abhimanyu Pallavi Sudhir (abhimanyu-pallavi-sudhir) · 2024-10-27T01:40:50.762Z · comments (0)

Building Safer AI from the Ground Up: Steering Model Behavior via Pre-Training Data Curation
Antonio Clarke (antonio-clarke) · 2024-09-29T18:48:23.308Z · comments (0)

[link] Predictions as Public Works Project — What Metaculus Is Building Next
ChristianWilliams · 2024-10-22T16:35:13.999Z · comments (0)

It is time to start war gaming for AGI
yanni kyriacos (yanni) · 2024-10-17T05:14:17.932Z · comments (1)

[question] Is OpenAI net negative for AI Safety?
Lysandre Terrisse · 2024-11-02T16:18:02.859Z · answers+comments (0)

[question] How do you finish your tasks faster?
Cipolla · 2024-08-21T20:01:41.306Z · answers+comments (2)

Reducing x-risk might be actively harmful
MountainPath · 2024-11-18T14:25:07.127Z · comments (0)

Bellevue-Redmond USA - ACX Meetups Everywhere Fall 2024
Cedar (xida-ren) · 2024-08-29T18:43:57.014Z · comments (8)

[link] A Logical Proof for the Emergence and Substrate Independence of Sentience
rife (edgar-muniz) · 2024-10-24T21:08:09.398Z · comments (31)

What are Emotions?
Myles H (zarsou9) · 2024-11-15T04:20:27.388Z · comments (12)

Likelihood calculation with duobels
Martin Gerdes (martin-gerdes) · 2024-10-01T16:21:01.268Z · comments (0)

Tbilisi Georgia - ACX Meetups Everywhere Fall 2024
Dmitrii (dmitrii) · 2024-08-29T18:36:43.223Z · comments (4)

[link] 2024 Election Forecasting Contest
mike20731 · 2024-10-05T20:43:16.203Z · comments (0)

Jailbreaking ChatGPT and Claude using Web API Context Injection
Jaehyuk Lim (jason-l) · 2024-10-21T21:34:37.579Z · comments (0)

[question] Noticing the World
EvolutionByDesign (bioluminescent-darkness) · 2024-11-04T16:41:44.696Z · answers+comments (1)

[question] What are the primary drivers that caused selection pressure for intelligence in humans?
Towards_Keeperhood (Simon Skade) · 2024-11-07T09:40:20.275Z · answers+comments (15)

[question] How do you follow AI (safety) news?
PeterH · 2024-09-24T13:58:48.916Z · answers+comments (2)

[question] is there a big dictionary somewhere with all your jargon and acronyms and whatnot?
KvmanThinking (avery-liu) · 2024-10-17T11:30:50.937Z · answers+comments (7)

Developmental Stages in Multi-Problem Grokking
James Sullivan · 2024-09-29T18:58:22.954Z · comments (0)

[link] The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind
Roko · 2024-10-16T01:24:51.102Z · comments (18)

Tokyo (日本語) Japan - ACX Meetups Everywhere Fall 2024
Emi (emi-2) · 2024-08-29T18:35:28.013Z · comments (0)

Towards a Clever Hans Test: Unmasking Sentience Biases in Chatbot Interactions
glykokalyx · 2024-11-10T22:34:58.956Z · comments (0)

Can Current LLMs be Trusted To Produce Paperclips Safely?
Rohit Chatterjee (rohit-c) · 2024-08-19T17:17:07.530Z · comments (0)

[question] Is there a known method to find others who came across the same potential infohazard without spoiling it to the public?
hive · 2024-10-17T10:47:05.099Z · answers+comments (6)

AI Compute governance: Verifying AI chip location
Farhan · 2024-10-12T17:36:45.942Z · comments (0)

[link] Podcast discussing Hanson's Cultural Drift Argument
vaishnav92 · 2024-10-20T17:58:41.416Z · comments (0)

Germany-wide ACX Meetup
Fernand0 · 2024-11-17T10:08:54.584Z · comments (0)

Project Adequate: Seeking Cofounders/Funders
Lorec · 2024-11-17T03:12:12.995Z · comments (4)

Dallas USA - ACX Meetups Everywhere Fall 2024
ethanmorse · 2024-08-29T18:43:37.972Z · comments (0)

[question] How might language influence how an AI "thinks"?
bodry (plosique) · 2024-10-30T17:41:04.460Z · answers+comments (0)

On the Practical Applications of Interpretability
Nick Jiang (nick-jiang) · 2024-10-15T17:18:25.280Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

buck on Why imperfect adversarial robustness doesn't doom AI control

Why does it know its attempt was replaced? Because it can guess it wouldn't have produced R?

Yep.

If it can tell that, how come it can't guess it specifically would have produced A?

I'm just hypothesizing that the first guessing problem is easier than the second; I intuitively expect that it's a lot easier to notice that something isn't your style than to guess what exactly you would have done (among other things, there are many more bits in "what action did I take" than "did I take this action").

buck on Why imperfect adversarial robustness doesn't doom AI control

For the main body, I gave it a first draft, it wrote a much improved second draft, then I edited from there. It wrote the appendix basically itself (with some feedback from me)

viliam on Making a conservative case for alignment

Going by today's standards, we should have banned Gwern in 2012.

And I think that would have been a mistake.

I wonder how many other mistakes we made. The problem is, we won't get good feedback on this.

joao-ribeiro-medeiros on Neutrality

Love this!!! To inspect the concept of neutrality as a software like process driven institution is a very illuminating approach.

I would add that there is a very relevant discussion in the AI sphere which is profoundly connected to the points you are making: the existence of bias in AI models.

To identify and measure bias is in a sense to identify and measure lack of neutrality, so it follows that, to define bias, one must first be very rigorous on the definition of neutrality.

This can seem simple for some of the more pedestrian AI tasks, but can become increasingly sophisticated as we introduce AI as an essential piece in workflows and institutions.

AI algorithms can be heavily biased, datasets can be biased and even data structures can be biased.

I feel this is a topic which you can further explore in the future. Thank you for this!

kave on Why imperfect adversarial robustness doesn't doom AI control

I'm curious about Claude's coauthorship of this post. How did it help?

kave on Why imperfect adversarial robustness doesn't doom AI control

The crucial point is: when the untrusted model gets to step 6, it sees R in the history instead of A, so it knows its attempt was replaced.

Why does it know its attempt was replaced? Because it can guess it wouldn't have produced R? If it can tell that, how come it can't guess it specifically would have produced A?

mr-hire on Ayn Rand’s model of “living money”; and an upside of burnout

also, this 'subconscious parts going on strike' theory makes slightly different predictions than the 'is it good for the whole system/live' theory

for instance, i predict that you can have 'dead parts' that e.g. give people social anxiety based on past trauma, even though it's no longer actually relevant to their current situation.

and that if you override this social anxiety using 'live willpower' for a while, you can get burnout, even though the willpower is in some sense 'correct' about what would be good for the overall flourishing of the system given the current reality.

egor-timatkov on "It's a 10% chance which I did 10 times, so it should be 100%"

I think what Justin is saying is that finding a single monogamous partner is not significantly different from finding two, three, etc. For some things you only care about succeeding once. So a 63% chance of success (any number of times) means a .63 expected value (because all successes after the first have a value of 0).

Meanwhile for other things, such as polyamorous partners, 2 partners is meaningfully better than one, so the expected value truly is 1, because you will get one partner on average. (Though this assumes 2 partners is twice as good as one, we can complicate this even more if we assume that 2 partners is better, but not twice as good)

aprilsr on "The Solomonoff Prior is Malign" is a special case of a simpler argument

In case it's a helpful data point: lines of reasoning sorta similar to the ones around the infohazard warning seemed to have interesting and intense psychological effects on me one time. It's hard to separate out from other factors, though, and I think it had something to do with the fact that lately I've been spending a lot of time learning to take ideas seriously on an emotional level instead of only an abstract one.

jonas-hallgren on Leon Lang's Shortform

If you look at the Active Inference community there's a lot of work going into PPL-based languages to do more efficient world modelling but that shit ain't easy and as you say it is a lot more compute heavy.

I think there'll be a scaling break due to this but when it is algorithmically figured out again we will be back and back with a vengeance as I think most safety challenges have a self vs environment model as a necessary condition to be properly engaged. (which currently isn't engaged with LLMs wolrd modelling)