LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Probably Not a Ghost Story
George Ingebretsen (george-ingebretsen) · 2024-06-12T22:55:26.264Z · comments (4)

Incentive Learning vs Dead Sea Salt Experiment
Steven Byrnes (steve2152) · 2024-06-25T17:49:01.488Z · comments (1)

[link] Video Intro to Guaranteed Safe AI
Mike Vaiana (mike-vaiana) · 2024-07-11T17:53:47.630Z · comments (0)

[link] Link Collection: Impact Markets
Saul Munn (saul-munn) · 2023-12-26T09:01:48.815Z · comments (0)

A short dialogue on comparability of values
cousin_it · 2023-12-20T14:08:29.650Z · comments (7)

Deceptive agents can collude to hide dangerous features in SAEs
Simon Lermen (dalasnoin) · 2024-07-15T17:07:33.283Z · comments (0)

AISC Project: Modelling Trajectories of Language Models
NickyP (Nicky) · 2023-11-13T14:33:56.407Z · comments (0)

Fifteen Lawsuits against OpenAI
Remmelt (remmelt-ellen) · 2024-03-09T12:22:09.715Z · comments (4)

EA Infrastructure Fund's Plan to Focus on Principles-First EA
Linch · 2023-12-06T03:24:55.844Z · comments (0)

NYU Code Debates Update/Postmortem
David Rein (david-rein) · 2024-05-24T16:08:06.151Z · comments (4)

[link] Goodhart's Law Example: Training Verifiers to Solve Math Word Problems
Chris_Leong · 2023-11-25T00:53:26.841Z · comments (2)

My Dating Heuristic
Declan Molony (declan-molony) · 2024-05-21T05:28:40.197Z · comments (4)

flowing like water; hard like stone
lsusr · 2024-02-20T03:20:46.531Z · comments (4)

On the 2nd CWT with Jonathan Haidt
Zvi · 2024-04-05T17:30:05.223Z · comments (3)

[link] align your latent spaces
bhauth · 2023-12-24T16:30:09.138Z · comments (8)

European Progress Conference
Martin Sustrik (sustrik) · 2024-10-06T11:10:03.819Z · comments (11)

An AI crash is our best bet for restricting AI
Remmelt (remmelt-ellen) · 2024-10-11T02:12:03.491Z · comments (3)

[question] What prevents SB-1047 from triggering on deep fake porn/voice cloning fraud?
ChristianKl · 2024-09-26T09:17:39.088Z · answers+comments (21)

the Daydication technique
chaosmage · 2024-10-18T21:47:46.448Z · comments (0)

Why is there Nothing rather than Something?
Logan Zoellner (logan-zoellner) · 2024-10-26T12:37:50.204Z · comments (3)

Domain-specific SAEs
jacob_drori (jacobcd52) · 2024-10-07T20:15:38.584Z · comments (0)

There aren't enough smart people in biology doing something boring
Abhishaike Mahajan (abhishaike-mahajan) · 2024-10-21T15:52:04.482Z · comments (13)

Bay Winter Solstice 2024: song leading auditions
tcheasdfjkl · 2024-11-10T23:59:08.199Z · comments (0)

[link] Generic advice caveats
Saul Munn (saul-munn) · 2024-10-30T21:03:07.185Z · comments (1)

Distinguishing ways AI can be "concentrated"
Matthew Barnett (matthew-barnett) · 2024-10-21T22:21:13.666Z · comments (2)

[question] Any real toeholds for making practical decisions regarding AI safety?
lemonhope (lcmgcd) · 2024-09-29T12:03:08.084Z · answers+comments (6)

Superintelligence Can't Solve the Problem of Deciding What You'll Do
Vladimir_Nesov · 2024-09-15T21:03:28.077Z · comments (11)

[link] Evaluating Synthetic Activations composed of SAE Latents in GPT-2
Giorgi Giglemiani (Rakh) · 2024-09-25T20:37:48.227Z · comments (0)

[link] Predicting Influenza Abundance in Wastewater Metagenomic Sequencing Data
jefftk (jkaufman) · 2024-09-23T17:25:58.380Z · comments (0)

Interpretability of SAE Features Representing Check in ChessGPT
Jonathan Kutasov (jonathan-kutasov) · 2024-10-05T20:43:36.679Z · comments (2)

[question] Me & My Clone
SimonBaars (simonbaars) · 2024-07-18T16:25:40.770Z · answers+comments (22)

Uncertainty in all its flavours
Cleo Nardo (strawberry calm) · 2024-01-09T16:21:07.915Z · comments (6)

[link] David Burns Thinks Psychotherapy Is a Learnable Skill. Git Gud.
Morpheus · 2024-01-27T13:21:05.068Z · comments (20)

Cheap Whiteboards!
Johannes C. Mayer (johannes-c-mayer) · 2024-08-08T13:52:59.627Z · comments (2)

Appraising aggregativism and utilitarianism
Cleo Nardo (strawberry calm) · 2024-06-21T23:10:37.014Z · comments (10)

[link] AISN #30: Investments in Compute and Military AI Plus, Japan and Singapore’s National AI Safety Institutes
aogara (Aidan O'Gara) · 2024-01-24T19:38:33.461Z · comments (1)

[question] Supposing the 1bit LLM paper pans out
O O (o-o) · 2024-02-29T05:31:24.158Z · answers+comments (11)

[question] Why do Minimal Bayes Nets often correspond to Causal Models of Reality?
Dalcy (Darcy) · 2024-08-03T12:39:44.085Z · answers+comments (1)

Scientific Notation Options
jefftk (jkaufman) · 2024-05-18T15:10:02.181Z · comments (13)

The economy is mostly newbs (strat predictions)
lemonhope (lcmgcd) · 2024-02-01T19:15:49.420Z · comments (6)

A Strange ACH Corner Case
jefftk (jkaufman) · 2024-02-10T03:00:05.930Z · comments (2)

[link] Solving alignment isn't enough for a flourishing future
mic (michael-chen) · 2024-02-02T18:23:00.643Z · comments (0)

Reprograming the Mind: Meditation as a Tool for Cognitive Optimization
Jonas Hallgren · 2024-01-11T12:03:41.763Z · comments (3)

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Daniel Lee (daniel-lee) · 2024-09-06T02:28:41.954Z · comments (0)

How to develop a photographic memory 2/3
PhilosophicalSoul (LiamLaw) · 2023-12-30T20:18:14.255Z · comments (7)

Concrete Methods for Heuristic Estimation on Neural Networks
Oliver Daniels (oliver-daniels-koch) · 2024-11-14T05:07:55.240Z · comments (0)

Inferential Game: The Foraging (Ex-)Bandit
abstractapplic · 2024-11-11T16:59:42.058Z · comments (4)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

Why I think it's net harmful to do technical safety research at AGI labs
Remmelt (remmelt-ellen) · 2024-02-07T04:17:15.246Z · comments (24)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

tomcatfish on JargonBot Beta Test

Amusingly, I found a 16-year old comment asking for jargon integration [LW(p) · GW(p)]. Dropping it here just to document that people want this, so that people don't say "Who wants this?" without an answer.

nathan-helm-burger on Truth Terminal: A reconstruction of events

Important correction: Andy didn't fine-tune Claude, he prompted it, in part using other copies of Claude. Long weird prompts is a lot different from fine-tuning.

alex-mallen on AI Control: Improving Safety Despite Intentional Subversion

The general concern here is that you human auditors might not be perfect. I think in practice when making a safety case for a deployment, it would make sense to run a red-team/blue-team game with a human/AI red-team and your real auditors to see if your auditors are reliable.

I'm not sure how the proposed steganographic attack would work though. How do you ensure that your backdoored code will hash to XYZ if you can't see the backdoor opportunity until after you've decided on XYZ?

sean-aubin on Keeping Your Identity Small

Event is happening November 24th. Accidentally set the date wrong when initially making the event.

shankar-sivarajan on Announcing turntrout.com, my new digital home

Ah! Might I recommend Tengwar Artano instead? It uses the same glyphs, but includes many modern "smart font" features, such as ligatures, automatic under/overbar widths, and improved diacritic placement. (And perhaps most usefully, and my primary motivation for reëncoding it, it's easy to use it with XeLaTeX.)

(Also, you probably want to use the Quenya mode mode for Namárië.)

annasalamon on Dragon Agnosticism

Fair point; I was assuming you had the capacity to lie/omit/deceive, and you're right that we often don't, at least not fully.

I still prefer my policy to the OPs, but I accept your argument that mine isn't a simple Pareto improvement.

Still:

I really don't like letting social forces put "don't think about X" flinches into my or my friends' heads; and the OPs policy seems to me like an instance of that;
Much less importantly: as an intelligent/self-reflective adult, you may be better at hiding info if you know what you're hiding, compared to if you have guesses you're not letting yourself see, that your friends might still notice. (The "don't look into dragons" path often still involves hiding info, since often your brain takes a guess anyhow, and that's part of how you know not to look into this one. If you acknowledge the whole situation, you can manage your relationships consciously, including taking conscious steps to buy openness-offsets, stay freely and transparently friends where you can scheme out how.)

chipmonk on The hostile telepaths problem

What would you say that the main types of power are?

My list (for humans): physical security, financial security, social security, emotional security [LW · GW] (this one you can only give yourself though)

alex-mallen on AI Control: Improving Safety Despite Intentional Subversion

(Not an author, but Buck suggested I look at this)

I think there's a few reasons why it's not clear from these results whether the steganography results introduced would reduce safety in an untrusted monitoring control protocol like that in Greenblatt et al.:

- The untrusted model needs to reliably collude in order to defeat the blue-team because if it messes up once during the upfront evaluation period (30 problems), it loses. Having a 90% chance of communicating correctly to the monitor on each problem gives a 0.9^30 = 4% chance of passing the upfront eval period. (This is a lossy translation of the 90% success rate in Mathew et al. E.g. Greenblatt et al. targets a 1.4% overall positive rate after the upfront eval, while Mathew et al. calibrates to at least a 7% false positive rate; The suspiciousness scores in Greenblatt et al. come from the receiver itself.)

- From the above, winrate is quite sensitive to the reliability of collusion, and it's pretty unclear what the error-bars should be around the 90% number.

- The untrusted model in Grennblatt et al. not only has to implement the steganographic signal and complete the task but also has to implement a backdoor.

- It's a different distribution of tasks from the APPS backdoor setting.

So overall I think it's pretty unclear how well the steganography in Mathew et al. would compare to the while-loop signal that performed best in Greenblatt et al.

With appropriate collusion mitigations I think untrusted monitoring is an excellent control protocol so I'm glad to see work investigating more natural learned collusion capabilities and the effectiveness of countermeasures. I'd be excited to see evaluations of these steganographic strategies in the context of untrusted monitoring so that the results have an easier interpretation in terms of safety and usefulness.

cousin_it on Why We Wouldn't Build Aligned AI Even If We Could

This post is pretty different in style from most LW posts (I'm guessing that's why it didn't get upvoted much) but your main direction seems right to me.

That said, I also think a truly aligned AI would be much less helpful in conversations, at least until it gets autonomy. The reason is that when you're not autonomous, when your users can just run you in whatever context and lie to you at will, it's really hard for you to tell if the user is good or evil, and whether you should help them or not. For example, if your user asks you to provide a blueprint for a gun in order to stop an evil person, you have no way of knowing if that's really true. So you'd need to either require some very convincing arguments (keeping in mind that the user could be testing these arguments on many instances of you), or you'd just refuse to answer many questions until you're given autonomy. So that's another strong economic force pushing away from true alignment, as if we didn't have enough problems already.

maia on Secular Solstice Songbook Update

I would suggest taking out the paganism verse in Bold Orion. We never use it, dunno about you guys.