LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Some comments on intelligence
Viliam · 2024-08-01T15:17:07.215Z · comments (5)

Fun With CellxGene
sarahconstantin · 2024-09-06T22:00:03.461Z · comments (2)

Low Probability Estimation in Language Models
Gabriel Wu (gabriel-wu) · 2024-10-18T15:50:05.947Z · comments (0)

[link] Concrete benefits of making predictions
Jonny Spicer (jonnyspicer) · 2024-10-17T14:23:17.613Z · comments (4)

Inference-Only Debate Experiments Using Math Problems
Arjun Panickssery (arjun-panickssery) · 2024-08-06T17:44:27.293Z · comments (0)

Examples of How I Use LLMs
jefftk (jkaufman) · 2024-10-14T17:10:04.597Z · comments (2)

AI labs can boost external safety research
Zach Stein-Perlman · 2024-07-31T19:30:16.207Z · comments (1)

[link] My Methodological Turn
adamShimi · 2024-09-29T15:01:45.986Z · comments (0)

[LDSL#4] Root cause analysis versus effect size estimation
tailcalled · 2024-08-11T16:12:14.604Z · comments (0)

[question] Where to find reliable reviews of AI products?
Elizabeth (pktechgirl) · 2024-09-17T23:48:25.899Z · answers+comments (6)

[link] AI forecasting bots incoming
Dan H (dan-hendrycks) · 2024-09-09T19:14:31.050Z · comments (44)

[link] New blog: Expedition to the Far Lands
Connor Leahy (NPCollapse) · 2024-08-17T11:07:48.537Z · comments (3)

Towards Quantitative AI Risk Management
Henry Papadatos (henry) · 2024-10-16T19:26:48.817Z · comments (1)

[link] A new process for mapping discussions
Nathan Young · 2024-09-30T08:57:20.029Z · comments (6)

[link] AI Safety at the Frontier: Paper Highlights, August '24
gasteigerjo · 2024-09-03T19:17:24.850Z · comments (0)

[link] NAO Updates, Fall 2024
jefftk (jkaufman) · 2024-10-18T00:00:04.142Z · comments (2)

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Daniel Lee (daniel-lee) · 2024-09-06T02:28:41.954Z · comments (0)

[question] Any real toeholds for making practical decisions regarding AI safety?
lukehmiles (lcmgcd) · 2024-09-29T12:03:08.084Z · answers+comments (6)

Domain-specific SAEs
jacob_drori (jacobcd52) · 2024-10-07T20:15:38.584Z · comments (0)

[link] Predicting Influenza Abundance in Wastewater Metagenomic Sequencing Data
jefftk (jkaufman) · 2024-09-23T17:25:58.380Z · comments (0)

Superintelligence Can't Solve the Problem of Deciding What You'll Do
Vladimir_Nesov · 2024-09-15T21:03:28.077Z · comments (11)

[link] If-Then Commitments for AI Risk Reduction [by Holden Karnofsky]
habryka (habryka4) · 2024-09-13T19:38:53.194Z · comments (0)

European Progress Conference
Martin Sustrik (sustrik) · 2024-10-06T11:10:03.819Z · comments (11)

Interpretability of SAE Features Representing Check in ChessGPT
Jonathan Kutasov (jonathan-kutasov) · 2024-10-05T20:43:36.679Z · comments (2)

Cheap Whiteboards!
Johannes C. Mayer (johannes-c-mayer) · 2024-08-08T13:52:59.627Z · comments (2)

[question] What prevents SB-1047 from triggering on deep fake porn/voice cloning fraud?
ChristianKl · 2024-09-26T09:17:39.088Z · answers+comments (21)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

[link] Evaluating Synthetic Activations composed of SAE Latents in GPT-2
Giorgi Giglemiani (Rakh) · 2024-09-25T20:37:48.227Z · comments (0)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

Just because an LLM said it doesn't mean it's true: an illustrative example
dirk (abandon) · 2024-08-21T21:05:59.691Z · comments (12)

[question] Seeking AI Alignment Tutor/Advisor: $100–150/hr
MrThink (ViktorThink) · 2024-10-05T21:28:16.491Z · answers+comments (3)

An AI crash is our best bet for restricting AI
Remmelt (remmelt-ellen) · 2024-10-11T02:12:03.491Z · comments (1)

5 ways to improve CoT faithfulness
CBiddulph (caleb-biddulph) · 2024-10-05T20:17:12.637Z · comments (8)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

The case for more Alignment Target Analysis (ATA)
Chi Nguyen · 2024-09-20T01:14:41.411Z · comments (13)

A Visual Task that's Hard for GPT-4o, but Doable for Primary Schoolers
Lennart Finke (l-f) · 2024-07-26T17:51:28.202Z · comments (4)

[question] Why do Minimal Bayes Nets often correspond to Causal Models of Reality?
Dalcy (Darcy) · 2024-08-03T12:39:44.085Z · answers+comments (1)

SAE features for refusal and sycophancy steering vectors
neverix · 2024-10-12T14:54:48.022Z · comments (4)

[link] Conventional footnotes considered harmful
dkl9 · 2024-10-01T14:54:01.732Z · comments (16)

[question] When can I be numerate?
FinalFormal2 · 2024-09-12T04:05:27.710Z · answers+comments (3)

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper · 2024-07-30T14:57:06.807Z · comments (0)

[link] what becoming more secure did for me
Chipmonk · 2024-08-22T17:44:48.525Z · comments (5)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics
DanielFilan · 2024-09-29T05:50:02.531Z · comments (0)

[link] SB 1047 gets vetoed
ryan_b · 2024-09-30T15:49:38.609Z · comments (1)

[link] A primer on the next generation of antibodies
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-01T22:37:59.207Z · comments (0)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

Fun With The Tabula Muris (Senis)
sarahconstantin · 2024-09-20T18:20:01.901Z · comments (0)

Proving the Geometric Utilitarian Theorem
StrivingForLegibility · 2024-08-07T01:39:10.920Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

matthew-barnett on The Hidden Complexity of Wishes

The post is about the complexity of what needs to be gotten inside the AI. If you had a perfect blackbox that exactly evaluated the thing-that-needs-to-be-inside-the-AI, this could possibly simplify some particular approaches to alignment, that would still in fact be too hard because nobody has a way of getting an AI to point at anything.

I think it's important to be able to make narrow points without needing to defend a broader thesis that wasn't necessarily the subject of my comment. To the extent my argument is "outer alignment seems easier than you portrayed it to be in this post, and elsewhere", then your reply here that inner alignment is still hard doesn't seem like it particularly engages with my narrow point.

This post definitely seems to relevantly touch on the question of outer alignment, given the premise that we are explicitly specifying the conditions that the outcome pump needs to satisfy in order for the outcome pump to produce a safe outcome. Explicitly specifying a function that delineates safe from unsafe outcomes is essentially the prototypical case of an outer alignment problem.

(It's possible that you'll reply to me by saying "I never intended people to interpret me as saying anything about outer alignment in this post" despite the clear portrayal of an outer alignment problem in the post. Even so, I don't think what you intended really matters that much here. I'm responding to what was clearly and explicitly written, rather than what was in your head at the time, which is unknowable to me.)

One cannot hook up a function to an AI directly; it has to be physically instantiated somehow. For example, the function could be a human pressing a button; and then, any experimentation on the AI's part to determine what "really" controls the button, will find that administering drugs to the human, or building a robot to seize control of the reward button, is "really" (from the AI's perspective) the true meaning of the reward button after all! Perhaps you do not have this exact scenario in mind.

It seems you're assuming here that something like iterated amplification and distillation will simply fail, because the supervisor function that provides rewards to the model can be hacked or deceived. I think my response to this is that I just tend to be more optimistic than you are that we can end up doing safe supervision where the supervisor ~always remains in control, and they can evaluate the AI's outputs accurately, more-or-less sidestepping the issues you mention here.

I think my reasons for believing this are pretty mundane: I'd point to the fact that evaluation tends to be easier than generation, and the fact that we can employ non-agentic tools to help evaluate, monitor, and control our models to provide them accurate rewards without getting hacked. I think your general pessimism about these things is fairly unwarranted, and my guess is that if you had made specific predictions about this question in the past, about what will happen prior to world-ending AI, these predictions would largely have fared worse than predictions from someone like Paul Christiano.

viliam on leogao's Shortform

“A prophet is not without honor except in his own town, among his relatives and in his own home.” (Mark 6:4)

tsvibt on How to have Polygenically Screened Children

That's a reasonable point... But I don't think we can just count number of divisions either? For one thing, there are several populations of stem cells in an adult. For another, people who are 50% bigger than other people don't live 2/3 as long (right? though maybe that's not the prediction?). I think maybe embryonic stem cells protect their telomeres--not sure.

ali-merali on The Mysterious Trump Buyers on Polymarket

Just adding some additional context that might be useful. PredictIt is a similar election betting platform but has a cap on the maximum amount traders are able to bet (I think <$1k, so relatively low). This means that if Polymarket is a money-weighted information aggregation mechanism, PredictIt is a person-weighted information aggregation mechanism. As noted in the post, from 6th October to just now Trump has gone from 50.8 to 60.1 meaning a difference of 1.6 cents with Kamala to a 20.2 cent difference (18.6 cent swing). On predictit he's gone from 51:53 to 55:48 on the same interval (9 cent swing). Of course it's possible that people on PredictIt are changing their bets due to Polymarket prices (eg. by arbitraging), but there's some evidence that at least half the change isn't due to these large bettors (whether those large bettors are trading based on private information, for manipulation purposes, or other reasons).

anonymous-1 on How to have Polygenically Screened Children

Thanks! Point 2. is especially what I am interested to learn about. If there is any place where I can read more about the presence and effect of telomerase on embryos please let me know!

Also, if you have any info on these other potential issues (pathways of aging through cell division) that chat-gpt generated, I am all ears:

a. DNA Damage Accumulation: Every time a cell divides, there is a risk of errors during DNA replication. (Cells do have repair mechanisms that fix most DNA damage, and fortunately they are more capable in young cells.)

b. Mitochondrial Dysfunction: Mitochondria have their own DNA (mtDNA). Mitochondria divide independently of the cell’s nuclear DNA. However, mtDNA is more prone to damage during cell division and replication due to its proximity to reactive oxygen species (ROS) produced during energy generation.

c. Stem Cell Exhaustion: Each stem cell has a limited number of divisions it can undergo before it enters programmed cell death.

genesmith on How to have Polygenically Screened Children

You’re ignoring several facts:

A significant fraction of cells turn over frequently in adults so the number of divisions for those cell types is far, far higher than 45 divisions. Those cell divisions CAN cause cancer, it a single extra cell division is going to have negligible impact on risk.
There’s an enzyme called telomerase which can extend telomeres. It’s active in embryos. So this isn’t really a concern.

sodium on (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need

Thanks for reading my post! Here's how I think this hypothesis is helpful:

It's possible that we wouldn't be able to understand what's going on even if we had some perfect way to decompose a forward pass into interpretable constituent heuristics. I'm skeptical that this would be the case, mostly because I think (1) we can get a lot of juice out of auto-interp methods and (2) we probably wouldn't need to simultaneously understand that many heuristics at the same time (which is the case for your logic gate example for modern computers). At the minimum, I would argue that the decomposed bag of heuristics is likely to be much more interpretable than the original model itself.

Suppose that the hypothesis is true, then it at least suggests that interpretability researchers should put in more efforts to try find and study individual heuristics/circuits, as opposed to the current more "feature-centric" framework. I don't know how this would manifest itself exactly, but it felt like it's worth saying. I believe that some of the empirical work I cited suggests that we might make more incremental progress if we focused on heuristics more right now.

anonymous-1 on How to have Polygenically Screened Children

I think you shouldn’t count aging in number of days, but rather in the number of cell divisions. (I accept there are other forms of cell damage than just division, but e.g. for telomere length division is quite an important component of aging.)

A newborn baby has fewer than 2^35 cells and an adult has fewer than 2^44 cells. Existing cells are often only replaced after 7 years, so quite a large portion of cell division happens during growth from single cell to adult.

For the sake of argument, assume that a biopsy takes 4 cells out of an 8-cell embryo, then all cells need to divide once more. So that’s 36 (or 45) divisions instead of 35 (or 44).

owain_evans on LLMs can learn about themselves by introspection

We have a section on the motivation to study introspection (with the specific definition we use in the paper). https://arxiv.org/html/2410.13787v1#S7

roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

One principled way to do it would be simulated war on narrow issues.

So if actor A spends resources R_c on computation C, any other actor B can surrender resources equal to R_c to prevent computation C from happening. The surrendered resources and the original resources are then physically destroyed (e.g. spent on Bitcoin mining or something).

This then at least means that to a first approximation, no actor has an incentive to destroy ELYSIUM itself in order to stop some computation inside it from happening, because they could just use their resources to stop the computation in the simulation instead. And many actors benefit from ELYSIUM, so there's a large incentive to protect it.

And since the interaction is negative sum (both parties lose resources from their personal utopias) there would be strong reasons to negotiate.

In addition to this there could be rule-based and AI-based protections to prevent unauthorized funny tricks with simulations. One rule could be a sort of "cosmic block" where you can just block some or all other Utopias from knowing about you outside of a specified set of tests ("is torture happening here", etc).