LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

“Artificial General Intelligence”: an extremely brief FAQ
Steven Byrnes (steve2152) · 2024-03-11T17:49:02.496Z · comments (6)

[link] [Repost] The Copenhagen Interpretation of Ethics
mesaoptimizer · 2024-01-25T15:20:08.162Z · comments (4)

Dumbing down
Martin Sustrik (sustrik) · 2024-06-09T06:50:47.469Z · comments (0)

[link] Yoshua Bengio: Reasoning through arguments against taking AI safety seriously
Judd Rosenblatt (judd) · 2024-07-11T23:53:17.187Z · comments (3)

[link] OpenAI: Preparedness framework
Zach Stein-Perlman · 2023-12-18T18:30:10.153Z · comments (23)

Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream
Diego Caples (diego-caples) · 2024-09-06T17:55:34.265Z · comments (7)

Automation collapse
Geoffrey Irving · 2024-10-21T14:50:54.500Z · comments (10)

[link] The True Story of How GPT-2 Became Maximally Lewd
Writer · 2024-01-18T21:03:08.167Z · comments (7)

Transcoders enable fine-grained interpretable circuit analysis for language models
Jacob Dunefsky (jacob-dunefsky) · 2024-04-30T17:58:09.982Z · comments (14)

Update on Chinese IQ-related gene panels
Lao Mein (derpherpize) · 2023-12-14T10:12:21.212Z · comments (7)

[link] [Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models"
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-06-06T18:55:09.151Z · comments (2)

[link] InterLab – a toolkit for experiments with multi-agent interactions
Tomáš Gavenčiak (tomas-gavenciak) · 2024-01-22T18:23:35.661Z · comments (0)

Text Posts from the Kids Group: 2020
jefftk (jkaufman) · 2024-04-13T22:30:05.326Z · comments (3)

[New Feature] Your Subscribed Feed
Ruby · 2024-06-11T22:45:00.000Z · comments (8)

[link] Investigating an insurance-for-AI startup
L Rudolf L (LRudL) · 2024-09-21T15:29:10.083Z · comments (0)

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities
Axel Højmark (hojmax) · 2024-07-22T16:17:07.665Z · comments (0)

The King and the Golem - The Animation
Writer · 2024-11-08T18:23:10.935Z · comments (0)

Multiplex Gene Editing: Where Are We Now?
sarahconstantin · 2024-07-16T20:50:04.590Z · comments (6)

Flagging Potentially Unfair Parenting
jefftk (jkaufman) · 2023-12-26T12:40:05.099Z · comments (1)

Finding Sparse Linear Connections between Features in LLMs
Logan Riggs (elriggs) · 2023-12-09T02:27:42.456Z · comments (5)

[link] Davidad's Provably Safe AI Architecture - ARIA's Programme Thesis
simeon_c (WayZ) · 2024-02-01T21:30:44.090Z · comments (17)

How We Picture Bayesian Agents
johnswentworth · 2024-04-08T18:12:48.595Z · comments (14)

[link] The Inner Ring by C. S. Lewis
Saul Munn (saul-munn) · 2024-04-24T22:48:09.228Z · comments (6)

[link] We're all in this together
Tamsin Leake (carado-1) · 2023-12-05T13:57:46.270Z · comments (65)

[link] Motivation gaps: Why so much EA criticism is hostile and lazy
titotal (lombertini) · 2024-04-22T11:49:59.389Z · comments (5)

[link] Former OpenAI Superalignment Researcher: Superintelligence by 2030
Julian Bradshaw · 2024-06-05T03:35:19.251Z · comments (30)

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt
DanielFilan · 2024-04-11T21:30:04.244Z · comments (10)

[Intuitive self-models] 3. The Homunculus
Steven Byrnes (steve2152) · 2024-10-02T15:20:18.394Z · comments (36)

[link] [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
chanind · 2024-09-25T09:31:03.296Z · comments (15)

Duct Tape security
Isaac King (KingSupernova) · 2024-04-26T18:57:05.659Z · comments (11)

Estimating Tail Risk in Neural Networks
Mark Xu (mark-xu) · 2024-09-13T20:00:06.921Z · comments (9)

Personal AI Planning
jefftk (jkaufman) · 2024-11-10T14:00:06.837Z · comments (10)

What is it to solve the alignment problem?
Joe Carlsmith (joekc) · 2024-08-24T21:19:34.280Z · comments (17)

Generalized Stat Mech: The Boltzmann Approach
David Lorell · 2024-04-12T17:47:31.880Z · comments (7)

Meetup Tip: Heartbeat Messages
Screwtape · 2023-12-07T17:18:33.582Z · comments (4)

When Are Circular Definitions A Problem?
johnswentworth · 2024-05-28T20:00:23.408Z · comments (15)

Shard Theory - is it true for humans?
Rishika (rishika-bose) · 2024-06-14T19:21:47.997Z · comments (7)

Brief notes on the Wikipedia game
Olli Järviniemi (jarviniemi) · 2024-07-14T02:28:22.473Z · comments (9)

MATS AI Safety Strategy Curriculum
Ronny Fernandez (ronny-fernandez) · 2024-03-07T19:59:37.434Z · comments (2)

How useful is "AI Control" as a framing on AI X-Risk?
habryka (habryka4) · 2024-03-14T18:06:30.459Z · comments (4)

Different senses in which two AIs can be “the same”
Vivek Hebbar (Vivek) · 2024-06-24T03:16:43.400Z · comments (1)

Best in Class Life Improvement
sapphire (deluks917) · 2024-04-04T01:51:02.556Z · comments (20)

[link] The 2nd Demographic Transition
Maxwell Tabarrok (maxwell-tabarrok) · 2024-04-06T14:10:13.095Z · comments (17)

[link] GPT-4o System Card
Zach Stein-Perlman · 2024-08-08T20:30:52.633Z · comments (11)

[link] Cost, Not Sacrifice
Joe Rogero · 2024-11-20T21:32:26.281Z · comments (13)

AI #79: Ready for Some Football
Zvi · 2024-08-29T13:30:10.902Z · comments (16)

Why Large Bureaucratic Organizations?
johnswentworth · 2024-08-27T18:30:07.422Z · comments (52)

[Summary] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda (neel-nanda-1) · 2024-04-19T19:06:17.755Z · comments (0)

[link] Peak Human Capital
PeterMcCluskey · 2024-09-30T21:13:30.421Z · comments (3)

The Hessian rank bounds the learning coefficient
Lucius Bushnaq (Lblack) · 2024-08-08T20:55:36.960Z · comments (9)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

screwtape on Boston Secular Solstice 2024

We're looking for speakers for the Boston Solstice. This year Solstice is December 28th, 7pm. Being a speaker at solstice is pretty straightforward; public speaking skill is useful but you can read off a script, don't feel like you need to memorize something.

If you're at all interested, reach out. We have speeches ranging from very short and silly to a couple of pages and somber.

Additionally, if you feel like you have an original speech on the themes of persistence or camaraderie, especially if you feel you have a good speech about not giving up even when it's hard, then please feel free to send a draft! The overall arc is set at this point but you might have something better for a given slot.

daystareld on You are not too "irrational" to know your preferences.

I agree that those are the thoughts at the surface-level of Bryce in those situations, and they are not the same as "it's wrong/stupid to enjoy eating ice cream."

But I think in many cases, they often do imply "and you are stupid/irrational if knowing these things does not spoil your enjoyment or shift your hedonic attractor." And even if Bryce genuinely doesn't feel that way, I hope they would still be very careful with their wording to avoid that implication.

screwtape on Raemon's Shortform

Tentative support for only auto-importing the first few paragraphs, if not that then start by auto-importing the whole post and waiting until anybody complains. My guess (~65%?) is that somebody will. Against having an LLM extract some important highlights- if doing highlights is the way to go I think whoever nominated the piece for the review can find the highlights?

I'd love it if I could use LessWrong as a central place to read rationalsphere content, and since more and more rationalist sphere writers are writing elsewhere this seems like it's worth trying.

henry-sleight on You should consider applying to PhDs (soon!)

Great post! I especially agree that for most independent researchers, applying to PHDs before you necessarily want one would be a helpful option to have as a backstop for if your near term career plans don't work out - and people should apply early because there's such a long lag time between application and starting.

I think it's also worth emphasising that if you have a non-standard work history (or are a bit junior), but might want to work in the United States, pursuing higher education in the US is one of the easiest ways to secure long-term work authorisation (And if someone funds your PhD, is radically cheaper than almost every alternative)

green_leaf on Is the mind a program?

I know the causal closure of the physical as the principle that nothing non-physical influences physical stuff, so that would be the causal closure of the bottom level of description (since there is no level below the physical), rather than the upper.

So if you mean by that that it's enough to simulate neurons rather than individual atoms, that wouldn't be "causal closure" as Wikipedia calls it.

raemon on gwern's Shortform

Yeah the LW team has been doing this sort of thing internally, still in the experimental phase. I don't know if we've used all the tricks listed here yet.

raemon on Raemon's Shortform

For this year's LessWrong Review, we're building UI to make it much easier to import linkposts from other blogs, since a lot of important rationalsphere or AI Safety content lives in other places, and backdate it such that it's eligible for the Review.

It's actually pretty easy to automatically import all the text from a url in most cases (We're looking into auto-importing PDFs of papers, which I suspect is doable but haven't checked), and in many cases I think this would basically be preferred, but it's also kinda exploitable in ways I don't know that I'd endorse. (i.e. some authors are probably happy to have people crosspost stuff while nominating it for Best of LessWrong, other authors might feel violated)

Three options are:

only auto-import the first few paragraphs, ending with a load more
have an LLM extract some important highlights. (I'm ignoring "have an LLM summarize it" because they suck at that, but I think they're decent at identifying key paragraphs)
start off by auto-importing the whole post, and then wait until anybody complains.

I'd probably be limiting this to users who are otherwise eligible to nominate (i.e. their account is at least two years old, and maybe they have like 100 karma), so randos can't go crazy with it. Admins will be seeing all posts imported this way so we can be sanity checking things.

Curious what people think.

jmh on Information vs Assurance

I think one clear aspect of the stories here, yours and John's, relates to what I'll call asymmetric information flows. Basically, the times at which the information, that no one is trying to keep secret, become known to the relevant parties.

Of course understanding what a good update frequency is for various situations should be is a tricky thing itself.

seth-herd on Seth Herd's Shortform

Agreed that scaling rather than addition is usually the better way to think about probabilities. In this case we've done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press ahead - which nobody knows.

I'm pretty sure it would be an error to trust anyone's estimate at this time, because people with roughly equal expertise and wisdom (e.g., Yudkowsky and Christiano) give such wildly different odds. And the discussions between those viewpoints always trail off into differing intuitions.

I also give alignment by default very poor odds, and prosaic alignment as it's usually discussed. But there are some pretty obvious techniques that are so low-tax that I think they'll be implemented even by orgs that don't take safety very seriously.

I'm curious if you've read my Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW] and/or Internal independent review for language model agent alignment [AF · GW] posts. Instruction-following is human-in-the-loop so that may already be what you're referring to. But some of the techniques in the independent review post (which is also a review of multiple methods) go beyond prosaic alignment to apply specifically to foundation model agents. And wisely-used instruction-following gives corrigibility with a flexible level of oversight.

I'm curious what you think about those techniques if you've got time to look.

I think public acceptance of a pause is only part of the issue. The Chinese might actually not pursue AGI if they didn't have to race the US. But Russia and North Korea will most certainly pursue it (although they've got very limited resources and technical chops to make lots of progress in new foundation models, but they still might get to real AGI based on turning next-gen (which there's not time to pause) foundation models into scaffolded cognitive architectures.

But yes, I do think there's a chance we could get the US and European public to support a pause using some of the framings you suggest. But we'd better be sure that's a good idea. Lots of people, notably Russians and North Koreans, are genuinely way less cautious even than Americans - and absolutely will not honor agreements to pause.

Those are some specifics; in general I think it's only useful to talk about what "we" "should" do in the context of what particular actors actually are likely to do in different scenarios. Humanity is far from aligned, and that's a problem.

gwern on gwern's Shortform

LLM support for writing LessWrong posts: virtual comments.

Back in August I discussed with Ruby & Oliver a bit about how to integrate LLMs into LW2 in ways which aren't awful, particularly using the new 'prompt caching' feature. To summarize one idea: we can use long-context LLMs with prompt caching to try to simulate various LW users of diverse perspectives to write useful feedback on drafts for authors.

(Prompt caching (eg) is the Transformer version of the old RNN hidden-state caching trick, where you run an input through the (deterministic) NN, and then save the intermediate version, and apply that to arbitrarily many future inputs, to avoid recomputing the first input each time, which is the naive way to do it. You can think of it as a lightweight finetuning. This is particularly useful if you are thinking about having large generic prompts---such as an entire corpus. A context window of millions of tokens might take up to a minute & $1 to compute currently, so you definitely need to be careful and don't want to compute more than once.)

One idea would be to try to use LLMs to offer feedback on drafts or articles. Given that tuned LLM feedback from Claude or ChatGPT is still not that great, tending towards sycophancy or obviousness or ChatGPTese, it is hardly worthwhile running a post through a generic "criticize this essay" prompt. (If anyone on LW2 wanted to do such a thing, they are surely capable of doing it themselves, and integrating it into LW2 isn't that useful. Removing the friction might be helpful, but it doesn't seem like it would move any needles.)

So, one way to force out more interesting feedback would be to try to force LLMs out of the chatbot assistant mode-collapse, and into more interesting simulations for feedback. There has been some success with just suggestively-named personas or characters in dialogues (you could imagine here we'd have "Skeptic" or "Optimist" characters), but we can do better. Since this is for LW2, we have an obvious solution: simulate LW users! We know that LW is in the training corpus of almost all LLMs and that writers on it (like myself) are well-known to LLMs (eg. truesight). So we can ask for feedback from simulated LWers: eg. Eliezer Yudkowsky or myself or Paul Christiano or the author or...

This could be done nicely by finetuning a "LW LLM" on all the articles & comments, with associated metadata like karma, and then feeding in any new draft or article into it, and sampling a comment from each persona. If there is some obvious criticism or comment Eliezer Yudkowsky would make on a post, which even a LLM can predict, why not deal with it upfront instead of waiting for the real Eliezer to comment (which is also unlikely to ever happen these days)? And one can of course sample an entire comment tree of responses to a 'virtual comment', with the LLM predicting the logical respondents.

This can further incorporate the draft's author's full history, which will usually fit into a multi-million token context window. So their previous comments and discussions, full of relevant material, will get included. This prompt can be cached, and used to sample a bunch of comment-trees. (And if finetuning is infeasible, one can try instead to put the LW corpus into the context and prompt-cache that before adding in the author's corpus.)

The default prompt would be to prompt for high-karma responses. This might not work, because it might be too hard to generate good high-quality responses blindly in a feedforward fashion, without any kind of search or filtering. So the formatting of the data might be to put the metadata after a comment, for ranking purposes: so the LLM generates a response and only then a karma score, and then when we sample, we simply throw out predicted-low-score comments rather than waste the author's time looking at them. (When it comes to these sorts of assistants, I strongly believe 'quality > quantity', and 'silence is golden'. Better to waste some API bills than author time.)

One can also target comments to specific kinds of feedback, to structure it better than a grab-bag of whatever the LLM happens to sample. It would be good to have (in descending order of how likely to be useful to the author) a 'typo' tree, a 'copyediting'/'style'/'tone' tree, 'confusing part', 'terminology', 'related work', 'criticism', 'implications and extrapolations', 'abstract/summary' (I know people hate writing those)... What else? (These are not natural LW comments, but you can easily see how to prompt for them with prompts like "$USER $KARMA $DATE | Typo: ", etc.)

As they are just standard LW comments, they can be attached to the post or draft like regular comments (is this possible? I'd think so, just transclude the comment-tree into the corresponding draft page) and responded to or voted on etc. (Downvoted comments can be fed back into the finetuning with low karma to discourage feedback like that.) Presumably at this point, it would not be hard to make it interactive, and allow the author to respond & argue with feedback. I don't know how worthwhile this would be, and the more interaction there is, the harder it would be to hide the virtual comments after completion.

And when the author finishes writing & posts a draft, the virtual comments disappear (possibly entirely unread), having served their purpose as scaffolding to help improve the draft. (If the author really likes one, they can just copy it in or quote it, I'd think. But otherwise, I don't see any real reason to make them visible to readers of the final post. If included at all, they should prominently flagged---maybe the usernames are always prefixed by AI_$USER to ensure no one, including future LLMs, is confused---and definitely always sort to the bottom & be collapsed by default.)