LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Cheap Whiteboards!
Johannes C. Mayer (johannes-c-mayer) · 2024-08-08T13:52:59.627Z · comments (2)

[link] AI Safety at the Frontier: Paper Highlights, August '24
gasteigerjo · 2024-09-03T19:17:24.850Z · comments (0)

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Daniel Lee (daniel-lee) · 2024-09-06T02:28:41.954Z · comments (0)

[link] Video Intro to Guaranteed Safe AI
Mike Vaiana (mike-vaiana) · 2024-07-11T17:53:47.630Z · comments (0)

[question] Me & My Clone
SimonBaars (simonbaars) · 2024-07-18T16:25:40.770Z · answers+comments (22)

[link] Predicting Influenza Abundance in Wastewater Metagenomic Sequencing Data
jefftk (jkaufman) · 2024-09-23T17:25:58.380Z · comments (0)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

Links and brief musings for June
Kaj_Sotala · 2024-07-06T10:10:03.344Z · comments (0)

Just because an LLM said it doesn't mean it's true: an illustrative example
dirk (abandon) · 2024-08-21T21:05:59.691Z · comments (12)

[link] Positive visions for AI
L Rudolf L (LRudL) · 2024-07-23T20:15:26.064Z · comments (4)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (7)

[link] Announcing Open Philanthropy's AI governance and policy RFP
Julian Hazell (julian-hazell) · 2024-07-17T02:02:39.933Z · comments (0)

[link] Beware the science fiction bias in predictions of the future
Nikita Sokolsky (nikita-sokolsky) · 2024-08-19T05:32:47.372Z · comments (20)

[link] MIRI's July 2024 newsletter
Harlan · 2024-07-15T21:28:17.343Z · comments (2)

The Wisdom of Living for 200 Years
Martin Sustrik (sustrik) · 2024-06-28T04:44:10.609Z · comments (3)

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper · 2024-07-30T14:57:06.807Z · comments (0)

[link] A primer on the next generation of antibodies
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-01T22:37:59.207Z · comments (0)

Evaluating Sparse Autoencoders with Board Game Models
Adam Karvonen (karvonenadam) · 2024-08-02T19:50:21.525Z · comments (1)

A Visual Task that's Hard for GPT-4o, but Doable for Primary Schoolers
Lennart Finke (l-f) · 2024-07-26T17:51:28.202Z · comments (4)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

Proving the Geometric Utilitarian Theorem
StrivingForLegibility · 2024-08-07T01:39:10.920Z · comments (0)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

[question] What's the Deal with Logical Uncertainty?
Ape in the coat · 2024-09-16T08:11:43.588Z · answers+comments (21)

An experiment on hidden cognition
Olli Järviniemi (jarviniemi) · 2024-07-22T03:26:05.564Z · comments (2)

Using an LLM perplexity filter to detect weight exfiltration
Adam Karvonen (karvonenadam) · 2024-07-21T18:18:05.612Z · comments (11)

Housing Roundup #9: Restricting Supply
Zvi · 2024-07-17T12:50:05.321Z · comments (8)

[link] Robert Caro And Mechanistic Models In Biography
adamShimi · 2024-07-14T10:56:42.763Z · comments (5)

Seeking Mechanism Designer for Research into Internalizing Catastrophic Externalities
c.trout (ctrout) · 2024-09-11T15:09:48.019Z · comments (2)

[link] Altruism and Vitalism Aren't Fellow Travelers
Arjun Panickssery (arjun-panickssery) · 2024-08-09T02:01:11.361Z · comments (2)

I didn't think I'd take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!
mako yass (MakoYass) · 2024-08-02T22:35:21.136Z · comments (2)

Fun With The Tabula Muris (Senis)
sarahconstantin · 2024-09-20T18:20:01.901Z · comments (0)

Distillation of 'Do language models plan for future tokens'
TheManxLoiner · 2024-06-27T20:57:34.351Z · comments (2)

[question] What percent of the sun would a Dyson Sphere cover?
Raemon · 2024-07-03T17:27:50.826Z · answers+comments (26)

[link] Truth is Universal: Robust Detection of Lies in LLMs
Lennart Buerger · 2024-07-19T14:07:25.162Z · comments (3)

How Congressional Offices Process Constituent Communication
Tristan Williams (tristan-williams) · 2024-07-02T12:38:41.472Z · comments (0)

[LDSL#2] Latent variable models, network models, and linear diffusion of sparse lognormals
tailcalled · 2024-08-09T19:57:56.122Z · comments (0)

GPT-3.5 judges can supervise GPT-4o debaters in capability asymmetric debates
Charlie George (charlie-george) · 2024-08-27T20:44:08.683Z · comments (7)

Whirlwind Tour of Chain of Thought Literature Relevant to Automating Alignment Research.
sevdeawesome · 2024-07-01T05:50:49.498Z · comments (0)

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
Kola Ayonrinde (kola-ayonrinde) · 2024-08-23T18:52:31.019Z · comments (3)

AI #77: A Few Upgrades
Zvi · 2024-08-20T00:20:09.717Z · comments (3)

AXRP Episode 34 - AI Evaluations with Beth Barnes
DanielFilan · 2024-07-28T03:30:07.192Z · comments (0)

[link] Libs vs Frameworks, Middle-Level Regularities vs Theories
adamShimi · 2024-07-04T19:01:59.440Z · comments (0)

[link] [Talk transcript] What “structure” is and why it matters
Alex_Altair · 2024-07-25T15:49:00.844Z · comments (0)

[link] Managing Emotional Potential Energy
adamShimi · 2024-07-10T18:20:45.640Z · comments (4)

The Garden of Eden
Alexander Turok · 2024-07-22T16:07:42.509Z · comments (2)

[question] Why do Minimal Bayes Nets often correspond to Causal Models of Reality?
Dalcy (Darcy) · 2024-08-03T12:39:44.085Z · answers+comments (1)

Trying to be rational for the wrong reasons
Viliam · 2024-08-20T16:18:06.385Z · comments (8)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

sharmake-farah on Another argument against utility-centric alignment paradigms

It seems to me all the more basic desires ("values"), e.g. the lower layers of Maslow's hierarchy of needs, are mainly determined by heritable factors. Because they are relatively stable across cultures. So presumably you talk about "higher" values being a function of "data sources in the world"? I.e. of nurture rather than nature?

I agree there probably are some heritable values, though my big difference here is that I think that the set of primitive values is quite a bit smaller than you might think.

Though be warned, heritability doesn't actually answer our question, because the way it's interpreted by laymen is pretty wrong:

https://www.lesswrong.com/posts/YpsGjsfT93aCkRHPh/what-does-knowing-the-heritability-of-a-trait-tell-me-in [LW · GW]

I probably should have separated formal ethical theories that people are describing, which you call morals and what their values actually are more.

I was always referring to values when I was talking about morals.

You are correct that someone describing a moral theory doesn't mean that they actually agree with or implement the theory.

I still think that if you had the amount of control over a human that an ML person had over an AI today, you could brainwash them to value ~arbitrary values with a lot of control, and it would be the central technology of political and social situations, which is a lot.

nathan-helm-burger on Nathan Helm-Burger's Shortform

Personal AI Assistant Ideas

When I imagine having a personal AI assistant with approximately current levels of capability I have a variety of ideas of what I'd like it to do for me.

Auto-background-research

I like to record myself rambling about my current ideas while walking my dog. I use an app that automatically saves a mediocre transcription of the recording. Ideally, my AI assistant would respond to a new transcription by doing background research to find academic literature related to the ideas mentioned within the transcript. That way, when I went to look at my old transcript, it would already be annotated with links to prior work done on the topic, and analysis of the overlap between my ideas and the linked literature.

Also, ideally, the transcription would be better quality. Context clues about the topic of the conversation should be taken into account when making guesses about transcribing unclear words. Where things are quite unclear, there should be some sort of indicator in the text of that.

Various voice modes, and ability to issue verbal commands to switch between them

Receptive Listening Voice-mode

Also for when I want to ramble about my ideas, I'd like having my AI assistant act as a receptive listener, just saying things like 'that makes sense', or 'could you explain more about what you mean about <idea>?'. Ideally, this would feel relatively natural, and not too intrusive. Asking for clarification just where the logic didn't quite follow, or I used a jargon term in some non-standard seeming way. The conversation in this mode would be one-sided, with the AI assistant just helping to draw out my ideas, not contributing much. Occasionally it might say, 'Do you mean <idea> in the same sense that <famous thinker> means <similar sounding idea>?' And I could explain similarity or differences.

Question Answering Voice-mode

Basically a straightforward version of the sort of thing Perplexity does, which is try to find academic sources to answer a technical question. I'd want to be able to ask, and get a verbal answer of summaries of the sources, but also have a record saved of the conversation and the sources (ideally, downloaded and saved to a folder). This would be mostly short questions from me, and long responses from the model.

Discussion Voice-mode

More emphasis on analysis of my ideas, extrapolating, pointing out strengths and weaknesses. Something like what I get when discussing ideas with Claude Sonnet 3.5 after having told it to act as a philosopher or scientist who is critically examining my ideas. This would be a balanced back-and-forth, with roughly equal length conversational turns between me and the model.

Coding Project Collaboration

I would want to be able to describe a coding project, and as I do so have the AI assistant attempt to implement the pieces. If I were using a computer, then an ongoing live output of the effects of the code could be displayed. If I were using voice-mode, then the feedback would be occasional updates about errors or successful runs and the corresponding outputs. I could ask for certain metrics to be reported against certain datasets or benchmarks, also general metrics like runtime (or iterations per second where appropriate), memory usage, and runtime complexity estimates.

Anthropic Wishlist

Anthropic is currently my favorite model supplier, in addition to being my most trusted and approved of lab in regards to AI safety and also user data privacy. So, when I fantasize about what I'd like future models to be like, my focus tends to be on ways I wish I could improve Anthropic's offering.

Most of these could be implemented as features in a wrapper app. But I'd prefer them to be implemented directly by Anthropic, so that I didn't have to trust an additional company with my data.

In order from most desired to less desired:

Convenience feature: Ability to specify a system prompt which gets inserted at the beginning of every conversation. Mine would say something like, "Act like a rational scientist who gives logical critiques. Keep praise to a minimum, no sycophancy, no apologizing. Avoid filler statements like, 'let me know if you have further questions.'
Ability to check up-to-date documents and code for publicly available code libraries. This doesn't need to be a web search! You could have a weekly scraper check for changes to public libraries, like python libraries. So many of the issues I run into with LLMs generating code that doesn't work is because of outdated calls to libraries which have since been updated.
Voice Mode, with appropriate interruptions, tone-of-voice detection, and prosody matching. Basically, like what OpenAI is working on.
Academic citations. This doesn't need to be a web search! This could just be from searching an internal archive of publicly available open access scientific literature, updated once a week or so.
Convenience feature: A button to enable 'summarize this conversation, and start a new conversation with that summary as a linked doc'.
Ability to test out generated code to make sure it at least compiles/runs before showing it to me. This would include the ability to have a code file which we were collaborating on, which got edited as the conversation went on. Instead of giving responses intended to modify just a specific function within a code file, where I need to copy/paste it in, and then rerun the code to check if it works.
Ability to have some kind of personalization of the model which went deeper than a simple system prompt. Some way for me to give feedback. Some way for me to select some of my conversations to be 'integrated', such that the mode / tone / info from that conversation were incorporated more into future fresh conversations. Sometimes I feel like I've 'made progress' in a conversation, gotten into a better pattern, and it's frustrating to have to start over from scratch every time.

habryka4 on The Sun is big, but superintelligences will not spare Earth a little sunlight

I updated the title with one Eliezer seemed fine with (after poking Robby). Not my top choice, but better than the previous one.

raemon on Struggling like a Shadowmoth

I found this a particularly helpful lens.

I had had individual thoughts like "of course in this case the alien woman is, you know, bad for kidnapping him and torturing him", but the particular ecosystem frame feels probably-useful for generating followup questions. (It's also thematically resonant with the themes in the book!)

This thread did motivate me to add an additional disclaimer to the post.

cubefox on Another argument against utility-centric alignment paradigms

It seems to me all the more basic desires ("values"), e.g. the lower layers of Maslow's hierarchy of needs, are mainly determined by heritable factors. Because they are relatively stable across cultures. So presumably you talk about "higher" values being a function of "data sources in the world"? I.e. of nurture rather than nature?

Another point I'd like to raise is that values (in the sense of desires/goals) are arguably quite different from morals. First, morals are more general than desires. Extraterrestrials could also come up with a familiar theory of, say, preference utilitarianism, while not sharing several of our desires, e.g. for eating chocolate or for having social contacts. Indeed, theories of ethics like utilitarianism or Kantian deontology "abstract away" from specific desires by coming up with more general principles which are independent of concrete things individuals may want. Second, it is clearly possible and consistent for someone (e.g. a psychopath) to want X without believing that X is morally right. Conversely, philosophers arguing for some theory of ethics don't necessarily adhere perfectly to the principles of this system, in the same way in which a philosopher arguing for a theory of rationality isn't necessarily perfectly rational himself.

raemon on The Sun is big, but superintelligences will not spare Earth a little sunlight

I just edited this into the OP.

sharmake-farah on My disagreements with "AGI ruin: A List of Lethalities"

We'd need to get more quantitative here about how much AI labor we can use for alignment before it's too dangerous, and my answer is that we could get about 1-2 OOMs smarter than humans at inference where we could be confident in using them safely, and IMO close to an arbitrary number of copies of that AI, conditional on good control techniques being used.

To address 2 comments:

and I have high confidence that control measures will not be used consistently and correctly in practice.

Yeah, this seems pretty load-bearing for the plan, and a lot of the reason I don't have probabilities of extinction below 0.1-1% is because I am actually worried about labs not doing control measures consistently.

I assign more moderate probabilities than you do, in that I think both the scenarios of labs not doing control properly and doing control properly are both somewhat plausible to me now, but yeah it would really be high-value for labs to prepare themselves to do control work properly.

To address this:

I just don't think that extends to ASI

Maybe initially, but critically, I think the evidence we will get for pre-AGI levels will heavily constrain our expectations of what an ASI will do re it's alignment, and that we will learn a lot more about both alignment and control techniques when we get human-level models, and I think we can trust a lot of the evidence to generalize at least 2 OOMs up.

So I think a lot of the uncertainty will start becoming removed as AI scales up.

I agree with this:

I think almost anything that counts as AGI is very nearly ASI by default (not because of RSI, just because of hardware scaling ability)

Even without recursive self-improvement, it's pretty easy to scale by several OOMs, and while there are enough bottlenecks to prevent FOOM, they are not enough to slow it down by 1 decade except in tail scenarios

boris-kashirin on The Sun is big, but superintelligences will not spare Earth a little sunlight

It is defecting against cooperate-bot.

raemon on The Sun is big, but superintelligences will not spare Earth a little sunlight

It was me. I initially suggested "Bernard Arnault won't give you $77" as the title, Eliezer said "don't bury the lead, just say 'ASI will not leave just a little sunlight for Earth'". After reading this thread I was thinking about alternate titles and was thinking about ones that would both convey the right thing and feel like a reasonably succinct/aesthetic/etc.

tsvibt on Struggling like a Shadowmoth

There's also the question of the non-helper as an instance of a class, and you as an instance of a class, and the resulting implied ecology. Or to say it a different way: apply TDT to the shadowmoth / meal question. To say it a third way: if people like me react to situations like this--involving some relationship with someone or something--in such-and-such a way, then what trophic niche are we opening up, i.e. what sort of food are we making available for what sort of predator?