LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Agent membranes/boundaries and formalizing “safety”
Chipmonk · 2024-01-03T17:55:21.018Z · comments (46)

Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features
Logan Riggs (elriggs) · 2024-03-15T16:30:00.744Z · comments (5)

Consequentialism is a compass, not a judge
Neil (neil-warren) · 2024-04-13T10:47:44.980Z · comments (6)

[link] Positive visions for AI
L Rudolf L (LRudL) · 2024-07-23T20:15:26.064Z · comments (4)

Distillation of 'Do language models plan for future tokens'
TheManxLoiner · 2024-06-27T20:57:34.351Z · comments (2)

Smartphone Etiquette: Suggestions for Social Interactions
Declan Molony (declan-molony) · 2024-06-04T06:01:03.336Z · comments (4)

Ideas for Next-Generation Writing Platforms, using LLMs
ozziegooen · 2024-06-04T18:40:24.636Z · comments (4)

[link] Manifold Markets
PeterMcCluskey · 2024-02-02T17:48:36.630Z · comments (9)

Evaluating Solar
jefftk (jkaufman) · 2024-02-17T21:50:04.783Z · comments (5)

The Sequences on YouTube
Neil (neil-warren) · 2024-01-07T01:44:39.663Z · comments (9)

AI #57: All the AI News That’s Fit to Print
Zvi · 2024-03-28T11:40:05.435Z · comments (14)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles & parrots
Owain_Evans · 2024-03-28T02:34:21.799Z · comments (0)

Evidential Correlations are Subjective, and it might be a problem
Martín Soto (martinsq) · 2024-03-07T18:37:54.105Z · comments (6)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

[link] my favourite Scott Sumner blog posts
DMMF · 2024-06-11T14:40:43.093Z · comments (0)

D&D.Sci Hypersphere Analysis Part 3: Beat it with Linear Algebra
aphyer · 2024-01-16T22:44:52.424Z · comments (1)

[link] Emotional issues often have an immediate payoff
Chipmonk · 2024-06-10T23:39:40.697Z · comments (2)

[link] Let's Design A School, Part 2.1 School as Education - Structure
Sable · 2024-05-02T22:04:30.435Z · comments (3)

Links and brief musings for June
Kaj_Sotala · 2024-07-06T10:10:03.344Z · comments (0)

Three Types of Constraints in the Space of Agents
Nora_Ammann · 2024-01-15T17:27:27.560Z · comments (3)

Vote in the LessWrong review! (LW 2022 Review voting phase)
habryka (habryka4) · 2024-01-17T07:22:17.921Z · comments (9)

What is the best argument that LLMs are shoggoths?
JoshuaFox · 2024-03-17T11:36:23.636Z · comments (22)

[link] Introducing the Anthropic Fellows Program
Miranda Zhang (miranda-zhang) · 2024-11-30T23:47:29.259Z · comments (0)

Intranasal mRNA Vaccines?
J Bostock (Jemist) · 2025-01-01T23:46:40.524Z · comments (2)

[link] overengineered air filter shelving
bhauth · 2024-11-08T22:04:39.987Z · comments (2)

Living with Rats in College
lsusr · 2024-12-25T10:44:13.085Z · comments (0)

AI #93: Happy Tuesday
Zvi · 2024-12-04T00:30:06.891Z · comments (2)

Learning Multi-Level Features with Matryoshka SAEs
Bart Bussmann (Stuckwork) · 2024-12-19T15:59:00.036Z · comments (4)

Chat Bankman-Fried: an Exploration of LLM Alignment in Finance
claudia.biancotti · 2024-11-18T09:38:35.723Z · comments (4)

Mask and Respirator Intelligibility Comparison
jefftk (jkaufman) · 2024-12-07T03:20:01.585Z · comments (5)

Lecture Series on Tiling Agents
abramdemski · 2025-01-14T21:34:03.907Z · comments (2)

Action derivatives: You’re not doing what you think you’re doing
PatrickDFarley · 2024-11-21T16:24:04.044Z · comments (0)

Preface
Allison Duettmann (allison-duettmann) · 2025-01-02T18:59:46.290Z · comments (1)

Trying Bluesky
jefftk (jkaufman) · 2024-11-17T02:50:04.093Z · comments (17)

$250K in Prizes: SafeBench Competition Announcement
ozhang (oliver-zhang) · 2024-04-03T22:07:41.171Z · comments (0)

[link] Death notes - 7 thoughts on death
Nathan Young · 2024-10-28T15:01:13.532Z · comments (1)

Talk: AI safety fieldbuilding at MATS
Ryan Kidd (ryankidd44) · 2024-06-23T23:06:37.623Z · comments (2)

[question] How are you preparing for the possibility of an AI bust?
Nate Showell · 2024-06-23T19:13:45.247Z · answers+comments (16)

[question] Seeking AI Alignment Tutor/Advisor: $100–150/hr
MrThink (ViktorThink) · 2024-10-05T21:28:16.491Z · answers+comments (3)

Causality is Everywhere
silentbob · 2024-02-13T13:44:49.952Z · comments (12)

[link] UK AISI: Early lessons from evaluating frontier AI systems
Zach Stein-Perlman · 2024-10-25T19:00:21.689Z · comments (0)

[link] Arrogance and People Pleasing
Jonathan Moregård (JonathanMoregard) · 2024-02-06T18:43:09.120Z · comments (7)

[link] what becoming more secure did for me
Chipmonk · 2024-08-22T17:44:48.525Z · comments (5)

AI Safety University Organizing: Early Takeaways from Thirteen Groups
agucova · 2024-10-02T15:14:00.137Z · comments (0)

SAE features for refusal and sycophancy steering vectors
neverix · 2024-10-12T14:54:48.022Z · comments (4)

Essaying Other Plans
Screwtape · 2024-03-06T22:59:06.240Z · comments (4)

Bayesian inference without priors
DanielFilan · 2024-04-24T23:50:08.312Z · comments (8)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

Meetup In a Box: Year In Review
Czynski (JacobKopczynski) · 2024-02-14T01:18:28.259Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

milan-w on Where should one post to get into the training data?

Github repos. There, your text won't be forced into people's feeds yet will probably be scraped.

Also: I recommend writing in Markdown, because LLMs tend to write in Markdown.

milan-w on Legionnaire's Shortform

creating and posting have no filter

False. There is a filter [LW · GW] for content submitted by new accounts.

dmitry-vaintrob on Dmitry Vaintrob's Shortform

Why I'm in AI sequence: 2020 Journal entry about gpt3

I moved from math academia to full-time AI safety a year ago -- in this I'm in the same boat as Adam Shai, whose reflection post on the topic [LW(p) · GW(p)] I recommend you read instead of this.

In making the decision, I went through a lot of thinking and (attempts at) learning about AI before that. A lot of my thinking had been about whether a pure math academic can make a positive difference in AI, and examples that I thought counterindicated this -- I finally decided this might be a good idea after talking to my sister Lizka extensively and doing MATS in Summer of 2023. I'm thinking of doing a more detailed post about my decision and thinking later, in case there are other academics thinking about making this transition (and feel free to reach out in pm's in this case!).

But one thing I have started to forget is how scary and visceral AI risk felt when I was making the decision. I'm both glad and a little sad that the urgency is less visceral and more theoretical now. AI is "a part of the world", not an alien feature: part of the "setting" in the Venkat Rao post that was part of my internal lexicon at the time.

For now, in order to fill a gap in my constantly flagging daily writing schedule [? · GW], I'll share a meandering entry from 2020 about how I thought about positive AI futures. I don't endorse a lot of it; much is simplistic and low-context, or alternatively commonplace in these circles, though some of it holds up. It's interesting reading back that the thing I thought was most interesting as a first attempt at orienting my thinking was fleshing out "positive futures" and what they might entail. Two big directional updates I've had since are thinking harder about "human alignment" and "human takeover" [LW · GW], and trying to temper the predictions that assume singularitarian "first-past-the-post" AGI for a messier "AI-is-kinda-AGI" world that we will likely end up in.

journal entry

7/19/2020 [...] I'm also being paranoid about GPT-3.

Let's think. Will the world end, and if so, when? No one knows, obviously. GPT-3 is a good text generation bot. It can figure out a lot about semantics, mood, style, even a little about humor. It's probably not going to take over the world yet. But how far away are we from AGI?

GPT-3 makes me think, "less than a decade". There's a possibility it will be soon (within the year). I'd assign that probability 10%. It felt like 20% when I first saw its text, but seeing Sam Altman's remark and thinking a little harder, I don't think it's quite realistic for it to go AGI without a significant extra step or two. I think that I'd give it order of 50% within the decade. So it's a little like living with a potentially fatal disease, with a prognosis of 10 years. Now we have no idea what AGI will be like. It will most likely either be very weird and deadly or revolutionary and good, though disappointing in some ways. I think there's not much we can do about the weird and deadly scenarios. Humans have lived in sociopathic times (see Venkat's notes on his 14th Century Europe book). It would probably be shorter and deadlier than the plague; various "human zoo" scenarios may be pleasant to experience (after all zoo animals are happier in general than in the wild, at least from the point of view of basic needs), but harrowing to imagine. In any case, it's not worth speculating on this.

What would a good outcome look like? Obviously, no one knows. It's very hard to predict our interaction with a super-human intelligence. But here are some pretty standard "decent" scenarios: (1) After a brief period of a pro-social AI piloted by a team of decent people, we end up with a world much like ours but with AI capabilities curbed for a long period of time [...]. If it were up to me I would design this world with certain "guard rail"-like changes: to me this would be a "Foundation"-style society somewhere in New Zealand (or on the bottom of the ocean perhaps? the moon?) consisting of people screened for decency, intelligence, etc. (but with serious diversity and variance built in), and with control of the world's nukes, with the responsibility of imposing very basic non-interference and freedom of immigration criteria on the world's societies (i.e., making the "archipelago" dream a reality, basically). So enforcing no torture, disincentivizing violent conflict, imposing various controls to make sure people can move from country to country and are exposed to the basic existence of a variety of experiences in the world, but allowing for culturally alien or disgusting practices in any given country: such as Russian homophobia, strict Islamic law, unpleasant-seeming (for Western Europeans) traditions in certain tribal cultures, etc. This combined with some sort of non-interventionist altruistic push. In this sci-fi scenario the Foundation-like culture would have de facto monopoly of the digital world (but use it sparingly) and also a system of safe nuclear power plants sufficient to provide the world's power (but turned on carefully and slowly, to prevent economic jolts), but to carefully and "incontrovertibly" turn most of the proceeds into a universal basic income for the entire world population. Obviously this would have to be carefully thought out first by a community of intelligent and altruistic people with clear rules of debate/decision. --- The above was written extremely sleepy. [...]

(2) (Unlikely) AI becomes integrated with (at first, decent and intelligent later, all interested) humans via some kind of mind-machine interface, or alternatively a faithful human modeling in silica. Via a very careful and considered transition (in some sense "adiabatic", i.e. designed so as not to lose any of our human ethos and meaning that can possibly be recovered safely) we become machines, with a good and meaningful (not wireheaded, other than by considered choice) world left for the hold-outs who chose to remain purely human.

(3) The "Her" scenario: AI takes off on its own, because of human carelessness or desperation. It develops in a way that cherishes and almost venerates humans, and puts effort into making a good, meaningful existence for humans (meaningful and good in sort of the above adiabatic sense, i.e. meaningful via a set of clearly desirable stages of progress from step to step, without hidden agendas, and carefully and thoughtfully avoiding creating or simulating, in an appropriate sense, anything that would be considered a moral horror by locally reasonable intelligences at any point in the journey). AI continues its own existence, either self-organized to facilitate this meaningful existence of humans or doing its own thing, in a clearly separated and "transcendent" world, genuinely giving humans a meaningful amount of self-determination, while also setting up guardrails to prevent horrors and also perhaps eliminating or mitigating some of the more mundane woes of existence (something like cancer, etc.) without turning us into wireheads.

(4) [A little less than ideal by my book, but probably more likely than the others]: The "garden of plenty" scenario. AI takes care of all human needs and jobs, and leaves all humans free to live a nevertheless potentially fulfilling existence, like aristocrats or Victorians but less classist, socializing learning reading, etc., with the realization that all they are doing is a hobby: perhaps "human-generated knowledge" would be a sort of sport, or analog of organic produce (homeopathically better, but via a game that makes everyone who plays it genuinely better in certain subtle ways). Perhaps AI will make certain "safe" types of art, craft and knowledge (maybe math! Here I'm obviously being very biased about my work's meaning not becoming fully automated) purely the domain of humans, to give us a sense of self-determination. Perhaps humans are guided through a sort of accelerated development over a few generations to get to the no.2 scenario.

(5) There is something between numbers 3 and 4 above, less ideal than all of the above but likely, where AI quickly becomes an equal player to humans in the domain of meaning-generation, and sort of fills up space with itself while leaving a vaguely better (maybe number 4-like) Earth to humans. Perhaps imposes a time limit on humans (enforced via a fertility cap, hopefully with the understanding that humans can raise AI babies with genuine sense of filial consciousness and complete with bizzarre scences of trying to explain the crazy world of AI to their parents), after which the human project becomes the AI project, probably essentially incomprehensible to us.

There's a sense that I have that while I'm partial to scenarios 1 and 2: I want humans to retain the monopoly on meaning-generation and to be able to feel empowered and important, it will be seen to be old-fashioned and almost dangerous by certain of my peers because of the lack of emphasis on harm-prevention, stable future, etc. I think this is part of the very serious debate, so far abstract and fun, but, as AI gets better, perhaps turning heated and loud, between whether comfort or meaning are more important goals of the human project (and both sides will get weird). I am firmly on the side of meaning, with a strict underpinning of retaining bodily and psychological integrity in all the object-level and meta-level senses (except I guess I'm ok with moving to the cloud eventually? Adiabatic is the word for me). Perhaps my point of view is on the side I think it is just in the weird group of futurists and rationalists that I mostly read when reading about AI: probably the generic human who thinks about AI is horrified by all of the above scenarios and just desperately hoping it will go away on its own, or has some really idiosyncratic mix of the above or other ideas which seem obviously preferable to them.

francescag on Extending control evaluations to non-scheming threats

Thanks for putting this out there, I’m on the Catalyze program in Lisa in London and have been thinking a lot about AI control (for framing my background is risk management, so coming at it from that angle).

'The high-level change I propose is to define a control evaluation as a method of bounding risk from worst-case deployment behavior generally – rather than scheming models in particular.'

I’m on board with this, if our goal is to bound the risk from worst-case deployment behaviour, I think the broader the better and enumerating different ‘untrusted model categories’ is going to better inform our understanding of harmful behaviours. If we have post deployment monitoring, and identify new patterns of harmful behaviour, we may even identify new untrusted model categories in this way.

For post deployment monitoring (let’s say the control eval highlighted some behaviour we wanted ongoing control), I think the reason (i.e the untrusted model category causing the behaviour) is less important than the observed behaviour itself. There could even be multiple applying untrusted model categories e.g. scheming and jailbroken, so you could have a single monitor to catch all of these. What are your thoughts on this?

‘control is a broad approach to mitigating and evaluating safety given ’pessimistic assumptions’ about model deployment behavior.’

Rather than ‘pessimistic assumptions’ here I describe these as ‘unknowns’, the probability of which would range somewhere between 0-1 based in our confidence in the representativeness in the evaluations.

I think to make control evaluations practically useful, we need to add more nuance to how we define trusted and untrusted. Coming from risk management (in Fintech and FS), I’m approaching it from a view that a model can only ever be P(Trusted)=1 (i.e. fully trusted) when bounded by certain deployment constraints. So a model is trusted to do something rather than generally trusted e.g. it may be trusted to not write backdoored code, but may not be trusted to keep sensitive data safe if its in a setting where users can jailbreak it.

This could also help justify claims around a ‘rarely misbehaving monitor being safe because its failures will likely be uncorrelated with attacks executed by the policy.’ Let’s say you assign a probability to this of 0.001, which is fine for a use case of coding a website but not ok for coding a banking platform. You could layer more monitors to reduce this probability for this use case, effectively getting the risk within appetite for this deployment setting e.g. to 0.000001.

taleuntum on We probably won't just play status games with each other after AGI

As AIs don't have the same origin as humans, it is basically inconceivable to me that they will ever share the internal processes underlying their "emotions" no matter how good they get at surface "emoting", in my opinion this makes it impossible to have a true connection/meaningful relationship with them as humans, even if many people in the future will fail to see this, but otherwise I largely agree with your post (a brain implemented in silicon could be a meaningful friend to a human (eg ems), people will prob fully integrate AIs into the social scene (this is a mistake in my view), people will modify their brains in (what now seems like) radical ways).

bilalchughtai on How do you deal w/ Super Stimuli?

As a general rule, I try and minimise my phone screen time and maximise my laptop screen time. I can do every "productive" task faster on a laptop than on my phone. Here are some things I do that I find helpful that I haven't yet seen discussed.

Use a very minimalist app launcher on my phone, that makes searching for apps a conscious decision.
Use a greyscale filter on my phone (which is hard to turn off), as this makes doing most things on my phone harder.

viliam on Shortform

There are many possible maps that describe the same territory. Trying to switch people to use a different map could be a good thing, or it could be a bad thing. (A person who likes the new map might describe it as "giving them fresh insights", a person who dislikes it might describe it as "manipulating them".)

Is the scientific map always better? Well, sometimes it is not available. And sometimes it is too complex. In situations where science provides a clear and simple answer, I guess following it is very likely to be the right answer. But that is not always the case, and then... what are the alternatives? Either paralysis ("I am going to ignore this topic until science finally comes with a simple answer") or some kind of greedy reductionism / focusing on what is legible ("I am going to ignore the illegible parts and focus on the part that is certain: everything, including my wife and kids, is ultimately built from atoms and anything else about them is mere superstition").

viliam on Daniel Tan's Shortform

Off topic, but your words helped me realize something. It seems like for some people it is physical attraction first, for others it is emotional connection first.

The former may perceive the latter as dishonest: if their model of the world is that for everyone it is physical attraction first (it is only natural to generalize from one example), then what you describe as "take my time getting to know someone organically", they interpret as "actually I was attracted to the person since the first sight, but I was afraid of a rejection, so I strategically pretended to be a friend first, so that I could later blackmail them into having sex by threatening to withdraw the friendship they spent a lot of time building".

Basically, from the "for everyone it is attraction first" perspective, the honest behavior is either going for the sex immediately ("hey, you're hot, let's fuck" or a more diplomatic version thereof), or deciding that you are not interested sexually, and then the alternatives are either walking away, or developing a friendship that will remain safely sexless forever.

And from the other side, complaining about the "friend zone" is basically complaining that too many people you are attracted to happen to be "physical attraction first" (and they don't find you attractive), but it takes you too long to find out.

abandon on White Lies

You have misunderstood a standard figure of speech. Here is the definition he was using: https://www.ldoceonline.com/dictionary/to-be-fair (see also https://dictionary.cambridge.org/us/dictionary/english/to-be-fair, which doesn't explicitly mention that it's typically used to offset criticisms but otherwise defines it more thoroughly).

adam-b on Introducing Fatebook: the fastest way to make and track predictions

Thanks Raemon – I'm a fan of ~all these ideas! I'm not spending much time on Fatebook specifically these days (busy with various other projects), but your feedback is and has been super useful.