LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

[link] Fluent dreaming for language models (AI interpretability method)
tbenthompson (ben-thompson) · 2024-02-06T06:02:59.296Z · comments (4)

[link] From Conceptual Spaces to Quantum Concepts: Formalising and Learning Structured Conceptual Models
Roman Leventov · 2024-02-06T10:18:40.420Z · comments (1)

Are most personality disorders really trust disorders?
chaosmage · 2024-02-06T12:37:56.070Z · comments (4)

Why Two Valid Answers Approach is not Enough for Sleeping Beauty
Ape in the coat · 2024-02-06T14:21:58.912Z · comments (12)

On the Debate Between Jezos and Leahy
Zvi · 2024-02-06T14:40:05.487Z · comments (6)

[question] Why do we need an understanding of the real world to predict the next tokens in a body of text?
Valentin Baltadzhiev (valentin-baltadzhiev) · 2024-02-06T14:43:50.559Z · answers+comments (12)

Evolution is an observation, not a process
Neil (neil-warren) · 2024-02-06T14:49:31.021Z · comments (11)

Preventing model exfiltration with upload limits
ryan_greenblatt · 2024-02-06T16:29:33.999Z · comments (16)

[question] How can I efficiently read all the Dath Ilan worldbuilding?
mike_hawke · 2024-02-06T16:52:32.558Z · answers+comments (1)

What does davidad want from «boundaries»?
Chipmonk · 2024-02-06T17:45:42.348Z · comments (1)

[link] Arrogance and People Pleasing
Jonathan Moregård (JonathanMoregard) · 2024-02-06T18:43:09.120Z · comments (7)

My guess at Conjecture's vision: triggering a narrative bifurcation
Alexandre Variengien (alexandre-variengien) · 2024-02-06T19:10:42.690Z · comments (12)

How to train your own "Sleeper Agents"
evhub · 2024-02-07T00:31:42.653Z · comments (10)

Full Driving Engagement Optional
jefftk (jkaufman) · 2024-02-07T02:30:04.776Z · comments (0)

story-based decision-making
bhauth · 2024-02-07T02:35:27.286Z · comments (11)

Why I think it's net harmful to do technical safety research at AGI labs
Remmelt (remmelt-ellen) · 2024-02-07T04:17:15.246Z · comments (24)

[link] Benchmark Study #5: Social Intelligence QA (Task, MCQ)
Bruce W. Lee (bruce-lee) · 2024-02-07T04:41:00.847Z · comments (0)

Quantum Darwinism, social constructs, and the scientific method
pchvykov · 2024-02-07T07:04:48.042Z · comments (12)

[question] How to deal with the sense of demotivation that comes from thinking about determinism?
SpectrumDT · 2024-02-07T10:53:54.794Z · answers+comments (71)

The Math of Suspicious Coincidences
Roko · 2024-02-07T13:32:35.513Z · comments (3)

Training of superintelligence is secretly adversarial
quetzal_rainbow · 2024-02-07T13:38:13.749Z · comments (2)

[question] What's this 3rd secret directive of evolution called? (survive & spread & ___)
lukehmiles (lcmgcd) · 2024-02-07T14:11:58.143Z · answers+comments (11)

[link] Reading writing advice doesn't make writing easier
Henry Sleight (ResentHighly) · 2024-02-07T19:14:39.099Z · comments (0)

[link] More Hyphenation
Arjun Panickssery (arjun-panickssery) · 2024-02-07T19:43:29.086Z · comments (19)

[question] Choosing a book on causality
martinkunev · 2024-02-07T21:16:08.885Z · answers+comments (3)

[link] Debating with More Persuasive LLMs Leads to More Truthful Answers
Akbir Khan (akbir-khan) · 2024-02-07T21:28:10.694Z · comments (14)

Nitric oxide for covid and other viral infections
Elizabeth (pktechgirl) · 2024-02-07T21:30:03.774Z · comments (6)

A Back-Of-The-Envelope Calculation On How Unlikely The Circumstantial Evidence Around Covid-19 Is
Roko · 2024-02-07T21:49:46.331Z · comments (36)

Conditional prediction markets are evidential, not causal
philh · 2024-02-07T21:52:47.476Z · comments (10)

Domestic Production vs International Wealth Creation
100YearPants · 2024-02-08T04:25:03.334Z · comments (0)

[link] A Chess-GPT Linear Emergent World Representation
karvonenadam · 2024-02-08T04:25:15.222Z · comments (14)

Measuring pre-peer-review epistemic status
Jakub Smékal (jakub-smekal) · 2024-02-08T05:09:01.418Z · comments (0)

Believing In
AnnaSalamon · 2024-02-08T07:06:13.072Z · comments (49)

How to develop a photographic memory 3/3
PhilosophicalSoul (LiamLaw) · 2024-02-08T09:22:07.918Z · comments (2)

AI #50: The Most Dangerous Thing
Zvi · 2024-02-08T14:30:13.168Z · comments (4)

Predicting Alignment Award Winners Using ChatGPT 4
Shoshannah Tekofsky (DarkSym) · 2024-02-08T14:38:37.925Z · comments (2)

Updatelessness doesn't solve most problems
Martín Soto (martinsq) · 2024-02-08T17:30:11.266Z · comments (43)

aintelope project update
Gunnar_Zarncke · 2024-02-08T18:32:00.000Z · comments (2)

[link] A review of "Don’t forget the boundary problem..."
jessicata (jessica.liu.taylor) · 2024-02-08T23:19:49.786Z · comments (1)

Twin Cities ACX Meetup - February 2024
Timothy M. (timothy-bond) · 2024-02-08T23:26:51.837Z · comments (2)

[question] How do health systems work in adequate worlds?
mukashi (adrian-arellano-davin) · 2024-02-09T00:54:38.443Z · answers+comments (2)

[question] How do high-trust societies form?
Shankar Sivarajan (shankar-sivarajan) · 2024-02-09T01:11:24.201Z · answers+comments (17)

[link] Core systems of number
Bruce W. Lee (bruce-lee) · 2024-02-09T02:19:03.207Z · comments (0)

Running the Numbers on a Heat Pump
jefftk (jkaufman) · 2024-02-09T03:00:04.920Z · comments (12)

[link] Shared system for ordering small and large numbers in monkeys and humans
Bruce W. Lee (bruce-lee) · 2024-02-09T04:45:52.957Z · comments (0)

[link] Number Trumps Area for 7-Month-Old Infants
Bruce W. Lee (bruce-lee) · 2024-02-09T04:58:51.344Z · comments (0)

[link] Biden-Harris Administration Announces First-Ever Consortium Dedicated to AI Safety
Ben Smith (ben-smith) · 2024-02-09T06:40:44.427Z · comments (0)

Transfer learning and generalization-qua-capability in Babbage and Davinci (or, why division is better than Spanish)
RP (Complex Bubble Tea) · 2024-02-09T07:00:45.825Z · comments (6)

Skills I'd like my collaborators to have
Raemon · 2024-02-09T08:20:37.686Z · comments (9)

[question] Do you want to make an AI Alignment song?
Kabir Kumar (kabir-kumar-1) · 2024-02-09T08:22:05.164Z · answers+comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

joe_collman on Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

This seems interesting, but I've seen no plausible case that there's a version of (1) that's both sufficient and achievable. I've seen Davidad mention e.g. approaches using boundaries formalization. This seems achievable, but clearly not sufficient. (boundaries don't help with e.g. [allow the mental influences that are desirable, but not those that are undesirable])

The [act sufficiently conservatively for safety, relative to some distribution of safety specifications] constraint seems likely to lead to paralysis (either of the form [AI system does nothing], or [AI system keeps the world locked into some least-harmful path], depending on the setup - and here of course "least harmful" isn't a utopia, since it's a distribution of safety specifications, not desirability specifications).
Am I mistaken about this?

I'm very pleased that people are thinking about this, but I fail to understand the optimism - hopefully I'm confused somewhere!
Is anyone working on toy examples as proof of concept?

I worry that there's so much deeply technical work here that not enough time is being spent to check that the concept is workable (is anyone focusing on this?). I'd suggest focusing on mental influences: what kind of specification would allow me to radically change my ideas, but not to be driven insane? What's the basis to think we can find such a specification?

It seems to me that finding a fit-for-purpose safety/acceptability specification won't be significantly easier than finding a specification for ambitious value alignment.

tenthkrige on Forecasting: the way I think about it

Good points well made. I'm not sure what you mean by "my expected log score is maximized" (and would like to know), but in any case it's probably your average world rather than your median world that does it?

zach-stein-perlman on Anthropic: Reflections on our Responsible Scaling Policy

Thanks.

I'm glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I'm still hoping to see more details. (And I'm generally confused about why Anthropic doesn't share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)
What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn't mean much compared to whether it's actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don't have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.

Not sure. I can generally imagine a company publishing what Anthropic has published but having a weak/fake system in reality. Policy details do seem less important for non-compliance reporting than some other policies — Anthropic says it has an infohazard review policy [LW(p) · GW(p)], and I expect it's good, but I'm not confident, and for other companies I wouldn't necessarily expect that their policy is good (even if they say a formal policy exists), and seeing details (with sensitive bits redacted) would help.

I mostly take back my secret policy is strong evidence of bad policy insinuation — that's ~true on my home planet, but on Earth you don't get sufficient credit for sharing good policies and there's substantial negative EV from misunderstandings and adversarial interpretations, so I guess it's often correct to not share :(

Edit: as an 80/20 of publishing, maybe you could share a policy with an external auditor who would then publish whether they think it's good or have concerns. I would feel better if that happened all the time.

marius-adrian-nicoara on Cluj-Napoca, Romania – ACX Meetups Everywhere 2022

Hi,

How did the event go?

Any plans to organize a meetup this year?

I'm planning to host a meetup in Sibiu this summer, because I haven't seen an event scheduled here. Any advice? I'm also planning to host a meetup in Cluj-Napoca this year, if it's not announced by someone else

Kind regards, Marius Nicoară

stephen-fowler on Stephen Fowler's Shortform

This does not feel super cruxy as the the power incentive still remains.

zac-hatfield-dodds on Anthropic: Reflections on our Responsible Scaling Policy

"red line" vs "yellow line"

Passing a red-line eval indicates that the model requires ASL-n mitigations. Yellow-line evals are designed to be easier to implement and/or run, while maintaining the property that if you fail them you would also fail the red-line evals. If a model passes the yellow-line evals, we have to pause training and deployment until we put a higher standard of security and safety measures in place, or design and run new tests which demonstrate that the model is below the red line. For example, leaving out the "register a typo'd domain" step from an ARA eval, because there are only so many good typos for our domain.

assurance mechanisms

Our White House committments mean that we're already reporting safety evals to the US Government, for example. I think the natural reading of "validated" is some combination of those, though obviously it's very hard to validate that whatever you're doing is 'sufficient' security against serious cyberattacks or safety interventions on future AI systems. We do our best.

I'm glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I'm still hoping to see more details. (And I'm generally confused about why Anthropic doesn't share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)

What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn't mean much compared to whether it's actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don't have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.

zac-hatfield-dodds on Anthropic: Reflections on our Responsible Scaling Policy

I believe that meeting our ASL-2 deployment commitments - e.g. enforcing our acceptable use policy, and data-filtering plus harmlessness evals for any fine-tuned models - with widely available model weights is presently beyond the state of the art. If a project or organization makes RSP-like commitments, evaluations and mitigates risks, and can uphold that while releasing model weights... I think that would be pretty cool.

(also note that e.g. LLama is not open source [LW(p) · GW(p)] - I think you're talking about releasing weights; the license doesn't affect safety but as an open-source maintainer the distinction matters to me)

chris_leong on Anthropic: Reflections on our Responsible Scaling Policy

That's the exact thing I'm worried about, that people will equate deploying a model via API with releasing open-weights when the latter has significantly more risk due to the potential for future modification and the inability for it to be withdrawn.

chris_leong on Anthropic: Reflections on our Responsible Scaling Policy

Frontier Red Team, Alignment Science, Finetuning, and Alignment Stress Testing

What's the difference between a frontier red team and alignment stress-testing? Is the red team focused on the current models you're releasing and the alignment stress testing focused on the future?

zach-stein-perlman on Anthropic: Reflections on our Responsible Scaling Policy

I think this is implicit — the RSP discusses deployment mitigations, which can't be enforced if the weights are shared.