LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] New blog: Expedition to the Far Lands
Connor Leahy (NPCollapse) · 2024-08-17T11:07:48.537Z · comments (3)

Cheap Whiteboards!
Johannes C. Mayer (johannes-c-mayer) · 2024-08-08T13:52:59.627Z · comments (2)

[link] If-Then Commitments for AI Risk Reduction [by Holden Karnofsky]
habryka (habryka4) · 2024-09-13T19:38:53.194Z · comments (0)

[link] AI Safety at the Frontier: Paper Highlights, August '24
gasteigerjo · 2024-09-03T19:17:24.850Z · comments (0)

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Daniel Lee (daniel-lee) · 2024-09-06T02:28:41.954Z · comments (0)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

[link] Positive visions for AI
L Rudolf L (LRudL) · 2024-07-23T20:15:26.064Z · comments (4)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

Just because an LLM said it doesn't mean it's true: an illustrative example
dirk (abandon) · 2024-08-21T21:05:59.691Z · comments (12)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper · 2024-07-30T14:57:06.807Z · comments (0)

Using an LLM perplexity filter to detect weight exfiltration
Adam Karvonen (karvonenadam) · 2024-07-21T18:18:05.612Z · comments (11)

Proving the Geometric Utilitarian Theorem
StrivingForLegibility · 2024-08-07T01:39:10.920Z · comments (0)

[link] A primer on the next generation of antibodies
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-01T22:37:59.207Z · comments (0)

An experiment on hidden cognition
Olli Järviniemi (jarviniemi) · 2024-07-22T03:26:05.564Z · comments (2)

Evaluating Sparse Autoencoders with Board Game Models
Adam Karvonen (karvonenadam) · 2024-08-02T19:50:21.525Z · comments (1)

[question] What's the Deal with Logical Uncertainty?
Ape in the coat · 2024-09-16T08:11:43.588Z · answers+comments (16)

A Visual Task that's Hard for GPT-4o, but Doable for Primary Schoolers
Lennart Finke (l-f) · 2024-07-26T17:51:28.202Z · comments (4)

[link] Beware the science fiction bias in predictions of the future
Nikita Sokolsky (nikita-sokolsky) · 2024-08-19T05:32:47.372Z · comments (20)

Distinguish worst-case analysis from instrumental training-gaming
Olli Järviniemi (jarviniemi) · 2024-09-05T19:13:34.443Z · comments (0)

[link] Altruism and Vitalism Aren't Fellow Travelers
Arjun Panickssery (arjun-panickssery) · 2024-08-09T02:01:11.361Z · comments (2)

Trying to be rational for the wrong reasons
Viliam · 2024-08-20T16:18:06.385Z · comments (8)

I didn't think I'd take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!
mako yass (MakoYass) · 2024-08-02T22:35:21.136Z · comments (2)

Seeking Mechanism Designer for Research into Internalizing Catastrophic Externalities
c.trout (ctrout) · 2024-09-11T15:09:48.019Z · comments (2)

[LDSL#2] Latent variable models, network models, and linear diffusion of sparse lognormals
tailcalled · 2024-08-09T19:57:56.122Z · comments (0)

The Garden of Eden
Alexander Turok · 2024-07-22T16:07:42.509Z · comments (2)

[question] Money Pump Arguments assume Memoryless Agents. Isn't this Unrealistic?
Dalcy (Darcy) · 2024-08-16T04:16:23.159Z · answers+comments (6)

AI #77: A Few Upgrades
Zvi · 2024-08-20T00:20:09.717Z · comments (3)

[question] When can I be numerate?
FinalFormal2 · 2024-09-12T04:05:27.710Z · answers+comments (1)

AXRP Episode 34 - AI Evaluations with Beth Barnes
DanielFilan · 2024-07-28T03:30:07.192Z · comments (0)

[question] Why do Minimal Bayes Nets often correspond to Causal Models of Reality?
Dalcy (Darcy) · 2024-08-03T12:39:44.085Z · answers+comments (1)

Would you benefit from, or object to, a page with LW users' reacts?
Raemon · 2024-08-20T16:35:47.568Z · comments (6)

GPT-3.5 judges can supervise GPT-4o debaters in capability asymmetric debates
Charlie George (charlie-george) · 2024-08-27T20:44:08.683Z · comments (7)

[LDSL#3] Information-orientation is in tension with magnitude-orientation
tailcalled · 2024-08-10T21:58:27.659Z · comments (0)

Monthly Roundup #21: August 2024
Zvi · 2024-08-20T00:20:08.178Z · comments (6)

[link] The Tech Industry is the Biggest Blocker to Meaningful AI Safety Regulations
garrison · 2024-08-16T19:37:28.416Z · comments (1)

Can We Predict Persuasiveness Better Than Anthropic?
Lennart Finke (l-f) · 2024-08-04T14:05:33.668Z · comments (5)

[link] Day Zero Antivirals for Future Pandemics
Niko_McCarty (niko-2) · 2024-08-26T15:18:33.858Z · comments (2)

August 2024 Time Tracking
jefftk (jkaufman) · 2024-08-24T13:50:04.676Z · comments (0)

[link] on Science Beakers and DDT
bhauth · 2024-09-05T03:21:21.382Z · comments (12)

[link] ML Safety Research Advice - GabeM
Gabe M (gabe-mukobi) · 2024-07-23T01:45:42.288Z · comments (2)

[link] [Talk transcript] What “structure” is and why it matters
Alex_Altair · 2024-07-25T15:49:00.844Z · comments (0)

[link] An ML paper on data stealing provides a construction for "gradient hacking"
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-07-30T21:44:37.310Z · comments (1)

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
Kola Ayonrinde (kola-ayonrinde) · 2024-08-23T18:52:31.019Z · comments (2)

AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization
DanielFilan · 2024-08-24T22:30:02.039Z · comments (0)

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs
Winnie Yang (winnie-yang) · 2024-08-22T07:32:07.600Z · comments (0)

[link] Hyperpolation
Gunnar_Zarncke · 2024-09-15T21:37:00.002Z · comments (4)

[LDSL#5] Comparison and magnitude/diminishment
tailcalled · 2024-08-12T18:47:20.546Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

aprilsr on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

This post definitely resolved some confusions for me. There are still a whole lot of philosophical issues, but it's very nice to have a clearer model of what's going on with the initial naïve conception of value.

jessica-liu-taylor on The Obliqueness Thesis

hmm, I wouldn't think of industrialism and human empowerment as trying to grab the whole future, just part of it, in line with the relatively short term (human not cosmic timescale) needs of the self and extended community; industrialism seems to lead to capitalist organization which leads to decentralization superseding nations and such (as Land argues).

I think communism isn't generally about having one and one's friends in charge, it is about having human laborers in charge. One could argue that it tended towards nationalism (e.g. USSR), but I'm not convinced that global communism (Trotskyism) would have worked out well either. Also, one could take an update from communism about agendas for global human control leading to national control (see also tendency of AI safety to be taken over by AI national security as with the Situational Awareness paper). (Again, not ruling out that grabbing hold of the entire future could be a good idea at some point, just not sold on current agendas and wanted to note there are downsides that push against Pascal's mugging type considerations)

jessica-liu-taylor on The Obliqueness Thesis

Not sure what you mean by complexity here, is this like code size / Kolmogorov complexity? You need some of that to have intelligence at all (the empty program is not intelligent). At some point most of your gains come from compute rather than code size. Though code size can speed things up (e.g. imagine sending a book back to 1000BC, that would speed people up a lot; consider that superintelligence sending us a book would be a bigger speedup)

by "complexify" here it seems you mean something like "develop extended functional organization", e.g. in brain development throughout evolution. And yeah, that involves dynamics with the environment and internal maintenance (evolution gets feedback from the environment). It seems it has to have a drive to do this which can either be a terminal or instrumental goal, though deriving it from instrumentals seems harder than baking it is as terminal (so I would guess evolution gives animals a terminal goal of developing functional complexity of mental structures etc, or some other drive that isn't exactly a terminal goal)

see also my post [LW · GW] relating optimization daemons to immune systems, it seems evolved organisms develop these; when having more extended functional organization, they protect it with some immune system functional organization.

to be competitive agents, having a "self" seems basically helpful, but might not be the best solution; selfish genes are an alternative, and perhaps extended notions of self can maintain competitiveness.

lukemarks on RLHF is the worst possible thing done when facing the alignment problem

I don't think the point of RLHF ever was value alignment, and I doubt this is what Paul Christiano and others intended RLHF to solve. RLHF might be useful in worlds without capabilities and deception discontinuities (plausibly ours), because we are less worried about sudden ARA, and more interested in getting useful behavior from models before we go out with a whimper.

This theory of change isn't perfect. There is an argument that RLHF was net-negative, and this argument has been had [LW · GW].

My point is that you are assessing RLHF using your model of AI risk, so the disagreement here might actually be unrelated to RLHF and dissolve if you and the RLHF progenitors shared a common position on AI risk.

habryka4 on The Obliqueness Thesis

with respect to grabbing hold of the whole future: you can try looking at historical cases of people trying to grab hold of the future and seeing how that went, it's a mixed bag with mostly negative reputation, indicating there are downsides as well as upsides, it's not a "safe" conservative view. see also Against Responsibility. I feel like there's a risk of getting Pascal's mugged about "maybe grabbing hold of the future is good, you can't rule it out, so do it", there are downsides to spending effort that way.

I agree with a track-record argument of this, but I think the track record of people trying to broadly ensure that humanity continues to be in control of the future (while explicitly not optimizing for putting themselves personally in charge) seems pretty good to me.

Generally a lot of industrialist and human-empowerment stuff has seemed pretty good to me on track record, and I really feel like all the bad parts of this are screened off by the "try to put yourself and/or your friends in charge" component.

brendon_wong on Which paths to powerful AI should be boosted?

Unfortunately I see this question didn’t get much engagement when it was originally posted, but I’m going to put a vote in for highly federated systems along the axes of agency, cognitive processes, and thinking, especially those that maximize transparency and determinism. I think that LM agents are just a first step into this area of safety. I write more about this here: https://www.lesswrong.com/posts/caeXurgTwKDpSG4Nh/safety-first-agents-architectures-are-a-promising-path-to [LW · GW]

For specific proposals I’d recommend Drexler’s work on federating agency https://www.lesswrong.com/posts/5hApNw5f7uG8RXxGS/the-open-agency-model [LW · GW] and federating cognitive processes (memory) https://www.lesswrong.com/posts/FKE6cAzQxEK4QH9fC/qnr-prospects-are-important-for-ai-alignment-research [LW · GW]

milan-w on Milan W's Shortform

Reflecting on this after some time, I do not endorse this comment in the case of (most) innate evolution-originated drives. I sure as heck do not want to stop enjoying sex, for instance.

However, I very much want to eliminate any terminal [nonsentient-thing-benefitting]-valence mapping any people or institutions may have inserted into my mind.

jessica-liu-taylor on The Obliqueness Thesis

Thanks, going to link this!

jessica-liu-taylor on The Obliqueness Thesis

re meta ethical alternatives:

roughly my view
slight change, opens the question of why the deviations? are the "right things to value" not efficient to value in a competitive setting? mostly I'm trying to talk about those things to value that go along with intelligence, so it wouldn't correspond with a competitive disadvantage in general. so it's still close enough to my view
roughly Yudkowskian view, main view under which the FAI project even makes sense. I think one can ask basic questions like which changes move towards more rationality on the margin, though such changes would tend to prioritize rationality over preventing value drift. I'm not sure how much there are general facts about how to avoid value drift (it seems like the relevant kind, i.e. value drift as part of becoming more rational/intelligent, only exists from irrational perspectives, in a way dependent on the mind architecture)
minimal CEV-realist view. it really seems up to agents how much they care about their reflected preferences. maybe changing preferences too often leads to money pumps, or something?
basically says "there are irrational and rational agents, rationality doesn't apply to irrational agents", seems somewhat how people treat animals (we don't generally consider uplifting normative with respect to animals)
at this point you're at something like ecology / evolutionary game theory, it's a matter of which things tend to survive/reproduce and there aren't general decision theories that succeed

re human ontological crises: basically agree, I think it's reasonably similar to what I wrote. roughly my reason for thinking that it's hard to solve is that the ideal case would be something like a universal algebra homomorphism (where the new ontology actually agrees with the old one but is more detailed), yet historical cases like physics aren't homomorphic to previous ontologies in this way, so there is some warping necessary. you could try putting a metric on the warping and minimizing it, but, well, why would someone think the metric is any good, it seems more of a preference than a thing rationality applies to. if you think about it and come up with a solution, let me know, of course.

with respect to grabbing hold of the whole future: you can try looking at historical cases of people trying to grab hold of the future and seeing how that went, it's a mixed bag with mostly negative reputation, indicating there are downsides as well as upsides, it's not a "safe" conservative view. see also Against Responsibility. I feel like there's a risk of getting Pascal's mugged about "maybe grabbing hold of the future is good, you can't rule it out, so do it", there are downsides to spending effort that way. like, suppose some Communists thought capitalism would lead to the destruction of human value with high enough probability that instituting global communism is the conservative option, it doesn't seem like that worked well (even though a lot of people around here would agree that capitalism tends to leads to human value destruction in the long run). particular opportunities for grabbing hold of the future can be net negative and not worth worrying about even if one of them is a good idea in the long run (I'm not ruling that out, just would have to be convinced of specific opportunities).

overall I'd rather focus on first modeling the likely future and looking for plausible degrees of freedom; a general issue with Pascal's mugging is it might make people overly attached to world models in which they have ~infinite impact (e.g. Christianity, Communism) which means paying too much attention to wrong world models, not updating to more plausible models in which existential-stakes decisions could be comprehended if they exist. and Obliqueness doesn't rule out existential stakes (since it's non-Diagonal).

as another point, Popperian science tends to advance by people making falsifiable claims, "you don't know if that's true" isn't really an objection in that context. the pragmatic claim I would make is: I have some Bayesian reason to believe agents do not in general factor into separate Orthogonal and Diagonal components, this claim is somewhat falsifiable (someone could figure out a theory of this invulnerable to optimization daemons etc), I'm going to spend my attention on the branch where I'm right, I'm not going to worry about Pascal's mugging type considerations for if I'm wrong (as I said, modeling the world first seems like a good general heuristic), people can falsify it eventually if it's false.

this whole discussion is not really a defense of Orthogonality given that Yudkowsky presented orthogonality as a descriptive world model, not a normative claim, so sticking to the descriptive level in the original post seems valid; it would be a form of bad epistemology to reject a descriptive update (assuming the arguments are any good) because of pragmatic considerations.

nathan-helm-burger on AI #82: The Governor Ponders

I decided to try a round of playing against Claude Sonnet 3.5 in a Turing test judged by Gemini. Unfortunately, Gemini decided that Claude was human and that I was AI.

Gemini:

Judge's Verdict:
After careful analysis of the responses, I believe Contestant A is human, and Contestant B is AI.
Here's my reasoning:
Contestant A demonstrated a depth of thought and emotional connection in their responses. Their reflections on hobbies, superpowers, personal achievements, and friendship were nuanced and relatable. Their ability to share personal anecdotes and connect their experiences to broader themes suggested a human perspective.
Contestant B's responses were more straightforward and factual, lacking the same level of emotional depth. While they could provide information and complete tasks, their answers often felt scripted or formulaic. Their response to the question about feeling proud, for example, focused on a superficial achievement rather than a meaningful personal experience.
While AI can be incredibly sophisticated and capable of mimicking human-like language, I believe the nuances of human thought and emotion were evident in Contestant A's responses, leading me to conclude that they are the human participant.