LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[question] Are UV-C Air purifiers so useful?
JohnBuridan · 2024-09-04T14:16:01.310Z · answers+comments (0)

Keeping it (less than) real: Against ℶ₂ possible people or worlds
quiet_NaN · 2024-09-13T17:29:44.915Z · comments (0)

[link] Universal dimensions of visual representation
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-28T10:38:58.396Z · comments (0)

On epistemic autonomy
sanyer (santeri-koivula) · 2024-08-31T18:50:43.377Z · comments (0)

[link] Can AI agents learn to be good?
Ram Rachum (ram@rachum.com) · 2024-08-29T14:20:04.336Z · comments (0)

Measuring Visual Sycophancy in Multimodal Models
Jaehyuk Lim (jason-l) · 2024-08-27T22:02:47.917Z · comments (0)

Just How Good Are Modern Chess Computers?
nem · 2024-09-19T18:57:21.254Z · comments (1)

[question] What do you expect AI capabilities may look like in 2028?
nonzerosum · 2024-08-23T16:59:53.007Z · answers+comments (5)

[link] Approval-Seeking ⇒ Playful Evaluation
Jonathan Moregård (JonathanMoregard) · 2024-08-28T21:03:51.244Z · comments (0)

[question] Does a time-reversible physical law/Cellular Automaton always imply the First Law of Thermodynamics?
Noosphere89 (sharmake-farah) · 2024-08-30T15:12:28.823Z · answers+comments (11)

[link] Is Redistributive Taxation Justifiable? Part 1: Do the Rich Deserve their Wealth?
Alexander de Vries (alexander-de-vries) · 2024-09-05T10:23:08.958Z · comments (20)

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM
Winnie Yang (winnie-yang) · 2024-08-28T08:41:38.967Z · comments (2)

One person's worth of mental energy for AI doom aversion jobs. What should I do?
Lorec · 2024-08-26T01:29:01.700Z · comments (16)

Fake Blog Posts as a Problem Solving Device
silentbob · 2024-08-31T09:22:54.513Z · comments (0)

[question] If I ask an LLM to think step by step, how big are the steps?
ryan_b · 2024-09-13T20:30:50.558Z · answers+comments (1)

[link] Checking public figures on whether they "answered the question" quick analysis from Harris/Trump debate, and a proposal
david reinstein (david-reinstein) · 2024-09-11T20:25:27.845Z · comments (4)

Denver USA - ACX Meetups Everywhere Fall 2024
Eneasz · 2024-08-29T18:40:53.332Z · comments (0)

[link] Physics of Language models (part 2.1)
Nathan Helm-Burger (nathan-helm-burger) · 2024-09-19T16:48:32.301Z · comments (1)

Moral Trade, Impact Distributions and Large Worlds
Larks · 2024-09-20T03:45:56.273Z · comments (0)

[link] SCP Foundation - Anti memetic Division Hub
landscape_kiwi · 2024-09-15T13:40:52.691Z · comments (1)

Thirty random thoughts about AI alignment
Lysandre Terrisse · 2024-09-15T16:24:10.572Z · comments (1)

[link] Could Things Be Very Different?—How Historical Inertia Might Blind Us To Optimal Solutions
James Stephen Brown (james-brown) · 2024-09-11T09:53:07.474Z · comments (0)

[link] A Nonconstructive Existence Proof of Aligned Superintelligence
Roko · 2024-09-12T03:20:09.531Z · comments (65)

[link] Optimising under arbitrarily many constraint equations
dkl9 · 2024-09-12T14:59:28.475Z · comments (0)

Understanding Hidden Computations in Chain-of-Thought Reasoning
rokosbasilisk · 2024-08-24T16:35:03.907Z · comments (0)

Food, Prison & Exotic Animals: Sparse Autoencoders Detect 6.5x Performing Youtube Thumbnails
Louka Ewington-Pitsos (louka-ewington-pitsos) · 2024-09-17T03:52:43.269Z · comments (2)

[link] AISafety.info: What are Inductive Biases?
Algon · 2024-09-19T17:26:24.581Z · comments (0)

Halifax Canada - ACX Meetups Everywhere Fall 2024
interstice · 2024-08-29T18:39:12.490Z · comments (0)

[link] [Linkpost] Interpretable Analysis of Features Found in Open-source Sparse Autoencoder (partial replication)
Fernando Avalos (fernando-avalos) · 2024-09-09T03:33:53.548Z · comments (1)

[link] Metaculus's 'Minitaculus' Experiments — Collaborate With Us
ChristianWilliams · 2024-08-26T20:44:32.125Z · comments (0)

[link] Redundant Attention Heads in Large Language Models For In Context Learning
skunnavakkam · 2024-09-01T20:08:48.963Z · comments (0)

[link] Contra Yudkowsky on 2-4-6 Game Difficulty Explanations
Josh Hickman (josh-hickman) · 2024-09-08T16:13:33.187Z · comments (1)

Inquisitive vs. adversarial rationality
gb (ghb) · 2024-09-18T13:50:09.198Z · comments (5)

Forever Leaders
Justice Howard (justice-howard) · 2024-09-14T20:55:39.095Z · comments (9)

Budapest Hungary - ACX Meetups Everywhere Fall 2024
Timothy Underwood (timothy-underwood-1) · 2024-08-29T18:37:41.313Z · comments (0)

[question] Can subjunctive dependence emerge from a simplicity prior?
Daniel C (harper-owen) · 2024-09-16T12:39:35.543Z · answers+comments (0)

A gentle introduction to sparse autoencoders
Nick Jiang (nick-jiang) · 2024-09-02T18:11:47.086Z · comments (0)

[link] AI Safety Newsletter #41: The Next Generation of Compute Scale Plus, Ranking Models by Susceptibility to Jailbreaking, and Machine Ethics
Corin Katzke (corin-katzke) · 2024-09-11T19:14:08.274Z · comments (1)

Grass Valley USA - ACX Meetups Everywhere Fall 2024
Raelifin · 2024-08-29T18:39:57.229Z · comments (0)

Piling bounded arguments
momom2 (amaury-lorin) · 2024-09-19T22:27:41.534Z · comments (0)

[question] How do we know dreams aren't real?
Logan Zoellner (logan-zoellner) · 2024-08-22T12:41:57.380Z · answers+comments (31)

Democracy beyond majoritarianism
Arturo Macias (arturo-macias) · 2024-09-03T15:10:56.284Z · comments (2)

[link] Universal basic income isn’t always AGI-proof
Kevin Kohler (KevinKohler) · 2024-09-05T15:39:18.389Z · comments (3)

Longevity and the Mind
George3d6 · 2024-09-16T09:43:09.700Z · comments (2)

Meta: On viewing the latest LW posts
quiet_NaN · 2024-08-25T19:31:39.008Z · comments (2)

Tbilisi Georgia - ACX Meetups Everywhere Fall 2024
Dmitrii (dmitrii) · 2024-08-29T18:36:43.223Z · comments (4)

[question] How do you finish your tasks faster?
Cipolla · 2024-08-21T20:01:41.306Z · answers+comments (2)

Crafting Polysemantic Transformer Benchmarks with Known Circuits
Evan Anders (evan-anders) · 2024-08-23T22:03:15.288Z · comments (0)

Interest poll: A time-waster blocker for desktop Linux programs
nahoj · 2024-08-22T20:44:04.479Z · comments (5)

Tokyo (日本語) Japan - ACX Meetups Everywhere Fall 2024
Emi (emi-2) · 2024-08-29T18:35:28.013Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

gb on What you know when you know nothing

We can only make that inference about conjunctions if we know that the statements are independent. Since (by assumption) we don’t know anything about said world, we don’t know that either, so the conclusion does not follow.

johannes-c-mayer on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

reward is the evidence from which we learn about our values

A sadist might feel good each time they hurt somebody. I am pretty sure it is possible for a sadist to exist who does not endorse hurting people, meaning they feel good if they hurt people, but they avoid it nonetheless.

So to what extent is hurting people a value? It's like the sadist's brain tries to tell them that they ought to want to hurt people, but they don't want to. Intuitively the "they don't want to" seems to be the value.

mateusz-baginski on Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It.

It's better but still not quite. When you play on two levels, sometimes the best strategy involves a pair of (level 1 and 2) substrategies that are seemingly opposites of each other. I don't think there's anything hypocritical about that.

Similarly, hedging is not hypocrisy.

cousin_it on Slave Morality: A place for every man and every man in his place

I think Nietzsche would agree that "slave morality" originated with Jesus. The main new idea that Jesus brought as a moral philosopher was compassion, feeling for the other person. It's pretty find to hard in earlier sources, for example the heroes of the Iliad hurt weaker people without a second thought.

To me it feels obvious that the idea of compassion needs to exist, and needs to have force. Because otherwise we'd have a human society operating by the laws of the natural world, and if you look at what animals do to each other, there's no limit to how bad things can get.

Can compassion also become a tool of power and abuse? Sure. But let's not go back to a world without compassion, please.

xpym on A Nonconstructive Existence Proof of Aligned Superintelligence

This isn’t really a problem with alignment

I'd rather put it that resolving that problem is a prerequisite for the notion of "alignment problem" to be meaningful in the first place. It's not technically a contradiction to have an "aligned" superintelligence that does nothing, but clearly nobody would in practice be satisfied with that.

quetzal_rainbow on What's the Deal with Logical Uncertainty?

The reason why logical uncertainty was brought up in the first place is decision theory, to make crisp formal expression for intuitive "I cooperate with you conditional on you cooperating with me", where "you cooperating with me" is result of analysis of probability distribution over possible algorithms which control actions of your opponent and you can't actually run these algorithms due to computational constraints, and you want to do all this reasoning in non-arbitrary ways.

devrandom on Is "superhuman" AI forecasting BS? Some experiments on the "539" bot from the Centre for AI Safety

There seem to be substantial problems with low probability events, coherent predictions over time, short term events, probabilities adding up to more than 100%, etc

An probabilistic oracle being inconsistent is completely besides the point. If I have a probabilistic oracle that has high accuracy but is sometimes inconsistent, I can just post-process the predictions to force them into a consistent format. For example, I can normalize the probabilities to 100%.

The economic value is in the overall accuracy. Being consistent is a cosmetic consideration.

cata on AI Safety is Dropping the Ball on Clown Attacks

This post was difficult to take seriously when I read it but the "clown attack" idea very much stuck with me.

tailcalled on The case for a negative alignment tax

With this being said, catastrophic misuse could wipe us all out, too. It seems too strong to say that the ‘traits’ of frontier models ‘[have] nothing to do’ with the alignment problem/whether a giant AI wave destroys human society, as you put it. If we had no alignment technique that reliably prevented frontier LLMs from explaining to anyone who asked how to make anthrax, build bombs, spread misinformation, etc, this would definitely at least contribute to a society-destroying wave. But finetuned frontier models do not do this by default, largely because of techniques like RLHF. (Again, not saying or implying RLHF achieves this perfectly or can’t be easily removed or will scale, etc. Just seems like a plausible counterfactual world we could be living in but aren’t because of ‘traits’ of frontier models.)

I would like to see the case for catastrophic misuse being an xrisk, since it mostly seems like business-as-usual for technological development (you get more capabilities which means more good stuff but also more bad stuff).

thomascederborg on The case for more Alignment Target Analysis (ATA)

The proposed research project would indeed be focused on a certain type of alignment target. For example proposals along the lines of PCEV. But not proposals along the lines of a tool-AI. Referring to this as Value-Alignment Target Analysis (VATA) would also be a possible notation. I will adopt this notation for the rest of this comment.

The proposed VATA research project would be aiming for risk mitigation. It would not be aiming for an answer:

There is a big difference between proposing an alignment target on the one hand. And pointing out problems with alignment targets on the other hand. For example: it is entirely possible to reduce risks from a dangerous alignment target, without having any idea how one might find a good alignment target. One can actually reduce risks without having any idea, what it even means for an alignment target to be a good alignment target.

The feature of PCEV mentioned in the post is an example of this. The threat posed by PCEV has presumably been mostly removed. This did not require anything along the lines of an answer. The analysis of Condorcet AI (CAI) is similar. The analysis simply describes a feature shared by all CAI proposals (the feature that a barely caring solid majority can do whatever they want with everyone else). Pointing this out presumably reduces the probability that a CAI will be launched by designers that never considered this feature. All claims made in the post about a VATA research project being tractable is referring to this type of risk mitigation being tractable. There is definitely no claim that a VATA research project can (i): find a good alignment target, (ii): somehow verify that this alignment target does not have any hidden flaws, and (iii): convince whoever is in charge to launch this target.

One can also go a bit beyond analysis of individual proposals, even if one does not have any idea how to find an answer. One can mitigate risk by describing necessary features (for example along the lines of this necessary Membrane formalism feature [LW · GW]). This reduces risks from all proposal that clearly does not have such a necessary feature.

(and just to be extra clear: the post is not arguing that launching a Sovereign AI is a good idea. The post is assuming an audience that agree that it is possible that a Sovereign AI might be launched. And then the post is arguing that if this does happen, then there is a risk that such a Sovereign AI project will be aiming at a bad value alignment target. The post then further argues that this particular risk can be reduced by doing VATA)

Regarding people being skeptical of Value Alignment Target proposals:

If someone ends up with the capability to launch a Sovereign AI, then I certainly hope that they will be skeptical of proposed Value Alignment Targets. Such skepticism can avert catastrophe even if the proposed alignment target has a flaw that no one has noticed.

The issue is that a situation might arise where (i): someone has the ability to launch a Sovereign AI, (ii): there exists a Sovereign AI proposal that no one can find any flaws with, and (iii): there is a time crunch.

Regarding the possibility that there exists people trying to find an answer without telling anyone:

I'm not sure how to estimate the probability of this. From a risk mitigation standpoint, this is certainly not the optimal way of doing things (if a proposed alignment target has a flaw, then it will be a lot easier to notice that flaw, if the proposal is not kept secret). I really don't think that this is a reasonable way of doing things. But I think that you have a point. If Bob is about to launch an AI Sovereign with some critical flaw that would lead to some horrific outcome. Then secretly working Steve might be able to notice this flaw. And if Bob is just about to launch his AI, and speaking up is the only way for Steve to prevent Bob from causing a catastrophe, then Steve will presumably speak up. In other words: the existence of people like secretly working Steve would indeed offer some level of protection. It would mean that the lack of people with relevant intuitions is not as bad as it appears (and when allocating resources, this possibility would indeed point to less resources for VATA). But I think that what is really needed is at least some people doing VATA with a clear risk mitigation focus. And discussing their finding with each other. This does not appear to exist.

Regarding other risks, and the issue that findings might be ignored:

A VATA research project would not help with misalignment. In other words: even if the field of VATA was somehow completely solved tomorrow, AI could still lead to extinction. So the proposed research project is definitely not dealing with all risks. The point of the post is that the field of VATA is basically empty. I don't know of anyone that is doing VATA full time with a clear risk mitigation focus. And I don't know if you personally should switch to focusing on VATA. It would not surprise me at all if some other project is a better use of your time. It just seems like there should exist some form of VATA research project with a clear risk mitigation focus.

It is also possible that a VATA finding will be completely ignored (by leading labs, or by governments, or by someone else). It is possible that a Sovereign AI will be launched, leading to catastrophe, even though it has a known flaw (because the people launching it is just refusing to listen). But finding a flaw at least means that it is possible to avert catastrophe.

PS:

Thanks for the links! I will look into this. (I think that there are many fields of research that are relevant to VATA. It's just that one has to be careful. A concept can behave very differently when it is transferred to the AI context)