LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[question] What prevents SB-1047 from triggering on deep fake porn/voice cloning fraud?
ChristianKl · 2024-09-26T09:17:39.088Z · answers+comments (21)

Superintelligence Can't Solve the Problem of Deciding What You'll Do
Vladimir_Nesov · 2024-09-15T21:03:28.077Z · comments (11)

[link] If-Then Commitments for AI Risk Reduction [by Holden Karnofsky]
habryka (habryka4) · 2024-09-13T19:38:53.194Z · comments (0)

5 ways to improve CoT faithfulness
CBiddulph (caleb-biddulph) · 2024-10-05T20:17:12.637Z · comments (8)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

[question] Seeking AI Alignment Tutor/Advisor: $100–150/hr
MrThink (ViktorThink) · 2024-10-05T21:28:16.491Z · answers+comments (3)

Just because an LLM said it doesn't mean it's true: an illustrative example
dirk (abandon) · 2024-08-21T21:05:59.691Z · comments (12)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

[link] Positive visions for AI
L Rudolf L (LRudL) · 2024-07-23T20:15:26.064Z · comments (4)

[link] Evaluating Synthetic Activations composed of SAE Latents in GPT-2
Giorgi Giglemiani (Rakh) · 2024-09-25T20:37:48.227Z · comments (0)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

An AI crash is our best bet for restricting AI
Remmelt (remmelt-ellen) · 2024-10-11T02:12:03.491Z · comments (1)

[link] Conventional footnotes considered harmful
dkl9 · 2024-10-01T14:54:01.732Z · comments (16)

An experiment on hidden cognition
Olli Järviniemi (jarviniemi) · 2024-07-22T03:26:05.564Z · comments (2)

The case for more Alignment Target Analysis (ATA)
Chi Nguyen · 2024-09-20T01:14:41.411Z · comments (13)

Using an LLM perplexity filter to detect weight exfiltration
Adam Karvonen (karvonenadam) · 2024-07-21T18:18:05.612Z · comments (11)

You're Playing a Rough Game
jefftk (jkaufman) · 2024-10-17T19:20:06.251Z · comments (2)

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper · 2024-07-30T14:57:06.807Z · comments (0)

[link] SB 1047 gets vetoed
ryan_b · 2024-09-30T15:49:38.609Z · comments (1)

Proving the Geometric Utilitarian Theorem
StrivingForLegibility · 2024-08-07T01:39:10.920Z · comments (0)

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
Taras Kutsyk · 2024-09-29T19:37:30.465Z · comments (6)

[link] what becoming more secure did for me
Chipmonk · 2024-08-22T17:44:48.525Z · comments (5)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

SAE features for refusal and sycophancy steering vectors
neverix · 2024-10-12T14:54:48.022Z · comments (4)

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics
DanielFilan · 2024-09-29T05:50:02.531Z · comments (0)

[link] Beware the science fiction bias in predictions of the future
Nikita Sokolsky (nikita-sokolsky) · 2024-08-19T05:32:47.372Z · comments (20)

[question] Why do Minimal Bayes Nets often correspond to Causal Models of Reality?
Dalcy (Darcy) · 2024-08-03T12:39:44.085Z · answers+comments (1)

Fun With The Tabula Muris (Senis)
sarahconstantin · 2024-09-20T18:20:01.901Z · comments (0)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

[link] A primer on the next generation of antibodies
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-01T22:37:59.207Z · comments (0)

A Visual Task that's Hard for GPT-4o, but Doable for Primary Schoolers
Lennart Finke (l-f) · 2024-07-26T17:51:28.202Z · comments (4)

[question] When can I be numerate?
FinalFormal2 · 2024-09-12T04:05:27.710Z · answers+comments (3)

[link] "25 Lessons from 25 Years of Marriage" by honorary rationalist Ferrett Steinmetz
CronoDAS · 2024-10-02T22:42:30.509Z · comments (2)

[link] Altruism and Vitalism Aren't Fellow Travelers
Arjun Panickssery (arjun-panickssery) · 2024-08-09T02:01:11.361Z · comments (2)

Seeking Mechanism Designer for Research into Internalizing Catastrophic Externalities
c.trout (ctrout) · 2024-09-11T15:09:48.019Z · comments (2)

Trying to be rational for the wrong reasons
Viliam · 2024-08-20T16:18:06.385Z · comments (8)

I didn't think I'd take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!
mako yass (MakoYass) · 2024-08-02T22:35:21.136Z · comments (2)

[LDSL#2] Latent variable models, network models, and linear diffusion of sparse lognormals
tailcalled · 2024-08-09T19:57:56.122Z · comments (2)

[link] Foundations - Why Britain has stagnated [crosspost]
Nathan Young · 2024-09-23T10:43:20.411Z · comments (1)

GPT-3.5 judges can supervise GPT-4o debaters in capability asymmetric debates
Charlie George (charlie-george) · 2024-08-27T20:44:08.683Z · comments (7)

[link] [Talk transcript] What “structure” is and why it matters
Alex_Altair · 2024-07-25T15:49:00.844Z · comments (0)

Rashomon - A newsbetting site
ideasthete · 2024-10-15T18:15:02.476Z · comments (8)

Would you benefit from, or object to, a page with LW users' reacts?
Raemon · 2024-08-20T16:35:47.568Z · comments (6)

AI #77: A Few Upgrades
Zvi · 2024-08-20T00:20:09.717Z · comments (3)

The Garden of Eden
Alexander Turok · 2024-07-22T16:07:42.509Z · comments (2)

[link] The Offense-Defense Balance of Gene Drives
Maxwell Tabarrok (maxwell-tabarrok) · 2024-09-27T16:47:25.976Z · comments (1)

AXRP Episode 34 - AI Evaluations with Beth Barnes
DanielFilan · 2024-07-28T03:30:07.192Z · comments (0)

Open Thread Fall 2024
habryka (habryka4) · 2024-10-05T22:28:50.398Z · comments (54)

[question] Money Pump Arguments assume Memoryless Agents. Isn't this Unrealistic?
Dalcy (Darcy) · 2024-08-16T04:16:23.159Z · answers+comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

Steve will never become aware of what Bob is doing to OldSteve

But how would Bob know that he wanted to create OldSteve, if Steve has been deleted from his memory via a cosmic block?

I suppose perhaps Bob could create OldEve. Eve is in a similar but not identical point in personality space to Steve and the desire to harm people who are like Eve is really the same desire as the desire to harm people like Steve. So Bob's Extrapolated Volition could create OldEve, who somehow consents to being mistreated in a way that doesn't trigger your torture detection test.

This kind of 'marginal case of consensual torture' has popped up in other similar discussions. E.g. In Yvain's (Scott Alexander's) article on Archipelago there's this section:

"""A child who is abused may be too young to know that escape is an option, or may be brainwashed into thinking they are evil, or guilted into believing they are betraying their families to opt out. And although there is no perfect, elegant solution here, the practical solution is that UniGov enforces some pretty strict laws on child-rearing, and every child, no matter what other education they receive, also has to receive a class taught by a UniGov representative in which they learn about the other communities in the Archipelago, receive a basic non-brainwashed view of the world, and are given directions to their nearest UniGov representative who they can give their opt-out request to"""

So Scott Alexander's solution to OldSteve is that OldSteve must get a non-brainwashed education about how ELYSIUM/Archipelago works and be given the option to opt out.

I think the issue here is that "people who unwisely consent to torture even after being told about it" and "people who are willing and consenting submissives" is not actually a hard boundary.

radford-neal-1 on Change My Mind: Thirders in "Sleeping Beauty" are Just Doing Epistemology Wrong

But the whole point of using probability to express uncertainty about the world is that the probabilities do not depend on the purpose.

If there are N possible observations, and M binary choices that you need to make, then a direct strategy for how to respond to an observation requires a table of size NxM, giving the actions to take for each possible observation. And you somehow have to learn this table.

In contrast, if the M choices all depend on one binary state of the world, you just need to have a table of probabilities of that state for each of the N observations, and a table of the utilities for the four action/state combinations for the M decisions - which have size proportional to N+M, much smaller than NxM for large N and M. You only need to learn the N probabilities (perhaps the utilities are givens).

And in reality, trying to make decisions without probabilities is even worse than it seems from this, since the set of decisions you may need to make is indefinitely large, and the number of possible observations is enormous. But avoiding having to make decisions by a direct observation->action table requires that probabilities have meaning independent of what decision you're considering at the moment. You can't just say that it could be 1/2, or could be 1/3...

austin-chen on Start an Upper-Room UV Installation Company?

Another similar company I want someone to start is one that produces inexpensive, self-installable far UV lamps. My understanding is that far UV is safe to shine directly on humans (as opposed to standard UV), meaning that you don't need high ceilings or special technicians to install the lamp. However, it's a much newer technology with not very much adoption or testing, I think because of a combination of principal/agent problems and price; see this post on blockers to Far UV adoption [LW · GW].

Beacon does produce these $800 lamps, which are consumer friendly-ish. I bought one for the Manifold office, but due to a variety of trivial inconveniences (figuring out where to mount it; the mobile app not syncing with my phone) it's still not active. I think a competent operator in this space could make a device that's somewhat cheaper & easier to use, and hit a tipping point for widespread/viral adoption.

(If you or someone you know is interested in doing this and is looking for funding, reach out to me at austin@manifund.org!)

deepthoughtlife on LLMs can learn about themselves by introspection

I obviously tend to go on at length about things when I analyze them. I'm glad when that's useful.

I had heard that OpenAI models aren't deterministic even at the lowest randomness, which I believe is probably due to optimizations for speed like how in image generation models (which I am more familiar with) the use of optimizers like xformers throws away a little correctness and determinism for significant improvements in resource usage. I don't know what OpenAI uses to run these models (I assume they have their own custom hardware?), but I'm pretty sure that it is the same reason. I definitely agree that randomness causes a cap on how well it could possibly do. On that point, could you determine the amount of indeterminacy in the system and put the maximum possible on your graphs for their models?

One thing I don't know if I got across in my comment based on the response is that I think if a model truly had introspective abilities to a high degree, it would notice that the basis of the result to such a question should be the same as its own process for the non-hypothetical comes up with. If it had introspection, it would probably use introspection as its default guess for both its own hypothetical behavior and that of any model (in people introspection is constantly used as a minor or sometimes major component of problem solving). Thus it would notice when its introspection got perfect scores and become very heavily dependent on it for this type of task, which is why I would expect its results to really just be 'run the query' for the hypothetical too.

Important point I perhaps should have mentioned originally, I think that the 'single forward pass' thing is in fact a huge problem for the idea of real introspection, since I believe introspection is a recursive task. You can perhaps do a single 'moment' of introspection on a single forward pass, but I'm not sure I'd even call that real introspection. Real introspection involves the ability to introspect about your introspection. Much like consciousness, it is very meta. Of course, the actual recursive depth of introspection at high fidelity usually isn't very deep, but we tell ourselves stories about our stories in an an almost infinitely deep manner (for instance, people have a 'life story' they think about and alter throughout their lives, and use their current story as an input basically always).

There are, of course, other architectures where that isn't a limitation, but we hardly use them at all (talking for the world at large, I'm sure there are still AI researchers working on such architectures). Honestly, I don't understand why they don't just use transformers in a loop with either its own ability to say when it has reached the end or with counters like 'pass 1 of 7'. (If computation is the concern, they could obviously just make it smaller.) They obviously used to have such recursive architectures and everyone in the field would be familiar with them (as are many laymen who are just kind of interested in how things work). I assume that means that people have tried and didn't find it useful enough to focus on, but I think it could help a lot with this kind of thing. (Image generation models actually kind of do this with diffusion, but they have a little extra unnecessary code in between, which are the actual diffusion parts.) I don't actually know why these architectures were abandoned besides there being a new shiny (transformers) though so there may be an obvious reason.

I would agree with you that these results do make it a more interesting research direction than other results would have, and it certainly seems worth "someone's" time to find out how it goes. I think a lot of what you are hoping to get out of it will fail, hopefully in ways that will be obvious to people, but it might fail in interesting ways.

I would agree that it is possible that introspection training is simply eliciting a latent capability that simply wasn't used in the initial training (though it would perhaps be interesting to train it on introspection earlier in its training and then simply continue with the normal training and see how that goes), I just think that finding a way to elicit it without retraining would be much better proof of its existence as a capability rather than as an artifact of goodharting things. I am often pretty skeptical about results across most/all fields where you can't do logical proofs due to this. Of course, even just finding the right prompt is vulnerable to this issue.

I think I don't agree that their being a cap on how much training is helpful necessarily indicates it is elicitation, but I don't really have a coherent argument on the matter. It just doesn't sound right to me.

The point you said you didn't understand was meant to point out (apparently unsuccessfully) that you use a different prompt for training than checking and it might also be worthwhile to train it on that style of prompting but with unrelated content. (Not that I know how you'd fit that style of prompting with a different style of content mind you.)

roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

a 55 percent majority (that does not have a lot of resource needs) burning 90 percent of all resources in ELYSIUM to fully disenfranchise everyone else. And then using the remaining resources to hurt the minority.

If there is an agent that controls 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone else, then assuming that ELYSIUM forbids them to do that, their rational move is to use their resources to prevent ELYSIUM from being built.

And since they control 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone who was trying to actually create ELYSIUM, they would likely succeed and ELYSIUM wouldn't happen.

Re:threats, see my other comment.

roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind

Especially if they like the idea of killing someone for refusing to modify the way that she lives her life. They can do this with person after person, until they have run into 9 people that prefers death to compliance. Doing this costs them basically nothing.

This assumes that threats are allowed. If you allow threats within your system you are losing out on most of the value of trying to create an artificial utopia because you will recreate most of the bad dynamics of real history which ultimately revolve around threats of force in order to acquire resources. So, the ability to prevent entities from issuing threats that they then do not follow through on is crucial.

Improving the equilibria of a game is often about removing strategic options; in this case the goal is to remove the option of running what is essentially organized crime.

In the real world there are various mechanisms that prevent organized crime and protection rackets. If you threaten to use force on someone in exchange for resources, the mere threat of force is itself illegal at least within most countries and is punished by a loss of resources far greater than the threat could win.

People can still engage in various forms of protest that are mutually destructive of resources (AKA civil disobedience).

The ability to have civil disobedience without protection rackets does seem kind of crucial.

zach-stein-perlman on LLMs can learn about themselves by introspection

I'm confused/skeptical about this being relevant, I thought honesty is orthogonal to whether the model has access to its mental states.

eliezer_yudkowsky on The Hidden Complexity of Wishes

Your distinction between "outer alignment" and "inner alignment" is both ahistorical and unYudkowskian. It was invented years after this post was written, by someone who wasn't me; and though I've sometimes used the terms in occasions where they seem to fit unambiguously, it's not something I see as a clear ontological division, especially if you're talking about questions like "If we own the following kind of blackbox, would alignment get any easier?" which on my view breaks that ontology. So I strongly reject your frame that this post was "clearly portraying an outer alignment problem" and can be criticized on those grounds by you; that is anachronistic.

You are now dragging in a very large number of further inferences about "what I meant", and other implications that you think this post has, which are about Christiano-style proposals that were developed years after this post. I have disagreements with those, many disagreements. But it is definitely not what this post is about, one way or another, because this post predates Christiano being on the scene.

What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won't work to say what you want. This point is true! If you then want to take in a bunch of anachronistic ideas developed later, and claim (wrongly imo) that this renders irrelevant the simple truth of what this post actually literally says, that would be a separate conversation. But if you're doing that, please distinguish the truth of what this post actually says versus how you think these other later clever ideas evade or bypass that truth.

michael-roe on What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

https://www.bbc.co.uk/news/technology-67012224

tristantrim on How I'd like alignment to get done (as of 2024-10-18)

Thanks : )