LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

My AI Predictions 2023 - 2026
HunterJay · 2023-10-16T00:50:52.968Z · comments (28)

"Epistemic range of motion" and LessWrong moderation
habryka (habryka4) · 2023-11-27T21:58:40.834Z · comments (3)

[link] How do open AI models affect incentive to race?
jessicata (jessica.liu.taylor) · 2024-05-07T00:33:20.658Z · comments (13)

[link] Results from an Adversarial Collaboration on AI Risk (FRI)
Josh Rosenberg (josh-rosenberg) · 2024-03-11T20:00:24.642Z · comments (3)

The Sense Of Physical Necessity: A Naturalism Demo (Introduction)
LoganStrohl (BrienneYudkowsky) · 2024-02-24T02:56:31.458Z · comments (1)

[link] The case for aftermarket blind spot mirrors
Brendan Long (korin43) · 2023-10-09T19:30:22.843Z · comments (14)

[link] Understanding strategic deception and deceptive alignment
Marius Hobbhahn (marius-hobbhahn) · 2023-09-25T16:27:47.357Z · comments (16)

Showing SAE Latents Are Not Atomic Using Meta-SAEs
Bart Bussmann (Stuckwork) · 2024-08-24T00:56:46.048Z · comments (7)

LessOnline Festival Updates Thread
Ben Pace (Benito) · 2024-04-18T21:55:08.003Z · comments (26)

[link] Are There Examples of Overhang for Other Technologies?
Jeffrey Heninger (jeffrey-heninger) · 2023-12-13T21:48:08.954Z · comments (50)

New paper shows truthfulness & instruction-following don't generalize by default
joshc (joshua-clymer) · 2023-11-19T19:27:30.735Z · comments (0)

AI #48: Exponentials in Geometry
Zvi · 2024-01-18T14:20:07.869Z · comments (9)

Measuring Coherence of Policies in Toy Environments
dx26 (dylan-xu) · 2024-03-18T17:59:08.118Z · comments (9)

What's next for the field of Agent Foundations?
Nora_Ammann · 2023-11-30T17:55:13.982Z · comments (23)

Approaching Human-Level Forecasting with Language Models
Fred Zhang (fred-zhang) · 2024-02-29T22:36:34.012Z · comments (6)

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions
Lidor Banuel Dabbah · 2024-07-19T20:32:15.095Z · comments (6)

All About Concave and Convex Agents
mako yass (MakoYass) · 2024-03-24T21:37:17.922Z · comments (23)

[link] Linkpost: Surely you can be serious
kave · 2024-07-18T22:18:09.271Z · comments (7)

Thoughts on SB-1047
ryan_greenblatt · 2024-05-29T23:26:14.392Z · comments (1)

On Frequentism and Bayesian Dogma
DanielFilan · 2023-10-15T22:23:10.747Z · comments (27)

Understanding SAE Features with the Logit Lens
Joseph Bloom (Jbloom) · 2024-03-11T00:16:57.429Z · comments (0)

What is it to solve the alignment problem?
Joe Carlsmith (joekc) · 2024-08-24T21:19:34.280Z · comments (16)

[link] More people getting into AI safety should do a PhD
AdamGleave · 2024-03-14T22:14:48.855Z · comments (24)

Does AI risk “other” the AIs?
Joe Carlsmith (joekc) · 2024-01-09T17:51:47.020Z · comments (3)

0th Person and 1st Person Logic
Adele Lopez (adele-lopez-1) · 2024-03-10T00:56:14.446Z · comments (28)

What is "True Love"?
johnswentworth · 2024-08-18T16:05:47.358Z · comments (9)

[link] shoes with springs
bhauth · 2023-12-30T21:46:55.319Z · comments (6)

Bids To Defer On Value Judgements
johnswentworth · 2023-09-29T17:07:25.834Z · comments (6)

The Problem With the Word ‘Alignment’
peligrietzer · 2024-05-21T03:48:26.983Z · comments (8)

Memorizing weak examples can elicit strong behavior out of password-locked models
Fabien Roger (Fabien) · 2024-06-06T23:54:25.167Z · comments (5)

[link] Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Adrià Garriga-alonso (rhaps0dy) · 2024-07-25T22:00:55.398Z · comments (8)

[link] Sam Altman, Greg Brockman and others from OpenAI join Microsoft
Ozyrus · 2023-11-20T08:23:00.791Z · comments (15)

[link] Announcing the $200k EA Community Choice
Austin Chen (austin-chen) · 2024-08-14T00:39:37.350Z · comments (8)

A hermeneutic net for agency
TsviBT · 2024-01-01T08:06:30.289Z · comments (4)

Apply to ESPR & PAIR, Rationality and AI Camps for Ages 16-21
Anna Gajdova (anna-gajdova) · 2024-05-03T12:36:37.610Z · comments (5)

[link] "Why I Write" by George Orwell (1946)
Arjun Panickssery (arjun-panickssery) · 2024-04-25T16:02:28.668Z · comments (2)

The LessWrong 2022 Review: Review Phase
RobertM (T3t) · 2023-12-22T03:23:49.635Z · comments (7)

Paper out now on creatine and cognitive performance
Fabienne · 2023-11-26T10:58:29.745Z · comments (2)

[link] microwave drilling is impractical
bhauth · 2024-06-12T22:16:00.199Z · comments (14)

How you can help pass important AI legislation with 10 minutes of effort
ThomasW · 2024-09-14T22:10:50.386Z · comments (2)

Managing catastrophic misuse without robust AIs
ryan_greenblatt · 2024-01-16T17:27:31.112Z · comments (17)

[link] Against Nonlinear (Thing Of Things)
tailcalled · 2024-01-18T21:40:00.369Z · comments (18)

Woods’ new preprint on object permanence
Steven Byrnes (steve2152) · 2024-03-07T21:29:57.738Z · comments (1)

D&D.Sci: The Mad Tyrant's Pet Turtles
abstractapplic · 2024-03-29T16:22:13.732Z · comments (18)

[link] Talk: "AI Would Be A Lot Less Alarming If We Understood Agents"
johnswentworth · 2023-12-17T23:46:32.814Z · comments (3)

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
robertzk (Technoguyrob) · 2024-03-06T05:03:09.639Z · comments (0)

On the Latest TikTok Bill
Zvi · 2024-03-13T18:50:05.398Z · comments (7)

On Lex Fridman’s Second Podcast with Altman
Zvi · 2024-03-25T12:20:08.780Z · comments (10)

[link] Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Scott Emmons · 2023-09-20T15:23:48.898Z · comments (9)

[link] Congressional Insider Trading
Maxwell Tabarrok (maxwell-tabarrok) · 2024-08-30T13:32:57.264Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

richard_kennaway on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

That is just replacing the idea of fixed values with a fixed utility function. But it is just as changeable whatever you call it.

Show me your utility function before you were born.

tailcalled on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

In Bayesian decision theory, there's the distinction between expected utility, which changes as one learns about the environment, and actual utility, which does not. Under this frame, I'd be inclined to round you off to using the words "values"/"liking"/etc. to refer to expected utility. Would you agree with that? If not, why not?

It might be tempting to round the OP off to use the word "values"/"ought" to refer to actual utility, but the details of that are kind of awkward at the edges so I would hold off on that.

mnephisto on Vilnius – ACX Meetups Everywhere Fall 2024

The discord links seem to time out. Here's a new one: https://discord.gg/D5ECfZ47

I'll be wearing a purple 'Roboflow' hat and holding an ACX sign. If you can't find us, I'll be routinely checking the discord and my email on the day ( Linaskondrackis@gmail.com); feel free to ping me if you're around.

keltan on The Best Lay Argument is not a Simple English Yud Essay

I like what you’re doing trying to do here. I think this is important work.

I’m a bit confused at what you mean by Layperson though? These are good for the ‘every day’ above average intelligence ‘switched on’ type of individual.

But that is not what I image a Layperson as. I interact regularly with ~100 people. (For context, I am a Drama Teacher and Trivia Host)

I thought about how many I predict could understand these examples, given 20 seconds of their attention. I thought of 10 people. The other 90% would fall into a few other categories that all end with them not being more knowledgeable after coming across the text.

But am I confused? Was that 90% not the target audience?

richard_kennaway on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

The core claim in this post is that our brains model the world as though there's a thing called "our values", and tries to learn about those values in the usual epistemic way.

I find that a very strange idea, as strange as Plato’s Socrates’ parallel idea that learning is not the acquisition of something new, but recollection of what one had forgotten.

If I try X, anticipating that it will be an excellent experience, and find it disappointing, I have not learned something about my values, but about X.

I have never eaten escamoles. If I try them, what I will discover is what they are like to eat. If i like them, did I always like them? That is an unheard-falling-trees question.

If I value a thing at one period of life and turn away from it later, I have not discovered something about my values. My values have changed. In the case of the teenager we call this process “maturing”. Wine maturing in a barrel is not becoming what it always was, but simply becoming, according to how the winemaker conducts the process.

But people have this persistent illusion that how they are today is how they always were and always will be, and their mood of the moment is their fundamental nature, despite the evidence of their own memory.

tailcalled on tailcalled's Shortform

At a human level, the counts for each type of atom is basically always conserved too, so it's not just a question of why not momentum but also a question of why not moles of hydrogen, moles of carbon, moles of oxygen, moles of nitrogen, moles of silicon, moles of iron, etc..

I guess for momentum in particular, it seems reasonable why it wouldn't be useful in a thermodynamics-style model because things would woosh away too much (unless you're dealing with some sort of flow? Idk). A formalization or refutation of this intuition would be somewhat neat, but I would actually more wonder, could one replace the energy-first formulations of quantum mechanics with momentum-first formulations?

directedevolution on AllAmericanBreakfast's Shortform

Why don't more people seek out and use talent scouts/headhunters? If the ghost jobs phenomenon is substantial, that's a perfect use case. Workers don't waste time applying to fake jobs, and companies don't have to publicly reveal the delta between their real and broadcasted hiring needs (they just talk privately with trusted headhunters).

Are there not enough headhunters? Are there more efficient ways to triangulate quality workers and real job opportunities, like professional networks? Are ghost jobs not that big of a deal? Do people in fact use headhunters quite a lot?

ape-in-the-coat on What's the Deal with Logical Uncertainty?

I think the simple solution is to not talk about logical tautologies and contradictions when expressing the Kolmogorov axioms for a theory of subjective Bayesianism. Instead talk about what we actually know a priori, not about tautologies which we merely could know a priori (if we were logically omniscient).

Yes, absolutely. When I apply probability theory it should represent my state of knowledge, not state of knowledge of some logically omniscient being. For me it seems such an obvious thing that I struggle to understand why it's still not a standard approach.

So are there some hidden paradoxes of such approach that I just do not see yet? Or maybe some issues with formalization of the axioms?

tamsin-leake on Shortform

(oops, this ended up being fairly long-winded! hope you don't mind. feel free to ask for further clarifications.)

There's a bunch of things wrong with your description, so I'll first try to rewrite it in my own words, but still as close to the way you wrote it (so as to try to bridge the gap to your ontology) as possible. Note that I might post QACI 2 somewhat soon, which simplifies a bunch of QACI by locating the user as {whatever is interacting with the computer the AI is running on} rather than by using a beacon.

A first pass is to correct your description to the following:

We find a competent honourable human at a particular point in time , like Joe Carlsmith or Wei Dai, and give them a rock engraved with a 1GB secret key, large enough that in counterfactuals it could replace with an entire snapshot of . We also give them the ability to express a 1GB output, eg by writing a 1GB key somewhere which is somehow "signed" as the only . This is part of $H$ — $H$ is not just the human being queried at a particular point in time, it's also the human producing an answer in some way. So $H$ is a function from 1GB bitstring to 1GB bitstring. We define $H^{+}$ as $H$ , followed by whichever new process $H$ describes in its output — typically another instance of $H$ except with a different 1GB payload.
We want a model $M$ of the agent $H^{+}$ . In QACI, we get $M$ by asking a Solomonoff-like ideal reasoner for their best guess about $H^{+}$ after feeding them a bunch of data about the world and the secret key.
We then ask $M$ the question $q$ , "What's the best utility-function-over-policies to maximise?" to get a utility function $U$ $: (O \times A)^{*} \to R$ . We then **ask our solomonoff-like ideal reasoner for their best guess about which action $A$ maximizes $U$ .

Indeed, as you ask in question 3, in this description there's not really a reason to make step 3 an extra thing. The important thing to notice here is that model $M$ might get pretty good, but it'll still have uncertainty.

When you say "we get $M$ by asking a Solomonoff-like ideal reasoner for their best guess about $H^{+}$ ", you're implying that — positing U(M,A) to be the function that says how much utility the utility function returned by model M attributes to action A (in the current history-so-far) — we do something like:

  let M ← oracle(argmax { for model M } 𝔼 { over uncertainty } P(M))
  let A ← oracle(argmax { for action A } U(M, A))
  perform(A)

Indeed, in this scenario, the second line is fairly redundant.

The reason we ask for a utility function is because we want to get a utility function within the counterfactual — we don't want to collapse the uncertainty with an argmax before extracting a utility function, but after. That way, we can do expected-given-uncertainty utility maximization over the full distribution of model-hypotheses, rather than over our best guess about $M$ . We do:

  let A ← oracle(argmax { for A } 𝔼 { for M, over uncertainty } P(M) · U(M, A))
  perform(A)

That is, we ask our ideal reasoner (oracle) for the action with the best utility given uncertainty — not just logical uncertainty, but also uncertainty about which $M$ . This contrasts with what you describe, in which we first pick the most probable $M$ and then calculate the action with the best utility according only to that most-probable pick.

To answer the rest of your questions:

Is this basically IDA, where Step 1 is serial amplification, Step 2 is imitative distillation, and Step 3 is reward modelling?

Unclear! I'm not familiar enough with IDA, and I've bounced off explanations for it I've seen in the past. QACI doesn't feel to me like it particularly involves the concepts of distillation or amplification, but I guess it does involve the concept of iteration, sure. But I don't get the thing called IDA.

Why not replace Step 1 with Strong HCH or some other amplification scheme?

It's unclear to me how one would design an amplification scheme — see concerns of the general shape expressed here [LW · GW]. The thing I like about my step 1 is that the QACI loop (well, really, graph (well, really, arbitrary computation, but most of the time the user will probably just call themself in sequence)) is that its setup doesn't involve any AI at all — you could go back in time before the industrial revolution and explain the core QACI idea and it would make sense assuming time-travelling-messages magic, and the magic wouldn't have to do any extrapolating. Just tell someone the idea is that they could send a message to {their past self at a particular fixed point in time}. If there's any amplification scheme, it'll be one designed by the user, inside QACI, with arbitrarily long to figure it out.

What does "bajillion" actually mean in Step 1?

As described above, we don't actually pre-determine the length of the sequence, or in fact the shape of the graph at all. Each iteration decides whether to spawn one or several next iteration, or indeed to spawn an arbitrarily different long-reflection process.

Why are we doing Step 3? Wouldn't it be better to just use M directly as our superintelligence? It seems sufficient to achieve radical abundance, life extension, existential security, etc.

Why not ask M for the policy π directly? Or some instruction for constructing π? The instruction could be "Build the policy using our super-duper RL algo with the following reward function..." but it could be anything.

Hopefully my correction above answers these.

What if there's no reward function that should be maximised? Presumably the reward function would need to be "small", i.e. less than a Exabyte, which imposes a maybe-unsatisfiable constraint.

(Again, untractable-to-naively-compute utility function*, not easily-trained-on reward function. If you have an ideal reasoner, why bother with reward functions when you can just straightforwardly do untractable-to-naively-compute utility functions?)

I guess this is kinda philosophical? I have some short thoughts on here. If an exabyte is enough to describe to describe {a communication channel with a human-on-earth} to an AI-on-earth, which I think seems likely, then it's enough to build "just have a nice corrigible assistant ask the humans what they want"-type channels.

Put another way: if there are actions which are preferable to other actions, then it seems to me like utility function are a fully lossless way for counterfactual QACI users to express which kinds of actions they want the AI to perform, which is all we need. If there's something wrong with utility function over worlds, then counterfactual QACI users can output a utility function which favors actions which lead to something other than utility maximization over worlds, for example actions which lead to the construction of a superintelligent corrigible assistant which will help the humans come up with a better scheme.

Why is there no iteration, like in IDA? For example, after Step 2, we could loop back to Step 1 but reassign $H$ as $H$ with oracle access to $M$ .

Again, I don't get IDA. Iteration doesn't seem particularly needed? Note that inside QACI, the user does have access to an oracle and to all relevant pieces of hypothesis about which hypothesis it is inhabiting in — this is what, in the QACI math [LW · GW], this line does:

${QACI}_{0}$ 's distribution over answers demands that the answer payload $π_{r}$ , when interpreted as math and with all required contextual variables passed as input ( $q, μ 1, μ 2, α, γ_{q}, ξ$ ).

Notably, $α$ is the hypothesis for which world the user is being considered in, and $γ_{q}, ξ$ for their location within that world. Those are sufficient to fully characterize the hypothesis-for- $H$ that describes them. And because the user doesn't really return just a string but a math function which takes $q, μ 1, μ 2, α, γ_{q}, ξ$ as input and returns a string, they can have that math function do arbitrary work — including rederive $H$ . In fact, rediriving $H$ is how they call a next iteration: they say (except in math) "call $H$ again (rederived using $q, μ 1, μ 2, α, γ_{q}, ξ$ ), but with this string, and return the result of that." See also this illustration [LW(p) · GW(p)], which is kinda wrong in places but gets the recursion call graph thing right.

Another reason to do "iteration" like this inside the counterfactual rather than in the actual factual world (if that's what IDA does, which I'm only guessing here) is that we don't have as many iteration steps as we want in the factual world — eventually OpenAI or someone else kills everyone, whereas in the counterfactual, the QACI users are the only ones who can make progress, so the QACI users essentially have as long as they want, so long as they don't take too long in each individual counterfactual step or other somewhat easily avoided actions like that.

Why isn't Step 3 recursive reward modelling? i.e. we could collect a bunch of trajectories from $π$ and ask $M$ to use those trajectories to improve the reward function.

Unclear if this still means anything given the rest of this post. Ask me again if it does.

james-oofou on Darklight's Shortform

I quickly tested this. It's true that GPT-4o and GPT-4o-mini hallucinate answers.

And GPT-4o-latest does not provide hallucinated answers.

So, it seems that OpenAI may indeed have made some progress in reducing hallucination in smaller models.

Also, GPT-4 and GPT-4-turbo do not hallucinate answers.