LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

On Llama-3 and Dwarkesh Patel’s Podcast with Zuckerberg
Zvi · 2024-04-22T13:10:02.645Z · comments (4)

[link] Twitter thread on AI safety evals
Richard_Ngo (ricraz) · 2024-07-31T00:18:14.076Z · comments (3)

[link] Superforecasting the Origins of the Covid-19 Pandemic
DanielFilan · 2024-03-12T19:01:15.914Z · comments (0)

Black Box Biology
GeneSmith · 2023-11-29T02:27:29.794Z · comments (30)

Do not delete your misaligned AGI.
mako yass (MakoYass) · 2024-03-24T21:37:07.724Z · comments (13)

What is a Tool?
johnswentworth · 2024-06-25T23:40:07.483Z · comments (4)

On coincidences and Bayesian reasoning, as applied to the origins of COVID-19
viking_math · 2024-02-19T01:14:06.772Z · comments (28)

Catastrophic Goodhart in RL with KL penalty
Thomas Kwa (thomas-kwa) · 2024-05-15T00:58:20.763Z · comments (10)

A framework for thinking about AI power-seeking
Joe Carlsmith (joekc) · 2024-07-24T22:41:01.685Z · comments (15)

Taxonomy of AI-risk counterarguments
Odd anon · 2023-10-16T00:12:51.021Z · comments (13)

[link] Ice: The Penultimate Frontier
Roko · 2024-07-13T23:44:56.827Z · comments (56)

RTFB: California’s AB 3211
Zvi · 2024-07-30T13:10:03.853Z · comments (2)

Interpreting and Steering Features in Images
Gytis Daujotas (gytis-daujotas) · 2024-06-20T18:33:59.512Z · comments (6)

[link] Outrage Bonding
Jonathan Moregård (JonathanMoregard) · 2024-08-09T13:46:59.818Z · comments (12)

AI #55: Keep Clauding Along
Zvi · 2024-03-14T15:40:09.335Z · comments (16)

Thoughts on open source AI
Sam Marks (samuel-marks) · 2023-11-03T15:35:42.067Z · comments (17)

Never Drop A Ball
Screwtape · 2023-11-23T04:15:35.834Z · comments (1)

Acting Wholesomely
owencb · 2024-02-26T21:49:16.526Z · comments (64)

Don't sleep on Coordination Takeoffs
trevor (TrevorWiesinger) · 2024-01-27T19:55:26.831Z · comments (24)

Managing risks while trying to do good
Wei Dai (Wei_Dai) · 2024-02-01T18:08:46.506Z · comments (26)

[Interim research report] Activation plateaus & sensitive directions in GPT2
StefanHex (Stefan42) · 2024-07-05T17:05:25.631Z · comments (2)

AI Safety Chatbot
markov (markovial) · 2023-12-21T14:06:48.981Z · comments (11)

Vote on worthwhile OpenAI topics to discuss
Ben Pace (Benito) · 2023-11-21T00:03:03.898Z · comments (55)

Social status part 2/2: everything else
Steven Byrnes (steve2152) · 2024-03-05T16:29:19.072Z · comments (2)

My hopes for alignment: Singular learning theory and whole brain emulation
Garrett Baker (D0TheMath) · 2023-10-25T18:31:14.407Z · comments (5)

[link] Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
Robert_AIZI · 2024-03-05T13:55:33.483Z · comments (24)

[question] We might be dropping the ball on Autonomous Replication and Adaptation.
Charbel-Raphaël (charbel-raphael-segerie) · 2024-05-31T13:49:11.327Z · answers+comments (30)

The proper response to mistakes that have harmed others?
Ruby · 2023-12-31T04:06:31.505Z · comments (12)

Balancing Games
jefftk (jkaufman) · 2024-02-24T14:40:04.237Z · comments (18)

AI #78: Some Welcome Calm
Zvi · 2024-08-22T14:20:10.812Z · comments (15)

A civilization ran by amateurs
Olli Järviniemi (jarviniemi) · 2024-05-30T17:57:32.601Z · comments (7)

E.T. Jaynes Probability Theory: The logic of Science I
Jan Christian Refsgaard (jan-christian-refsgaard) · 2023-12-27T23:47:52.579Z · comments (20)

Inspired by: Failures in Kindness
X4vier · 2024-07-27T01:21:42.848Z · comments (2)

Raemon's Deliberate (“Purposeful?”) Practice Club
Raemon · 2023-11-14T18:24:19.335Z · comments (11)

Offering AI safety support calls for ML professionals
Vael Gates · 2024-02-15T23:48:12.797Z · comments (1)

How should TurnTrout handle his DeepMind equity situation?
habryka (habryka4) · 2023-10-16T18:25:38.895Z · comments (30)

Balsa Update and General Thank You
Zvi · 2023-12-12T20:30:03.980Z · comments (8)

[link] DeepMind: Evaluating Frontier Models for Dangerous Capabilities
Zach Stein-Perlman · 2024-03-21T03:00:31.599Z · comments (8)

Natural Latents Are Not Robust To Tiny Mixtures
johnswentworth · 2024-06-07T18:53:36.643Z · comments (8)

What is SB 1047 *for*?
Raemon · 2024-09-05T17:39:39.871Z · comments (8)

OpenAI’s new Preparedness team is hiring
leopold · 2023-10-26T20:42:35.966Z · comments (2)

There Should Be More Alignment-Driven Startups
Vaniver · 2024-05-31T02:05:06.799Z · comments (14)

An Actually Intuitive Explanation of the Oberth Effect
Isaac King (KingSupernova) · 2024-01-10T20:23:17.216Z · comments (33)

"Epistemic range of motion" and LessWrong moderation
habryka (habryka4) · 2023-11-27T21:58:40.834Z · comments (3)

[question] What do we know about the AI knowledge and views, especially about existential risk, of the new OpenAI board members?
Zvi · 2024-03-11T14:55:05.128Z · answers+comments (2)

Originality vs. Correctness
alkjash · 2023-12-06T18:51:49.531Z · comments (16)

On OpenAI Dev Day
Zvi · 2023-11-09T16:10:06.646Z · comments (0)

[link] How do open AI models affect incentive to race?
jessicata (jessica.liu.taylor) · 2024-05-07T00:33:20.658Z · comments (13)

My AI Predictions 2023 - 2026
HunterJay · 2023-10-16T00:50:52.968Z · comments (28)

Interdictor Ship
lsusr · 2024-08-19T04:59:18.487Z · comments (9)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

richard_kennaway on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

The core claim in this post is that our brains model the world as though there's a thing called "our values", and tries to learn about those values in the usual epistemic way.

I find that a very strange idea, as strange as Plato’s Socrates’ parallel idea that learning is not the acquisition of something new, but recollection of what one had forgotten.

If I try X, anticipating that it will be an excellent experience, and find it disappointing, I have not learned something about my values, but about X.

I have never eaten escamoles. If I try them, what I will discover is what they are like to eat. If i like them, did I always like them? That is an unheard-falling-trees question.

If I value a thing at one period of life and turn away from it later, I have not discovered something about my values. My values have changed. In the case of the teenager we call this process “maturing”. Wine maturing in a barrel is not becoming what it always was, but what it will be.

But people have this persistent illusion that how they are today is how they always were and always will be, and their mood of the moment is their fundamental nature, despite the evidence of their own memory.

tailcalled on tailcalled's Shortform

At a human level, the counts for each type of atom is basically always conserved too, so it's not just a question of why not momentum but also a question of why not moles of hydrogen, moles of carbon, moles of oxygen, moles of nitrogen, moles of silicon, moles of iron, etc..

I guess for momentum in particular, it seems reasonable why it wouldn't be useful in a thermodynamics-style model because things would woosh away too much (unless you're dealing with some sort of flow? Idk). A formalization or refutation of this intuition would be somewhat neat, but I would actually more wonder, could one replace the energy-first formulations of quantum mechanics with momentum-first formulations?

directedevolution on AllAmericanBreakfast's Shortform

Why don't more people seek out and use talent scouts/headhunters? If the ghost jobs phenomenon is substantial, that's a perfect use case. Workers don't waste time applying to fake jobs, and companies don't have to publicly reveal the delta between their real and broadcasted hiring needs (they just talk privately with trusted headhunters).

Are there not enough headhunters? Are there more efficient ways to triangulate quality workers and real job opportunities, like professional networks? Are ghost jobs not that big of a deal? Do people in fact use headhunters quite a lot?

ape-in-the-coat on What's the Deal with Logical Uncertainty?

I think the simple solution is to not talk about logical tautologies and contradictions when expressing the Kolmogorov axioms for a theory of subjective Bayesianism. Instead talk about what we actually know a priori, not about tautologies which we merely could know a priori (if we were logically omniscient).

Yes, absolutely. When I apply probability theory it should represent my state of knowledge, not state of knowledge of some logically omniscient being. For me it seems such an obvious thing that I struggle to understand why it's still not a standard approach.

So are there some hidden paradoxes of such approach that I just do not see yet? Or maybe some issues with formalization of the axioms?

tamsin-leake on Shortform

(oops, this ended up being fairly long-winded! hope you don't mind. feel free to ask for further clarifications.)

There's a bunch of things wrong with your description, so I'll first try to rewrite it in my own words, but still as close to the way you wrote it (so as to try to bridge the gap to your ontology) as possible. Note that I might post QACI 2 somewhat soon, which simplifies a bunch of QACI by locating the user as {whatever is interacting with the computer the AI is running on} rather than by using a beacon.

A first pass is to correct your description to the following:

We find a competent honourable human at a particular point in time , like Joe Carlsmith or Wei Dai, and give them a rock engraved with a 1GB secret key, large enough that in counterfactuals it could replace with an entire snapshot of . We also give them the ability to express a 1GB output, eg by writing a 1GB key somewhere which is somehow "signed" as the only . This is part of $H$ — $H$ is not just the human being queried at a particular point in time, it's also the human producing an answer in some way. So $H$ is a function from 1GB bitstring to 1GB bitstring. We define $H^{+}$ as $H$ , followed by whichever new process $H$ describes in its output — typically another instance of $H$ except with a different 1GB payload.
We want a model $M$ of the agent $H^{+}$ . In QACI, we get $M$ by asking a Solomonoff-like ideal reasoner for their best guess about $H^{+}$ after feeding them a bunch of data about the world and the secret key.
We then ask $M$ the question $q$ , "What's the best utility-function-over-policies to maximise?" to get a utility function $U$ $: (O \times A)^{*} \to R$ . We then **ask our solomonoff-like ideal reasoner for their best guess about which action $A$ maximizes $U$ .

Indeed, as you ask in question 3, in this description there's not really a reason to make step 3 an extra thing. The important thing to notice here is that model $M$ might get pretty good, but it'll still have uncertainty.

When you say "we get $M$ by asking a Solomonoff-like ideal reasoner for their best guess about $H^{+}$ ", you're implying that — positing U(M,A) to be the function that says how much utility the utility function returned by model M attributes to action A (in the current history-so-far) — we do something like:

  let M ← oracle(argmax { for model M } 𝔼 { over uncertainty } P(M))
  let A ← oracle(argmax { for action A } U(M, A))
  perform(A)

Indeed, in this scenario, the second line is fairly redundant.

The reason we ask for a utility function is because we want to get a utility function within the counterfactual — we don't want to collapse the uncertainty with an argmax before extracting a utility function, but after. That way, we can do expected-given-uncertainty utility maximization over the full distribution of model-hypotheses, rather than over our best guess about $M$ . We do:

  let A ← oracle(argmax { for A } 𝔼 { for M, over uncertainty } P(M) · U(M, A))
  perform(A)

That is, we ask our ideal reasoner (oracle) for the action with the best utility given uncertainty — not just logical uncertainty, but also uncertainty about which $M$ . This contrasts with what you describe, in which we first pick the most probable $M$ and then calculate the action with the best utility according only to that most-probable pick.

To answer the rest of your questions:

Is this basically IDA, where Step 1 is serial amplification, Step 2 is imitative distillation, and Step 3 is reward modelling?

Unclear! I'm not familiar enough with IDA, and I've bounced off explanations for it I've seen in the past. QACI doesn't feel to me like it particularly involves the concepts of distillation or amplification, but I guess it does involve the concept of iteration, sure. But I don't get the thing called IDA.

Why not replace Step 1 with Strong HCH or some other amplification scheme?

It's unclear to me how one would design an amplification scheme — see concerns of the general shape expressed here [LW · GW]. The thing I like about my step 1 is that the QACI loop (well, really, graph (well, really, arbitrary computation, but most of the time the user will probably just call themself in sequence)) is that its setup doesn't involve any AI at all — you could go back in time before the industrial revolution and explain the core QACI idea and it would make sense assuming time-travelling-messages magic, and the magic wouldn't have to do any extrapolating. Just tell someone the idea is that they could send a message to {their past self at a particular fixed point in time}. If there's any amplification scheme, it'll be one designed by the user, inside QACI, with arbitrarily long to figure it out.

What does "bajillion" actually mean in Step 1?

As described above, we don't actually pre-determine the length of the sequence, or in fact the shape of the graph at all. Each iteration decides whether to spawn one or several next iteration, or indeed to spawn an arbitrarily different long-reflection process.

Why are we doing Step 3? Wouldn't it be better to just use M directly as our superintelligence? It seems sufficient to achieve radical abundance, life extension, existential security, etc.

Why not ask M for the policy π directly? Or some instruction for constructing π? The instruction could be "Build the policy using our super-duper RL algo with the following reward function..." but it could be anything.

Hopefully my correction above answers these.

What if there's no reward function that should be maximised? Presumably the reward function would need to be "small", i.e. less than a Exabyte, which imposes a maybe-unsatisfiable constraint.

(Again, untractable-to-naively-compute utility function*, not easily-trained-on reward function. If you have an ideal reasoner, why bother with reward functions when you can just straightforwardly do untractable-to-naively-compute utility functions?)

I guess this is kinda philosophical? I have some short thoughts on here. If an exabyte is enough to describe to describe {a communication channel with a human-on-earth} to an AI-on-earth, which I think seems likely, then it's enough to build "just have a nice corrigible assistant ask the humans what they want"-type channels.

Put another way: if there are actions which are preferable to other actions, then it seems to me like utility function are a fully lossless way for counterfactual QACI users to express which kinds of actions they want the AI to perform, which is all we need. If there's something wrong with utility function over worlds, then counterfactual QACI users can output a utility function which favors actions which lead to something other than utility maximization over worlds, for example actions which lead to the construction of a superintelligent corrigible assistant which will help the humans come up with a better scheme.

Why is there no iteration, like in IDA? For example, after Step 2, we could loop back to Step 1 but reassign $H$ as $H$ with oracle access to $M$ .

Again, I don't get IDA. Iteration doesn't seem particularly needed? Note that inside QACI, the user does have access to an oracle and to all relevant pieces of hypothesis about which hypothesis it is inhabiting in — this is what, in the QACI math [LW · GW], this line does:

${QACI}_{0}$ 's distribution over answers demands that the answer payload $π_{r}$ , when interpreted as math and with all required contextual variables passed as input ( $q, μ 1, μ 2, α, γ_{q}, ξ$ ).

Notably, $α$ is the hypothesis for which world the user is being considered in, and $γ_{q}, ξ$ for their location within that world. Those are sufficient to fully characterize the hypothesis-for- $H$ that describes them. And because the user doesn't really return just a string but a math function which takes $q, μ 1, μ 2, α, γ_{q}, ξ$ as input and returns a string, they can have that math function do arbitrary work — including rederive $H$ . In fact, rediriving $H$ is how they call a next iteration: they say (except in math) "call $H$ again (rederived using $q, μ 1, μ 2, α, γ_{q}, ξ$ ), but with this string, and return the result of that." See also this illustration [LW(p) · GW(p)], which is kinda wrong in places but gets the recursion call graph thing right.

Another reason to do "iteration" like this inside the counterfactual rather than in the actual factual world (if that's what IDA does, which I'm only guessing here) is that we don't have as many iteration steps as we want in the factual world — eventually OpenAI or someone else kills everyone, whereas in the counterfactual, the QACI users are the only ones who can make progress, so the QACI users essentially have as long as they want, so long as they don't take too long in each individual counterfactual step or other somewhat easily avoided actions like that.

Why isn't Step 3 recursive reward modelling? i.e. we could collect a bunch of trajectories from $π$ and ask $M$ to use those trajectories to improve the reward function.

Unclear if this still means anything given the rest of this post. Ask me again if it does.

james-oofou on Darklight's Shortform

I quickly tested this. GPT-4 and GPT-4-turbo do not seem to provide hallucinated answers. GPT-4o and GPT-4o-mini do.

But GPT-4o-latest does not provide hallucinated answers.

So, it seems that OpenAI may indeed have made some progress in reducing hallucination in smaller models.

kqr on Is "superhuman" AI forecasting BS? Some experiments on the "539" bot from the Centre for AI Safety

Thanks for taking the time to dive into this. I've spent the past few evenings iterating on a forecasting bot while doing embarrassingly little research myself[1], and it seems like I have stumbled into the same approach as Five Thirty Nine, and my bot has the exact same sort of problems. I'll write more later about why I think some of those problems are not as big as they may seem.

But your article also gave me some ideas that might lead to improvements. Thanks!

[1]: In this case, I prioritise the two weeks in the lab over the hour in the library. I'm doing it not to make a good forecasting bot but to learn the APIs involved.

directedevolution on AllAmericanBreakfast's Shortform

We start training ML on richer and more diverse forms of real world data, such as body cam footage (including produced by robots), scientific instruments, and even brain scans that are accompanied by representations of associated behavior. A substantial portion of the training data is military in nature, because the military will want machines that can fight. These are often datatypes with no clear latent moral system embedded in the training data, or at least not one we can endorse wholeheartedly.

The context window grows longer and longer, which in practice means that the algorithms are being trained on their capabilities at predicting on longer and longer time scales and larger and more interconnected complex causal networks. Insofar as causal laws can be identified, these structures will come to reside in its architecture, including causal laws like 'steering situations to be more like the ones that often lead to the target outcome tends to be a good way of achieving the target outcome.'

Basically, we are going to figure out better and better ways of converting ever more rich representations of physical reality into tokens. We're going to do spend vast resources doing ML on those rich datasets. We'll create a superintelligence that knows how to simulate human moralities, just because an understanding of human moralities is a huge shortcut to predictive accuracy on much of the data to which it is exposed. But it won't be governed by those moralities. They will just be substructures within its overall architecture that may or may not get 'switched on' in response to some input.

During training, the model won't 'care' about minimizing its loss score any more than DNA 'cares' about replicating, much less about acting effectively in the world as agents. Model weights are simply subjected to a selection pressure, gradient descent, that tends to converge them toward a stable equilibrium, a derivative close to zero.

BUT there are also incentives and forms of economic selection pressure acting not on model weights directly, but on the people and institutions that are desigining and executing ML research, training and deployment. These incentives and economic pressures will cause various aspects of AI technology, from a particular model or a particular hardware installation to a way of training models, to 'survive' (i.e. be deployed) or 'replicate' (i.e. inspire the design of the next model).

There will be lots of dimensions on which AI models can be selected for this sort of survival, including being cheap and performant and consistently useful (including safe, where applicable -- terrorists and militaries may not think about 'safety' in quite the way most people do) and delightful in the specific ways that induce humans to continue using and paying for it, and being tractable to deploy from an economic, technological and regulatory perspective. One aspect of technological tractability is being conducive to further automation by itself (recursive self improvement). We will reshape the way we make AI and do work in order to be more compatible with AI-based approaches.

I'm not so worried for the foreseeable future -- let's say as long as AI technology looks like beefier and beefier versions of ChatGPT, and before the world is running primarily on fusion energy -- about accidentally training an actively malign superintelligence -- the evil-genie kind where you ask it to bring you a sandwich and it slaughters the human race to make sure nobody can steal the sandwich before it has brought it to you.

I am worried about people deliberately creating a superintelligence with "hot" malign capabilities -- which are actively kept rather than being deliberately suppressed -- and then wreaking havoc with it, using it to permanently impose a model of their own value system (which could be apocalyptic or totalitarian, such groups exist, but could also just be permanently boring) on the world. Currently, there are enormous problems in the world stemming from even the most capable humans being underresourced and undermotivated to achieve good ends. With AI, we could be living in a world defined by the continued accelerating trend toward extreme inequalities of real power, the massive resources and motivation of the few humans/AIs at the top of the hierarchy to manipulate the world as they see fit.

We have never lived in a world like that before. Many things come to pass. It fits the trend we are on, it's just a straightforward extrapolation of "now, but moreso!"

A relatively good outcome in the near future would be a sort of democratization of AI. I don't mean open source AT ALL. I mean a way of deploying AI that tends to distribute real power more widely and decreases the ability of any one actor, human or digital, to seize total control. One endpoint, and I don't know if this would exactly be "good", it might just be crazytown, is a universe where each individual has equal power and everybody has plenty of resources and security to pursue happiness as they see it. Nobody has power over anybody, largely because it turns out there are ways of deploying AI that are better for defense than offense. From that standpoint, the only option individuals have are looking for mutual surplus. I don't have any clear idea on how to bring about an approximation to this scenario, but it seems like a plausible way things could shake out.

sodium on Sodium's Shortform

Pre-registering a71c97bb02e7082ca62503d8e3ac78dc9f554f524a72ad6a1392cf2d34f398d7

seth-herd on The Other Existential Crisis

Most of humanity has always known they couldn't do anything useful - except provide a better life for their children than they had.

Only a few elites have ever felt that what they do mattered, and looked forward to doing it as a challenge. Most of humanity has done what they must to ensure their children won't suffer.

Your first answer to your daughter would make most parents weep with joy: whatever you want is what you'll do.

Don't worry that she won't find something she likes to to do unless she's forced to. People care about people, and there will be plenty to do with and for other people.

If you want concrete ideas of what people do when they're allowed to, see art and other collaborative projects that aren't just for money.