LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

[link] AISN #29: Progress on the EU AI Act Plus, the NY Times sues OpenAI for Copyright Infringement, and Congressional Questions about Research Standards in AI Safety
aogara (Aidan O'Gara) · 2024-01-04T16:09:31.336Z · comments (0)

Some Vacation Photos
johnswentworth · 2024-01-04T17:15:01.187Z · comments (0)

Deep atheism and AI risk
Joe Carlsmith (joekc) · 2024-01-04T18:58:47.745Z · comments (22)

[link] Cellular reprogramming, pneumatic launch systems, and terraforming Mars: Some things I learned about at Foresight Vision Weekend
jasoncrawford · 2024-01-04T19:33:57.887Z · comments (0)

The Gears of Argmax
StrivingForLegibility · 2024-01-04T23:30:30.339Z · comments (0)

Safety Data Sheets for Optimization Processes
StrivingForLegibility · 2024-01-04T23:30:36.510Z · comments (1)

Best-Responding Is Not Always the Best Response
StrivingForLegibility · 2024-01-04T23:30:48.400Z · comments (0)

Using Threats to Achieve Socially Optimal Outcomes
StrivingForLegibility · 2024-01-04T23:30:54.615Z · comments (0)

Hello
S Benfield (steven-benfield) · 2024-01-04T23:35:05.621Z · comments (0)

[link] Project ideas: Governance during explosive technological growth
Lukas Finnveden (Lanrian) · 2024-01-04T23:51:56.407Z · comments (0)

MIRI 2024 Mission and Strategy Update
Malo (malo) · 2024-01-05T00:20:54.169Z · comments (44)

Does AI care about reality or just its own perception?
RedFishBlueFish (RedStateBlueState) · 2024-01-05T04:05:11.167Z · comments (8)

If I ran the zoo
Optimization Process · 2024-01-05T05:14:57.631Z · comments (0)

Striking Implications for Learning Theory, Interpretability — and Safety?
RogerDearnaley (roger-d-1) · 2024-01-05T08:46:58.915Z · comments (4)

Predictive model agents are sort of corrigible
Raymond D · 2024-01-05T14:05:03.037Z · comments (6)

[link] Forecast your 2024 with Fatebook
Sage Future (aaron-ho-1) · 2024-01-05T14:07:55.743Z · comments (0)

AI Impacts Survey: December 2023 Edition
Zvi · 2024-01-05T14:40:06.156Z · comments (6)

Catching AIs red-handed
ryan_greenblatt · 2024-01-05T17:43:10.948Z · comments (18)

[question] What technical topics could help with boundaries/membranes?
Chipmonk · 2024-01-05T18:14:58.795Z · answers+comments (25)

[link] The Hippie Rabbit Hole -Nuggets of Gold in Rivers of Bullshit
Jonathan Moregård (JonathanMoregard) · 2024-01-05T18:27:01.769Z · comments (20)

Technology path dependence and evaluating expertise
bhauth · 2024-01-05T19:21:23.302Z · comments (2)

[link] AI Impacts 2023 Expert Survey on Progress in AI
habryka (habryka4) · 2024-01-05T19:42:17.226Z · comments (1)

The Next ChatGPT Moment: AI Avatars
kolmplex (luke-man) · 2024-01-05T20:14:10.074Z · comments (10)

[link] Almost everyone I’ve met would be well-served thinking more about what to focus on
Henrik Karlsson (henrik-karlsson) · 2024-01-05T21:01:27.861Z · comments (8)

[link] Benchmark Study #1: MMLU (Pile, MCQ)
Bruce W. Lee (bruce-lee) · 2024-01-05T21:35:37.999Z · comments (0)

[link] Project ideas: Epistemics
Lukas Finnveden (Lanrian) · 2024-01-05T23:41:23.721Z · comments (4)

[link] Benchmark Study #2: TruthfulQA (Task, MCQ)
Bruce W. Lee (bruce-lee) · 2024-01-06T02:39:39.895Z · comments (2)

Survey of 2,778 AI authors: six parts in pictures
KatjaGrace · 2024-01-06T04:43:34.590Z · comments (1)

Are we inside a black hole?
Jay · 2024-01-06T13:30:51.451Z · comments (5)

Book review: Trick or treatment (2008)
Fleece Minutia · 2024-01-06T15:40:49.953Z · comments (0)

A Land Tax For Britain
A.H. (AlfredHarwood) · 2024-01-06T15:52:14.942Z · comments (9)

Lack of Spider-Man is evidence against the simulation hypothesis
RamblinDash · 2024-01-06T18:17:20.641Z · comments (22)

A Challenge to Effective Altruism's Premises
False Name (False Name, Esq.) · 2024-01-06T18:46:23.715Z · comments (3)

AI Risk and the US Presidential Candidates
Zane · 2024-01-06T20:18:04.945Z · comments (22)

The Sequences on YouTube
Neil (neil-warren) · 2024-01-07T01:44:39.663Z · comments (9)

[link] Defending against hypothetical moon life during Apollo 11
eukaryote · 2024-01-07T04:49:42.628Z · comments (9)

[link] Benchmark Study #3: HellaSwag (Task, MCQ)
Bruce W. Lee (bruce-lee) · 2024-01-07T04:59:21.347Z · comments (4)

[link] Towards AI Safety Infrastructure: Talk & Outline
Paul Bricman (paulbricman) · 2024-01-07T09:31:12.217Z · comments (0)

[link] Bayesians Commit the Gambler's Fallacy
Kevin Dorst · 2024-01-07T12:54:59.939Z · comments (28)

Deceptive AI ≠ Deceptively-aligned AI
Steven Byrnes (steve2152) · 2024-01-07T16:55:13.761Z · comments (19)

Benchmark Study #4: AI2 Reasoning Challenge (Task(s), MCQ)
Bruce W. Lee (bruce-lee) · 2024-01-07T17:13:00.209Z · comments (0)

[link] Project ideas: Sentience and rights of digital minds
Lukas Finnveden (Lanrian) · 2024-01-07T17:34:58.942Z · comments (0)

(Partial) failure in replicating deceptive alignment experiment
claudia.biancotti · 2024-01-07T17:56:36.748Z · comments (0)

We shouldn't fear superintelligence because it already exists
Spencer Chubb (spencer-chubb) · 2024-01-07T17:59:55.297Z · comments (14)

[link] A model of research skill
L Rudolf L (LRudL) · 2024-01-08T00:13:12.755Z · comments (6)

Utility is relative
CrimsonChin · 2024-01-08T02:31:44.000Z · comments (4)

Sledding Among Hazards
jefftk (jkaufman) · 2024-01-08T03:30:08.463Z · comments (5)

Why There Is Hope For An Alignment Solution
Darklight · 2024-01-08T06:58:32.820Z · comments (0)

Reflections on my first year of AI safety research
Jay Bailey · 2024-01-08T07:49:08.147Z · comments (3)

There is no sharp boundary between deontology and consequentialism
quetzal_rainbow · 2024-01-08T11:01:47.828Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

jblack on Semantic Disagreement of Sleeping Beauty Problem

No, introducing the concept of "indexical sample space" does not capture the thirder position, nor language. You do not need to introduce a new type of space, with new definitions and axioms. The notion of credence (as defined in the Sleeping Beauty problem) already uses standard mathematical probability space definitions and axioms.

gwern on Transformers Represent Belief State Geometry in their Residual Stream

My earlier comment on meta-learning and Bayesian RL/inference for background: https://www.lesswrong.com/posts/TiBsZ9beNqDHEvXt4/how-we-picture-bayesian-agents?commentId=yhmoEbztTunQMRzJx [LW(p) · GW(p)]

The main question I have been thinking about is what is a state for language and how that can be useful if so discovered in this way?

The way I would put it is that 'state' is misleading you here. It makes you think that it must be some sort of little Turing machine or clockwork, where it has a 'state', like the current state of the Turing machine tape or the rotations of each gear in a clockwork gadget, where the goal is to infer that. This is misleading, and it is a coincidence in these simple toy problems, which are so simple that there is nothing to know beyond the actual state.

As Ortega et al highlights in those graphs, what you are really trying to define is the sufficient statistics: the summary of the data (history) which is 100% adequate for decision making, and where additionally knowing the original raw data doesn't help you.

In the coin flip case, the sufficient statistics are simply the 2-tuple (heads,tails), and you define a very simple decision over all of the possible observed 2-tuples. Note that the sufficient statistic is less information than the original raw "the history", because you throw out the ordering. (A 2-tuple like '(3,1)' is simpler than all of the histories it summarizes, like '[1,1,1,0]', '[0,1,1,1]'. '[1,0,1,1]', etc.) From the point of view of decision making, these all yield the same posterior distribution over the coin flip probability parameter, which is all you need for decision making (optimal action: 'bet on the side with the higher probability'), and so that's the sufficient statistic. If I tell you the history as a list instead of a 2-tuple, you cannot make better decisions. It just doesn't matter if you got a tails first and then all heads, or all heads first then tails, etc.

It is not obvious that this is true: a priori, maybe that ordering was hugely important, and those correspond to different games. But the RNN there has learned that the differences are not important, and in fact, they are all the same.

And the 2-tuple here doesn't correspond to any particular environment 'state'. The environment doesn't need to store that anywhere. The environment is just a RNG operating according to the coin flip probability, independently every turn of the game, with no memory. There is nowhere which is counting heads & tails in a 2-tuple. That exists solely in the RNN's hidden state as it accumulates evidence over turns, and optimally updates priors to posteriors every observed coin flip, and possibly switches its bet.

So, in language tasks like LLMs, they are the same thing, but on a vastly grander scale, and still highly incomplete. They are (trying to) infer sufficient statistics of whatever language-games they have been trained on, and then predicting accordingly.

What are those sufficient statistics in LLMs? Hard to say. In that coinflip example, it is so simple that we can easily derive by hand the conjugate statistics and know it is just a binomial and so we only need to track heads/tails as the one and only sufficient statistic, and we can then look in the hidden state to find where that is encoded in a converged optimal agent. In LLMs... not so much. There's a lot going on.

Based on interpretability research and studies of how well they simulate people as well as just all of the anecdotal experience with the base models, we can point to a few latents like honesty, calibration, demographics, and so on. (See Janus's "Simulator Theory" for a more poetic take, less focused on agency than the straight Bayesian meta-imitation learning take I'm giving here.) Meanwhile, there are tons of things about the inputs that the model wants to throw away, irrelevant details like the exact mispellings of words in the prompt (while recording that there were mispellings, as grist for the inference mill about the environment generating the mispelled text).

So conceptually, the sufficient statistics when you or I punch in a prompt to GPT-3 might look like some extremely long list of variables like, "English speaker, Millennial, American, telling the truth, reliable, above-average intelligence, Common Crawl-like text not corrupted by WET processing, shortform, Markdown formatting, only 1 common typo or misspelling total, ..." and it will then tailor responses accordingly and maximize its utility by predicting the next token accurately (because the 'coin flip' there is simply betting on the logits with the highest likelihood etc). Like the coinflip 2-tuple, most of these do not correspond to any real-world 'state': if you or I put in a prompt, there is no atom or set of atoms which corresponds to many of these variables. But they have consequences: if we ask about Tienanmen Square, for example, we'll get a different answer than if we had asked in Mandarin, because the sufficient statistics there are inferred to be very different and yield a different array of latents which cause different outputs.

And that's what "state" is for language: it is the model's best attempt to infer a useful set of latent variables which collectively are sufficient statistics for whatever language-game or task or environment or agent-history or whatever the context/prompt encodes, which then supports optimal decision-making.

jblack on Bogdan Ionut Cirstea's Shortform

At the same time, current models seem very unlikely to be x-risky (e.g. they're still very bad at passing dangerous capabilities evals), which is another reason to think pausing now would be premature.

The relevant criterion is not whether the current models are likely to be x-risky (it's obviously far too late if they are!), but whether the next generation of models have more than an insignificant chance of being x-risky together with all the future frameworks they're likely to be embedded into.

Given that the next generations are planned to involve at least one order of magnitude more computing power in training (and are already in progress!) and that returns on scaling don't seem to be slowing, I think the total chance of x-risk from those is not insignificant.

rhaps0dy on Why I'm doing PauseAI

Thank you for working on this Joseph!

review-bot on Why I'm doing PauseAI

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?

ryan_greenblatt on Raemon's Shortform

Aren't text names basically similar in practice? At least for me, I find they trigger basically the same thing because I do actually associate names with people.

Maybe this wouldn't be true if I didn't know people very well (but in that case, icons also wouldn't matter).

(I overall dislike icons, but I don't have a principled reason for this.)

gjm on How to be an amateur polyglot

A nitpick: you say

fun story, I passed the C2 exam and then I realized I didn’t remember the word faucet when I went to the UK to visit a friend

but here in the UK I don't think I have ever once heard a native speaker use the word "faucet" in preference to "tap". I guess the story is actually funnier if immediately after passing your C2 exam you (1) thought "faucet" was the usual UK term and (2) couldn't remember it anyway...

(I liked the post a lot and although I am no polyglot all the advice seems sound to me.)

gwern on William_S's Shortform

Rest assured, there is plenty that could leak at OA... (And might were there not NDAs, which of course is much of the point of having them.)

For a past example, note that no one knew that Sam Altman had been fired from YC CEO for similar reasons as OA CEO, until the extreme aggravating factor of the OA coup, 5 years later. That was certainly more than 'run of the mill office politics', I'm sure you'll agree, but if that could be kept secret, surely lesser things now could be kept secret well past 2029?

deluks917 on Deep Honesty

Many things can be done by the right people. But Idk 'radical honesty' adjacent ideas usually go real bad.

t3t on RobertM's Shortform

I hadn't, but I just did and nothing in the article seems to be responsive to what I wrote.

Amusingly, not a single news source I found reporting on the subject has managed to link to the "plan" that the involved parties (countries, companies, etc) agreed to.

Nothing in that summary affirmatively indicates that companies agreed to submit their future models to pre-deployment testing by the UK AISI. One might even say that it seems carefully worded to avoid explicitly pinning the companies down like that.