LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Exploring OpenAI's Latent Directions: Tests, Observations, and Poking Around
Johnny Lin (hijohnnylin) · 2024-01-31T06:01:27.969Z · comments (4)

Reprograming the Mind: Meditation as a Tool for Cognitive Optimization
Jonas Hallgren · 2024-01-11T12:03:41.763Z · comments (3)

Links and brief musings for June
Kaj_Sotala · 2024-07-06T10:10:03.344Z · comments (0)

Smartphone Etiquette: Suggestions for Social Interactions
Declan Molony (declan-molony) · 2024-06-04T06:01:03.336Z · comments (4)

Meetup In a Box: Year In Review
Czynski (JacobKopczynski) · 2024-02-14T01:18:28.259Z · comments (0)

Why I think it's net harmful to do technical safety research at AGI labs
Remmelt (remmelt-ellen) · 2024-02-07T04:17:15.246Z · comments (24)

[link] Manifold Markets
PeterMcCluskey · 2024-02-02T17:48:36.630Z · comments (9)

AI debate: test yourself against chess 'AIs'
Richard Willis · 2023-11-22T14:58:10.847Z · comments (35)

Agent membranes/boundaries and formalizing “safety”
Chipmonk · 2024-01-03T17:55:21.018Z · comments (46)

Evaluating Solar
jefftk (jkaufman) · 2024-02-17T21:50:04.783Z · comments (5)

[link] Attention on AI X-Risk Likely Hasn't Distracted from Current Harms from AI
Erich_Grunewald · 2023-12-21T17:24:16.713Z · comments (2)

A list of all the deadlines in Biden's Executive Order on AI
Valentin Baltadzhiev (valentin-baltadzhiev) · 2023-11-01T17:14:31.074Z · comments (2)

The Sequences on YouTube
Neil (neil-warren) · 2024-01-07T01:44:39.663Z · comments (9)

Talk: AI safety fieldbuilding at MATS
Ryan Kidd (ryankidd44) · 2024-06-23T23:06:37.623Z · comments (2)

Singular learning theory and bridging from ML to brain emulations
kave · 2023-11-01T21:31:54.789Z · comments (16)

[question] How are you preparing for the possibility of an AI bust?
Nate Showell · 2024-06-23T19:13:45.247Z · answers+comments (16)

AI #57: All the AI News That’s Fit to Print
Zvi · 2024-03-28T11:40:05.435Z · comments (14)

How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles & parrots
Owain_Evans · 2024-03-28T02:34:21.799Z · comments (0)

[link] How to Upload a Mind (In Three Not-So-Easy Steps)
aggliu · 2023-11-13T18:13:32.893Z · comments (0)

Consequentialism is a compass, not a judge
Neil (neil-warren) · 2024-04-13T10:47:44.980Z · comments (6)

Geometric Utilitarianism (And Why It Matters)
StrivingForLegibility · 2024-05-12T03:41:21.342Z · comments (2)

Steering subsystems: capabilities, agency, and alignment
Seth Herd · 2023-09-29T13:45:00.739Z · comments (0)

[question] Thoughts on Francois Chollet's belief that LLMs are far away from AGI?
O O (o-o) · 2024-06-14T06:32:48.170Z · answers+comments (17)

Can the House Legislate?
jefftk (jkaufman) · 2023-10-05T13:40:06.649Z · comments (6)

Evidential Correlations are Subjective, and it might be a problem
Martín Soto (martinsq) · 2024-03-07T18:37:54.105Z · comments (6)

[link] Forecasting future gains due to post-training enhancements
elifland · 2024-03-08T02:11:57.228Z · comments (2)

My Mid-Career Transition into Biosecurity
jefftk (jkaufman) · 2023-10-02T21:20:06.768Z · comments (4)

Ideas for Next-Generation Writing Platforms, using LLMs
ozziegooen · 2024-06-04T18:40:24.636Z · comments (4)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

Just because an LLM said it doesn't mean it's true: an illustrative example
dirk (abandon) · 2024-08-21T21:05:59.691Z · comments (12)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

[link] Beware the science fiction bias in predictions of the future
Nikita Sokolsky (nikita-sokolsky) · 2024-08-19T05:32:47.372Z · comments (20)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

Proving the Geometric Utilitarian Theorem
StrivingForLegibility · 2024-08-07T01:39:10.920Z · comments (0)

Evaluating Sparse Autoencoders with Board Game Models
Adam Karvonen (karvonenadam) · 2024-08-02T19:50:21.525Z · comments (1)

[question] What's the Deal with Logical Uncertainty?
Ape in the coat · 2024-09-16T08:11:43.588Z · answers+comments (21)

[link] what becoming more secure did for me
Chipmonk · 2024-08-22T17:44:48.525Z · comments (5)

Why I got the smallpox vaccine in 2023
joec · 2023-10-02T05:11:41.249Z · comments (6)

[link] Structured Transparency: a framework for addressing use/mis-use trade-offs when sharing information
habryka (habryka4) · 2024-04-11T18:35:44.824Z · comments (0)

To Boldly Code
StrivingForLegibility · 2024-01-26T18:25:59.525Z · comments (4)

Weighing Animal Worth
jefftk (jkaufman) · 2023-09-28T13:50:06.752Z · comments (11)

$250K in Prizes: SafeBench Competition Announcement
ozhang (oliver-zhang) · 2024-04-03T22:07:41.171Z · comments (0)

The Wisdom of Living for 200 Years
Martin Sustrik (sustrik) · 2024-06-28T04:44:10.609Z · comments (3)

If a little is good, is more better?
DanielFilan · 2023-11-04T07:10:05.943Z · comments (16)

Useful starting code for interpretability
eggsyntax · 2024-02-13T23:13:47.940Z · comments (2)

Virtually Rational - VRChat Meetup
Tomás B. (Bjartur Tómas) · 2024-01-28T05:52:36.934Z · comments (3)

Testing for consequence-blindness in LLMs using the HI-ADS unit test.
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2023-11-24T23:35:29.560Z · comments (2)

[link] Arrogance and People Pleasing
Jonathan Moregård (JonathanMoregard) · 2024-02-06T18:43:09.120Z · comments (7)

The Drowning Child
Tomás B. (Bjartur Tómas) · 2023-10-22T16:39:53.016Z · comments (8)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

lawrencec on Base LLMs refuse too

After thinking about it more, I think the LLaMA 1 refusals strongly suggest that this is an artefact of training data.So I've unendorsed the comment above.

It's still worth noting that modern models generally have filtered pre-training datasets (if not wholely synthetic or explicitly instruction following datasets), and it's plausible to me that this (on top of ChatGPT contamination) is a large part of why we see much better instruction following/more eloquent refusals in modern base models.

jacob-dunefsky on Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Computing the description length using the entropy of a feature activation's probability distribution is flexible enough to distinguish different types of distributions. For example, a binary distribution would have a entropy of one bit or less, and distributions spread out over more values would have larger entropies.

Yep, that's completely true. Thanks for the reminder!

quetzal_rainbow on the case for CoT unfaithfulness is overstated

I don't think you need much incentives to develop steganography, only fact "human language is not platonic example of efficiency in conveying thoughts".

There are multiple examples of o1 producing gibberish in its COT summary (I won't insert them right now because linking stuff from mobile is painful, will edit this comment later). Gibberish seems reasonable given description of training process, which optimizes model based only on final answer. I imagine a process like this:

Sometimes, model just produces gibberish in COT just because of lack of robustness
This gibberish gets reinforced
Model learns to utilize it just like it learns to utilize "..." tokens
Continue process for a long enough time and you are going to get a lot of <untranslatable_14637> in COT, even if model doesn't "try to hide" its reasoning. Also, I guess "gibberish resulting from non-robustness" is in itself more native format of thought for LLMs than imitating human thinking out loud.

The other factor of developing steganography is economic: if you optimize model to spend minimal number of tokens, you get model that "skips" or compress some thoughts.

amalthea on Ruby's Quick Takes

He's clearly not completely discounting that there's progress, but overall it doesn't feel like he's "updating all the way":

This is a recent post about the deepmind math olympiad results: https://mathstodon.xyz/@tao/112850716240504978

"1. This is great work, shifting once again our expectations of which benchmark challenges are within reach of either #AI-assisted or fully autonomous methods"

thomas-kwa on You can, in fact, bamboozle an unaligned AI into sparing your life

I appreciate the clear statement of the argument, though it is not obviously watertight to me, and wish people like Nate would engage.

lawrencec on Base LLMs refuse too

It's worth noting that there's reasons to expect the "base models" of both Gemma2 and Qwen2 to demonstrate refusals -- neither is trained on unfilted webtext.

Qwen2's pretraining data both contains synthetic data generated by Qwen1.5, and was filtered using Qwen1.5 models. Notably, its pretraining data explicitly includes "high-quality multi-task instruction data"! From the Qwen2 report:

Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these
models are utilized to synthesize high-quality pre-training data. (Page 5)
[...]
Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the
Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.

I think this had a huge effect on Qwen2: Qwen2 is able to reliably follow both the Qwen1.5 chat template (as you note) as well as the "User: {Prompt}\n\nAssistant: " template. This is also reflected in their high standardized benchmark scores -- the "base" models do comparably to the instruction finetuned ones! In other words, Qwen2 "base" models are pretty far from traditional base models a la GPT-2 or Pythia as a result of explicit choices made when generating their pretraining data and this explains its propensity for refusals. I wouldn't be surprised if the same were true of the 1.5 models.

I think the Gemma 2 base models were not trained on synthetic data from larger models but its pretraining dataset was also filtered to remove "unwanted or unsafe utterances". From the Gemma 2 tech report:

We use the same data filtering techniques as Gemma 1. Specifically, we filter the pretraining dataset to reduce the risk of unwanted or unsafe utterances, filter out certain personal information or other sensitive data, decontaminate evaluation sets from our pre-training data mixture, and reduce the risk of recitation by minimizing the proliferation of sensitive outputs. (Page 3)
[...]
We undertook considerable safety filtering of our pre-training data to reduce the likelihood of our
pre-trained and fine-tuned checkpoints producing harmful content. (page 10)

My guess is this filtering explains why the model refuses, moreso than (and in addition to?) ChatGPT contamination. Once you remove all the "unsafe completions"

I don't know what's going on with LLaMA 1, though.

norimori1992 on Fake Journal Club proposal

Ah, gotcha. Thank you.

research_prime_space on LLMs are likely not conscious

It can't represent a subjective sense of yellow, because if so, consciousness would be a linear function. That's somewhat ridiculous because I would experience a story about a "dog" differently based on the context.

Furthermore, LLMs scale "features" by how strongly they appear (e.g. the positive sentiment vector is scaled up if the text is very positive). So the LLM's conscious processing of a positive sentiment would be linearly proportional to how positive the text is. Which also seems ridiculous.

I don't expect consciousness to have any useful properties. Let's say you have a deterministic function y = f(x). You can encode just y = f(x), or y = f(x) where f includes conscious representations in the intermediate layers. The latter does not help you achieve increased training accuracy in the slightest. Neural networks also have a strong simplicity bias towards low frequency functions (this has been mathematically proven), and f(x) without consciousness is much simpler/lower frequency to encode than f(x) with consciousness.

mitchell_porter on Alexander Gietelink Oldenziel's Shortform

Thomas Kwa just provided a good reason: "measure drops off exponentially with program length". So embeddings of programs within other programs - which seems to be what a simulation is, in the Solomonoff framework - are considered exponentially unlikely.

benito on You can, in fact, bamboozle an unaligned AI into sparing your life

I have known non-zero adults to make such commitments to themselves. (But I agree it is not the typical outcome, and I wouldn't believe most people if they told me they would follow-through.)