LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Machine Unlearning Evaluations as Interpretability Benchmarks
NickyP (Nicky) · 2023-10-23T16:33:04.878Z · comments (2)

[link] FTX expects to return all customer money; clawbacks may go away
Mikhail Samin (mikhail-samin) · 2024-02-14T03:43:13.218Z · comments (1)

[link] On Lies and Liars
Gabriel Alfour (gabriel-alfour-1) · 2023-11-17T17:13:03.726Z · comments (4)

[question] Do websites and apps actually generally get worse after updates, or is it just an effect of the fear of change?
lillybaeum · 2023-12-10T17:26:34.206Z · answers+comments (34)

Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols
Arjun Panickssery (arjun-panickssery) · 2024-01-15T21:21:03.962Z · comments (0)

[question] Is AlphaGo actually a consequentialist utility maximizer?
faul_sname · 2023-12-07T12:41:05.132Z · answers+comments (8)

5. Moral Value for Sentient Animals? Alas, Not Yet
RogerDearnaley (roger-d-1) · 2023-12-27T06:42:09.130Z · comments (41)

AI #63: Introducing Alpha Fold 3
Zvi · 2024-05-09T14:20:03.176Z · comments (2)

AGI will be made of heterogeneous components, Transformer and Selective SSM blocks will be among them
Roman Leventov · 2023-12-27T14:51:37.713Z · comments (9)

Regrant up to $600,000 to AI safety projects with GiveWiki
Dawn Drescher (Telofy) · 2023-10-28T19:56:06.676Z · comments (1)

An illustrative model of backfire risks from pausing AI research
Maxime Riché (maxime-riche) · 2023-11-06T14:30:58.615Z · comments (3)

LLMs can strategically deceive while doing gain-of-function research
Igor Ivanov (igor-ivanov) · 2024-01-24T15:45:08.795Z · comments (4)

The Consciousness Box
GradualImprovement · 2023-12-11T16:45:08.172Z · comments (22)

Love, Reverence, and Life
Elizabeth (pktechgirl) · 2023-12-12T21:49:04.061Z · comments (7)

[link] Vacuum: Theory and Technologies
ethanmorse · 2024-01-21T17:23:49.257Z · comments (0)

We have promising alignment plans with low taxes
Seth Herd · 2023-11-10T18:51:38.604Z · comments (9)

[link] Fake Deeply
Zack_M_Davis · 2023-10-26T19:55:22.340Z · comments (7)

Disentangling four motivations for acting in accordance with UDT
Julian Stastny · 2023-11-05T21:26:22.514Z · comments (3)

Helpful examples to get a sense of modern automated manipulation
trevor (TrevorWiesinger) · 2023-11-12T20:49:57.422Z · comments (3)

Update #2 to "Dominant Assurance Contract Platform": EnsureDone
moyamo · 2023-11-28T18:02:50.367Z · comments (2)

Some of my predictable updates on AI
Aaron_Scher · 2023-10-23T17:24:34.720Z · comments (8)

Preface to the Sequence on LLM Psychology
Quentin FEUILLADE--MONTIXI (quentin-feuillade-montixi) · 2023-11-07T16:12:07.742Z · comments (0)

Being good at the basics
dominicq · 2023-11-04T14:18:50.976Z · comments (1)

Video and transcript of presentation on Scheming AIs
Joe Carlsmith (joekc) · 2024-03-22T15:52:03.311Z · comments (1)

0. The Value Change Problem: introduction, overview and motivations
Nora_Ammann · 2023-10-26T14:36:15.466Z · comments (0)

Monthly Roundup #13: December 2023
Zvi · 2023-12-19T15:10:08.293Z · comments (5)

Computational Approaches to Pathogen Detection
jefftk (jkaufman) · 2023-11-01T00:30:13.012Z · comments (5)

5 Reasons Why Governments/Militaries Already Want AI for Information Warfare
trevor (TrevorWiesinger) · 2023-10-30T16:30:38.020Z · comments (0)

A quick experiment on LMs’ inductive biases in performing search
Alex Mallen (alex-mallen) · 2024-04-14T03:41:08.671Z · comments (2)

[link] Talking With People Who Speak to Congressional Staffers about AI risk
Eneasz · 2023-12-14T17:55:50.606Z · comments (0)

[link] How "Pause AI" advocacy could be net harmful
Tamsin Leake (carado-1) · 2023-12-26T16:19:20.724Z · comments (9)

[link] OpenAI, DeepMind, Anthropic, etc. should shut down.
Tamsin Leake (carado-1) · 2023-12-17T20:01:22.332Z · comments (48)

[link] Why you, personally, should want a larger human population
jasoncrawford · 2024-02-23T19:48:10.526Z · comments (32)

How I build and run behavioral interviews
benkuhn · 2024-02-26T05:50:05.328Z · comments (6)

[question] How unusual is the fact that there is no AI monopoly?
Viliam · 2024-08-16T20:21:51.012Z · answers+comments (15)

[link] End Single Family Zoning by Overturning Euclid V Ambler
Maxwell Tabarrok (maxwell-tabarrok) · 2024-07-26T14:08:45.046Z · comments (1)

[link] NAO Updates, Fall 2024
jefftk (jkaufman) · 2024-10-18T00:00:04.142Z · comments (2)

Investigating the Ability of LLMs to Recognize Their Own Writing
Christopher Ackerman (christopher-ackerman) · 2024-07-30T15:41:44.017Z · comments (0)

Is suffering like shit?
KatjaGrace · 2024-05-31T01:20:03.855Z · comments (5)

[link] the subreddit size threshold
bhauth · 2024-01-23T00:38:13.747Z · comments (3)

Comparing Quantized Performance in Llama Models
NickyP (Nicky) · 2024-07-15T16:01:24.960Z · comments (2)

In Defense of Lawyers Playing Their Part
Isaac King (KingSupernova) · 2024-07-01T01:32:58.695Z · comments (9)

Learning Math in Time for Alignment
Nicholas / Heather Kross (NicholasKross) · 2024-01-09T01:02:37.446Z · comments (3)

[link] A computational complexity argument for many worlds
jessicata (jessica.liu.taylor) · 2024-08-13T19:35:10.116Z · comments (15)

Being against involuntary death and being open to change are compatible
Andy_McKenzie · 2024-05-27T06:37:27.644Z · comments (5)

An argument that consequentialism is incomplete
cousin_it · 2024-10-07T09:45:12.754Z · comments (27)

[link] An X-Ray is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
hugofry · 2024-10-07T08:53:14.658Z · comments (0)

If you are also the worst at politics
lukehmiles (lcmgcd) · 2024-05-26T20:07:49.201Z · comments (8)

[link] Manifund: 2023 in Review
Austin Chen (austin-chen) · 2024-01-18T23:50:13.557Z · comments (0)

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
Jan Wehner · 2024-07-14T10:37:21.544Z · comments (4)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

mo-putera on What is malevolence? On the nature, measurement, and distribution of dark traits

I agree; Eichmann in Jerusalem and immoral mazes come to mind.

khafra on Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now)

I think being a Catholic with no connection to living leaders makes more sense than being an EA who doesn't have a leader they trust and respect

Catholic EA: You have a leader you trust and respect, and defer to their judgement.

Sola Fide EA: You read 80k hours and Givewell, but you keep your own spreadsheet of EV calculations.

randomwalks on What's a good book for a technically-minded 11-year old?

but only the dialogues?

actually, it probably needs a re-ordering. place the really terse stuff in an appendix, put the dialogues in the beginning, etc.

rogerdearnaley on The Mask Comes Off: At What Price?

Actual humans aren't "aligned" with each other, and they may not be consistent enough that you can say they're always "aligned" with themselves.

Completely agreed, see for example my post 3. Uploading [LW · GW] which makes this exact point at length.

Anyway, even if the approach did work, that would just mean that "its own ideas" were that it had to learn about and implement your (or somebody's?) values, and also that its ideas about how to do that are sound. You still have to get that right before the first time it becomes uncontrollable. One chance, no matter how you slice it.

True. Or, as I put it just above:

But yes, you do need to start the model off close enough to aligned that it converges to value learning. [LW · GW]

The point is that you now get one shot at a far simpler task: defining "your purpose as an AI is to learn about and implement the humans' collective values" is a lot more compact, and a lot easier to get right first time, than an accurate description of human values in their full large-and-fairly-fragile [LW · GW] details. As I demonstrate in the post linked to in that quote [LW · GW], the former, plus its justification as being obvious and stable under reflection, can be described in exhaustive detail on a few pages of text.

As for the the model's ideas on how to do that research being sound, that's a capabilities problem: if the model is incapable of performing a significant research project when at least 80% of the answer is already in human libraries, then it's not much of an alignment risk.

simon on D&D Sci Coliseum: Arena of Data

Inspired by abstractapplic's machine learning and wanting to get some experience in julia, I got Claude (3.5 sonnet) to write me an XGBoost implementation in julia. Took a long time especially with some bugfixing (took a long time to find that a feature matrix was the wrong shape - a problem with insufficient type explicitness, I think). Still way way faster than doing it myself! Not sure I'm learning all that much julia, but am learning how to get Claude to write it for me, I hope.

Anyway, I used a simple model that

only takes into account 8 * sign(speed difference) + power difference, as in the comment this is a reply to

and a full model that

takes into account all the available features including the base data, the number the simple model uses, and intermediate steps in the calculation of that number (that would be, iirc: power (for each), speed (for each), speed difference, power difference, sign(speed difference))

Results:

Rank 1
Full model scores: Red: 94.0%, Black: 94.9%
Combined full model score: 94.4%
Simple model scores: Red: 94.3%, Black: 94.6%
Combined simple model score: 94.5%

Matchups:
Varina Dourstone (+0 boots, +3 gauntlets) vs House Cadagal Champion
Willow Brown (+3 boots, +0 gauntlets) vs House Adelon Champion
Xerxes III of Calantha (+2 boots, +2 gauntlets) vs House Deepwrack Champion
Zelaya Sunwalker (+1 boots, +1 gauntlets) vs House Bauchard Champion

This is the top scoring scoring result with either the simplified model or the full model. It was found by a full search of every valid item and hero combination available against the house champions.

It is also my previously posted, found w/o machine learning, proposal for the solution. Which is reassuring. (Though, I suppose there is some chance that my feeding the models this predictor, if it's good enough, might make them glom on to it while they don't find some hard-to learn additional pattern.)

My theory though is that giving the models the useful metric mostly just helps them - they don't need to learn the metric from the data, and I mostly think that if there was a significant additional pattern the full model would do better.

(for Cadagal, I haven't changed the champion's boots to +4, though I don't expect that to make a significant difference)

As far as I can tell the full model doesn't do significantly better and does worse in some ways (though, I don't know much about how to evaluate this, and Claude's metrics, including a test set log loss of 0.2527 for the full model and 0.2511 for the simple model, are for a separately generated version which I am not all that confident are actually the same models, though they "should be" up to the restricted training set if Claude was doing it right).

But the red/black variations seen below for the full model seem likely to me (given my prior that red and black are likely to be symmetrical) to be an indication that what the full model is finding that isn't in the full model is at least partially overfitting. Though actually, if it's overfitting a lot, maybe it's surprising that the test set log loss wouldn't be a lot worse than found (though it is at least worse than the simple model)? Hmm - what if there are actual red/black difference? (something to look into perhaps, as well as try to duplicate abstractapplic's report regarding sign(speed difference) not exhausting the benefits of speed info ... but for now I'm more likely to leave the machine learning aside and switch to looking at distributions of gladiator characteristics, I think.)

Predictions for individual matchups for my and abstractapplic's solutions:

My matchups:

Varina Dourstone (+0 boots, +3 gauntlets) vs House Cadagal Champion (+2 boots, +3 gauntlets)
Full Model: Red: 91.1%, Black: 96.7%
Simple Model: Red: 94.3%, Black: 94.6%

Willow Brown (+3 boots, +0 gauntlets) vs House Adelon Champion (+3 boots, +1 gauntlets)
Full Model: Red: 94.3%, Black: 95.1%
Simple Model: Red: 94.3%, Black: 94.6%

Xerxes III of Calantha (+2 boots, +2 gauntlets) vs House Deepwrack Champion (+3 boots, +2 gauntlets)
Full Model: Red: 95.2%, Black: 93.7%
Simple Model: Red: 94.3%, Black: 94.6%

Zelaya Sunwalker (+1 boots, +1 gauntlets) vs House Bauchard Champion (+3 boots, +2 gauntlets)
Full Model: Red: 95.3%, Black: 93.9%
Simple Model: Red: 94.3%, Black: 94.6%

(all my matchups have 4 effective power difference in my favour as noted in an above comment)

abstractapplic's matchups:

Matchup 1:
Uzben Grimblade (+3 boots, +0 gauntlets) vs House Adelon Champion (+3 boots, +1 gauntlets)

Win Probabilities:
Full Model: Red: 72.1%, Black: 62.8%
Simple Model: Red: 65.4%, Black: 65.7%

Stats:
Speed: 18 vs 14 (diff: 4)
Power: 11 vs 18 (diff: -7)
Effective Power Difference: 1
--------------------------------------------------------------------------------

Matchup 2:
Xerxes III of Calantha (+2 boots, +1 gauntlets) vs House Bauchard Champion (+3 boots, +2 gauntlets)

Win Probabilities:
Full Model: Red: 46.6%, Black: 43.9%
Simple Model: Red: 49.4%, Black: 50.6%

Stats:
Speed: 16 vs 12 (diff: 4)
Power: 13 vs 21 (diff: -8)
Effective Power Difference: 0
--------------------------------------------------------------------------------

Matchup 3:
Varina Dourstone (+0 boots, +3 gauntlets) vs House Cadagal Champion (+2 boots, +3 gauntlets)

Win Probabilities:
Full Model: Red: 91.1%, Black: 96.7%
Simple Model: Red: 94.3%, Black: 94.6%

Stats:
Speed: 7 vs 25 (diff: -18)
Power: 22 vs 10 (diff: 12)
Effective Power Difference: 4
--------------------------------------------------------------------------------

Matchup 4:
Yalathinel Leafstrider (+1 boots, +2 gauntlets) vs House Deepwrack Champion (+3 boots, +2 gauntlets)

Win Probabilities:
Full Model: Red: 35.7%, Black: 39.4%
Simple Model: Red: 34.3%, Black: 34.6%

Stats:
Speed: 20 vs 15 (diff: 5)
Power: 9 vs 18 (diff: -9)
Effective Power Difference: -1
--------------------------------------------------------------------------------

Overall Statistics:
Full Model Average: Red: 61.4%, Black: 60.7%
Simple Model Average: Red: 60.9%, Black: 61.4%

joneedssleep on JoNeedsSleep's Shortform

The distinction between inner and outer alignment is quite unnatural. For example, even the concept of reward hacking implies the double-fold failure of a reward that is not robust enough to exploitation, and a model that develops instrumental capabilities as to find a way to trick the reward; indeed, in the case of reward hacking, it's worth noting that depending on the autonomy of the system in question, we could attribute the misalignment as inner or outer. At its core, this distinction comes out of the policy <-> reward scheme of RL, though prediction <-> loss function in SL can be similarly characterized; I doubt how well this framing generalizes to other engineering choices.

john-huang on Could randomly choosing people to serve as representatives lead to better government?

>Do you think there is a situation where selected random people do not want to be in office/leadership and want to pursue their own passion/career and thus due to this reason may do a bad job? Is this mandatory?

I think a robust way to design the assembly (or multiple assemblies like with Bouricius's model) is to have many different people serving different term lengths. Some people may serve a term of only a couple days or weeks. Others might serve for years.

For short-term service, I would make that mandatory. Everyone is required to come.

For long term service, maybe those should be voluntary.

As far as incentives go, there's a range of enforcement options for "mandatory" service. Perhaps you can just pay a big fine, as a percentage of your income, as an alternative to service. There probably ought to be mechanisms to defer service so you can time things a bit better with your life circumstances.

The typical Citizens' Assembly will also offer benefits such as child care, parental care.

A high paying salary will encourage the lower and middle class to participate.

I have trouble coming up with ways to help small business owners to participate though. Could a small business owner drop their work for an entire year, even if it was well paid -- especially if the small business is so small there are no managers to cover their role? Perhaps there could be alternatives for them, such as part time work coupled with work-from-home.

>What are some nuances about population and diversity? (I am not sure yet)

I have yet to hear about a case where Deliberative decision making techniques were tried and failed due to excessive diversity or cultural factors. I'm not an expert on the latest and greatest research here so I may be wrong. I do know that deliberation experiments have been performed all around the world, including East Asia, Africa, and India.

An example deliberative poll was performed in Uganda, paper linked here:

https://direct.mit.edu/daed/article/146/3/140/27163/Applying-Deliberative-Democracy-in-Africa-Uganda-s

I haven't fully read this yet. Note that James Fishkin is the guy that performs and advocates for these "deliberative polls".

simon on D&D Sci Coliseum: Arena of Data

Very interesting, this would certainly cast doubt on

my simplified model

But so far I haven't been noticing

any affects not accounted for by it.

After reading your comments I've been getting Claude to write up an XGBoost implementation for me, I should have made this reply comment when I started, but will post my results under my own comment chain.

I have not (but should) try to duplicate (or fail to do so) your findings - I haven't been quite testing the same thing.

george-ingebretsen on Overcoming Bias Anthology

How difficult would it be to turn this into an epub or pdf? Is there word of that coming soon?

jozdien on BIG-Bench Canary Contamination in GPT-4

Thanks for testing this! This is very interesting.

I'm still operating under the assumption that it does know the string, and is hallucinating, or that it can't retrieve it under these conditions. But slightly alarming that it apologises for lying, then lies again.

Agreed. Seems plausible that it does recognize the string as the BIG-Bench canary (though on questioning it's very evasive, so I wouldn't say it's extremely likely). I'm leaning toward it having "forgotten" the actual string though ("forgotten" because it's plausible it can be extracted with the right prompt, but there isn't a good word for describing a larger elicitation gap).