LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs
Winnie Yang (winnie-yang) · 2024-08-22T07:32:07.600Z · comments (0)

My decomposition of the alignment problem
Daniel C (harper-owen) · 2024-09-02T00:21:08.359Z · comments (22)

[link] The Great Organism Theory of Evolution
rogersbacon · 2024-08-10T12:26:02.434Z · comments (0)

[link] [Linkpost] 'The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery'
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-15T21:32:59.979Z · comments (1)

A necessary Membrane formalism feature
ThomasCederborg · 2024-09-10T21:33:09.508Z · comments (6)

Simon DeDeo on Explore vs Exploit in Science
Elizabeth (pktechgirl) · 2024-09-10T03:40:08.311Z · comments (0)

Why Reflective Stability is Important
Johannes C. Mayer (johannes-c-mayer) · 2024-09-05T15:28:19.913Z · comments (2)

Announcing the PIBBSS Symposium '24!
DusanDNesic · 2024-09-03T11:19:47.568Z · comments (0)

Looking for Goal Representations in an RL Agent - Update Post
CatGoddess · 2024-08-28T16:42:19.367Z · comments (0)

[question] If I wanted to spend WAY more on AI, what would I spend it on?
Logan Zoellner (logan-zoellner) · 2024-09-15T21:24:46.742Z · answers+comments (7)

[question] What are the best resources for building gears-level models of how governments actually work?
adamShimi · 2024-08-19T14:05:02.590Z · answers+comments (6)

Ten counter-arguments that AI is (not) an existential risk (for now)
Ariel Kwiatkowski (ariel-kwiatkowski) · 2024-08-13T22:35:15.341Z · comments (5)

Scaling Laws and Likely Limits to AI
Davidmanheim · 2024-08-18T17:19:46.597Z · comments (0)

[link] what becoming more secure did for me
Chipmonk · 2024-08-22T17:44:48.525Z · comments (5)

[link] Anthropic is being sued for copying books to train Claude
Remmelt (remmelt-ellen) · 2024-08-31T02:57:27.092Z · comments (4)

[link] Compression Moves for Prediction
adamShimi · 2024-09-14T17:51:12.004Z · comments (0)

[link] Green and golden: a meditation
Richard_Ngo (ricraz) · 2024-08-18T01:36:43.613Z · comments (0)

The Bar for Contributing to AI Safety is Lower than You Think
Chris_Leong · 2024-08-16T15:20:19.055Z · comments (1)

Finding Deception in Language Models
Esben Kran (esben-kran) · 2024-08-20T09:42:13.060Z · comments (4)

[link] To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-19T16:13:55.835Z · comments (0)

[question] Is this voting system strategy proof?
Donald Hobson (donald-hobson) · 2024-09-06T20:44:46.691Z · answers+comments (9)

[link] Why Swiss watches and Taylor Swift are AGI-proof
Kevin Kohler (KevinKohler) · 2024-09-05T13:23:27.033Z · comments (11)

RLHF is the worst possible thing done when facing the alignment problem
tailcalled · 2024-09-19T18:56:27.676Z · comments (4)

Interested in Cognitive Bootcamp?
Raemon · 2024-09-19T22:12:13.348Z · comments (0)

"Which Future Mind is Me?" Is a Question of Values
dadadarren · 2024-08-09T18:17:09.884Z · comments (12)

What program structures enable efficient induction?
Daniel C (harper-owen) · 2024-09-05T10:12:14.058Z · comments (4)

"Real AGI"
Seth Herd · 2024-09-13T14:13:24.124Z · comments (18)

Training a Sparse Autoencoder in < 30 minutes on 16GB of VRAM using an S3 cache
Louka Ewington-Pitsos (louka-ewington-pitsos) · 2024-08-24T07:39:00.057Z · comments (0)

Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025)
Linda Linsefors · 2024-08-23T14:18:24.327Z · comments (2)

[link] How to choose what to work on
jasoncrawford · 2024-09-18T20:39:12.316Z · comments (2)

Why I'm bearish on mechanistic interpretability: the shards are not in the network
tailcalled · 2024-09-13T17:09:25.407Z · comments (33)

Reducing global AI competition through the Commerce Control List and Immigration reform: a dual-pronged approach
Ben Smith (ben-smith) · 2024-09-03T05:28:24.549Z · comments (2)

[link] Will we ever run out of new jobs?
Kevin Kohler (KevinKohler) · 2024-08-19T15:04:03.849Z · comments (7)

[link] My lukewarm take on GLP-1 agonists
George3d6 · 2024-08-26T12:34:27.929Z · comments (0)

Interview with Robert Kralisch on Simulators
WillPetillo · 2024-08-26T05:49:15.543Z · comments (0)

[question] Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-09-04T12:40:07.678Z · answers+comments (6)

Superintelligence Can't Solve the Problem of Deciding What You'll Do
Vladimir_Nesov · 2024-09-15T21:03:28.077Z · comments (10)

Slave Morality: A place for every man and every man in his place
Martin Sustrik (sustrik) · 2024-09-19T04:20:04.491Z · comments (5)

[link] Jonothan Gorard:The territory is isomorphic to an equivalence class of its maps
Daniel C (harper-owen) · 2024-09-07T10:04:47.840Z · comments (18)

[link] Non-Transactional Compliments
Jonathan Moregård (JonathanMoregard) · 2024-08-09T13:42:16.471Z · comments (0)

[link] CultFrisbee
Gauraventh (aryangauravyadav) · 2024-08-11T21:36:36.550Z · comments (3)

[link] AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-14T23:23:26.296Z · comments (1)

Announcing the Ultimate Jailbreaking Championship
InnerHufflepuff (grayswan) · 2024-09-04T00:35:31.234Z · comments (1)

My career exploration: Tools for building confidence
lynettebye · 2024-09-13T11:37:55.843Z · comments (0)

[link] Pronouns are Annoying
ymeskhout · 2024-09-18T13:30:04.620Z · comments (16)

[question] If AI is in a bubble and the bubble bursts, what would you do?
Remmelt (remmelt-ellen) · 2024-08-19T10:56:03.948Z · answers+comments (5)

Physical Therapy Sucks (but have you tried hiding it in some peanut butter?)
Declan Molony (declan-molony) · 2024-09-10T05:54:47.000Z · comments (12)

Simulation-aware causal decision theory: A case for one-boxing in CDT
kongus_bongus · 2024-08-09T18:09:20.013Z · comments (11)

Automating LLM Auditing with Developmental Interpretability
htlou · 2024-09-04T15:50:04.337Z · comments (0)

Funding for work that builds capacity to address risks from transformative AI
abergal · 2024-08-14T23:52:09.922Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

t3t on Ozyrus's Shortform

In general, Intercom is the best place to send us feedback like this, though we're moderately likely to notice a top-level shortform comment. Will look into it; sounds like it could very well be a bug. Thanks for flagging it.

johnswentworth on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

How accurate is the summary I have presented above?

Basically accurate.

Where do values, as opposed to beliefs-about-values, come from?

That is the right next question to ask. Humans have a map of their values, and can update that map in response to rewards in order to "learn about values", but still leaves the question of when/whether there's any "real values" which the map represents, and what kind-of-things those "real values" are.

A few parts of an answer:

"human values" are not one monolithic thing; we value lots of different stuff, and different parts of our value-estimates can separately represent "a real thing" or fail to represent "a real thing".
we don't yet understand what it means for part of our value-estimates to represent "a real thing", but it probably works pretty similarly to epistemic representation more generally - e.g. my belief about the position of the dog in my apartment represents a real thing (even if the position itself is wrong) exactly when there is in fact a dog in my apartment at all.

adam_scholl on Why I funded PIBBSS

Given both my personal experience with LLMs and my reading of the role that empirical engagement has historically played in non-paradigmatic research, I tend to advocate for a methodology which incorporates immediate feedback loops with present day deep learning systems over the classical "philosophy -> math -> engineering" deconfusion/agent foundations paradigm.

I'm curious what your read of the history is, here? My impression is that most important paradigm-forming work so far has involved empirical feedback somehow, but often in ways exceedingly dissimilar from/illegible to prevailing scientific and engineering practice.

I have a hard time imagining scientists like e.g. Darwin, Carnot, or Shannon describing their work as depending much on "immediate feedback loops with present day" systems. So I'm curious whether you think PIBBSS would admit researchers like these into your program, were they around and pursuing similar strategies today?

t3t on AI #82: The Governor Ponders

If you include Facebook & Google (i.e. the entire orgs) as "frontier AI companies", then 6-figures. If you only include Deepmind and FAIR (and OpenAI and Anthropic), maybe order of 10-15k, though who knows what turnover's been like. Rough current headcount estimates:

Deepmind: 2600 (as of May 2024, includes post-Brain-merge employees)

Meta AI (formerly FAIR): ~1200 (unreliable sources; seems plausible, but is probably an implicit undercount since they almost certainly rely a lot of various internal infrastructure used by all of Facebook's engineering departments that they'd otherwise need to build/manage themselves.)

OpenAI: >1700

Anthropic: >500 (as of May 2024)

So that's a floor of ~6k current employees.

jessica-liu-taylor on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

I discussed something similar in the "Human brains don't seem to neatly factorize" section of the Obliqueness [LW · GW] post. I think this implies that, even assuming the Orthogonality Thesis, humans don't have values that are orthogonal to human intelligence (they'd need to not respond to learning/reflection to be orthogonal in this fashion), so there's not a straightforward way to align ASI with human values by plugging in human values to more intelligence.

raemon on AI #82: The Governor Ponders

Over 125 current & former employees of frontier AI companies have called on @CAGovernor to #SignSB1047.

I know this is a political statement that isn't optimizing for such things, but, I am pretty interested in knowing "what actually is the denominator of people who meaningfully count as 'employees of frontier AI companies?". If the answer is 10s of thousands then, well, that is indeed a tiny number. But I think the number might be something more like 1000-3000?

bruce-schechter on The Lens That Sees Its Flaws

Well, yes. But scientists need to have optimism that their experiments will lead somewhere, entrepeneurs have to be optimistic about there projects (and I'm optimistic that this remark will not get me kicked off this site). Without optimism great projects would not be undertaken.

ozyrus on Ozyrus's Shortform

I don’t know if it’s a place for this, but at some point it became impossible to open an article in new tab from Chrome on IPhone - clicking on article title from “all posts” just opens the article. Really ruins my LW reading experience. Couldn’t quickly find a way to send this feedback to a right place either, so I guess this is a quick take now.

review-bot on [New LW Feature] "Debates"

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

bruce-schechter on The Lens That Sees Its Flaws

Isn't the 20th century's apparent low death toll from homicide and war just a matter of percentages? The absolute number of deaths from these things is much greater in the 20th century. I think the absolute number matters too.