LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

AI #49: Bioweapon Testing Begins
Zvi · 2024-02-01T15:30:04.690Z · comments (11)

[link] An AI Manhattan Project is Not Inevitable
Maxwell Tabarrok (maxwell-tabarrok) · 2024-07-06T16:42:35.920Z · comments (25)

Review Report of Davidson on Takeoff Speeds (2023)
Trent Kannegieter · 2023-12-22T18:48:55.983Z · comments (11)

AI #66: Oh to Be Less Online
Zvi · 2024-05-30T14:20:03.334Z · comments (6)

The Defence production act and AI policy
[deleted] · 2024-03-01T14:26:09.064Z · comments (0)

[link] The Hippie Rabbit Hole -Nuggets of Gold in Rivers of Bullshit
Jonathan Moregård (JonathanMoregard) · 2024-01-05T18:27:01.769Z · comments (20)

Possible OpenAI's Q* breakthrough and DeepMind's AlphaGo-type systems plus LLMs
Burny · 2023-11-23T03:16:09.358Z · comments (25)

Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition
cmathw · 2024-04-08T11:14:43.268Z · comments (4)

On DeepMind’s Frontier Safety Framework
Zvi · 2024-06-18T13:30:21.154Z · comments (4)

Turning Your Back On Traffic
jefftk (jkaufman) · 2024-07-17T01:00:08.627Z · comments (7)

My best guess at the important tricks for training 1L SAEs
Arthur Conmy (arthur-conmy) · 2023-12-21T01:59:06.208Z · comments (4)

AI companies' commitments
Zach Stein-Perlman · 2024-05-29T11:00:31.339Z · comments (0)

Please Bet On My Quantified Self Decision Markets
niplav · 2023-12-01T20:07:38.284Z · comments (6)

[link] UC Berkeley course on LLMs and ML Safety
Dan H (dan-hendrycks) · 2024-07-09T15:40:00.920Z · comments (1)

(Appetitive, Consummatory) ≈ (RL, reflex)
Steven Byrnes (steve2152) · 2024-06-15T15:57:39.533Z · comments (1)

Good job opportunities for helping with the most important century
HoldenKarnofsky · 2024-01-18T17:30:03.332Z · comments (0)

Closeness To the Issue (Part 5 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-09T00:36:47.388Z · comments (0)

[link] Searching for the Root of the Tree of Evil
Ivan Vendrov (ivan-vendrov) · 2024-06-08T17:05:53.950Z · comments (14)

[link] Learning coefficient estimation: the details
Zach Furman (zfurman) · 2023-11-16T03:19:09.013Z · comments (0)

AI Safety Camp final presentations
Linda Linsefors · 2024-03-29T14:27:43.503Z · comments (3)

Childhood and Education Roundup #5
Zvi · 2024-04-17T13:00:03.015Z · comments (4)

Deeply Cover Car Crashes?
jefftk (jkaufman) · 2023-12-10T22:20:01.133Z · comments (31)

[link] Who is Sam Bankman-Fried (SBF) really, and how could he have done what he did? - three theories and a lot of evidence
spencerg · 2023-11-11T01:04:22.747Z · comments (28)

A Socratic dialogue with my student
lsusr · 2023-12-05T09:31:05.266Z · comments (14)

[link] Toki pona FAQ
dkl9 · 2024-03-17T21:44:21.782Z · comments (8)

[link] Scaling laws for dominant assurance contracts
jessicata (jessica.liu.taylor) · 2023-11-28T23:11:07.631Z · comments (5)

An anti-inductive sequence
Viliam · 2024-08-14T12:28:54.226Z · comments (9)

[link] Shifting Headspaces - Transitional Beast-Mode
Jonathan Moregård (JonathanMoregard) · 2024-08-12T13:02:06.120Z · comments (9)

I'm creating a deep dive podcast episode about the original Leverage Research - would you like to take part?
spencerg · 2024-09-22T14:03:22.164Z · comments (1)

Debate: Is it ethical to work at AI capabilities companies?
Ben Pace (Benito) · 2024-08-14T00:18:38.846Z · comments (21)

[question] What are your cruxes for imprecise probabilities / decision rules?
Anthony DiGiovanni (antimonyanthony) · 2024-07-31T15:42:27.057Z · answers+comments (29)

[link] Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
Soroush Pour (soroush-pour) · 2023-11-07T17:59:36.857Z · comments (2)

The Evolution of Humans Was Net-Negative for Human Values
Zack_M_Davis · 2024-04-01T16:01:10.037Z · comments (1)

[link] "Model UN Solutions"
Arjun Panickssery (arjun-panickssery) · 2023-12-08T23:06:33.490Z · comments (5)

[link] Claude 3 Opus can operate as a Turing machine
Gunnar_Zarncke · 2024-04-17T08:41:57.209Z · comments (2)

Mental Masturbation and the Intellectual Comfort Zone
Declan Molony (declan-molony) · 2024-05-07T05:47:05.257Z · comments (2)

[link] How To Socialize With Psycho(logist)s
Sable · 2023-10-20T11:33:46.066Z · comments (11)

AI #47: Meet the New Year
Zvi · 2024-01-13T16:20:10.519Z · comments (7)

Comparing representation vectors between llama 2 base and chat
Nina Panickssery (NinaR) · 2023-10-28T22:54:37.059Z · comments (5)

Otherness and control in the age of AGI
Joe Carlsmith (joekc) · 2024-01-02T18:15:54.168Z · comments (0)

AI #34: Chipping Away at Chip Exports
Zvi · 2023-10-19T15:00:03.055Z · comments (19)

[link] ∀: a story
Richard_Ngo (ricraz) · 2023-12-17T22:42:32.857Z · comments (1)

[question] Snapshot of narratives and frames against regulating AI
Jan_Kulveit · 2023-11-01T16:30:19.116Z · answers+comments (19)

We are already in a persuasion-transformed world and must take precautions
trevor (TrevorWiesinger) · 2023-11-04T15:53:31.345Z · comments (14)

On Dwarkesh’s 3rd Podcast With Tyler Cowen
Zvi · 2024-02-02T19:30:05.974Z · comments (9)

[question] What is an "anti-Occamian prior"?
Zane · 2023-10-23T02:26:10.851Z · answers+comments (22)

Protocol evaluations: good analogies vs control
Fabien Roger (Fabien) · 2024-02-19T18:00:09.794Z · comments (10)

Dangers of Closed-Loop AI
Gordon Seidoh Worley (gworley) · 2024-03-22T23:52:22.010Z · comments (7)

The "context window" analogy for human minds
Ruby · 2024-02-13T19:29:10.387Z · comments (0)

My Detailed Notes & Commentary from Secular Solstice
Jeffrey Heninger (jeffrey-heninger) · 2024-03-23T18:48:51.894Z · comments (16)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

avturchin on What are the best arguments for/against AIs being "slightly 'nice'"?

I am argue against relation of necessity which EY suggested, but do not argue against the idea that "ASI will kill us all" is highly likely. So there is no actionable difference. My my arguments need to be only to demonstrate possibility but not necessity to undermine the necessity of the outcome that ASI will kill us.

Also metaphors has limited proving power as they often miss some details. Both true for Arnault and jail. In jail, there is a high authority which prevents most inmates to kill each other until only one survive. However, even in some jails people who committed special types of crimes are prosecuted by other inmates. For example, in female jails, women who killed minors may be killed by other women.

So it is better not to go in jail if your perform a type of crime which other inmates hate. The same way may not want to go in space full of other ASIs if it performed the crime which other ASIs may hate: killing its creators.

Other your arguments are about possibilities, and they can be converted in expected utilities for ASI. It may ask itself: do I want to take 0.1 per cent increase of the risk of space war in the future for 0.000000000001 increase of amount of available atoms? If it starts reason this way, it is likely that it will prefer to reduce the risk of space war.

We can increase our chances by making precommitment now (Rolf Nelson attack for space wars) to go to war with future non-aligned AI in space. Nelson suggested similar precomitment attack for simulation: we precommit now to simulate all possible hostile AIs and turn them off, if they will not play as if they are aligned; as any young ASI may not be sure if it simulation of not, it may prefer to play aligned.

Will ASI play 4D chess and ignore the acasual threat just to punish everyone who do it? I am not sure. There could be 5D chess level where ASI will find benefitial to comply only to [don't know what].

Anyway predicting ASI behavior is difficult and any claims about it can't be certain. So we can't be certain that ASI will kill us.

measure on Sherrinford's Shortform

Did you switch to the markdown editor?

steve2152 on [Intuitive self-models] 2. Conscious Awareness

it's clearly not necessary for your theory to cover this topic if you don't find it very interesting and/or you have other objectives

I’m trying to make a stronger statement than that. I think the kind of theory you’re looking for doesn’t exist. :)

I don't think we did run into any edge cases of representation so far where something partially represents or is partially represented

In this comment [LW(p) · GW(p)] you talked about a scenario where a perfect map of London was created by “an (extremely unlikely) accident”. I think that’s an “edge case of representation”, right? Just as a virus is an edge case of alive-ness. And (I claim) that arguing about whether viruses are “really” alive or not is a waste of time, and likewise arguing whether accidental maps “really” represent London or not is a waste of time too.

Again, a central case of representation would be:

(A) The map has lots and lots of features that straightforwardly correspond to features of London
(B) The reason for (A) is that the map was made by an optimization process [LW · GW] that systematically (though imperfectly) tends to lead to (A), e.g. the map was created by a human map-maker who was trying to accurately map London
(C) The map is in fact useful for navigating London

I claim that if you run a predictive learning algorithm, that’s a valid (B), and we can call the trained generative model a “map”. If we do that…

You can pick out places where all three of (A-C) are very clearly applicable, and when you find one of those places, you can say that there’s some map-territory correspondence / representation there. Those places are clear-cut central examples, just as a cat is a clear-cut central example of alive-ness.
There are other bits of the map that clearly don’t correspond to anything in the territory, like a drug-induced hallucinated ghost that the person believes to be really standing in front of them. That’s a clear-cut non-example of map-territory correspondence, just as a rock is a clear-cut non-example of alive-ness.
And then there are edge cases, like the Gettier problem, or vague impressions, or partially-correct ideas, or hearsay, or vibes, or Scott/Yvain, or latent variables that are somehow helpful for prediction but in a very hard-to-interpret and indirect way, or habits, etc. Those are edge-cases of map-territory correspondence. And arguing about what if anything they’re “really” representing is a waste of time, just as arguing about whether a virus is “really” alive is a waste of time, right? In those cases, I claim that it’s useful to invoke the term “map-territory correspondence” or “representation” as part of a longer description of what’s happening here, but not as a bedrock ground truth that we’re trying to suss out.

neil-warren on Gell-Mann checks

Can I piggy-back off your conclusions so far? Any news you find okay?

anthonyc on on Science Beakers and DDT

I partially agree with @ulyssessword [LW · GW] that while this article has valuable insights, it overstates the case.

What Civ gets "wrong" (by making it a simple game mechanic) is treating basic and applied research, on platform and specific technologies, work the same way, and also by letting the player know in advance how many Science Beakers lie between here and there for each.

If you know what problem you need to solve, and the general shape of what a solution could look like, you can throw more money and minds at it and get there faster than you counterfactually would.

If not, then more minds and money spent broadly will still directionally produce more research which can be turned into more innovations, but you won't be able to say in advance how many turns faster you'll get to any particular invention, or what order you'll get them in.

You don't have literally zero knowledge, you can be usefully clever, but you don't have Civ-like control. Cities and countries with better laws, infrastructure, business culture, ESOs, investors, and baseline wealth levels really do reliably generate and roll out more and better Technology (TM) across the board, and this replicates to a significant degree, and where it fails it's often pretty easy to do a post- (or, yes, pre-) mortem to figure out what did (will) go wrong and how to do better.

In other words, research turns money into knowledge, and innovation turns knowledge into money; both processes are error prone and inefficient, but there is still a range of better and worse ways to manage each; and doing more of either will put you in a position to do more of the other.

jett-janiak on Characterizing stable regions in the residual stream of LLMs

I'm not familiar with this interpretation. Here's what Claude has to say (correct about stable regions, maybe hallucinating about Hopfield networks)

This is an interesting question that connects the findings in the paper to broader theories about how transformer models operate. Let me break down my thoughts:
The paper's findings and the Hopfield network interpretation of self-attention are not directly contradictory, but they're not perfectly aligned either. Let's examine this in more detail:
The paper's key findings:
The residual stream of trained transformers contains stable regions where small activation changes lead to minimal output changes.
These regions emerge during training and become more defined as training progresses or model size increases.
The regions appear to correspond to semantic distinctions.
The Hopfield network interpretation of self-attention:
Self-attention layers in transformers can be seen as performing energy-based updates similar to those in continuous Hopfield networks.
This view suggests that self-attention is performing pattern completion or error correction, moving activations towards learned attractor states.
Now, let's consider how these might relate:
Consistency with stable regions: The existence of stable regions in the residual stream could be consistent with the idea of attractor states in a Hopfield-like network. The stable regions might correspond to basins of attraction around these states.
Emergence during training: The paper observes that stable regions emerge and become more defined during training. This aligns well with the Hopfield network interpretation, as training would refine the attractor landscapes.
Semantic correspondence: The paper suggests that stable regions correspond to semantic distinctions. This is compatible with the Hopfield network view, where different attractor states could represent different semantic concepts or categories.
Sharp transitions: The paper observes sharp transitions between stable regions. This is somewhat less aligned with the typical continuous Hopfield network dynamics, which often show smoother transitions. However, it's not necessarily inconsistent, as the observed behavior could result from complex interactions across multiple layers.
Scale of regions: The paper suggests that these stable regions are much larger than previously studied polytopes. This might pose some challenges for the Hopfield network interpretation, as it implies a different granularity of representation than might be expected.

anthonyc on Gell-Mann checks

To be fair, there are exceptions, and some reporters or publications consistently do better at oarticular kinds of reporting. It just takea a lot of work to reliably figure out which are which.

steve2152 on [Intuitive self-models] 2. Conscious Awareness

I could have cut §2.4. It’s not particularly important for later posts in the series. I thought about it. But it was pretty short, and I judged that it might be marginally useful all things considered.

Even if you lack the §2.4.2 intuition linking amnesia to I-must-have-been-unconscious, other people certainly have that intuition, e.g. Clive Wearing “constantly believes that he has only recently awoken from a comatose state”. I thought it was worth saying why someone might find that to be intuitively appealing, even if they simultaneously have other countervailing intuitions.

neil-warren on Gell-Mann checks

Well then, I can update a little more in the direction not to trust this stuff.

charlie-steiner on [Intuitive self-models] 2. Conscious Awareness

I am confused about the purpose of the "awareness and memory" section, and maybe disagree with the intuition said to be obvious in the second subsection. Is there some deeper reason you want to bring up how we self-model memory / something you wanted to talk about there that I missed?