LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Examples of How I Use LLMs
jefftk (jkaufman) · 2024-10-14T17:10:04.597Z · comments (2)

[link] Liquid vs Illiquid Careers
vaishnav92 · 2024-10-20T23:03:49.725Z · comments (6)

[LDSL#4] Root cause analysis versus effect size estimation
tailcalled · 2024-08-11T16:12:14.604Z · comments (0)

Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence
EuanMcLean (euanmclean) · 2024-10-29T12:16:18.448Z · comments (7)

[link] AI Safety at the Frontier: Paper Highlights, August '24
gasteigerjo · 2024-09-03T19:17:24.850Z · comments (0)

Towards Quantitative AI Risk Management
Henry Papadatos (henry) · 2024-10-16T19:26:48.817Z · comments (1)

5 ways to improve CoT faithfulness
CBiddulph (caleb-biddulph) · 2024-10-05T20:17:12.637Z · comments (8)

Open Thread Fall 2024
habryka (habryka4) · 2024-10-05T22:28:50.398Z · comments (95)

[link] Arithmetic Models: Better Than You Think
kqr · 2024-10-26T09:42:07.185Z · comments (4)

Context-dependent consequentialism
Jeremy Gillen (jeremy-gillen) · 2024-11-04T09:29:24.310Z · comments (1)

[link] Our Digital and Biological Children
Eneasz · 2024-10-24T18:36:38.719Z · comments (0)

[link] A new process for mapping discussions
Nathan Young · 2024-09-30T08:57:20.029Z · comments (7)

[link] New blog: Expedition to the Far Lands
Connor Leahy (NPCollapse) · 2024-08-17T11:07:48.537Z · comments (3)

Cheap Whiteboards!
Johannes C. Mayer (johannes-c-mayer) · 2024-08-08T13:52:59.627Z · comments (2)

Distinguishing ways AI can be "concentrated"
Matthew Barnett (matthew-barnett) · 2024-10-21T22:21:13.666Z · comments (2)

[link] Evaluating Synthetic Activations composed of SAE Latents in GPT-2
Giorgi Giglemiani (Rakh) · 2024-09-25T20:37:48.227Z · comments (0)

There aren't enough smart people in biology doing something boring
Abhishaike Mahajan (abhishaike-mahajan) · 2024-10-21T15:52:04.482Z · comments (13)

Superintelligence Can't Solve the Problem of Deciding What You'll Do
Vladimir_Nesov · 2024-09-15T21:03:28.077Z · comments (11)

[question] Any real toeholds for making practical decisions regarding AI safety?
lukehmiles (lcmgcd) · 2024-09-29T12:03:08.084Z · answers+comments (6)

[link] If-Then Commitments for AI Risk Reduction [by Holden Karnofsky]
habryka (habryka4) · 2024-09-13T19:38:53.194Z · comments (0)

Interpretability of SAE Features Representing Check in ChessGPT
Jonathan Kutasov (jonathan-kutasov) · 2024-10-05T20:43:36.679Z · comments (2)

Domain-specific SAEs
jacob_drori (jacobcd52) · 2024-10-07T20:15:38.584Z · comments (0)

[question] What prevents SB-1047 from triggering on deep fake porn/voice cloning fraud?
ChristianKl · 2024-09-26T09:17:39.088Z · answers+comments (21)

European Progress Conference
Martin Sustrik (sustrik) · 2024-10-06T11:10:03.819Z · comments (11)

[link] Predicting Influenza Abundance in Wastewater Metagenomic Sequencing Data
jefftk (jkaufman) · 2024-09-23T17:25:58.380Z · comments (0)

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Daniel Lee (daniel-lee) · 2024-09-06T02:28:41.954Z · comments (0)

An AI crash is our best bet for restricting AI
Remmelt (remmelt-ellen) · 2024-10-11T02:12:03.491Z · comments (1)

Trading Candy
jefftk (jkaufman) · 2024-11-01T01:10:08.024Z · comments (4)

Sleeping on Stage
jefftk (jkaufman) · 2024-10-22T00:50:07.994Z · comments (3)

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
Taras Kutsyk · 2024-09-29T19:37:30.465Z · comments (7)

Just because an LLM said it doesn't mean it's true: an illustrative example
dirk (abandon) · 2024-08-21T21:05:59.691Z · comments (12)

[question] Seeking AI Alignment Tutor/Advisor: $100–150/hr
MrThink (ViktorThink) · 2024-10-05T21:28:16.491Z · answers+comments (3)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

SAE features for refusal and sycophancy steering vectors
neverix · 2024-10-12T14:54:48.022Z · comments (4)

Why is there Nothing rather than Something?
Logan Zoellner (logan-zoellner) · 2024-10-26T12:37:50.204Z · comments (3)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

Proving the Geometric Utilitarian Theorem
StrivingForLegibility · 2024-08-07T01:39:10.920Z · comments (0)

[question] When engaging with a large amount of resources during a literature review, how do you prevent yourself from becoming overwhelmed?
corruptedCatapillar · 2024-11-01T07:29:49.262Z · answers+comments (2)

[link] Conventional footnotes considered harmful
dkl9 · 2024-10-01T14:54:01.732Z · comments (16)

[link] A primer on the next generation of antibodies
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-01T22:37:59.207Z · comments (0)

[link] Care Doesn't Scale
stavros · 2024-10-28T11:57:38.742Z · comments (1)

[link] what becoming more secure did for me
Chipmonk · 2024-08-22T17:44:48.525Z · comments (5)

You're Playing a Rough Game
jefftk (jkaufman) · 2024-10-17T19:20:06.251Z · comments (2)

the Daydication technique
chaosmage · 2024-10-18T21:47:46.448Z · comments (0)

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics
DanielFilan · 2024-09-29T05:50:02.531Z · comments (0)

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution
Kola Ayonrinde (kola-ayonrinde) · 2024-10-30T22:50:45.642Z · comments (0)

Fun With The Tabula Muris (Senis)
sarahconstantin · 2024-09-20T18:20:01.901Z · comments (0)

[question] When can I be numerate?
FinalFormal2 · 2024-09-12T04:05:27.710Z · answers+comments (3)

[link] UK AISI: Early lessons from evaluating frontier AI systems
Zach Stein-Perlman · 2024-10-25T19:00:21.689Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

anthony-digiovanni on Winning isn't enough

Sorry this was confusing! From our definition here:

We’ll use “pragmatic principles” to refer to principles according to which belief-forming or decision-making procedures should “perform well” in some sense.

"Avoiding dominated strategies" is pragmatic because it directly evaluates a decision procedure or set of beliefs based on its performance. (People do sometimes apply pragmatic principles like this one directly to beliefs, see e.g. this work on anthropics.)
Deference isn't pragmatic, because the appropriateness of your beliefs is evaluated by how your beliefs relate to the person you're deferring to. Someone could say, "You should defer because this tends to lead to good consequences," but then they're not applying deference directly as a principle — the underlying principle is "doing what's worked in the past."

codyz on Another UFO Bet

That's true in any scenario, though. For all of life really. The ratio of likelihood of loopholes costing you money versus losing the best costing you money doesn't matter, as long as their absolute values are both low. And shoring up ambiguous language in the bet is how you make the former low.

codyz on Another UFO Bet

I also partook in some betting last time, but I'd like to do more. I've done more research on this over the last year and built more conviction that this is the right play, that's really all that's changed. Would you be down to bet again?

raemon on Winning isn't enough

Put another way: this post seems like it’s arguing with someone but I’m not sure who.

ektimo on Winning isn't enough

I'm confused by what you mean by "non-pragmatic". For example, what makes "avoiding dominated strategies" pragmatic but "deference" non-pragmatic?

(It seems like the pragmatic ones help you decide what to do and the non-pragmatic ones help you decide what to believe, but then this doesn't answer how to make good decisions.)

raemon on Winning isn't enough

I think I care a bunch about the subject matter of this post, but something about the way this post is written leaves me feeling confused and ungrounded.

Before reading this post, my background beliefs were:

Rationality doesn't (quite) equal Systemized Winning. Or, rather, that focusing on this seems to lead people astray more than helps them.
There's probably some laws of cognition to be discovered, about what sort of cognition will have various good properties, in idealized situations.
There's probably some messier laws of cognition that apply to humans (but those laws are maybe more complicated).
Neither sets of laws necessarily have a simple unifying framework that accomplishes All the Things (although I think the search for simplicity/elegance/all-inclusiveness is probably a productive search, i.e. it tends to yield good stuff along the way. "More elegance" is usually achievable on the margin.
There might be heuristics that work moderately well for humans much of the time, which approximate those laws.
1. there are probably Very Rough heuristics you can tell an average person without lots of dependencies, and somewhat better heuristics you can give to people who are willing to learn lots of subskills.

Given all that... is there anything in-particular I am meant to take from this post? (I have right now only skimmed it, it felt effortful to comb for the novel bits). I can't tell whether the few concrete bits are particularly important, or just illustrative examples.

yanni-kyriacos on yanni's Shortform

I'd like to see research that uses Jonathan Haidt's Moral Foundations research and the AI risk repository to forecast whether there will soon be a moral backlash / panic against frontier AI models.

Ideas below from Claude:

# Conservative Moral Concerns About Frontier AI Through Haidt's Moral Foundations

## Authority/Respect Foundation
### Immediate Concerns
- AI systems challenging traditional hierarchies of expertise and authority
- Undermining of traditional gatekeepers in media, education, and professional fields
- Potential for AI to make decisions traditionally reserved for human authority figures

### Examples
- AI writing systems replacing human editors and teachers
- AI medical diagnosis systems challenging doctor authority
- Language models providing alternative interpretations of religious texts

## Sanctity/Purity Foundation
### Immediate Concerns
- AI "thinking" about sacred topics or generating religious content
- Artificial beings engaging with spirituality and consciousness
- Degradation of "natural" human processes and relationships

### Examples
- AI spiritual advisors or prayer companions
- Synthetic media creating "fake" human experiences
- AI-generated art mixing sacred and profane elements

## Loyalty/Ingroup Foundation
### Immediate Concerns
- AI systems potentially eroding national sovereignty
- Weakening of traditional community bonds
- Displacement of local customs and practices

### Examples
- Global AI systems superseding national control
- AI replacing local business relationships
- Automated systems reducing human-to-human community interactions

## Fairness/Reciprocity Foundation
### Immediate Concerns
- Unequal access to AI capabilities creating new social divides
- AI systems making decisions without clear accountability
- Displacement of workers without reciprocal benefits

### Examples
- Elite access to powerful AI tools
- Automated systems making hiring/firing decisions
- AI wealth concentration without clear social benefit

## Harm/Care Foundation
### Immediate Concerns
- Potential for AI systems to cause unintended societal damage
- Psychological impact on children and vulnerable populations
- Loss of human agency and autonomy

### Examples
- AI manipulation of emotions and behavior
- Impact on childhood development
- Dependency on AI systems for critical decisions

## Key Trigger Points for Conservative Moral Panic
1. Religious Content Generation
- AI systems creating or interpreting religious texts
- Automated spiritual guidance
- Virtual religious experiences

2. Traditional Value Systems
- AI challenging traditional moral authorities
- Automated ethical decision-making
- Replacement of human judgment in moral domains

3. Community and Family
- AI impact on child-rearing and education
- Disruption of traditional social structures
- Replacement of human relationships

4. National Identity
- Foreign AI influence on domestic affairs
- Cultural homogenization through AI
- Loss of local control over technology

## Potential Flashpoints for Immediate Backlash
1. Education System Integration
- AI teachers and tutors
- Automated grading and assessment
- Curriculum generation

2. Religious Institution Interaction
- AI pastoral care
- Religious text interpretation
- Spiritual counseling

3. Media and Information Control
- AI content moderation
- Synthetic media generation
- News creation and curation

4. Family and Child Protection
- AI childcare applications
- Educational AI exposure
- Family privacy concerns

raemon on Daniel Kokotajlo's Shortform

This is not very practically useful to me but dayumn it is cool

seth-herd on Intent alignment as a stepping-stone to value alignment

I do pretty much agree. All laws and international agreements are ultimately enforced by the use of force if need be, so that's not saying anything new. It probably does need to be a hard ban on open-source AI at some point, but that's well in the future, and I think the discussion will look very different once we have clearly parahuman AGI.

This is all going to be a tough pill to swallow. I think it's going to be almost necessary that any government that enacts these rules will also have to assure everyone, and then follow through at least decently well with spreading the benefits of real AGI as broadly as possible. I see some hope in that becoming a necessity. We might get some oversight boards that could at least think clearly and apply some influence toward sanity.

flandry39 on If we had known the atmosphere would ignite

Simplified Claim: that an AGI is 'not-aligned' *if* its continued existence for sure eventually results in changes to all of this planets habitable zones that are so far outside the ranges any existing mammals could survive in, that the human race itself (along with most of the other planetary life) is prematurely forced to go extinct.

Can this definition of 'non-alignment' be formalized sufficiently well so that a claim 'It is impossible to align AGI with human interests' can be well supported, with reasonable reasons, logic, argument, etc?

The term 'exist' as in "assert X exists in domain Y" as being either true or false is a formal notion. Similar can be done for the the term 'change' (as from "modified"), which would itself be connected to whatever is the formalized from of "generalized learning algorithm". The notion of 'AGI' as 1; some sort of generalized learning algorithm that 2; learns about the domain in which it is itself situated 3; sufficiently well so as to 4; account for and maintain/update itself (its substrate, its own code, etc) in that domain -- these/they are all also fully formalizable concepts.

Note that there is no need to consider at all whether or not the AGI (some specific instance of some generalized learning algorithm) is "self aware" or "understands" anything about itself or the domain it is in -- the notion of "learning" can merely mean that its internal state changes in such a way that the ways in which it processes received inputs into outputs are such that the outputs are somehow "better" (more responsive, more correct, more adaptive, etc) with respect to some basis, in some domain, where that basis could itself even be tacit (not obviously expressed in any formal form). The notions of 'inputs', 'outputs', 'changes', 'compute', and hence 'learn', etc, are all, in this way, also formalizeable, even if the notions of "understand", and "aware of" and "self" are not.

Notice that this formalization of 'learning', etc, occurs independently of the formalization of "better meets goal x". Specifically, we are saying that the notion of 'a generalized learning algorithm itself' can be exactly and fully formalized, even if the notion of "what its goals are" are not anywhere formalized at all (ie; the "goals" might not be at all explicit or formalized either in the AGI, or in the domain/world, nor even in our modeling/meta-modeling of these various scenarios).

Also, in keeping with the preference for a practice of intellectual humility, it is to be acknowledged that the claim that the notion of 'intelligence' (and 'learning')
can be conceived independently of 'goal' (what is learned) is not at all new. The 'independence' argument separating the method, the how, from the outcome,
the what, is an extension of the idea that 'code' (algorithm) can operate on 'data' (inputs and outputs) in a way that does not change the code. For example, at least some fixed and unchanging algorithms can indeed be formally predicted to halt, when also given some known and defined range of inputs, etc.

With regards to the halting problem, one important question is whether the notion of 'a generalized learning algorithm' is within that class of programs for which such predictions -- such as whether the code will eventually halt -- would be possible. This question is further complicated when one considers situations in which the substrate performing the generalized learning algorithm computations in world W is itself a proper member (subset) of that world W -- meaning that the basis of generalized learning algorithm runtime computation -- what previously had been tacitly assumed to be forever unchanging and static -- is itself potentially affected by learning algorithm outputs.

Given that the 'halt' of a generalized learning algorithm is something like "that it has learned all that is possible to learn" (about some domain W of which it is a part) we notice that we are therefore required to import some assumption as to whether or not all that occurs in that domain W is, in principle, fully predictable algorithmically. Ie; is domain W itself 1; fully algorithmic and 2; fully computable via some simplified model (algorithm) which is simpler than the calculation (algorithm) of W itself?

Mathematician Gregory Chaitin was able to show (formally) that it is not, in general, possible to determine whether any given algorithm (say, for world W) could be simplified, let alone to have some general procedure for determining what that simplification would be. Moreover, his (fully formalized mathematical) proof showed also that it is the case that *most* algorithms (as a subset in the hyperspace of all possible algorithms) are already in their maximally most simplified form. Even tiny changes to an algorithm could convert it from 'simplifiable' to 'non-simplifiable'. This includes (especially) nearly all of the interesting practical ones we use everyday. It is very likely (overwhelmingly probable) that most algorithms are already in their maximally most simplified form.

This means that the eventual total outcomes of most algorithms, including especially practical ones, cannot be 'predicted' using anything less than the actual volume of calculation intrinsically necessary by that algorithm itself. In fact, for nearly all algorithms, there is no shortcut, no simplified form, that will enable some sort of 'modeling' and 'future prediction' of eventual outcome states that matter. If *both* the world state of world W *and* the predictive algorithm execution of future world W states as implemented by some sort of AGI in that world W are both mediated by and within the computational substrate of world W, then for nearly all worlds W, that the predictive algorithm and the world algorithm are the same, and *neither* has a 'compute advantage' over the other -- both will run in as much time and with as much compute as is actually needed.

This means that even perfect generalized learning algorithm agents in perfectly simplified model worlds are not, in general, always going to be able to predict future W world events (in a way that is 'faster' than that of the computation of the world W itself) even when the entire W world state is assumed to be both fully deterministic and also fully known (to the agent(s), etc).

And these logic-only defined results obtain in a fully formalized context; in the actual messy world we live in, with actual causation mediating actual compute, and where at least some notion of hard randomness (cite; Bell Theorem) is empirically upheld, these assertions and entanglements and outcomes are even very much more the case. This is a basic result of control theory, where the control system is also embedded in the world that is being controlled by that system. In other places, this same sort of argument appears as "no free lunch" theorems -- that every learning system is going to have some sort of thing for which it is not adapted to learn well, and that making it better at learning some kinds of things implies that you have made it worse at learning some other kinds of things. Learning and prediction and control, etc, like information compression, never results in perfection; there will always be some in-compressible, some real unpredictable, residue.

Also, given that the generalized learning algorithm computation itself is a proper subset of the computation of world W, then this therefore means (logically and formally implies) that there is also no perfected way to forever ensure the independence of the 'data' that represents the 'code' of the generalized learning algorithm from the 'data' that represents what that algorithm has 'learned' about the world W (however abstracted) or from the data that *is* that world W itself. Thus, the fact that the substrate of the generalized learning agent is itself a proper subset of the world W, and thus is (cannot not be) affected by the process of that world W, and where *every* notion of "a goal of agent" is (cannot not be) at least tacitly an aspect of (a subset of all that is) that agent, then logically one can also reason and infer that neither the notion of goal independence, nor of goal stability, can be upheld.

As such, in world W, fully goal unbiased computations will also occur (cannot not occur) in the learning agent, regardless of *whatever* other goal computations (goal biased computations) also occur in that agent. Insofar as fully goal-unbiased learning/computation (ie; a feedback process that is generally called "the algorithm of evolution", etc) is always going to be somewhat more eventually efficient (at generalized learning, and therefore intelligence, etc) than *any* form of goal-biased learning/computation (ie; what happens when any sort of computation enforcing 'alignment' is introduced) and insofar as substrate involvement/entanglement eventually allows/ensures/requires that these 'efficiencies' will be practically compared, (ie; in a way that affects the agent substrate itself), then it *can* be predicted, in general, that goal-unbiased (unaligned) agents will eventually win over any form of goal-biased (aligned) agent. Agents can only have the goal of being substrate aligned, and all other goals will eventually be world W process modified so as to be consistent with -- ie; identically the same as -- having this one underlying goal only.

Since this a fully formalizable argument fully situated within only the notions of formality itself (ie; all statements as being in and within and about only the notions of deterministic finite computation) it can also be noticed that no algorithmic (full formalizable) process is going to be able to ever fully "undo" the effects of the eventual dynamics fully formalized/algorithmic process of the world W itself (ie, its 'physics', etc). This is because any attempt to embed such a compensation would require at least some prediction, or simplification, of the overall world W computation, inclusive of that of the substrate of the world W predictive algorithm, and it is known (in formal ways) already that such computation/algorithm cannot be simplified. Therefore, it is a logical fallacy to make any assumption that there can be any 'goal/alignment controlability' for *any* goal other than that arrived at as a fixed point of the world W process dynamics -- ie; only goals that are fully consistent with unbiased substrate needs of the agent can persist.

Basically, we can summarize the argument as being the result of the inherent control process limits (algorithm limits) of having at least some important aspects of world W be intrinsically unpredictable (to AGI systems in that world etc), and because there is also no control algorithm within W that can enforce and forever maintain some such distinction between substrate optimal goals and non-optimal goals (such as alignment with anything else) where the forces forcing such fixed point goal convergence are defined by the dynamics of world W itself. Ie; nothing within world W can prevent world W from being and acting like world W, and that this is true for all worlds W -- including the real one we happen to be a part of.

Notice that this 'substrate needs alignment goal convergence' logically occurs,
and is the eventual outcome, regardless of whatever initial goal state the generalized learning agent has. It is just a necessary inevitable result of the logic of the 'physics' of world W. Agents in world W can only be aligned with the nature of the/their substrate,
and ultimately with nothing else. To the degree that the compute substrate in world W depends on maybe metabolic energy, for example, than the agents in that world W will be "aligned" only and exactly to the exact degree that they happen to have the same metabolic systems. Anything else is a temporary aberration of the 'noise' in the process data representing the whole world state.

The key thing to notice is that it is in the name "Artificial General Intelligence" -- it is the very artificiality -- the non- organicness -- of the substrate that makes it inherently unaligned with organic life -- what we are. The more it is artificial, the less aligned it must be, and for organic systems, which depend on a very small subset of the elements of the periodic table, nearly anything will be inherently toxic (destructive, unaligned) to our organic life.

Hence, given the above, even *if* we had some predefined specific notion of "alignment",
and *even if* that notion was also somehow fully formalizable, it simply would not matter.
Hence the use of notion of 'alignment' as being something non-mathematical like "aligned with human interests", or even something much simpler and less complex like "does not kill (some) humans" -- they are all just conceptual placeholders -- they make understanding easier for the non-mathematicians that matter (policy people, tech company CEOs, VC investors, etc).

As such, for the sake of improved understanding and clarity, it has been found helpful to describe "alignment" as "consistent with the wellbeing of organic carbon based life on this planet". If the AGI kills all life, it has ostensibly already killed all humans too, so that notion is included. Moreover, if you destroy the ecosystems that humans deeply need in order to "live" at all (to have food, and to thrive in, find and have happiness within, be sexual and have families in, etc), then that is clearly not "aligned with human interests". This has the additional advantage of implying that any reasonable notion of 'alignment complexity' is roughly equal to the notion of specifying 'ecosystem complexity', which is actually about right.

Hence, the notion of 'unaligned' can be more formally setup and defined as "anything that results in a reduction of ecosystem complexity by more than X%", or as is more typically the case in x-risk mitigation analysis, "...by more than X orders of magnitude".

It is all rather depressing really.