LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Cheap Whiteboards!
Johannes C. Mayer (johannes-c-mayer) · 2024-08-08T13:52:59.627Z · comments (2)

[question] How are you preparing for the possibility of an AI bust?
Nate Showell · 2024-06-23T19:13:45.247Z · answers+comments (16)

[link] Positive visions for AI
L Rudolf L (LRudL) · 2024-07-23T20:15:26.064Z · comments (4)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

Links and brief musings for June
Kaj_Sotala · 2024-07-06T10:10:03.344Z · comments (0)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

Just because an LLM said it doesn't mean it's true: an illustrative example
dirk (abandon) · 2024-08-21T21:05:59.691Z · comments (12)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

RLHF is the worst possible thing done when facing the alignment problem
tailcalled · 2024-09-19T18:56:27.676Z · comments (9)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

Talk: AI safety fieldbuilding at MATS
Ryan Kidd (ryankidd44) · 2024-06-23T23:06:37.623Z · comments (2)

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper · 2024-07-30T14:57:06.807Z · comments (0)

Proving the Geometric Utilitarian Theorem
StrivingForLegibility · 2024-08-07T01:39:10.920Z · comments (0)

[link] Beware the science fiction bias in predictions of the future
Nikita Sokolsky (nikita-sokolsky) · 2024-08-19T05:32:47.372Z · comments (20)

A Visual Task that's Hard for GPT-4o, but Doable for Primary Schoolers
Lennart Finke (l-f) · 2024-07-26T17:51:28.202Z · comments (4)

Evaluating Sparse Autoencoders with Board Game Models
Adam Karvonen (karvonenadam) · 2024-08-02T19:50:21.525Z · comments (1)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

[link] An Intuitive Explanation of Sparse Autoencoders for Mechanistic Interpretability of LLMs
Adam Karvonen (karvonenadam) · 2024-06-25T15:57:16.872Z · comments (0)

An experiment on hidden cognition
Olli Järviniemi (jarviniemi) · 2024-07-22T03:26:05.564Z · comments (2)

Using an LLM perplexity filter to detect weight exfiltration
Adam Karvonen (karvonenadam) · 2024-07-21T18:18:05.612Z · comments (11)

Distinguish worst-case analysis from instrumental training-gaming
Olli Järviniemi (jarviniemi) · 2024-09-05T19:13:34.443Z · comments (0)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (6)

[link] A primer on the next generation of antibodies
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-01T22:37:59.207Z · comments (0)

The Wisdom of Living for 200 Years
Martin Sustrik (sustrik) · 2024-06-28T04:44:10.609Z · comments (3)

Housing Roundup #9: Restricting Supply
Zvi · 2024-07-17T12:50:05.321Z · comments (8)

[link] Announcing Open Philanthropy's AI governance and policy RFP
Julian Hazell (julian-hazell) · 2024-07-17T02:02:39.933Z · comments (0)

[question] What's the Deal with Logical Uncertainty?
Ape in the coat · 2024-09-16T08:11:43.588Z · answers+comments (20)

[link] MIRI's July 2024 newsletter
Harlan · 2024-07-15T21:28:17.343Z · comments (2)

How Congressional Offices Process Constituent Communication
Tristan Williams (tristan-williams) · 2024-07-02T12:38:41.472Z · comments (0)

I didn't think I'd take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!
mako yass (MakoYass) · 2024-08-02T22:35:21.136Z · comments (2)

[link] Altruism and Vitalism Aren't Fellow Travelers
Arjun Panickssery (arjun-panickssery) · 2024-08-09T02:01:11.361Z · comments (2)

Trying to be rational for the wrong reasons
Viliam · 2024-08-20T16:18:06.385Z · comments (8)

[link] The Living Planet Index: A Case Study in Statistical Pitfalls
Jan_Kulveit · 2024-06-24T10:05:55.101Z · comments (0)

[link] Truth is Universal: Robust Detection of Lies in LLMs
Lennart Buerger · 2024-07-19T14:07:25.162Z · comments (3)

Seeking Mechanism Designer for Research into Internalizing Catastrophic Externalities
c.trout (ctrout) · 2024-09-11T15:09:48.019Z · comments (2)

Population ethics and the value of variety
cousin_it · 2024-06-23T10:42:21.402Z · comments (11)

Distillation of 'Do language models plan for future tokens'
TheManxLoiner · 2024-06-27T20:57:34.351Z · comments (2)

A Basic Economics-Style Model of AI Existential Risk
Rubi J. Hudson (Rubi) · 2024-06-24T20:26:09.744Z · comments (3)

[link] Robert Caro And Mechanistic Models In Biography
adamShimi · 2024-07-14T10:56:42.763Z · comments (5)

[question] What percent of the sun would a Dyson Sphere cover?
Raemon · 2024-07-03T17:27:50.826Z · answers+comments (26)

Would you benefit from, or object to, a page with LW users' reacts?
Raemon · 2024-08-20T16:35:47.568Z · comments (6)

Whirlwind Tour of Chain of Thought Literature Relevant to Automating Alignment Research.
sevdeawesome · 2024-07-01T05:50:49.498Z · comments (0)

AXRP Episode 34 - AI Evaluations with Beth Barnes
DanielFilan · 2024-07-28T03:30:07.192Z · comments (0)

[link] Managing Emotional Potential Energy
adamShimi · 2024-07-10T18:20:45.640Z · comments (4)

[question] When can I be numerate?
FinalFormal2 · 2024-09-12T04:05:27.710Z · answers+comments (1)

[question] Money Pump Arguments assume Memoryless Agents. Isn't this Unrealistic?
Dalcy (Darcy) · 2024-08-16T04:16:23.159Z · answers+comments (6)

Incentive Learning vs Dead Sea Salt Experiment
Steven Byrnes (steve2152) · 2024-06-25T17:49:01.488Z · comments (1)

The Garden of Eden
Alexander Turok · 2024-07-22T16:07:42.509Z · comments (2)

AI #77: A Few Upgrades
Zvi · 2024-08-20T00:20:09.717Z · comments (3)

GPT-3.5 judges can supervise GPT-4o debaters in capability asymmetric debates
Charlie George (charlie-george) · 2024-08-27T20:44:08.683Z · comments (7)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

ricraz on The case for more Alignment Target Analysis (ATA)

It seems very plausible to me that alignment targets in practice will evolve out of things like the OpenAI Model Spec. If anyone has suggestions for how to improve that, please DM me.

cubefox on What's the Deal with Logical Uncertainty?

In theories of Bayesianism, the axioms of probability theory are conventionally assumed to say that all logical truths have probability one, and that the probability of a disjunction of logically inconsistent statements is the sum of their probabilities. Corresponding to the second and third Kolmogorov axiom.

If one then e.g. regards the Peano axioms as certain, then all theorems of Peano arithmetic must also be certain, because those are just logical consequences. And all statements which can be disproved in Peano arithmetic then must have probability zero. So the above version of the Kolmogorov axioms is assuming we are logically omniscient. So this form of Bayesianism doesn't allow us to assign anything like 0.5 probability to the googolth digit of pi being odd: We must assign 1 if it's odd, or 0 if it's even.

I think the simple solution is to not talk about logical tautologies and contradictions when expressing the Kolmogorov axioms for a theory of subjective Bayesianism. Instead talk about what we actually know a priori, not about tautologies which we merely could know a priori (if we were logically omniscient). Then the second Kolmogorov axiom says that statements we actually know a priori have to be assigned probability 1, and disjunctions of statements actually known a priori to be inconsistent have to be assigned the sum of their probabilities.

Then we are allowed to assign probabilities less than 1 to statements where we don't actually know that they are tautologies, e.g. 0.5 to "the googolth digit of pi is odd" even if this happens to be, unbeknownst to us, a theorem of Peano arithmetic.

rotatingpaguro on On Measuring Intellectual Performance - personal experience and several thoughts

A very interesting problem is measuring something like general intelligence. I’m not going to delve deeply into this topic but simply want to draw attention to an idea that is often implied, though rarely expressed, in the framing of such a problem: the assumption that an "intelligence level," whatever it may be, corresponds to some inherent properties of a person and can be measured through their manifestations. Moreover, we often talk about measurements with a precision of a few percentage points, which suggests that, in theory, the measurement should be based on very stable indicators.
What’s fascinating is that this assumption receives very little scrutiny, while in cases where we talk about "mechanical" parameters of the human body (such as physical performance), we know that such parameters, aside from a person’s potential, heavily depends on numerous external factors and what that person has been doing over the past couple of weeks.

Do you know about the g-factor?

faul_sname on RLHF is the worst possible thing done when facing the alignment problem

Unfortunately, until the world has acted to patch up some terrible security holes in society, we are all in a very fragile state.

Agreed.

I have been working on AI Biorisk Evals with SecureBio for nearly a year now.

I appreciate that. I also really like the NAO project, which also a SecureBio thing. Good work y'all!

As models increase in general capabilities, so too do they incidentally get more competent at assisting with the creation of bioweapons. It is my professional opinion that they are currently providing non-zero uplift over a baseline of 'bad actor with web search, including open-access scientific papers'.

Yeah, if your threat model is "AI can help people do more things, including bad things" that is a valid threat model and seems correct to me. That said, my world model has a giant gaping hole where one would expect an explanation for why "people can do lots of things" hasn't already led to a catastrophe (it's not like the bio risk threat model needs AGI assistance, a couple undergrad courses and some lab experience reproducing papers should be quite sufficient).

In any case, I don't think RLHF makes this problem actively worse, and it could plausibly help a bit though obviously the help is of the form "adds a trivial inconvenience to destroying the world".

Some people don't believe the we will get to Powerful AI Agents before we've already arrived at other world states that make it unlikely we will continue to proceed on a trajectory towards Powerful AI Agents.

If you replace "a trajectory towards powerful AI agents" with "a trajectory towards powerful AI agents that was foreseen in 2024 and could be meaningfully changed in predictable ways by people in 2024 using information that exists in 2024" that's basically my position.

sharmake-farah on RLHF is the worst possible thing done when facing the alignment problem

As someone who does disagree with this:

End-state-focused Value-aligned Highly-Effective-Optimizer AI Agents (hereafter Powerful AI Agents) with the primary value of win-at-all-costs are extremely dangerous.

I think the disagreement point is this:

Some say that intelligent agents being dangerously powerful is good actually, and won't be a risky situation

I have several cruxes/reasons for why I disagree:

I think I have quite a lot less probability on unrestrained AI conflict than @tailcalled [LW · GW] does.
I disagree with the assumption of this:

But nobody has come up with an acceptable end goal for the world, because any goal we can come up with tends to want to consume everything, which destroys humanity.

Because I think that a lot of the reason the search came up empty is because people were attempting to solve the problem too far ahead of the actual risk.

Also, I think it matters a lot which specific utility function matters here, analogously to how picking a specific number or function from a variable N tells you a lot more about what will happen than just reasoning about the variable in the abstract.

3. I think the world isn't as offense-dominant as LWers/EAs tend to think, and that while some level of offense outpacing defense is quite plausible, I don't think it's as extreme as "defense is essentially pointless."

christiankl on Inquisitive vs. adversarial rationality

You define your terms when you say:

An inquisitive (aka inquisitorial) legal system is one in which the judge acts as both judge and prosecutor, personally digging into the facts before ruling.

In the German system, digging into the facts before the ruling is part of the job of the judge. They are doing it from a neutral perspective, but digging into facts is part of what they are supposed to do. In Anglosaxon common law on the other hand it's the job of both parties of a legal case to law out all the facts that they think support their side and it's not the job to dig into facts that neither of the sides presented.

raytaylor on Mental Health and the Alignment Problem: A Compilation of Resources (updated April 2023)

It was nice to see C S Lewis as a reminder we've kinda been here before.

One of the things which helped groups during the fight for the Nuclear Test Ban Treaty in the US was Joanna Macy's "despair work", which was developed from individual grief work.

Joanna started in intelligence and has been facing X-risks and slow actions of governments with others since the 1970s, and built a network of people doing that, and she still does. She did a lot in Cernobyl, and did some of the earliest longtermism and deep time work on nuclear waste storage.

Her despair work has been adapted for climate change and rainforest protection, so I'm sure it could be adapted for AI and other Xrisks/Srisks too, and even tougher goals like achieving universal veganism, instituting rational policymaking or "dealing with parents" ;-)

Trainers in despair work:
https://workthatreconnects.org/find-a-facilitator/, or ask me, or Dr Chris Johnstone for recommendations.

Trainer's Manual for groups (recommended):
- Coming Back to Life, Joanna Macy

Books:
- Despair and Empowerment in the Nuclear Age, Joanna Macy
- Active Hope, Chris Johnstone and Joanna Macy

More recent video:
(be ready to filter some of the 1970s vocabulary; they're both confident with intense emotion)

oklogic on My Critique of Effective Altruism

As a full throated defender of pulling the lever (given traditional assumptions such as a lack of an audience, complete knowledge of each outcomes, productivity of the people on the tracks) , there are numerous issues with your proposals:

1.) Vague alternative: You seem to be pushing towards some form of virtue ethics/basic intuitionism, but there are numerous problems with this approach. Besides determining whose basic intuitions count and whose don't, or which virtues are important, there is very real problems when these virtues conflict. For instance, imagine you are walking at night, and are trying to cross a street. The sign says red, but no cars are around. Do you jaywalk? In this circumstance, one is forced to make a decision which pits two virtues/ intuitions against each other. The beauty of utilitarianism is that it allows us to choose in these circumstances.

2.) Subjective Morality: Yes, utilitarianism, may not be "objective" in the sense that there is no intrinsic reason to value human flourishing, but I believe utilitarianism to be the viewpoint which closest conforms to what most people value. To illustrate why this matters, image you need to decide what color to paint a room. Nobody has very strong opinions, but most people in your household prefer the color blue. Yes, blue might not be "objectively" the best, but if most of the people in your household like the color blue the most, there is little reason not to. We are all individually going to seek what we value, so we might as well collectively agree to a system which reflects the preferences of most people.

3.) Altruism in Disguise:

Another thing to notice is that virtue ethics can be a form of effective altruism when practiced in specific ways. In general, bettering yourself as a person by becoming more rational, less biased, act, will in fact make the world a better place, and giving time to form meaningful relationships, engage in leisure, etc. can actually increase productivity in the long run.

You also seem advocate for fundamental changes in society, changes I am not sure I would agree with, but if your proposed changes are indeed the best way to increase the general happiness of the population, it would be, by definition, the goal of the EA movement. I think a lot of people look at the recent stuff with SBF and AI research and come to think the EA movement is only concerned with lofty existential risk scenarios, but there is a lot more to it then that.

raemon on We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

I felt a bit confused by this bit:

At some point I sit down and think about escamoles. Yeah, ants are kinda gross, but on reflection I don’t think I endorse that reaction to escamoles. I can see why my reward system would generate an “ew, gross” signal, but I model that reward as being the result of two decoupled things: either a hardcoded aversion to insects, or my actual values. I know that I am automatically averse to putting insects in my mouth so it's less likely that the negative reward is evidence of my values in this case; the signal is explained away in the usual epistemic sense by some cause other than my values. So, I partially undo the value-downgrade I had assigned to escamoles in response to the “ew, gross” reaction. I might still feel some disgust, but I consciously override that disgust to some extent.
That last example is particularly interesting, since it highlights a nontrivial prediction of this model. Insofar as reward is treated as evidence about values, and our beliefs about values update in the ordinary epistemic manner, we should expect all the typical phenomena of epistemic updating to carry over to learning about our values. Explaining-away is one such phenomenon. What do other standard epistemic phenomena look like, when carried over to learning about values using reward as evidence?

I feel like this sort of makes sense but don't quite parse why this counted as "explaining away." How do I know my hardcoded reactions aren't values?

cubefox on What you know when you know nothing

Ludwig Wittgenstein tried to solve this problem in an a priori fashion with a theory of "logical atomism". So without a refined model of the world. In the Tractatus he postulated that there must be "atomic" propositions. For example, the proposition that Bob is a bachelor is clearly not atomic (but complex) because it can be decomposed into the proposition that Bob is a man and that Bob is unmarried. And those are arguably themselves sort of complex statements, since the concept of a man or of marriage themselves allow for definitions from simpler terms. But at some point, we hit undefinable, primitive terms, presumably those which refer directly to particular sense data or the like. Wittgenstein then argued that these atomic propositions have to be regarded as being independent and having probability 1/2.

Or more precisely, he came up with the concept of truth tables, and counted the fraction of the rows in which the conditions of the truth tables are satisfied. Each row he assumed to be a priori equally likely when only atomic propositions are involved. So for atomic propositions P and Q, the complex proposition "P and Q" has probability 1/4 (only one out of four rows makes this proposition true, namely "true and true"), and the complex proposition "P or Q" has probability 3/4 (three out of four rows in the truth table make a disjunction true: all except "false or false").

This turns out to be equivalent to assuming that all atomic propositions have a) probability 1/2 and are b) independent of each other.

However, logical atomism was later broadly abandoned for various reasons. One is that it is hard to define what an atomic proposition is. For example, I can't assume that "this particular spot in my visual field is blue" is atomic. Because it is incompatible with the statement "this particular spot in my visual field is yellow". The same spot can't be both blue and yellow, even though that wouldn't be a logical contradiction. The two statements are therefore not independent, so they can't be atomic. But it is hard to think of anything more primitive than qualia predicates like colors.