Posts

Overview of strong human intelligence amplification methods 2024-10-08T08:37:18.896Z
TsviBT's Shortform 2024-06-16T23:22:54.134Z
Koan: divining alien datastructures from RAM activations 2024-04-05T18:04:57.280Z
What could a policy banning AGI look like? 2024-03-13T14:19:07.783Z
A hermeneutic net for agency 2024-01-01T08:06:30.289Z
What is wisdom? 2023-11-14T02:13:49.681Z
Human wanting 2023-10-24T01:05:39.374Z
Hints about where values come from 2023-10-18T00:07:58.051Z
Time is homogeneous sequentially-composable determination 2023-10-08T14:58:15.913Z
Telopheme, telophore, and telotect 2023-09-17T16:24:03.365Z
Sum-threshold attacks 2023-09-08T17:13:37.044Z
Fundamental question: What determines a mind's effects? 2023-09-03T17:15:41.814Z
Views on when AGI comes and on strategy to reduce existential risk 2023-07-08T09:00:19.735Z
The fraught voyage of aligned novelty 2023-06-26T19:10:42.195Z
Provisionality 2023-06-19T11:49:06.680Z
Explicitness 2023-06-12T15:05:04.962Z
Wildfire of strategicness 2023-06-05T13:59:17.316Z
The possible shared Craft of deliberate Lexicogenesis 2023-05-20T05:56:41.829Z
A strong mind continues its trajectory of creativity 2023-05-14T17:24:00.337Z
Better debates 2023-05-10T19:34:29.148Z
An anthropomorphic AI dilemma 2023-05-07T12:44:48.449Z
The voyage of novelty 2023-04-30T12:52:16.817Z
Endo-, Dia-, Para-, and Ecto-systemic novelty 2023-04-23T12:25:12.782Z
Possibilizing vs. actualizing 2023-04-16T15:55:40.330Z
Expanding the domain of discourse reveals structure already there but hidden 2023-04-09T13:36:28.566Z
Ultimate ends may be easily hidable behind convergent subgoals 2023-04-02T14:51:23.245Z
New Alignment Research Agenda: Massive Multiplayer Organism Oversight 2023-04-01T08:02:13.474Z
Descriptive vs. specifiable values 2023-03-26T09:10:56.334Z
Shell games 2023-03-19T10:43:44.184Z
Are there cognitive realms? 2023-03-12T19:28:52.935Z
Do humans derive values from fictitious imputed coherence? 2023-03-05T15:23:04.065Z
Counting-down vs. counting-up coherence 2023-02-27T14:59:39.041Z
Does novel understanding imply novel agency / values? 2023-02-19T14:41:40.115Z
Please don't throw your mind away 2023-02-15T21:41:05.988Z
The conceptual Doppelgänger problem 2023-02-12T17:23:56.278Z
Control 2023-02-05T16:16:41.015Z
Structure, creativity, and novelty 2023-01-29T14:30:19.459Z
Gemini modeling 2023-01-22T14:28:20.671Z
Non-directed conceptual founding 2023-01-15T14:56:36.940Z
Dangers of deference 2023-01-08T14:36:33.454Z
The Thingness of Things 2023-01-01T22:19:08.026Z
[link] The Lion and the Worm 2022-05-16T20:40:22.659Z
Harms and possibilities of schooling 2022-02-22T07:48:09.542Z
Rituals and symbolism 2022-02-10T16:00:14.635Z
Index of some decision theory posts 2017-03-08T22:30:05.000Z
Open problem: thin logical priors 2017-01-11T20:00:08.000Z
Training Garrabrant inductors to predict counterfactuals 2016-10-27T02:41:49.000Z
Desiderata for decision theory 2016-10-27T02:10:48.000Z
Failures of throttling logical information 2016-02-24T22:05:51.000Z
Speculations on information under logical uncertainty 2016-02-24T21:58:57.000Z

Comments

Comment by TsviBT on TsviBT's Shortform · 2025-01-21T06:32:10.171Z · LW · GW

Say a "deathist" is someone who says "death is net good (gives meaning to life, is natural and therefore good, allows change in society, etc.)" and a "lifeist" ("anti-deathist") is someone who says "death is net bad (life is good, people should not have to involuntarily die, I want me and my loved ones to live)". There are clearly people who go deathist -> lifeist, as that's most lifeists (if nothing else, as an older kid they would have uttered deathism, as the predominant ideology). One might also argue that young kids are naturally lifeist, and therefore most people have gone lifeist -> deathist once. Are there people who have gone deathist -> lifeist -> deathist? Are there people who were raised lifeist and then went deathist?

Comment by TsviBT on Thane Ruthenis's Shortform · 2025-01-18T21:10:34.313Z · LW · GW

(Still impressive and interesting of course, just not literally SOTA.)

Comment by TsviBT on Thane Ruthenis's Shortform · 2025-01-18T20:21:24.720Z · LW · GW

According to the article, SOTA was <1% of cells converted into iPSCs

I don't think that's right, see https://www.cell.com/cell-stem-cell/fulltext/S1934-5909(23)00402-2

Comment by TsviBT on What Is The Alignment Problem? · 2025-01-17T02:14:16.364Z · LW · GW

metapreferences are important, but their salience is way out of proportion to their importance.

You mean the salience is too high? On the contrary, it's too low.

one of the most immediate natural answers is "metapreferences!".

Of course, this is not an answer, but a question-blob.

as evidenced by experiences like "I thought I wanted X, but in hindsight I didn't"

Yeah I think this is often, maybe almost always, more like "I hadn't computed / decided to not want [whatever Thing-like thing X gestured at], and then I did compute that".

a last-line fallback for extreme cases

It's really not! Our most central values are all of the proleptic (pre-received; foreshadowed) type: friendship, love, experience, relating, becoming. They all can only be expressed in an either vague or incomplete way: "There's something about this person / myself / this collectivity / this mental activity that draws me in to keep walking that way.". Part of this is resolvable confusion, but probably not all of it. Part of the fun of relating with other people is that there's a true open-endedness; you get to cocreate something non-pre-delimited, find out what another [entity that is your size / as complex/surprising/anti-inductive as you] is like, etc. "Metapreferences" isn't an answer of course, but there's definitely a question that has to be asked here, and the answer will fall under "metapreferences" broadly construed, in that it will involve stuff that is ongoingly actively determining [all that stuff we would call legible values/preferences].

"What does it even mean to be wrong about our own values? What's the ground truth?"

Ok we can agree that this should point the way to the right questions and answers, but it's an extremely broad question-blob.

Comment by TsviBT on TsviBT's Shortform · 2025-01-14T09:31:21.600Z · LW · GW

"The Future Loves You: How and Why We Should Abolish Death" by Dr Ariel Zeleznikow-Johnston is now available to buy. I haven't read it, but I expect it to be a definitive anti-deathist monograph. https://www.amazon.com/Future-Loves-You-Should-Abolish-ebook/dp/B0CW9KTX76

The description (copied from Amazon):


A brilliant young neuroscientist explains how to preserve our minds indefinitely, enabling future generations to choose to revive us

Just as surgeons once believed pain was good for their patients, some argue today that death brings meaning to life. But given humans rarely live beyond a century – even while certain whales can thrive for over two hundred years – it’s hard not to see our biological limits as profoundly unfair. No wonder then that most people nearing death wish they still had more time.

Yet, with ever-advancing science, will the ends of our lives always loom so close? For from ventilators to brain implants, modern medicine has been blurring what it means to die. In a lucid synthesis of current neuroscientific thinking, Zeleznikow-Johnston explains that death is no longer the loss of heartbeat or breath, but of personal identity – that the core of our identities is our minds, and that our minds are encoded in the structure of our brains. On this basis, he explores how recently invented brain preservation techniques now offer us all the chance of preserving our minds to enable our future revival.

Whether they fought for justice or cured diseases, we are grateful to those of our ancestors who helped craft a kinder world – yet they cannot enjoy the fruits of the civilization they helped build. But if we work together to create a better future for our own descendants, we may even have the chance to live in it. Because, should we succeed, then just maybe, the future will love us enough to bring us back and share their world with us.

Comment by TsviBT on Views on when AGI comes and on strategy to reduce existential risk · 2025-01-11T23:39:31.158Z · LW · GW

But like, I wouldn't be surprised if, say, someone trained something that performed comparably to LLMs on a wide variety of benchmarks, using much less "data"... and then when you look into it, you find that what they were doing was taking activations of the LLMs and training the smaller guy on the activations. And I'll be like, come on, that's not the point; you could just as well have "trained" the smaller guy by copy-pasting the weights from the LLM and claimed "trained with 0 data!!". And you'll be like "but we met your criterion!" and I'll just be like "well whatever, it's obviously not relevant to the point I was making, and if you can't see that then why are we even having this conversation". (Or maybe you wouldn't do that, IDK, but this sort of thing--followed by being accused of "moving the goal posts"--is why this question feels frustrating to answer.)

Comment by TsviBT on Views on when AGI comes and on strategy to reduce existential risk · 2025-01-11T22:38:08.699Z · LW · GW

But ok:

  • Come up, on its own, with many math concepts that mathematicians consider interesting + mathematically relevant on a similar level to concepts that human mathematicians come up with.
  • Do insightful science on its own.
  • Perform at the level of current LLMs, but with 300x less training data.
Comment by TsviBT on Views on when AGI comes and on strategy to reduce existential risk · 2025-01-11T22:33:16.268Z · LW · GW

I did give a response in that comment thread. Separately, I think that's not a great standard, e.g. as described in the post and in this comment https://www.lesswrong.com/posts/i7JSL5awGFcSRhyGF/shortform-2?commentId=zATQE3Lhq66XbzaWm :

Second, 2024 AI is specifically trained on short, clear, measurable tasks. Those tasks also overlap with legible stuff--stuff that's easy for humans to check. In other words, they are, in a sense, specifically trained to trick your sense of how impressive they are--they're trained on legible stuff, with not much constraint on the less-legible stuff (and in particular, on the stuff that becomes legible but only in total failure on more difficult / longer time-horizon stuff).

In fact, all the time in real life we make judgements about things that we couldn't describe in terms that would be considered well-operationalized by betting standards, and we rely on these judgements, and we largely endorse relying on these judgements. E.g. inferring intent in criminal cases, deciding whether something is interesting or worth doing, etc. I should be able to just say "but you can tell that these AIs don't understand stuff", and then we can have a conversation about that, without me having to predict a minimal example of something which is operationalized enough for you to be forced to recognize it as judgeable and also won't happen to be surprisingly well-represented in the data, or surprisingly easy to do without creativity, etc.

Comment by TsviBT on Views on when AGI comes and on strategy to reduce existential risk · 2025-01-11T19:43:40.816Z · LW · GW

My p(AGI by 2045) is higher because there's been more time for algorithmic progress, maybe in the ballpark of 20%. I don't have strong opinions about how much people will do huge training runs, though maybe I'd be kinda skeptical that people would be spending $10^11 or $10^12 on runs, if their $10^10 runs produced results not qualitatively very different from their $10^9 runs. But IDK, that's both a sociological question and a question of which lesser capabilities happen to get unlocked at which exact training run sizes given the model architectures in a decade, which of course IDK. So yeah, if it's 10^30 but not much algorithmic progress, I doubt that gets AGI.

Comment by TsviBT on Views on when AGI comes and on strategy to reduce existential risk · 2025-01-11T10:00:19.828Z · LW · GW

I still basically think all of this, and still think this space doesn't understand it, and thus has an out-of-whack X-derisking portfolio.

If I were writing it today, I'd add this example about search engines from this comment https://www.lesswrong.com/posts/oC4wv4nTrs2yrP5hz/what-are-the-strongest-arguments-for-very-short-timelines?commentId=2XHxebauMi9C4QfG4 , about induction on vague categories like "has capabilities":

Would you say the same thing about the invention of search engines? That was a huge jump in the capability of our computers. And it looks even more impressive if you blur out your vision--pretend you don't know that the text that comes up on your screen is written by a human, and pretend you don't know that search is a specific kind of task distinct from a lot of other activity that would be involved in "True Understanding, woooo"--and just say "wow! previously our computers couldn't write a poem, but now with just a few keystrokes my computer can literally produce Billy Collins level poetry!".

I might also try to explain more how training procedures with poor sample complexity tend to not be on an unbounded trajectory.

Comment by TsviBT on Views on when AGI comes and on strategy to reduce existential risk · 2025-01-11T09:59:13.554Z · LW · GW
Comment by TsviBT on Views on when AGI comes and on strategy to reduce existential risk · 2025-01-10T20:30:07.682Z · LW · GW

What I mainline expect is that yes, a few OOMs more of compute and efficiency will unlock a bunch of new things to try, and yes some of those things will make some capabilities go up a bunch, in the theme of o3. I just also expect that to level off. I would describe myself as "confident but not extremely confident" of that; like, I give 1 or 2% p(doom) in the next 10ish years, coming from this possibility (and some more p(doom) from other sources). Why expect it to level off? Because I don't see good evidence of "a thing that wouldn't level off"; the jump made by LLMs of "now we can leverage huge amounts of data and huge amounts of compute at all rather than not at all" is certainly a jump, but I don't see why to think it's a jump to an unbounded trajectory.

Comment by TsviBT on TsviBT's Shortform · 2025-01-09T05:48:25.634Z · LW · GW

The standard way to measure compute is FLOPS. Besides other problems, this measure has two major flaws: First, no one cares exactly how many FLOPS you have; we want to know the order of magnitude without having to incant "ten high". Second, it sounds cute, even though it's going to kill us.

I propose an alternative: Digital Orders Of Magnitude (per Second), or DOOM(S).

Comment by TsviBT on Overview of strong human intelligence amplification methods · 2025-01-09T02:37:56.297Z · LW · GW

Yes, it is: tsvibt

Comment by TsviBT on The Field of AI Alignment: A Postmortem, and What To Do About It · 2025-01-04T01:31:00.203Z · LW · GW

I'm not sure what "concrete" is supposed to mean; for the one or two senses I immediately imagine, no, I would say the feedback is indeed concrete. In terms of consensus/outcome, no, I think the feedback is actually concrete. There is a difficulty, which is that there's a much smaller set of people to whom the outcomes are visible.

As an analogy/example: feedback in higher math. It's "nonconcrete" in that it's "just verbal arguments" (and translating those into something much more objective, like a computer proof, is a big separate long undertaking). And there's a much smaller set of people who can tell what statements are true in the domain. There might even be a bunch more people who have opinions, and can say vaguely related things that other non-experts can't distinguish from expert statements, and who therefore form an apparent consensus that's wrong + ungrounded. But one shouldn't conclude from those facts that math is less real, or less truthtracking, or less available for communities to learn about directly.

Comment by TsviBT on Is "VNM-agent" one of several options, for what minds can grow up into? · 2024-12-30T19:42:55.355Z · LW · GW

It may be that some of the good reasons to not be VNM right now, will continue to be such. In that case, there's no point at which you want to be VNM, and in some senses you don't even limit to VNM. (E.g. you might limit to VNM in the sense that, for any local ontology thing, as long as it isn't revised, you tend toward VNMness; but the same mind might fail to limit to VNM in that, on any given day, the stuff it is most concerned/involved with makes it look quite non-VNM.)

Comment by TsviBT on Is "VNM-agent" one of several options, for what minds can grow up into? · 2024-12-30T18:16:51.489Z · LW · GW

Cf. https://www.lesswrong.com/posts/NvwjExA7FcPDoo3L7/are-there-cognitive-realms

Comment by TsviBT on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T22:06:07.862Z · LW · GW

I very much agree that structurally what matters a lot, but that seems like half the battle to me.

But somehow this topic is not afforded much care or interest. Some people will pay lip service to caring, others will deny that mental states exist, but either way the field of alignment doesn't put much force (money, smart young/new people, social support) toward these questions. This is understandable, as we have much less legible traction on this topic, but that's... undignified, I guess is the expression.

Comment by TsviBT on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T22:02:40.570Z · LW · GW

a sufficiently intelligent AI system that does understand that relationship will be able to exploit the extra degrees of freedom in the lower level ontology to our disadvantage, and we won’t be able to see it coming.

Even if you do understand the lower level, you couldn't stop such an adversarial AI from exploiting it, or exploiting something else, and taking control. If you understand the mental states (yeah, the structure), then maybe you can figure out how to make an AI that wants to not do that. In other words, it's not sufficient, and probably not necessary / not a priority.

Comment by TsviBT on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T20:05:07.603Z · LW · GW

I'm unsure whether it's a good analogy. Let me make a remark, and then you could reask or rephrase.

The discovery that the phenome is largely a result of the genome, is of course super important for understanding and also useful. The discovery of mechanically how (transcribe, splice, translate, enhance/promote/silence, trans-regulation, ...) the phenome is a result of the genome is separately important, and still ongoing. The understanding of "structurally how" characters are made, both in ontogeny and phylogeny, is a blob of open problems (evodevo, niches, ...). Likewise, more simply, "structurally what"--how to even think of characters. Cf. Günter Wagner, Rupert Riedl.

I would say the "structurally how" and "structurally what" is most analogous. The questions we want to answer about minds aren't like "what is a sufficient set of physical conditions to determine--however opaquely--a mind's effects", but rather "what smallish, accessible-ish, designable-ish structures in a mind can [understandably to us, after learning how] determine a mind's effects, specifically as we think of those effects". That is more like organology and developmental biology and telic/partial-niche evodevo (<-made up term but hopefully you see what I mean).

https://tsvibt.blogspot.com/2023/04/fundamental-question-what-determines.html

Comment by TsviBT on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T15:31:44.093Z · LW · GW

why you are so confident in these "defeaters"

More than any one defeater, I'm confident that most people in the alignment field don't understand the defeaters. Why? I mean, from talking to many of them, and from their choices of research.

People in these fields understand very well the problem you are pointing towards.

I don't believe you.

if the alignment community would outlaw mechinterp/slt/ neuroscience

This is an insane strawman. Why are you strawmanning what I'm saying?

I dont think progress on this question will be made by blanket dismissals

Progress could only be made by understanding the problems, which can only be done by stating the problems, which you're calling "blanket dismissals".

Comment by TsviBT on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T15:10:53.971Z · LW · GW

I.e. a training technique? Design principles? A piece of math ? Etc

All of those, sure? First you understand, then you know what to do. This is a bad way to do peacetime science, but seems more hopeful for

  1. cruel deadline,
  2. requires understanding as-yet-unconceived aspects of Mind.

I think I am asking a very fair question.

No, you're derailing from the topic, which is the fact that the field of alignment keeps failing to even try to avoid / address major partial-consensus defeaters to alignment.

Comment by TsviBT on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T15:02:08.559Z · LW · GW

This is a derail. I can know that something won't work without knowing what would work. I don't claim to know something that would work. If you want my partial thoughts, some of them are here: https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html

In general, there's more feedback available at the level of "philosophy of mind" than is appreciated.

Comment by TsviBT on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T13:38:13.608Z · LW · GW

They aren't close to the right kind of abstraction. You can tell because they use a low-level ontology, such that mental content, to be represented there, would have to be homogenized, stripped of mental meaning, and encoded. Compare trying to learn about arithmetic, and doing so by explaining a calculator in terms of transistors vs. in terms of arithmetic. The latter is the right level of abstraction; the former is wrong (it would be right if you were trying to understand transistors or trying to understand some further implementational aspects of arithmetic beyond the core structure of arithmetic).

What I'm proposing instead, is theory.

Comment by TsviBT on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T13:20:03.201Z · LW · GW

From the linked post:

The first moral that I'd draw is simple but crucial: If you're trying to understand some phenomenon by interpreting some data, the kind of data you're interpreting is key. It's not enough for the data to be tightly related to the phenomenon——or to be downstream of the phenomenon, or enough to pin it down in the eyes of Solomonoff induction, or only predictable by understanding it. If you want to understand how a computer operating system works by interacting with one, it's far far better to interact with the operating at or near the conceptual/structural regime at which the operating system is constituted.

What's operating-system-y about an operating system is that it manages memory and caching, it manages CPU sharing between process, it manages access to hardware devices, and so on. If you can read and interact with the code that talks about those things, that's much better than trying to understand operating systems by watching capacitors in RAM flickering, even if the sum of RAM+CPU+buses+storage gives you a reflection, an image, a projection of the operating system, which in some sense "doesn't leave anything out". What's mind-ish about a human mind is reflected in neural firing and rewiring, in that a difference in mental state implies a difference in neurons. But if you want to come to understand minds, you should look at the operations of the mind in descriptive and manipulative terms that center around, and fan out from, the distinctions that the mind makes internally for its own benefit. In trying to interpret a mind, you're trying to get the theory of the program.

Comment by TsviBT on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T12:35:06.330Z · LW · GW

A thing that makes alignment hard / would defeat various alignment plans or alignment research plans.

E.g.s: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities#Section_B_

E.g. the things you're studying aren't stable under reflection.

E.g. the things you're studying are at the wrong level of abstraction (SLT, interp, neuro) https://www.lesswrong.com/posts/unCG3rhyMJpGJpoLd/koan-divining-alien-datastructures-from-ram-activations

E.g. https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html

This just in: Alignment researchers fail to notice skulls from famous blog post "Yes, we have noticed the skulls".

Comment by TsviBT on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T04:43:17.844Z · LW · GW

Alternative: "AI x-derisking"

Comment by TsviBT on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T02:35:20.927Z · LW · GW

"AI x-safety" seems ok. The "x-" is a bit opaque, and "safety" is vague, but I'll try this as my default.

(Including "technical" to me would exclude things like public advocacy.)

Comment by TsviBT on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T02:09:17.621Z · LW · GW

Scary demos isn't exciting as Deep Science but its influence on policy

There maybe should be a standardly used name for the field of generally reducing AI x-risk, which would include governance, policy, evals, lobbying, control, alignment, etc., so that "AI alignment" can be a more narrow thing. I feel (coarsely speaking) grateful toward people working on governance, policy, evals_policy, lobbying; I think control is pointless or possibly bad (makes things look safer than they are, doesn't address real problem); and frustrated with alignment.

What's concerning is watching a certain strain of dismissiveness towards mainstream ideas calcify within parts of the rationalist ecosystem. As Vanessa notes in her comment, this attitude of isolation and attendant self-satisfied sense of superiority certainly isn't new. It has existed for a while around MIRI & the rationalist community. Yet it appears to be intensifying as AI safety becomes more mainstream and the rationalist community's relative influence decreases

What should one do, who:

  1. thinks that there's various specific major defeaters to the narrow project of understanding how to align AGI;
  2. finds partial consensus with some other researchers about those defeaters;
  3. upon explaining these defeaters to tens or hundreds of newcomers, finds that, one way or another, they apparently-permanently fail to avoid being defeated by those defeaters?

It sounds like in this paragraph your main implied recommendation is "be less snooty". Is that right?

Comment by TsviBT on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-29T01:54:29.392Z · LW · GW

the claim which gets clamped to True is not "this research direction will/can solve alignment" but instead "my research is high value".

This agrees with something like half of my experience.

that their research is maybe a useful part of a bigger solution which involves many other parts, or that their research is maybe useful step toward something better.

Right, I think of this response as arguing that streetlighting is a good way to do large-scale pre-paradigm science projects in general. And I have to somewhat agree with that.

Then I argue that AGI alignment is somewhat exceptional: 1. cruel deadline, 2. requires understanding as-yet-unconceived aspects of Mind. Point 2 of exceptionality goes through things like alienness of creativity, RSI, reflective instability, the fact that we don't understand how values sit in a mind, etc., and that's the part that gets warped away.

I do genuinely think that the 2024 field of AI alignment would eventually solve the real problems via collective iterative streetlighting. (I even think it would eventually solve it in a hypothetical world where all our computers disappeared, if it kept trying.) I just think it'll take a really long time.

being a useful part of a bigger solution (which they don't know the details of) is itself a rather difficult design constraint which they have not at all done the work to satisfy

Right, exactly. (I wrote about this in my opaque gibberish philosophically precise style here: https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html#1-summary)

Comment by TsviBT on evhub's Shortform · 2024-12-28T20:25:24.845Z · LW · GW

I'd add:

  • Support explicit protections for whistleblowers.

Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGIwhi

I'll echo this and strengthen it to:

... call for policymakers to stop the development of AGI.

Comment by TsviBT on If all trade is voluntary, then what is "exploitation?" · 2024-12-28T16:52:32.520Z · LW · GW

I don't think McDonald's example quite makes sense; if they were doing credit card fraud, that would probably destroy the relationship, so failing to do that fraud doesn't absolve them of being an exploiter. But anyway, you're probably right that "maximal" is too strong.

Comment by TsviBT on If all trade is voluntary, then what is "exploitation?" · 2024-12-28T08:23:45.234Z · LW · GW

does include the parenting example I gave

A normal and/or healthy parent-child relationship doesn't have the parent extracting as much value as possible from the child regardless of ethics or harm!

is strictly broader than my proposed definition.

Therefore not.

Comment by TsviBT on If all trade is voluntary, then what is "exploitation?" · 2024-12-28T02:05:54.056Z · LW · GW

I think the simple throughline is something like:

The exploiter extracts about as much value from the exploiter as they can while still retaining the relationship (employee, romantic, customer, con mark), regardless of harm (e.g. including by lying, or by making it harder for the exploited to leave the relationship).

Comment by TsviBT on If all trade is voluntary, then what is "exploitation?" · 2024-12-28T01:56:39.455Z · LW · GW

Exploitation is using a superior negotiating position to inflict great costs on someone else, at small cost to yourself.

Actually, it isn't strictly too broad; it also excludes things that should be included. E.g. it doesn't have to be a negotiation. I would say that something that tries to trap your attention in short, high-intensity, unfulfilling activity is exploitative. E.g. casinos, social media. Or, simple fraud would be exploitative.

A way it's too broad is that it doesn't mention the motive, or the benefit to the exploiter. (Well actually I'm not exactly sure what you meant by "at small cost to yourself".) Some examples you gave are like "why didn't the employer do this thing that would have been nice for the employee, that wouldn't have cost too much". But like, they might just suck, or they might rationally not view it as worthwhile.

Your example

A parent sits down for tea, but their kid is running around. “Absolutely no noise while I’m having tea, or no Nintendo for the next month.” Every time the parent pulls this card, the kid accepts.

doesn't seem to me like exploitation.

Comment by TsviBT on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-28T01:26:16.983Z · LW · GW

The flinches aren't structureless particulars. Rather, they involve warping various perceptions. Those warped perceptions generalize a lot, causing other flaws to be hidden.

As a toy example, you could imagine someone attached to the idea of AI boxing. At first they say it's impossible to break out / trick you / know about the world / whatever. Then you convince them otherwise--that the AI can do RSI internally, and superhumanly solve computer hacking / protein folding / persuasion / etc. But they are attached to AI boxing. So they warp their perception, clamping "can an AI be very superhumanly capable" to "no". That clamping causes them to also not see the flaws in the plan "we'll deploy our AIs in a staged manner, see how they behave, and then recall them if they behave poorly", because they don't think RSI is feasible, they don't think extreme persuasion is feasible, etc.

A more real example is, say, people thinking of "structures for decision making", e.g. constitutions. You explain that these things, they are not reflectively stable. And now this person can't understand reflective stability in general, so they don't understand why steering vectors won't work, or why lesioning won't work, etc.

Another real but perhaps more controversial example: {detecting deception, retargeting the search, CoT monitoring, lesioning bad thoughts, basically anything using RL} all fail because creativity starts with illegible concomitants to legible reasoning.

(This post seems to be somewhat illegible, but if anyone wants to see more real examples of aspects of mind that people fail to remember, see https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html)

Comment by TsviBT on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T16:14:51.450Z · LW · GW

Remember that the top-level commenter here is currently a physicist, so it's not like the usefulness of their work would be going down by doing a useless MATS project :P

Yes it would! It would eat up motivation and energy and hope that they could have put towards actual research. And it would put them in a social context where they are pressured to orient themselves toward streetlighty research--not just during the program, but also afterward. Unless they have some special ability to have it not do that.

Without MATS: not currently doing anything directly useful (though maybe indirectly useful, e.g. gaining problem-solving skill). Could, if given $30k/year, start doing real AGI alignment thinking from scratch not from scratch, thereby scratching their "will you think in a way that unlocks understanding of strong minds" lottery ticket that each person gets.

With MATS: gotta apply to extension, write my LTFF grant. Which org should I apply to? Should I do linear probes software engineering? Or evals? Red teaming? CoT? Constitution? Hyperparamter gippity? Honeypot? Scaling supervision? Superalign, better than regular align? Detecting deception?

Comment by TsviBT on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T15:07:31.585Z · LW · GW

Do you think that funders are aware that >90% [citation needed!] of the money they give to people, to do work described as helping with "how to make world-as-we-know-it ending AGI without it killing everyone", is going to people who don't even themselves seriously claim to be doing research that would plausibly help with that goal? If they are aware of that, why would they do that? If they aren't aware of it, don't you think that it should at least be among your very top hypotheses, that those researchers are behaving materially deceptively, one way or another, call it what you will?

Comment by TsviBT on If all trade is voluntary, then what is "exploitation?" · 2024-12-27T12:29:26.790Z · LW · GW

I think exploitation is an important thing and should be understood better. (At least by us; maybe it's well understood academically.)

Exploitation is using a superior negotiating position to inflict great costs on someone else, at small cost to yourself.

I think this is way too broad. Elements of a more narrow definition:

  • The exploited is in a satisficing-hole; they'd need more slack to get out (e.g. to find / train for another job).
  • The exploited is in a reference class that can't easily cohere for negotiation purposes, the exploiter isn't. (E.g. workers vs. large corporations; workers unions are supposed to address this.)
  • The exploiter might specifically harm the exploited, in order to keep zer in the satisficing-hole. (E.g. abusive partner insults the abused to keep zer pessimistic about prospects outside the relationship. E.g. union busting.)
  • The exploiter sets trade conditions to be near the minimum satisficing amount for the exploited.
  • The exploiter is much further from zer minimum satisficing amount, compared to the exploited, in absolute terms. (E.g. an employer can eat deficit long enough to train a new worker; the worker can't eat deficit long enough to find a new job.)
Comment by TsviBT on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T12:00:45.712Z · LW · GW

it's tractable to achieve progress through mindfully shaping the funding landscape

This isn't clear to me, where the crux (though maybe it shouldn't be) is "is it feasible for any substantial funders to distinguish actually-trying research from other".

Comment by TsviBT on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T11:23:40.884Z · LW · GW

Does this actually happen?

Yes, absolutely. Five years ago, people were more honest about it, saying ~explicitly and out loud "ah, the real problems are too difficult; and I must eat and have friends; so I will work on something else, and see if I can get funding on the basis that it's vaguely related to AI and safety".

Comment by TsviBT on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T11:21:02.604Z · LW · GW

MATS will push you to streetlight much more unless you have some special ability to have it not do that.

Comment by TsviBT on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T11:19:45.171Z · LW · GW

Currently, we have zero concrete feedback about which strategies can effectively align complex systems of equal or greater intelligence to humans.

Actually, I now suspect this is to a significant extent disinformation. You can tell when ideas make sense if you think hard about them. There's plenty of feedback, that's not already being taken advantage of, at the level of "abstract, high-level, philosophy of mind", about the questions of alignment.

Comment by TsviBT on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-26T20:27:26.481Z · LW · GW

Cf. https://www.lesswrong.com/posts/QzQQvGJYDeaDE4Cfg/talent-needs-of-technical-ai-safety-teams?commentId=BNkpTqwcgMjLhiC8L

https://www.lesswrong.com/posts/unCG3rhyMJpGJpoLd/koan-divining-alien-datastructures-from-ram-activations?commentId=apD6dek5zmjaqeoGD

https://www.lesswrong.com/posts/HbkNAyAoa4gCnuzwa/wei-dai-s-shortform?commentId=uMaQvtXErEqc67yLj

Comment by TsviBT on What are the strongest arguments for very short timelines? · 2024-12-25T05:16:48.407Z · LW · GW

The burden is on you because you're saying "we have gone from not having the core algorithms for intelligence in our computers, to yes having them".

https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce#The__no_blockers__intuition

And I think you're admitting that your argument is "if we mush all capabilities together into one dimension, AI is moving up on that one dimension, so things will keep going up".

Would you say the same thing about the invention of search engines? That was a huge jump in the capability of our computers. And it looks even more impressive if you blur out your vision--pretend you don't know that the text that comes up on your screen is written by a humna, and pretend you don't know that search is a specific kind of task distinct from a lot of other activity that would be involved in "True Understanding, woooo"--and just say "wow! previously our computers couldn't write a poem, but now with just a few keystrokes my computer can literally produce Billy Collins level poetry!".

Blurring things together at that level works for, like, macroeconomic trends. But if you look at macroeconomic trends it doesn't say singularity in 2 years! Going to 2 or 10 years is an inside-view thing to conclude! You're making some inference like "there's an engine that is very likely operating here, that takes us to AGI in xyz years".

Comment by TsviBT on Shortform · 2024-12-24T16:38:04.783Z · LW · GW

I don't know a good description of what in general 2024 AI should be good at and not good at. But two remarks, from https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce.

First, reasoning at a vague level about "impressiveness" just doesn't and shouldn't be expected to work. Because 2024 AIs don't do things the way humans do, they'll generalize different, so you can't make inferences between "it can do X" to "it can do Y" like you can with humans:

There is a broken inference. When talking to a human, if the human emits certain sentences about (say) category theory, that strongly implies that they have "intuitive physics" about the underlying mathematical objects. They can recognize the presence of the mathematical structure in new contexts, they can modify the idea of the object by adding or subtracting properties and have some sense of what facts hold of the new object, and so on. This inference——emitting certain sentences implies intuitive physics——doesn't work for LLMs.

Second, 2024 AI is specifically trained on short, clear, measurable tasks. Those tasks also overlap with legible stuff--stuff that's easy for humans to check. In other words, they are, in a sense, specifically trained to trick your sense of how impressive they are--they're trained on legible stuff, with not much constraint on the less-legible stuff (and in particular, on the stuff that becomes legible but only in total failure on more difficult / longer time-horizon stuff).

The broken inference is broken because these systems are optimized for being able to perform all the tasks that don't take a long time, are clearly scorable, and have lots of data showing performance. There's a bunch of stuff that's really important——and is a key indicator of having underlying generators of understanding——but takes a long time, isn't clearly scorable, and doesn't have a lot of demonstration data. But that stuff is harder to talk about and isn't as intuitively salient as the short, clear, demonstrated stuff.

Comment by TsviBT on Shortform · 2024-12-24T13:11:24.544Z · LW · GW

Pulling a quote from the tweet replies (https://x.com/littmath/status/1870560016543138191):

Not a genius. The point isn't that I can do the problems, it's that I can see how to get the solution instantly, without thinking, at least in these examples. It's basically a test of "have you read and understood X." Still immensely impressive that the AI can do it!

Comment by TsviBT on What are the strongest arguments for very short timelines? · 2024-12-23T18:07:06.662Z · LW · GW

You probably won't find good arguments because there don't seem to be any. Unless, of course, there's some big lab somewhere that, unlike the major labs we're aware of, has made massive amounts of progress and kept it secret, and you're talking to one of those people.

https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce

Comment by TsviBT on Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible · 2024-12-16T04:24:30.984Z · LW · GW

FWIW I agree that personality traits are important. A clear case is that you'd want to avoid combining very low conscientiousness with very high disagreeability, because that's something like antisocial personality disorder or something. But, you don't want to just select against those traits, because weaker forms might be associated with creative achievement. However, IQ, and more broadly cognitive capacity / problem-solving ability, will not become much less valuable soon.

Comment by TsviBT on avturchin's Shortform · 2024-12-15T01:49:02.364Z · LW · GW

You can publish it, including the output of a standard hash function applied to the secret password. "Any real note will contain a preimage of this hash."