Posts

Whither Prison Abolition? 2023-12-08T05:27:26.985Z
MadHatter's Shortform 2023-12-01T04:09:21.892Z
A Proposed Cure for Alzheimer's Disease??? 2023-11-30T17:37:33.982Z
A Formula for Violence (and Its Antidote) 2023-11-30T16:04:32.370Z
Enkrateia: a safe model-based reinforcement learning algorithm 2023-11-30T15:51:49.188Z
Some Intuitions for the Ethicophysics 2023-11-30T06:47:55.145Z
The Alignment Agenda THEY Don't Want You to Know About 2023-11-30T04:29:27.784Z
Homework Answer: Glicko Ratings for War 2023-11-30T04:08:10.401Z
Feature Request for LessWrong 2023-11-30T03:19:13.369Z
My Alignment Research Agenda ("the Ethicophysics") 2023-11-30T02:57:01.571Z
Stupid Question: Why am I getting consistently downvoted? 2023-11-30T00:21:54.285Z
Ethicophysics II: Politics is the Mind-Savior 2023-11-28T16:27:19.233Z
Ethicophysics I 2023-11-27T15:44:29.236Z
My Mental Model of Infohazards 2023-11-23T02:37:57.104Z
Letter to a Sonoma County Jail Cell 2023-11-18T02:24:10.280Z
We are Peacecraft.ai! 2023-11-16T14:15:53.478Z
The Snuggle/Date/Slap Protocol 2023-11-12T20:44:42.728Z
Askesis: a model of the cerebellum 2023-11-06T20:19:09.001Z
LQPR: An Algorithm for Reinforcement Learning with Provable Safety Guarantees 2023-11-06T20:17:05.790Z
Balancing Security Mindset with Collaborative Research: A Proposal 2023-11-01T00:46:37.792Z
A mechanistic explanation for SolidGoldMagikarp-like tokens in GPT2 2023-02-26T01:10:33.785Z
Intervening in the Residual Stream 2023-02-22T06:29:37.973Z
Is AI Gain-of-Function research a thing? 2022-11-12T02:33:21.164Z
Trying to Make a Treacherous Mesa-Optimizer 2022-11-09T18:07:03.157Z
Mechanistic Interpretability for the MLP Layers (rough early thoughts) 2021-12-24T07:24:38.699Z
Hard-Coding Neural Computation 2021-12-13T04:35:51.705Z
Teaser: Hard-coding Transformer Models 2021-12-12T22:04:53.092Z

Comments

Comment by MadHatter on MadHatter's Shortform · 2023-12-05T03:44:17.614Z · LW · GW

How to Train Your Shoggoth, Part 2

The Historical Teams Framework for Alignment Research

Eliezer Yudkowsky has famously advocated for embracing "security mindset" when thinking about AI safety. This is a mindset where you think about how to prevent things from going wrong, rather than how to make things go right. This seems obviously correct to me, so for the purposes of this post I'll just take this as a given.

But I think there's a piece missing from the AI Safety community's understanding of security mindset, that is a key part of the mindset that computer security researchers and practitioners use. This is the concept of "threat model", or "Mossad vs. not-Mossad". In computer security, it's important to specify the class of adversary you are trying to keep out, because different adversaries have different capabilities. For example, if you are trying to keep out a nation-state adversary, you need to worry about things like zero-day exploits, and you need to worry about physical security of your servers. If you are trying to keep out a script kiddie, you don't need to worry about zero-day exploits, and you don't need to worry about physical security of your servers. If you are trying to keep out a random person on the internet, you don't need to worry about zero-day exploits, and you don't need to worry about physical security of your servers, and you don't need to worry about them breaking into your office and stealing your hard drives.

In my experience, the AI Safety community typically thinks primarily about adversaries with completely unbounded capabilities. There's a good reason for this (there's probably not much of a ceiling on AI capabilities, so, for any level of capabilities, eventually we will be facing that level of threat), but it makes it very hard to do the sort of productive, incremental work that builds a research program and eventually a paradigm.

The Historical Lens

 

How would one address this? I think we need a simple, shared framework in which to specify what kind of adversary we are thinking about at any given moment, and I think we need to associate any particular project with some particular class or classes of adversary it is meant to contain, thwart, or otherwise deal with. I think this would help us to make progress on the problem of AI safety, and I think it would help us to communicate with each other about our work. I also think it would help alignment researchers not immediately burn out upon seriously considering the problem.

What should this framework look like? I've been thinking about this question for a while, and the best answer I've come up with is to specify a list of human beings from history, and then assume for the sake of theory that we're facing an AI that has the set of capabilities that list of human beings would have if they worked together in pure coordination, using all available modern knowledge.

Let's work through some examples here to get a sense of what I'm talking about. Let's take the question of boxing an AI. Now consider the adverary that is composed of just Napoleon Bonaparte. Napoleon was a very good general, and had the various skills that that implies. When he was finally defeated, they essentially tried to box him by exiling him to the island of Elba. He escaped, and caused more havoc, and then they exiled him to the island of St. Helena, which he did not escape from. In the framework I am proposing, this would be take as evidence that boxing an AI is very hard, and we will probably get it wrong the first time, but that for a fixed adversary, we may get it right on the second try, if we survive the failure of the first attempt. This is a slightly different conclusion than the one I think a lot of people in the alignment community would have on the question of boxing an AI, which is essentially that it's impossible and we shouldn't try.

This argument also feels much more grounded in reality and history than the typical alignment argument that I encounter. I think it is more likely to convince people who are not already convinced of the importance of AI safety. And I think it is just a much less painful and dangerous thought to hold in one's head than the sort of "how do I enslave a god who can read my mind" kind of thinking that trying to reason about unbounded adversaries tends to lead to.

Let's take another example. Let's say we're trying to figure out how to deal with deception in an AI. We could consider the adversary that is composed of Harry Houdini and Adolf Hitler. (Already we start to see that this framework is slightly weird and at times deeply offensive; one really can't imagine the historical Houdini having a productive working relationship with the historical Hitler.) Houdini was a very good magician, who devoted much of the later part of his life to exposing mediums and other frauds. He could do this well essentially because he knew all the tricks that the mediums might use. He could have easily set up shop as a medium himself if he had wanted to. Hitler, on the other hand, was a very convincing liar, and was able to convince a large number of people to do a bunch of very bad things, through a combination of threats of violence and "the big lie". Now the question becomes, how do you safely get productive work (say, writing a press release for a small business) out of the Houdini-Hitler mashup? I think it immediately become clear that this is quite a difficult task, and not really one that we have a good solution to. It's not the sort of thing you would assume that you will succeed at. Here I think we have just rederived the standard conclusion that deception is a hard problem, but we have done so in a way that is more grounded in reality and history, and that is less painful and dangerous to think about.

Now that we have a sense of what this framework looks like, let's try to do some productive work in it. Let's say we're trying to make progress on the deception problem. We could try to figure out what it would take to be able to trust a system that is primarily powered by Harry Houdini. Let's say we're trying to solve the standard ELK problem (the problem of keeping a diamond in a vault safe, when the diamond is protected by an AI that understands the construction and nature of the vault better than we do, and we rely on the AI to report on the state of the diamond). (This is a little bit of an abuse of the ELK framework, in that I think with ELK one usually is discussing a system that isn't actively adversarial.)

The only way I can remotely imagine not getting outsmarted by the Houdini AI is to have another Houdini that thinks at the same speed, who isn't in coordination with the fist Houdini. (Think of the historical Houdini outsmarting various mediums.) This is a classic approach to building AI alignment systems that I think is considered not very effective by a lot of alignment researchers, but I've always thought it was not given enough credit. RLHF, in some sense the established approach to prosaic AI alignment, takes this form.

The Simulation Lens

 

The other thing that makes alignment research difficult is that it's very hard to run experiments that give you useful information about how to align an AI. There are multiple reasons for this: the most powerful existing AI's are very difficult to work with (training them is complicated and expensive, and will likely remain so for the duration; getting their weights and activations is impossible for people outside the big labs).

So what would happen if we just took video games as our model of reality? Say, if we are trying to prevent a superintelligence from performing atrocities in reality, then try to get a superintelligence to avoid performing atrocities in Factorio. That sort of thing. This approach seems (to me at least) far more empirical and grounded than any approach to performing alignment research that I am aware of. It also has the advantage that the big AI labs could productively research it starting more than five years ago, because they already have.

Comment by MadHatter on MadHatter's Shortform · 2023-12-05T03:42:54.386Z · LW · GW

How to Train Your Shoggoth, Part 1


What if Didactic Fiction is the Answer to Aligning Large Language Models?

This is a linkshortform for https://bittertruths.substack.com/p/how-to-train-your-shoggoth-part-1

Epistemic status: this sounds very much like an embarrassingly dumb take, but I haven't been able to convince myself that it is false, and indeed am somewhat convinced that it would at least help to align large language models (and other models that meaningfully incorporate them). So I'm writing at least partly to see if anyone has any fundamental objections to this idea.

The idea in a nutshell

 

We need to write a bunch of didactic fiction (which for this post we will define as teaching morality/alignment by scripting desired and possibly undesired responses to hypothetical situations) that is intended for AI consumption, that depicts our desired state of affairs. This fiction should describe a coherent (series of?) world(s) in which we could find ourselves, in which AI ranging from present-day reality to strong ASI is safely integrated into society. The purpose of this fiction is to be included in training data for large language models, and indeed the AI labs training large language models would be enabled and encouraged to do so.  (So stories would be open-source and freely downloadable, translated into all languages and formats used by AI's, etc.)

The goal of such an activity would be to provide meaningful characters for a large language model to inhabit / imitate / simulate. I will say more about this in the sections to come.

The fiction need not be any good as fiction for human consumption (although it would be interesting to read and to write, I think). We will describe various desiderata for what sort of fiction would be desired in the section "Implementation Details"; this is a complicated but actually fairly interesting question, in my opinion.

Some basic context

 

GPT-k for k <=3.5/4.0 and ChatGPT are next-token predictors based on a transformer architecture, possibly with retrieval augmentation. They are trained on a very large corpus of text, including most of the internet. (A good deal of the internet is presumably filtered away from them, though, because it is not high-enough quality. There is also potentially a large amount of proprietary data also going in.)

GPT-k is not a human-like intelligence, and should not be thought of as one. (Barring, I suppose, huge changes in its architecture for sufficiently large k.) It is really damn good at next-token prediction (better than humans at this fairly-unnatural-for-us task), and ChatGPT is semi-reasonable about answering large numbers of questions and prompts.

It's important to avoid anthropomorphization of these models, obviously. And there's a bunch of related failure modes in any project like this. For instance, fiction that imputes qualities or experiences to AI's that they do not have (e.g., an internal monologue that sounds particularly human) would be actively harmful, because repeating things like it would be actively misleading to people about how the AI works. And the whole point of this idea is to get the AI to repeat things or adapt them to new contexts in which it finds itself.

Implementation Details

 

First, what are our goals? We would like to instill various behaviors in the AI, but which behaviors? Some obvious clusters are the Helpful/Harmless/Honest description, not-saying-slurs, helping alignment research/development/maintenance/regulation(???), and generally being honest about its internal knowledge and capabilities. We need to give the large language model some sort of behavioral framework that will ensure our safety, and it seems like that includes it having to disclose (and possibly explain!) certain information that is relevant to our safety. 

Probably the predominant genre for such fiction initially would be a sort of variant on the epistolary novel, in which humans write into a chatbot or API and the model replies, but for various different counterfactual situations that the AI might find itself in. It is unclear to me whether a narrative would be desirable or even perceivable by the large language model being trained on the text.

Some techno-literary devices might be to provide custom embeddings of specific things, vector stores over certain texts, and knowledge graphs that run in parallel to the text of the novel. These might make the system relevant to other kinds of models. We could also provide custom reward labelings for different characters in the story, for training RLHF proxy models. This would be sort of equivalent to a Goofus-and-Gallant comic (or I suppose a Virgin-vs.-Chad meme for the younger folks).

Another interesting facet of this problems is the sort of social planning aspect. We need to build out the fiction to be somewhat consistent and we will want to build it significantly faster than we build out the intelligence that will be trained on it, given how fast progress can be in this field, and given that the data needs to exist potentially several months in advance of the model being ready. And yet we would like to build the fiction out slowly enough that we can have some sort of social process around what it should be like.

Comment by MadHatter on [Valence series] 1. Introduction · 2023-12-04T15:44:21.016Z · LW · GW

Very cool post! We need a theory of valence that is grounded in real neuroscience, since understanding valence is pretty much required for any alignment agenda that works the first time.

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-12-04T05:53:29.934Z · LW · GW

I have read the sequences. Not all of them, because, who has time. 

Here is a video of me reading the sequences (both Eliezer's and my own):

https://bittertruths.substack.com/p/semi-adequate-equilibria

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-12-04T05:49:36.100Z · LW · GW

Well what if he bets a significant amount of money at 2000:1 odds that the Pope will officially add his space Bible to the real Bible as a third Testament after the New Testament within the span of a year?

What if he records a video of himself doing Bible study? What if he offers to pay people their currently hourly rate to watch him do Bible study?

I guess the thrust of my questions here is, at what point do you feel that you become the dick for NOT helping him publish his own space Bible? At what point are you actively impeding new religious discoveries by failing to engage?

For real, literal Christianity, I think there's no amount of cajoling or argumentation that could lead a Christian to accept the new space Bible. For one thing, until the Pope signs off on it, they would no longer be Christian if they did.

Does rationalism aspire to be more than just another provably-false religion? What would ET Jaynes say about people who fail to update on new evidence?

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-12-03T21:02:51.948Z · LW · GW

Recorded a sort of video lecture here: https://open.substack.com/pub/bittertruths/p/semi-adequate-equilibria

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-12-03T19:29:00.314Z · LW · GW

Agree that it should be time-based rather than karma-based.

I'm currently on a very heavy rate limit that I think is being manually adjusted by the LessWrong team.

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-12-03T01:28:01.867Z · LW · GW

Is this clear enough:

I posit that the reason that humans are able to solve any coordination problems at all is that evolution has shaped us into game players that apply something vaguely like a tit-for-tat strategy meant to enforce convergence to a nearby Schelling Point / Nash Equilibrium, and to punish defectors from this Schelling Point / Nash Equilibrium. I invoke a novel mathematical formalization of Kant's Categorical Imperative as a potential basis for coordination towards a globally computable Schelling Point. I believe that this constitutes a promising approach to the alignment problem, as the mathematical formalization is both simple to implement and reasonably simple to measure deviations from. Using this formalization would therefore allow us both to prevent and detect misalignment in powerful AI systems. As a theory of change, I believe that applying RLHF to LLM's using a strong and consistent formalization of the Categorical Imperative is a plausible and reasonably direct route to good outcomes in the prosaic case of LLM's, and I believe that LLM's with more neuromorphic components added are a strong contender for a pathway to AGI.

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-12-02T17:30:07.445Z · LW · GW

My understanding of the current situation with me is that I am not in fact rate-limited purely by automatic processes currently, but rather by some sort of policy decision on the part of LessWrong's moderators.

Which is fine, I'll just continue to post my alignment research on my substack, and occasionally dump linkposts to them in my shortform, which the mods have allowed me continued access to.

Comment by MadHatter on Ethicophysics I · 2023-12-01T19:00:50.029Z · LW · GW

https://github.com/epurdy/ethicophysics/blob/main/writeup1.pdf

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-12-01T17:23:42.057Z · LW · GW

Perhaps it is about right, then.

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-12-01T14:43:04.117Z · LW · GW

Yes, this is a valid and correct point. The observed and theoretical Nash Equilibrium of the Wittgensteinian language game of maintaining consensus reality is indeed not to engage with cranks who have not Put In The Work in a way that is visible and hard-to-forge.

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-12-01T14:26:03.691Z · LW · GW

Thank you for this anwer. I agree that I have not visibly been putting in the work to make falsifiable predictions relevant to the ethicophysics. These can indeed be made in the ethicophysics, but they're less predictions and more "self-fulfilling prophecies" that have the effect of compelling the reader to comply with a request to the extent that they take the request seriously. Which, in plain language, is some combination of wagers, promises, and threats. 

And it seems impolite to threaten people just to get them to read a PDF.

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-12-01T13:52:59.519Z · LW · GW

And I also have the courage to apply to Y Combinator to start either a 501c3 or a for-profit company to actually perform this trial through legal, official channels. Do you think that I will be denied entry into their program with such a noble goal and the collaboration of a domain expert?

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-12-01T13:49:16.194Z · LW · GW

I have the courage to commit an act of civil disobedience in which I ask people caring for Alzheimer's patients to request a Zoloft and/or Trazodone prescription for their loved ones, and then track the results.

Do you think I lack the persistence and capital to organize something of that nature? Why or why not?

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-12-01T13:45:22.354Z · LW · GW

Well then, I submit that courage is a virtue, when tempered with the wisdom not to pick fights you do not plan to finish.

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-12-01T13:41:05.174Z · LW · GW

This comment continues to annoy me. I composed a whole irrational response in my mind where I would make credible threats to burn significant parts of the capabilities commons every time someone called me delusional on LessWrong.

But that's probably not a reasonable way to live my life, so this response is not that response.

I get that history is written by the victors. I get that what is accepted by consensus reality is dictated by the existing power structures. The fact that you would presume to explain these things to the author of Ethicophysics I and Ethicophysics II simply demonstrates that you have either failed to read these documents, failed to understand what they are saying, or are simply too pigheaded for your own or anyone else's good.

I do not appreciate being called delusional, and I must ask you never to use that word in reference to me again (at least not to my face). You have my permission to use the synonyms "irrational" (for claims that you believe have poor Bayesian hygiene) or "unreasonable" (for claims or suggestions that you think do not lead to Pareto-optimal outcomes consistent with morality as it is commonly understood).

If you do continue to be condescending to me, then my response is always going to be some mostly-polite but quite direct explanation of the current location of your foot relative to the current location of your oral cavity. 

And history will not be kind to both sides of any discussion like that. I offer you a choice: continue to pooh-pooh my proposed treatment for Alzheimer's disease, and accept some wager at my proposed odds on the probable outcome of this trial when I organize it out of my own free time and my own funds, or (and I say this as politely as I know how) find some more reasonable use of the precious gift of your time on Earth than to comment on my work, my character, or the inside of my head.

Comment by MadHatter on Some Intuitions for the Ethicophysics · 2023-12-01T12:21:28.688Z · LW · GW

But it uses the tools of physics, so the math would best be checked by someone who understands Lagrangian mechanics at a professional level.

Comment by MadHatter on Some Intuitions for the Ethicophysics · 2023-12-01T12:20:21.995Z · LW · GW

Yes, it is a specification of a set of temporally adjacent computable Schelling Points. It thus constitutes a trajectory through the space of moral possibilities that can be used by agents to coordinate and punish defectors from a globally consistent morality whose only moral stipulations are such reasonable sounding statements as "actions have consequences" and "act more like Jesus and less like Hitler".

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-12-01T12:10:29.768Z · LW · GW

And I'm happy to code up the smartphone app and run the clinical trial from my own funds. My uncle is starting to have memory trouble, I believe.

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-12-01T12:08:57.953Z · LW · GW

Oh, come on. If the rationality community disapproved of Einstein predicting the transit of Mercury, that's an L for the rationality community, not for Einstein.

I have offered to say why I believe it to be true, as soon as I can get clearance from my company to publish capabilities relevant theoretical neuroscience work.

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-12-01T12:04:34.713Z · LW · GW

That's fair, and I need to do a better job of building on-ramps for different readers. My most recent shortform is an attempt to build such an on-ramp for the LessWrong memeplex.

Comment by MadHatter on Ethicophysics II: Politics is the Mind-Savior · 2023-12-01T12:01:57.184Z · LW · GW

That's fair (strong up/agree vote).

If you consult my recent shortform, I lay out a more measured, skeptical description of the project. Basically, ethicophysics constitutes a globally computable Schelling Point, such that it can be used as a protocol between different RL agents that believe in "oughts" to achieve Pareto-optimal outcomes. As long as the largest coalition agrees to prefer Jesus to Hitler, I think (and I need to do far more to back this up) defectors can be effectively reined in, the same way that Bitcoin works because the majority of the computers hooked up to it don't want to destroy faith in the Bitcoin protocol.

Comment by MadHatter on MadHatter's Shortform · 2023-12-01T04:09:22.000Z · LW · GW

Ethicophysics for Skeptics

Or, what the fuck am I talking about?

In this post, I will try to lay out my theories of computational ethics in as simple, skeptic-friendly, non-pompous language as I am able to do. Hopefully this will be sufficient to help skeptical readers engage with my work.

The ethicophysics is a set of computable algorithms that suggest (but do not require) specific decisions in response to ethical decisions in a multi-player reinforcement learning problem.

The design goal that the various equations need to satisfy is that they should select a uniquely identifiable Schelling Point and Nash Equilibrium, such that all participants in the reinforcement learning algorithm who follow the ethicophysical algorithms will cooperate to achieve a Pareto-optimal outcome with high total reward, and such that any non-dominant coalition of defectors can be contained and if necessary neutralized by the larger, dominant coalition of players following the strategies selected by the ethicophysics.

The figure of merit, then, that determines if the ethicophysical algorithms are having the intended effect, is the divergence of the observed outcome from a notional outcome that would be achieved by using pure coordination and pure cooperation between all players. The existence of this divergence between what is and what could be in the absence of coordination problems is roughly what I take to be the content of Scott Alexander’s post on Moloch. I denote this phenomenon (“Moloch”) by the philosophical term “collective akrasia”, since it is a failure of communities to exercise self-mastery that is roughly isomorphic to the classic philosophical problem of akrasia. Rather than the classical “why do I not do as I ought?” question, it is rather a matter of “why do we not do as we ought?”, where we refers to the community under consideration.

So then the question is, what algorithms should we use to minimize collective akrasia? Phrased in this simple language, it becomes clear that many important civilizational problems fall under this rubric; in particular, climate change is maybe 10% a technical issue and 90% a coordination problem. In particular, China is unwilling to cooperate with the US because it feels that the US is “pulling the ladder up after itself” in seeking to limit the emissions of China’s rapidly industrializing society more than the US’s reasonably mature and arguably stagnating industrial base.

So we (as a society) find ourselves in need of a set of algorithms to decide what is “fair”, in a way that is visibly “the best we can do”, in the sense of Pareto optimality, but also in some larger, more important sense of minimizing collective akrasia, or “the money we are leaving on the table”, or “Moloch”.

The conservation laws defined in Ethicophysics I and Ethicophysics II operate as a sort of “social fact” technology to establish common knowledge of what is fair and what is not fair. Once a large and powerful coalition of agents has common knowledge of a computable Schelling Point and Nash Equilibrium, we can simply steer towards that Schelling Point and punish those who defect against the selected Schelling Point in a balanced, moderate way, that is simply chosen to incentivize cooperation.

So, my goal in publishing and promulgating the ethicophysical results I have proved so far is to allow people who are interested in solving the problem of aligning powerful intelligences to also join a large coalition of people who are steering towards a mutually compatible Schelling Point that is more Jesus-flavored and less Hitler-flavored, which seems like something that all reasonable people can get behind.

I thus argue that, far from being a foreign irritant that needs to be expelled from the LessWrong memetic ecosystem, the ethicophysics is a rigorous piece of mathematics and technology that is both necessary and sufficient to deliver us from our collective nightmare of collective akrasia, or “Moloch”, which is one of the higher and more noble ambitions espoused by the effective altruist community.

In future posts, I will review the extant results in the ethicophysics, particularly the conservation laws, and show them to be intuitively plausible descriptions of inarguably real phenomena. For instance, the Law of Conservation of Bullshit would translate into something much like the Simulacra Levels of a popular LessWrong post, in which self-interested actors who stop tracking the true meanings of things over time develop collective akrasia so thorough and so entrenched that they lose the ability to say true things even when they are trying to.

Stay tuned for further updates. Probably the next post will simply treat some very simple cause-and-effect ethical word problems, such as the classic “Can you lie to the Nazis about the location of someone they are looking for?”

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-12-01T02:25:46.335Z · LW · GW

They really suck. The old paradigm of Alzheimer's research is very weak and, as I understand it, no drug has an effect size sufficient to offset even a minimal side effect profile, to the point where I think only one real drug has been approved by the FDA in the old paradigm, and that approval was super controversial. That's my understanding, anyway. I welcome correction from anyone who knows better.

So maybe we should define the effect size in terms of cogntiive QALY's? Say, an effective treatment should at least halve the rate of decline of the experimental arm relative to the control arm, with a stretch goal of bringing the decline to within the nuisance levels of normal, non-Alzheimer's aging, and an even stretchier stretch goal of reversing the condition to the point where the former Alzheimer's patient starts learning new skills and acquiring new hobbies.

Comment by MadHatter on The Alignment Agenda THEY Don't Want You to Know About · 2023-12-01T02:12:30.348Z · LW · GW

Here is the best I could muster on short notice: https://bittertruths.substack.com/p/ethicophysics-for-skeptics

Since I'm currently rate-limited, I cannot post it officially.

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-12-01T02:10:14.867Z · LW · GW

How will we handle the desk drawer effect, where insignificant results are quietly shelved? I guess if the trial is preregistered this won't happen...

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-12-01T01:29:54.980Z · LW · GW

https://chat.openai.com/share/068f5311-f11a-43fe-a2da-cbfc2227de8e

Here are ChatGPT's speculations on how much it would cost to run this study. I invite any interested reader to work on designing this study. I can also write up my theories as to why this etiology is plausible in arbitrary detail if that is decision-relevant to someone with either grant money or interest in helping to code up the smartphone app we would need. to collect the relevant measurements cheaply. (Intuitively, it would be something like a Dual N-Back app, but more user-friendly for Alzheimer's patients.)

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-12-01T01:12:52.295Z · LW · GW

I can put together some sort of proposal tonight, I suppose.

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-12-01T01:11:14.719Z · LW · GW

OK, let's do it. Your nickel against my $100.

What resolution criteria should we use? Perhaps the first RCT that studies a treatment I deem sufficiently similar has to find a statistically significant effect with a publishable effect size? Or should we require that the first RCT that studies a similar treatment is halted halfway through because it would be unethical for the control group not to receive the treatment? (We could have a side bet on the latter, perhaps.)

What would the study look like? Presumably scores on a standard cognitive test designed to measure decline of Alzheimer's patients, with four arms: control, Zoloft, Trazadone, and Zoloft + Trazadone (with all of the other components of the treatment, in particular the EPA/DHA, constant for all non-control subjects). Let me know if you have any thoughts on the study design or if I should put together a grant proposal to stsudy this.

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-11-30T22:51:54.678Z · LW · GW

OK, sounds good! Consider it a bet.

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-11-30T22:49:31.306Z · LW · GW

I wouldn't say I really do satire? My normal metier is more "the truth, with jokes". If I'm acting too crazy to be considered a proper rationalist, it's usually because I am angry or at least deeply annoyed.

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-11-30T22:38:41.946Z · LW · GW

OK. I can only personally afford to be wrong to the tune of about $10K, which would be what, $5 on your part? Did I do that math correctly?

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-11-30T22:36:25.262Z · LW · GW

OK, anybody who publicly bets on my predicted outcome to the RCT wins the right to engage me in a LessWrong dialogue on a topic of their choosing, in which I will politely set aside my habitual certainty and trollish demeanor.

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-11-30T22:33:48.920Z · LW · GW

Well that should be straightforward, and is predicted by my model of serotonin's function in the brain. It would require an understanding of the function of orexin, which I do not currently possess, beyond the standard intuition that it modulates hunger. 

The evolutionary story would be this:

  • serotonin functions (in my model) to make an agent satisficing, which has many desirable safety properties, e.g. not getting eaten by predators when you forage unnecessarily
  • the most obvious and important desire to satisfy (and neurally mark as satisfied) is the hunger for food modulated by the hormone/neurotransmitter orexin
  • the most obvious mechanism (and thus the one I predict) is that serotonergic bacteria in the gut activate some neural population in the gut's "second brain", sending a particular neural signal bundle to the primary brain consistent with malnutrition (there are many details here that I have not worked out and which could be usefully worked on by a qualified theoretical neuroscientist)
  • this neural signal bundle would necessarily up(???)modulate the orexin signal(???)
  • sustained high levels of orexin lead to autocannibalism of the brain through sustained neural pruning
Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-11-30T21:32:03.770Z · LW · GW

Well what's the appropriate way to act in the face of the fact that I AM sure I am right? I've been offering public bets of the nickel of some high-karma person versus my $100, which seems like a fair and attractive bet for anyone who doubts my credibility and ability to reason about the things I am talking about.

I will happily bet anyone with significant karma that Yudkowsky will find my work on the ethicophysics valuable a year from now, at the odds given above.

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-11-30T21:25:40.646Z · LW · GW

Historically, I have been extremely, extremely good at delaying publication of what I felt were capabilites-relevant advances, for essentially Yudkowskyan doomer reasons. The only reward I have earned for this diligence is to be treated like a crank when I publish alignment-related research because I don't have an extensive history of public contribution to the AI field.

Here is my speculation of what Q* is, along with a github repository that implements a shitty version of it, postdated several months.

https://bittertruths.substack.com/p/what-is-q

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-11-30T20:31:37.811Z · LW · GW

And now I am officially rate-limited to one post per week. Be sure to go to my substack if you are curious about what I am up to.

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-11-30T20:30:29.127Z · LW · GW

Well, I'll just have to continue being first out the door, then, won't I?

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-11-30T20:13:39.348Z · LW · GW

And if people refuse to take such an attractive bet for the reason that my proposed cure sounds like it couldn't possibly hurt anyone, and might indeed help, then I reiterate the point I made in The Alignment Agenda THEY Don't Want You to Know About: the problem is not that my claims are prima facie ridiculous, it is that I myself am prima facie ridiculous.

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-11-30T20:11:27.621Z · LW · GW

I will publicly wager $100 against a single nickel with the first 10 people with extremely high LessWrong karma who want to publicly bet against my predicted RCT outcome.

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-11-30T19:20:30.376Z · LW · GW

https://alzheimergut.org/research/ is the place to look for all the lastest research from the gut microbiome hypothesis community.

Comment by MadHatter on A Formula for Violence (and Its Antidote) · 2023-11-30T19:07:31.480Z · LW · GW

Agreed, this is a crucial lesson of history.

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-11-30T19:06:09.931Z · LW · GW

Young people forget important stuff, get depressed, struggle to understand the world. That is the prediction of my model: that a bad gut microbiome would cause more neural pruning than is strictly optimal.

It is well documented that starving young people have lower IQ's, I believe? Certainly the claim does not seem prima facie ridiculous to me.

The older you get, the more chances you have to develop a bad gut microbiome. Perhaps the actual etiology of bad gut microbiomes (which I do not claim to understand) is heavily age-correlated. Or maybe we simply do not label neural pruning induced by fake starvation perceptions as Alzheimer's in the absence of old age. 

Note that the Alzheimer's gut microbiome have induced Alzheimer's-like symptoms in young healthy mice by transferring something (tissue?) from the brain of human Alzheimer's patients to the stomach of young healthy mice; thus, I consider this particular claim (young people can get Alzheimer's) to have heavy empirical validation, if only in animal models.

Comment by MadHatter on Enkrateia: a safe model-based reinforcement learning algorithm · 2023-11-30T18:59:16.006Z · LW · GW

Then maybe the alignment problem is a stupid problem to try to solve? I don't believe this, and have spent the past five years working on the alignment problem. But your argument certainly seems like a general purpose argument that we could and should surrender our moral duties to a fancy algorithm as a cost-saving measure, and that anyone who opposes that is a technophobe who Does Not Get the Science.

Comment by MadHatter on A Proposed Cure for Alzheimer's Disease??? · 2023-11-30T17:51:00.660Z · LW · GW

Also, for interested readers, I am happy to post a more detailed mechanistic neuroscience explanation of my theory, but want to make sure I'm not breaking my company NDA's by sharing it first.

Comment by MadHatter on Enkrateia: a safe model-based reinforcement learning algorithm · 2023-11-30T17:43:14.357Z · LW · GW

What's so bad about keeping a human in the loop forever? Do we really think we can safely abdicate our moral responsibilities?

Comment by MadHatter on A Formula for Violence (and Its Antidote) · 2023-11-30T17:41:57.410Z · LW · GW

I'm not trying to generate revenue for Wayne. I'm trying to spread his message to force the hand of the judicial system to not imprison him for longer than they already have.

Comment by MadHatter on Stupid Question: Why am I getting consistently downvoted? · 2023-11-30T16:48:53.281Z · LW · GW

Well, perhaps we can ask, what is reading about? Surely it involves reading through clearly presented arguments and trying to understand the process that generated them, and not presupposing any particular resolution to the question "is this person crazy" beyond the inevitable and unenviable limits imposed by our finite time on Earth.

Comment by MadHatter on A Formula for Violence (and Its Antidote) · 2023-11-30T16:29:18.215Z · LW · GW

That's fair. I just want Wayne to get out of jail soon because he's a personal friend of mine.