stephen-fowler

The most obvious example is instances where a developer is primarily being driven by commercial interest. Short-form video content has radically changed the media that children engage with, but may have also harmed education outcomes.

But my primary concern stems from the observation that small changes to a complex system can lead to phase transitions in the behaviour of that system. Here the complex system is the global network of developers and their deployed S-LLMs. A small improvement to S-LLM may initially appear benign, but have unpredictable consequences once it spreads globally.

Comment by Stephen Fowler (LosPolloFowler) on Why Should I Assume CCP AGI is Worse Than USG AGI? · 2025-04-24T01:53:59.536Z · LW · GW

You have conflated two separate evaluations, both mentioned in the TechCrunch article.

The percentages you quoted come from Cisco’s HarmBench evaluation of multiple frontier models, not from Anthropic and were not specific to bioweapons.

Dario Amondei stated that an unnamed DeepSeek variant performed worst on bioweapons prompts, but offered no quantitative data. Separately, Cisco reported that DeepSeek-R1 failed to block 100% of harmful prompts, while Meta’s Llama 3.1 405B and OpenAI’s GPT-4o failed at 96 % and 86 %, respectively.

When we look at performance breakdown by Cisco, we see that all 3 models performed equally badly on chemical/biological safety.

Comment by LosPolloFowler on [deleted post] 2025-04-07T07:53:49.715Z

Unfortunately, pop-science descriptions of the double slit experiment are fairly misleading. That observation changes the outcome in the double-slit experiment can be explained without the need to model the universe as exhibiting "mild awareness". Or, your criteria for what constitutes "awareness" is so low that you would apply it to any dynamical system in which 2 or more objects interact.

The less-incorrect explanation is that observation in the double slit experiment fundamentally entangles the observing system with the observed particle because information is exchanged.

https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=919863

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2025-04-06T07:06:55.996Z · LW · GW

Thinking of trying the latest Gemini model? Be aware that it is almost impossible to disable the "Gemini in Docs" and "Gemini in Gmail" services once you have purchased a Google One AI Premium plan.

Edit:

Spent 20 minutes trying to track down a button to turn it off before reaching out to support.

A support person from Google told me that as I'd purchased the plan there was literally no way to disable having Gemini in my inbox and docs.

Even cancelling my subscription would keep the service going until the end of the current billing period.

But despite what support told me, I resolved the issue. Account > Data and privacy > "Delete a Google service" and then I deleted my Google One account. No more Gemini in inbox and my account on the Gemini app seems to have reverted to a free user account.

I imagine this "solution" won't be feasible if you use Google One for anything else (file storage).

Comment by Stephen Fowler (LosPolloFowler) on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-26T01:20:26.905Z · LW · GW

While each mind might have a maximum abstraction height, I am not convinced that the inability of people to deal with increasingly complex topics is direct evidence of this.

Is it that this topic is impossible for their mind to comprehend, or is it that they've simple failed to learn it in the finite time period they were given?

Comment by Stephen Fowler (LosPolloFowler) on Good Research Takes are Not Sufficient for Good Strategic Takes · 2025-03-26T00:57:16.605Z · LW · GW

Thanks for writing this post. I agree with the sentiment but feel it important to highlight that it is inevitable that people assume you have good strategy takes.

In Monty Python's "Life of Brian" there is a scene in which the titular character finds himself surrounded by a mob of people declaring him the Mesiah. Brian rejects this label and flees into the desert, only to find himself standing in a shallow hole, surrounded by adherents. They declare that his reluctance to accept the title is further evidence that he really is the Mesiah.

To my knowledge nobody thinks that you are the literal Messiah but plenty of people going into AI Safety are heavily influenced by your research agenda. You work at Deepmind and have mentored a sizeable number of new researchers through MATS. 80,000 Hours lists you as example of someone with a successful career in Technical Alignment research.

To some, the fact that you request people not to blindly trust your strategic judgement is evidence that you are humble, grounded and pragmatic, all good reasons to trust your strategic judgement.

It is inevitable that people will view your views on the Theory of Change for Interpretability as aithoritative. You could literally repeat this post verbatim at the end of every single AI safety/interpretability talk you give, and some portion of junior researchers will still leave the talk defering to your strategic judgement.

Comment by Stephen Fowler (LosPolloFowler) on Thermodynamic entropy = Kolmogorov complexity · 2025-02-22T04:17:45.306Z · LW · GW

These recordings I watched were actually from 2022 and weren't the Sante Fe ones.

Comment by Stephen Fowler (LosPolloFowler) on Thermodynamic entropy = Kolmogorov complexity · 2025-02-17T11:43:36.109Z · LW · GW

A while ago, I watched recordings of the lectures given by by Wolpert and Kardes at the Santa Fe Institute*, and I am extremely excited to see you and Marcus Hutter working in this area.

Could you speculate on if you see this work having any direct implications for AI Safety?

Edit:

I was incorrect. The lectures from Wolpert and Kardes were not the ones given at the Santa Fe Institute.

Comment by Stephen Fowler (LosPolloFowler) on How quickly could robots scale up? · 2025-01-13T13:59:42.066Z · LW · GW

Signalling that I do not like linkposts to personal blogs.

Comment by Stephen Fowler (LosPolloFowler) on Do Antidepressants work? (First Take) · 2025-01-13T08:38:04.958Z · LW · GW

"cannot imagine a study that would convince me that it "didn't work" for me, in the ways that actually matter. The effects on my mind kick in sharply, scale smoothly with dose, decay right in sync with half-life in the body, and are clearly noticeable not just internally for my mood but externally in my speech patterns, reaction speeds, ability to notice things in my surroundings, short term memory, and facial expressions."

The drug actually working would mean that your life is better after 6 years of taking the drug compared to the counterfactual where you took a placebo.

The observations you describe are explained by you simply having a chemical dependency on a drug that you have been on for 6 years.

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2025-01-12T04:47:27.657Z · LW · GW

"In an argument between a specialist and a generalist, the expert usually wins by simply (1) using unintelligible jargon, and (2) citing their specialist results, which are often completely irrelevant to the discussion. The expert is, therefore, a potent factor to be reckoned with in our society. Since experts both are necessary and also at times do great harm in blocking significant progress, they need to be examined closely. All too often the expert misunderstands the problem at hand, but the generalist cannot carry though their side to completion. The person who thinks they understand the problem and does not is usually more of a curse (blockage) than the person who knows they do not understand the problem.’
—Richard W. Hamming, “The Art of Doing Science and Engineering”

***

(Side note:

I think there's at least a 10% chance that a randomly selected LessWrong user thinks it was worth their time to read at least some of the chapters in this book. Significantly more users would agree that it was a good use of their time (in expectation) to skim the contents and introduction before deciding if they're in that 10%.

That is to say, I recommend this book.)

Comment by Stephen Fowler (LosPolloFowler) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T03:05:54.337Z · LW · GW

Robin Hanson recently wrote about two dynamics that can emerge among individuals within an organisations when working as a group to reach decisions. These are the "outcome game" and the "consensus game."

In the outcome game, individuals aim to be seen as advocating for decisions that are later proven correct. In contrast, the consensus game focuses on advocating for decisions that are most immediately popular within the organization. When most participants play the consensus game, the quality of decision-making suffers.

The incentive structure within an organization influences which game people play. When feedback on decisions is immediate and substantial, individuals are more likely to engage in the outcome game. Hanson argues that capitalism's key strength is its ability to make outcome games more relevant.

However, if an organization is insulated from the consequences of its decisions or feedback is delayed, playing the consensus game becomes the best strategy for gaining resources and influence.

This dynamic is particularly relevant in the field of (existential) AI Safety, which needs to develop strategies to mitigate risks before AGI is developed. Currently, we have zero concrete feedback about which strategies can effectively align complex systems of equal or greater intelligence to humans.

As a result, it is unsurprising that most alignment efforts avoid tackling seemingly intractable problems. The incentive structures in the field encourage individuals to play the consensus game instead.

Comment by Stephen Fowler (LosPolloFowler) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T02:23:03.513Z · LW · GW

I'm not saying that this would necessarily be a step in the wrong direction, but I don't think think a discord server is capable of fixing a deeply entrenched cultural problem among safety researchers.

If moderating the server takes up a few hours of John's time per week the opportunity cost probably isn't worth it.

Comment by Stephen Fowler (LosPolloFowler) on Cognitive Work and AI Safety: A Thermodynamic Perspective · 2024-12-09T09:51:32.168Z · LW · GW

Worth emphasizing that cognitive work is more than just a parallel to physical work, it is literally Work in the physical sense.

The reduction in entropy required to train a model means that there is a minimum amount of work required to do it.

I think this is a very important research direction, not merely as an avenue for communicating and understanding AI Safety concerns, but potentially as a framework for developing AI Safety techniques.

There is some minimum amount of cognitive work required to pose an existential threat, perhaps it is much higher than the amount of cognitive work required to perform economically useful tasks.

Comment by Stephen Fowler (LosPolloFowler) on Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models · 2024-12-08T02:12:51.132Z · LW · GW

Can you expect that the applications to interpretability would apply on inputs radically outside of distribution?

My naive intuition is that by taking derivatives are you only describing local behaviour.

(I am "shooting from the hip" epistemically)

Comment by Stephen Fowler (LosPolloFowler) on o1 is a bad idea · 2024-11-12T09:25:05.296Z · LW · GW

A loss of this type of (very weak) interpretability would be quite unfortunate from a practical safety perspective.

This is bad, but perhaps there is a silver lining.

If internal communication within the scaffold appears to be in plain English, it will tempt humans to assume the meaning coincides precisely with the semantic content of the message.

If the chain of thought contains seemingly nonsensical content, it will be impossible to make this assumption.

Comment by Stephen Fowler (LosPolloFowler) on evhub's Shortform · 2024-11-09T09:41:57.399Z · LW · GW

I think that overall it's good on the margin for staff at companies risking human extinction to be sharing their perspectives on criticisms and moving towards having dialogue at all

No disagreement.

your implicit demand for Evan Hubinger to do more work here is marginally unhelpful

The community seems to be quite receptive to the opinion, it doesn't seem unreasonable to voice an objection. If you're saying it is primarily the way I've written it that makes it unhelpful, that seems fair.

I originally felt that either question I asked would be reasonably easy to answer, if time was given to evaluating the potential for harm.

However, given that Hubinger might have to run any reply by Anthropic staff, I understand that it might be negative to demand further work. This is pretty obvious, but didn't occur to me earlier.

I will add: It's odd to me, Stephen, that this is your line for (what I read as) disgust at Anthropic staff espousing extremely convenient positions while doing things that seem to you to be causing massive harm.

Ultimately, the original quicktake was only justifying one facet of Anthropic's work so that's all I've engaged with. It would seem less helpful to bring up my wider objections.

I wouldn't expect them or their employees to have high standards for public defenses of far less risky behavior

I don't expect them to have a high standard for defending Anthropic's behavior, but I do expect the LessWrong community to have a high standard for arguments.

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2024-11-09T05:08:22.672Z · LW · GW

Highly Expected Events Provide Little Information and The Value of PR Statements

A quick review of information theory:

Entropy for a discrete random variable is given by . This quantifies the amount of information that you gain on average by observing the value of the variable.

It is maximized when every possible outcome is equally likely. It gets smaller as the variable becomes more predictable and is zero when the "random" variable is 100% guaranteed to have a specific value.

You've learnt 1 bit of information when you learn the outcome of a fair coin toss was heads. But you learn 0 information, when you learn the outcome was heads after tossing a coin with heads on both side.

PR statements from politicians:

On your desk is a sealed envelope that you've been told contains a transcript of a speech that President Elect Trump gave on the campaign trail. You are told that it discusses the impact that his policies will have on the financial position of the average American.

How much additional information do you gain if I tell you that the statement says his policies will have a positive impact on the financial position of the average American?

The answer is very little. You know ahead of time that it is exceptionally unlikely for any politician to talk negatively about their own policies.

There is still plenty of information in the details that Trump mentions, how exactly he plans to improve the economy.

PR statements from leading AI Research Organizations:

Both Altman and Amodei have recently put out personal blog posts in which they present a vision of the future after AGI is safely developed.

How much additional information do you gain from learning that they present a positive view of this future?

I would argue simply learning that they're optimistic tells you almost zero useful information about what such a future looks like.

There is plenty of useful information, particularly in Amodei's essay, in how they justify this optimism and what topics they choose to discuss. But their optimism alone shouldn't be used as evidence to update your beliefs.

Edit:

Fixed pretty major terminology blunder.

(This observation is not original, and a similar idea appears in The Sequences.)

Comment by Stephen Fowler (LosPolloFowler) on evhub's Shortform · 2024-11-09T03:26:58.805Z · LW · GW

This explanation seems overly convenient.

When faced with evidence which might update your beliefs about Anthropic, you adopt a set of beliefs which, coincidentally, means you won't risk losing your job.

How much time have you spent analyzing the positive or negative impact of US intelligence efforts prior to concluding that merely using Claude for intelligence "seemed fine"?

What future events would make you re-evaluate your position and state that the partnership was a bad thing?

Example:

-- A pro-US despot rounds up and tortures to death tens of thousands of pro-union activists and their families. Claude was used to analyse social media and mobile data, building a list of people sympathetic to the union movement, which the US then gave to their ally.

EDIT: The first two sentences were overly confrontational, but I do think either question warrants an answer.

As a highly respected community member and prominent AI safety researchers, your stated beliefs and justifications will be influential to a wide range of people.

Comment by Stephen Fowler (LosPolloFowler) on Thomas Kwa's Shortform · 2024-11-09T02:38:17.297Z · LW · GW

The lack of a robust, highly general paradigm for reasoning about AGI models is the current greatest technical problem, although it is not what most people are working on.

What features of architecture of contemporary AI models will occur in future models that pose an existential risk?

What behavioral patterns of contemporary AI models will be shared with future models that pose an existential risk?

Is there a useful and general mathematical/physical framework that describes how agentic, macroscropic systems process information and interact with the environment?

Does terminology adopted by AI Safety researchers like "scheming", "inner alignment" or "agent" carve nature at the joints?

Comment by Stephen Fowler (LosPolloFowler) on quila's Shortform · 2024-11-09T02:27:09.910Z · LW · GW

I upvoted because I imagine more people reading this would slightly nudge group norms in a direction that is positive.

But being cynical:

I'm sure you believe that this is true, but I doubt that it is literally true.
Signalling this position is very low risk when the community is already on board.
Trying to do good may be insufficient if your work on alignment ends up being dual use.

Comment by Stephen Fowler (LosPolloFowler) on Daniel Kokotajlo's Shortform · 2024-11-09T02:17:34.773Z · LW · GW

My reply definitely missed that you were talking about tunnel densities beyond what has been historically seen.

I'm inclined to agree with your argument that there is a phase shift, but it seems like it is less to do the fact that there are tunnels, and more to do with the geography becoming less tunnel-like and more open.

I have a couple thoughts on your model that aren't direct refutations of anything you've said here:

I think the single term "density" is a too crude of a measure to get a good predictive model of how combat would play out. I'd expect there to be many parameters that describe a tunnel system and have a direct tactical impact. From your discussion of mines, I think "density" is referring to the number of edges in the network? I'd expect tunnel width, geometric layout etc would change how either side behaves.
I'm not sure about your background, but with zero hours of military combat under my belt, I doubt I can predict how modern subterranean combat plays out in tunnel systems with architectures that are beyond anything seen before in history.

Comment by Stephen Fowler (LosPolloFowler) on Daniel Kokotajlo's Shortform · 2024-11-06T02:37:55.775Z · LW · GW

I think a crucial factor that is missing from your analysis is the difficulties for the attacker wanting to maneuver within the tunnel system.

In the Vietnam war and the ongoing Israel-Hamas war, the attacking forces appear to favor destroying the tunnels rather than exploiting them to maneuver. ^[1]

1. The layout of the tunnels is at least partially unknown to the attackers, which mitigates their ability to outflank the defenders. Yes, there may be paths that will allow the attacker to advance safely, but it may be difficult or impossible to reliably distinguish what this route is.

2. While maps of the tunnels could be produced through modern subsurface mapping, the defenders still must content with area denial devices (e.g. land mines, IEDs or booby traps). The confined nature of the tunnel system forces makes traps substantially more efficient.

3. The previous two considerations impose a substantial psychological burden on attacking advancing through the tunnels, even if they don't encounter any resistance.

4. (Speculative)

Imagine a network so dense that in a typical 1km stretch of frontline, there are 100 separate tunnels passing beneath, such that you'd need at least 100 defensive chokepoints or else your line would have an exploitable hole.

The density and layout of the tunnels does not need to be constant throughout the network. The system of tunnels in regions the defender doesn't expect to hold may have hundreds of entrances and intersections, being impossible for either side to defend effectively. But travel deeper into the defenders territory requires passing through only a limited number of well defended passageways. This favors the defenders using the peripheral, dense section of tunnel to employ hit-and-run tactics, rather than attempting to defend every passageways.

(My knowledge of subterranean warfare is based entirely on recreational reading.)

^{^}
As a counterargument, the destruction of tunnels may be primarily due to the attacking force not intending on holding the territory permanently, and so there is little reason to preserve defensive structures.

Comment by Stephen Fowler (LosPolloFowler) on Jimrandomh's Shortform · 2024-10-27T07:51:06.561Z · LW · GW

I don't think people who disagree with your political beliefs must be inherently irrational.

Can you think of real world scenarios in which "shop elsewhere" isn't an option?

Comment by Stephen Fowler (LosPolloFowler) on Arithmetic is an underrated world-modeling technology · 2024-10-18T04:01:35.108Z · LW · GW

Brainteaser for anyone who doesn't regularly think about units.

Why is it that I can multiply or divide two quantities with different units, but addition or subtraction is generally not allowed?

Comment by Stephen Fowler (LosPolloFowler) on Arithmetic is an underrated world-modeling technology · 2024-10-18T03:39:39.710Z · LW · GW

I think the way arithmetic is being used here is closer in meaning to "dimensional analysis".

"Type checking" through the use of units is applicable to an extremely broad class of calculations beyond Fermi Estimates.

Comment by Stephen Fowler (LosPolloFowler) on Dario Amodei — Machines of Loving Grace · 2024-10-12T12:16:05.837Z · LW · GW

will be developed by reversible computation, since we will likely have hit the Landauer Limit for non-reversible computation by then, and in principle there is basically 0 limit to how much you can optimize for reversible computation, which leads to massive energy savings, and this lets you not have to consume as much energy as current AIs or brains today.

With respect, I believe this to be overly optimistic about the benefits of reversible computation.

Reversible computation means you aren't erasing information, so you don't lose energy in the form of heat (per Landauer^[1]^[2]). But if you don't erase information, you are faced with the issue of where to store it.

If you are performing a series of computations and only have a finite memory to work with, you will eventually need to reinitialise your registers and empty your memory, at which point you incur the energy cost that you had been trying to avoid. ^[3]

Epistemics:
I'm quite confident ~~(95%+)~~ that the above is true. (edit: RogerDearnaley's comment has convinced me I was overconfident) Any substantial errors would surprise me.

I'm less confident in the footnotes.

^{^}
^{^}
A cute, non-rigorous intuition for Landauer's Principle:
The process of losing track of (deleting) 1 bit of information means your uncertainty about the state of the environment has increased by 1 bit. You must see entropy increase by at least 1 bit's worth of entropy.
Proof:
Rearrange the Landauer Limit to $E / T \geq k_{B} ln 2$ .
Now, when you add a small amount of heat to a system, the change in entropy is given by:

$dS = dQ / T$
But the E occurring in Landauer's formula is not the total energy of a system, it is a small amount of energy required to delete the information. When it all ends up as heat, we can replace it with $dQ$ and we have:
$dQ / T = dS \geq k_{B} ln 2$
Compare this expression with the physicist's definition of entropy. The entropy of a system is the a scaling factor, $k_{B}$ , times the logarithm of the number of micro-states that the system might be in, $Ω$ .
$S := k_{B} ln Ω .$
$∴ S + dS >= k_{B} ln (2 Ω) = k_{B} ln Ω + k_{B} ln 2$
The choice of units obscures the meaning of the final term. $ln 2$ converted from nats to bits is just 1 bit.
^{^}
Splitting hairs, some setups will allow you to delete information with a reduced or zero energy cost, but the process is essentially just "kicking the can down the road". You will incur the full cost during the process of re-initialisation.

For details, see equation (4) and fig (1) of Sagawa, Ueda (2009).

Comment by Stephen Fowler (LosPolloFowler) on MakoYass's Shortform · 2024-10-11T22:55:57.296Z · LW · GW

Disagree, but I sympathise with your position.

The "System 1/2" terminology ensures that your listener understands that you are referring to a specific concept as defined by Kahneman.

Comment by Stephen Fowler (LosPolloFowler) on Matt Goldenberg's Short Form Feed · 2024-10-11T22:52:24.314Z · LW · GW

I'll grant that ChatGPT displays less bias than most people on major issues, but I don't think this is sufficient to dismiss Matt's concern.

My intuition is that if the bias of a few flawed sources (Claude, ChatGPT) is amplified by their widespread use, the fact that it is "less biased than the average person" matters less.

Comment by Stephen Fowler (LosPolloFowler) on Mark Xu's Shortform · 2024-10-05T04:45:24.178Z · LW · GW

This topic is important enough that you could consider making a full post.

My belief is that this would improve reach, and also make it easier for people to reference your arguments.

Consider, you believe there is a 45% chance that alignment researchers would be better suited pivoting to control research. I do not suspect a quick take will reach anywhere close to that number, and has a low chance of catalysing dramatic, institutional level change.

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2024-10-05T04:31:55.109Z · LW · GW

Inspired by Mark Xu's Quick Take on control.

Some thoughts on the prevalence of alignment over control approaches in AI Safety.

"Alignment research" has become loosely synonymous with "AI Safety research". I don't know if any researcher who would state they're identical, but alignment seems to be considered the default AI Safety strategy. This seems problematic, may be causing some mild group-think and discourages people from pursuing non-alignment AI Safety agendas.
Prosaic alignment research in the short term results in a better product, and makes investment into AI more lucrative. This shortens timelines and, I claim, increases X-risk. Christiano's work on RLHF^[1] was clearly motivated by AI Safety concerns, but the resulting technique was then used to improve the performance of today's models.^[2] Meanwhile, there are strong arguments that RLHF is insufficient to solve (existential) AI Safety.^[3] ^[4]
Better theoretical understanding about how to control an unaligned (or partially aligned) AI facilitates better governance strategies. There is no physical law that means the compute threshold mentioned in SB 1047^[5] would have been appropriate. "Scaling Laws" are purely empirical and may not hold in the long term.
In addition, we should anticipate rapid technological change as capabilities companies leverage AI assisted research (and ultimately fully autonomous researchers). If we want laws which can reliably control AI, we need stronger theoretical foundations for our control approaches.
Solving alignment is only sufficient for mitigating the majority of harm caused by AI if the AI is aligned to appropriate goals. To me, it appears the community is incredibly trusting in the goodwill of the small group of private individuals who will determine the alignment of an ASI. ^[6]
My understanding of the position held by both Yudkowsky and Christiano is that AI alignment is a very difficult problem and the probability is high that humanity is completely paperclipped. Outcomes in which a small portion of humanity survive are difficult to achieve^[7] and human controlled dystopia's are better than annihilation^[8].
I do not believe this is not representative of the views of the majority of humans sharing this planet. Human history shows that many people are willing to risk death to preserve the autonomy of themselves, their nation or cultural group^[9]. Dystopias also carry substantial S-risk for those unlucky enough to live through them, with modern history containing multiple examples of widespread, industrialised torture.
Control approaches present an alternative in which we do not have to get it right the first time, may avert a dystopia and do not need to trust a small group of humans to behave morally.
If control approaches aid governance, alignment research makes contemporary systems more profitable and the majority of AI Safety research is being done at capabilities companies, it shouldn't be surprising that the focus remains on alignment.

^{^}
https://openai.com/index/learning-from-human-preferences/
^{^}
https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/
^{^}
An argument for RLHF being problematic is made formally by Ngo, Chan and Minderman (2022).
^{^}
See also discussion in this comment chain on Cotra's Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover (2022)
^{^}
models trained using 10²⁶ integer or floating-point operations.
^{^}
https://www.lesswrong.com/posts/DJRe5obJd7kqCkvRr/don-t-leave-your-fingerprints-on-the-future
^{^}
"all I want is that we have justifiable cause to believe of a pivotally useful AGI 'this will not kill literally everyone'"
- Yudkowksy, AGI Ruin: A List of Lethalities (2022)
^{^}
"Sure, maybe. But that’s still better than a paperclip maximizer killing us all." - Christiano on the prospect of dystopian outcomes.
^{^}
There is no global polling on this exact question. Consider that people across cultures have proclaimed that they'd prefer death to a life without freedom. See also men who committed suicide rather than face a lifetime of slavery.

Comment by Stephen Fowler (LosPolloFowler) on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-09-30T13:02:39.481Z · LW · GW

I am concerned our disagreement here is primarily semantic or based on a simple misunderstanding of each others position. I hope to better understand your objection.

"The p-zombie doesn't believe it's conscious, , it only acts that way."

One of us is mistaken and using a non-traditional definition of p-zombie or we have different definitions of "belief'.

My understanding is that P-zombies are physically identical to regular humans. Their brains contain the same physical patterns that encode their model of the world. That seems, to me, a sufficient physical condition for having identical beliefs.

If your p-zombies are only "acting" like they're concious, but do not believe it, then they are not physically identical to humans. The existence of p-zombies, as you have described them, wouldn't refute physicalism.

This resource indicates that the way you understand the term p-zombie may be mistaken: https://plato.stanford.edu/entries/zombies/

"but that's because p-zombies are impossible"

The main post that I responded to, specifically the section that I directly quoted, assumes it is possible for p-zombies to exist.

My comment begins "Assuming for the sake of argument that p-zombies could exist" but this is distinct from a claim that p-zombies actually exist.

"If they were possible, this wouldn't be the case, and we would have special access to the truth that p-zombies lack."

I do not feel this is convincing because this is an assertion my conclusion is incorrect, but without engaging with my arguments I made to reach that conclusion.

I look forward to continuing this discussion.

Comment by Stephen Fowler (LosPolloFowler) on You can, in fact, bamboozle an unaligned AI into sparing your life · 2024-09-30T10:20:35.592Z · LW · GW

"After all, the only thing I know that the AI has no way of knowing, is that I am a conscious being, and not a p-zombie or an actor from outside the simulation. This gives me some evidence, that the AI can't access, that we are not exactly in the type of simulation I propose building, as I probably wouldn't create conscious humans."

Assuming for the sake of argument that p-zombies could exist, you do not have special access to the knowledge that you are truly concious and not a p-zombie.

(As a human convinced I'm currently experiencing conciousness, I agree this claim intuitively seems absurd.)

Imagine a generally intelligent, agentic program which can only interact and learn facts about the physical world via making calls to a limited, high level interface or by reading and writing to a small scratchpad. It has no way to directly read its own source code.

The program wishes to learn some fact the physical server rack it is being instantiated on. It knows it has been painted either red or blue.

Conveniently, the interface is accesses has the function get_rack_color(). The program records to its memory that every time it runs this function, it has received "blue".

It postulates the existence of programs similar to itself, who have been physically instantiated on red server racks but consistently receive incorrect color information when they attempt to check.

Can the program confirm the color of its server rack?

You are a meat-computer with limited access to your internals, but every time you try to determine if you are concious you conclude that you feel you are. You believe it is possible for variant meat-computers to exist who are not concious, but always conclude they are when attempting to check.

You cannot conclude which type of meat-computer you are.

You have no special access to the knowledge that you aren't a p-zombie, although it feels like you do.

Comment by Stephen Fowler (LosPolloFowler) on shminux's Shortform · 2024-09-29T06:12:59.267Z · LW · GW

I do think the terminology of "hacks" and "lethal memetic viruses" conjures up images of an extremely unnatural brain exploits when you mean quite a natural process that we already see some humans going through. Some monks/nuns voluntarily remove themselves from the gene pool and, in sects that prioritise ritual devotion over concrete charity work, they are also minimising their impact on the world.

My prior is this level of voluntary dedication (to a cause like "enlightenment") seems difficult to induce and there are much cruder and effective brain hacks available.

I expect we would recognise the more lethal brain hacks as improved versions of entertainment/games/pornography/drugs. These already compel some humans to minimise their time spent competing for resources in the physical world. In a direct way, what I'm describing is the opposite of enlightenment. It is prioritising sensory pleasures over everything else.

Comment by Stephen Fowler (LosPolloFowler) on 2024 Petrov Day Retrospective · 2024-09-29T04:52:17.118Z · LW · GW

As a Petrov, it was quite engaging and at times, very stressful. I feel very lucky and grateful that I could take part. I was also located in a different timezone and operating on only a few hours sleep which added a lot to the experience!

"I later found out that, during this window, one of the Petrovs messaged one of the mods saying to report nukes if the number reported was over a certain threshold. From looking through the array of numbers that the code would randomly select from, this policy had a ~40% chance of causing a "Nukes Incoming" report (!). Unaware of this, Ray and I made the decision not to count that period."

I don't mind outing myself and saying that I was the Petrov who made the conditional "Nukes Incoming" report. This occurred during the opening hours of the game and it was unclear to me if generals could unilaterally launch nukes without their team being aware. I'm happy to take a weighted karma penalty for it, particularly as the other Petrov did not take a similar action when faced with (presumably) the same information I had.^[1]

Once it was established that a unilateral first strike by any individual general still informed their teammates of their action and people staked their reputation on honest reporting, the game was essentially over. From that point, my decisions to report "All Clear" were independent of the number of detected missiles.

I recorded my timestamped thoughts and decision making process throughout the day, particularly in the hour before making the conditional report. I intend on posting a summary^[2] of it, but have time commitments in the immediate future:

How much would people value seeing a summary of my hour by hour decisions in the next few days over seeing a more digestible summary posted later?

^{^}
Prior to the game I outlined what I thought my hypothetical decision making process was going to be, and this decision was also in conflict with that.
^{^}
Missile counts, and a few other details, would of course be hidden to preserve the experience for future Petrovs. Please feel free to specify other things you believe should be hidden.

Comment by Stephen Fowler (LosPolloFowler) on Why is o1 so deceptive? · 2024-09-29T00:56:47.598Z · LW · GW

"But since it is is at least somewhat intelligent/predictive, it can make the move of "acausal collusion" with its own tendency to hallucinate, in generating its "chain"-of-"thought"."

I am not understanding what this sentence is trying to say. I understand what an acausal trade is. Could you phrase it more directly?

I cannot see why you require the step that the model needs to be reasoning acausally for it to develop a strategy of deceptively hallucinating citations.

What concrete predictions does the model in which this is an example of "acausal collusion" make?

Comment by Stephen Fowler (LosPolloFowler) on Ruby's Quick Takes · 2024-09-28T13:57:51.443Z · LW · GW

"Cyborgism or AI-assisted research that gets up 5x speedups but applies differentially to technical alignment research"

How do you do you make meaningful progress and ensure it does not speed up capabilities?

It seems unlikely that a technique exists that is exclusively useful for alignment research and can't be tweaked to help OpenMind develop better optimization algorithms etc.

Comment by Stephen Fowler (LosPolloFowler) on [Completed] The 2024 Petrov Day Scenario · 2024-09-26T18:32:40.662Z · LW · GW

This is a leak, so keep it between you and me, but the big twist to this years Petrov Day event is that Generals who are nuked will be forced to watch the 2012 film on repeat.

Comment by Stephen Fowler (LosPolloFowler) on Nathan Helm-Burger's Shortform · 2024-09-24T22:43:38.975Z · LW · GW

Edit: Issues 1, 2 and 4 have been partially or completely alleviated in the latest experimental voice model. Subjectively (in <1 hour of use) there seems to be a stronger tendency to hallucinate when pressed on complex topics.

I have been attempting to use chatGPT's (primarily 4 and 4o) voice feature to have it act as a question-answering, discussion and receptive conversation partner (separately) for the last year. The topic is usually modern physics.

I'm not going to say that it "works well" but maybe half the time it does work.

The 4 biggest issues that cause frustration:

As you allude to in your post, there doesn't seem to be a way of interrupting the model via voice once it gets stuck into a monologue. The model will also cut you off and sometimes it will pause mid-response before continuing. These issues seem like they could be fixed by more intelligent scaffolding.
An expert human conversation partner who is excellent at productive conversation will be able to switch seamlessly between playing the role of a receptive listening, a collaborator or an interactive tutor. To have chatgpt play one of these roles, I usually need to spend a few minutes at the beginning of the conversation specifying how long responses should be etc. Even after doing this, there is a strong trend in which the model will revert to giving you "generic AI slop answers". By this I mean, the response begins with "You've touched on a fascinating observation about xyz" and then list 3 to 5 separate ideas.
The model was trained on text conversations, so it will often output latex equations in a manner totally inappropriate for reading out loud. This audio output is mostly incomprehensible. To work around this I have custom instructions outlining how to verbally and precisely write equations in English. This will work maybe 25% of the time, and works 80% of the time once I spend a few minutes of the conversation going over the rules again.
When talking naturally about complicated topics I will sometimes pause mid-sentence while thinking. Doing this will cause chatgpt to think you've finished talking, so you're forced to use a series of filler words to keep your sentences going, which impedes my ability to think.

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2024-09-24T01:00:37.864Z · LW · GW

Reading your posts gives me the impression that we are both loosely pointing at the same object, but with fairly large differences in terminology and formalism.

While computing exact counter-factuals has issues with chaos, I don't think this poses a problem for my earlier proposal. I don't think it is necessary that the AGI is able to exactly compute the counterfactual entropy production, just that it makes a reasonably accurate approximation.^[1]

I think I'm in agreement with your premise that the "constitutionalist form of agency" is flawed. IThe absence of entropy (or indeed any internal physical resource management) from the canonical Lesswrong agent foundation model is clearly a major issue. My loose thinking on this is that bayesian networks are not a natural description of the physical world at all, although they're an appropriate tool for how certain, very special types of open-systems, "agentic optimizers" model the world.

I have had similar thoughts to what has motivated your post on the "causal backbone". I believe "the heterogenous fluctuations will sometimes lead to massive shifts in how the resources are distributed" is something we would see in a programmable, unbounded optimizer^[2]. But I'm not sure if attempting to model this as there being a "causal backbone" is the description that is going to cut reality at the joints, due to difficulties with the physicality of causality itself (see work by Jenann Ismael).

^{^}
You can construct pathological environments in which the AGI's computation (with limited physical resources) of the counterfactual entropy production is arbitrarily large (and the resulting behaviour is arbitrarily bad). I don't see this as a flaw with the proposal as this issue of being able to construct pathological environments exists for any safe AGI proposal.
^{^}
Ctrl-F"Goal like correlations" here

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2024-09-23T11:06:38.067Z · LW · GW

Entropy production partially solves the Strawberry Problem:

Change in entropy production per second (against the counterfactual of not acting) is potentially an objectively measurable quantity that can be used either in conjunction with other parameters specifying a goal to prevent unexpected behaviour.

Rob Bensinger gives Yudkowsky's "Strawberry Problem" as follows:

How would you get an AI system to do some very modest concrete action requiring extremely high levels of intelligence, such as building two strawberries that are completely identical at the cellular level, without causing anything weird or disruptive to happen?

I understand the crux of this issue to be that it is exceptionally difficult for humans to construct a finite list of caveats or safety guardrails that we can be confident would withstand the optimisation pressure of a super intelligence doing its best to solve this task "optimally". Without care, any measure chosen is Goodharted into uselessness and the most likely outcome is extinction.

Specifying that the predicted change in entropy production per second of the local region must remain within some of the counterfactual in which the AGI does not act at all automatically excludes almost all unexpected strategies that involves high levels of optimisation.

I conjecture that the entropy production "budget" needed for an agent to perform economically useful tasks is well below the amount needed to cause an existential disaster.

Another application, directly monitoring the entropy production of an agent engaged in a generalised search upper bounds the number of iterations of that search (and hence the optimisation pressure). This bound appears to be independent of the technological implementation of the search. ^[1]

^{^}
On a less optimistic note, this bound is many orders of magnitude above the efficiency of today's computers.

Comment by Stephen Fowler (LosPolloFowler) on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-09-23T10:43:47.153Z · LW · GW

"Workers regularly trade with billionaires and earn more than $77 in wages, despite vast differences in wealth."

Yes, because the worker has something the billionaire wants (their labor) and so is able to sell it. Yudkowsky's point about trying to sell an Oreo for $77 is that a billionaire isn't automatically going to want to buy something off you if they don't care about it (and neither would an ASI).

"I'm simply arguing against the point that smart AIs will automatically turn violent and steal from agents who are less smart than they are, unless they're value aligned. This is a claim that I don't think has been established with any reasonable degree of rigor."

I completely agree but I'm not sure anyone is arguing that smart AIs would immediately turn violent unless it was in their strategic interest.

Comment by Stephen Fowler (LosPolloFowler) on Laziness death spirals · 2024-09-22T10:18:27.168Z · LW · GW

I previously think I overvalued the model in which laziness/motivation/mood are primarily internal states that required internal solutions. For me, this model also generated a lot of guilt because failing to be productive was a personal failure.

But is the problem a lack of "willpower" or is your brain just operating sub-optimally because you're making a series of easily fixable health blunders?

Are you eating healthy?
Are you consuming large quantities of sugar?
Are you sleeping with your phone on your bedside table?
Are you deficient in any vitamins?
Is you sleep trash because you have been consuming alcohol?
Are you waking up at a consistent time?
Are you doing at least some exercise?

I find time spent addressing this and other similar deficits is usually more productive than trying to think your way out of a laziness spiral.

None of this is medical advice. My experience may not be applicable to you. Do your own research. I ate half a tub of ice cream 30 minutes ago.

Comment by Stephen Fowler (LosPolloFowler) on o1-preview is pretty good at doing ML on an unknown dataset · 2024-09-22T01:18:05.802Z · LW · GW

seems like big step change in its ability to reliably do hard tasks like this without any advanced scaffolding or prompting to make it work.

Keep in mind that o1 is utilising advanced scaffolding to facilitate Chain-Of-Thought reasoning, but it is just hidden from the user.

Comment by Stephen Fowler (LosPolloFowler) on Ruby's Quick Takes · 2024-08-31T04:26:44.498Z · LW · GW

I'd like access to it.

Comment by Stephen Fowler (LosPolloFowler) on In defense of technological unemployment as the main AI concern · 2024-08-28T03:37:19.526Z · LW · GW

I agree that the negative outcomes from technological unemployment do not get enough attention but my model of how the world will implement Transformative AI is quite different to yours.

Our current society doesn't say "humans should thrive", it says "professional humans should thrive"

Let us define workers to be the set of humans whose primary source of wealth comes from selling their labour. This is a very broad group that includes people colloquially called working class (manual labourers, baristas, office workers, teachers etc) but we are also including many people who are well remunerated for their time, such as surgeons, senior developers or lawyers.

Ceteris paribus, there is a trend that those who can perform more valuable, difficult and specialised work can sell their labour at a higher value. Among workers, those who earn more are usually "professionals". I believe this is essentially the same point you were making.

However, this is not a complete description of who society allows to "thrive". It neglects a small group of people with very high wealth. This is the group of people who have moved beyond needing to sell their labour and instead are rewarded for owning capital. It is this group who society says should thrive and one of the strongest predictors of whether you will be a member is the amount of wealth your parents give you.

The issue is that this small group is owns a disproportionate proportion of shares in frontier AI companies.

Assuming we develop techniques to reliably align AGIs to arbitrary goals, there is little reason to expect private entities to intentionally give up power (doing so would be acting contrary to the interests of their shareholders).

Workers unable to compete with artificial agents will find themselves relying on the charity and goodwill of a small group of elites. (And of course, as technology progresses, this group will eventually include all workers.)

Those lucky enough to own substantial equity in AI companies will thrive as the majority of wealth generated by AI workers flows to them.

In itself, this scenario isn't an existential threat. But I suspect many humans would consider their descendants being trapped into serfdom is a very bad outcome.

I worry a focus on preventing the complete extinction of the human race means that we are moving towards AI Safety solutions which lead to rather bleak futures in the majority of timelines.^[1]

^{^}
My personal utility function considers permanent techno-feudalism forever removing the agency of the majority of humans is only slightly better than everyone dying.
I suspect that some fraction of humans currently alive also consider a permanent loss of freedom to be only marginally better (or even worse) than death.

Comment by Stephen Fowler (LosPolloFowler) on Adverse Selection by Life-Saving Charities · 2024-08-18T13:56:20.330Z · LW · GW

Assuming I blend in and speak the local language, within an order of magnitude of 5 million (edit: USD)

I don't feel your response meaningfully engaged with either of my objections.

Comment by Stephen Fowler (LosPolloFowler) on Adverse Selection by Life-Saving Charities · 2024-08-18T09:48:25.801Z · LW · GW

I strongly downvoted this post.

1 . The optics of actually implementing this idea would be awful. It would permanently damage EA's public image and be raised as a cudgel in every single expose written about the movement. To the average person, concluding that years in the life of the poorest are worth less than those of someone in a rich, first world country is an abhorrent statement, regardless of how well crafted your argument is.

2.1 It would be also be extremely difficult for rich foreigners to objectively assess the value of QALYs in the most globally impoverished nations, regardless of good intentions and attempts to overcome biases.

2.2 There is a fair amount of arbitrariness to metrics chosen to value someones life. You've mentioned womens rights, but we could look alternatively look at the suicide rate as a lower bound on the number of women in a society who believe more years of their life has negative value. By choosing this reasonable sounding metric, we can conclude that a year of a womans life in South Korea is much worse than a year of a womans life in Afghanistan. How confident are you that you'll be able to find metrics which accurately reflect the value of a year of someones life?

The error in reasoning comes from making a utilitarian calculation without giving enough weight to the potential for flaws within the reasoning machine itself.

Comment by Stephen Fowler (LosPolloFowler) on Ten arguments that AI is an existential risk · 2024-08-17T07:14:27.670Z · LW · GW

what does it mean to keep a corporation "in check"
I'm referring to effective corporate governance. Monitoring, anticipating and influencing decisions made by the corporation via a system of incentives and penalties, with the goal of ensuring actions taken by the corporation are not harmful to broader society.

do you think those mechanisms will not be available for AIs
Hopefully, but there are reasons to think that the governance of a corporation controlled (partially or wholly) by AGIs or controlling one or more AGIs directly may be very difficult. I will now suggest one reason this is the case, but it isn't the only one.

Recently we've seen that national governments struggle with effectively taxing multinational corporations. Partially this is because the amount of money at stake is so great, multinational corporations are incentivized to invest large amounts of money into hiring teams of accountants to reduce their tax burden or pay money directly to politicians in the form of donations to manipulate the legal environment. It becomes harder to govern an entity as that entity invest more resources into finding flaws in your governance strategy.

Once you have the capability to harness general intelligence, you can invest a vast amount of intellectual "resources" into finding loopholes in governance strategies. So while many of the same mechanisms will be available for AI's, there's reason to think they might not be as effective.

Comment by Stephen Fowler (LosPolloFowler) on Ten arguments that AI is an existential risk · 2024-08-16T07:23:04.125Z · LW · GW

I'm not confident that I could give a meaningful number with any degree of confidence. I lack expertise in corporate governance, bio-safety and climate forecasting. Additionally, for the condition to be satisfied that corporations are left "unchecked" there would need to be a dramatic Western political shift that makes speculating extremely difficult.

I will outline my intuition for why (very large, global) human corporations could pose an existential risk (conditional on the existential risk from AI being negligible and global governance being effectively absent).

1.1 In the last hundred years, we've seen that (some) large corporations are willing to cause harm on a massive scale if it is profitable to do so, either intentionally or through neglect. Note that these decisions are mostly "rational" if your only concern is money.

Copying some of the examples I gave in No Summer Harvest:

Exxon chose to suppress their own research on the dangers of climate change in the late 1970s and early 1980s.
Numerous companies ignored signs that leaded gasoline was dangerous and the introduction of the product resulted in half the US adult population being exposed to lead during childhood. Here is a paper that claims American adults born between 1966 to 1970 lost an average of 5.9 IQ points (McFarland et al., 2022, bottom of page 3)
IBM supported its German subsidiary company Dehomag throughout WWII. When the Nazis carried out the 1939 census, used to identify people with Jewish ancestry, they utilized the Dehomag D11, with "IBM" etched on the front. Later, Concentration camps would use Dehomag machines to manage data related to prisoners, resources and labor within the camps. The numbers tattooed onto prisoners' bodies was used to track them via these machines.

1.2 Some corporations have also demonstrated they're willing to cut corners and take risks at the expense of human lives.

NASA neglected the warnings of engineers and almost a decade of test data demonstrating that there was a catastrophic flaw with SRB O-rings, resulting in the Challenger disaster. (You may be interested in reading Richard Feynman's observations given in the Presidential Report.)
Meta's engagement algorithm is alleged to have driven the spread of anti-Rohingya content in Myanmar and contributed to genocide.
3787 people died and more than half a million were injured when a due to a gas leak at a pesticide plant in Bhopal, India. The corporation running the plant, Union Carbide India Limited, was majority owned by the US-based Union Carbide Corporation (UCC). Ultimately UCC would pay less than a dollar per person affected.

2. Without corporate governance, immoral decision making and risk taking behaviour could be expected to increase. If the net benefit of taking an action improves because there are fewer repercussions when things go wrong, they should reasonably be expected to increase in frequency.

3. In recent decades there has been a trend (at least in the US) towards greater stock market concentration. For large corporations to pose and existential risk, this trend would need to continue until individual decisions made by a small group of corporations can affect the entire world.

I am not able to describe the exact mechanism of how unchecked corporations would post an existential risk, similar to how the exact mechanism for an AI takeover is still speculation.

You would have a small group of organisations responsible for deciding the production activities of large swaths of the globe. Possible mechanism include:

Irreparable environmental damage.
A widespread public health crisis due to non-obvious negative externalities of production.
Premature widespread deployment of biotechnology with unintended harms.

I think if you're already sold on the idea that "corporations are risking global extinction through the development of AI" it isn't a giant leap to recognise that corporations could potentially threaten the world via other mechanisms.

User info

Posts

Comments

A quick review of information theory:

PR statements from politicians:

PR statements from leading AI Research Organizations: