Posts

When is reward ever the optimization target? 2024-10-15T15:09:20.912Z
What does it mean for an event or observation to have probability 0 or 1 in Bayesian terms? 2024-09-17T17:28:52.731Z
My disagreements with "AGI ruin: A List of Lethalities" 2024-09-15T17:22:18.367Z
Does a time-reversible physical law/Cellular Automaton always imply the First Law of Thermodynamics? 2024-08-30T15:12:28.823Z
Francois Chollet inadvertently limits his claim on ARC-AGI 2024-07-16T17:32:00.219Z
The problems with the concept of an infohazard as used by the LW community [Linkpost] 2023-12-22T16:13:54.822Z
What's the minimal additive constant for Kolmogorov Complexity that a programming language can achieve? 2023-12-20T15:36:50.968Z
Arguments for optimism on AI Alignment (I don't endorse this version, will reupload a new version soon.) 2023-10-15T14:51:24.594Z
Hilbert's Triumph, Church and Turing's failure, and what it means (Post #2) 2023-07-30T14:33:25.180Z
Does decidability of a theory imply completeness of the theory? 2023-07-29T23:53:08.166Z
Why you can't treat decidability and complexity as a constant (Post #1) 2023-07-26T17:54:33.294Z
An Opinionated Guide to Computability and Complexity (Post #0) 2023-07-24T17:53:18.551Z
Conditional on living in a AI safety/alignment by default universe, what are the implications of this assumption being true? 2023-07-17T14:44:02.083Z
A potentially high impact differential technological development area 2023-06-08T14:33:43.047Z
Are computationally complex algorithms expensive to have, expensive to operate, or both? 2023-06-02T17:50:09.432Z
Does reversible computation let you compute the complexity class PSPACE as efficiently as normal computers compute the complexity class P? 2023-05-09T13:18:09.025Z
Are there AI policies that are robustly net-positive even when considering different AI scenarios? 2023-04-23T21:46:40.952Z
Can we get around Godel's Incompleteness theorems and Turing undecidable problems via infinite computers? 2023-04-17T15:14:40.631Z
Best arguments against the outside view that AGI won't be a huge deal, thus we survive. 2023-03-27T20:49:24.728Z
A case for capabilities work on AI as net positive 2023-02-27T21:12:44.173Z
Some thoughts on the cults LW had 2023-02-26T15:46:58.535Z
How seriously should we take the hypothesis that LW is just wrong on how AI will impact the 21st century? 2023-02-16T15:25:42.299Z
I've updated towards AI boxing being surprisingly easy 2022-12-25T15:40:48.104Z
A first success story for Outer Alignment: InstructGPT 2022-11-08T22:52:54.177Z
Is the Orthogonality Thesis true for humans? 2022-10-27T14:41:28.778Z
Logical Decision Theories: Our final failsafe? 2022-10-25T12:51:23.799Z
How easy is it to supervise processes vs outcomes? 2022-10-18T17:48:24.295Z
When should you defer to expertise? A useful heuristic (Crosspost from EA forum) 2022-10-13T14:14:56.277Z
Does biology reliably find the global maximum, or at least get close? 2022-10-10T20:55:35.175Z
Is the game design/art maxim more generalizable to criticism/praise itself? 2022-09-22T13:19:00.438Z
In a lack of data, how should you weigh credences in theoretical physics's Theories of Everything, or TOEs? 2022-09-07T18:25:52.750Z
Can You Upload Your Mind & Live Forever? From Kurzgesagt - In a Nutshell 2022-08-19T19:32:12.434Z
Complexity No Bar to AI (Or, why Computational Complexity matters less than you think for real life problems) 2022-08-07T19:55:19.939Z
Which singularity schools plus the no singularity school was right? 2022-07-23T15:16:19.339Z
Why AGI Timeline Research/Discourse Might Be Overrated 2022-07-20T20:26:39.430Z
How humanity would respond to slow takeoff, with takeaways from the entire COVID-19 pandemic 2022-07-06T17:52:16.840Z
How easy/fast is it for a AGI to hack computers/a human brain? 2022-06-21T00:34:34.590Z
Noosphere89's Shortform 2022-06-17T21:57:43.803Z

Comments

Comment by Noosphere89 (sharmake-farah) on Gradual Disempowerment, Shell Games and Flinches · 2025-02-06T18:38:16.046Z · LW · GW

I broadly agree with this, though I'll state 2 things:

  1. Limited steering ability doesn't equal 0 steering ability, and while there's an argument to be made that people overestimate how much you can do with pure social engineering, I do still think there can be multiple equilibrium points, even if a lot of what happens is ultimately controlled by incentives.

  2. AIs probably have a much easier time coordinating on what to do, and importantly can route around a lot of the bottlenecks that exist in human societies solely due to copying, merging and scaling, so assuming alignment is achieved, it's very possible for single humans to do large scale social change by controlling the economy and military, and working your way from there.

Comment by Noosphere89 (sharmake-farah) on C'mon guys, Deliberate Practice is Real · 2025-02-06T13:52:31.894Z · LW · GW

I basically agree with something like this:

FWIW it's not TOTALLY obvious to me that the literature supports the notion that deliberate practice applies to meta-cognitive skills at the highest level like this.

Evidence for this type of universal transfer learning is scant.

Not because you'll goodhart, but because people think it's plausible that the mind just isn't plastic on this level of basic meta-cognition. There's lots of evidence AGAINST that, and many times that people THINK they've find this sort of universal transfer, it often ends up being more domain specific than they thought.

Indeed, one of the main differences between current AI and humans is precisely that transfer learning is more powerful in AIs (I remember gwern saying this to me).

Comment by Noosphere89 (sharmake-farah) on Subjective Naturalism in Decision Theory: Savage vs. Jeffrey–Bolker · 2025-02-06T13:43:20.578Z · LW · GW

Wait, how does the atomless property ensure that if the probability of an event is 0, then the event can never happen at all, as a matter of logic?

Comment by Noosphere89 (sharmake-farah) on nikola's Shortform · 2025-02-06T04:12:45.288Z · LW · GW

Some Wait But Why links on this topic:

https://waitbutwhy.com/2017/04/neuralink.html

https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html

And some books by Kurzweil:

https://www.amazon.com/gp/aw/d/0670033847?ref_=dbs_m_mng_wam_calw_thcv_0&storeType=ebooks

https://www.amazon.com/dp/B08ZJRMWVS/?bestFormat=true&k=the singularity is nearer by ray kurzweil&ref_=nb_sb_ss_w_scx-ent-pd-bk-m-si_de_k0_1_8&crid=X3GZ8HDDAEPI&sprefix=the sing

Comment by Noosphere89 (sharmake-farah) on The Risk of Gradual Disempowerment from AI · 2025-02-06T02:46:56.960Z · LW · GW

My own take is summarized by Ryan Greenblatt here and Fabien Roger and myself, which is that it is a problem, but not really an existential threat by itself (but only because humanity's potential is technically fulfilled if a random billionaire took control over earth and killed almost everyone else except people ideologically aligned to him, and yet AIs still take orders and he personally gets uploaded and has a very rich life):

https://www.lesswrong.com/posts/pZhEQieM9otKXhxmd/gradual-disempowerment-systemic-existential-risks-from#GChLyapXkhuHaBewq

https://www.lesswrong.com/posts/pZhEQieM9otKXhxmd/gradual-disempowerment-systemic-existential-risks-from#GJSdxkc7YfgdzcLRb

https://www.lesswrong.com/posts/pZhEQieM9otKXhxmd/gradual-disempowerment-systemic-existential-risks-from#QCjBC7Ym6Bt9pHHew

https://www.lesswrong.com/posts/pZhEQieM9otKXhxmd/gradual-disempowerment-systemic-existential-risks-from#8yCL9TdDW5KfXkvzh

Comment by Noosphere89 (sharmake-farah) on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development · 2025-02-06T02:46:02.328Z · LW · GW

Eh, I'd argue that people do not in fact agree on most of the issues related to AI, and there's lot's of disagreements on what the problem is, or how to solve it, or what to do after AI is aligned.

Comment by Noosphere89 (sharmake-farah) on evhub's Shortform · 2025-02-05T16:46:19.011Z · LW · GW

I would go farther, in that we will in practice be picking from this set of outcomes with a lot of arbitrariness, and that this is not removable.

Comment by Noosphere89 (sharmake-farah) on Subjective Naturalism in Decision Theory: Savage vs. Jeffrey–Bolker · 2025-02-05T16:41:26.051Z · LW · GW
  1. Austerity: The model should only include events the agent thinks are genuinely possible. If she is certain something cannot happen, the theory shouldn’t force her to rank or measure preferences for that scenario. (Formally, we can regard zero-probability events as excluded from the relevant algebra.)

 

I'd want to mention that in infinite contexts, probability 0 events are still possible.

(An example here is possibly the constants of our universe, which currently are real numbers, but any specific real number has a 0 probability of being picked)

It's very important to recognize when you are in a domain such that probability 0 is not equal to impossible.

The dual case holds as well, that is probability 1 events are not equal to certainty in the general case.

Comment by Noosphere89 (sharmake-farah) on evhub's Shortform · 2025-02-05T02:32:41.612Z · LW · GW

I'm generally of the opinion that CEV was always a bad goal, and that we shouldn't attempt to do so, and a big reason for this is I don't believe a procedure exists that doesn't incentivize humans to fight over the initial dynamic, or another way to say this is that who implements the CEV procedure will always matter, because I don't believe in the idea that humans will naturally converge in the limit of more intelligence to a fixed moral value system, and instead I predict divergence as constraints are removed.

I roughly agree with Steven Byrnes, but stronger here (though I think this holds beyond humans too):

https://www.lesswrong.com/posts/SqgRtCwueovvwxpDQ/valence-series-2-valence-and-normativity#2_7_3_Possible_implications_for_AI_alignment_discourse

Regardless, imo the biggest question that standard CEV leaves unanswered is what your starting population looks like that you extrapolate from. The obvious answer is "all the currently living humans," but I find that to be a very unsatisfying answer. One of the principles that Eliezer talks about in discussing CEV is that you want a procedure such that it doesn't matter who implements it—see Eliezer's discussion under "Avoid creating a motive for modern-day humans to fight over the initial dynamic." I think this is a great principle, but imo it doesn't go far enough. In particular:
 

Comment by Noosphere89 (sharmake-farah) on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development · 2025-02-04T17:44:34.957Z · LW · GW

BTW, this is also a crux for me as well, in that I do believe that absent technical misalignment, some humans will have most of the power by default, rather than AIs, because I believe AI rights will be limited by default.
 

I think this is the disagreement: I expect that selfish/individual powerseeking without any coordination will still result in (some) humans having most power in the absence of technical misalignment problems. Presumably your view is that the marginal amount of power anyone gets via powerseeking is negligible (in the absence of coordination). But, I don't see why this would be the case. Like all shareholders/board members/etc want to retain their power and thus will vote accordingly which naively will retain their power unless they make a huge error from their own powerseeking perspective. Wasting some resources on negative sum dynamics isn't a crux for this argument unless you can argue this will waste a substantial fraction of all human resources?


 

Comment by Noosphere89 (sharmake-farah) on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development · 2025-02-04T17:39:02.182Z · LW · GW

Generally speaking, it's probably 5-20%, at this point on chances of doom.

Comment by Noosphere89 (sharmake-farah) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-02-04T16:42:06.734Z · LW · GW

Any examples of where the starting resource allocation has more than 1 solution/equilibrium, assuming demand independence is violated?

Comment by Noosphere89 (sharmake-farah) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-02-04T15:03:42.536Z · LW · GW

And this is why some goods, like public safety can't be provided by markets, because the assumption of demand independence is violated.

Comment by Noosphere89 (sharmake-farah) on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development · 2025-02-04T14:48:19.978Z · LW · GW

I do generally agree more with continuous views than discrete views, but I don't think that this alone gets us a need for humans in the loop for many decades/indefinitely, because continuous progress in alignment can still be very fast, such that it takes only a few months/years for AIs to be aligned with a single person's human preference for almost arbitrarily long.

(The link is in the context of AI capabilities, but I think the general point holds on how continuous progress can still be fast):

https://www.planned-obsolescence.org/continuous-doesnt-mean-slow/

My own take on whether Steven's "Law of Conservation of Wisdom" is true is that I think this is mostly true for human brains, and I think a fair amount of those issues described in the comment is a values conflict, and I think value conflicts, except in special cases will be insoluble by default, and I also don't think CEV works because of this.

That said, I don't think you have to break too much norms in order to prevent existential catastrophe, mostly because actually destroying humanity is actually quite hard, and will be even harder in AI takeoff.

Comment by Noosphere89 (sharmake-farah) on Gradual Disempowerment, Shell Games and Flinches · 2025-02-04T14:23:38.781Z · LW · GW

My main crux here is I think that no strong AI rights will likely be given before near-full alignment to one person is achieved, and maybe not even then, and a lot of the failure modes of giving AIs power in gradual disempowerment scenario fundamentally route through giving AIs very strong rights, but thankfully, this is disincentivized by default, because otherwise AIs would be more expensive.

The main way this changes the scenario is that the 6 humans here remain broadly in control here, and aren't just high all the time, and the first one probably doesn't just replace their preferences with pure growth, because at the level of billionaires, status dominates, so they are likely living very rich lives with their own servants.

No guarantees about anyone else surviving though:

  • No strong AI rights before full alignment: There won't be a powerful society that gives extremely productive AIs "human-like rights" (and in particular strong property rights) prior to being relatively confident that AIs are aligned to human values.
    • I think it's plausible that fully AI-run entities are given the same status as companies - but I expect that the surplus they generate will remain owned by some humans throughout the relevant transition period.
    • I also think it's plausible that some weak entities will give AIs these rights, but that this won't matter because most "AI power" will be controlled by humans that care about it remaining the case as long as we don't have full alignment.
       
Comment by Noosphere89 (sharmake-farah) on What are the "no free lunch" theorems? · 2025-02-04T02:52:40.206Z · LW · GW

Another interpretation of the no free lunch theorem by @davidad is that learning/optimization is too trivial under worst-case conditions, but also impractical, so you need to put more constraints to have an interesting solution:

https://www.lesswrong.com/posts/yTvBSFrXhZfL8vr5a/worst-case-thinking-in-ai-alignment#N3avtTM3ESH4KHmfN

Comment by Noosphere89 (sharmake-farah) on Alexander Gietelink Oldenziel's Shortform · 2025-02-03T16:45:03.355Z · LW · GW

I agree with Leo Gao here:

https://x.com/nabla_theta/status/1885846403785912769

always good to get skeptical takes on SAEs, though imo this result is because of problems with SAE evaluation methodology. I would strongly bet that well trained SAE features on random nets are qualitatively much worse than ones on real LMs.

Comment by Noosphere89 (sharmake-farah) on Gradual Disempowerment, Shell Games and Flinches · 2025-02-02T21:54:10.013Z · LW · GW

This is why I was stating the scenario in the paper cannot really lead to existential catastrophe, at least without other assumptions here.

Comment by Noosphere89 (sharmake-farah) on Gradual Disempowerment, Shell Games and Flinches · 2025-02-02T21:52:51.049Z · LW · GW

@the gears to ascension is there a plausible scenario in your mind where the gradual disempowerment leads to the death/very bad fates for all humans?

Because I'm currently struggling to understand the perspective where alignment is solved, but all humans still die/irreversibly lose control due to gradually being disempowered.

A key part of the challenge is that you must construct the scenario in a world where the single-single alignment problem/classic alignment problem as envisioned by LW is basically solved for all intents and purposes.

Comment by Noosphere89 (sharmake-farah) on Catastrophe through Chaos · 2025-02-02T18:24:11.965Z · LW · GW

Technical note, I'm focusing on existential catastrophes, not normal catastrophes, and the difference is that no humans have power anymore, compared to only a few humans having power, so this mostly excludes scenarios like these:

https://www.lesswrong.com/posts/pZhEQieM9otKXhxmd/gradual-disempowerment-systemic-existential-risks-from

https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher

Comment by Noosphere89 (sharmake-farah) on “Sharp Left Turn” discourse: An opinionated review · 2025-02-02T18:20:06.085Z · LW · GW

My second concern is that “AIs solving specific technical problems that the human wants them to solve” is insufficient to avoid extinction and get to a good future—even if the AI is solving those problems at superhuman level and with the help of (1-3) superpowers.[3]

I won’t go much into details of why I have this concern, since I want to keep this post focused on technical alignment. But I’ll spend just a few paragraphs to give a hint. For more, see my What does it take to defend the world against out-of-control AGIs? and John Wentworth’s The Case Against AI Control.

Here’s an illustrative example. People right now are researching dangerous viruses, in a way that poses a risk of catastrophic lab leaks wildly out of proportion to any benefits. Why is that happening? Not because everyone is trying to stop it, but they lack the technical competence to execute. Quite the contrary! Powerful institutions like governments are not only allowing it, but are often funding it!

I think when people imagine future AI helping with this specific problem, they’re usually imagining the AI convincing people—say, government officials, or perhaps the researchers themselves—that this activity is unwise. See the problem? If AI is convincing people of things, well, powerful AIs will be able to convince people of bad and false things as well as good and true things. So if the AI is exercising its own best judgment in what to convince people of, then we’d better hope that the AI has good judgment! And it’s evidently not deriving that good judgment from human preferences. Human preferences are what got us into this mess! Right?

Remember, the “alignment generalizes farther” argument (§4 above) says that we shouldn’t worry because AI understands human preferences, and those preferences will guide its actions (via RLHF or whatever). But if we want an AI that questions human preferences rather than slavishly following them, then that argument would not be applicable! So how are we hoping to ground the AI’s motivations? It has to be something more like “ambitious value learning” or Coherent Extrapolated Volition—things that are philosophically fraught, and rather different from what people today are doing with foundation models.

 

I agree with the claim that existential catastrophes aren't automatically solved by aligned/controlled AI, and in particular biological issues remain a threat to human survival.

My general view on how AI helps in practice is more likely to route around first figuring out ways to solve the biology threat from a technical perspective, and then using instruction following AIs to implement a campaign of persuasion that uses the most effective way to change people's minds.

You are correct that an AI can convince people of true things as well as false things, and this is why you will need to make assumptions about the AI's corrigibility/instruction following, though you helpfully make such assumptions here.

On the first person problem, I believe that the general solution to this involves recapitulating human social instincts via lots of data on human values, and I'm perhaps more optimistic than you that a lot of the social instincts in humans don't have to be innately specified by a prior.

In many ways, I have a weird reaction to the post, because I centrally agree with the claim that corrigible/instruction following AIs aren't automatically sufficient to ensure safety, and yet I am much more optimistic than you do that mere corrigibility/instruction following is a huge way for AI to be safe, probably because I think you can actually do a lot more work to secure civilization in ways that semi-respect existing norms, but semi-respect is key here.

Comment by Noosphere89 (sharmake-farah) on AI X-risk is a possible solution to the Fermi Paradox · 2025-02-02T16:33:08.742Z · LW · GW

I think this is an interesting answer, and it does have some use even outside of the scenario, but I do think that the more likely answer to the problem probably rests upon the rareness of life, and in particular the eukaryote transition is probably the most likely great filter, because natural selection had to solve a coordination problem, combined with this step only happening once in earth's history, compared to all the other hard steps.

That said, I will say some more on this topic, if only to share my models:

  1. The universe might be too large for exponential growth to fill it up. It doesn't seem plausible for self-replication to be faster than exponential in the long-run, and if the universe is sufficiently large (like, bigger than 101030 or so?) then it's impossible - even with FTL - to kill everything, and again the scenario doesn't work. I suppose an exception would be if there were some act that literally ends the entire universe immediately (thus killing everything without a need to replicate). Also, an extremely-large universe would require an implausibly-strong Great Filter for us to actually be the first this late.

The big out here is time travel, and in these scenarios, assuming logical inconsistencies are prevented by the time travel mechanism, there's a non-trivial chance that trying to cause a logical inconsistency destroys the universe immediately:

https://www.lesswrong.com/posts/EScmxJAHeJY5cjzAj/ssa-rejects-anthropic-shadow-too#Probability_pumping_and_time_travel

  1. AI Doom might not happen. If humanity is asserted to be not AI-doomed then this argument turns on its head and our existence (to at least the extent that we might not be the first) argues that either light-cone-breaking FTL is impossible or AI doom is a highly-unusual thing to happen to civilisations. This is sort of a weird point to mention since the whole scenario is an Outside View argument that AI Doom is likely, but how seriously to condition on these sorts of arguments is a matter of some dispute.

In general, I'm more confident in light-cone breaking FTL being impossible then AI doom is highly unusual, but conditional on light cone breaking FTL being possible, I'd assert that AI doom is quite unusual for civilizations (excluding institutional failures, because these cases don't impact the absolute difficulty of the technical problem)

My big reason for this is I think instruction following is actually reasonably easy, and is enough to prevent existential risk on it's own, and this doesn't really require that much value alignment, for the purposes of existential risk.

In essence, I'm stating that value alignment isn't very necessary in order for large civilizations with AI to come out that aren't purely grabby, and while there is some value alignment necessary, it can be surprisingly small and bounded by a constant.

+1 for at least trying to do something, and also being surprisingly useful outside of the fermi paradox.

Comment by Noosphere89 (sharmake-farah) on Gradual Disempowerment, Shell Games and Flinches · 2025-02-02T15:26:35.969Z · LW · GW

I think a big part of the issue is not just the assumptions people use, but also because your scenario doesn't really lead to existential catastrophe in most worlds, if only because a few very augmented humans determine a lot of what the future does hold, at least under single-single alignment scenarios, and a lot of AI thought has been directed towards worlds where AI does do existential risk, and a lot of this is because of the values of the first thinkers on the topic.

More below:

https://www.lesswrong.com/posts/pZhEQieM9otKXhxmd/gradual-disempowerment-systemic-existential-risks-from#GChLyapXkhuHaBewq

Comment by Noosphere89 (sharmake-farah) on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development · 2025-02-02T14:49:52.967Z · LW · GW

I broadly agree with the view that something like this is a big risk under a lot of current human value sets.

One important caveat for some value sets is that I don't think this results in an existential catastrophe, and the broad reason for this is that in single-single alignment scenarios, some humans will remain in control and potentially become immortal, and importantly scenarios in which this is achieved automatically are excluded from existential catastrophes, solely due to the fact that human potential is realized, it's just that most humans are locked out of it.

It has similarities to this:

https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher

But more fleshed out.

Comment by Noosphere89 (sharmake-farah) on In response to critiques of Guaranteed Safe AI · 2025-02-02T03:31:36.003Z · LW · GW

Depends on your assumptions. If you assume that a pretty-well-intent-aligned pretty-well-value-aligned AI (e.g. Claude) scales to a sufficiently powerful tool to gain sufficient leverage on the near-term future to allow you to pause/slow global progress towards ASI (which would kill us all)...

We can drop the assumption that ASI inevitably kills us all/we should pause and still have the above argument work, or as I like to say it, practical AI alignment/safety is very much helped by computer security, especially against state adversaries.

I think Zach-Stein Perlman is overstating the case, but here it is:

https://www.lesswrong.com/posts/eq2aJt8ZqMaGhBu3r/zach-stein-perlman-s-shortform#ckNQKZf8RxeuZRrGH

Comment by Noosphere89 (sharmake-farah) on In response to critiques of Guaranteed Safe AI · 2025-02-02T02:20:21.908Z · LW · GW

I agree that it isn't a direct AI safety agenda, though I will say that software security would be helpful for control agendas, and the increasing capabilities of AI mathematics could, in principle, help with AI alignment agendas that are mostly mathematical like Vanessa Kosoy's agenda:

It's also useful for AI control purposes.

More below:

https://www.lesswrong.com/posts/oJQnRDbgSS8i6DwNu/the-hopium-wars-the-agi-entente-delusion#BSv46tpbkcXCtpXrk

Comment by Noosphere89 (sharmake-farah) on The Failed Strategy of Artificial Intelligence Doomers · 2025-02-01T23:33:42.084Z · LW · GW

I definitely agree with this, but in their defense, this is to be expected, especially in fast growing fields.

Model building is hard, and specialization generally beats trying to deeply understand something in general, so it's not that surprising that many people won't understand why, and this will be the case regardless of the truth value of timelines claims.

Comment by Noosphere89 (sharmake-farah) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-02-01T20:25:57.722Z · LW · GW

One of the scenarios I'm imagining is a scenario where network effects exist, such that you don't want to have something if you are the only person having it, but you do want it if others have it.

Arguably, a lot of public goods/social media are like this, where there's 0 demand at a limited size, but have lots of demand when the size starts increasing beyond a threshold.

In essence, I'm asking if we can remove the demand independence assumption and still get an isomorphism between optimal solutions to scarce resources and a system having prices.

Comment by Noosphere89 (sharmake-farah) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-02-01T19:44:10.417Z · LW · GW

Can we remove one of the assumptions, or are both assumptions necessary to get the result stated?

Comment by Noosphere89 (sharmake-farah) on The Failed Strategy of Artificial Intelligence Doomers · 2025-02-01T18:59:03.203Z · LW · GW

I think the arguments for short timelines are definitely weaker than their proponents usually assume, but they aren't totally vibes based, and while not so probable as to dominate the probability mass, are probable enough to be action guiding:

https://www.lesswrong.com/posts/LCNdGLGpq89oRQBih/bayesianism-for-humans-probable-enough

I do predict that we will probably have at least 1 more paradigm shift before the endgame, but I'm not so confident in it as to dismiss simple scaling.

Comment by Noosphere89 (sharmake-farah) on In response to critiques of Guaranteed Safe AI · 2025-02-01T16:38:23.984Z · LW · GW

This seems like a crux here, one that might be useful to uncover further:

2. Claiming that non-vacuous sound (over)approximations are feasible, or that we'll be able to specify and verify non-trivial safety properties, is risible. Planning for runtime monitoring and anomaly detection is IMO an excellent idea, but would be entirely pointless if you believed that we had a guarantee!

I broadly agree with you that most of the stuff proposed is either in it's infancy or is essentially vaporware that doesn't really work without AIs being so good that the plan would be wholly irrelevant, and thus is really unuseful for short timelines work, but I do believe enough of the plan is salvageable to make it not completely useless, and in particular, is the part where it's very possible for AIs to help in real ways (at least given some evidence):

https://www.lesswrong.com/posts/DZuBHHKao6jsDDreH/in-response-to-critiques-of-guaranteed-safe-ai#Securing_cyberspace

Comment by Noosphere89 (sharmake-farah) on plex's Shortform · 2025-01-31T19:54:45.645Z · LW · GW

Alright, I'll try to answer the questions:

  1. I think qualia is rescuable, in a sense, and my specific view is that they exist as a high-level model.

As far as what that qualia is, I think it's basically an application of modeling the world in order to control something, and thus qualia, broadly speaking is your self-model.

As far as my exact views on qualia, the links below are helpful:

https://www.lesswrong.com/posts/FQhtpHFiPacG3KrvD/seth-explains-consciousness#7ncCBPLcCwpRYdXuG

https://www.lesswrong.com/posts/NMwGKTBZ9sTM4Morx/linkpost-a-conceptual-framework-for-consciousness

  1. My general answer to these question is probably computation/programs/mathematics, with the caveat that these notions are very general, and thus don't explain anything specific about our world.

I personally agree with this on what counts as real:

If you believe math is fundamental, what distinguishes this particular mathematical universe from other ones; what qualifies this world as "real", if anything; what 'breathes fire into the equations and creates a world for them to describe'?

(Commentary: one self-consistent position answers "nothing" - that this world is just one of the infinitely many possible mathematical functions / programs. That 'real' secretly means 'the program(s?) we are part of'. Though I observe this position to be rare; most have a strong intuition that there is something which "makes reality real".)

What breathes fire into the equations of our specific world is either an infinity of computational resources, or a very large amount of computational resources.

As far as what mathematics is, I definitely like the game analogy where we agree to play a game according to specified rules, though another way to describe mathematics is as a way to generalize all of the situations you encounter and abstract from specific detail, and it is also used to define what something is.

Comment by Noosphere89 (sharmake-farah) on [Linkpost] A conceptual framework for consciousness · 2025-01-31T19:45:18.571Z · LW · GW

In retrospect, something like this theory was probably one I was drawing on implicitly, and I like that from 2 well-validated principles, you can constrain the space of possibilities of consciousness significantly enough to throw out a lot of philosophically interesting, but also wrong theories, and I personally think AST is probably my go-to mental model of how consciousness actually works, and I personally think the consciousness problem is by now mostly resolved in my own mind, such that I can focus on bigger problems.

Comment by Noosphere89 (sharmake-farah) on Meta Questions about Metaphilosophy · 2025-01-31T19:27:33.659Z · LW · GW

Mostly, the concept of "metaphilosophy" is so hopelessly broad that you kinda reach it by definition by thinking about any problem hard enough. This isn't a good thing, when you have a category so large it contains everything (not saying this applies to you, but it applies to many other people I have met who talked about metaphilosophy), it usually means you are confused.

In retrospect, this is the reason why debates over the simulation hypothesis/Mathematical Universe Hypothesis/computationalism go nowhere, because there is a motte and bailey in the people advocating for these hypotheses which can validly claim their category encompasses everything, but they don't realize this doesn't constrain their expectations at all, and thus can't be used in basically any debate, but the bailey is that this is something important where you should change something, but only a narrower version of the simulation hypothesis/Mathematical Universe Hypothesis/computationalism that doesn't encompass everything, and the people arguing against those hypotheses don't realize there's no way for their hypothesis to be falsified if made general enough.

Comment by Noosphere89 (sharmake-farah) on Gears-Level Models are Capital Investments · 2025-01-31T18:40:48.957Z · LW · GW

I wonder why the gears level model is so important, then, because basically any model of the world which depends on assumptions and isn't a look-up table, could be considered a gears level model, since it doesn't apply to everything anymore, and thus in most applications you want to talk about how gearsy a model is, rather than debating a binary question.

Comment by Noosphere89 (sharmake-farah) on Catastrophe through Chaos · 2025-01-31T18:33:09.214Z · LW · GW

A big one has to do with Deepseek's R1 maybe breaking moats, essentially killing industry profit if it happens:

https://www.lesswrong.com/posts/ynsjJWTAMhTogLHm6/?commentId=a2y2dta4x38LqKLDX

The other issue has to do with o1/o3 being potentially more supervised than advertised:

https://www.lesswrong.com/posts/HiTjDZyWdLEGCDzqu/?commentId=gfEFSWENkmqjzim3n#gfEFSWENkmqjzim3n

Finally, Vladimir Nesov has an interesting comment on how Stargate is actually evidence for longer timelines:

https://www.lesswrong.com/posts/fdCaCDfstHxyPmB9h/vladimir_nesov-s-shortform#W5twe6SPqe5Y7oGQf

Comment by Noosphere89 (sharmake-farah) on Catastrophe through Chaos · 2025-01-31T17:22:07.071Z · LW · GW

I think the catastrophe through chaos story is the most likely outcome, conditional on catastrophe happening.

The big disagreement might ultimately be about timelines, as I've updated towards longer timelines, such that world-shakingly powerful AI is probably in the 2030s or 2040s, not this decade, though I put about 35-40% credence in the timeline in the post being correct, though I put more credence in at least 1 new paradigm shift before world-shaking AI happens.

The other one is probably that I'm more optimistic in turning aligned TAI into aligned ASI, because I am reasonably confident both in the alignment problem is easy overall, combined with being much more optimistic on automating alignment compared to a lot of other people.

Comment by Noosphere89 (sharmake-farah) on Rational Unilateralists Aren't So Cursed · 2025-01-30T23:45:36.579Z · LW · GW

This is an interesting post, that while not very relevant on it's own, might become relevant in the future.

More importantly, it's a scenario where rational agents can outperform irrational agents.

+1 for this, which while minor, still matters.

Comment by Noosphere89 (sharmake-farah) on AI #101: The Shallow End · 2025-01-30T18:22:02.164Z · LW · GW

What do you think of Deepseek's announcement of R1 being evidence that AI capabilities will slow down, because they got a cheaper but not actually better model, and this kills the industry's growth?

https://www.lesswrong.com/posts/ynsjJWTAMhTogLHm6/?commentId=a2y2dta4x38LqKLDX

Comment by Noosphere89 (sharmake-farah) on Should you publish solutions to corrigibility? · 2025-01-30T16:52:13.598Z · LW · GW

The answer depends on your values, and thus there isn't really a single answer to be said here.

Comment by Noosphere89 (sharmake-farah) on Anthropic CEO calls for RSI · 2025-01-29T22:08:14.753Z · LW · GW

Not surprised, given China's announcement of a $145 billion dollar AI fund next week, and a potential $1 trillion in more funding by February or March.

If the $1 trillion dollar commitment is real, I have to give a lot of Bayes points to those who predicted an AI race, because this is not a thing you do unless you either want the technology for yourself, or you are racing against someone:

https://x.com/AndrewCurran_/status/1883721802280841245

Because of the success of R1 it looks like about $145 billion USD in Chinese AI funds were created this week. The rumor is that the Chinese government will also officially announce a trillion dollar AI fund in early February.

https://x.com/angelusm0rt1s/status/1883796706107736126

Not a guarantee that it will be in Feb It will be in Q1(Feb or March)

Comment by Noosphere89 (sharmake-farah) on Should you go with your best guess?: Against precise Bayesianism and related views · 2025-01-29T22:01:39.325Z · LW · GW

I'm not sure, given the "Indeterminate priors" section. But assuming that's true, what implication are you drawing from that? (The indeterminacy for us doesn't go away just because we think logically omniscient agents wouldn't have this indeterminacy.)

In one sense, the implication is that for an ideal reasoner, you can always give a probability to every event.

You are correct that the indeterminancy for us wouldn't go away.

The arbitrariness of a precise prior is a fact of life. This doesn't imply we shouldn't reduce this arbitrariness by having indeterminate priors.

Perhaps.

I'd expect that we can still extend a no free lunch style argument such that the choice of indeterminate prior is arbitrary if we want to learn in the maximally general case, but I admit no such theorem is known, and maybe imprecise priors do avoid such a theorem.

I'm not saying indeterminate priors are bad, but rather that they probably aren't magical.

Comment by Noosphere89 (sharmake-farah) on Six Thoughts on AI Safety · 2025-01-29T21:56:01.687Z · LW · GW

This sounds a lot like what @Seth Herd's talk about instruction following AIs is all about:

https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than

Comment by Noosphere89 (sharmake-farah) on My Mental Model of AI Optimist Opinions · 2025-01-29T21:35:39.497Z · LW · GW

As far as my own AI optimism, I think it is a quite caricatured view of my opinions, but not utterly deranged.

My biggest reasons I've become more optimistic about AI alignment personally are the following:

  1. I've become convinced that a lot of the complexity of human values, to the extent that they are there to be surprisingly unnecessary for alignment, and a lot of this is broadly downstream of thinking that inductive biases matter much less for values than people thought it was necessary.

  2. I think that AI control is surprisingly useful, and I think a lot of the criticism around it is pretty misguided, and in particular I think slop is both a real problem, but also surprisingly easy to make iteration work, compared to other problems on adversarial AI.

Some other reasons are:

  1. I think the argument on the Solomonoff prior is malign doesn't actually work, because in the general case it's equally costly to simulate solipsist universes compared to non-solipsist universes, compared to their resource budget, combined with a lot of values wanting to simulate things due to instrumental convergence, meaning you can't get much evidence if at all for what the values of the multiverse are:

https://www.lesswrong.com/posts/tDkYdyJSqe3DddtK4/alexander-gietelink-oldenziel-s-shortform#w2M3rjm6NdNY9WDez

  1. I believe that a lot of people on LW overestimate the goodharting the market does, because they don't realize the constraints that real humans and markets work under, which not only includes physical constraints but also economic constraints, and an example is where the entire discussion about the 1-hose air conditioner being a market failure seems to have been based on a false premise, since 1-hose air conditioners are acceptable enough at the price points they are sold at for consumers:

https://www.lesswrong.com/posts/5re4KgMoNXHFyLq8N/air-conditioner-test-results-and-discussion#maJBX3zAEtx5gFcBG

https://www.lesswrong.com/posts/5re4KgMoNXHFyLq8N/air-conditioner-test-results-and-discussion#3TFECJ3urX6wLre5n

Comment by Noosphere89 (sharmake-farah) on eggsyntax's Shortform · 2025-01-29T21:33:28.854Z · LW · GW

To be fair here, from an omniscient perspective, believing P and believing things that imply P are genuinely the same thing in terms of results, but from a non-omniscient perspective, the difference matters.

Comment by Noosphere89 (sharmake-farah) on Should you go with your best guess?: Against precise Bayesianism and related views · 2025-01-29T00:15:23.027Z · LW · GW

IMO, most of the problems with Precise Bayesianism for humans are mostly problems with logical omnisicence not being satisfied.

Also, one the arbitrariness of the prior, this is an essential feature for a very general learner, due to the no free lunch theorems.

The no free lunch theorem prohibits 1 prior from always being universally accurate or inaccurate, so the arbitrariness of the prior is just a fact of life.

Comment by Noosphere89 (sharmake-farah) on Assume Bad Faith · 2025-01-29T00:05:35.526Z · LW · GW

One reason to assume good faith is to stop yourself from justifying your own hidden motives in a conflict:

https://www.lesswrong.com/posts/e4GBj6jxRZcsHFSvP/assume-bad-faith#Sc6RqbDpurX6hJ8pY

Comment by Noosphere89 (sharmake-farah) on Six Plausible Meta-Ethical Alternatives · 2025-01-28T14:56:11.595Z · LW · GW

This post gives a pretty short proof, and my main takeaway is that intelligence and consciousness converges to look-up tables which are infinitely complicated, so as to deal with every possible situation:

https://www.lesswrong.com/posts/2LvMxknC8g9Aq3S5j/ldt-and-everything-else-can-be-irrational

I agree with this implication for optimization:

https://www.lesswrong.com/posts/yTvBSFrXhZfL8vr5a/worst-case-thinking-in-ai-alignment#N3avtTM3ESH4KHmfN

Comment by Noosphere89 (sharmake-farah) on Decision theory does not imply that we get to have nice things · 2025-01-28T14:17:26.357Z · LW · GW

To be honest, I probably agree with Ryan Greenblatt on the local validity point more that AIs probably would spend at least a little amount of resources, assuming preferences that are indifferent to humans, though I personally think the cost of niceness is surprisingly high, such that small shards of niceness can be ignored:

https://www.lesswrong.com/posts/xvBZPEccSfM8Fsobt/what-are-the-best-arguments-for-against-ais-being-slightly#wy9cSASwJCu7bjM6H

That said, I think the most valuable point here is actually it's short version, where it points out that for LDT, it tries to maximize utility, and it will only cooperate if it gets more expected utility out of the interaction than if it didn't, and thus LDT doesn't solve value conflicts except in special cases, and this is going to be a general point I think is very largely used in the future, because I expect a lot of people to propose that some version of decision theory like LDT will solve some burning value conflict by them cooperating, and I'll have to tell them that decision theory cannot do this:

A common confusion I see in the tiny fragment of the world that knows about logical decision theory (FDT/UDT/etc.), is that people think LDT agents are genial and friendly for each other.[1]

One recent example is Will Eden’s tweet about how maybe a molecular paperclip/squiggle maximizer would leave humanity a few stars/galaxies/whatever on game-theoretic grounds. (And that's just one example; I hear this suggestion bandied around pretty often.)

I'm pretty confident that this view is wrong (alas), and based on a misunderstanding of LDT. I shall now attempt to clear up that confusion.

To begin, a parable: the entity Omicron (Omega's little sister) fills box A with $1M and box B with $1k, and puts them both in front of an LDT agent saying "You may choose to take either one or both, and know that I have already chosen whether to fill the first box". The LDT agent takes both.

"What?" cries the CDT agent. "I thought LDT agents one-box!"

LDT agents don't cooperate because they like cooperating. They don't one-box because the name of the action starts with an 'o'. They maximize utility, using counterfactuals that assert that the world they are already in (and the observations they have already seen) can (in the right circumstances) depend (in a relevant way) on what they are later going to do.

A paperclipper cooperates with other LDT agents on a one-shot prisoner's dilemma because they get more paperclips that way. Not because it has a primitive property of cooperativeness-with-similar-beings. It needs to get the more paperclips.

If a bunch of monkeys want to build a paperclipper and have it give them nice things, the paperclipper needs to somehow expect to wind up with more paperclips than it otherwise would have gotten, as a result of trading with them.

If the monkeys instead create a paperclipper haplessly, then the paperclipper does not look upon them with the spirit of cooperation and toss them a few nice things anyway, on account of how we're all good LDT-using friends here.

It turns them into paperclips.

Because you get more paperclips that way.

That's the short version. Now, I’ll give the longer version.[2]

Comment by Noosphere89 (sharmake-farah) on The present perfect tense is ruining your life · 2025-01-27T22:11:14.318Z · LW · GW

This was the faked evidence here:

I’m not sure where Scott is going with this series, but I seem to have a different reaction to the excerpts from Henrich than most (but not all) of the commenters before me: rather than coming across as persuasive, I wouldn’t trust him as far as I could throw him.

For simplicity let’s concentrate on the seal hunting description. I don’t know enough about Inuit techniques to critique the details, but instead of aiming for a fair description, it’s clear that Henrich’s goal is to make the process sound as difficult to achieve as possible. But this is just slight of hand: the goal of the stranded explorer isn’t to reproduce the exact technique of the Inuit, but to kill seals and eat them. The explorer isn’t going to use caribou antler probes or polar bear harpoon tips — they are going to use some modern wood or metal that they stripped from their ice bound ship.

Then we hit “Now you have a seal, but you have to cook it.” What? The Inuit didn’t cook their seal meat using a soapstone lamp fueled with whale oil, they ate it raw! At this point, Henrich is not just being misleading, he’s making it up as he goes along. At this point I start to wonder if part about the antler probe and bone harpoon head are equally fictional. I might be wrong, but beyond this my instinct is to doubt everything that Henrich argues for, even if (especially if) it’s not an area where I have familiarity

Going back to the previous post on “Epistemic Learned Helplessness”, I’m surprised that many people seem to have the instinct to continue to trust the parts of a story that they cannot confirm even after they discover that some parts are false. I’m at the opposite extreme. As soon as I can confirm a flaw, I have trouble trusting anything else the author has to say. I don’t care about the baby, this bathwater has to go! And if the “flaw” is that the author is being intentionally misleading, I’m unlikely to ever again trust them (or anyone else who recommends them). .

Probably I accidentally misrepresented a lot in the parts that were my own summary. But this is from a direct quote, and so not my fault.

roystgnr adds:

Wikipedia seems to suggest that they ate freshly killed meat raw, but cooked some of the meat brought back to camp using a Kudlik, a soapstone lamp fueled with seal oil or whale blubber. Is that not correct? That would still flatly contradict “but you have to cook it”, but it’s close enough that the mistake doesn’t reach “making it up as he goes along” levels of falsehood. You’re correct that even the true bits seem to be used for argument in a misleading fashion, though.

This seems within the level of simplifying-to-make-a-point that I have sometimes been guilty of myself, so I’ll let it pass.