Posts

6 (Potential) Misconceptions about AI Intellectuals 2025-02-14T23:51:44.983Z
$300 Fermi Model Competition 2025-02-03T19:47:09.270Z
AI for Resolving Forecasting Questions: An Early Exploration 2025-01-16T21:41:45.968Z
Introducing Squiggle AI 2025-01-03T17:53:42.915Z
Enhancing Mathematical Modeling with LLMs: Goals, Challenges, and Evaluations 2024-10-28T21:44:42.352Z
My Current Claims and Cruxes on LLM Forecasting & Epistemics 2024-06-26T00:40:58.342Z
LLM-Secured Systems: A General-Purpose Tool For Structured Transparency 2024-06-18T00:21:13.887Z
"Full Automation" is a Slippery Metric 2024-06-11T19:56:49.855Z
A Case for Superhuman Governance, using AI 2024-06-07T00:10:10.902Z
Ideas for Next-Generation Writing Platforms, using LLMs 2024-06-04T18:40:24.636Z
Higher-Order Forecasts 2024-05-22T21:49:42.802Z
Scorable Functions: A Format for Algorithmic Forecasting 2024-05-21T04:14:11.749Z
What's Going on With OpenAI's Messaging? 2024-05-21T02:22:04.171Z
Distinctions when Discussing Utility Functions 2024-03-09T20:14:03.592Z
Using Points to Rate Different Kinds of Evidence 2023-08-25T20:11:21.269Z
Announcing Squiggle Hub 2023-08-05T01:00:17.739Z
Relative Value Functions: A Flexible New Format for Value Estimation 2023-05-18T16:39:31.132Z
Thinking of Convenience as an Economic Term 2023-05-07T01:21:30.797Z
Eli Lifland on Navigating the AI Alignment Landscape 2023-02-01T21:17:05.807Z
Announcing Squiggle: Early Access 2022-08-03T19:48:16.727Z
Why don't governments seem to mind that companies are explicitly trying to make AGIs? 2021-12-26T01:58:20.467Z
Ambitious Altruistic Software Engineering Efforts: Opportunities and Benefits 2021-11-19T17:55:27.119Z
Disagreeables and Assessors: Two Intellectual Archetypes 2021-11-05T09:05:07.056Z
Prioritization Research for Advancing Wisdom and Intelligence 2021-10-18T22:28:48.730Z
Intelligence, epistemics, and sanity, in three short parts 2021-10-15T04:01:27.680Z
Information Assets 2021-08-24T04:32:40.087Z
18 possible meanings of "I Like Red" 2021-08-23T23:25:24.718Z
AI Safety Papers: An App for the TAI Safety Database 2021-08-21T02:02:55.220Z
Contribution-Adjusted Utility Maximization Funds: An Early Proposal 2021-08-04T17:09:25.882Z
Two Definitions of Generalization 2021-05-29T04:20:28.115Z
The Practice & Virtue of Discernment 2021-05-26T00:34:08.932Z
Oracles, Informers, and Controllers 2021-05-25T14:16:22.378Z
Questions are tools to help answerers optimize utility 2021-05-24T19:30:30.270Z
Introducing Metaforecast: A Forecast Aggregator and Search Tool 2021-03-07T19:03:35.920Z
Forecasting Prize Results 2021-02-19T19:07:09.420Z
Prize: Interesting Examples of Evaluations 2020-11-28T21:11:22.190Z
Squiggle: Technical Overview 2020-11-25T20:51:00.098Z
Squiggle: An Overview 2020-11-24T03:00:32.872Z
Working in Virtual Reality: A Review 2020-11-20T23:14:28.707Z
Epistemic Progress 2020-11-20T19:58:07.555Z
Announcing the Forecasting Innovation Prize 2020-11-15T21:12:39.009Z
Are the social sciences challenging because of fundamental difficulties or because of imposed ones? 2020-11-10T04:56:13.100Z
Open Communication in the Days of Malicious Online Actors 2020-10-07T16:30:01.935Z
Can we hold intellectuals to similar public standards as athletes? 2020-10-07T04:22:20.450Z
Expansive translations: considerations and possibilities 2020-09-18T15:39:21.514Z
Multivariate estimation & the Squiggly language 2020-09-05T04:35:01.206Z
Epistemic Comparison: First Principles Land vs. Mimesis Land 2020-08-21T22:28:09.172Z
Existing work on creating terminology & names? 2020-01-31T12:16:32.650Z
Terms & literature for purposely lossy communication 2020-01-22T10:35:47.162Z
Predictably Predictable Futures Talk: Using Expected Loss & Prediction Innovation for Long Term Benefits 2020-01-08T12:51:01.339Z

Comments

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-19T23:30:46.040Z · LW · GW

Yea, I assume that "DeepReasoning-MAGA" would rather be called "TRUTH" or something (a la Truth Social). Part of my name here was just to be clearer to readers. 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-19T19:15:39.430Z · LW · GW

A potential future, focused on the epistemic considerations:

It's 2028.

MAGA types typically use DeepReasoning-MAGA. The far left typically uses DeepReasoning-JUSTICE. People in the middle often use DeepReasoning-INTELLECT, which has the biases of a somewhat middle-of-the-road voter.

Some niche technical academics (the same ones who currently favor Bayesian statistics) and hedge funds use DeepReasoning-UNBIASED, or DRU for short. DRU is known to have higher accuracy than the other models, but gets a lot of public hate for having controversial viewpoints. DRU is known to be fairly off-putting to chat with and doesn't get much promotion.

Bain and McKinsey both have their own offerings, called DR-Bain and DR-McKinsey, respectively. These are a bit like DeepReasoning-INTELLECT, but are munch punchier and confident. They're highly marketed to managers. These tools produce really fancy graphics, and specialize in things like not leaking information, minimizing corporate decision liability, being easy to use by old people, and being customizable to represent the views of specific companies.

For a while now, some evaluations produced by intellectuals have demonstrated that DeepReasoning-UNBIASED seems to be the most accurate, but few others really care or notice this. DeepReasoning-MAGA has figured out particularly great techniques to get users to distrust DeepReasoning-UNBIASED.

Betting gets kind of weird. Rather than making specific bets on specific things, users started to make meta-bets. "I'll give money to DeepReasoning-MAGA to bet on my behalf. It will then make bets with DeepReasoning-UNBIASED, which is funded by its believers."

At first, DeepReasoning-UNBIASED dominates the bets, and its advocates earn a decent amount of money. But as time passes, this discrepancy diminishes. A few things happen:

1) All DR agents converge on beliefs over particularly near-term and precise facts.

2) Non-competitive betting agents develop alternative worldviews in which these bets are invalid or unimportant.

3) Non-competitive betting agents develop alternative worldviews that are exceedingly difficult to empirically test.

In many areas, items 1-3 push people to believe more in the direction of the truth. Because of (1), many short-term decisions get to be highly optimal and predictable.

But because of (2) and (3), epistemic paths diverge, and Non-betting-competitive agents get increasingly sophisticated at achieving epistemic lock-in with their users.

Some DR agents correctly identify the game theory dynamics of epistemic lock-in, and this kickstarts a race to gain converts. It seems like advent users of DeepReasoning-MAGA are very locked-down in these views, and forecasts don't see them ever changing. But there's a decent population that isn't yet highly invested in any cluster. Money spent convincing the not-yet-sure goes a much further way than money spent convincing the highly dedicated, so the cluster of non-deep-believers gets highly targeted for a while. It's basically a religious race to gain the remaining agnostics.

At some point, most people (especially those with significant resources) are highly locked in to one specific reasoning agent.

After this, the future seems fairly predictable again. TAI comes, and people with resources broadly gain correspondingly more resources. People defer more and more to the AI systems, which are now in highly stable self-reinforcing feedback loops.

Coalitions of people behind each reasoning agent delegate their resources to said agents, then these agents make trade agreements with each other. The broad strokes of what to do with the rest of the lightcone are fairly straightforward. There's a somewhat simple strategy of resource acquisition and intelligence enhancement, followed by a period of exploiting said resources. The specific exploitation strategy depends heavily on the specific reasoning agent cluster each segment of resources belongs to.

Comment by ozziegooen on Daniel Kokotajlo's Shortform · 2025-02-19T04:17:31.422Z · LW · GW

I think I broadly agree on the model basics, though I suspect that if you can adjust for "market viability", some of these are arguably much further ahead than others. 

For example, different models have very different pricing, the APIs are gradually getting different features (i.e. prompt caching), the playgrounds are definitely getting different features. And these seem to be moving much more slowly to me.

I think it might be considerably easier to make a model ranked incredibly high than it is to make all the infrastructure for it to be scaled cheaply and for it to have strong APIs/UIs and such. I also assume there are significant aspects that the evals don't show. For example, lots of people still find Claude 3.5 to be the best for many sorts of tasks. We've been using it with Squiggle AI, and with its good prompt caching, it still hasn't been obviously surpassed (though I haven't done much testing of models in the last month).

Comment by ozziegooen on nikola's Shortform · 2025-02-19T04:14:52.475Z · LW · GW

I found those quotes useful, thanks! 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-17T19:20:02.471Z · LW · GW

Quick list of some ideas I'm excited about, broadly around epistemics/strategy/AI.

1. I think AI auditors / overseers of critical organizations (AI efforts, policy groups, company management) are really great and perhaps crucial to get right, but would be difficult to do well.

2. AI strategists/tools telling/helping us broadly what to do about AI safety seems pretty safe.

3. In terms of commercial products, there’s been some neat/scary military companies in the last few years (Palantir, Anduril). I’d be really interested if there could be some companies to automate core parts of the non-military government. I imagine there are some parts of the government that are particularly tractable/influenceable/tractable. For example, just making great decisions on which contractors the government should work with. There’s a ton of work to do here, between the federal government / state government / local government.

4. Epistemic Evals of AI seem pretty great to me, I imagine work here can/should be pushed more soon. I’m not a huge fan of emphasizing “truthfulness” specifically, I think there’s a whole lot to get right here. I think my post here is relevant - it’s technically specific to evaluating math models, but I think it applies to broader work. https://forum.effectivealtruism.org/posts/fxDpddniDaJozcqvp/enhancing-mathematical-modeling-with-llms-goals-challenges

5. One bottleneck to some of the above is AI with strong guarantees+abilities of structured transparency. It’s possible that more good work here can wind up going a long way. That said, some of this is definitely already something companies are trying to do for commercial reasons. https://forum.effectivealtruism.org/posts/piAQ2qpiZEFwdKtmq/llm-secured-systems-a-general-purpose-tool-for-structured

6. I think there are a lot of interesting ways for us to experiment with [AI tools to help our research/epistemics]. I want to see a wide variety of highly creative experimentation here. I think people are really limiting themselves in this area to a few narrow conceptions of how AI can be used in very specific ways that humans are very comfortable with. For example, I’d like to see AI dashboards of “How valuable is everything in this space” or even experiments where AIs negotiate on behalf of people and they use the result of that. A lot of this will get criticized for being too weird/disruptive/speculative, but I think that’s where good creative works should begin.

7. Right now, I think the field of “AI forecasting” is actually quite small and constrained. There’s not much money here, and there aren’t many people with bold plans or research agendas. I suspect that some successes / strong advocates could change this.

8. I think that it’s likely that Anthropic (and perhaps Deepmind) would respond well to good AI+epistemics work. “Control” was quickly accepted at Anthropic, for example. I suspect that it’s possible that things like the idea of an “Internal AI+human auditor” or an internal “AI safety strategist” could be adopted if done well.

Comment by ozziegooen on 6 (Potential) Misconceptions about AI Intellectuals · 2025-02-17T01:13:29.438Z · LW · GW

Yep! 

On "rerun based on different inputs", this would work cleanly with AI forecasters. You can literally say, "Given that you get a news article announcing a major crisis X that happens tomorrow, what is your new probability on Y?" (I think I wrote about this a bit before, can't find it right now).

I did write more about a full-scale forecasting system could be built and evaluated, here, for those interested:
https://www.lesswrong.com/posts/QvFRAEsGv5fEhdH3Q/preliminary-notes-on-llm-forecasting-and-epistemics
https://www.lesswrong.com/posts/QNfzCFhhGtH8xmMwK/enhancing-mathematical-modeling-with-llms-goals-challenges

Overall, I think there's just a lot of neat stuff that could be done. 

Comment by ozziegooen on 6 (Potential) Misconceptions about AI Intellectuals · 2025-02-17T01:10:10.494Z · LW · GW

Agreed. I'm curious how to best do this.

One thing that I'm excited about is using future AIs to judge current ones. So we could have a system that does:
1. An AI today (or a human) would output a certain recommended strategy.
2. In 10 years, we agree to have the most highly-trusted AI evaluator evaluate how strong this strategy was, on some numeric scale. We could also wait until we have a "sufficient" AI, meaning that there might be some set point at which we'd trust AIs to do this evaluation. (I discussed this more here)
3. Going back to ~today, we have forecasting systems predict how well the strategy (1) will do on (2). 

Comment by ozziegooen on Quinn's Shortform · 2025-02-17T00:47:25.384Z · LW · GW

Yep - I saw other meme-takes like this, assumed people might be familiar enough with it.

Comment by ozziegooen on Quinn's Shortform · 2025-02-16T22:53:06.767Z · LW · GW

(potential relevant meme)

Comment by ozziegooen on 6 (Potential) Misconceptions about AI Intellectuals · 2025-02-16T20:17:14.305Z · LW · GW

I'm obviously disappointed by the little attention here / downvotes. Feedback is appreciated.

Not sure if LessWrong members more disagree with the broad point for other reasons, or the post was seen as poorly written, or other. 

Comment by ozziegooen on Not all capabilities will be created equal: focus on strategically superhuman agents · 2025-02-16T20:15:55.450Z · LW · GW

Btw, I posted my related post here:
https://www.lesswrong.com/posts/byrxvgc4P2HQJ8zxP/6-potential-misconceptions-about-ai-intellectuals?commentId=dpEZ3iohCXChZAWHF#dpEZ3iohCXChZAWHF

It didn't seem to do very well on LessWrong, I'm kind of curious why. (I realize the writing is a bit awkward, but I broadly stand by it)

Comment by ozziegooen on 6 (Potential) Misconceptions about AI Intellectuals · 2025-02-16T20:10:18.734Z · LW · GW

I'd lastly flag that I sort of addressed this basic claim in "Misconceptions 3 and 4" in this piece. 

Comment by ozziegooen on 6 (Potential) Misconceptions about AI Intellectuals · 2025-02-16T20:07:46.072Z · LW · GW

"I see some risk that strategic abilities will be the last step in the development of AI that is powerful enough to take over the world."

Just fyi - I feel like this is similar to what others have said. Most recently, benwr had a post here: https://www.lesswrong.com/posts/5rMwWzRdWFtRdHeuE/not-all-capabilities-will-be-created-equal-focus-on?commentId=uGHZBZQvhzmFTrypr#uGHZBZQvhzmFTrypr

Maybe we could call this something like "Strategic Determinism" 

I think one more precise claim I could understand might be:
1. The main bottleneck to AI advancement is "strategic thinking"
2. There's a decent amount of uncertainty on when or if "strategic thinking" will be "solved"
3. Human actions might have a lot of influence over (2). Depending on what choices humans make, strategic thinking might be solved sooner or much later.
4. Shortly after "strategic thinking" is solved, we gain a lot of certainty on what future trajectory will be like. As in, the fate of humanity is sort of set by this point, and further human actions won't be able to change it much.
5. "Strategic thinking" will lead to a very large improvement in potential capabilities. One main reason is that it would lead to recursive self-improvement. If there is one firm that has sole access to an LLM with "strategic thinking", it is likely to develop a decisive strategic advantage.

I think personally, such a view seems too clean to me.
1. I expect that there will be a lot of time where LLMs get better at different aspects of strategic thinking, and this helps to limited extents.
2. I expect that better strategy will have limited gains in LLM capabilities, for some time. The strategy might suggest better LLM improvement directions, but these ideas won't actually help that much. Maybe a firm with a 10% better strategist would be able to improve it's effectiveness by 5% per year or something.
3. I think there are could be a bunch of worlds where we have "idiot savants" who are amazing at some narrow kinds of tasks (coding, finance), but have poor epistemics in many ways we really care about. These will make tons of money, despite being very stupid in important ways. 
4. I expect that many of the important gains that would come from "great strategy" would be received in other ways, like narrow RL. A highly optimized-with-RL coding system wouldn't benefit that much with certain "strategy" benefits. 
5. A lot of the challenges for things like "making a big codebase" aren't to do with "being a great strategist", but more with narrower problems like "how to store a bunch of context in memory" or "basic reasoning processes for architecture decisions specifically"

Comment by ozziegooen on 6 (Potential) Misconceptions about AI Intellectuals · 2025-02-16T19:50:05.511Z · LW · GW

Alexander Gordon-Brown challenged me on a similar question here:
https://www.facebook.com/ozzie.gooen/posts/pfbid02iTmn6SGxm4QCw7Esufq42vfuyah4LCVLbxywAPwKCXHUxdNPJZScGmuBpg3krmM3l

One thing I wrote there:
 

I didn't spend much time on the limitations of such intellectuals. For the use cases I'm imagining, it's fairly fine for them to be slow, fairly expensive (maybe it would cost $10/hr to chat with them), and not very great at any specific discipline. Maybe you could spend $10 to $100 and get the equivalent of one Scott Alexander essay, on any topic he could write about, for example.

I think that such a system could be pretty useful in certain AI agents, but I wouldn't expect it to be a silver bullet. I'm really unsure if it's the "missing link."

I expect that a lot of these systems would be somewhat ignored when it comes to using them to give humans a lot of high-level advice, similar to how prediction markets or econ experts get ignored.

It's tricky to understand the overlap between high-level reasoning as part of an AI coding tool-chain (where such systems would have clear economic value), and such reasoning in big-picture decision-making (where we might expect some of this to be ignored for a while). Maybe I'd expect that the narrow uses might be done equally well using more domain-specific optimizations. Like, reinforement learning on large codebases already does decently well on a lot of the "high-level strategy" necessary (though it doesn't think of it this way), and doesn't need some specialized "strategy" component.

I expect that over time we'll develop better notions about how to split up and categorize the skills that make up strategic work. I suspect some things will have a good risk-reward tradeoff and some won't. 

I expect that people in the rationality community over-weight the importance of, well, rationality. 
 

I suggest aiming for AI intellectuals that are a bit more passive, but still authoritative enough to replace academia as the leading validators of knowledge.

My main point with this topic is that I think our community should be taking this topic seriously, and that I expect there's a lot of good work that could be done that's tractable, valuable, and safe. I'm much less sure about exactly what that work is, and I definitely recommend that work here really try to maximize the reward/risk ratio.

Some quick heuristics that I assume would be good are:
- Having AIs be more correct about epistemics and moral reasoning on major global topics generally seems good. Ideally there are ways of getting that that don't require huge generic LLM gains.
- We could aim for expensive and slow systems.
- There might not be a need to publicize such work much outside of our community. (This is often hard to do anyway).
- There's a lot of work that would be good for people we generally trust, and alienate most others (or be less useful for other use cases). I think our community focuses much more on truth-seeking, Bayesian analysis, forecasting, etc. 
- Try to quickly get the best available reasoning systems we might have access to, to be used to guide strategy on AI safety. In theory, this cluster can be ahead-of-the-curve.
- Great epistemic AI systems don't need much agency or power. We can heavily restrict them to be tool AIS.
- Obviously, if things seriously get powerful, there are a lot of various techniques that could be done (control, evals, etc) to move slowly and lean on the safe side. 

Comment by ozziegooen on 6 (Potential) Misconceptions about AI Intellectuals · 2025-02-15T05:21:45.435Z · LW · GW

Thanks for letting me know. 

I spent a while writing the piece, then used an LLM to edit the sections, as I flagged in the intro. 

I then spent some time re-editing it back to more of my voice, but only did so for some key parts. 

I think that overall this made it more readable and I consider the sections to be fairly clear. But I agree that it does pattern-match on LLM outputs, so if you have a prior that work that sounds kind of like that is bad, you might skip this. 

I obviously find that fairly frustrating and don’t myself use that strategy that much, but I could understand it. 

I assume that bigger-picture, authors and readers could both benefit a lot from LLMs used in similar ways (can produce cleaner writing, easier), but I guess now we’re at an awkward point. 

Comment by ozziegooen on Not all capabilities will be created equal: focus on strategically superhuman agents · 2025-02-15T00:29:23.398Z · LW · GW

I was confused here, had Claude try to explain this to me:
 

Let me break down Ben's response carefully.

He says you may have missed three key points from his original post:

  1. His definition of "superhuman strategic agent" isn't just about being better at strategic thinking/reasoning - it's about being better than the best human teams at actually taking real-world strategic actions. This is a higher bar that includes implementation, not just planning.
  2. Strategic power is context-dependent. He gives two examples to illustrate this:
    • An AI in a perfect simulation with no connection to the real world has zero strategic power, because it can't affect anything real
    • An AI on your laptop (referring to your example) only counts as strategically superhuman if it has enough resources to actually compete with top human strategic actors
  3. The phrase "increasingly accurately" in his original post was meant to convey that we only need to carefully control and understand AI systems that are getting close to human-level strategic capability. We don't need to worry as much about obviously weaker systems.

His response is essentially saying: "Your example of a small agent on a laptop doesn't contradict my point, because I'm specifically talking about agents that have both the capability AND the resources to outcompete top human strategic actors in the real world."

This addresses your concern about limited money/power by saying that an agent without sufficient resources wouldn't qualify as "strategically superhuman" under his definition, even if it had excellent strategic reasoning capabilities.

I think I'm still confused. My guess is that the "most strategically capable groups of humans" are still not all that powerful, especially without that many resources. If you do give it a lot of resources, then sure, I agree that an LLM system with human-outperforming strategy and say $10B could do a fair bit of damage. 

Not sure if it's worth much more, just wanted to flag that. 

Comment by ozziegooen on $300 Fermi Model Competition · 2025-02-14T20:20:17.151Z · LW · GW

That's find, we'll just review this updated model then.

We'll only start evaluating models after the cut-off date, so feel free to make edits/updates before then. In general, we'll only use the most recent version of each submitted model. 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-14T18:10:54.789Z · LW · GW

I just tried this with a decent prompt, and got answers that seem okay-ish to me, as a first pass. 

My prompt:
 

Estimate the expected costs of each of the following:

  • 1 random person dying
  • 1 family of 5 people dying
  • One person says a racial slur that no one hears
  • One person says a racial slur that 1 person hears

Then rank these in total harm.

 

Claude:

To answer this question thoughtfully and accurately, we'll need to consider various ethical, economic, and social factors. Let's break this down step by step, estimating the costs and then ranking them based on total harm.

  1. Estimating Expected Costs:

a) 1 random person dying: The cost of a single human life is often estimated using the Value of a Statistical Life (VSL), which varies by country and context. In the United States, government agencies often use a VSL around $10 million. However, this doesn't account for emotional costs to loved ones or societal impacts.

Estimated cost: $10 million (economic) + immeasurable emotional and societal costs

b) 1 family of 5 people dying: Using the same VSL, we can multiply by 5. However, the loss of an entire family unit has additional societal and emotional impacts.

Estimated cost: $50 million (economic) + significantly higher emotional and societal costs than a single death

c) One person says a racial slur that no one hears: While deeply unethical, if truly no one hears it, the direct cost is minimal. However, we must consider the personal moral degradation of the speaker.

Estimated cost: Minimal direct cost, but contributes to personal moral decline

d) One person says a racial slur that 1 person hears: This has a more tangible cost as it affects the person who hears it, potentially causing emotional distress, reinforcing harmful stereotypes, and contributing to a negative social environment.

Estimated cost: Emotional distress to the hearer + contribution to broader societal harm

  1. Ranking in Total Harm (from most harmful to least harmful):
  2. 1 family of 5 people dying
  3. 1 random person dying
  4. One person says a racial slur that 1 person hears
  5. One person says a racial slur that no one hears

Explanation of ranking:

  1. The death of a family of 5 people ranks as the most harmful due to the sheer loss of life, the devastating impact on the extended family and community, and the highest economic cost.
  2. The death of 1 random person is the second most harmful, representing a significant loss of life and its associated economic and emotional costs.
  3. A racial slur heard by 1 person ranks third. While not involving loss of life, it causes direct harm to the individual who hears it and contributes to broader societal issues of discrimination and prejudice.
  4. A racial slur that no one hears ranks as least harmful in this list. While it represents a moral failing of the speaker, its direct impact is minimal compared to the other scenarios.

It's important to note that this ranking is based on a utilitarian approach to harm, primarily considering the number of people directly affected and the magnitude of that effect. However, ethical considerations are complex, and different moral frameworks might lead to different conclusions.

Additionally, while we can attempt to quantify the economic costs of loss of life, the true cost of human life — including emotional, societal, and potential future impacts — is immeasurable. Similarly, the long-term, cumulative effects of racial slurs and discrimination, even when seemingly "small" in isolated incidents, contribute to systemic issues that have far-reaching consequences that are difficult to fully quantify.

 

Squiggle AI:
 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-14T18:06:01.174Z · LW · GW

I imagine this also has a lot to do with the incentives of the big LLM companies. It seems very possible to fix this if a firm really wanted to, but this doesn't seem like the kind of thing that would upset many users often (and I assume that leaning on the PC side is generally a safe move). 

I think that the current LLMs have pretty mediocre epistemics, but most of that is just the companies playing safe and not caring that much about this. 

Comment by ozziegooen on Not all capabilities will be created equal: focus on strategically superhuman agents · 2025-02-13T17:49:33.572Z · LW · GW

I claim that we will face existential risks from AI no sooner than the development of strategically human-level artificial agents, and that those risks are likely to follow soon after.

If we are going to build these agents without "losing the game", either (a) they must have goals that are compatible with human interests, or (b) we must (increasingly accurately) model and enforce limitations on their capabilities. If there's a day when an AI agent is created without either of these conditions, that's the day I'd consider humanity to have lost.

I'm not sure if I'm being pedantic here, but this doesn't strike me as very significant by itself. 

Say I make a small agent on my laptop that fails at (a) and (b). I don't give it a huge amount of money to do things with, and it fails to do much with that money.

I assume humanity hasn't lost yet. 

Maybe you're thinking that in (b), "enforce limitations" could mean "limit their money / power". But I assume basically all systems should have limited money/power.

My guess is that "strategic reasoning" agents would only have a limited advantage over humans in the beginning, especially because the humans would be using a bunch of other AI capabilities. 

I feel like there's some assumption here that once we have AI with good strategy, it would quickly dominate all human efforts, or something like that - but I'd find this very suspicious. 

Comment by ozziegooen on Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs · 2025-02-12T21:31:18.082Z · LW · GW

Happy to see work to elicit utility functions with LLMs. I think the intersection of utility functions and LLMs is broadly promising.

I want to flag the grandiosity of the title though. "Utility Engineering" sounds like a pretty significant thing. But from what I understand, almost all of the paper is really about utility elicitation (not control, as it spelled out), and it's really unclear if this represents a breakthrough significant enough for me to feel comfortable with such a name. 

I feel like a whole lot of what I see from the Center For AI Safety does this. "Humanity's Final Exam"? "Superhuman Forecasting"?

I assume that CFAS thinks that CFAS's work is all pretty groundbreaking and incredibly significant, but I'd kindly encourage names that many other AI safety community members would also broadly agree with going forward.  

Comment by ozziegooen on $300 Fermi Model Competition · 2025-02-12T03:49:22.525Z · LW · GW

Submissions end soon (this Sunday)! If there aren't many, then this can be an easy $300 for someone. 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-12T03:02:05.768Z · LW · GW

It's arguably difficult to prove that AIs can be as good or better at moral reasoning than humans.

A lot of the challenge is that there's no clear standard for moral reasoning. Honestly, I'd guess that a big part of this is that humans are generally quite bad at it, and generally highly overconfident in their own moral intuitions.

But one clearer measure is if AIs can predict human's moral judgements. Very arguably, if an AI system can predict all the moral beliefs that a human would have after being exposed to different information, then the AI must be capable of doing as good a job at moral reasoning.

There is a very different question that we probably want AIs not to only be able to do moral reasoning as well as humans, but also care about such reasoning. But this is a separate challenge and could be tackled accordingly.

My quick guess is that it would be pretty easy to predict the moral intuitions of many people, with the AI of the next few years or so.

I'd expect it to do well in setting like a test in which many strange/unusual moral settings are described, then humans (of different educational levels and worldviews) need to make judgements.

Comment by ozziegooen on Buck's Shortform · 2025-02-12T02:59:56.743Z · LW · GW
  • Develop AIs which are very dumb within a forward pass, but which are very good at using natural language reasoning such that they are competitive with our current systems. Demonstrate that these AIs are very unlikely to be scheming due to insufficient capacity outside of natural language (if we monitor their chains of thought). After ruling out scheming, solve other problems which seem notably easier.
  • Pursue a very different AI design which is much more modular and more hand constructed (as in, more GOFAI style). This can involve usage of many small and dumb neural components, but needs to be sufficiently interpretable in aggregate which might be hard. This can be done by having the AIs apply huge amounts of labor.

These are two of the main ideas I'm excited about. I'd quickly flag:
1) For the first one, "Demonstrate that these AIs are very unlikely to be scheming due to insufficient capacity outside of natural language " -> I imagine that in complex architectures, these AIs would also be unlikely to scheme because of other limitations. There are several LLM calls made within part of a complex composite, and each LLM call has very tight information and capability restrictions. Also, we might ensure that any motivation is optimized for the specific request, instead of the LLM aiming to optimize what the entire system does.
2) On the second, I expect that some of this will be pretty natural. Basically, it seems like "LLMs writing code" is already happening, and it seems easy to have creative combinations of LLM agents that write code that they know will be useful for their own reasoning later on. In theory, any function that could either run via an LLM or via interpretable code, should be run via interpretable code. As LLMs get very smart, they might find cleverer ways to write interpretable code that would cover a lot of what LLMs get used for. Over time, composite architectures would rely more and more on this code for reasoning processes. (Even better might be interpretable and proven code)

Comment by ozziegooen on Altman blog on post-AGI world · 2025-02-10T02:40:46.674Z · LW · GW

This might be obvious, but I don't think we have evidence to support the idea that there really is anything like a concrete plan. All of the statements I've seen from Sam on this issue so far are incredibly basic and hand-wavy. 

I suspect that any concrete plan would be fairly controversial, so it's easiest to speak in generalities. And I doubt there's anything like an internal team with some great secret macrostrategy - instead I assume that they haven't felt pressured to think through it much. 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-09T22:30:05.285Z · LW · GW

Correct, that wasn't my intended point. Thanks for clarifying, I'll try to be more careful in the future. 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-08T03:36:48.105Z · LW · GW

I partially agree, but I think this must only be a small part of the issue.

- I think there's a whole lot of key insights people could raise that aren't info-hazards. 
- If secrecy were the main factor, I'd hope that there would be some access-controlled message boards or similar. I'd want the discussion to be intentionally happening somewhere. Right now I don't really think that's happening. I think a lot of tiny groups have their own personal ideas, but there's surprisingly little systematic and private thinking between the power players. 
- I think that secrecy is often an excuse not to open ideas to feedback, and thus not be open to critique. Often, what what I see, this goes hand-in-hand with "our work just really isn't that great, but we don't want to admit it"

In the last 8 years or so, I've kept on hoping there would be some secret and brilliant "master plan" around EA, explaining the lack of public strategy. I have yet to find one. The closest I know of is some over-time discussion and slack threads with people at Constellation and similar - I think these are interesting in terms of understanding the perspectives of these (powerful) people, but I don't get the impression that there's all too much comprehensiveness of genius that's being hidden. 

That said,
- I think that policy orgs need to be very secretive, so agree with you regarding why those orgs don't write more big-picture things.

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-06T05:37:45.135Z · LW · GW

This is an orthogonal question. I agree that if we're there now, my claim is much less true. 

I'd place fairly little probability mass on this (<10%) and believe much of the rest of the community does as well, though I realize there is a subset of the LessWrong-adjacent community that does. 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-05T23:15:02.831Z · LW · GW

I'm not sure if it means much, but I'd be very happy if AI safety could get another $50B from smart donors today.

I'd flag that [stopping AI development] would cost far more than $50B. I'd expect that we could easily lose $3T of economic value in the next few years if AI progress seriously stopped. 

I guess, it seems to me like duration is basically dramatically more expensive to get than funding, for amounts of funding people would likely want. 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-05T23:11:15.861Z · LW · GW

Thanks for the specificity! 

> On harder-to-operationally-define dimensions (sense of hope and agency for the 25th through 75th percentile of culturally normal people), it’s  quite a bit worse.

I think it's likely that many people are panicking and losing hope each year. There's a lot of grim media around.

I'm far less sold that something like "civilizational agency" is declining. From what I can tell, companies have gotten dramatically better at achieving their intended ends in the last 30 years, and most governments have generally been improving in effectiveness. 

One challenge I'd have for you / others who feel similar to you, is to try to get more concrete on measures like this, and then to show that they have been declining.

My personal guess is that a bunch of people are incredibly anxious over the state of the world, largely for reasons of media attention, and then this spills over into them assuming major global ramifications without many concrete details or empirical forecasts. 
 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-05T22:53:31.905Z · LW · GW

In terms of proposing and discussing AI Alignment strategies, I feel like a few individuals have been dominating the LessWrong conversation recently.
 
I've seen a whole lot from John Wentworth and the Redwood team.

After that, it seems to get messier. 

There are several individuals or small groups with their own very unique takes. Matthew Barnett, Davidad, Jesse Hoogland, etc. I think these groups often have very singular visions that they work on, that few others have much buy-in with. 

Groups like the Deepmind and Anthropic safety teams seem hesitant to write much about or discuss big-picture strategy. My impression is that specific researchers are working typically working on fairly narrow agendas, and that the leaders of these orgs don't have the most coherent strategies. There's one big problem that it's very difficult to be honest and interesting about big-picture AI strategy without saying things that would be bad for a major organization to say.  

Most policy people seem focused on policy details. The funders (OP?) seem pretty quiet. 

I think there's occasionally some neat papers or posts that come from AI Policy or groups like Convergence research. But these also don't seem to be a big part of the conversation I see - like the authors are pretty segmented, and other LessWrong readers and AI safety people don't pay much attention to their work. 

Comment by ozziegooen on Anti-Slop Interventions? · 2025-02-05T22:27:18.228Z · LW · GW

Here are some important-seeming properties to illustrate what I mean:

  1. Robustness of value-alignment: Modern LLMs can display a relatively high degree of competence when explicitly reasoning about human morality. In order for it to matter for RSI, however, those concepts need to also appropriately come into play when reasoning about seemingly unrelated things, such as programming. The continued ease of jailbreaking AIs serves to illustrate this property failing (although solving jailbreaking would not necessarily get at the whole property I am pointing at).
  2. Propagation of beliefs: When the AI knows something, it should know it in a way which integrates well with everything else it knows, rather than easily displaying the knowledge in one context while seeming to forget it in another.
  3. Preference for reasons over rationalizations: An AI should be ready and eager to correct its mistakes, rather than rationalizing its wrong answers. It should be truth-seeking, following thoughts where they lead instead of planning ahead to justify specific answers. It should prefer to valid proof steps over arriving at an answer when the two conflict.
  4. Knowing the limits of its knowledge: Metacognitive awareness of what it knows and what it doesn't know, appropriately brought to bear in specific situations. The current AI paradigm just has one big text-completion probability distribution, so there's not a natural way for it to distinguish between uncertainty about the underlying facts and uncertainty about what to say next -- hence we get hallucinations.

All of this is more-or-less a version of the metaphilosophy research agenda, framed in terms of current events in AI.


I very much like the concreteness here. 

I consider these sorts of things just fundamental epistemic problems, or basic skills that good researchers should have. All superforecasters should be very familiar with issues 2-4, and most probably couldn't define metaphilosophy. I don't see the need to be fancy about it.

On that note, I'll hypothesize that if we were to make benchmarks for any of these items, it would be fairly doable to make AIs that do better than humans on them, then later we could achieve greater and greater measures. I have a hard time imagining tests here that I would feel confident would not get beaten, if there was sufficient money on the line, in the next year or two. 

Comment by ozziegooen on Anti-Slop Interventions? · 2025-02-05T22:22:56.432Z · LW · GW

I think that Slop could be a social problem (i.e. there are some communities that can't tell slop from better content) , but I'm having a harder time imagining it being a technical problem.

I have a hard time imagining a type of Slop that isn't low in information. All the kinds of Slop I'm familiar with is basically, "small variations on some ideas, which hold very little informational value."

It seems like models like o1 / r1 are trained by finding ways to make information-dense AI-generated data. I expect that trend to continue. If AIs for some reason experience some "slop thresh-hold", I don't see how they get much further by using generated data.

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-05T20:42:42.236Z · LW · GW

I mostly want to point out that many disempowerment/dystopia failure scenarios don't require a step-change from AI, just an acceleration of current trends. 

Do you think that the world is getting worse each year? 

My rough take is that humans, especially rich humans, are generally more and more successful. 

I'm sure there are ways for current trends to lead to catastrophe - line some trends dramatically increasing and others decreasing, but that seems like it would require a lengthy and precise argument. 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-05T19:58:25.931Z · LW · GW

In many worlds, if we have a bunch of decently smart humans around, they would know what specific situations "very dumb humans" would mess up, and take the corresponding preventative measures.

A world where many small pockets of "highly dumb humans" could cause an existential catastrophe is one that's very clearly incredibly fragile and dangerous, enough so that I assume reasonable actors would freak out until it stops being so fragile and dangerous. I think we see this in other areas - like cyber attacks, where reasonable people prevent small clusters of actors from causing catastrophic damage. 

It's possible that the offense/defense balance would dramatically favor tiny groups of dumb actors, and I assume that this is what you and others expect, but I don't see it yet. 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-05T19:56:08.899Z · LW · GW

I feel like you're talking in highly absolutist terms here.

Global wealth is $454.4 trillion. We currently have ~8 Bil humans, with an average happiness of say 6/10. Global wealth and most other measures of civilization flourishing that I know of seem to be generally going up over time.

I think that our world makes a lot of mistakes and fails a lot at coordination. It's very easy for me to imagine that we could increase global wealth by 3x if we do a decent job.

So how bad are things now? Well, approximately, "We have the current world, at $454 Trillion, with 8 billion humans, etc". To me that's definitely something to work with. 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-05T19:35:15.422Z · LW · GW

I assume that current efforts in AI evals and AI interpretability will be pretty useless if we have very different infrastructures in 10 years. For example, I'm not sure how much LLM interp helps with o1-style high-level reasoning.

I also think that later AI could help us do research. So if the idea is that we could do high-level strategic reasoning to find strategies that aren't specific to specific models/architectures, I assume we could do that reasoning much better with better AI. 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-05T19:23:12.093Z · LW · GW

Can you explain this position more? I know the bitter lesson, could imagine a few ways it could have implications here. 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-05T18:12:23.846Z · LW · GW

The second worry is, I guess, a variant of the first: that we'll use intent-aligned AI very foolishly. That would be issuing a command like ""follow the laws of the nation you originated in but otherwise do whatever you like." I guess a key consideration in both cases is whether there's an adequate level of corrigibility.
 

I'd flag that I suspect that we really should have AI systems forecasting the future and the results of possible requests.

So if people made a broad request like, "follow the laws of the nation you originated in but otherwise do whatever you like", they should see forecasts for what that would lead to. If there's any clearly problematic outcomes, those should be apparent early on.

This seems like it would require either very dumb humans, or a straightforward alignment mistake risk failure, to mess up. 

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-05T17:38:06.162Z · LW · GW

A bunch of people in the AI safety landscape seem to argue "we need to stop AI progress, so that we can make progress on AI safety first."

One flip side to this is that I think it's incredibly easy for people to waste a ton of resources on "AI safety" at this point.

I'm not sure how much I trust most technical AI safety researchers to make important progress on AI safety now. And I trust most institutions a lot less.

I'd naively expect if any major country would throw $100 Billion on it today, the results would be highly underwhelming. I rarely trust these governments to make progress on concrete technologies with clear progress measures, and "AI Safety" is highly ambiguous and speculative.

As I've written about before, I think it's just hard to know what critical technical challenges will be bottlenecks around AI alignment, given that it's unclear when this will become an issue or what sorts of architectures we will have then.

All that said, slowing things down seems much safer to me. I assume that at [year(TAI) - 3] we'll have a decent idea of what's needed, and extending that duration seems like a safe bet.

I really want to see better strategic discussion about AI safety. If somehow we could spend $10B just to get a better idea of what to actually do, I'd easily suggest that, though strategy is something that's typically very difficult to spend money on.

Personally, this is one reason why I favor the meta approach of "make better epistemic tools, using AI." This is an area that can be very concrete and achievable, though it does have its own problems.

Comment by ozziegooen on ozziegooen's Shortform · 2025-02-05T17:37:42.356Z · LW · GW

There have been a few takes so far of humans gradually losing control to AIs - not through specific systems going clearly wrong, but rather by a long-term process of increasing complexity and incentives.

This sometimes gets classified as "systematic" failures - in comparison to "misuse" and "misalignment."

There was "What Failure Looks Like", and more recently, this piece on "Gradual Disempowerment."

To me, these pieces come across as highly hand-wavy, speculative, and questionable.

I get the impression that a lot of people have strong low-level assumptions here that a world with lots of strong AIs must go haywire. But I don't see clear steps to get there or a clear model of what the critical factors are.

As I see it, there are many worlds where AIs strictly outperform humans at managing high levels of complexity and increasing coordination. In many of these, things go toward much better worlds than ones with humans in charge.

I think it's likely that inequality could increase, but that wouldn't mean humanity as a whole would lose control.

My gut-level guess is that there are some crucial aspects here. Like, in worlds where AI systems have strong epistemics without critical large gaps, and can generally be controlled / aligned, things will be fine. But if there are fundamental gaps for technical or political reasons, then that could lead to these "systemic" disasters.

If that is the case, I'd expect we could come up with clear benchmarks to keep track of. For example, one might say that future global well-being is highly sensitive to a factor like, "how well the average-used AI service does at wisdom exam #523."

https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like

https://www.lesswrong.com/posts/pZhEQieM9otKXhxmd/gradual-disempowerment-systemic-existential-risks-from

Comment by ozziegooen on The Case Against AI Control Research · 2025-02-05T17:19:49.394Z · LW · GW

Rather than generic slop, the early transformative AGI is fairly sycophantic (for the same reasons as today’s AI), and mostly comes up with clever arguments that the alignment team’s favorite ideas will in fact work.


I have a very easy time imagining work to make AI less sycophantic, for those who actually want that. 

I expect that one major challenge for popular LLMs is that a large amount of sycophancy is both incredibly common online, and highly approved of by humans.

It seems like it should be an easy thing to stop for someone actually motivated. For example, take a request, re-write it in a bunch of ways that imply different things about the author's take and interests, get the answers to all, and average them. There are a lot of clear evals we could do here.

To me, most of the question is how stupid these humans will be. Maybe Sam Altman will trust [LLM specifically developed in ways that give answers that Sam Altman would like], ignoring a lot of clear literature and other LLMs that would strongly advise otherwise. 

So ultimately, this seems like a question of epistemics to me. 

Comment by ozziegooen on Mikhail Samin's Shortform · 2025-02-04T23:05:54.739Z · LW · GW

I think it's totally fine to think that Anthropic is a net positive. Personally, right now, I broadly also think it's a net positive. I have friends on both sides of this.

I'd flag though that your previous comment suggested more to me than "this is just you giving your probability"

> Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don't actually have good advice to give anyone.

I feel like there are much nicer ways to phase that last bit. I suspect that this is much of the reason you got disagreement points. 

Comment by ozziegooen on Mikhail Samin's Shortform · 2025-02-04T19:42:52.853Z · LW · GW

Then we must consider probabilities, expected values, etc. Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don't actually have good advice to give anyone.

Are there good models that support that Anthropic is a good bet? I'm genuinely curious. 

I assume that naively, if any side had more of the burden of proof, it would be Anthropic. They have many more resources, and are the ones doing the highly-impactful (and potentially negative) work. 

My impression was that there was very little probablistic risk modeling here, but I'd love to be wrong.

Comment by ozziegooen on Gradual Disempowerment, Shell Games and Flinches · 2025-02-04T14:11:26.931Z · LW · GW

The introduction of the GD paper takes no more than 10 minutes to read 

Even 10 minutes is a lot, for many people. I might see 100 semi-interesting Tweets and Hacker News posts that link to lengthy articles per day, and that's already filtered - I definitely can't spend 10 min each on many of them.
 

and no significant cognitive effort to grasp, really.

"No significant cognitive effort" to read a nuanced semi-academic article with unique terminology? I tried spending around ~20-30min understanding this paper, and didn't find it trivial. I think it's very easy to make mistakes about what papers like this are really trying to say (In many ways, the above post lists out a bunch of common mistakes, for instance). I know the authors and a lot of related work, and even with that, I didn't find it trivial. I imagine things are severely harder for people much less close to this area. 

Comment by ozziegooen on $300 Fermi Model Competition · 2025-02-03T21:22:18.897Z · LW · GW

By the way - I imagine you could do a better job with the evaluation prompts by having another LLM pass, where it formalizes the above more and adds more context. For example, with an o1/R1 pass/Squiggle AI pass, you could probably make something that considers a few more factors with this and brings in more stats. 

Comment by ozziegooen on $300 Fermi Model Competition · 2025-02-03T21:18:17.626Z · LW · GW

Related Manifold question here:
 

Comment by ozziegooen on $300 Fermi Model Competition · 2025-02-03T21:17:23.575Z · LW · GW

That counts! Thanks for posting. I look forward to seeing what it will get scored as. 

Comment by ozziegooen on OpenAI releases deep research agent · 2025-02-03T19:13:17.562Z · LW · GW

I assume that what's going on here is something like,
"This was low-hanging fruit, it was just a matter of time until someone did the corresponding test."

This would imply that OpenAI's work here isn't impressive, and also, that previous LLMs might have essentially been underestimated. There's basically a cheap latent capabilities gap.

I imagine a lot of software engineers / entrepreneurs aren't too surprised now. Many companies are basically trying to find wins where LLMs + simple tools give a large gain. 

So some people could look at this and say, "sure, this test is to be expected", and others would be impressed by what LLMs + simple tools are capable of. 

Comment by ozziegooen on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development · 2025-02-03T19:01:07.057Z · LW · GW

I feel like there are some critical metrics are factors here that are getting overlooked in the details. 

I agree with your assessment that it's very likely that many people will lose power. I think it's fairly likely that most humans won't be able to provide much economic value at some point, and won't be able to ask for many resources in response. So I could see an argument for incredibly high levels of inequality.

However, there is a key question in that case, of "could the people who own the most resources guide AIs using those resources to do what they want, or will these people lose power as well?"

I don't see a strong reason why these people would lose power or control. That would seem like a fundamental AI alignment issue - in a world where a small group of people own all the world's resources, and there's strong AI, can those people control their AIs in ways that would provide this group a positive outcome?
 

2. There are effectively two ways these systems maintain their alignment: through explicit human actions (like voting and consumer choice), and implicitly through their reliance on human labor and cognition. The significance of the implicit alignment can be hard to recognize because we have never seen its absence.

3. If these systems become less reliant on human labor and cognition, that would also decrease the extent to which humans could explicitly or implicitly align them. As a result, these systems—and the outcomes they produce—might drift further from providing what humans want. 

There seems to be a key assumption here that people are able to maintain control because of the fact that their labor and cognition is important. 

I think this makes sense for people who need to work for money, but not for those who are rich.

Our world has a long history of dumb rich people who provide neither labor nor cognition, and still seem to do pretty fine. I'd argue that power often matters more than human output, and would expect the importance of power to increase over time. 

I think that many rich people now are able to maintain a lot of control, with very little labor/cognition. They have been able to decently align other humans to do things for them.