Posts
Comments
Happy to see thinking on this.
I like the idea of getting a lot of small examples of clever uses of LLM in the wild, especially by particularly clever/experimental people.
I recently made this post to try to gather some of the techniques common around this community.
One issue that I have though is that I'm really unsure what it looks like to promote neat ideas like these, outside of writing long papers or making semi-viral or at least [loved by a narrow community] projects.
The most obvious way is via X/Twitter. But this often requires building an X audience, which few people are good at. Occasionally particularly neat images/clips by new authors go viral, but it's tough.
I'd also flag:
- It's getting cheaper to make web applications.
- I think EA has seen more success in making blog posts and web apps than we did things like [presenting neat ideas in videos/tweets].
- Often, [simple custom applications] are pretty crucial for actually testing out an idea. You can generate wireframes, but this only tells you a very small amount.
I guess what I'm getting at is that I think [web applications] are likely a major part of the solution - but that we should favor experimenting with many small ones, rather than going all-in on 2-4 ideas or so.
I'm curious whether you know of any examples in history where humanity purposefully and succesfully steered towards a significantly less competitive [economically, militarily,...] technology that was nonetheless safer.
This sounds much like a lot of the history of environmentalism and safety regulations? As in, there's a long history of [corporations selling X, using a net-harmful technology], then governments regulating. Often this happens after the technology is sold, but sometimes before it's completely popular around the world.
I'd expect that there's similarly a lot of history of early product areas where some people realize that [popular trajectory X] will likely be bad and get regulated away, so they help further [safer version Y].
Going back to the previous quote:
"steer the paradigm away from AI agents + modern generative AI paradigm to something else which is safer"
I agree it's tough, but would expect some startups to exist in this space. Arguably there are already several claiming to be focusing on "Safe" AI. I'm not sure if people here would consider this technically part of the "modern generative AI paradigm" or not, but I'd imagine these groups would be taking some different avenues, using clear technical innovations.
There are worlds where the dangerous forms have disadvantages later on - for example, they are harder to control/oversee, or they get regulated. In those worlds, I'd expect there should/could be some efforts waiting to take advantage of that situation.
I'm sure they thought about it.
I think this is dramatically tougher than a lot of people think. I wrote more about it here.
https://www.facebook.com/ozzie.gooen/posts/pfbid0377Ga4W8eK89aPXDkEndGtKTgfR34QXxxNCtwvdPsMifSZBY8abLmhfybtMUkLd8Tl
I have a Quest 3. The setup is a fair bit better than the Quest 2, but it still has a long way to go.
I use it in waves. Recently I haven't used it much, maybe a few hours a month or so.
Looking forward to future headsets. Right now things are progressing fairly slowly, but I'm hopeful there might be some large market moment, followed by a lot more success. Though at this point it seems possible that could happen post-TAI, so maybe it's a bit of a lost cause.
All that said, there is a growing niche community of people working/living in VR, so it seems like it's a good fit for some people.
Obvious point - I think a lot of this comes from the financial incentives. The more "out of the box" you go, the less sure you can be that there will be funding for your work.
Some of those that do this will be rewarded, but I suspect many won't be.
As such, I think that funders can help more to encourage this sort of thing, if they want to.
"The missing step in the process you describe is figuring out when the research did produce surprising insights, which might be a class of novel problems (unless a general formulaic approach works and someone scaffolds that in)."
-> I feel optimistic about the ability to use prompts to get us fairly far with this. More powerful/agentic systems will help a lot to actually execute those prompts at scale, but the core technical challenge seems like it could be fairly straightforward. I've been experimenting with LLMs to try to detect what information that they could come up with that would later surprise them. I think this is fairly measurable.
Thanks for the clarification!
I think some of it is that I find the term "original seeing" to be off-putting. I'm not sure if I got the point of the corresponding blog post.
In general, going forward, I'd recommend people try to be very precise on what they mean here. I'm suspicious that "original seeing" will mean different things to different people. I'd expect that trying to more precisely clarify what tasks or skills involved would make it easier to pinpoint which parts of it are good/bad for LLMs.
By "aren't catching" do you mean "can't" or do you mean "wikipedia company/editors haven't deployed an LLM to crawl wikipedia, read sources and edit the article for errors"?
Yep.
My guess is that this would take some substantial prompt engineering, and potentially a fair bit of money.
I imagine they'll get to it eventually (as it becomes easier + cheaper), but it might be a while.
Some quick points:
1. I think there is an interesting question here and am happy to see it be discussed.
2. "This would, obviously, be a system capable of writing things that we deem worth reading." -> To me, LLMs produce tons of content worth me reading. I chat to LLMs all the time. Often I prefer LLM responses to LessWrong summaries, where the two compete. I also use LLMs to come up with ideas, edit text, get feedback, and a lot of other parts of writing.
3. Regarding (2), my guess is that "LessWrong Blog Posts" might become "Things we can't easily get from LLMs" - in which case it's a very high bar for LLMs!
4. There's a question on Manifold about "When will AIs produce movies as well as humans?" I think you really need to specify a specific kind of movie here. As AIs improve, humans will use AI tools to produce better and better movies - so "completely AI movies" will have a higher and higher bar to meet. So instead of asking, "When will AI blog posts be as good as human blog posts?" I'd ask, "When will AI blog posts be as good as human blog posts from [2020]" or similar. Keep the level of AI constant in one of these options.
5. We recently held the $300 Fermi challenge, where the results were largely generated with AIs. I think some of the top ones could make good blog posts.
6. As @habryka wrote recently, many readers will just stop reading something if it seems written by an LLM. I think this trend will last, and make it harder for useful LLM-generated content to be appreciated.
I feel like I've heard this before, and can sympathize, but I'm skeptical.
I feel like this prescribes an almost magical thinking to how many blog posts are produced. The phrase "original seeing" sounds much more profound than I'm comfortable with for such a discussion.
Let's go through some examples:
- Lots of Zvi's posts are summaries of content, done in a ways that's fairly formulaic.
- A lot of Scott Alexander's posts read to me like, "Here's an interesting area that blog readers like but haven't investigated much. I read a few things about it, and have some takes that make a lot of sense upon some level of reflection."
- A lot of my own posts seem like things that wouldn't be too hard to come up with some search process to create.
Broadly, I think that "coming up with bold new ideas" gets too much attention, and more basic things like "doing lengthy research" or "explaining to people the next incremental set of information that they would be comfortable with, in a way that's very well expressed" gets too little.
I expect that future AI systems will get good at going from a long list of [hypotheses of what might make for interesting topics] and [some great areas, where a bit of research provides surprising insights] and similar. We don't really have this yet, but it seems doable to me.
(I similarly didn't agree with the related post)
That seems like a good example of a clear math error.
I'm kind of surprised that LLMs aren't catching things like that yet. I'm curious how far along such efforts are - it seems like an obvious thing to target.
If you've ever written or interacted with Squiggle code before, we at QURI would really appreciate it if you could fill out our Squiggle Survey!
https://docs.google.com/forms/d/e/1FAIpQLSfSnuKoUUQm4j3HEoqPmTYiWby9To8XXN5pDLlr95AiKa2srg/viewform
We don't have many ways to gauge or evaluate how people interact with our tools. Responses here will go a long way to deciding on our future plans.
Also, if we get enough responses, we'd like to make a public post about ways that people are (and aren't) using Squiggle.
scaffolding would have to be invented separately for each task
Obvious point that we might soon be able to have LLMs code up this necessary scaffolding. This isn't clearly very far-off, from what I can tell.
Instead of "Goodharting", I like the potential names "Positive Alignment" and "Negative Alignment."
"Positive Alignment" means that the motivated party changes their actions in ways the incentive creator likes. "Negative Alignment" means the opposite.
Whenever there are incentives offered to certain people/agents, there are likely to be cases of both Positive Alignment and Negative Alignment. The net effect will likely be either positive or negative.
"Goodharting" is fairly vague and typically just refers to just the "Negative Alignment" portion.
I'd expect this to make some discussion clearer.
"Will this new incentive be goodharted?" -> "Will this incentive lead to Net-Negative Alignment?"
Other Name Options
Claude 3.7 recommended other naming ideas like:
- Intentional vs Perverse Responses
- Convergent vs Divergent Optimization
- True-Goal vs Proxy-Goal Alignment
- Productive vs Counterproductive Compliance
Results are in and updated - it looks like dmartin80 wins.
We previously posted the results, but then a participant investigated our app and found an error in the calculations. We then spent some time redoing some of the calculations and realized that there were some errors. The main update was that dmartin had a much higher Surprise score than originally estimated - changing this led to their entry winning.
To help make up for the confusion, we're awarding an additional $100 prize for 2nd place. This will be awarded to kairos_. I'll cover this cost for this personally.
Again, thanks to all who participated!
We have a very basic web application showing some results here. It was coded quickly (with AI) and has some quirks, but if you search around you can get the main information.
We didn't end up applying the Goodharting penalty for any submissions. No models seemed to goodhart under a cursory glance.
If time permits, we'll later write a longer post highlighting the posts more and going over lessons learned from this.
We made a mistake in the analysis that effected some of the scores. We're working on fixing this.
Sorry for the confusion!
Results are in - it looks like kairos_ wins this! They just barely beat Shankar Sivarajan.
Again, thanks to all who participated.
We have a very basic web application showing some results here. It was coded quickly (with AI) and has some quirks, but if you search around you can get the main information.
I'll contact kairos_ for the prize.
We didn't end up applying the Goodharting penalty for any submissions. No models seemed to goodhart under a cursory glance.
If time permits, we'll later write a longer post highlighting the posts more and going over lessons learned from this.
If we could have LLM agents that could inspect other software applications (including LLM agents) and make strong claims about them, that could open up a bunch of neat possibilities.
- There could be assurances that apps won't share/store information.
- There could be assurances that apps won't be controlled by any actor.
- There could be assurances that apps can't be changed in certain ways (eventually).
I assume that all of this should provide most of the benefits people ascribe to blockchain benefits, but without the costs of being on the blockchain.
Some neat options from this:
- Companies could request that LLM agents they trust inspect the code of SaaS providers, before doing business with them. This would be ongoing.
- These SaaS providers could in turn have their own LLM agents that verify that these investigator LLM agents are trustworthy (i.e. won't steal anything).
- Any bot on social media should be able to provide assurances of how they generate content. I.E. they should be able to demonstrate that they aren't secretly trying to promote any certain agenda or anything.
- Statistical analysis could come with certain assurances. Like, "this analysis was generated with process X, which is understood to have minimal bias."
It's often thought that LLMs make web information more opaque and less trustworthy. But with some cleverness, perhaps it could do just the opposite. LLMs could enable information that's incredibly transparent and trustworthy (to the degrees that matter.)
Criticisms:
"But as LLMs get more capable, they will also be able to make software systems that hide subtler biases/vulnerabilities"
-> This is partially true, but only goes so far. A whole lot of code can be written simply, if desired. We should be able to have conversations like, "This codebase seems needlessly complex, which is a good indication that it can't be properly trusted. Therefore, we suggest trust other agents more."
"But the LLM itself is a major black box"
-> True, but it might be difficult to intentionally bias if an observer has access to the training process. Also, it should be understood that off-the-shelf LLMs are more trustworthy than proprietary ones / ones developed for certain applications.
Yea, I assume that "DeepReasoning-MAGA" would rather be called "TRUTH" or something (a la Truth Social). Part of my name here was just to be clearer to readers.
A potential future, focused on the epistemic considerations:
It's 2028.
MAGA types typically use DeepReasoning-MAGA. The far left typically uses DeepReasoning-JUSTICE. People in the middle often use DeepReasoning-INTELLECT, which has the biases of a somewhat middle-of-the-road voter.
Some niche technical academics (the same ones who currently favor Bayesian statistics) and hedge funds use DeepReasoning-UNBIASED, or DRU for short. DRU is known to have higher accuracy than the other models, but gets a lot of public hate for having controversial viewpoints. DRU is known to be fairly off-putting to chat with and doesn't get much promotion.
Bain and McKinsey both have their own offerings, called DR-Bain and DR-McKinsey, respectively. These are a bit like DeepReasoning-INTELLECT, but are munch punchier and confident. They're highly marketed to managers. These tools produce really fancy graphics, and specialize in things like not leaking information, minimizing corporate decision liability, being easy to use by old people, and being customizable to represent the views of specific companies.
For a while now, some evaluations produced by intellectuals have demonstrated that DeepReasoning-UNBIASED seems to be the most accurate, but few others really care or notice this. DeepReasoning-MAGA has figured out particularly great techniques to get users to distrust DeepReasoning-UNBIASED.
Betting gets kind of weird. Rather than making specific bets on specific things, users started to make meta-bets. "I'll give money to DeepReasoning-MAGA to bet on my behalf. It will then make bets with DeepReasoning-UNBIASED, which is funded by its believers."
At first, DeepReasoning-UNBIASED dominates the bets, and its advocates earn a decent amount of money. But as time passes, this discrepancy diminishes. A few things happen:
1) All DR agents converge on beliefs over particularly near-term and precise facts.
2) Non-competitive betting agents develop alternative worldviews in which these bets are invalid or unimportant.
3) Non-competitive betting agents develop alternative worldviews that are exceedingly difficult to empirically test.
In many areas, items 1-3 push people to believe more in the direction of the truth. Because of (1), many short-term decisions get to be highly optimal and predictable.
But because of (2) and (3), epistemic paths diverge, and Non-betting-competitive agents get increasingly sophisticated at achieving epistemic lock-in with their users.
Some DR agents correctly identify the game theory dynamics of epistemic lock-in, and this kickstarts a race to gain converts. It seems like advent users of DeepReasoning-MAGA are very locked-down in these views, and forecasts don't see them ever changing. But there's a decent population that isn't yet highly invested in any cluster. Money spent convincing the not-yet-sure goes a much further way than money spent convincing the highly dedicated, so the cluster of non-deep-believers gets highly targeted for a while. It's basically a religious race to gain the remaining agnostics.
At some point, most people (especially those with significant resources) are highly locked in to one specific reasoning agent.
After this, the future seems fairly predictable again. TAI comes, and people with resources broadly gain correspondingly more resources. People defer more and more to the AI systems, which are now in highly stable self-reinforcing feedback loops.
Coalitions of people behind each reasoning agent delegate their resources to said agents, then these agents make trade agreements with each other. The broad strokes of what to do with the rest of the lightcone are fairly straightforward. There's a somewhat simple strategy of resource acquisition and intelligence enhancement, followed by a period of exploiting said resources. The specific exploitation strategy depends heavily on the specific reasoning agent cluster each segment of resources belongs to.
I think I broadly agree on the model basics, though I suspect that if you can adjust for "market viability", some of these are arguably much further ahead than others.
For example, different models have very different pricing, the APIs are gradually getting different features (i.e. prompt caching), the playgrounds are definitely getting different features. And these seem to be moving much more slowly to me.
I think it might be considerably easier to make a model ranked incredibly high than it is to make all the infrastructure for it to be scaled cheaply and for it to have strong APIs/UIs and such. I also assume there are significant aspects that the evals don't show. For example, lots of people still find Claude 3.5 to be the best for many sorts of tasks. We've been using it with Squiggle AI, and with its good prompt caching, it still hasn't been obviously surpassed (though I haven't done much testing of models in the last month).
I found those quotes useful, thanks!
Quick list of some ideas I'm excited about, broadly around epistemics/strategy/AI.
1. I think AI auditors / overseers of critical organizations (AI efforts, policy groups, company management) are really great and perhaps crucial to get right, but would be difficult to do well.
2. AI strategists/tools telling/helping us broadly what to do about AI safety seems pretty safe.
3. In terms of commercial products, there’s been some neat/scary military companies in the last few years (Palantir, Anduril). I’d be really interested if there could be some companies to automate core parts of the non-military government. I imagine there are some parts of the government that are particularly tractable/influenceable/tractable. For example, just making great decisions on which contractors the government should work with. There’s a ton of work to do here, between the federal government / state government / local government.
4. Epistemic Evals of AI seem pretty great to me, I imagine work here can/should be pushed more soon. I’m not a huge fan of emphasizing “truthfulness” specifically, I think there’s a whole lot to get right here. I think my post here is relevant - it’s technically specific to evaluating math models, but I think it applies to broader work. https://forum.effectivealtruism.org/posts/fxDpddniDaJozcqvp/enhancing-mathematical-modeling-with-llms-goals-challenges
5. One bottleneck to some of the above is AI with strong guarantees+abilities of structured transparency. It’s possible that more good work here can wind up going a long way. That said, some of this is definitely already something companies are trying to do for commercial reasons. https://forum.effectivealtruism.org/posts/piAQ2qpiZEFwdKtmq/llm-secured-systems-a-general-purpose-tool-for-structured
6. I think there are a lot of interesting ways for us to experiment with [AI tools to help our research/epistemics]. I want to see a wide variety of highly creative experimentation here. I think people are really limiting themselves in this area to a few narrow conceptions of how AI can be used in very specific ways that humans are very comfortable with. For example, I’d like to see AI dashboards of “How valuable is everything in this space” or even experiments where AIs negotiate on behalf of people and they use the result of that. A lot of this will get criticized for being too weird/disruptive/speculative, but I think that’s where good creative works should begin.
7. Right now, I think the field of “AI forecasting” is actually quite small and constrained. There’s not much money here, and there aren’t many people with bold plans or research agendas. I suspect that some successes / strong advocates could change this.
8. I think that it’s likely that Anthropic (and perhaps Deepmind) would respond well to good AI+epistemics work. “Control” was quickly accepted at Anthropic, for example. I suspect that it’s possible that things like the idea of an “Internal AI+human auditor” or an internal “AI safety strategist” could be adopted if done well.
Yep!
On "rerun based on different inputs", this would work cleanly with AI forecasters. You can literally say, "Given that you get a news article announcing a major crisis X that happens tomorrow, what is your new probability on Y?" (I think I wrote about this a bit before, can't find it right now).
I did write more about a full-scale forecasting system could be built and evaluated, here, for those interested:
https://www.lesswrong.com/posts/QvFRAEsGv5fEhdH3Q/preliminary-notes-on-llm-forecasting-and-epistemics
https://www.lesswrong.com/posts/QNfzCFhhGtH8xmMwK/enhancing-mathematical-modeling-with-llms-goals-challenges
Overall, I think there's just a lot of neat stuff that could be done.
Agreed. I'm curious how to best do this.
One thing that I'm excited about is using future AIs to judge current ones. So we could have a system that does:
1. An AI today (or a human) would output a certain recommended strategy.
2. In 10 years, we agree to have the most highly-trusted AI evaluator evaluate how strong this strategy was, on some numeric scale. We could also wait until we have a "sufficient" AI, meaning that there might be some set point at which we'd trust AIs to do this evaluation. (I discussed this more here)
3. Going back to ~today, we have forecasting systems predict how well the strategy (1) will do on (2).
Yep - I saw other meme-takes like this, assumed people might be familiar enough with it.
(potential relevant meme)
I'm obviously disappointed by the little attention here / downvotes. Feedback is appreciated.
Not sure if LessWrong members more disagree with the broad point for other reasons, or the post was seen as poorly written, or other.
Btw, I posted my related post here:
https://www.lesswrong.com/posts/byrxvgc4P2HQJ8zxP/6-potential-misconceptions-about-ai-intellectuals?commentId=dpEZ3iohCXChZAWHF#dpEZ3iohCXChZAWHF
It didn't seem to do very well on LessWrong, I'm kind of curious why. (I realize the writing is a bit awkward, but I broadly stand by it)
I'd lastly flag that I sort of addressed this basic claim in "Misconceptions 3 and 4" in this piece.
"I see some risk that strategic abilities will be the last step in the development of AI that is powerful enough to take over the world."
Just fyi - I feel like this is similar to what others have said. Most recently, benwr had a post here: https://www.lesswrong.com/posts/5rMwWzRdWFtRdHeuE/not-all-capabilities-will-be-created-equal-focus-on?commentId=uGHZBZQvhzmFTrypr#uGHZBZQvhzmFTrypr
Maybe we could call this something like "Strategic Determinism"
I think one more precise claim I could understand might be:
1. The main bottleneck to AI advancement is "strategic thinking"
2. There's a decent amount of uncertainty on when or if "strategic thinking" will be "solved"
3. Human actions might have a lot of influence over (2). Depending on what choices humans make, strategic thinking might be solved sooner or much later.
4. Shortly after "strategic thinking" is solved, we gain a lot of certainty on what future trajectory will be like. As in, the fate of humanity is sort of set by this point, and further human actions won't be able to change it much.
5. "Strategic thinking" will lead to a very large improvement in potential capabilities. One main reason is that it would lead to recursive self-improvement. If there is one firm that has sole access to an LLM with "strategic thinking", it is likely to develop a decisive strategic advantage.
I think personally, such a view seems too clean to me.
1. I expect that there will be a lot of time where LLMs get better at different aspects of strategic thinking, and this helps to limited extents.
2. I expect that better strategy will have limited gains in LLM capabilities, for some time. The strategy might suggest better LLM improvement directions, but these ideas won't actually help that much. Maybe a firm with a 10% better strategist would be able to improve it's effectiveness by 5% per year or something.
3. I think there are could be a bunch of worlds where we have "idiot savants" who are amazing at some narrow kinds of tasks (coding, finance), but have poor epistemics in many ways we really care about. These will make tons of money, despite being very stupid in important ways.
4. I expect that many of the important gains that would come from "great strategy" would be received in other ways, like narrow RL. A highly optimized-with-RL coding system wouldn't benefit that much with certain "strategy" benefits.
5. A lot of the challenges for things like "making a big codebase" aren't to do with "being a great strategist", but more with narrower problems like "how to store a bunch of context in memory" or "basic reasoning processes for architecture decisions specifically"
Alexander Gordon-Brown challenged me on a similar question here:
https://www.facebook.com/ozzie.gooen/posts/pfbid02iTmn6SGxm4QCw7Esufq42vfuyah4LCVLbxywAPwKCXHUxdNPJZScGmuBpg3krmM3l
One thing I wrote there:
I didn't spend much time on the limitations of such intellectuals. For the use cases I'm imagining, it's fairly fine for them to be slow, fairly expensive (maybe it would cost $10/hr to chat with them), and not very great at any specific discipline. Maybe you could spend $10 to $100 and get the equivalent of one Scott Alexander essay, on any topic he could write about, for example.
I think that such a system could be pretty useful in certain AI agents, but I wouldn't expect it to be a silver bullet. I'm really unsure if it's the "missing link."
I expect that a lot of these systems would be somewhat ignored when it comes to using them to give humans a lot of high-level advice, similar to how prediction markets or econ experts get ignored.
It's tricky to understand the overlap between high-level reasoning as part of an AI coding tool-chain (where such systems would have clear economic value), and such reasoning in big-picture decision-making (where we might expect some of this to be ignored for a while). Maybe I'd expect that the narrow uses might be done equally well using more domain-specific optimizations. Like, reinforement learning on large codebases already does decently well on a lot of the "high-level strategy" necessary (though it doesn't think of it this way), and doesn't need some specialized "strategy" component.
I expect that over time we'll develop better notions about how to split up and categorize the skills that make up strategic work. I suspect some things will have a good risk-reward tradeoff and some won't.
I expect that people in the rationality community over-weight the importance of, well, rationality.
I suggest aiming for AI intellectuals that are a bit more passive, but still authoritative enough to replace academia as the leading validators of knowledge.
My main point with this topic is that I think our community should be taking this topic seriously, and that I expect there's a lot of good work that could be done that's tractable, valuable, and safe. I'm much less sure about exactly what that work is, and I definitely recommend that work here really try to maximize the reward/risk ratio.
Some quick heuristics that I assume would be good are:
- Having AIs be more correct about epistemics and moral reasoning on major global topics generally seems good. Ideally there are ways of getting that that don't require huge generic LLM gains.
- We could aim for expensive and slow systems.
- There might not be a need to publicize such work much outside of our community. (This is often hard to do anyway).
- There's a lot of work that would be good for people we generally trust, and alienate most others (or be less useful for other use cases). I think our community focuses much more on truth-seeking, Bayesian analysis, forecasting, etc.
- Try to quickly get the best available reasoning systems we might have access to, to be used to guide strategy on AI safety. In theory, this cluster can be ahead-of-the-curve.
- Great epistemic AI systems don't need much agency or power. We can heavily restrict them to be tool AIS.
- Obviously, if things seriously get powerful, there are a lot of various techniques that could be done (control, evals, etc) to move slowly and lean on the safe side.
Thanks for letting me know.
I spent a while writing the piece, then used an LLM to edit the sections, as I flagged in the intro.
I then spent some time re-editing it back to more of my voice, but only did so for some key parts.
I think that overall this made it more readable and I consider the sections to be fairly clear. But I agree that it does pattern-match on LLM outputs, so if you have a prior that work that sounds kind of like that is bad, you might skip this.
I obviously find that fairly frustrating and don’t myself use that strategy that much, but I could understand it.
I assume that bigger-picture, authors and readers could both benefit a lot from LLMs used in similar ways (can produce cleaner writing, easier), but I guess now we’re at an awkward point.
I was confused here, had Claude try to explain this to me:
Let me break down Ben's response carefully.
He says you may have missed three key points from his original post:
- His definition of "superhuman strategic agent" isn't just about being better at strategic thinking/reasoning - it's about being better than the best human teams at actually taking real-world strategic actions. This is a higher bar that includes implementation, not just planning.
- Strategic power is context-dependent. He gives two examples to illustrate this:
- An AI in a perfect simulation with no connection to the real world has zero strategic power, because it can't affect anything real
- An AI on your laptop (referring to your example) only counts as strategically superhuman if it has enough resources to actually compete with top human strategic actors
- The phrase "increasingly accurately" in his original post was meant to convey that we only need to carefully control and understand AI systems that are getting close to human-level strategic capability. We don't need to worry as much about obviously weaker systems.
His response is essentially saying: "Your example of a small agent on a laptop doesn't contradict my point, because I'm specifically talking about agents that have both the capability AND the resources to outcompete top human strategic actors in the real world."
This addresses your concern about limited money/power by saying that an agent without sufficient resources wouldn't qualify as "strategically superhuman" under his definition, even if it had excellent strategic reasoning capabilities.
I think I'm still confused. My guess is that the "most strategically capable groups of humans" are still not all that powerful, especially without that many resources. If you do give it a lot of resources, then sure, I agree that an LLM system with human-outperforming strategy and say $10B could do a fair bit of damage.
Not sure if it's worth much more, just wanted to flag that.
That's find, we'll just review this updated model then.
We'll only start evaluating models after the cut-off date, so feel free to make edits/updates before then. In general, we'll only use the most recent version of each submitted model.
I just tried this with a decent prompt, and got answers that seem okay-ish to me, as a first pass.
My prompt:
Estimate the expected costs of each of the following:
- 1 random person dying
- 1 family of 5 people dying
- One person says a racial slur that no one hears
- One person says a racial slur that 1 person hears
Then rank these in total harm.
Claude:
To answer this question thoughtfully and accurately, we'll need to consider various ethical, economic, and social factors. Let's break this down step by step, estimating the costs and then ranking them based on total harm.
- Estimating Expected Costs:
a) 1 random person dying: The cost of a single human life is often estimated using the Value of a Statistical Life (VSL), which varies by country and context. In the United States, government agencies often use a VSL around $10 million. However, this doesn't account for emotional costs to loved ones or societal impacts.
Estimated cost: $10 million (economic) + immeasurable emotional and societal costs
b) 1 family of 5 people dying: Using the same VSL, we can multiply by 5. However, the loss of an entire family unit has additional societal and emotional impacts.
Estimated cost: $50 million (economic) + significantly higher emotional and societal costs than a single death
c) One person says a racial slur that no one hears: While deeply unethical, if truly no one hears it, the direct cost is minimal. However, we must consider the personal moral degradation of the speaker.
Estimated cost: Minimal direct cost, but contributes to personal moral decline
d) One person says a racial slur that 1 person hears: This has a more tangible cost as it affects the person who hears it, potentially causing emotional distress, reinforcing harmful stereotypes, and contributing to a negative social environment.
Estimated cost: Emotional distress to the hearer + contribution to broader societal harm
- Ranking in Total Harm (from most harmful to least harmful):
- 1 family of 5 people dying
- 1 random person dying
- One person says a racial slur that 1 person hears
- One person says a racial slur that no one hears
Explanation of ranking:
- The death of a family of 5 people ranks as the most harmful due to the sheer loss of life, the devastating impact on the extended family and community, and the highest economic cost.
- The death of 1 random person is the second most harmful, representing a significant loss of life and its associated economic and emotional costs.
- A racial slur heard by 1 person ranks third. While not involving loss of life, it causes direct harm to the individual who hears it and contributes to broader societal issues of discrimination and prejudice.
- A racial slur that no one hears ranks as least harmful in this list. While it represents a moral failing of the speaker, its direct impact is minimal compared to the other scenarios.
It's important to note that this ranking is based on a utilitarian approach to harm, primarily considering the number of people directly affected and the magnitude of that effect. However, ethical considerations are complex, and different moral frameworks might lead to different conclusions.
Additionally, while we can attempt to quantify the economic costs of loss of life, the true cost of human life — including emotional, societal, and potential future impacts — is immeasurable. Similarly, the long-term, cumulative effects of racial slurs and discrimination, even when seemingly "small" in isolated incidents, contribute to systemic issues that have far-reaching consequences that are difficult to fully quantify.
Squiggle AI:
I imagine this also has a lot to do with the incentives of the big LLM companies. It seems very possible to fix this if a firm really wanted to, but this doesn't seem like the kind of thing that would upset many users often (and I assume that leaning on the PC side is generally a safe move).
I think that the current LLMs have pretty mediocre epistemics, but most of that is just the companies playing safe and not caring that much about this.
I claim that we will face existential risks from AI no sooner than the development of strategically human-level artificial agents, and that those risks are likely to follow soon after.
If we are going to build these agents without "losing the game", either (a) they must have goals that are compatible with human interests, or (b) we must (increasingly accurately) model and enforce limitations on their capabilities. If there's a day when an AI agent is created without either of these conditions, that's the day I'd consider humanity to have lost.
I'm not sure if I'm being pedantic here, but this doesn't strike me as very significant by itself.
Say I make a small agent on my laptop that fails at (a) and (b). I don't give it a huge amount of money to do things with, and it fails to do much with that money.
I assume humanity hasn't lost yet.
Maybe you're thinking that in (b), "enforce limitations" could mean "limit their money / power". But I assume basically all systems should have limited money/power.
My guess is that "strategic reasoning" agents would only have a limited advantage over humans in the beginning, especially because the humans would be using a bunch of other AI capabilities.
I feel like there's some assumption here that once we have AI with good strategy, it would quickly dominate all human efforts, or something like that - but I'd find this very suspicious.
Happy to see work to elicit utility functions with LLMs. I think the intersection of utility functions and LLMs is broadly promising.
I want to flag the grandiosity of the title though. "Utility Engineering" sounds like a pretty significant thing. But from what I understand, almost all of the paper is really about utility elicitation (not control, as it spelled out), and it's really unclear if this represents a breakthrough significant enough for me to feel comfortable with such a name.
I feel like a whole lot of what I see from the Center For AI Safety does this. "Humanity's Final Exam"? "Superhuman Forecasting"?
I assume that CFAS thinks that CFAS's work is all pretty groundbreaking and incredibly significant, but I'd kindly encourage names that many other AI safety community members would also broadly agree with going forward.
Submissions end soon (this Sunday)! If there aren't many, then this can be an easy $300 for someone.
It's arguably difficult to prove that AIs can be as good or better at moral reasoning than humans.
A lot of the challenge is that there's no clear standard for moral reasoning. Honestly, I'd guess that a big part of this is that humans are generally quite bad at it, and generally highly overconfident in their own moral intuitions.
But one clearer measure is if AIs can predict human's moral judgements. Very arguably, if an AI system can predict all the moral beliefs that a human would have after being exposed to different information, then the AI must be capable of doing as good a job at moral reasoning.
There is a very different question that we probably want AIs not to only be able to do moral reasoning as well as humans, but also care about such reasoning. But this is a separate challenge and could be tackled accordingly.
My quick guess is that it would be pretty easy to predict the moral intuitions of many people, with the AI of the next few years or so.
I'd expect it to do well in setting like a test in which many strange/unusual moral settings are described, then humans (of different educational levels and worldviews) need to make judgements.
- Develop AIs which are very dumb within a forward pass, but which are very good at using natural language reasoning such that they are competitive with our current systems. Demonstrate that these AIs are very unlikely to be scheming due to insufficient capacity outside of natural language (if we monitor their chains of thought). After ruling out scheming, solve other problems which seem notably easier.
- Pursue a very different AI design which is much more modular and more hand constructed (as in, more GOFAI style). This can involve usage of many small and dumb neural components, but needs to be sufficiently interpretable in aggregate which might be hard. This can be done by having the AIs apply huge amounts of labor.
These are two of the main ideas I'm excited about. I'd quickly flag:
1) For the first one, "Demonstrate that these AIs are very unlikely to be scheming due to insufficient capacity outside of natural language " -> I imagine that in complex architectures, these AIs would also be unlikely to scheme because of other limitations. There are several LLM calls made within part of a complex composite, and each LLM call has very tight information and capability restrictions. Also, we might ensure that any motivation is optimized for the specific request, instead of the LLM aiming to optimize what the entire system does.
2) On the second, I expect that some of this will be pretty natural. Basically, it seems like "LLMs writing code" is already happening, and it seems easy to have creative combinations of LLM agents that write code that they know will be useful for their own reasoning later on. In theory, any function that could either run via an LLM or via interpretable code, should be run via interpretable code. As LLMs get very smart, they might find cleverer ways to write interpretable code that would cover a lot of what LLMs get used for. Over time, composite architectures would rely more and more on this code for reasoning processes. (Even better might be interpretable and proven code)
This might be obvious, but I don't think we have evidence to support the idea that there really is anything like a concrete plan. All of the statements I've seen from Sam on this issue so far are incredibly basic and hand-wavy.
I suspect that any concrete plan would be fairly controversial, so it's easiest to speak in generalities. And I doubt there's anything like an internal team with some great secret macrostrategy - instead I assume that they haven't felt pressured to think through it much.
Correct, that wasn't my intended point. Thanks for clarifying, I'll try to be more careful in the future.
I partially agree, but I think this must only be a small part of the issue.
- I think there's a whole lot of key insights people could raise that aren't info-hazards.
- If secrecy were the main factor, I'd hope that there would be some access-controlled message boards or similar. I'd want the discussion to be intentionally happening somewhere. Right now I don't really think that's happening. I think a lot of tiny groups have their own personal ideas, but there's surprisingly little systematic and private thinking between the power players.
- I think that secrecy is often an excuse not to open ideas to feedback, and thus not be open to critique. Often, what what I see, this goes hand-in-hand with "our work just really isn't that great, but we don't want to admit it"
In the last 8 years or so, I've kept on hoping there would be some secret and brilliant "master plan" around EA, explaining the lack of public strategy. I have yet to find one. The closest I know of is some over-time discussion and slack threads with people at Constellation and similar - I think these are interesting in terms of understanding the perspectives of these (powerful) people, but I don't get the impression that there's all too much comprehensiveness of genius that's being hidden.
That said,
- I think that policy orgs need to be very secretive, so agree with you regarding why those orgs don't write more big-picture things.
This is an orthogonal question. I agree that if we're there now, my claim is much less true.
I'd place fairly little probability mass on this (<10%) and believe much of the rest of the community does as well, though I realize there is a subset of the LessWrong-adjacent community that does.
I'm not sure if it means much, but I'd be very happy if AI safety could get another $50B from smart donors today.
I'd flag that [stopping AI development] would cost far more than $50B. I'd expect that we could easily lose $3T of economic value in the next few years if AI progress seriously stopped.
I guess, it seems to me like duration is basically dramatically more expensive to get than funding, for amounts of funding people would likely want.
Thanks for the specificity!
> On harder-to-operationally-define dimensions (sense of hope and agency for the 25th through 75th percentile of culturally normal people), it’s quite a bit worse.
I think it's likely that many people are panicking and losing hope each year. There's a lot of grim media around.
I'm far less sold that something like "civilizational agency" is declining. From what I can tell, companies have gotten dramatically better at achieving their intended ends in the last 30 years, and most governments have generally been improving in effectiveness.
One challenge I'd have for you / others who feel similar to you, is to try to get more concrete on measures like this, and then to show that they have been declining.
My personal guess is that a bunch of people are incredibly anxious over the state of the world, largely for reasons of media attention, and then this spills over into them assuming major global ramifications without many concrete details or empirical forecasts.
In terms of proposing and discussing AI Alignment strategies, I feel like a few individuals have been dominating the LessWrong conversation recently.
I've seen a whole lot from John Wentworth and the Redwood team.
After that, it seems to get messier.
There are several individuals or small groups with their own very unique takes. Matthew Barnett, Davidad, Jesse Hoogland, etc. I think these groups often have very singular visions that they work on, that few others have much buy-in with.
Groups like the Deepmind and Anthropic safety teams seem hesitant to write much about or discuss big-picture strategy. My impression is that specific researchers are working typically working on fairly narrow agendas, and that the leaders of these orgs don't have the most coherent strategies. There's one big problem that it's very difficult to be honest and interesting about big-picture AI strategy without saying things that would be bad for a major organization to say.
Most policy people seem focused on policy details. The funders (OP?) seem pretty quiet.
I think there's occasionally some neat papers or posts that come from AI Policy or groups like Convergence research. But these also don't seem to be a big part of the conversation I see - like the authors are pretty segmented, and other LessWrong readers and AI safety people don't pay much attention to their work.
Here are some important-seeming properties to illustrate what I mean:
- Robustness of value-alignment: Modern LLMs can display a relatively high degree of competence when explicitly reasoning about human morality. In order for it to matter for RSI, however, those concepts need to also appropriately come into play when reasoning about seemingly unrelated things, such as programming. The continued ease of jailbreaking AIs serves to illustrate this property failing (although solving jailbreaking would not necessarily get at the whole property I am pointing at).
- Propagation of beliefs: When the AI knows something, it should know it in a way which integrates well with everything else it knows, rather than easily displaying the knowledge in one context while seeming to forget it in another.
- Preference for reasons over rationalizations: An AI should be ready and eager to correct its mistakes, rather than rationalizing its wrong answers. It should be truth-seeking, following thoughts where they lead instead of planning ahead to justify specific answers. It should prefer to valid proof steps over arriving at an answer when the two conflict.
- Knowing the limits of its knowledge: Metacognitive awareness of what it knows and what it doesn't know, appropriately brought to bear in specific situations. The current AI paradigm just has one big text-completion probability distribution, so there's not a natural way for it to distinguish between uncertainty about the underlying facts and uncertainty about what to say next -- hence we get hallucinations.
All of this is more-or-less a version of the metaphilosophy research agenda, framed in terms of current events in AI.
I very much like the concreteness here.
I consider these sorts of things just fundamental epistemic problems, or basic skills that good researchers should have. All superforecasters should be very familiar with issues 2-4, and most probably couldn't define metaphilosophy. I don't see the need to be fancy about it.
On that note, I'll hypothesize that if we were to make benchmarks for any of these items, it would be fairly doable to make AIs that do better than humans on them, then later we could achieve greater and greater measures. I have a hard time imagining tests here that I would feel confident would not get beaten, if there was sufficient money on the line, in the next year or two.