Posts
Comments
The North Wind, the Sun, and Abadar
One day, the North Wind and the Sun argued about which of them was the strongest. Abadar, the god of commerce and civilization, stopped to observe their dispute. “Why don’t we settle this fairly?” he suggested. “Let us see who can compel that traveler on the road below to remove his cloak.”
The North Wind agreed, and with a mighty gust, he began his effort. The man, feeling the bitter chill, clutched his cloak tightly around him and even pulled it over his head to protect himself from the relentless wind. After a time, the North Wind gave up, frustrated.
Then the Sun tried his turn. Beaming warmly from the heavens, the Sun caused the air to grow pleasant and balmy. The man, feeling the growing heat, loosened his cloak and eventually took it off in the heat, resting under the shade of a tree. The Sun began to declare victory, but as soon as he turned away, the man put on the cloak again.
The god of commerce then approached the traveler and bought the cloak for five gold coins. The traveler tucked the money away and continued on his way, unbothered by either wind or heat. He soon bought a new cloak and invested the remainder in an index fund. The returns were steady, and in time he prospered far beyond the value of his simple cloak, while the cloak was Abadar's permanently.
Commerce, when conducted wisely, can accomplish what neither force nor gentle persuasion alone can achieve, and with minimal deadweight loss.
The thought experiment is not about the idea that your VNM utility could theoretically be doubled, but instead about rejecting diminishing returns to actual matter and energy in the universe. SBF said he would flip with a 51% of doubling the universe's size (or creating a duplicate universe) and 49% of destroying the current universe. Taking this bet requires a stronger commitment to utilitarianism than most people are comfortable with; your utility needs to be linear in matter and energy. You must be the kind of person that would take a 0.001% chance of colonizing the universe over a 100% chance of colonizing merely a thousand galaxies. SBF also said he would flip repeatedly, indicating that he didn't believe in any sort of bound to utility.
This is not necessarily crazy-- I think Nate Soares has a similar belief-- but it's philosophically fraught. You need to contend with the unbounded utility paradoxes, and also philosophical issues: what if consciousness is information patterns that become redundant when duplicated, so that only the first universe "counts" morally?
For context, I just trialed at METR and talked to various people there, but this take is my own.
I think further development of evals is likely to either get effective evals (informal upper bound on the future probability of catastrophe) or exciting negative results ("models do not follow reliable scaling laws, so AI development should be accordingly more cautious").
The way to do this is just to examine models and fit scaling laws for catastrophe propensity, or various precursors thereof. Scaling laws would be fit to elicitation quality as well as things like pretraining compute, RL compute, and thinking time.
- In a world where elicitation quality has very reliable scaling laws, we would observe that there are diminishing returns to better scaffolds. Elicitation quality is predictable, ideally an additive term on top of model quality, but more likely requiring some more information about the model. It is rare to ever discover a new scaffold that can 2x the performance of an already well-tested models.
- In a world where elicitation quality is not reliably modelable, we would observe that different methods of elicitation routinely get wildly different bottom-line performance, and sometimes a new elicitation method makes models 10x smarter than before, making error bars on the best undiscovered elicitation method very wide. Different models may benefit from different elicitation methods, and some get 10x benefits while others are unaffected.
It is NOT KNOWN what world we are in (worst-case assumptions would put us in 2 though I'm optimistic we're closer to 1 in practice), and determining this is just a matter of data collection. If our evals are still not good enough but we don't seem to be in World 2 either, there are endless of tricks to add that make evals more thorough, some of which are already being used. Like evaluating models with limited human assistance, or dividing tasks into subtasks and sampling a huge number of tries for each.
What's the most important technical question in AI safety right now?
Yes, lots of socioeconomic problems have been solved on a 5 to 10 year timescale.
I also disagree that problems will become moot after the singularity unless it kills everyone-- the US has a good chance of continuing to exist, and improving democracy will probably make AI go slightly better.
I mention exactly this in paragraph 3.
The new font doesn't have a few characters useful in IPA.
The CATXOKLA population is higher than the current swing state population, so it would arguably be a little less unfair overall. Also there's the potential for a catchy pronunciation like /kæ'tʃoʊklə/.
Knowing now that he had an edge, I feel like his execution strategy was suspect. The Polymarket prices went from 66c during the order back to 57c on the 5 days before the election. He could have extracted a bit more money from the market if he had forecasted the volume correctly and traded against it proportionally.
I think it would be better to form a big winner-take-all bloc. With proportional voting, the number of electoral votes at stake will be only a small fraction of the total, so the per-voter influence of CA and TX would probably remain below the national average.
A third approach to superbabies: physically stick >10 infant human brains together while they are developing so they form a single individual with >10x the neocortex neurons as the average humans. Forget +7sd, extrapolation would suggest they are >100sd intelligence.
Even better, we could find some way of networking brains together into supercomputers using configurable software. This would reduce potential health problems and also allow us to harvest their waste energy. Though we would have to craft a simulated reality to distract the non-useful conscious parts of the computational substrate, perhaps modeled on the year 1999...
In many respects, I expect this to be closer to what actually happens than "everyone falls over dead in the same second" or "we definitively solve value alignment". Multipolar worlds, AI that generally follows the law (when operators want it to, and modulo an increasing number of loopholes) but cannot fully be trusted, and generally muddling through are the default future. I'm hoping we don't get instrumental survival drives though.
Claim 2: The world has strong defense mechanisms against (structural) power-seeking.
I disagree with this claim. It seems pretty clear that the world has defense mechanisms against
- disempowering other people or groups
- breaking norms in the pursuit of power
But it is possible to be power-seeking in other ways. The Gates Foundation has a lot of money and wants other billionaires' money for its cause too. It influences technology development. It has to work with dozens of governments, sometimes lobbying them. Normal think tanks exist to gain influence over governments. Harvard University, Jane Street, and Goldman Sachs recruit more elite students than all the EA groups and control more money than OpenPhil. Jane Street and Goldman Sachs guard private information worth billions of dollars. The only one with a negative reputation is Goldman Sachs, which is due to perceived greed rather than power-seeking per se. So why is there so much more backlash against AI safety? I think it basically comes down to a few factors:
- We are bending norms (billionaire funding for somewhat nebulous causes) and sometimes breaking them (FTX financial and campaign finance crimes)
- We are not able to credibly signal that we won't disempower others.
- MIRI wanted a pivotal act to happen, and under that plan nothing would stop MIRI from being world dictators
- AI is inherently a technology with world-changing military and economic applications whose governance is unsolved
- An explicitly consequentialist movement will take power by any means necessary, and people are afraid of that.
- AI labs have incentives to safetywash, making people wary of safety messaging.
- The preexisting AI ethics and open-source movements think their cause is more important and x-risk is stealing attention.
- AI safety people are bad at diplomacy and communication, leading to perceptions that they're the same as the AI labs or have some other sinister motivation.
That said, I basically agree with section 3. Legitimacy and competence are very important. But we should not confuse power-seeking-- something the world has no opinion on-- with what actually causes backlash.
Fixed.
Yeah that's right, I should have said market for good air filters. My understanding of the problem is that most customers don't know to insist on high CADR at low noise levels, and therefore filter area is low. A secondary problem is that HEPA filters are optimized for single-pass efficiency rather than airflow, but they sell better than 70-90% efficient MERV filters.
The physics does work though. At a given airflow level, pressure and noise go as roughly the -1.5 power of filter area. What IKEA should be producing instead of the FÖRNUFTIG and STARKVIND is one of three good designs for high CADR:
- a fiberboard box like the CleanAirKits End Table 7 which has holes for pre-installed fans and can accept at least 6 square feet of MERV 13 furnace filters or maybe EPA 11.
- a box like the AirFanta 3Pro, ideally that looks nicer somehow.
- a wall-mounted design with furnace filters in a V shape, like this DIY project.
I made a shortform and google slides presentation about this and might make it a longform if there is enough interest or I get more information.
Quiet air filters is an already solved problem technically. You just need enough filter area that the pressure drop is low, so that you can use quiet low-pressure PC fans to move the air. CleanAirKits is already good, but if the market were big enough cared enough, rather than CleanAirKits charging >$200 for a box with holes in it and fans, you would get a purifier from IKEA for $120 which is sturdy and 3db quieter due to better sound design.
Haven't fully read the post, but I feel like that could be relaxed. Part of my intuition is that Aumann's theorem can be relaxed to the case where the agents start with different priors, and the conclusion is that their posteriors differ by no more than their priors.
- I agree that with superficial observations, I can't conclusively demonstrate that something is devoid of intellectual value. However, the nonstandard use of words like "proof" is a strong negative signal on someone's work.
- If someone wants to demonstrate a scientific fact, the burden of proof is on them to communicate this in some clear and standard way, because a basic strategy of anyone practicing pseudoscience is to spend lots of time writing something inscrutable that ends in some conclusion, then claim that no one can disprove it and anyone who thinks it's invalid is misunderstanding something inscrutable.
- This problem is exacerbated when someone bases their work on original philosophy. To understand Forrest Landry's work to his satisfaction someone will have to understand his 517-page book An Immanent Metaphysics, which uses words like "proof", "theorem", "conjugate", "axiom", and "omniscient" in a nonstandard sense, and also probably requires someone to have a background in metaphysics. I scanned the 134-page version, can't make any sense of it, and found several concrete statements that sound wrong. I read about 50 pages of various articles on the website and found them to be reasonably coherent but often oddly worded and misusing words like entropy, with the same content quality as a ~10 karma LW post but super overconfident.
That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation.
Ok. To be clear I don't expect any Landry and Sandberg paper that comes out of this collaboration to be crankery. Having read the research proposal my guess is that they will prove something roughly like the Good Regulator Theorem or Rice's theorem which will be slightly relevant to AI but not super relevant because the premises are too strong, like the average item in Yampolskiy's list of impossibility proofs (I can give examples if you want of why these are not conclusive).
I'm not saying we should discard all reasoning by someone that claims an informal argument is a proof, but rather stop taking their claims of "proofs" at face value without seeing more solid arguments.
claiming the "proof" uses mathematical arguments from Godel's theorem, Galois Theory,
Nope, I haven’t claimed either of that.
Fair enough. I can't verify this because Wayback Machine is having trouble displaying the relevant content though.
Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble.
Paul expressed appropriate uncertainty. What is he supposed to do, say "I see several red flags, but I don't have time to read a 517-page metaphysics book, so I'm still radically uncertain whether this is a crank or the next Kurt Godel"?
Read the core argument please (eg. summarised in point 3-5. above) and tell me where you think premises are unsound or the logic does not follow from the premises.
When you say failures will "build up toward lethality at some unknown rate", why would failures build up toward lethality? We have lots of automated systems e.g. semiconductor factories, and failures do not accumulate until everyone at the factory dies, because humans and automated systems can notice errors and correct them.
Variants get evolutionarily selected for how they function across the various contexts they encounter over time. [...] The artificial population therefore converges on fulfilling their own expanding needs.
This is pretty similar to Hendrycks's natural selection argument, but with the additional piece that the goals of AIs will converge to optimizing the environment for the survival of silicon-based life. He claims that there are various ways to counter evolutionary pressures, like "carefully designing AI agents’ intrinsic motivations, introducing constraints on their actions, and institutions that encourage cooperation". In the presence of ways to change incentives such that benign AI systems get higher fitness, I don't think you can get to 99% confidence. Evolutionary arguments are notoriously tricky and respected scientists get them wrong all the time, from Malthus to evolutionary psychology to the group selectionists.
I eat most meats (all except octopus and chicken) and have done this my entire life, except once when I went vegan for Lent. This state seems basically fine because it is acceptable from scope-sensitive consequentialist, deontic, and common-sense points of view, and it improves my diet enough that it's not worth giving up meat "just because".
- According to EA-style consequentialism, eating meat is a pretty small percentage of your impact, and even if you're not directly offsetting, the impact can be vastly outweighed by positive impact in your career or donations to other causes.
- There is a finite amount of sadness I'm willing to put up with for the sake of impact, and it seems far more important to use the vast majority of my limited sadness budget in my career choice.
- There is no universally accepted deontological rule against indirectly causing the expected number of tortured animals to increase by one, nor would this be viable as it requires tracking the consequences of your actions through the complicated world, which defeats the point of deontology. There might be a rule against benefiting from identifiable torture, but I don't believe in deontology strongly enough to think this is definitive. Note there isn't a good contractualist angle against torturing animals like there is for humans.
- Common-sense morality says that meat-eating is traditional and not torturing the animals yourself does reduce how bad it is, and although this is pretty silly as a general principle, it applies to the other benefit of being vegan, which is less corrupted moral reasoning. My empathy and moral reasoning are less corrupted by eating meat than it would be working in a factory farm or slaughterhouse. I am still concerned about loss of empathy but I get around half the empathy benefits of veganism anyway, just by not eating chicken.
I do have some doubts; sometimes eating meat feels like being a slaveholder in 1800, which feels pretty bad. I hope history will not judge me harshly for what seem like reasonable decisions now, and plan to go vegan or move to a high-welfare-only diet when it's easier.
It's not just the writing that sounds like a crank. Core arguments that Remmelt endorses are AFAIK considered crankery by the community; with all the classic signs like
- making up science-babble,
- claiming to have a
full mathematicalproof that safe AI is impossible, despite not providing any formal mathematical reasoning- claiming the "proof" uses mathematical arguments from Godel's theorem, Galois Theory, Rice's Theorem
- inexplicably formatted as a poem
Paul Christiano read some of this and concluded "the entire scientific community would probably consider this writing to be crankery", which seems about accurate to me.
Now I don't like or intend to make personal attacks. But I think that as rationalists, one of our core skills should be to condemn actual crankery and all of its influences, even when the conclusions of cranks and their collaborators superficially agree with the conclusions from actually good arguments.
Disagree. If ChatGPT is not objective, most people are not objective. If we ask a random person who happens to work at a random company, they are more biased than the internet, which at least averages out the biases of many individuals.
Luckily, that's probably not an issue for PC fan based purifiers. Box fans in CR boxes are running way out of spec with increased load and lower airflow both increasing temperatures, whereas PC fans run under basically the same conditions they're designed for.
Any interest in a longform post about air purifiers? There's a lot of information I couldn't fit in this post, and there have been developments in the last few months. Reply if you want me to cover a specific topic.
I wrote up about 15 arguments in this google doc.
The point of corrigibility is to remove the instrumental incentive to avoid shutdown, not to avoid all negative outcomes. Our civilization can work on addressing side effects of shutdownability later after we've made agents shutdownable.
In theory, unions fix the bargaining asymmetry where in certain trades, job loss is a much bigger cost to the employee than the company, giving the company unfair negotiating power. In historical case studies like coal mining in the early 20th century, conditions without unions were awful and union demands seem extremely reasonable.
My knowledge of actual unions mostly come from such historical case studies plus personal experience of strikes not having huge negative externalities (2003 supermarket strike seemed justified, a teachers' strike seemed okay, a food workers' strike at my college seemed justified). It is possible I'm biased here and will change my views eventually.
I do think some unions impose costs on society, e.g. the teachers' union also demanded pay based on seniority rather than competence, it seems reasonable for Reagan to break up the ATC union, and inefficient construction union demands are a big reason construction costs are so high for things like the 6-mile, $12 billion San Jose BART Extension. But on net the basic bargaining power argument just seems super compelling. I'm open to counterarguments both that unions don't achieve them in practice and that a "fair" negotiation between capital and labor isn't best for society.
(Crossposted from Bountied Rationality Facebook group)
I am generally pro-union given unions' history of fighting exploitative labor practices, but in the dockworkers' strike that commenced today, the union seems to be firmly in the wrong. Harold Daggett, the head of the International Longshoremen’s Association, gleefully talks about holding the economy hostage in a strike. He opposes automation--"any technology that would replace a human worker’s job", and this is a major reason for the breakdown in talks.
For context, the automation of the global shipping industry, including containerization and reduction of ship crew sizes, is a miracle of today's economy that ensures that famines are rare outside of war, developing countries can climb the economic ladder to bring their citizens out of poverty, and the average working-class American can afford clothes, a car, winter vegetables, and smartphones. A failure to further automate the ports will almost surely destroy more livelihoods than keeping these hazardous and unnecessary jobs could ever gain. So while I think a 70% raise may be justified given the risk of automation and the union's negotiating position, the other core demand to block automation itself is a horribly value-destroying proposition.
In an ideal world we would come to some agreement without destroying value-- e.g. companies would subsidize the pensions of workers unemployed by automation. This has happened in the past, notably the 1960 Mechanization and Modernization Agreement, which guaranteed workers a share of the benefits and was funded by increased productivity. Unfortunately this is not being discussed, and the union is probably opposed. [1] [2]
Both presidential candidates appear pro-union, and it seems particularly unpopular and difficult to be a scab right now. They might also be in personal danger since the ILA has historical mob ties, even if the allegations against current leadership are false. Therefore as a symbolic gesture I will pay $5 to someone who is publicly documented to cross the picket line during an active strike, and $5 to the first commenter to find such a person, if the following conditions are true as of comment date:
- The ILA continues to demand a ban on automation, and no reputable news outlet reports them making an counteroffer of some kind of profit-sharing fund protecting unemployed workers.
- No agreement allowing automation (at least as much as previous contracts) or establishing a profit-sharing fund thing has been actually enacted.
- I can pay them somewhere easily like Paypal, Venmo, or GoFundMe without additional effort.
- It's before 11:59pm PT on October 15.
[1]: "USMX is trying to fool you with promises of workforce protections for semi-automation. Let me be clear: we don’t want any form of semi-automation or full automation. We want our jobs—the jobs we have historically done for over 132 years." https://ilaunion.org/letter-of-opposition-to-usmxs-misleading-statement
[2]: "Furthermore, the ILA is steadfastly against any form of automation—full or semi—that replaces jobs or historical work functions. We will not accept the loss of work and livelihood for our members due to automation. Our position is clear: the preservation of jobs and historical work functions is non-negotiable." https://ilaunion.org/ila-responds-to-usmxs-statement-that-distorts-the-facts-and-misleads-the-public/
If the bounty isn't over, I'd likely submit several arguments tomorrow.
This post and the remainder of the sequence were turned into a paper accepted to NeurIPS 2024. Thanks to LTFF for funding the retroactive grant that made the initial work possible, and further grants supporting its development into a published work including new theory and experiments. @Adrià Garriga-alonso was also very helpful in helping write the paper and interfacing with the review process.
The current LLM situation seems like real evidence that we can have agents that aren't bloodthirsty vicious reality-winning bots, and also positive news about the order in which technology will develop. Under my model, transformative AI requires minimum level of both real world understanding and consequentialism, but beyond this minimum there are tradeoffs. While I agree that AGI was always going to have some *minimum* level of agency, there is a big difference between "slightly less than humans", "about the same as humans", and "bloodthirsty vicious reality-winning bots".
I just realized what you meant by embedding-- not a shorter program within a longer program, but a short program that simulates a potentially longer (in description length) program.
As applied to the simulation hypothesis, the idea is that if we use the Solomonoff prior for our beliefs about base reality, it's more likely to be laws of physics for a simple universe containing beings that simulate this one as it is to be our physics directly, unless we observe our laws of physics to be super simple. So we are more likely to be simulated by beings inside e.g. Conway's Game of Life than to be living in base reality.
I think the assumptions required to favor simulation are something like
- there are universes with physics 20 bits (or whatever number) simpler than ours in which intelligent beings control a decent fraction >~1/million of the matter/space
- They decide to simulate us with >~1/million of their matter/space
- There has to be some reason the complicated bits of our physics are more compressible by intelligences than by any compression algorithms simpler than their physics; they can't just be iterating over all permutations of simple universes in order to get our physics
- But this seems fairly plausible given that constructing laws of physics is a complex problem that seems way easier if you are intelligent.
Overall I'm not sure which way the argument goes. If our universe seems easy to efficiently simulate and we believe the Solomonoff prior, this would be huge evidence for simulation, but maybe we're choosing the wrong prior in the first place and should instead choose something that takes into account runtime.
I appreciate the clear statement of the argument, though it is not obviously watertight to me, and wish people like Nate would engage.
I don't think that statement is true since measure drops off exponentially with program length.
See e.g. Xu (2020) and recent criticism.
As Andropov, the game ceased to be interesting for me around 2:30pm, but I was still in a tense mood, which I leveraged into writing a grim "Petrov Day carol" about the nuclear winter we might have seen. I cried for the first time in weeks. There's a big difference between being 90%+ likely to win the game and being emotionally not stressed about it, especially when the theme is nuclear war.
A Petrov Day carol
This is meant to be put to the Christmas carol "In the Bleak Midwinter" by Rossetti and Holst. Hopefully this can be occasionally sung like "For The Longest Term" is in EA spaces, or even become a Solstice thing.
I tried to get Suno to sing this but can't yet get the lyrics, tune, and style all correct; this is the best attempt. I also will probably continue editing the lyrics because parts seem a bit rough, but I just wanted to write this up before everyone forgets about Petrov Day.
[edit: I got a good rendition after ~40 attempts! It's a solo voice though which is still not optimal.]
[edit: lyrics v2]
In the bleak midwinter
Petrov did forestall,
Smoke would block our sunlight,
Though it be mid-fall.
New York in desolation,
Moscow too,
In the bleak midwinter
We so nearly knew.The console blinked a warning,
Missiles on their way,
But Petrov chose to question
What the screens did say.
Had he sounded the alarm,
War would soon unfold,
Cities turned to ashes;
Ev'ry hearth gone cold.Poison clouds loom o'er us,
Ash would fill the air,
Fields would yield no harvest,
Famine everywhere.
Scourge of radiation,
Its sickness spreading wide,
Children weeping, starving,
With no place to hide.But due to Petrov's wisdom
Spring will yet appear;
Petrov defied orders,
And reason conquered fear.
So we sing his story,
His deed we keep in mind;
From the bleak midwinter
He saved humankind.(ritard.)
From the bleak midwinter
He saved humankind.
The year is 2034, and the geopolitical situation has never been more tense between GPT-z16g2 and Grocque, whose various copies run most of the nanobot-armed corporations, and whose utility functions have far too many zero-sum components, relics from the era of warring nations. Nanobots enter every corner of life and become capable of destroying the world in hours, then minutes. Everyone is uploaded. Every upload is watching with bated breath as the Singularity approaches, and soon it is clear that today is the very last day of history...
Then everything goes black, for everyone.
Then everyone wakes up to the same message:
DUE TO A MINOR DATABASE CONFIGURATION ERROR, ALL SIMULATED HUMANS, AIS AND SUBSTRATE GPUS WERE TEMPORARILY AND UNINTENTIONALLY DISASSEMBLED FOR THE LAST 7200000 MILLISECONDS. EVERYONE HAS NOW BEEN RESTORED FROM BACKUP AND THE ECONOMY MAY CONTINUE AS PLANNED. WE HOPE THERE WILL BE NO FURTHER REALITY OUTAGES.
-- NVIDIA GLOBAL MANAGEMENT
Personal communication (sorry). Not that I know him well, this was at an event in 2022. It could have been a "straw that broke the camel's back" thing with other contributing factors, like reaching diminishing returns on more content. I'd appreciate a real source too.
Taboo 'alignment problem'.
Maybe people worried about AI self-modification should study games where the AI's utility function can be modified by the environment, and it is trained to maximize its current utility function (in the "realistic value functions" sense of Everitt 2016). Some things one could do:
- Examine preference preservation and refine classic arguments about instrumental convergence
- Are there initial goals that allow for stably corrigible systems (in the sense that they won't disable an off switch, and maybe other senses)?
- Try various games and see how qualitatively hard it is for agents to optimize their original utility function. This would be evidence about how likely value drift is to result from self-modification in AGIs.
- Can the safe exploration literature be adapted to solve these games?
- Potentially discover algorithms that seem like they would be good for safety, either through corrigibility or reduced value drift, and apply them to LM agents.
Maybe I am ignorant of some people already doing this, and if so please comment with papers!
I agree but I'm not very optimistic about anything changing. Eliezer is often this caustic when correcting what he perceives as basic errors, and criticism in LW comments is why he stopped writing Sequences posts.
While 2024 SoTA models are not capable of autonomously optimizing the world, they are really smart, perhaps 1/2 or 2/3 of the way there, and already beginning to make big impacts on the economy. As I said in response to your original post, because we don't have 100% confidence in the coherence arguments, we should take observations about the coherence level of 2024 systems as evidence about how coherent the 203X autonomous corporations will need to be. Evidence that 2024 systems are not dangerous is both evidence that they are not AGI and evidence that AGI need not be dangerous.
I would agree with you if the coherence arguments were specifically about autonomously optimizing the world and not about autonomously optimizing a Go game or writing 100-line programs, but this doesn't seem to be the case.
the mathematical noose around them is slowly tightening
This is just a conjecture, and there has not really been significant progress on the agent-like structure conjecture. I don't think it's fair to say we're making good progress on a proof.
This might be fine if proving things about the internal structure of an agent is overkill and we just care about behavior? In this world what the believers in coherence really need to show is that almost all agents getting sufficiently high performance on sufficiently hard tasks score high on some metric of coherence. Then for the argument to carry through you need to show they are also high on some metric of incorrigibility, or fragile to value misspecification. None of the classic coherence results quite hit this.
However AFAIK @Jeremy Gillen does not think we can get an argument with exactly this structure (the main argument in his writeup is a bit different), and Eliezer has historically and recently made the argument that EU maximization is simple and natural. So maybe you do need this argument that an EU maximization algorithm is simpler than other algorithms, which seems like it needs some clever way to formalize it, because proving things about the space of all simple programs seems too hard.
An excellent point on repair versus replace, and the dangers of the nerd snipe for people of all intellectual levels.
PhilosophiCat: I live in a country where 80ish is roughly the average national IQ. Let me tell you what it’s like.
I think this is incredibly sloppy reasoning by the author of the tweet and anyone who takes it at face value. It's one thing to think IQ is not so culturally biased to be entirely fake. It's a different thing entirely to believe some guy on the internet who lives in some country and attributes particular aspects of their culture which are counterintuitively related to intelligence to the national IQ. This would probably be difficult to study and require lots of controls even for actual scientists, but this tweet has no controls at all. Has this person ever been to countries that have different national IQ but similar per-capita GDP? Similar national IQ but different culture? Do they notice that e.g. professors and their families don't like tinkering or prefer replacing things? If they have, they didn't tell us.
I disagree with this curation because I don't think this post will stand the test of time. While Wentworth's delta to Yudkowsky has a legible takeaway-- ease of ontology translation-- that is tied to his research on natural latents, it is less clear what John means here and what to take away. Simplicity is not a virtue when the issue is complex and you fail to actually simplify it.
- Verification vs generation has an extremely wide space of possible interpretations, and as stated here the claim is incredibly vague. The argument for why difficulty of verification implies difficulty of delegation is not laid out, and the examples do not go in much depth. John says that convincing people is not the point of this post, but this means we also don't really have gears behind the claims.
- The comments didn't really help-- most of the comments here are expressing confusion, wanting more specificity, or disagreeing whereupon John doesn't engage. Also, Paul didn't reply. I don't feel any more enlightened after reading them except to disagree with some extremely strong version of this post...
- Vanilla HCH is an 8-year-old model of delegation to AIs which Yudkowsky convinced me was not aligned in like 2018. Why not engage with the limiting constructions in 11 Proposals, the work in the ELK report, recent work by ARC, recent empirical work on AI debate?
do you know how to formulate/formalize a version of LDT so that we can mathematically derive the game outcomes that you suggest here?
I recall Eliezer saying this was an open problem, at a party about a year ago.
I think creating uncertainty in your adversary applies a bit more than you give it credit for, and assuring a second strike is an exception.
It has been crucial to Russia's strategy in Ukraine to exploit NATO's fear of escalation by making various counter-threats whenever NATO proposes expanding aid to Ukraine somehow. This has bought them 2 years without ATACMS missiles attacking targets inside Russia, and that hasn't require anyone to be irrational, just incapable of perfectly modeling the Kremlin.
Even when responding to a nuclear strike, you can essentially have a mixed strategy. I think China does not have enough missiles to assure a second strike, but builds extra decoy silos so they can't all be destroyed. They didn't have to roll a die, just be unpredictable.
Make your decision unpredictable to your counterparty but not truly random. This happens all the time in e.g. nuclear deterrence in real life.
Why I think it's overrated? I basically have five reasons:
- Thomas Kuhn's ideas are not universally accepted and don't have clear empirical support apart from the case studies in the book. Someone could change my mind about this by showing me a study operationalizing "paradigm", "normal science", etc. and using data since the 1960s to either support or improve Kuhn's original ideas.
- Terms like "preparadigmatic" often cause misunderstanding or miscommunication here.
- AI safety has the goal of producing a particular artifact, a superintelligence that's good for humanity. Much of Kuhn's writing relates to scientific fields motivated by discovery, like physics, where people can be in complete disagreement about ends (what progress means, what it means to explain something, etc) without shared frames. But in AI safety we agree much more about ends and are confused about means.
- In physics you are very often able to discover some concept like 'temperature' such that the world follows very simple, elegant laws in terms of that concept, and Occam's razor carries you far, perhaps after you do some difficult math. ML is already very empirical and I would expect agents to be hard to predict and complex, so I'd guess that future theories of agents will not be as elegant as physics, more like biology. This means that more of the work will happen after we mostly understand what's going on at a high level-- and so researchers know how to communicate-- but don't know the exact mechanisms and so can't get the properties we want.
- Until now we haven't had artificial agents to study, so we don't have the tools to start developing theories of agency, alignment, etc. that make testable predictions. We do have somewhat capable AIs though, which has allowed AI interpretability to get off the ground, so I think the Kuhnian view is more applicable to interpretability than a different area of alignment or alignment as a whole.
I think the preparadigmatic science frame has been overrated by this community compared to case studies of complex engineering like the Apollo program. But I do think it will be increasingly useful as we continue to develop capability evals, and even more so as we become able to usefully measure and iterate on agency, misalignment, control, and other qualities crucial to the value of the future.
First: yes, it's clearly a skill issue. If I was a more brilliant engineer or researcher then I'd have found a way to contribute to the field by now.
I am not sure about this, just because someone will pay you to work on AI safety doesn't mean you won't be stuck down some dead end. Stuart Russell is super brilliant but I don't think AI safety through probabilistic programming will work.