Posts
Comments
This seems reasonable, though efficacy of the learning method seems unclear to me.
But:
with a heavily-reinforced constraint that the author vectors are identical for documents which have the same author
This seems wrong. To pick on myself, my peer reviewed papers, my substack, my lesswrong posts, my 1990s blog posts, and my twitter feed are all substantively different in ways that I think the author vector should capture.
There's a critical (and interesting) question about how you generate the latent space of authors, and/or how it is inferred from the text. Did you have thoughts on how this would be done?
That is completely fair, and I was being uncharitable (which is evidently what happens when I post before I have my coffee, apologies.)
I do worry that we're not being clear enough that we don't have solutions for this worryingly near-term problem, and think that there's far too little public recognition that this is a hard or even unsolvable problem.
it could be just as easily used that way once there's a reason to worry about actual alignment of goal-directed agents
This seems to assume that we solve various Goodhart's law and deception problems
Assuming that timelines are exogenous, I would completely agree - but they are not.
The load bearing assumption here seems to be that we won't make unaligned superintelligent systems given current methods soon enough to think it matters.
This seems false, and at the very least should be argued explicitly.
My original claim, "the ability to 'map' Turing machine states to integers," was an assertion over all possible Turing machines and their maps.
I have certainly seen that type of frustrating unwillingness to update on his part at times occur as well, but I haven't seen indications of bad faith. (I suspect this could be because your interpretation of the phrase "bad faith" is different and far more extensive than mine.)
A few examples of being reasonable which I found looking through quickly; https://twitter.com/GaryMarcus/status/1835396298142625991 / https://x.com/GaryMarcus/status/1802039925027881390 / https://twitter.com/GaryMarcus/status/1739276513541820428 / https://x.com/GaryMarcus/status/1688210549665075201
@Veedrac - if you want concrete examples, search for both of our usernames on twitter, or more recently, on bluesky.
I think that basically everyone at MIRI, Yampolskiy, and a dozen other people all have related and strong views on this. You're posting on Lesswrong, and I don't want to be rude, but I don't know why I'd need to explain this instead of asking you to read the relevant work.
Transformers work for many other tasks, and it seems incredibly likely to me that the expressiveness includes not only game playing, vision, and language, but also other things the brain does. And to bolster this point, the human brain doesn't use two completely different architectures!
So I'll reverse the question; why do you think the thought assessor is fundamentally different from other neural functions that we know transformers can do?
As I said in my top level comment, I don't see a reason to think that once the issue is identified as they key barrier, work on addressing it would be so slow.
I don't really understand the implicit model where AI companies recognize that having a good thought assessor is the critical barrier to AGI, they put their best minds on solving it, and it seems like you think they just fail because it's the single incomparably hard human capability.
It seems plausible that the diagnosis of what is missing is correct, but strongly implausible that it's fundamentally harder than other parts of the puzzle, much less hard in ways that AI companies would need a decade to tackle. In my modal case, once they start, I expect progress to follow curves similar to every other capability they develop.
I could imagine the capability occurring but not playing out that way, because the SWEs won't necessarily be fired even after becoming useless - so it won't be completely obvious from the outside. But this is a sociological point about when companies fire people, not a prediction about AI capabilities.
Yes, doing those things in ways that a capable alignment research can't find obvious failure modes for. (Which may not be enough, given that they.aren't superintelligences - but is still a bar which no proposed plan comes close to passing.)
Sorry if this was unclear, but there's a difference between plans which work conditioning on an impossibility, and trying to do the impossible. For example, building a proof that works only if P=NP is true is silly in ways that trying to prove P=NP is not. The second is trying to do the impossible, the first is what I was dismissive of.
My own best hope at this point is that someone will actually solve the "civilizational superalignment" problem of CEV, i.e. learning how to imbue autonomous AI with the full set of values (whatever they are) required to "govern" a transhuman civilization in a way that follows from the best in humanity, etc. - and that this solution will be taken into account by whoever actually wins the race to superintelligence.
Sounds like post-hoc justification for not even trying to stop something bad by picking a plan with zero percent chance of success, instead of further thought and actually trying to do the impossible.
Another problem seems important to flag.
They said they first train a very powerful model, then "align" it - so they better hope it can't do anything bad until after they make it safe.
Then, as you point out, they are implicitly trusting that the unsafe reasoning won't have any views on the topic and lets itself be aligned. (Imagine an engineer in any other field saying "we build it and it's definitely unsafe, then we tack on safety at the end by using the unsafe thing we built.")
I do know what you mean, and still think the soldier mindset both here and in the post are counterproductive to the actual conversation.
In my experience, when I point out a mistake to Gary without attacking him, he is willing to.admit he was wrong, and often happy to update. So this type of attacking non-engagement seems very bad - especially since him changing his mind is more useful for informing his audience than attacking him.
He says things that are advantageous, and sometimes they are even true. The benefit of not being known to be a liar usually keeps the correlation between claims and truth positive, but in his case it seems that ship has sailed.
(Checkably false claims are still pretty rare, and this may be one of those.)
I'd vainly hope that everyone would know about the zero-sum nature of racing to the apocalypse from nuclear weapons, but the parallel isn't great, and no-one seems to have learned the lesson anyways, given the failure of holding SALT III or even doing START II.
Seems bad to posit that there must be a sane version.
I think it's hard to explain in the narrative, and there is plenty to point to that explains it - but on reflection I admit that it's not sufficiently clear for those who are skeptical.
There are mountains of posts laying out the arguments about optimization pressure, and trying to include that and explain here seems like adding an unhelpful digression.
Also; Picket signs read, "AI for who?" should obviously say "AI for whom?" ;)
We used to explain the original false claim to insurance regulators and similar groups, in the context of "100-year events" by having them roll 6 dice at once a few times. It's surprisingly useful for non-specialist intuitions.
Clarifying question:
How, specifically? Do you mean Perplexity using the new model, or comparing the new model to Perplexity?
The sequence description is: "Short stories about (implausible) AI dooms. Any resemblance to actual AI takeover plans is purely coincidental.“
Sure, space colonies happen faster - but AI-enabled and AI-dependent space colonies don't do anything to make me think disempowerment risk gets uncorrelated.
Aside from the fact that I disagree that it helps, given that an AI takeover that's hostile to humans isn't a local problem, we're optimistically decades away from such colonies being viable independent of earth, so it seems pretty irrelevant.
I admitted that it's possible the problem is practically unsolvable, or worse; you could have put the entire world on Russell and Whitehead's goal of systematizing math, and you might have gotten to Gödel faster, but you'd probably just waste more time.
And on Scott's contributions, I think they are solving or contributing towards solving parts of the problems that were posited initially as critical to alignment, and I haven't seen anyone do more. (With the possible exception of Paul Christiano, who hasn't been focusing on research for solving alignment as much recently.) I agree that the work doesn't don't do much other than establish better foundations, but that's kind-of the point. (And it's not just Logical induction - there's his collaboration on Embedded Agency, and his work on finite factored sets.) But the fact that the work done to establish the base for the work is more philosophical and doesn't align AGI seems like it is moving the goalposts, even if I agree it's true.
I don't think I disagree with you on the whole - as I said to start, I think this is correct. (I only skimmed the full paper, but I read the post; on looking at it, the full paper does discuss this more, and I was referring to the response here, not claiming the full paper ignores the topic.)
That said, in the paper you state that the final steps require something more than human disempowerment due to other types of systems, but per my original point, seem to elide how the process until that point is identical by saying that these systems have largely been aligned with humans until now, while I think that's untrue; humans have benefitted despite the systems being poorly aligned. (Misalignment due to overoptimization failures would look like this, and is what has been happening when economic systems are optimizing for GDP and ignoring wealth disparity, for example; the wealth goes up, but as it becomes more extreme, the tails diverge, and at this point, maximizing GDP looks very different from what a democracy is supposed to do.)
Back to the point, to the extent that the unique part is due to cutting the last humans out of the decision loop, it does differ - but it seems like the last step definitionally required the initially posited misalignment with human goals, so that it's an alignment or corrigibility failure of the traditional type, happening at the end of this other process that, again, I think is not distinct.
Again, that's not to say I disagree, just that it seems to ignore the broader trend by saying this is really different.
But since I'm responding, as a last complaint, you do all of this without clearly spelling out why solving technical alignment would solve this problem, which seems unfortunate. Instead, the proposed solutions try to patch the problems of disempowerment by saying you need to empower humans to stay in the decision loop - which in the posited scenario doesn't help when increasingly powerful but fundamentally misaligned AI systems are otherwise in charge. But this is making a very different argument, and one I'm going to be exploring when thinking about oversight versus control in a different piece I'm writing.
I don't think that covers it fully. Corporations "need... those bureaucracies," but haven't done what would be expected otherwise.
I think we need to add both that corporations are limited by only doing things they can convince humans to do, are aligned with at least somewhat human directors / controllers, have a check and balance system of both the people being able to whistleblow and the company being constrained by law to an extent that the people need to worry when breaking it blatantly.
But I think that breaking these constraints is going to be much closer to the traditional loss-of-control scenario than what you seem to describe.
Apologies - when I said genius, I had a very high bar in mind, no more than a half dozen people alive today, who each have single-handedly created or materially advanced an entire field. And I certainly hold Scott in very high esteem, and while I don't know Sam or Jessica personally, I expect they are within throwing distance - but I don't think any of them meet this insanely high bar. And Scott's views on this, at least from ca. 2015, was a large part of what informed my thinking about this; I can't tell the difference between him and Terry Tao when speaking with them, but he can, and he said there is clearly a qualitative difference there. Similarly for other people clearly above my league, including a friend who worked with Thurston at Cornell back in 2003-5. (It's very plausible that Scott Aaronson is in this bucket as well, albeit in a different areas, though I can't tell personally, and have not heard people say this directly - but he's not actually working on the key problems, and per him, he hasn't really tried to work on agent foundations. Unfortunately.)
So to be clear, I think Scott is a genius, but not one of the level that is needed to single-handedly advance the field to the point where the problem might be solved this decade, if it is solvable. Yes, he's brilliant, and yes, he has unarguably done a large amount of the most valuable work in the area in the past decade, albeit mostly more foundational that what is needed to solve the problem. So if we had another dozen people of his caliber at each of a dozen universities working on this, that would be at least similar in magnitude to what we have seen in fields that have made significant progress in a decade - though even then, not all fields like hat see progress.
But the Tao / Thurston level of genius, usually in addition to the above-mentioned 100+ top people working on the problem, is what has given us rapid progress in the past in fields where such progress was possible. This may not be one of those areas - but I certainly don't expect that we can do much better than other areas with much less intellectual firepower, hence my above claim that humanity as a whole hasn't managed even what I'd consider a half-assed semi-serious attempt at solving a problem that deserves an entire field of research working feverishly to try our best to actually not die - and not just a few lone brilliant researchers.
One thing though I kept thinking: Why doesn’t the article mention AI Safety research much?
Because almost all of current AI safety research can't make future agentic ASI that isn't already aligned with human values safe, as everyone who has looked at the problem seems to agree. And the Doomers certainly have been clear about this, even as most of the funding goes to prosaic alignment.
I hate to be insulting to a group of people I like and respect, but "the best agent foundations work that's happened over ~10 years of work" was done by a very small group of people who, despite being very smart, certainly smarter than myself, aren't academic superstars or geniuses (Edit to add: on a level that is arguably sufficient, as I laid out in my response below.) And you agree about this. The fact that they managed to make significant progress is fantastic, but substantial progress on deep technical problems is typically due to (ETA: only-few-in-a-generation level) geniuses, large groups of researchers tackling the problem, or usually both. And yes, most work on the topic won't actually address the key problem, just like most work in academia does little or nothing to advance the field. But progress happens anyways, because intentionally or accidentally, progress on problems is often cumulative, and as long as a few people understand the problem that matters, someone usually actually notices when a serious advance occurs.
I am not saying that more people working on the progress and more attention would definitely crack the problems in the field this decade, but I certainly am saying that humanity as a whole hasn't managed even what I'd consider a half-assed semi-serious attempt.
I think this is correct, but doesn't seem to note the broader trend towards human disempowerment in favor of bureaucratic and corporate systems, which this gradual disempowerment would continue, and hence elides or ignores why AI risk is distinct.
"when, if ever, our credences ought to capture indeterminacy in how we weigh up considerations/evidence"
The obvious answer is only when there is enough indeterminacy to matter; I'm not sure if anyone would disagree. Because the question isn't whether there is indeterminacy, it's how much, and whether it's worth the costs of using a more complex model instead of doing it the Bayesian way.
I'd be surprised if many/most infra-Bayesians would endorse suspending judgment in the motivating example in this post
You also didn't quite endorse suspending judgement in that case - "If someone forced you to give a best guess one way or the other, you suppose you’d say “decrease”. Yet, this feels so arbitrary that you can’t help but wonder whether you really need to give a best guess at all…" So, yes, if it's not directly decision relevant, sure, don't pick, say you're uncertain. Which is best practice even if you use precise probability - you can have a preference for robust decisions, or a rule for withholding judgement when your confidence is low. But if it is decision relevant, and there is only a binary choice available, your best guess matters. And this is exactly why Eliezer says that when there is a decision, you need to focus your indeterminacy, and why he was dismissive of DS and similar approaches.
I’m not merely saying that agents shouldn’t have precise credences when modeling environments more complex than themselves
You seem to be underestimating how pervasive / universal this critique is - essentially every environment is more complex than we are, at the very least when we're embedded agents, or other humans are involved. So I'm not sure where your criticism (which I agree with) is doing more than the basic argument is in a very strong way - it just seems to be stating it more clearly.
The problem is that Kolmogorov complexity depends on the language in which algorithms are described. Whatever you want to say about invariances with respect to the description language, this has the following unfortunate consequence for agents making decisions on the basis of finite amounts of data: For any finite sequence of observations, we can always find a silly-looking language in which the length of the shortest program outputting those observations is much lower than that in a natural-looking language (but which makes wildly different predictions of future data).
Far less confident here, but I think this isn't correct as a mater of practice. Conceptually, Solomonoff doesn't say "pick an arbitrary language once you've seen the data and then do the math" it says "pick an arbitrary language before you've seen any data and then do the math." And if we need to implement the silly looking language, there is a complexity penalty to doing that, one that's going to be similarly large regardless of what baseline we choose, and we can determine how large it is in reducing the language to some other language. (And I may be wrong, but picking a language cleverly should not means that Kolmogorov complexity will change something requiring NP programs to encode into something that P programs can encode, so this criticism seems weak anyways outside of toy examples.)
Strongly agree. I was making a narrower point, but the metric is clearly different than the goal - if anything it's more surprising that we see so much correlation as we do, given how much it has been optimized.
Toby Ord writes that “the required resources [for LLM training] grow polynomially with the desired level of accuracy [measured by log-loss].” He then concludes that this shows “very poor returns to scale,” and christens it the "Scaling Paradox." (He continues to point out that this doesn’t imply it can’t create superintelligence, but I agree with him about that.)
But what would it look like if this were untrue? That is, what would be the conceptual alternative, where required resources grow more slowly?I think the answer is that it’s conceptually impossible.
To start, there is a fundamental bound on loss at zero, since the best possible model perfectly predicts everything - it exactly learns the distribution. This can happen when overfitting a model, but it can also happen when there is a learnable ground truth; models that are trained to learn a polynomial function can learn them exactly.
But there is strong reason to expect the bound to be significantly above zero loss. The training data for LLMs contains lots of aleatory randomness, things that are fundamentally conceptually unpredictable. I think it’s likely that things like RAND’s random number book are in the training data, and it’s fundamentally impossible to predict randomness. I think something similar is generally true for many other things - predicting world choice for semantically equivalent words, predicting where typos occur, etc.
Aside from being bound well above zero, there's a strong reason to expect that scaling is required to reduce loss for some tasks. In fact, it’s mathematically guaranteed to require significant computation to get near that level for many tasks that are in the training data. Eliezer pointed out that GPTs are predictors, and gives the example of a list of numbers followed by their two prime factors. It’s easy to generate such a list by picking pairs of primes and multiplying them, the writing the answer first - but decreasing loss for generating the next token to predict the primes from the product is definitionally going to require exponentially more computation to perform better for larger primes.
And I don't think this is the exception, I think it's at least often the rule. The training data for LLMs contains lots of data where the order of the input doesn’t follow the computational order of building that input. When I write an essay, I sometimes arrive at conclusions and then edit the beginning to make sense. When I write code, the functions placed earlier often don’t make sense until you see how they get used later. Mathematical proofs are another example where this would often be true.
An obvious response is that we’ve been using exponentially more compute for better accomplishing tasks that aren’t impossible in this way - but I’m unsure if that is true. Benchmarks keep getting saturated, and there’s no natural scale for intelligence. So I’m left wondering whether there’s any actual content in the “Scaling Paradox.”
(Edit: now also posted to my substack.)
True, and even more, if optimizing for impact or magnitude has Goodhart effects, of various types, then even otherwise good directions are likely to be ruined by pushing on them too hard. (In large part because it seems likely that the space we care about is not going to have linear divisions into good and bad, there will be much more complex regions, and even when pointed in a directino that is locally better, pushing too far is possible, and very hard to predict from local features even if people try, which they mostly don't.)
I think the point wasn't having a unit norm, it was that impact wasn't defined as directional, so we'd need to remove the dimensionality from a multidimensionally defined direction.
So to continue the nitpicking, I'd argue impact = || Magnitude * Direction ||, or better, ||Impact|| = Magnitude * Direction, so that we can talk about size of impact. And that makes my point in a different comment even clearer - because almost by assumption, the vast majority of those with large impact are pointed in net-negative directions, unless you think either a significant proportion of directions are positive, or that people are selecting for it very strongly, which seems not to be the case.
I think some of this is on target, but I also think there's insufficient attention to a couple of factors.
First, in the short and intermediate term, I think you're overestimating how much most people will actually update their personal feelings around AI systems. I agree that there is a fundamental reason that fairly near-term AI will be able to function as better companion and assistant than humans - but as a useful parallel, we know that nuclear power is fundamentally better than most other power sources that were available in the 1960s, but people's semi-irrational yuck reaction to "dirty" or "unclean" radiation - far more than the actual risks - made it publicly unacceptable. Similarly, I think the public perception of artificial minds will be generally pretty negative, especially looking at current public views of AI. (Regardless of how appropriate or good this is in relation to loss-of-control and misalignment, it seems pretty clearly maladaptive for generally friendly near-AGI and AGI systems.)
Second, I think there is a paperclip maximizer aspect to status competition, in the sense Eliezer uses the concept. That is, Specifically, given massively increased wealth, abilities, and capacity, even if a implausibly large 99% of humans find great ways to enhance their lives in ways that don't devolve into status competition, there are few other domains where an indefinite amount of wealth and optimization power can be applied usefully. Obviously, this is at best zero-sum, but I think there aren't lots of obvious alternative places for positive sum indefinite investments. And even where such positive-sum options exist, they often are harder to arrive at as equilibria. (We see a similar dynamic with education, housing, and healthcare, where increasing wealth leads to competition over often artificially-constrained resources rather than expansion of useful capacity.)
Finally and more specifically, your idea that we'd see intelligence enhancement as a new (instrumental) goal in the intermediate term seems possible and even likely, but not a strong competitor for, nor inhibitor of, status competition. (Even ignoring the fact that intelligence itself is often an instrumental goal for status competition!) Even aside from the instrumental nature of the goal, I will posit that some strongly reduced returns to investment will exist - regardless of the fact that it's unlikely on priors that these limits are near the current levels. Once those points are reached, the indefinite investment of resources will trade-off between more direct status competition and further intelligence increases, and as the latter shows decreased returns, as noted above, the former becomes the metaphorical paperclip which individuals can invest indefinitely into.
my uninformed intuition is that the people with the biggest positive impact on the world have prioritized the Magnitude
That's probably true, but it's selecting on the outcome variable. And I'll bet that the people with the biggest negative impact are even more overwhelmingly also those who prioritized magnitude.
"If you already know that an adverse event is highly likely for your specific circumstances, then it is likely that the insurer will refuse to pay out for not disclosing "material information" - a breach of contract."
Having worked in insurance, that's not what the companies usually do. Denying explicitly for clear but legally hard to defend reasons, especially those which a jury would likely rule against, isn't a good way to reduce costs and losses. (They usually will just say no and wait to see if you bother following up. Anyone determined enough to push to get a reasonable claim is gonna be cheaper to pay out for than to fight.)
Yes - the word 'global' is a minimum necessary qualification for referring to catastrophes of the type we plausibly care about - and even then, it is not always clear that something like COVID-19 was too small an event to qualify.
I definitely appreciate that confusion. I think it's a good reason to read the sequence and think through the questions clearly; https://www.lesswrong.com/s/p3TndjYbdYaiWwm9x - I think this resolves the vast majority of the confusion people have, even if it doesn't "answer" the questions.
The math is good, the point is useful, the explanations are fine, but the embracing straw vulcan version of rationality and dismissing any notion of people legitimately wanting things other than money seems really quite bad, which leaves me wishing this wasn't being highlighted for visitors to the site.