Posts
Comments
Sometimes the less well-justified method even wins. TRPO is very principled if you want to "not update too far" from a known good policy, as it's a Taylor expansion of a KL divergence constraint. PPO is less principled but works better. It's not clear to me that in ML capabilities one should try to be more like Bengio in having better models, rather than just getting really fast at running experiments and iterating.
This seems to also have happened in alignment, and I especially count RLHF here, and all the efforts to make AI nice, which I think show a pretty important point: Less justified/principled methods can and arguably do win over more principled methods like the embedded agency research, or a lot of decision theory research from MIRI, or the modern OAA plan from Davidad, or arguably ~all of the research that Lesswrong did pre 2014-2016.
If you were to be less charitable than I would, this would explain a lot about why AI safety wants to regulate AI companies so much, since they're offering at least a partial solution, if not a full solution to the alignment problem and safety problem that doesn't require much slowdown in AI progress, nor does it require donations to MIRI or classic AI safety organizations, nor does it require much coordination, which threatens both AI safety funding sources and fears that their preferred solution, slowing down AI won't be implemented.
Cf this tweet and the text below:
https://twitter.com/Rocketeer_99/status/1706057953524977740
It's like degrowth or dieting or veganism; people come up with a solution that makes things better but requires personal sacrifice and then make that solution a cornerstone of personal moral virtue. Once that's your identity, any other solutions to the original problem are evil.
No in theses cases, mostly because they are independent movements, so we aren't dealing with any potential conflict points. Very critically, no claim was made about the relative importance of each cause, which also reduces many of the frictions.
Even assuming AI safety people were right to imply that it's cause was way more important than others, especially in public, this would probably make other people in the AI space rankle at it, and claim it was a distraction, because it means their projects are either less important than our projects, or at the very least it seems like their projects were less important than our projects.
There are also more general issues where sometimes one movement can break other movements with bad decisions, though.
I think a point to remember that to a large extent, this dynamic is driven by the fact that there's a sort of a winner take all effect where if your issue isn't getting attention, this can be essentially the death knell of your movement due to the internet, and to be a little blunt, AI safety on Lesswrong was extraordinarily successful at getting attention, and due to some implicit/explicit claims that AI safety was way more important than any other issue, that meant to a certain extent, other issues like ethics and AI bias and their movements lost a lot of their potency and oxygen, and thus AI safety is predictably getting criticism about it distracting from their issues.
Cf habryka's observation that strong action/shouting is basically the only way to get heard, otherwise the system of interest neutralizes the concern and continues on much like it did before.
The existence of a valid state, and a conceivable path to reach that state, is not enough to justify a claim that that state will be observed with non-negligible probability.
This is also why I'm not a fan of the common argument that working on AI risk is worth it even if you have massive uncertainty, since there's a vast gap between logically possible and actionable probability.
Also, prospect theory tells us that we will way overestimate the chances of small probabilities, so we should assume small probability arguments as essentially exploits/adversarial attacks on our reasoning by default.
Hm, what does this mean for your argument that set theory is uncomputable by a Turing machine, for example?
My focus was on the more philosophical/impractical side, and the computers we can actually build in principle, assuming the laws of physics are unchangable and we are basically correct, we can't even build Universal Turing Machines/Turing Complete systems, but just linear bounded automatons, due to the holographic principle.
Also, the entire hierarchy can be solved simply by allowing non-uniform computational models, which is yet another benefit of non-uniform computational models.
There are 2 things to be said here:
-
I didn't say that it had to return an answer, and the halting problem issue for Turing Machines is essentially that even though a program halts or doesn't halt, which in this case can be mapped to true or false, we don't have an algorithm that runs on a Turing Machine that can always tell us which of the 2 cases is true.
-
Gödel's incompleteness theorems are important, but in this context, we can basically enhance the definition of a computer philosophically to solve essentially arbitrary problems, and the validity problem of first order logic becomes solvable by introducing an Oracle tape or a closed timeline curve to a Turing Machine, and at that point we can decide the validity and satisfiability problems of first order logic.
You also mentioned that oracles like oracle tapes can provide the necessary interpretation for set theory statements to be encoded, which was my goal, since a Turing Machine or other computer with an Oracle tape, which gets around the first incompleteness theorem by violating an assumption necessary to get the result.
So we can, in a manner of speaking, encode them as programs to be run in a Turing Machine with a oracle tape. That's not too hard to do, once we use stronger models of computation, and thus we can still encode set theory statements in more powerful computers.
So I'm still basically right, philosophically speaking, in that we can always encode set theory/mathematics statements in a program, using the trick I described of converting set theory into quantified statements, and then looking at the first quantifier to determine whether halting or not-halting is either true or false.
A neat little result, that set theory is RE-hard, and damn this is a very large set, so large that it's larger than every other cardinality.
This might be one of the few set theories that can't be completely solved even with non-uniformity, as sometimes non-uniform models of computation, in if we could make them, we could solve every language.
An example is provided on the 14th page of this paper:
https://arxiv.org/pdf/0808.2669.pdf
And this seems like a great challenge for the Universal Hypercomputer defined here, in that it could compute the entire universe of sets V using very weird resources.
Basically, it has to do with the fundamental issue of the Von Neumann bottleneck, and the issue is that there is a massive imbalance between memory and computation, and while LLMs and human brains differ in their algorithms a lot, another non-algorithmic difference is the fact that the human brain has way more memory than pretty much any GPT, as well as basically all AI that exists.
Besides, more memory is good anyways.
And that causes issues when you try simulating an entire brain at high speed, and in particular it becomes a large issue when you have to wait all the time since the compute keeps shuffling around in memory.
Now my question is, how complicated is the domain of discourse of the sets, exactly?
This is false. It only works if you don't use too complex a sequence of quantifiers. It is also limited to arithmetic and doesn't work with e.g. set theory.
Hm, how is this procedure not going to work for certain classes of statements like set theory. Where does the encoding process fail?
Because I do remember that there was a stackexchange post which claimed we can always take a logical statement and express it as something that returns true if it halts, and false if it doesn't halt.
With all that said: practical alignment work is extremely accelerationist. If ChatGPT had behaved like Tay, AI would still be getting minor mentions on page 19 of The New York Times. These alignment techniques play a role in AI somewhat like the systems used to control when a nuclear bomb goes off. If such bombs just went off at random, no-one would build nuclear bombs, and there would be no nuclear threat to humanity. Practical alignment work makes today's AI systems far more attractive to customers, far more usable as a platform for building other systems, far more profitable as a target for investors, and far more palatable to governments. The net result is that practical alignment work is accelerationist. There's an extremely thoughtful essay by Paul Christiano, one of the pioneers of both RLHF and AI safety, where he addresses the question of whether he regrets working on RLHF, given the acceleration it has caused. I admire the self-reflection and integrity of the essay, but ultimately I think, like many of the commenters on the essay, that he's only partially facing up to the fact that his work will considerably hasten ASI, including extremely dangerous systems.
Over the past decade I've met many AI safety people who speak as though "AI capabilities" and "AI safety/alignment" work is a dichotomy. They talk in terms of wanting to "move" capabilities researchers into alignment. But most concrete alignment work is capabilities work. It's a false dichotomy, and another example of how a conceptual error can lead a field astray. Fortunately, many safety people now understand this, but I still sometimes see the false dichotomy misleading people, sometimes even causing systematic effects through bad funding decisions.
"Does this mean you oppose such practical work on alignment?" No! Not exactly. Rather, I'm pointing out an alignment dilemma: do you participate in practical, concrete alignment work, on the grounds that it's only by doing such work that humanity has a chance to build safe systems? Or do you avoid participating in such work, viewing it as accelerating an almost certainly bad outcome, for a very small (or non-existent) improvement in chances the outcome will be good? Note that this dilemma isn't the same as the by-now common assertion that alignment work is intrinsically accelerationist. Rather, it's making a different-albeit-related point, which is that if you take ASI xrisk seriously, then alignment work is a damned-if-you-do-damned-if-you-don't proposition.
I think this is sort of a flipside to the following point: Alignment work is incentivized as a side effect of capabilities, and there is reason to believe that alignment and capabilities can live together without either of them being destroyed. The best example really comes down to the jailbreak example, where the jailbreaker has aligned it to them, and controls the AI. The AI doesn't jailbreak itself and is unaligned, instead the alignment/control is transferred to the jailbreaker. We truly do live in a regime where alignment is pretty easy, at least for LLMs. And that's good news compared to AI pessimist views.
The tweet is below:
https://twitter.com/QuintinPope5/status/1702554175526084767
This also is important, in the sense that alignment progress will naturally raise misuse risk, and solutions to the control problem look very different from solutions to the misuse problems of AI, and one implication is that it's far less bad to accelerate if misuse is the main concern and can actually look very positive.
This is a point Simeon raised in this link, where he states a tradeoff between misuse and misalignment concerns here:
So this means that it is very plausible that as the control problem/misalignment is solved, misuse risk can be increased, which is a different tradeoff than what is pictured here.
So the reason why the Law of the Excluded Middle is used has to do with 2 reasons:
(Side note, the Law of the Excluded Middle maps False to 0, and True to 1.)
-
It's a tautology, in the sense that for any input, it always outputs true, and this is a huge benefit since it is always right, no matter what input you give in the boolean formulation. This means that for our purposes, we don't have to care about the non-boolean cases, since we can simply map their truth values to 2 values like true or false, or 0 or 1, etc.
-
For a mathematical statement to be proven true or false, at a very high level, we can always basically reformulate the mathematical statement and turn it into a program that halts or doesn't halt, or stops at a finite time or continues running forever/infinite time, where the truth value of the statement depends on whether it starts with an existential quantifier or universal quantifier.
Critically, a program either halts at a finite time or runs forever, and there is no middle ground, so the law of the excluded middle applies. You may not know whether it halts or runs forever, but these are the only two outcomes that a program can have.
If it starts with an existential quantifier, then if the program halts, the statement is true, and if the program doesn't halt, it's false.
If it starts with an universal quantifier, then if the program halts, the statement is false, and if the program doesn't halt, the statement is true.
Yep, that's the source I was looking for to find the original source of the claim.
Reber (2010) was my original source for the claim that the human brain has 2.5 petabytes of memory, but it's definitely something that got reported a lot by secondary sources like the Scientific American.
I do want to ask why don't you think the 2.5 petabyte figure is right, exactly?
I think the strongest takeaway from this post is that, now that I think about it, is that alignment is not equal to safety, and that even if AI is controllable, it may not be totally safe to someone else.
In your Fusion Power Generator scenario, what happened is that they asked for a fusion power generator, and the AI managed to make the fusion power generator, and it was inner and outer aligned enough to the principal such that it didn't take over the world and make every fusion power plant, and in particular hasn't Goodharted the specification negatively.
In essence, this seems like a standard misuse of AI (though I don't exactly like the connotations), and thus if I were to make this post, I'd focus on how aligning AI isn't enough to assure safety, or putting it another way, there are more problems in AI safety than just alignment/control problems.
I got that from googling around the capacity of the human brain, and I found it via many sources. I definitely think that while this number is surprisingly high, I do think it makes a little sense, especially since I remember that one big issue with AI is essentially the fact that it has way less memory than the human brain, even when computation is similar in level.
A lot of the reason why we usually do selection have to do with the fact that for most purposes, once you have a person ready to do economically valuable things, their traits and attributes are basically fixed by genetics, and improvement is mostly not possible.
This is an important thing to remember about humans in general, but especially for this.
A lot of the reason humans managed to conquer the natural world is because we are essentially super-cooperative, in the sense that humans can form groups of 10-1000 or more and not have the social system totally break down. We aren't much more intelligent than other animals, we are way more cooperative in groups than other animals.
I think the human brain has around 2.5 petabytes of memory storage, which is insane compared to only 80 gigabytes in the H100 VRAM, and it all does this for 20 watts, and I think this gives a lot of credence to the belief that the near future of AI will be a lot more brain-like than people think.
If the brain is basically at the limits of efficient algorithms, and we don't get new paradigms for computing, then Jacob Cannell's scenario for AI takeover would be quite right.
If algorithmic progress does have a larger effect on things, than Steven Byrnes's take will likely be correct on AI takeover.
I do want to note that probabilities 0 and 1 only correspond to no fuzziness if we assume a finite set. If we don't assume a finite set, then it's easy to cook up examples where probabilities are 0 or 1, but they aren't equivalent to either nothing or everything, and thus probabilities 0 or 1 can still introduce fuzziness.
The good news is I've strongly upvoted this back to positive territory.
However, if "aligning AI" is actually easier than "aligning the CCP" or "aligning Trump" (or whoever has a bunch of power in the next 2-20 years (depending on your timelines and how you read the political forecasts))... then maybe mass proliferation would be good?
Something like this would definitely be my reasoning. In general, a big disagreement that seems to animate me compared to a lot of doomers like Zvi or Eliezer Yudkowsky is that to a large extent, I think the AI accident safety problem by default will be solved, either by making AIs shutdownable, or by aligning them, or by making them corrigible, and I see pretty huge progress that others don't see. I also see the lack of good predictions from a lot of doomy sources, especially MIRI beyond doom that have panned out as another red flag, because they imply that there isn't much reason to trust that their world-models, especially the most doomy ones have any relation to reality
Thus, I'm much more concerned about outcomes where we successfully align AI, but something like The Benevolence of the Butcher scenario happens, where the state and/or capitalists mostly control AI, and very bad things happen because the assumptions that held up industrial society crumble away. Very critically, one key difference between this scenario and common AI risk scenarios is that it makes a lot of anti-open source AI movements look quite worrisome, and AI governance interventions can and arguably will backfire.
https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher
I want to mention that regardless of who is right in this discussion, I decided to upvote and agree this comment strongly, for the following reason: I definitely want to flag that this part of your comment below is very important, since this context is really important for people that don't grok sexual assault to notice, and in particular we have good reason to believe that the biases are self-serving.
It's really common for people with misogynist beliefs to declare a rape allegation "manifestly false" even when it's true, because they have a lot of false beliefs around sexual assault or around the psychology of trauma.
Gary Marcus: yes. I was told this by an inside source, around the time of ChatGPT release, others have noted it in this very thread, it is within the budget, fits empirical experience of multiple people, is in their commercial interest, and the underlying architecture has not been disclosed.
This seems like a very important claim to verify, because this essentially amounts to a claim that OpenAI is intentionally using/abusing data leakage to overfit their GPT models.
I'm already somewhat skeptical about LLMs leading to much progress in AI by say, 2030, even if this is false, but if true this seems like a very big red flag that common LW beliefs about LLMs are shockingly wrong, which would reduce the capability trajectory of OpenAI/Google LLMs immensely, and unfortunately this is not good news from a safety perspective.
I definitely want someone to verify this claim soon.
This seems like a You Are Not Measuring What You Think Are Measuring moment. Link below:
and IMO took nothing morally seriously
Are there any good examples of this, because this would be pretty important for us to know.
Consider frameworks like the Bayesian probability theory or various decision theories, which (strive to) establish the formally correct algorithms for how systems embedded in a universe larger than themselves must act, even under various uncertainties. How to update on observations, what decisions to make given what information, etc. They still take on "first-person" perspective, they assume that you're operating on models of reality rather than the reality directly — but they strive to be formally correct given this setup.
Admittedly, this does have 1 big problem, and I'll list them down below:
- Formalization is a pain in the butt, and we have good reasons to believe that formalizing things will be so hard as to essentially be impossible in practice, except in very restricted circumstances. In particular, this is one way in which rationalism and Bayesian reasoning fail to scale down: They assume either infinite computation, or in the regime of bounded Bayesian reasoning/rationality, they assume the ability to solve very difficult problems like NP-complete/Co-NP complete/#P-complete/PSPACE-complete problems or worse. This is generally the reason why formal frameworks don't work out very well in the real world: Absent oddball assumptions on physics, we probably won't be able to solve things formally in a lot of cases for forever.
So the question is, why do you believe formalization is tractable at all for the AI safety problem?
This is exactly the situation where your question unfortunately doesn't have an answer, at least right now.
This seems like a potentially downstream issue of rationalist/EA organizations ignoring a few Chesterton Fences that are really important, and one of those Chesterton Fences is not having dating/romantic relationships in the employment context if there is any power asymmetry issues. These can easily lead to abuse or worse issues.
In general, one impression I get from a lot of rationalist/EA organizations is that there are very few boundaries between work, romantic/dating and potentially living depending on the organization, and the ones it does have are either much too illegible and high context, especially social context, and/or are way too porous, in that they can be easily violated.
Yes, there are no preformed Cartesian boundaries that we can use, but that doesn't stop us from at least forming approximate boundaries and enforcing them, and while legible norms are never fun and have their costs, I do think that the benefits of legible norms, especially epistemically legible norms in the dating/romantic scene, especially in an employment context are very, very high value, so much that I think the downsides aren't enough to say that it's bad overall to enforce legible norms around dating/romantic relationships in the employment context. I'd say somewhat similar things around legible norms on living situations, pay etc.
There is a questionable trend to equate ML skills with the ability to do alignment work.
I'd arguably say this is good, primarily because I think EA was already in danger of it's AI safety wing becoming unmoored from reality by ignoring key constraints, similar to how early Lesswrong before the deep learning era around 2012-2018 turned out to be mostly useless due to how much everything was stated in a mathematical way, and not realizing how many constraints and conjectured constraints applied to stuff like formal provability, for example..
I think this is more so a longtermist/non-longtermist divide than a selfish/altruistic divide.
But yeah, whether you buy long-term ethics or not, and how much you discount is going to make some surprising differences about how much you support AI progress. Indeed, I'd argue that a big part of the reason why LW/EA has flirted with extreme slowdowns/extreme policies on AI has to do with the overrepresentation of very, very longtermist outlooks.
One practical point is that for most purposes, you should be focused far less on long-term impacts, even for longtermists, since people are in general very bad at predicting anything longer than say 20 years, and the most important implication is that trying to plan over the longer term leads you into essentially nowhere.
This means that for our purposes, we can cut out all the potential future generations but one, and we can probably do more than that, and radically cut the expected value of AI risk and general existential risk.
While I definitely should have been more polite in expressing those ideas, I do think that they're important to convey, especially the first one, as I really, really don't people to burn themselves out or get anxiety/depression from doing something that they don't want to do, or even like doing.
I definitely will be nicer about expressing those ideas, but they're so important that I do think something like the insights need to be told to a lot of people, especially those in the alignment community.
This is very, very dependent on what assumptions you fundamentally make about the nature of physical reality, and what assumptions you make about how much future civilizations can alter physics.
I genuinely think that if you want to focus on the long term, unfortunately we'd need to solve very, very difficult problems in physics to reliably give answers.
For the short term limitations that are relevant to AI progress, I'd argue that the biggest one is probably thermodynamics stuff, and in particular the Landauer limit is a good approximation for why you can't make radically better nanotechnology than life without getting into extremely weird circumstances, like reversible computation.
Yep, the latency and performance are real killers for embodied type cognition. I remember a tweet that suggested the entire Internet was not enough to train the model.
For a lot of people, especially people that aren't psychologically stable, this is very, very good advice in general around existential risk.
To be clear, I think that he has an overly pessimistic worldview on existential risk, but I genuinely respect your friend realizing that his capabilities weren't enough to tackle it productively, and that he realized that he couldn't be helpful enough to do good work on existential risk, so he backed away from the field as he realized his own limitations.
Unless we answer these questions, then even if we believe that universes with other laws of physics exist, it does not allow us to make specific predictions. A theory that doesn't make predictions is useless.
I tend to think that this is the biggest reason why despite thinking the Tegmark Multiverse theory is right, it ultimately sort of adds up to normality for most purposes, and why in practice it's not so useful for very much.
But as Gödel's incompleteness theorem and the halting problem demonstrate, this is an incoherent assumption, and attempting to take it as a premise will lead to contradictions.
Not really. What Godel's incompleteness theorem basically says is that no effective method that is Turing powerful can prove all truths about the natural numbers. However, in certain other models of computation, we can decide the truth of all natural numbers, and really any statement in first order logic can be decided if they are valid or not. You are overstating the results.
Similarly, the halting problem is solely unsolvable by Turing Machines. There's another issue, in that even oracles for the halting problem cannot solve their own halting problems, but it turns out that the basic issue is we assumed that computers had to be uniform, that is a single machine/circuit could solve all input sizes of a problem.
If we allow non-uniform circuits or advice strings, then we can indeed compose different circuits for different input sizes to make a logically omnisicent machine. Somewhat similarly, if we can somehow get an advice string or precompute what we need to test before testing it, we can again make a logically omnisicent machine.
Note that I'm only arguing that it makes logical sense to assume logical omnisicence, I'm not arguing it's useful or that it makes physical sense to do so. (I generally think it's not too useful, because then you can solve every problem if we assume logical omnisicence, and we thus have assumed away the constraints that are taut in real life.)
The proof that a probabilistic machine with probabilistic advice and a CTC can solve every language is in this paper, albeit it's several pages down.
https://arxiv.org/abs/0808.2669
Similarly, the proof that a Turing Machine + CTC can solve the halting problem is in this paper below:
Yep, I was thinking about NP problems, though #P problems for the counting version would count as well.
To be somewhat more fair, there are probably thousands of problems with the property that they are much easier to check than they are to solve, and while alignment research is maybe not this, I do think that there's a general gap between verifying a solution and actually solving the problem.
Introducing the current computer security norms into biology without adjustment for the different circumstances means we, very likely, all die.
Only because of the issue of a mere catastrophe potentially leading us never to grow back our population. If you discount this effect, we're not even sure if it's possible at all for a biological infection to kill us all, and even if it is, I expect it to require way more implementation effort than people think.
I feel like this is either misinformation or very close to it.
https://www.lesswrong.com/posts/8NPFtzPhkeYZXRoh3/perpetually-declining-population
While I definitely agree that a fight between humanity and AGI will never look like humanity vs AGI due to the issues with the abstraction of humanity, I do think one key disagreement I have with this comment is that I don't think that there is no fire alarm for AGI, and in general my model is that if anything a lot of people will support very severe restrictions on AI and AI progress for safety. I think this already happened several months ago, and there people got freaked out about AI, and that was merely GPT-4. We will get a lot of fire alarms, especially via safety incidents. A lot of people are already primed for apocalyptic narratives, and if AI progresses in a big way, this will fan the flames into a potential AI-killer, supported by politicians. It's not impossible for tech companies to defuse this, but damn is it hard to defuse.
I worry about the opposite problem, in that if existential risk concerns look less and less likely, AI regulation may nonetheless become quite severe, and the AI organizations built by LessWrongers have systematic biases that will prevent them from updating to this position.
For myself, I've come back to believing that AI doom is probably worth worrying about a little, and I no longer view AI doom as basically a non-problem, due to new studies.
RE viewing this as a conflict, I agree with this mindset, but with one caveat: There are also vast prior and empirical disagreements too, and while there is a large conflict of values, it's magnified even larger by uncertainty.
I definitely agree with this take, because I generally hate the society/we abstractions, but one caveat is that AI resistance by a lot of people could be very strong, especially because the public is probably primed for apocalyptic narratives.
These tweets are at least somewhat of an argument that AI will be resisted pretty heavily.
https://twitter.com/daniel_271828/status/1696794764136562943
Daniel Eth: Woah, the public is for even more restrictive AI regulations than even I am.
https://twitter.com/daniel_271828/status/1696770364549087310
Daniel Eth: I honestly think everyone in this debate - from accelerationists to safetyists - is underestimating this factor.
My remark was just a reminder that 1 and 0 are not probabilities, in the same sense that infinity is not a real number.
This is admittedly a minor internet crusade/pet peeve of mine, but the claim that 1 and 0 aren't probabilities is exactly wrong, and the analogy is pretty strained here. In fact, probability theory need to have 0 and 1 as legitimate probabilities, or things fall apart into incoherency.
The post you linked is one of the most egregiously wrong things Eliezer has ever said that is purely mathematical.
And we don't just imagine measure/probability 1 or 0 sets, we have proved that certain sets of this kind exists.
There are fundamental disanalogies that make infinity not a number (except in the extended real line and the projective real line), compared to 0 and 1 being probabilities.
Re getting to 100% probability of a outcome, that's actually surprisingly easy to do sometimes, especially in infinite sets like the real numbers. It's not trivial, but you can get these outcomes sometimes.
Something like this is argued to be why humans are frankly exceptionally well aligned to basic homeostatic drives, and the only real failure modes that happened are basically obesity, drugs and maybe alcohol as things that misaligned us with basic needs, as hedonic treadmills/loops essentially tame the RL part of us, and make sure that reward isn't the optimization target in practice, like TurnTrout's post below:
https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target
Similarly, 2 beren posts below explain how the PID control loop may be helpful for alignment:
https://www.lesswrong.com/posts/3mwfyLpnYqhqvprbb/hedonic-loops-and-taming-rl
https://www.beren.io/2022-11-29-Preventing-Goodheart-with-homeostatic-rewards/
I think the big implication for now is that the scaling hypothesis for LLMs, at least if we require them to be bounded scaling, is probably false for far more scaling effort than we realized, and this extends AI timelines by quite a bit.