Posts
Comments
Wow.
Mark whether to make your responses private; ie exclude them when this data is made public. Keep in mind that although it should in theory be difficult to identify you from your survey results, it may be possible if you have an unusual answer to certain questions, for example your Less Wrong karma. Please also be aware that even if this box is checked, the person collecting the surveys (Screwtape) will be able to see your results (but will keep them confidential)
This sounds like there should be a checkbox here, but I see two "spherical" response options instead.
I didn't fill it out yet, but I just want to say that I appreciate all of the survey being on one page rather than requiring you to fill out all the answers on the first page and then click "next page" to see more questions. That would have been particularly annoying given the "do you want your answers to be made private" question - I want to see what I'm asked before I can tell whether I want to keep my answers private! Kudos.
I mostly agree with what you say, just registering my disagreement/thoughts on some specific points. (Note that I haven't yet read the page you're responding to.)
Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer.
Maybe? Depends on what exactly you mean by the word "might", but it doesn't seem obvious to me that this would need to be the case. My intuition from seeing the kinds of interpretability results we've seen so far, is that within less of a decade we'd already have a pretty rigorous theory and toolkit for answering these kinds of questions. At least assuming that we don't keep switching to LLM architectures that work based on entirely different mechanisms and make all of the previous interpretability work irrelevant.
If by "might" you mean something like a "there's at least a 10% probability that this could take decades to answer" then sure, I'd agree with that. Now I haven't actually thought about this specific question very much before seeing it pop up in your post, so I might radically revise my intuition if I thought about it more, but at least it doesn't seem immediately obvious to me that I should assign "it would take decades of work to answer this" a very high probability.
Instead, the authors make a big deal out of the fact that human innate drives are relatively simple (I think they mean “simple compared to a modern big trained ML model”, which I would agree with). I’m confused why that matters. Who cares if there’s a simple solution, when we don’t know what it is?
I would assume the intuition to be something like "if they're simple, then given the ability to experiment on minds and access AI internals, it will be relatively easy to figure out how to make the same drives manifest in an AI; the amount of (theory + trial and error) required for that will not be as high as it would be if the drives were intrinsically complex".
We can run large numbers of experiments to find the most effective interventions, and we can also run it in a variety of simulated environments and test whether it behaves as expected both with and without the cognitive intervention. Each time the AI’s “memories” can be reset, making the experiments perfectly reproducible and preventing the AI from adapting to our actions, very much unlike experiments in psychology and social science.
That sounds nice, but brain-like AGI (like most RL agents) does online learning. So if you run a bunch of experiments, then as soon as the AGI does anything whatsoever (e.g. reads the morning newspaper), your experiments are all invalid (or at least, open to question), because now your AGI is different than it was before (different ANN weights, not just different environment / different prompt). Humans are like that too, but LLMs are not.
There's something to that, but this sounds too strong to me. If someone had hypothetically spent a year observing all of my behavior, having some sort of direct read access to what was happening in my mind, and also doing controlled experiments where they reset my memory and tested what happened with some different stimulus... it's not like all of their models would become meaningless the moment I read the morning newspaper. If I had read morning newspapers before, they would probably have a pretty good model of what the likely range of updates for me would be.
Of course, if there was something very unexpected and surprising in the newspaper, that might cause a bigger update, but I expect that they would also have reasonably good models of the kinds of things that are likely to trigger major updates or significant emotional shifts in me. If they were at all competent, that's specifically the kind of thing that I'd expect them to work on trying to find out!
And even if there was a major shift, I think it's basically unheard of that literally everything about my thoughts and behavior would change. When I first understood the potentially transformative impact of AGI, it didn't change the motor programs that determine how I walk or brush my teeth, nor did it significantly change what kinds of people I feel safe around (aside for some increase in trust toward other people who I felt "get it"). I think that human brains quite strongly preserve their behavior and prediction structures, just adjusting them somewhat when faced with new information. Most of the models and predictions you've made about an adult will tend to stay valid, though of course with children and younger people there's much greater change.
Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans happens, it has to happen via a very different and much less direct algorithmic mechanism than how it happens in LLMs. Specifically, humans imitate other humans because they want to, i.e. because of the history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all, that’s just mechanically what it does.
In some sense yes, but it does also seem to me that prediction and desire does get conflated in humans in various ways, and that it would be misleading to say that the people in question want it. For example, I think about this post by @romeostevensit often:
Fascinating concept that I came across in military/police psychology dealing with the unique challenges people face in situations of extreme stress/danger: scenario completion. Take the normal pattern completion that people do and put fear blinders on them so they only perceive one possible outcome and they mechanically go through the motions *even when the outcome is terrible* and there were obvious alternatives. This leads to things like officers shooting *after* a suspect has already surrendered, having overly focused on the possibility of needing to shoot them. It seems similar to target fixation where people under duress will steer a vehicle directly into an obstacle that they are clearly perceiving (looking directly at) and can't seem to tear their gaze away from. Or like a self fulfilling prophecy where the details of the imagined bad scenario are so overwhelming, with so little mental space for anything else that the person behaves in accordance with that mental picture even though it is clearly the mental picture of the *un*desired outcome.
I often try to share the related concept of stress induced myopia. I think that even people not in life or death situations can get shades of this sort of blindness to alternatives. It is unsurprising when people make sleep a priority and take internet/screen fasts that they suddenly see that the things they were regarding as obviously necessary are optional. In discussion of trauma with people this often seems to be an element of relationships sadly enough. They perceive no alternative and so they resign themselves to slogging it out for a lifetime with a person they are very unexcited about. This is horrific for both people involved.
It's, of course, true that for an LLM, prediction is the only thing it can do, and that humans have a system of desires on top of that. But it looks to me that a lot of human behavior is just having LLM-ish predictive models of how someone like them would behave in a particular situation, which is also the reason why conceptual reframings the like one you can get in therapy can be so powerful ("I wasn't lazy after all, I just didn't have the right tools for being productive" can drastically reorient many predictions you're making of yourself and thus your behavior). (See also my post on human LLMs, which has more examples.)
While it's obviously true that there is a lot of stuff operating in brains besides LLM-like prediction, such as mechanisms that promote specific predictive models over other ones, that seems to me to only establish that "the human brain is not just LLM-like prediction", while you seem to be saying that "the human brain does not do LLM-like prediction at all". (Of course, "LLM-like prediction" is a vague concept and maybe we're just using it differently and ultimately agree.)
To elaborate on that, Shear is presumably saying exactly as much as he is allowed to say in public. This implies that if the removal had nothing to do with safety, then he would say "The board did not remove Sam over anything to do with safety". His inserting of that qualifier implies that he couldn't make a statement that broad, and therefore that safety considerations were involved in the removal.
I expect safety of that to be at zero
At least it refuses to give you instructions for making cocaine.
For site libraries, there is indeed no alternative since you have to use some libraries to get anything done, so there you do have to do it on a case-by-case basis. In the case of exposing user data, there is an alternative - limiting yourself to only public data. (See also my reply to jacobjacob.)
we're a small team and the world is on fire, and I don't think we should really be prioritising making Dialogue Matching robust to this kind of adversarial cyber threat for information of comparable scope and sensitivity!
I agree that it wouldn't be a very good use of your resources. But there's a simple solution for that - only use data that's already public and users have consented to you using. (Or offer an explicit opt-in where that isn't the case.)
I do agree that in this specific instance, there's probably little harm in the information being revealed. But I generally also don't think that that's the site admin's call to make, even if I happen to agree with that call in some particular instances. A user may have all kinds of reasons to want to keep some information about themselves private, some of those reasons/kinds of information being very idiosyncratic and hard to know in advance. The only way to respect every user's preferences for privacy, even the unusual ones, is by letting them control what information is used and not make any of those calls on their behalf.
My point is less about the individual example than the overall decision algorithm. Even if you're correct that in this specific instance, you can verify the whole trail of implications and be certain that nothing bad happens, a general policy of "figure it out on a case-by-case basis and only do it when it feels safe" means that you're probably going to make a mistake eventually, given how easy it is to make a mistake in this domain.
I've wondered the same thing; I've suggested before merging them, so that posts in shortform would automatically be posted into that month's open thread and vice versa. As it is, I every now and then can't decide which one to post in, so I post in neither.
We tenatively postulated it would be fine to do this as long as seeing a name on your match page gave no more than like a 5:1 update about those people having checked you.
I would strongly advocate against this kind of thought; any such decision-making procedure relies on the assumption that you correctly figure out all the ways such information can be used, and that there isn't a clever way an adversary can extract more information than you had thought. This is bound to fail - people come up with clever ways to extract more private information than anticipated all the time. For example:
- Timing Attacks on Web Privacy:
- We describe a class of attacks that can compromise the privacy of users’ Web-browsing histories. The attacks allow a malicious Web site to determine whether or not the user has recently visited some other, unrelated Web page. The malicious page can determine this information by measuring the time the user’s browser requires to perform certain operations. Since browsers perform various forms of caching, the time required for operations depends on the user’s browsing history; this paper shows that the resulting time variations convey enough information to compromise users’ privacy.
- Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset)
- We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.
- De-anonymizing Social Networks
- We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate. Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy “sybil” nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary’s auxiliary information is small.
- On the Anonymity of Home/Work Location Pairs
- Many applications benefit from user location data, but location data raises privacy concerns. Anonymization can protect privacy, but identities can sometimes be inferred from supposedly anonymous data. This paper studies a new attack on the anonymity of location data. We show that if the approximate locations of an individual’s home and workplace can both be deduced from a location trace, then the median size of the individual’s anonymity set in the U.S. working population is 1, 21 and 34,980, for locations known at the granularity of a census block, census track and county respectively. The location data of people who live and work in different regions can be re-identified even more easily. Our results show that the threat of re-identification for location data is much greater when the individual’s home and work locations can both be deduced from the data.
- Bubble Trouble: Off-Line De-Anonymization of Bubble Forms
- Fill-in-the-bubble forms are widely used for surveys, election ballots, and standardized tests. In these and other scenarios, use of the forms comes with an implicit assumption that individuals’ bubble markings themselves are not identifying. This work challenges this assumption, demonstrating that fill-in-the-bubble forms could convey a respondent’s identity even in the absence of explicit identifying information. We develop methods to capture the unique features of a marked bubble and use machine learning to isolate characteristics indicative of its creator. Using surveys from more than ninety individuals, we apply these techniques and successfully reidentify individuals from markings alone with over 50% accuracy.
Hmm, I would actually expect neurotypicals to find this advice more useful, since they're more likely to have thoughts like "I can't do that, that'd be weird" while the stereotypical autist would be blissfully unaware of there being anything weird about it.
No worries! Yeah, I agree with that. These paragraphs were actually trying to explicitly say that things may very well not work out in the end, but maybe that wasn't clear enough:
Love doesn’t always win. There are situations where loyalty, cooperation, and love win, and there are situations where disloyalty, selfishness, and hatred win. If that wasn’t the case, humans wouldn’t be so clearly capable of both.
It’s possible for people and cultures to settle into stable equilibria where trust and happiness dominate and become increasingly beneficial for everyone, but also for them to settle into stable equilibria where mistrust and misery dominate, or anything in between.
I don't think any of these arguments depend crucially on whether there is a sole explicit goal of the training process, or if the goal of the training process changes a bunch. The only thing the argument depends on is whether there exist such abstract drives/goals
I agree that they don't depend on that. Your arguments are also substantially different from the ones I was criticizing! The ones I was responding were ones like the following:
The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it's not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can't yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don't suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities. (A central AI alignment problem: capabilities generalization, and the sharp left turn)
15. [...] We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. [...]
16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. (AGI Ruin: A List of Lethalities)
Those arguments are explicitly premised on humans having been optimized for IGF, which is implied to be a single thing. As I understand it, your argument is just that humans now have some very different behaviors from the ones they used to have, omitting any claims of what evolution originally optimized us for, so I see it as making a very different sort of claim.
To respond to your argument itself:
I agree that there are drives for which the behavior looks very different from anything that we did in the ancestral environment. But does very different-looking behavior by itself constitute a sharp left turn relative to our original values?
I would think that if humans had experienced a sharp left turn, then the values of our early ancestors should look unrecognizable to us, and vice versa. And certainly, there do seem to be quite a few things that our values differ on - modern notions like universal human rights and living a good life while working in an office might seem quite alien and repulsive to some tribal warrior who values valor in combat and killing and enslaving the neighboring tribe, for instance.
At the same time... I think we can still basically recognize and understand the values of that tribal warrior, even if we don't share them. We do still understand what's attractive about valor, power, and prowess, and continue to enjoy those kinds of values in less destructive forms in sports, games, and fiction. We can read Gilgamesh or Homer or Shakespeare and basically get what the characters are motivated by and why they are doing the things they're doing. An anthropologist can go to a remote tribe to live among them and report that they have the same cultural and psychological universals as everyone else and come away with at least some basic understanding of how they think and why.
It's true that humans couldn't eradicate diseases before. But if you went to people very far back in time and told them a story about a group of humans who invented a powerful magic that could destroy diseases forever and then worked hard to do so... then the people of that time would not understand all of the technical details, and maybe they'd wonder why we'd bother bringing the cure to all of humanity rather than just our tribe (though Prometheus is at least commonly described as stealing fire for all of humanity, so maybe not), but I don't think they would find it a particularly alien or unusual motivation otherwise. Humans have hated disease for a very long time, and if they'd lost any loved ones to the particular disease we were eradicating they might even cheer for our doctors and want to celebrate them as heroes.
Similarly, humans have always gone on voyages of exploration - e.g. the Pacific islands were discovered and settled long ago by humans going on long sea voyages - so they'd probably have no difficulty relating to a story about sorcerers going to explore the moon, or of two tribes racing for the glory of getting there first. Babylonians had invented the quadratic formula by 1600 BC and apparently had a form of Fourier analysis by 300 BC, so the math nerds among them would probably have some appreciation of modern-day advanced math if it was explained to them. The Greek philosophers argued over epistemology, and there were apparently instructions on how to animate golems (arguably AGI-like) around by the late 12th/early 13th century.
So I agree that the same fundamental values and drives can create very different behavior in different contexts... but if it is still driven by the same fundamental values and drives in a way that people across time might find relatable, why is that a sharp left turn? Analogizing that to AI, it would seem to imply that if the AI generalized its drives in that kind of way when it came to novel contexts, then we would generally still be happy about the way it had generalized them.
This still leaves us with that tribal warrior disgusted with our modern-day weak ways. I think that a lot of what is going on with him is that he has developed particular strategies for fulfilling his own fundamental drives - being a successful warrior was the way you got what you wanted back in that day - and internalized them as a part of his aesthetic of what he finds beautiful and what he finds disgusting. But it also looks to me like this kind of learning is much more malleable than people generally expect. One's sense of aesthetics can be updated by propagating new facts into it, and strongly-held identities (such as "I am a technical person") can change in response to new kinds of strategies becoming viable, and generally many (I think most) deep-seated emotional patterns can at least in principle be updated. (Generally, I think of human values in terms of a two-level model, where the underlying "deep values" are relatively constant, with emotional responses, aesthetics, identities, and so forth being learned strategies for fulfilling those deep values. The strategies are at least in principle updatable, subject to genetic constraints such as the person's innate temperament that may be more hardcoded.)
I think that the tribal warrior would be disgusted by our society because he would rightly recognize that we have the kinds of behavior patterns that wouldn't bring glory in his society and that his tribesmen would find it shameful to associate with, and also that trying to make it in our society would require him to unlearn a lot of stuff that he was deeply invested in. But if he was capable of making the update that there were still ways for him to earn love, respect, power, and all the other deep values that his warfighting behavior had originally developed to get... then he might come to see our society as not that horrible after all.
I am confused by your AlphaGo argument because "winning states of the board" looks very different depending on what kinds of tactics your opponent uses, in a very similar way to how "surviving and reproducing" looks very different depending on what kinds of hazards are in the environment.
I don't think the actual victory states look substantially different? They're all ones where AlphaGo has more territory than the other player, even if the details of how you get there are going to be different.
I predict that AlphaGo is actually not doing that much direct optimization in the sense of an abstract drive to win that it reasons about, but rather has a bunch of random drives piled up that cover various kinds of situations that happen in Go.
Yeah, I would expect this as well, but those random drives would still be systematically shaped in a consistent direction (that which brings you closer to a victory state).
I think I agree with this, do you mean it as disagreement to something I said or just an observation?
Thanks, edited:
I argued that there’s no single thing that evolution selects for; rather, the thing that it’s selecting is constantly changing.
Does this comment help clarify the point?
So I think the issue is that when we discuss what I'd call the "standard argument from evolution", you can read two slightly different claims into it. My original post was a bit muddled because I think those claims are often conflated, and before writing this reply I hadn't managed to explicitly distinguish them.
The weaker form of the argument, which I interpret your comment to be talking about, goes something like this:
- The original evolutionary "intent" of various human behaviors/goals was to increase fitness, but in the modern day these behaviors/goals are executed even though their consequences (in terms of their impact on fitness) are very different. This tells us that the intent of the process that created a behavior/goal does not matter. Once the behavior/goal has been created, it will just do what it does even if the consequences of that doing deviate from their original purpose. Thus, even if we train an AI so that it carries out goal X in a particular context, we have no particular reason to expect that it would continue to automatically carry out the same (intended) goal if the context changes enough.
I agree with this form of the argument and have no objections to it. I don't think that the points in my post are particularly relevant to that claim. (I've even discussed a form of inner optimization in humans that causes value drift that I don't recall anyone else discussing in those terms before.)
However, I think that many formulations are actually implying, if not outright stating a stronger claim:
- In the case of evolution, humans were originally selected for IGF but are now doing things that are completely divorced from that objective. Thus, even if we train an AI so that it carries out goal X in a particular context, we have a strong reason to expect that its behavior would deviate so much from the goal as to become practically unrecognizable.
So the difference is something like the implied sharpness of the left turn. In the weak version, the claim is just that the behavior might go some unknown amount to the left. We should figure out how to deal with this, but we don't yet have much empirical data to estimate exactly how much it might be expected to go left. In the strong version, the claim is that the empirical record shows that the AI will by default swerve a catastrophic amount to the left.
(Possibly you don't feel that anyone is actually implying the stronger version. If you don't and you would already disagree with the stronger version, then great! We are in agreement. I don't think it matters whether the implication "really is there" in some objective sense, or even whether the original authors intended it or not. I think the relevant thing is that I got that implication from the posts I read, and I expect that if I got it, some other people got it too. So this post is then primarily aimed at the people who did read the strong version to be there and thought it made sense.)
You wrote:
I agree that humans (to a first approximation) still have the goals/drives/desires we were selected for. I don't think I've heard anyone claim that humans suddenly have an art creating drive that suddenly appeared out of nowhere recently, nor have I heard any arguments about inner alignment that depend on an evolution analogy where this would need to be true. The argument is generally that the ancestral environment selected for some drives that in the ancestral environment reliably caused something that the ancestral environment selected for, but in the modern environment the same drives persist but their consequences in terms of [the amount of that which the ancestral environment was selecting for] now changes, potentially drastically.
If we are talking about the weak version of the argument, then yes, I agree with everything here. But I think the strong version - where our behavior is implied to be completely at odds with our original behavior - has to implicitly assume that things like an art-creation drive are something novel.
Now I don't think that anyone who endorses the strong version (if anyone does) would explicitly endorse the claim that our art-creation drive just appeared out of nowhere. But to me, the strong version becomes pretty hard to maintain if you take the stance that we are mostly still executing all of the behaviors that we used to, and it's just that their exact forms and relative weightings are somewhat out of distribution. (Yes, right now our behavior seems to lead to falling birthrates and lots of populations at below replacement rates, which you could argue was a bigger shift than being "somewhat out of distribution", but... to me that intuitively feels like it's less relevant than the fact that most individual humans still want to have children and are very explicitly optimizing for that, especially since we've only been in the time of falling birthrates for a relatively short time and it's not clear whether it'll continue for very long.)
I think the strong version also requires one to hold that evolution does, in fact, consistently and predominantly optimize for a single coherent thing. Otherwise, it would mean that our current-day behaviors could be explained by "evolution doesn't consistently optimize for any single thing" just as well as they could be explained by "we've experienced a left turn from what evolution originally optimized for".
However, it is pretty analogous to RL, and especially multi agent RL, and overall I don't think of the inner misalignment argument as depending on stationarity of the environment in either direction. AlphaGo might early in training select for policies that do tactic X initially because it's a good tactic to use against dumb Go networks, and then once all the policies in the pool learn to defend against that tactic it is no longer rewarded.
I agree that there are contexts where it would be analogous to that. But in that example, AlphaGo is still being rewarded for winning games of Go, and it's just that the exact strategies it needs to use differ. That seems different than e.g. the bacteria example, where bacteria are selected for exactly the opposite traits - either selected for producing a toxin and an antidote, or selected for not producing a toxin and an antidote. That seems to me more analogous to a situation where AlphaGo is initially being rewarded for winning at Go, then once it starts consistently winning it starts getting rewarded for losing instead, and then once it starts consistently losing it starts getting rewarded for winning again.
And I don't think that that kind of a situation is even particularly rare - anything that consumes energy (be it a physical process such as producing a venom or a fur, or a behavior such as enjoying exercise) is subject to that kind of an "either/or" choice.
Now you could say that "just like AlphaGo is still rewarded for winning games of Go and it's just the strategies that differ, the organism is still rewarded for reproducing and it's just the strategies that differ". But I think the difference is that for AlphaGo, the rewards are consistently shaping its "mind" towards having a particular optimization goal - one where the board is in a winning state for it.
And one key premise on which the "standard argument from evolution" rests is that evolution has not consistently shaped the human mind in such a direct manner. It's not that we have been created with "I want to have surviving offspring" as our only explicit cognitive goal, with all of the evolutionary training going into learning better strategies to get there by explicit (or implicit) reasoning. Rather we have been given various motivations that exhibit varying degrees of directness in how useful they are for that goal - from "I want to be in a state where I produce great art" (quite indirect) to "I want to have surviving offspring" (direct), with the direct goal competing with all the indirect ones for priority. Unlike AlphaGo, which does have the cognitive capacity for direct optimization toward its goal being the sole reward criteria all along.
This is also a bit hard to put a finger on, but I feel like there's some kind of implicit bait-and-switch happening with the strong version of the standard argument. It correctly points out that we have not had IGF as our sole explicit optimization goal because we didn't start by having enough intelligence for that to work. Then it suggests that because of this, AIs are likely to also be misaligned... even though, unlike with human evolution, we could just optimize them for one explicit goal from the beginning, so we should expect our AIs to be much more reliably aligned with that goal!
Thank you, I like this comment. It feels very cooperative and like some significant effort went into it, and it also seems to touch the core of some important consideratios.
I notice I'm having difficulty responding, in that I disagree with some of what you said, but then have difficulty figuring out my reasons for that disagreement. I have the sense there's a subtle confusion going on, but trying to answer you makes me uncertain whether others are the ones with the subtle confusion or if I am.
I'll think about it some more and get back to you.
infanticide is not a substitute for contraception
I did not mean to say that they would be exactly equivalent nor that infanticide would be without significant downsides.
How is this not an excellent example of how under novel circumstances, inner-optimizers (like human brains) can almost all (serial sperm donor cases like hundreds out of billions) diverge extremely far (if forfeiting >10,000% is not diverging far, what would be?) from the optimization process's reward function (within-generation increase in allele frequencies), while pursuing other rewards (whatever it is you are enjoying doing while very busy not ever donating sperm)?
"Inner optimizers diverging from the optimization process's reward function" sounds to me like humans were already donating to sperm banks in the EEA, only for an inner optimizer to wreak havoc and sidetrack us from that. I assume you mean something different, since under that interpretation of what you mean the answer would be obvious - that we don't need to invoke inner optimizers because there were no sperm banks in the EEA, so "that's not the kind of behavior that evolution selected for" is a sufficient explanation.
That's a fair point.
I suspect that both of those may be running off the same basic algorithm, with there just being other components dictating what that algorithm gets applied to, and by default preventing it from getting applied too broadly.
But I could be wrong about that. And even if it was the same basic algorithm, running it in "limited vs. universal" mode does cause some significant qualitative differences, even if the difference was arguably just quantitative. So I do think that a more precise view would be to consider these as different-but-related forces in the same pantheon: one force just banding together with your ingroup, and one force for some more universal love.
Or you could view it in the kind of a way as it was viewed in The Goddess of Everything Else: going from a purely solitary existence, to banding together, to using that exploit outgroups, to then expanding the moral circle to outgroups as well, represents steps in the dance of force for harmony and the force for conflict. (Of course, in reality, these steps are not separated in time, but rather are constantly intertwined with each other.) The banding together within the same species bears the signature of the force for cooperation and self-sacrifice, but also that of the force for conflict and destruction... and then again that of the force for cooperation, as it can be turned into more universal caring.
To begin with, the latter doesn't seem like something one grows into with increased "social intelligence", but rather "quantum jumps" that are taken at unpredictable moments that one cannot engineer. Secondly, it is not clear to me at all that people who integrate a greater understanding and dexterity with cooperation in their personal lives have any higher chances of reaching these altered states of consciousness.
I agree, and I don't mean to suggest that increased "social intelligence" would automatically take one to altered states, nor that increasing one's ability for cooperation would necessarily lead one to them. The connection is more the other way around. The states are ones in which something like the algorithm for cooperation is activated unusually strongly, so if one can get to those states and get an experience of what that mindstate is like, then it can be easier to try to achieve that kind of a mindstate in "real life" and increase one's ability for cooperation there.
That being said, it's not a completely one-way connection, since once certain prerequisites have been unlocked then practicing the mindstate in "real life" also makes it easier to get into the altered states, since the mind is already more inclined in that direction. (I'm simplifying things quite a bit here since a proper elaboration of this caveat would require another essay.) Also having the mind strongly inclined in that direction already can make it easier to unlock the prerequisites and reach the altered states - for example, some people go quite easily from loving-kindness meditation to bliss states, in part because they've already practiced (maybe without being consciously aware of it) habits of mind that make it easy to incline themselves towards loving-kindness.
Except that chess really does have an objectively correct value systemization, which is "win the game."
Your phrasing sounds like you might be saying this as an objection to what I wrote, but I'm not sure how it would contradict my comment.
The same mechanisms can still apply even if the correct systematization is subjective in one case and objective in the second case. Ultimately what matters is that the cognitive system feels that one alternative is better than the other and takes that feeling as feedback for shaping future behavior, and I think that the mechanism which updates on feedback doesn't really see whether the source of the feedback is something we'd call objective (win or loss at chess) or subjective (whether the resulting outcome was good in terms of the person's pre-existing values).
"Sitting with paradox" just means, don't get too attached to partial systemizations.
Yeah, I think that's a reasonable description of what it means in the context of morality too.
Similarly, suppose you have two deontological values which trade off against each other. Before systematization, the question of "what's the right way to handle cases where they conflict" is not really well-defined; you have no procedure for doing so. After systematization, you do. (And you also have answers to questions like "what counts as lying?" or "is X racist?", which without systematization are often underdefined.) [...]
You can conserve your values (i.e. continue to care terminally about lower-level representations) but the price you pay is that they make less sense, and they're underdefined in a lot of cases. [...] And that's why the "mind itself wants to do this" does make sense, because it's reasonable to assume that highly capable cognitive architectures will have ways of identifying aspects of their thinking that "don't make sense" and correcting them.
I think we should be careful to distinguish explicit and implicit systematization. Some of what you are saying (e.g. getting answers to question like "what counts as lying") sounds like you are talking about explicit, consciously done systematization; but some of what you are saying (e.g. minds identifying aspects of thinking that "don't make sense" and correcting them) also sounds like it'd apply more generally to developing implicit decision-making procedures.
I could see the deontologist solving their problem either way - by developing some explicit procedure and reasoning for solving the conflict between their values, or just going by a gut feel for which value seems to make more sense to apply in that situation and the mind then incorporating this decision into its underlying definition of the two values.
I don't know how exactly deontological rules work, but I'm guessing that you could solve a conflict between them by basically just putting in a special case for "in this situation, rule X wins over rule Y" - and if you view the rules as regions in state space where the region for rule X corresponds to the situations where rule X is applied, then adding data points about which rule is meant to cover which situation ends up modifying the rule itself. It would also be similar to the way that rules work in skill learning in general, in that experts find the rules getting increasingly fine-grained, implicit and full of exceptions. Here's how Josh Waitzkin describes the development of chess expertise:
Let’s say that I spend fifteen years studying chess. [...] We will start with day one. The first thing I have to do is to internalize how the pieces move. I have to learn their values. I have to learn how to coordinate them with one another. [...]
Soon enough, the movements and values of the chess pieces are natural to me. I don’t have to think about them consciously, but see their potential simultaneously with the figurine itself. Chess pieces stop being hunks of wood or plastic, and begin to take on an energetic dimension. Where the piece currently sits on a chessboard pales in comparison to the countless vectors of potential flying off in the mind. I see how each piece affects those around it. Because the basic movements are natural to me, I can take in more information and have a broader perspective of the board. Now when I look at a chess position, I can see all the pieces at once. The network is coming together.
Next I have to learn the principles of coordinating the pieces. I learn how to place my arsenal most efficiently on the chessboard and I learn to read the road signs that determine how to maximize a given soldier’s effectiveness in a particular setting. These road signs are principles. Just as I initially had to think about each chess piece individually, now I have to plod through the principles in my brain to figure out which apply to the current position and how. Over time, that process becomes increasingly natural to me, until I eventually see the pieces and the appropriate principles in a blink. While an intermediate player will learn how a bishop’s strength in the middlegame depends on the central pawn structure, a slightly more advanced player will just flash his or her mind across the board and take in the bishop and the critical structural components. The structure and the bishop are one. Neither has any intrinsic value outside of its relation to the other, and they are chunked together in the mind.
This new integration of knowledge has a peculiar effect, because I begin to realize that the initial maxims of piece value are far from ironclad. The pieces gradually lose absolute identity. I learn that rooks and bishops work more efficiently together than rooks and knights, but queens and knights tend to have an edge over queens and bishops. Each piece’s power is purely relational, depending upon such variables as pawn structure and surrounding forces. So now when you look at a knight, you see its potential in the context of the bishop a few squares away. Over time each chess principle loses rigidity, and you get better and better at reading the subtle signs of qualitative relativity. Soon enough, learning becomes unlearning. The stronger chess player is often the one who is less attached to a dogmatic interpretation of the principles. This leads to a whole new layer of principles—those that consist of the exceptions to the initial principles. Of course the next step is for those counterintuitive signs to become internalized just as the initial movements of the pieces were. The network of my chess knowledge now involves principles, patterns, and chunks of information, accessed through a whole new set of navigational principles, patterns, and chunks of information, which are soon followed by another set of principles and chunks designed to assist in the interpretation of the last. Learning chess at this level becomes sitting with paradox, being at peace with and navigating the tension of competing truths, letting go of any notion of solidity.
"Sitting with paradox, being at peace with and navigating the tension of competing truths, letting go of any notion of solidity" also sounds to me like some of the models for higher stages of moral development, where one moves past the stage of trying to explicitly systematize morality and can treat entire systems of morality as things that all co-exist in one's mind and are applicable in different situations. Which would make sense, if moral reasoning is a skill in the same sense that playing chess is a skill, and moral preferences are analogous to a chess expert's preferences for which piece to play where.
Morality seems like the domain where humans have the strongest instinct to systematize our preferences
At least, the domain where modern educated Western humans have an instinct to systematize our preferences. Interestingly, it seems the kind of extensive value systematization done in moral philosophy may itself be an example of belief systematization. Scientific thinking taught people the mental habit of systematizing things, and then those habits led them to start systematizing values too, as a special case of "things that can be systematized".
Phil Goetz had this anecdote:
I'm also reminded of a talk I attended by one of the Dalai Lama's assistants. This was not slick, Westernized Buddhism; this was saffron-robed fresh-off-the-plane-from-Tibet Buddhism. He spoke about his beliefs, and then took questions. People began asking him about some of the implications of his belief that life, love, feelings, and the universe as a whole are inherently bad and undesirable. He had great difficulty comprehending the questions - not because of his English, I think; but because the notion of taking a belief expressed in one context, and applying it in another, seemed completely new to him. To him, knowledge came in units; each unit of knowledge was a story with a conclusion and a specific application. (No wonder they think understanding Buddhism takes decades.) He seemed not to have the idea that these units could interact; that you could take an idea from one setting, and explore its implications in completely different settings.
David Chapman has a page talking about how fundamentalist forms of religion are a relatively recent development, a consequence of how secular people first started systematizing values and then religion has to start doing the same in order to adapt:
Fundamentalism describes itself as traditional and anti-modern. This is inaccurate. Early fundamentalism was anti-modernist, in the special sense of “modernist theology,” but it was itself modernist in a broad sense. Systems of justifications are the defining feature of “modernity,” as I (and many historians) use the term.
The defining feature of actual tradition—“the choiceless mode”—is the absence of a system of justifications: chains of “therefore” and “because” that explain why you have to do what you have to do. In a traditional culture, you just do it, and there is no abstract “because.” How-things-are-done is immanent in concrete customs, not theorized in transcendent explanations.
Genuine traditions have no defense against modernity. Modernity asks “Why should anyone believe this? Why should anyone do that?” and tradition has no answer. (Beyond, perhaps, “we always have.”) Modernity says “If you believe and act differently, you can have 200 channels of cable TV, and you can eat fajitas and pad thai and sushi instead of boiled taro every day”; and every genuinely traditional person says “hell yeah!” Because why not? Choice is great! (And sushi is better than boiled taro.)
Fundamentalisms try to defend traditions by building a system of justification that supplies the missing “becauses.” You can’t eat sushi because God hates shrimp. How do we know? Because it says so here in Leviticus 11:10-11.3
Secular modernism tries to answer every “why” question with a chain of “becauses” that eventually ends in “rationality,” which magically reveals Ultimate Truth. Fundamentalist modernism tries to answer every “why” with a chain that eventually ends in “God said so right here in this magic book which contains the Ultimate Truth.”
The attempt to defend tradition can be noble; tradition is often profoundly good in ways modernity can never be. Unfortunately, fundamentalism, by taking up modernity’s weapons, transforms a traditional culture into a modern one. “Modern,” that is, in having a system of justification, founded on a transcendent eternal ordering principle. And once you have that, much of what is good about tradition is lost.
This is currently easier to see in Islamic than in Christian fundamentalism. Islamism is widely viewed as “the modern Islam” by young people. That is one of its main attractions: it can explain itself, where traditional Islam cannot. Sophisticated urban Muslims reject their grandparents’ traditional religion as a jumble of pointless, outmoded village customs with no basis in the Koran. Many consider fundamentalism the forward-looking, global, intellectually coherent religion that makes sense of everyday life and of world politics.
Jonathan Haidt also talked about the way that even among Westerners, requiring justification and trying to ground everything in harm/care is most prominent in educated people (who had been socialized to think about morality in this way) as opposed to working-class people. Excerpts from The Righteous Mind where he talks about reading people stories about victimless moral violations (e.g. having sex with a dead chicken before eating it) to see how they thought about them:
I got my Ph.D. at McDonald’s. Part of it, anyway, given the hours I spent standing outside of a McDonald’s restaurant in West Philadelphia trying to recruit working-class adults to talk with me for my dissertation research. When someone agreed, we’d sit down together at the restaurant’s outdoor seating area, and I’d ask them what they thought about the family that ate its dog, the woman who used her flag as a rag, and all the rest. I got some odd looks as the interviews progressed, and also plenty of laughter—particularly when I told people about the guy and the chicken. I was expecting that, because I had written the stories to surprise and even shock people.
But what I didn’t expect was that these working-class subjects would sometimes find my request for justifications so perplexing. Each time someone said that the people in a story had done something wrong, I asked, “Can you tell me why that was wrong?” When I had interviewed college students on the Penn campus a month earlier, this question brought forth their moral justifications quite smoothly. But a few blocks west, this same question often led to long pauses and disbelieving stares. Those pauses and stares seemed to say, You mean you don’t know why it’s wrong to do that to a chicken? I have to explain this to you? What planet are you from?
These subjects were right to wonder about me because I really was weird. I came from a strange and different moral world—the University of Pennsylvania. Penn students were the most unusual of all twelve groups in my study. They were unique in their unwavering devotion to the “harm principle,” which John Stuart Mill had put forth in 1859: “The only purpose for which power can be rightfully exercised over any member of a civilized community, against his will, is to prevent harm to others.”1 As one Penn student said: “It’s his chicken, he’s eating it, nobody is getting hurt.”
The Penn students were just as likely as people in the other eleven groups to say that it would bother them to witness the taboo violations, but they were the only group that frequently ignored their own feelings of disgust and said that an action that bothered them was nonetheless morally permissible. And they were the only group in which a majority (73 percent) were able to tolerate the chicken story. As one Penn student said, “It’s perverted, but if it’s done in private, it’s his right.” [...]
Haidt also talks about this kind of value systematization being uniquely related to Western mental habits:
I and my fellow Penn students were weird in a second way too. In 2010, the cultural psychologists Joe Henrich, Steve Heine, and Ara Norenzayan published a profoundly important article titled “The Weirdest People in the World?” The authors pointed out that nearly all research in psychology is conducted on a very small subset of the human population: people from cultures that are Western, educated, industrialized, rich, and democratic (forming the acronym WEIRD). They then reviewed dozens of studies showing that WEIRD people are statistical outliers; they are the least typical, least representative people you could study if you want to make generalizations about human nature. Even within the West, Americans are more extreme outliers than Europeans, and within the United States, the educated upper middle class (like my Penn sample) is the most unusual of all.
Several of the peculiarities of WEIRD culture can be captured in this simple generalization: The WEIRDer you are, the more you see a world full of separate objects, rather than relationships. It has long been reported that Westerners have a more independent and autonomous concept of the self than do East Asians. For example, when asked to write twenty statements beginning with the words “I am …,” Americans are likely to list their own internal psychological characteristics (happy, outgoing, interested in jazz), whereas East Asians are more likely to list their roles and relationships (a son, a husband, an employee of Fujitsu).
The differences run deep; even visual perception is affected. In what’s known as the framed-line task, you are shown a square with a line drawn inside it. You then turn the page and see an empty square that is larger or smaller than the original square. Your task is to draw a line that is the same as the line you saw on the previous page, either in absolute terms (same number of centimeters; ignore the new frame) or in relative terms (same proportion relative to the frame). Westerners, and particularly Americans, excel at the absolute task, because they saw the line as an independent object in the first place and stored it separately in memory. East Asians, in contrast, outperform Americans at the relative task, because they automatically perceived and remembered the relationship among the parts.
Related to this difference in perception is a difference in thinking style. Most people think holistically (seeing the whole context and the relationships among parts), but WEIRD people think more analytically (detaching the focal object from its context, assigning it to a category, and then assuming that what’s true about the category is true about the object). Putting this all together, it makes sense that WEIRD philosophers since Kant and Mill have mostly generated moral systems that are individualistic, rule-based, and universalist. That’s the morality you need to govern a society of autonomous individuals.
But when holistic thinkers in a non-WEIRD culture write about morality, we get something more like the Analects of Confucius, a collection of aphorisms and anecdotes that can’t be reduced to a single rule.6 Confucius talks about a variety of relationship-specific duties and virtues (such as filial piety and the proper treatment of one’s subordinates). If WEIRD and non-WEIRD people think differently and see the world differently, then it stands to reason that they’d have different moral concerns. If you see a world full of individuals, then you’ll want the morality of Kohlberg and Turiel—a morality that protects those individuals and their individual rights. You’ll emphasize concerns about harm and fairness.
But if you live in a non-WEIRD society in which people are more likely to see relationships, contexts, groups, and institutions, then you won’t be so focused on protecting individuals. You’ll have a more sociocentric morality, which means (as Shweder described it back in chapter 1) that you place the needs of groups and institutions first, often ahead of the needs of individuals. If you do that, then a morality based on concerns about harm and fairness won’t be sufficient. You’ll have additional concerns, and you’ll need additional virtues to bind people together.
this speech was still unusually strong against AI safety.
I think that's a reasonable read if you're operating in a conceptual framework where acceleration and safety must be mutually exclusive, but the sense I got was that that's not the framework he's operating under. My read of the speech is as pro-acceleration and pro-safety. Invest a lot in AI development, and also invest a lot in ensuring its safety.
Thanks!
I was around stage 6, touching 7 on a really good day if I recall correctly. At some point after writing the post, concentration practices stopped working for me (I have some guesses of the reason but no clear answers) and since then I've been at stage 2-3 whenever I've given it another try, sometimes very briefly on 4.
The bit that came immediately after those lines also felt pretty important:
And in any case, how can we write laws that make sense for something we don’t yet fully understand?
So, instead, we’re building world-leading capability to understand and evaluate the safety of AI models within government.
To do that, we’ve already invested £100m in a new taskforce…
…more funding for AI safety than any other country in the world.
And we’ve recruited some of the most respected and knowledgeable figures in the world of AI.
So, I’m completely confident in telling you the UK is doing far more than other countries to keep you safe.
And because of this – because of the unique steps we’ve already taken – we’re able to go even further today.
I can announce that we will establish the world’s first AI Safety Institute – right here in the UK.
It will advance the world’s knowledge of AI safety.
And it will carefully examine, evaluate, and test new types of AI…
…so that we understand what each new model is capable of…
…exploring all the risks, from social harms like bias and misinformation, through to the most extreme risks of all.
The British people should have peace of mind that we’re developing the most advanced protections for AI of any country in the world.
To me this seemed like good news - "don't rush to regulate, actually take the time for experts to figure out what makes sense" sounds like the kind of approach that might actually give sensible regulation rather than something that was quickly put together and sounded good but doesn't actually make any sense.
It's not just alignment and capability research you need to watch out for - anything connected to AI could conceivably advance timelines and therefore is inadvisable.
We could take this logic further. Any participation in the global economy could conceivably be connected to AI. Therefore people concerned about AI x-risk should quit their jobs, stop using any services, learn to live off nature, and go live as hermits.
Of course, it's possible that the local ecosystem may have a connection to the local economy and thus the global economy, so even living off nature may have some influence on AI. Learning to survive on pure sunlight may thus be a crucial alignment research priority.
For me it's not that fiction would be less compelling, but my standards have crept up, and it would take more time and effort to find something that would really grab me.
Is there anyone else who finds Dialogues vaguely annoying to read and would appreciate posts that distilled them to their final conclusions? (not offering to write them, but just making it common knowledge if there is such a demand)
Consider a situation where a post strongly offends a small number of LW regulars, but is generally approved of by the median reader. A small number of regulars hard downvote the post, resulting in a suppression of the undesirable idea.
I believe that this is actually part of the design intent of strongvotes - to help make sure that LW rewards the kind of content that long-time regulars appreciate, avoiding an "Eternal September" scenario where an influx of new users starts upvoting the kind of content you might find anywhere else on the Internet and driving the old regulars out, until the thing that originally made LW unique is lost.
This post might qualify (about how to get AIs to feel something like "love" toward humans).
Nice find!
GPT-4 really seems to change the minds of a lot of researchers. Pearl, Hinton, and I think I saw a few others too but can't remember who they were now.
Getting a shape into the AI's preferences is different from getting it into the AI's predictive model. MIRI is always in every instance talking about the first thing and not the second.
You obviously need to get a thing into the AI at all, in order to get it into the preferences, but getting it into the AI's predictive model is not sufficient. It helps, but only in the same sense that having low-friction smooth ball-bearings would help in building a perpetual motion machine; the low-friction ball-bearings are not the main problem, they are a kind of thing it is much easier to make progress on compared to the main problem.
I read this as saying "GPT-4 has successfully learned to predict human preferences, but it has not learned to actually fulfill human preferences, and that's a far harder goal". But in the case of GPT-4, it seems to me like this distinction is not very clear-cut - it's useful to us because, in its architecture, there's a sense in which "predicting" and "fulfilling" are basically the same thing.
It also seems to me that this distinction is not very clear-cut in humans, either - that a significant part of e.g. how humans internalize moral values while growing up has to do with building up predictive models of how other people would react to you doing something and then having your decision-making be guided by those predictive models. So given that systems like GPT-4 seem to have a relatively easy time doing something similar, that feels like an update toward alignment being easier than expected.
Of course, there's a high chance that a superintelligent AI will generalize from that training data differently than most humans would. But that seems to me more like a risk of superintelligence than a risk from AI as such; a superintelligent human would likely also arrive at different moral conclusions than non-superintelligent humans would.
Bessel van der Kolk claimed the following in The Body Keeps the Score:
There have in fact been hundreds of scientific publications spanning well over a century documenting how the memory of trauma can be repressed, only to resurface years or decades later. Memory loss has been reported in people who have experienced natural disasters, accidents, war trauma, kidnapping, torture, concentration camps, and physical and sexual abuse. Total memory loss is most common in childhood sexual abuse, with incidence ranging from 19 percent to 38 percent. This issue is not particularly controversial: As early as 1980 the DSM-III recognized the existence of memory loss for traumatic events in the diagnostic criteria for dissociative amnesia: “an inability to recall important personal information, usually of a traumatic or stressful nature, that is too extensive to be explained by normal forgetfulness.” Memory loss has been part of the criteria for PTSD since that diagnosis was first introduced.
One of the most interesting studies of repressed memory was conducted by Dr. Linda Meyer Williams, which began when she was a graduate student in sociology at the University of Pennsylvania in the early 1970s. Williams interviewed 206 girls between the ages of ten and twelve who had been admitted to a hospital emergency room following sexual abuse. Their laboratory tests, as well as the interviews with the children and their parents, were kept in the hospital’s medical records. Seventeen years later Williams was able to track down 136 of the children, now adults, with whom she conducted extensive follow-up interviews. More than a third of the women (38 percent) did not recall the abuse that was documented in their medical records, while only fifteen women (12 percent) said that they had never been abused as children. More than two-thirds (68 percent) reported other incidents of childhood sexual abuse. Women who were younger at the time of the incident and those who were molested by someone they knew were more likely to have forgotten their abuse.
This study also examined the reliability of recovered memories. One in ten women (16 percent of those who recalled the abuse) reported that they had forgotten it at some time in the past but later remembered that it had happened. In comparison with the women who had always remembered their molestation, those with a prior period of forgetting were younger at the time of their abuse and were less likely to have received support from their mothers. Williams also determined that the recovered memories were approximately as accurate as those that had never been lost: All the women’s memories were accurate for the central facts of the incident, but none of their stories precisely matched every detail documented in their charts. [...]
Given the wealth of evidence that trauma can be forgotten and resurface years later, why did nearly one hundred reputable memory scientists from several different countries throw the weight of their reputations behind the appeal to overturn Father Shanley’s conviction, claiming that “repressed memories” were based on “junk science”? Because memory loss and delayed recall of traumatic experiences had never been documented in the laboratory, some cognitive scientists adamantly denied that these phenomena existed or that retrieved traumatic memories could be accurate. However, what doctors encounter in emergency rooms, on psychiatric wards, and on the battlefield is necessarily quite different from what scientists observe in their safe and well-organized laboratories.
Consider what is known as the “lost in the mall” experiment, for example. Academic researchers have shown that it is relatively easy to implant memories of events that never took place, such as having been lost in a shopping mall as a child. About 25 percent of subjects in these studies later “recall” that they were frightened and even fill in missing details. But such recollections involve none of the visceral terror that a lost child would actually experience.
Another line of research documented the unreliability of eyewitness testimony. Subjects might be shown a video of a car driving down a street and asked afterward if they saw a stop sign or a traffic light; children might be asked to recall what a male visitor to their classroom had been wearing. Other eyewitness experiments demonstrated that the questions witnesses were asked could alter what they claimed to remember. These studies were valuable in bringing many police and courtroom practices into question, but they have little relevance to traumatic memory.
The fundamental problem is this: Events that take place in the laboratory cannot be considered equivalent to the conditions under which traumatic memories are created. The terror and helplessness associated with PTSD simply can’t be induced de novo in such a setting. We can study the effects of existing traumas in the lab, as in our script-driven imaging studies of flashbacks, but the original imprint of trauma cannot be laid down there. Dr. Roger Pitman conducted a study at Harvard in which he showed college students a film called Faces of Death, which contained newsreel footage of violent deaths and executions. This movie, now widely banned, is as extreme as any institutional review board would allow, but it did not cause Pitman’s normal volunteers to develop symptoms of PTSD. If you want to study traumatic memory, you have to study the memories of people who have actually been traumatized.
At some point I tried to read some papers on the topic to see what the state of the debate is; here's what I wrote about it in another post:
This post discusses suppressing traumatic memories, drawing on the theories of clinical practitioners, who have disagreements with clinical researchers about whether memory suppression is a thing (Patihis, Ho, Tingen, Lilienfeld, & Loftus, 2014).
Much of the criticism about repressed memories is aimed at a specific concept from Freudian theory, and/or on the question of how reliable therapeutically recovered memories are. Several of the critics (e.g. (Rofé, 2008)) acknowledge that people may suppress or intentionally forget painful memories, but argue that this is distinct from the Freudian concept of repression. However, memory suppression in the sense discussed in this post is not related to the Freudian concept, and also includes intentional attempts to forget or avoid thinking about something, as the examples will hopefully demonstrate.
In fact, the memories being hard to forget is exactly the problem, which is something that many critics of the standard Freudian paradigm are keen to point out - traumatic memories are often particularly powerful and long-lasting.
I do make the assumption that conscious attempts to forget something may eventually become sufficiently automated so as to become impossible for the person themselves to notice; but this seems like a straightforward inference from the observation that skills and habits in general can become automated enough so as to happen without the person realizing what they are doing. A recent experiment (unreplicated, but I have a reasonably high prior for cognitive psychology experiments replicating) also showed that once people are trained to intentionally forget words that are associated with a particular cue, the cue will reduce recall of words even when it is paired with them in a form that is too short to consciously register (Salvador et al. 2018).
I make no strong claims about the reliability of memories recovered in therapy. It has been clearly demonstrated that it is possible for therapists to accidentally or intentionally implant false memories, but there have also been cases of people recovering memories which have then been confirmed from other sources. Probably some recovered memories are genuine (though possibly distorted) and some are not.
I'm confused by what is meant by this comment. Are you suggesting that Firinn has brain damage?
That makes sense. I have some amount of those kinds of spirals too, especially around "I'm well-focused on my tasks and productive starting from early morning" vs. "I'm keep procrastinating on my tasks and getting distracted from the ones that I did manage to start". Focus seems to feed additional focus and distraction seems to feed additional distraction, but distraction is often stronger, so if the focus starts slipping it's an easy slide to complete distraction.
In that context I'd think of the percentage thing on the level of spirals than individual tasks. E.g. getting into a positive spiral on 1% of days is better than getting to a positive spiral on 0% of days, and if I get into a negative spiral on one day, I can take comfort in the fact that tomorrow may be a more positive one. (If your spirals are longer than one day, adjust appropriately.)
So this seems to me like it's the crux. I agree with you that GPT-4 is "pretty good", but I think the standard necessary for things to go well is substantially higher than "pretty good", and that's where the difficulty arises once we start applying higher and higher levels of capability and influence on the environment.
This makes sense to me. On the other hand - it feels like there's some motte and bailey going on here, if one claim is "if the AIs get really superhumanly capable then we need a much higher standard than pretty good", but then it's illustrated using examples like "think of how your AI might not understand what you meant if you asked it to get your mother out of a burning building".
Yeah, the intended context is for people who feel like they managed to solve their problem for good, only to have it unexpectedly come back again. If that's not the context of your problems, then it's not useful advice for you.
I don't know how literally you meant the "visibly", but people are often good at covering up how they feel inside. My mood is more stable these days so it's a bit hard to recall the details anymore, but I would find it very plausible that a well-slept night could at some point have made the difference between "feeling basically good and happy" and "feeling like life is not worth living" for me.
But you probably wouldn't have been able to tell that from the outside. Probably to an outside observer, it would have looked more like "Kaj looks like he's in a good mood today" vs. "Kaj looks a little reserved today".
People responding to studies assume the questions to make sense; if there's a question that seems to not make sense, they suspect it's a trick question or that they don't understand the instructions correctly, and try to do something that makes more sense.
A question that says "the right answer is X, please write it down here" doesn't make sense in terms of how most people understand tests. Why would they be asked a question for which the answer was already there?
I'm guessing that most of the people who got this wrong thought "wtf, there has to be some trick here", then tried to check what the answer should be, got it wrong, and then felt satisfied that they had noticed and avoided the "trick to see if they were paying attention".
I have definitely had periods where my mood on the day has basically been determined by whether I've slept well or not (assuming no other major factors influencing it in either direction).
GPT-4 was not designed to be commercially deployed at scale.
What makes you say that?
For anyone else who was wondering: it's physically located in Berkeley (mentioned on the website but not in the post).
I picked "resisting social pressure" and then when I got the second message, I thought "Aha, I was asked if I value resisting social pressure, and now I'm offered the chance of applying social pressure to make things go my way, to see if I will defect against the very virtue I claimed to be in favor of! I'm guessing that there's a different message tailored for each of the virtues, where everyone is offered some action that is actually the opposite of the virtue they claimed to endorse, to see how many people are consistent. Clever! Can't wait to see what the opposite choice for the other virtues is."
Now I'm slightly disappointed that this wasn't the case.
If we had 2.5 petabytes of storage, there'd be no reason for the brain to bother!
I recall reading an anecdote (though don't remember the source, ironically enough) from someone who said they had an exceptional memory, saying that such a perfect memory gets nightmarish. Everything they saw constantly reminded them of some other thing associated with it. And when they recalled a memory, they didn't just recall the memory, but they also recalled each time in their life when they had recalled that memory, and also every time they had recalled recalling those memories, and so on.
I also have a friend whose memory isn't quite that good, but she says that unpleasant events have an extra impact on her because the memory of them never fades or weakens. She can recall embarrassments and humiliations from decades back with an equal force and vividity as if they happened yesterday.
Those kinds of anecdotes suggest to me that the issue is not that the brain would in principle have insufficient capacity for storing everything, but that recalling everything would create too much interference and that the median human is more functional if most things are forgotten.
EDIT: Here is one case study reporting this kind of a thing:
We know of no other reported case of someone who recalls personal memories over and over again, who is both the warden and the prisoner of her memories, as AJ reports. We took seriously what she told us about her memory. She is dominated by her constant, uncontrollable remembering, finds her remembering both soothing and burdensome, thinks about the past “all the time,” lives as if she has in her mind “a running movie that never stops” [...]
One way to conceptualize this phenomenon is to see AJ as someone who spends a great deal of time remembering her past and who cannot help but be stimulated by retrieval cues. Normally people do not dwell on their past but they are oriented to the present, the here and now. Yet AJ is bound by recollections of her past. As we have described, recollection of one event from her past links to another and another, with one memory cueing the retrieval of another in a seemingly “unstoppable” manner. [...]
Like us all, AJ has a rich storehouse of memories latent, awaiting the right cues to invigorate them. The memories are there, seemingly dormant, until the right cue brings them to life. But unlike AJ, most of us would not be able to retrieve what we were doing five years ago from this date. Given a date, AJ somehow goes to the day, then what she was doing, then what she was doing next, and left to her own style of recalling, what she was doing next. Give her an opportunity to recall one event and there is a spreading activation of recollection from one island of memory to the next. Her retrieval mode is open, and her recollections are vast and specific.
I don't immediately see the connection in your comment to what I was saying, which implies that I didn't express my point clearly enough.
To rephrase: I interpreted FeepingCreature's comment to suggest that 2.5 petabytes feels implausibly large, and that it to be implausible because based on introspection it doesn't feel like one's memory would contain that much information. My comment was meant to suggest that given that we don't seem to ever run out of memory storage, then we should expect our memory to contain far less information than the brain's maximum capacity, as there always seems to be more capacity to spare for new information.
To me any big number seems plausible, given that AFAIK people don't seem to have run into upper limits of how much information the human brain can contain - while you do forget some things that don't get rehearsed, and learning does slow down at old age, there are plenty of people who continue learning things and having a reasonably sharp memory all the way to old age. If there's any point when the brain "runs out of hard drive space" and becomes unable to store new information, I'm at least not aware of any study that would suggest this.
I think that sharing the reasoning in private with a small number of people might somewhat help with the "Alignment people specifically making bad strategic decisions that end up having major costs" cost, but not the others, and even then it would only help a small amount of the people working in alignment rather than the field in general.