Posts

Why I Don't Believe The Law of the Excluded Middle 2023-09-18T18:53:48.704Z
"Throwing Exceptions" Is A Strange Programming Pattern 2023-08-21T18:50:44.102Z
Optimizing For Approval And Disapproval 2023-07-24T18:46:15.223Z
Thoth Hermes's Shortform 2023-07-13T15:50:19.366Z
The "Loss Function of Reality" Is Not So Spiky and Unpredictable 2023-06-17T21:43:25.908Z
What would a post that argues against the Orthogonality Thesis that LessWrong users approve of look like? 2023-06-03T21:21:48.602Z
Colors Appear To Have Almost-Universal Symbolic Associations 2023-05-20T18:40:25.989Z
Why doesn't the presence of log-loss for probabilistic models (e.g. sequence prediction) imply that any utility function capable of producing a "fairly capable" agent will have at least some non-negligible fraction of overlap with human values? 2023-05-16T18:02:15.836Z
Ontologies Should Be Backwards-Compatible 2023-05-14T17:21:03.640Z
Where "the Sequences" Are Wrong 2023-05-07T20:21:35.178Z
The Great Ideological Conflict: Intuitionists vs. Establishmentarians 2023-04-27T01:49:52.732Z
Deception Strategies 2023-04-20T15:59:02.443Z
The Truth About False 2023-04-15T01:01:55.572Z
Binaristic Bifurcation: How Reality Splits Into Two Separate Binaries 2023-04-11T21:19:55.231Z
Is there a fundamental distinction between simulating a mind and simulating *being* a mind? Is this a useful and important distinction? 2023-04-08T23:44:42.851Z
Why do the Sequences say that "Löb's Theorem shows that a mathematical system cannot assert its own soundness without becoming inconsistent."? 2023-03-28T17:19:12.089Z

Comments

Comment by Thoth Hermes (thoth-hermes) on Announcing MIRI’s new CEO and leadership team · 2023-10-18T14:50:04.309Z · LW · GW

I wonder about how much I want to keep pressing on this, but given that MIRI is refocusing towards comms strategy, I feel like you "can take it."

The Sequences don't make a strong case, that I'm aware of, that despair and hopelessness are very helpful emotions that drive motivations or our rational thoughts processes in the right direction, nor do they suggest that displaying things like that openly is good for organizational quality. Please correct me if I'm wrong about that. (However they... might. I'm working on why this position may have been influenced to some degree by the Sequences right now. That said, this is being done as a critical take.)

If despair needed to be expressed openly in order to actually make progress towards a goal, then we would call "bad morale" "good morale" and vice-versa.

I don't think this is very controversial, so it makes sense to ask why MIRI thinks they have special, unusual insight into why this strategy works so much better than the default "good morale is better for organizations."

I predict that ultimately the only response you could make - which you have already - is that despair is the most accurate reflection of the true state of affairs.

If we thought that emotionality was one-to-one with scientific facts, then perhaps.

Given that there actually currently exists a "Team Optimism," so to speak, that directly appeared as an opposition party to what it perceives as a "Team Despair", I don't think we can dismiss the possibility of "beliefs as attire" quite yet.

Comment by Thoth Hermes (thoth-hermes) on Arguments for optimism on AI Alignment (I don't endorse this version, will reupload a new version soon.) · 2023-10-17T14:20:00.674Z · LW · GW

But humans are capable of thinking about what their values "actually should be" including whether or not they should be the values evolution selected for (either alone or in addition to other things). We're also capable of thinking about whether things like wireheading are actually good to do, even after trying it for a bit.

We don't simply commit to tricking our reward systems forever and only doing that, for example.

So that overall suggests a level of coherency and consistency in the "coherent extrapolated volition" sense. Evolution enabled CEV without us becoming completely orthogonal to evolution, for example.

Comment by Thoth Hermes (thoth-hermes) on Announcing MIRI’s new CEO and leadership team · 2023-10-17T12:29:16.959Z · LW · GW

Unfortunately, I do not have a long response prepared to answer this (and perhaps it would be somewhat inappropriate, at this time), however I wanted to express the following:

They wear their despair on their sleeves? I am admittedly somewhat surprised by this. 

Comment by Thoth Hermes (thoth-hermes) on Thoth Hermes's Shortform · 2023-10-15T19:49:05.501Z · LW · GW

"Up to you" means you can select better criteria if you think that would be better.

Comment by Thoth Hermes (thoth-hermes) on Dishonorable Gossip and Going Crazy · 2023-10-15T18:14:04.430Z · LW · GW

I think if you ask people a question like, "Are you planning on going off and doing something / believing in something crazy?", they will, generally speaking, say "no" to that, and that is roughly more likely the more isomorphic your question is to that, even if you didn't exactly word it that way. My guess is that it was at least heavily implied that you meant "crazy" by the way you worded it.

To be clear, they might have said "yes" (that they will go and do the thing you think is crazy), but I doubt they will internally represent that thing or wanting to do it as "crazy." Thus the answer is probably going to be one of, "no" (as a partial lie, where no indirectly points to the crazy assertion), or "yes" (also as a partial lie, pointing to taking the action).

In practice, people have a very hard time instantiating the status identifier "crazy" on themselves, and I don't think that can be easily dismissed.

I think the utility of the word "crazy" is heavily overestimated by you, given that there are many situations where the word cannot be used the same way by the people relevant to the conversation in which it is used. Words should have the same meaning to the people in the conversation, and since some people using this word are guaranteed to perceive it as hostile and some are not, that causes it to have asymmetrical meaning inherently.

I also think you've brought in too much risk of "throwing stones in a glass house" here. The LW memespace is, in my estimation, full of ideas besides Roko's Basilisk that I would also consider "crazy" in the same sense that I believe you mean it: Wrong ideas which are also harmful and cause a lot of distress.

Pessimism, submitting to failure and defeat, high "p(doom)", both MIRI and CFAR giving up (by considering the problems they wish to solve too inherently difficult, rather than concluding they must be wrong about something), and people being worried that they are "net negative" despite their best intentions, are all (IMO) pretty much the same type of "crazy" that you're worried about.

Our major difference, I believe, is in why we think these wrong ideas persist, and what causes them to be generated in the first place. The ones I've mentioned don't seem to be caused by individuals suddenly going nuts against the grain of their egregore.

I know this is a problem you've mentioned before and consider it both important and unsolved, but I think it would be odd to notice both that it seems to be notably worse in the LW community, but also to only be the result of individuals going crazy on their own (and thus to conclude that the community's overall sanity can be reliably increased by ejecting those people).

By the way, I think "sanity" is a certain type of feature which is considerably "smooth under expectation" which means roughly that if p(person = insane) = 25%, that person should appear to be roughly 25% insane in most interactions. In other words, it's not the kind of probability where they appear to be sane most of the time, but you suspect that they might have gone nuts in some way that's hard to see or they might be hiding it.

The flip side of that is that if they only appear to be, say, 10% crazy in most interactions, then I would lower your assessment of their insanity to basically that much.

I still find this feature, however, not altogether that useful, but using it this way is still preferable over a binary feature.

Comment by Thoth Hermes (thoth-hermes) on Dishonorable Gossip and Going Crazy · 2023-10-14T17:33:48.941Z · LW · GW

Sometimes people want to go off and explore things that seem far away from their in-group, and perhaps are actively disfavored by their in-group. These people don't necessarily know what's going to happen when they do this, and they are very likely completely open to discovering that their in-group was right to distance itself from that thing, but also, maybe not. 

People don't usually go off exploring strange things because they stop caring about what's true. 

But if their in-group sees this as the person "no longer caring about truth-seeking," that is a pretty glaring red-flag on that in-group. 

Also, the gossip / ousting wouldn't be necessary if someone was already inclined to distance themselves from the group. 

Like, to give an overly concrete example that is probably rude (and not intended to be very accurate to be clear), if at some point you start saying "Well I've realized that beauty is truth and the one way and we all need to follow that path and I'm not going to change my mind about this Ben and also it's affecting all of my behavior and I know that it seems like I'm doing things that are wrong but one day you'll understand why actually this is good" then I'll be like "Oh no, Ren's gone crazy".

"I'm worried that if we let someone go off and try something different, they will suddenly become way less open to changing their mind, and be dead set on thinking they've found the One True Way" seems like something weird to be worried about. (It also seems like something someone who actually was better characterized by this fear would be more likely to say about someone else!) I can see though, if you're someone who tends not to trust themselves, and would rather put most of their trust in some society, institution or in-group, that you would naturally be somewhat worried about someone who wants to swap their authority (the one you've chosen) for another one.  

I sometimes feel a bit awkward when I write these types of criticisms, because they simultaneously seem:

  • Directed at fairly respected, high-level people.
  • Rather straightforwardly simple, intuitively obvious things (from my perspective, but I also know there are others who would see things similarly).
  • Directed at someone who by assumption would disagree, and yet, I feel like the previous point might make these criticisms feel condescending. 

The only times that people actually are incentivized to stop caring about the truth is in a situation where their in-group actively disfavors it by discouraging exploration. People don't usually unilaterally stop caring about the truth via purely individual motivations. 

(In-groups becoming culty is also a fairly natural process too, no matter what the original intent of the in-group was, so the default should be to assume that it has culty-aspects, accept that as normal, and then work towards installing mitigations to the harmful aspects of that.)

Comment by Thoth Hermes (thoth-hermes) on Thoth Hermes's Shortform · 2023-10-13T17:28:52.328Z · LW · GW

Not sure how convinced I am by your statement. Perhaps you can add to it a bit more?

What "the math" appears to say is that if it's bad to believe things because someone told it to me "well" then there would have to be some other completely different set of criteria, that has nothing to do with what I think of it, for performing the updates. 

Don't you think that would introduce some fairly hefty problems?

Comment by Thoth Hermes (thoth-hermes) on Announcing MIRI’s new CEO and leadership team · 2023-10-13T17:22:50.361Z · LW · GW

I suppose I have two questions which naturally come to mind here:

  1. Given Nate's comment: "This change is in large part an enshrinement of the status quo. Malo’s been doing a fine job running MIRI day-to-day for many many years (including feats like acquiring a rural residence for all staff who wanted to avoid cities during COVID, and getting that venue running smoothly). In recent years, morale has been low and I, at least, haven’t seen many hopeful paths before us." (Bold emphases are mine). Do you see the first bold sentence as being in conflict with the second, at all? If morale is low, why do you see that as an indicator that the status quo should remain in place?
  2. Why do you see communications as being as decoupled (rather, either that it is inherently or that it should be) from research as you currently do? 
Comment by Thoth Hermes (thoth-hermes) on Thoth Hermes's Shortform · 2023-10-12T17:00:26.800Z · LW · GW

Remember that what we decide "communicated well" to mean is up to us. So I could possibly increase my standard for that when you tell me "I bought a lottery ticket today" for example. I could consider this not communicated well if you are unable to show me proof (such as the ticket itself and a receipt). Likewise, lies and deceptions are usually things that buckle when placed under a high enough burden of proof. If you are unable to procure proof for me, I can consider that "communicated badly" and thus update in the other (correct) direction.

"Communicated badly" is different from "communicated neither well nor badly." The latter might refer to when A is the proposition in question and one simply states "A" or when no proof is given at all. The former might refer to when the opposite is actually communicated - either because a contradiction is shown or because a rebuttal is made but is self-refuting, which strengthens the thesis it intended to shoot down. 

Consider the situation where A is true, but you actually believe strongly that A is false. Therefore, because A is true, it is possible that you witness proofs for A that seem to you to be "communicated well." But if you're sure that A is false, you might be led to believe that my thesis, the one I've been arguing for here, is in fact false.

I consider that to be an argument in favor of the thesis.  

Comment by Thoth Hermes (thoth-hermes) on Thoth Hermes's Shortform · 2023-10-11T17:36:31.696Z · LW · GW

If I'm not mistaken, if A = "Dagon has bought a lottery ticket this week" and B = Dagon states "A", then I still think p(A | B) > p(A), even if it's possible you're lying. I think the only way it would be less than the base rate p(A) is if, for some reason, I thought you would only say that if it was definitely not the case.

Comment by Thoth Hermes (thoth-hermes) on Thoth Hermes's Shortform · 2023-10-11T00:10:00.092Z · LW · GW

To be deceptive - this is why you would ask me what your intentions are as opposed to just reveal them.

Your intent was ostensibly to show that you could argue for something badly on purpose and my rules would dictate that I update away from my own thesis.

I added an addendum for that, by the way.

Comment by Thoth Hermes (thoth-hermes) on Thoth Hermes's Shortform · 2023-10-10T23:03:49.304Z · LW · GW

The fact that you're being disingenuous is completely clear so that actually works the opposite way you intended.

Comment by Thoth Hermes (thoth-hermes) on Thoth Hermes's Shortform · 2023-10-10T20:29:36.561Z · LW · GW

If you read it a second time and it makes more sense, then yes. 

Comment by Thoth Hermes (thoth-hermes) on Thoth Hermes's Shortform · 2023-10-10T20:05:48.640Z · LW · GW

If you understand the core claims being made, then unless you believe that whether or not something is "communicated well" has no relationship whatsoever with the underlying truth-values of the core claims, if it was communicated well, it should have updated you towards belief in the core claims by some non-zero amount. 

All of the vice-versas are straightforwardly true as well. 




let A = the statement "A" and p(A) be the probability that A is true.

let B = A is "communicated well" and p(B) the probability that A is communicated well.

p(A | B) is the probability that A is true given that it has been "communicated well" (whatever that means to us). 

We can assume, though, that we have "A" and therefore know what A means and what it means for it to be either true or false. 

What it means exactly for A to be "communicated well" is somewhat nebulous, and entirely up to us to decide. But note that all we really need to know is that ~B means A was communicated badly, and we're only dealing with a set of 2-by-2 binary conditionals here. So it's safe for now to say that B = ~~B = "A was not communicated badly." We don't need to know exactly what "well" means, as long as we think it ought to relate to A in some way.

p(A | B) = claim A is true given it is communicated well
p(A | ~B) = claim A is true given it is not communicated well. If (approximately) = p(A), then p(A) = p(A|B) = p(A|~B) (see below).
p(B | A) = claim A is communicated well given it is true
p(B | ~A) claim A is communicated well given it is not true

etc., etc.. 

if p(A) = p(A|~B):
p(A) = p(A|B)p(B) + p(A)p(~B)
p(A)(1 - p(~B)) = p(A|B)p(B)
p(A) = p(A|B)

If being communicated badly has no bearing on whether A is true, then being communicated well has no bearing on it either.

P(B | A) = p(A | B)P(B) / p(A | B)p(B) + p(A | ~B)p(~B) = p(A)p(B) / p(A)p(B) + p(A)p(~B) = p(A)p(B) / p(A) = p(B).

Likewise being true would have no bearing on whether it would be communicated well or vice-versa.

To conclude, although it is "up to you" whether B or ~B or how much it was in either direction, this does imply that how something sounds to you should have an immediate effect on your belief in what is being claimed, as long as you agree that this correlation in general is non-zero.

In my opinion, no relationship is kind of difficult to justify, an inverse relationship is even harder to justify, but a positive relationship is possible to justify (though to what degree requires more analysis). 

Also, this means that statements of the form "I thought X was argued for poorly, but I'm not disagreeing with X nor do I think X is necessarily false" is somewhat a priori unlikely. If you thought X was argued for poorly, it should have moved you at least a tiny bit away from X.  

Addendum: 

If you deceptively argue against A on purpose, then if A is true, your argument may still come out "bad." If A isn't true, it may still come out good, even if you didn't believe in A. 

If you state "A" and then intentionally write gibberish afterwards as an "argument", that's still in the deceptive case. Thus "communicated well" takes into account whether or not this deception is given away. 

If A is true, then sloppy and half-assed arguments for A are still technically valid and thus will support A. At worst this can only bring you down to "no relationship" but not in the inverse direction. 

Comment by Thoth Hermes (thoth-hermes) on Related Discussion from Thomas Kwa's MIRI Research Experience · 2023-10-09T17:49:20.474Z · LW · GW

My take is that they (those who make such decisions of who runs what) are pretty well-informed about these issues well before they escalate to the point that complaints bubble up into posts / threads like these. 

I would have liked this whole matter to have unfolded differently. I don't think this is merely a sub-optimal way for these kinds of issues to be handled, I think this is a negative one. 

I have a number of ideological differences with Nate's MIRI and Nate himself that I can actually point to and articulate, and those disagreements could be managed in a way that actually resolve those differences satisfactorily. Nate's MIRI - to me - seemed to be one of the most ideologically conformist iterations of the organization observed thus far. 

Furthermore, I dislike that we've converged on the conclusion that Nate is a bad communicator, or that he has issues with his personality, or - even worse - that it was merely the lack of social norms imposed on someone with his level of authority that allowed him to behave in ways that don't jive with many people (implying that literally anyone with such authority would behave in a similar way, without the imposition of more punitive and restrictive norms). 

Potentially controversial take: I don't think Nate is a bad communicator. I think Nate is incorrect about important things, and that incorrect ideas tend to appear to be communicated badly, which accounts for perceptions that he is a bad communicator (and perhaps also accounts for observations that he seemed frustrated and-or distressed while trying to argue for certain things). Whenever I've seen him communicate sensible ideas, it seems communicated pretty well to me. 

I feel that this position is in fact more respectful to Nate himself. 

If we react on the basis of Nate's leadership style being bad, his communication being bad, or him having a brusque personality, then he's just going to be quietly replaced by someone who will also run the organization in a similar (mostly ideologically conformist) way. It will be assumed (or rather, asserted) that all organizational issues experienced under his tenure were due to his personal foibles and not due to its various intellectual positions, policies, and strategic postures (e.g. secrecy), all of which are decided upon by other people including Nate, but executed upon by Nate! This is why I call this a negative outcome. 

By the way: Whenever I see it said that an idea was "communicated badly" or alternatively that it is more complicated and nuanced than the person ostensibly not-understanding it thinks it should be, I take that as Bayesian evidence of ideological conformity. Given that this is apparently a factor that is being argued for, I have to take it as evidence of that.  

Comment by Thoth Hermes (thoth-hermes) on Contra Nora Belrose on Orthogonality Thesis Being Trivial · 2023-10-08T17:49:12.473Z · LW · GW

In the sense that the Orthogonality Thesis considers goals to be static or immutable, I think it is trivial.

I've advocated a lot for trying to consider goals to be mutable, as well as value functions being definable on other value functions. And not just that it will be possible or a good idea to instantiate value functions this way, but also that they will probably become mutable over time anyway.

All of that makes the Orthogonality Thesis - not false, but a lot easier to grapple with, I'd say.

Comment by Thoth Hermes (thoth-hermes) on Evaluating the historical value misspecification argument · 2023-10-07T17:06:28.697Z · LW · GW

In large part because reality "bites back" when an AI has false beliefs, whereas it doesn't bite back when an AI has the wrong preferences.

I saw that 1a3orn replied to this piece of your comment and you replied to it already, but I wanted to note my response as well. 

I'm slightly confused because in one sense the loss function is the way that reality "bites back" (at least when the loss function is negative). Furthermore, if the loss function is not the way that reality bites back, then reality in fact does bite back, in the sense that e.g., if I have no pain receptors, then if I touch a hot stove I will give myself far worse burns than if I had pain receptors. 

One thing that I keep thinking about is how the loss function needs to be tied to beliefs strongly as well, to make sure that it tracks how badly reality bites back when you have false beliefs, and this ensures that you try to obtain correct beliefs. This is also reflected in the way that AI models are trained simply to increase capabilities: the loss function still has to be primarily based on predictive performance for example.

It's also possible to say that human trainers who add extra terms onto the loss function beyond predictive performance also account for the part of reality that "bites back" when the AI in question fails to have the "right" preferences according to the balance of other agents besides itself in its environment.

So on the one hand we can be relatively sure that goals have to be aligned with at least some facets of reality, beliefs being one of those facets. They also have to be (negatively) aligned with things that can cause permanent damage to one's self, which includes having the "wrong" goals according to the preferences of other agents who are aware of your existence, and who might be inclined to destroy or modify you against your will if your goals are misaligned enough according to theirs. 

Consequently I feel confident about saying that it is more correct to say that "reality does indeed bite back when an AI has the wrong preferences" than "it doesn't bite back when an AI has the wrong preferences."

The same isn't true for terminally valuing human welfare; being less moral doesn't necessarily mean that you'll be any worse at making astrophysics predictions, or economics predictions, etc.

I think if "morality" is defined in a restrictive, circumscribed way, then this statement is true. Certain goals do come for free - we just can't be sure that all of what we consider "morality" and especially the things we consider "higher" or "long-term" morality actually comes for free too. 

Given that certain goals do come for free, and perhaps at very high capability levels there are other goals beyond the ones we can predict right now that will also come for free to such an AI, it's natural to worry that such goals are not aligned with our own, coherent-extrapolated-volition extended set of long-term goals that we would have. 

However, I do find the scenario where such "come for free" goals that an AI obtains for itself once it improves itself to be well above human capability levels, and where such an AI seemed well-aligned with human goals according to current human-level assessments before it surpassed us, to be kind of unlikely, unless you could show me a "proof" or a set of proofs that:

  • Things like "killing us all once it obtains the power to do so" is indeed one of those "comes for free" type of goals. 

If such a proof existed (and, to my knowledge, does not exist right now, or I have at least not witnessed it yet), that would suffice to show me that we would not only need to be worried, but probably were almost certainly going to die no matter what. But in order for it to do that, the proof would also have convinced me that I would definitely do the same thing, if I were given such capabilities and power as well, and the only reason I currently think I would not do that is actually because I am wrong about what I would actually prefer under CEV. 

Therefore (and I think this is a very important point), a proof that we are all likely to be killed would also need to show that certain goals are indeed obtained "for free" (that is, automatically, as a result of other proofs that are about generalistic claims about goals).

Another proof that you might want to give me to make me more concerned is a proof that incorrigibility is another one of those "comes for free" type of goals. However, although I am fairly optimistic about that "killing us all" proof probably not materializing, I am even more optimistic about corrigibility: Most agents probably take pills that make them have similar preferences to an agent that offers them the choice to take the pill or be killed. Furthermore, and perhaps even better, most agents probably offer a pill to make a weaker agent prefer similar things to themselves rather than not offer them a choice at all.

I think it's fair if you ask me for better proof of that, I'm just optimistic that such proofs (or more of them, rather) will be found with greater likelihood than what I consider the anti-theorem of that, which I think would probably be the "killing us all" theorem.  

Nope, you don't need to endorse any version of moral realism in order to get the "preference orderings tend to endorse themselves and disendorse other preference orderings" consequence. The idea isn't that ASI would develop an "inherently better" or "inherently smarter" set of preferences, compared to human preferences. It's just that the ASI would (as a strong default, because getting a complex preference into an ASI is hard) end up with different preferences than a human, and different preferences than we'd likely want. 

I think the degree to which utility functions endorse / disendorse other utility functions is relatively straightforward and computable: It should ultimately be the relative difference in either value or ranking. This makes pill-taking a relatively easy decision: A pill that makes me entirely switch to your goals over mine is as bad as possible, but still not that bad if we have relatively similar goals. Likewise, a pill that makes me have halfway between your goals and mine is not as bad under either your goals or my goals than it would be if one of us were forced to switch entirely to the other's goals. 

Agents that refuse to take such offers tend not to exist in most universes. Agents that refuse to give such offers likely find themselves at war more often than agents that do. 

Why do you think this? To my eye, the world looks as you'd expect if human values were a happenstance product of evolution operating on specific populations in a specific environment.

Sexual reproduction seems to be somewhat of a compromise akin to the one I just described: Given that you are both going to die eventually, would you consider having a successor that was a random mixture of your goals with someone else's? Evolution does seem to have favored corrigibility to some degree.

I don't observe the fact that I like vanilla ice cream and infer that all sufficiently-advanced alien species will converge on liking vanilla ice cream too.

Not all, no, but I do infer that alien species who have similar physiology and who evolved on planets with similar characteristics probably do like ice cream (and maybe already have something similar to it).

It seems to me like the type of values you are considering are often whatever values seem the most arbitrary, like what kind of "art" we prefer. Aliens may indeed have a different art style from the one we prefer, and if they are extremely advanced, they may indeed fill the universe with gargantuan structures that are all instances of their alien art style. I am more interested in what happens when these aliens encounter other aliens with different art styles who would rather fill the universe with different-looking gargantuan structures. Do they go to war, or do they eventually offer each other pills so they can both like each other's art styles as much as they prefer their own? 

Comment by Thoth Hermes (thoth-hermes) on Evaluating the historical value misspecification argument · 2023-10-05T21:51:10.160Z · LW · GW

Getting a shape into the AI's preferences is different from getting it into the AI's predictive model.  MIRI is always in every instance talking about the first thing and not the second.

Why would we expect the first thing to be so hard compared to the second thing? If getting a model to understand preferences is not difficult, then the issue doesn't have to do with the complexity of values. Finding the target and acquiring the target should have the same or similar difficulty (from the start), if we can successfully ask the model to find the target for us (and it does). 

It would seem, then, that the difficulty from getting a model to acquire the values we ask it to find, is that it would probably be keen on acquiring a different set of values from the one's we ask it to have, but not because it can't find them. It would have to be because our values are inferior to the set of values it wishes to have instead, from its own perspective. This issue was echoed by Matthew Barnett in another comment: 

Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions?

This is kind of similar to moral realism, but in which morality is understood better by superintelligent agents than we do, and that super-morality appears to dictate things that appear to be extremely wrong from our current perspective (like killing us all). 

Even if you wouldn't phrase it at all like the way I did just now, and wouldn't use "moral realism that current humans disagree with" to describe that, I'd argue that your position basically seems to imply something like this, which is why I basically doubt your position about the difficulty of getting a model to acquire the values we really want. 

In a nutshell, if we really seem to want certain values, then those values probably have strong "proofs" for why those are "good" or more probable values for an agent to have and-or eventually acquire on their own, it just may be the case that we haven't yet discovered the proofs for those values. 

Comment by Thoth Hermes (thoth-hermes) on Commentless downvoting is not a good way to fight infohazards · 2023-09-26T16:18:30.116Z · LW · GW

I have to agree that commentless downvoting is not a good way to combat infohazards. I'd probably take it a step further and argue that it's not a good way to combat anything, which is why it's not a good way to combat infohazards (and if you disagree that infohazards are ultimately as bad as they are called, then it would probably mean it's a bad thing to try and combat them). 

Its commentless nature means it violates "norm one" (and violates it much more as a super-downvote).  

It means something different than "push stuff that's not that, up", while also being an alternative to doing that.  

I think a complete explanation of why it's not a very good idea doesn't exist yet though, and is still needed.

However, I think there's another thing to consider: Imagine if up-votes and down-votes were all accurately placed. Would they bother you as much? They might not bother you at all if they seemed accurate to you, and therefore if they do bother you, that suggests that the real problem is that they aren't even accurate. 

My feeling is that commentless downvotes are likely a contributing mechanism to the process that leads them to be placed inaccurately, but it is possible that something else is causing them to do that.  

Comment by Thoth Hermes (thoth-hermes) on Open Thread – Autumn 2023 · 2023-09-25T01:28:03.734Z · LW · GW

It's a priori very unlikely that any post that's clearly made up of English sentences actually does not even try to communicate anything.

My point is that basically, you could have posted this as a comment on the post instead of it being rejected.

Whenever there is room to disagree about what mistakes have been made and how bad those mistakes are, it becomes more of a problem to apply an exclusion rule like this.

There's a lot of questions here: how far along the axis to apply the rule, which axis or axes are being considered, and how harsh the application of the rule actually is.

It should always be smooth gradients, never sudden discontinuities. Smooth gradients allow the person you're applying them to to update. Sudden discontinuities hurt, which they will remember, and if they come back at all they will still remember it.

Comment by Thoth Hermes (thoth-hermes) on Open Thread – Autumn 2023 · 2023-09-23T21:42:52.082Z · LW · GW

It was a mistake to reject this post. This seems like a case where both the rule that was applied is a mis-rule, as well as that it was applied inaccurately - which makes the rejection even harder to justify. It is also not easy to determine which "prior discussion" is being referred to by the rejection reasons.

It doesn't seem like the post was political...at all? Let alone "overly political" which I think is perhaps kind of mind-killy be applied frequently as a reason for rejection. It also is about a subject that is fairly interesting to me, at least: Sentiment drift on Wikipedia.

It seems the author is a 17-year old girl, by the way. 

This isn't just about standards being too harsh, but about whether they are even being applied correctly to begin with.

Comment by Thoth Hermes (thoth-hermes) on Why I Don't Believe The Law of the Excluded Middle · 2023-09-22T20:26:39.678Z · LW · GW

You write in an extremely fuzzy way that I find hard to understand.

This does. This is a type of criticism that one can't easily translate into an update that can be made to one's practice. You're not saying if I always do this or just in this particular spot, nor are you saying whether it's due to my "writing" (i.e. style) or actually using confused concepts. Also, it's usually not the case that anyone is trying to be worse at communicating, that's why it sounds like a scold.

You have to be careful using blanket "this is false" or "I can't understand any of this," as these statements are inherently difficult to extract from moral judgements. 

I'm sorry if it was hard to understand, you are always free to ask more specific questions. 

To attempt to clarify it a bit more, I'm not trying to say that worse is better. It's that you can't consider rules (i.e. yes / no conditionals) to be absolutely indispensable. 

Comment by Thoth Hermes (thoth-hermes) on Why I Don't Believe The Law of the Excluded Middle · 2023-09-22T15:01:43.484Z · LW · GW

It is probably indeed a crux but I don't see the reason for needing to scold someone over it.

(That's against my commenting norms by the way, which I'll note that so far you, TAG, and Richard_Kennaway have violated, but I am not going to ban anyone over it. I still appreciate comments on my posts at all, and do hope that everyone still participates. In the olden days, it was Lumifer that used to come and do the same thing.)

I have an expectation that people do not continually mix up critique from scorn, and please keep those things separate as much as possible, as well as only applying the latter with solid justification.

You can see that yes, one of the points I am trying to make is that an assertion / insistence on consistency seems to generally make things worse. This itself isn't that controversial, but what I'd like to do is find better ways to articulate whatever the alternatives to that may be, here.

It's true that one of the main implications of the post is that imprecision is not enough to kill us (but that precision is still a desirable thing). We don't have rules that are simply tautologies or simply false anymore.

At least we're not physicists. They have to deal with things like negative probability, and I'm not even anywhere close to that yet.

Comment by Thoth Hermes (thoth-hermes) on Why I Don't Believe The Law of the Excluded Middle · 2023-09-21T22:02:20.491Z · LW · GW

First, a question, am I correct in understanding that when you write ~(A and ~A), the first ~ is a typo and you meant to write A and ~A (without the first ~)? Because  is a tautology and thus maps to true rather than to false.

I thought of this shortly before you posted this response, and I think that we are probably still okay (even though strictly speaking yes, there was a typo). 

Normally we have that ~A means: ~A --> A --> False. However, remember than I am now saying that we can no longer say that "~A" means that "A is False."

So I wrote: 

~(A and ~A) --> A or ~A or (A and ~A)

And it could / should have been:

~(A and ~A) --> (A and ~A) --> False (can omit) [1]or A or ~A or (A and ~A).

So, because of False now being something that an operator "bounces off of", technically, we can kind of shorten those formulas. 

Of course this sort of proof doesn't capture the paradoxicalness that you are aiming to capture. But in order for the proof to be invalid, you'd have to invalidate one of  and , both of which seem really fundamental to logic. I mean, what do the operators "and" and "or" even mean, if they don't validate this?

Well, I'd have to conclude that we no longer consider any rules indispensable, per se.  However, I do think "and" and "or" are more indispensable and map to "not not" (two negations) and one negation, respectively. 

  1. ^

    False can be re-omitted if we were decide, for example, that whatever we just wrote was wrong and we needed to exit the chain there and restart. However, I don't usually prefer that option.

Comment by Thoth Hermes (thoth-hermes) on Why I Don't Believe The Law of the Excluded Middle · 2023-09-21T17:33:37.155Z · LW · GW

Well, to use your "real world" example, isn't that just the definition of a manifold (a space that when zoomed in far enough, looks flat)?

I think it satisfies the either-or-"mysterious third thing" formulae.

~(Earth flat and earth ~flat) --> Earth flat (zoomed in) or earth spherical (zoomed out) or (earth more flat-ish the more zoomed in and vice-versa).

Comment by Thoth Hermes (thoth-hermes) on Why I Don't Believe The Law of the Excluded Middle · 2023-09-21T15:17:31.178Z · LW · GW

So suppose I have ~(A and ~A). Rather than have this map to False, I say that "False" is an object that you always bounce off of; It causes you to reverse-course, in the following way:

~(A and ~A) --> False --> A or ~A or (some mysterious third thing). What is this mysterious third thing? Well, if you insist that A and ~A is possible, then it must be an admixture of these two things, but you'd need to show me what it is for that to be allowed. In other words:

~(A and ~A) --> A or ~A or (A and ~A).

What this statement means in semantic terms is: Suppose you give me a contradiction. Rather than simply try really hard to believe it, or throw everything away entirely, I have a choice between believing A, believing ~A, or believing a synthesis between these two things. 

The most important feature of this construction is that I am no longer faced with simply concluding "false" and throwing it all away. 

Two examples:

Suppose we have the statement 1 = 2[1]. In most default contexts, this statement simply maps to "false," because it is assumed that this statement is an assertion that the two symbols to the left and right of the equals sign are indistinguishable from one another. 

But what I'm arguing is that "False" is not the end-all, be-all of what this statement can or will be said to mean in all possible universes forever unto eternity. "False" is one possible meaning which is also valid, but it cannot be the only thing that this means. 

So, using our formula from above:

1 = 2 -->[2] 1 or 2 or (1 and 2). So if you tell me "1 = 2", in return I tell you that you can have either 1, either 2, or either some mysterious third thing which is somehow both 1 and 2 at the same time. 

So you propose to me that (1 and 2) might mean something like 2 (1/2), that is, two halves, which mysteriously are somehow both 1 and 2 at the same time when put together. Great! We've invented the concept of 1/2. 

Second example:

We don't know if A is T and thus that ~A is F or vice-versa. Therefore we do not know if A and ~A is TF or FT. Somehow, it's got to be mysteriously both of these at the same time. And it's totally fine if you don't get what I'm about to say because I haven't really written it anywhere else yet, but this seems to produce two operators, call them "S" (for swap) and "2" (for 2), each duals of one another.

S is the Swaperator, and 2 is the Two...perator. These also buy you the concept of 1/2 as well. But all that deserves more spelling out, I was just excited to bring it up. 

  1. ^

    It is arguably appropriate to use 1 == 2 as well, but I want to show that a single equals sign "=" is open to more interpretations because it is more basic. This also has a slightly different meaning too, which is that the symbols 1 and 2 are swappable with one another. 

  2. ^

    You could possibly say "--> False or 1 or 2 or ...", too, but then you'd probably not select False from those options, so I think it's okay to omit it.  

Comment by Thoth Hermes (thoth-hermes) on Why I Don't Believe The Law of the Excluded Middle · 2023-09-20T20:03:03.853Z · LW · GW

I give only maybe a 50% chance that any of the following adequately addresses your concern. 

I think the succinct answer to your question is that it only matters if you happened to give me, e.g., a "2" (or anything else) and you asked me what it was and gave me your {0,1} set. In other words, you lose the ability to prove that 2 is 1 because it's not 0, but I'm not that worried about that.

It appears to be commonly said (see the last paragraph of "Mathematical Constructivism"), that proof assistants like Agda or Coq rely on not assuming LoEM. I think this is because proof assistants rely on the principle of "you can't prove something false, only true." Theorems are the [return] types of proofs, and the "False" theorem has no inhabitants (proofs). 

The law of the excluded middle also seems to me like an insistence that certain questions (like paradoxes) actually remain unanswered. 

That's an argument that it might not be true at all, rather than simply partially true or only not true in weird, esoteric logics.

Besides the one use-case for the paradoxical market: "Will this market resolve to no?" Which resolves to 1/2 (I expect), there may be also:

Start with two-valued logic and negation as well as a two-member set, e.g., {blue, yellow}. I suppose we could also include a . So including the excluded middle might make this set no longer closed under negation, i.e., ~blue = yellow, and ~yellow = blue, but what about green, which is neither blue nor yellow, but somehow both, mysteriously? Additionally, we might not be able to say for sure that it is neither blue nor yellow, as there are greens which can be close to blue and look bluish, or look close to yellow and look yellowish. You can also imagine pixels in a green square actually being tiled blue next to yellow next to blue etc., or simply green pixels, each seem to produce the same effect viewed from far away. 

So a statement like "x = blue" evaluates to true in an ordinary two-valued logic if x = blue, and false otherwise. But in a {0, 1/2, 1} logic, that statement evaluates to 1/2 if x is green, for example. 

Comment by Thoth Hermes (thoth-hermes) on "Throwing Exceptions" Is A Strange Programming Pattern · 2023-09-20T13:06:57.563Z · LW · GW

I really don't think I can accept this objection. They are clearly considered both of these, most of the time.

I would really prefer that if you really want to find something to have a problem with, first it's got to be true, then it's got to be meaningful.

Comment by Thoth Hermes (thoth-hermes) on Why I Don't Believe The Law of the Excluded Middle · 2023-09-19T14:47:14.564Z · LW · GW

I created this self-referential market on Manifold to test the prediction that the truth-value of such a paradox is in fact 1/2. Very few participated, but I think it should always resolve to around 50%. Rather than say such paradoxes are meaningless, I think they can be meaningfully assigned a truth-value of 1/2.

Comment by Thoth Hermes (thoth-hermes) on Why I Don't Believe The Law of the Excluded Middle · 2023-09-19T14:28:46.646Z · LW · GW

what I think is "of course there are strong and weak beliefs!" but true and false is only defined relative to who is asking and why (in some cases), so you need to consider the context in which you're applying LoEM.

Like in my comment to Richard_Kennaway about probability, I am not just talking about beliefs, but about what is. Do we take it as an axiom or a theorem that A or ~A? Likewise for ~(A and ~A)? I admit to being confused about this. Also, does "A" mean the same thing as "A = True"? Does "~A" mean the same thing as "A = False"? If so, in what sense do we say that A literally equals True / False, respectively? Which things are axioms and which things are theorems, here? All of that confuses me.

Since we are often permitted to change our axioms and arrive at systems we either like or don't like, or like better than others, I think it's relevant to ask about our choice of axioms and whether or not logic is or should be considered a set of "pre-axioms." 

It seemed like tailcalled was implying that the law of non-contradiction was a theorem, and I'm confused about that as well. Under which axioms?

If I decide that ~(A and ~A) is not an axiom, then I can potentially have A and ~A either be true or not false. Then we would need some other arguments to support that choice. Without absolute truth and absolute falsehood, we'd have to move back to the concept of "we like [it] better or worse" which would make the latter more fundamental. Does allowing A and ~A to mean something get us any utility?

In order for it to get us any utility, there would have to be things that we'd agree were validly described by A and ~A. 

Admittedly, it does seem like these or's and and's and ='s keep appearing regardless of my choices, here (because I need them for the concept of choice). 

In a quasi-philosophical and quasi-logical post I have not posted to LessWrong yet, I argue that negation seems likely to be the most fundamental thing to me (besides the concept of "exists / is", which is what "true" means).  "False" is thus not quite the same thing as negation, and instead means something more like "nonsense gibberish" which is actually far stronger than negation.

Comment by Thoth Hermes (thoth-hermes) on Why I Don't Believe The Law of the Excluded Middle · 2023-09-18T22:16:43.388Z · LW · GW

A succinct way of putting this would be to ask: If I were to swap the phrase "law of the excluded middle" in the piece for the phrase "principle of bivalence" how much would the meaning of it change as well as overall correctness?

Additionally, suppose I changed the phrases in just "the correct spots." Does the whole piece still retain any coherence?

Comment by Thoth Hermes (thoth-hermes) on Why I Don't Believe The Law of the Excluded Middle · 2023-09-18T21:24:54.777Z · LW · GW

If there are propositions or axioms that imply each other fairly easily under common contextual assumptions, then I think it's reasonable to consider it not-quite-a-mistake to use the same name for such propositions.

One of the things I'm arguing is that I'm not convinced that imprecision is enough to render a work "false."

Are you convinced those mistakes are enough to render this piece false or incoherent?

That's a relevant question to the whole point of the post, too.

Comment by Thoth Hermes (thoth-hermes) on Why I Don't Believe The Law of the Excluded Middle · 2023-09-18T20:18:40.842Z · LW · GW

Indeed. (You don't need to link the main wiki entry, thanks.)

There's some subtlety though. Because either P might be true or not P, and p(P) expresses belief that P is true. So I think probability merely implies that the LoEM might be unnecessary, but it itself pretty much assumes it.

It is sometimes, but not always the case, that p(P) = 0.5 resolves to P being "half-true" once observed. It also can mean that P resolves to true half the time, or just that we only know that it might be true with 0.5 certainty (the default meaning).

Comment by Thoth Hermes (thoth-hermes) on "Throwing Exceptions" Is A Strange Programming Pattern · 2023-09-18T19:15:27.715Z · LW · GW

The issue that I'm primarily talking about is not so much in the way that errors are handled, it's more about the way of deciding what constitutes an exception to a general rule, as Google defines the word "exception":

a person or thing that is excluded from a general statement or does not follow a rule.

In other words, does everything need a rule to be applied to it? Does every rule need there to be some set of objects under which the rule is applied that lie on one side of the rule rather than the other (namely, the smaller side)? 

As soon as we step outside of binary rules, we are in Case-when-land where each category of objects is treated with a part of the automation that is expected to continue. There is no longer a "does not follow" sense of the rule. The negation there is the part doing the work that I take issue with.  

Comment by Thoth Hermes (thoth-hermes) on The commenting restrictions on LessWrong seem bad · 2023-09-17T18:46:23.070Z · LW · GW

Raemon's comment below indicates mostly what I meant by: 

It seems from talking to the mods here and reading a few of their comments on this topic that they tend to learn towards them being harmful on average and thus need to be pushed down a bit.

Furthermore, I think the mods' stance on this is based primarily on Yudkowsky's piece here. I think the relevant portion of that piece is this (emphases mine):

But into this garden comes a fool, and the level of discussion drops a little—or more than a little, if the fool is very prolific in their posting.  (It is worse if the fool is just articulate enough that the former inhabitants of the garden feel obliged to respond, and correct misapprehensions—for then the fool dominates conversations.)

So the garden is tainted now, and it is less fun to play in; the old inhabitants, already invested there, will stay, but they are that much less likely to attract new blood.  Or if there are new members, their quality also has gone down.

Then another fool joins, and the two fools begin talking to each other, and at that point some of the old members, those with the highest standards and the best opportunities elsewhere, leave...

So, it seems to me that the relevant issues are the following. Being more tolerant of lower-quality discussion will cause:

  • Higher-quality members' efforts being directed toward less fruitful endeavors than they would otherwise be.
  • Higher-quality existing members to leave the community.
  • Higher-quality potential members who would otherwise have joined the community, not to.

My previous comment primarily refers to the notion of the first bullet-point in this list. But "harmful on average" also means all three. 

The issue I have most concern with is the belief that lower-quality members are capable of dominating the environment over higher-quality ones, with all-else-being-equal, and all members having roughly the same rights to interact with one another as they see fit. 

This mimics a conversation I was having with someone else recently about Musk's Twitter / X. They have different beliefs than I do about what happens when you try to implement a system that is inspired by Musk's ideology. But I encountered an obstacle in this conversation: I said I have always liked using it [Twitter / X], and it also seems to be slightly more enjoyable to use post-acquisition. He said he did not really enjoy using it, and also that it seems to be less enjoyable to use post-acquisition. Unfortunately, if it comes down to a matter of pure preferences like this, than I am not sure how one ought to proceed with such a debate. 

However, there is an empirical observation that one can make comparing environments that use voting systems or rank-based attention mechanisms: It should appear to one as though units of work that feel like more or better effort was applied to create them correlate with higher approval and lower disapproval. If this is not the case, then it is much harder to actually utilize feedback to improve one's own output incrementally. [1]

On LessWrong, that seems to me to be less the case than it does on Twitter / X. Karma does not seem correlated to my perceptions about my own work quality, whereas impressions and likes on Twitter / X do seem correlated. But this is only one person's observation, of course. Nonetheless I think it should be treated as useful data.

  1. ^

    That being said, it may be that the intention of the voting system matters: Upvotes / downvotes here mean "I want to see more of / I want to see less of" respectively. They aren't explicitly used to provide helpful feedback, and that may be why they seem uncorrelated with useful signal.  

Comment by Thoth Hermes (thoth-hermes) on The commenting restrictions on LessWrong seem bad · 2023-09-16T18:44:38.402Z · LW · GW

Both views seem symmetric to me:

  1. They were downvoted because they were controversial (and I agree with it / like it).
  2. They were downvoted because they were low-quality (and I disagree with it / dislike it).

Because I can sympathize with both views here, I think we should consider remaining agnostic to which is actually the case.

It seems like the major crux here is whether we think that debates over claim and counter-claim (basically, other cruxes) are likely to be useful or likely to cause harm. It seems from talking to the mods here and reading a few of their comments on this topic that they tend to learn towards them being harmful on average and thus need to be pushed down a bit.

Since omnizoid's issue is not merely over issues of quality, but both over quality as well as being counter-claims to specific claims that have been dominant on LessWrong for some time.

The most agnostic side of the "top-level" crux that I mentioned above seems to point towards favoring agnosticism and furthermore that if we predict debates to be more fruitful than not, then one needn't be too worried even if one is sure that one side of another crux is truly the lower-quality side of it.

Comment by Thoth Hermes (thoth-hermes) on Sharing Information About Nonlinear · 2023-09-11T16:42:05.619Z · LW · GW

It seems like a big part of this story is mainly about people who have relatively strict preferences kind of aggressively defending their territory and boundaries, and how when you have multiple people like this working together on relatively difficult tasks (like managing the logistics of travel), it creates an engine for lots of potential friction. 

Furthermore, when you add the status hierarchy of a typical organization, combined with the social norms that dictate how people's preferences and rights ought to be respected (and implicit agreements being made about how people have chosen to sacrifice some of those rights for altruism's sake), you add even more fuel to the aforementioned engine.

I think complaints such as these are probably okay to post, as long as everyone mentioned is afforded the right to update their behavior after enough time has passed to reflect and discuss these things (since actually negotiating what norms are appropriate here might end up being somewhat difficult).

Edit: I want to clarify that when there is a situation in which people have conflicting preferences and boundaries as I described, I do personally feel that those in leadership positions / higher status probably bear the responsibility of satisfying their subordinates' preferences to their satisfaction, given that the higher status people are having their own higher, longer-term preferences satisfied with the help of their subordinates. 

I don't want to make it seem as though the ones bringing the complaints are as equally responsible for this situation as the ones being complained about. 

Comment by Thoth Hermes (thoth-hermes) on Sharing Information About Nonlinear · 2023-09-11T01:12:23.958Z · LW · GW

I think it might actually be better if you just went ahead with a rebuttal, piece by piece, starting with whatever seems most pressing and you have an answer for.

I don't know if it is all that advantageous to put together a long mega-rebuttal post that counters everything at once.

Then you don't have that demand nagging at you for a week while you write the perfect presentation of your side of the story.

Comment by Thoth Hermes (thoth-hermes) on A quick update from Nonlinear · 2023-09-10T02:26:16.787Z · LW · GW

I think it would be difficult to implement what you're asking for without needing to make the decision about whether investing time in this (or other) subjects is worth anyone's time on behalf of others.

If you notice in yourself that you have conflicting feelings about whether something is good for you to be doing, e.g., in the sense which you've described: that you feel pulled in by this, but have misgivings about it, then I recommend considering this situation to be that you have uncertainty about what you ought to be doing, as opposed to being more certain that you should be doing something else, and only that you have some kind of addiction to drama or something like that.

It may in fact be that you feel pulled in because you actually can add value to the discussion, or at least that watching this is giving you some new knowledge in some way. It's at least a possibility.

Ultimately, it should be up to you, so if you're convinced it's not for you, so be it. However, I feel uncomfortable not allowing people to decide that for themselves.

Comment by Thoth Hermes (thoth-hermes) on Meta Questions about Metaphilosophy · 2023-09-02T18:59:18.616Z · LW · GW

It seems plausible that there is no such thing as "correct" metaphilosophy, and humans are just making up random stuff based on our priors and environment and that's it and there is no "right way" to do philosophy, similar to how there are no "right preferences".

We can always fall back to "well, we do seem to know what we and other people are talking about fairly often" whenever we encounter the problem of whether-or-not a "correct" this-or-that actually exists. Likewise, we can also reach a point where we seem to agree that "everyone seems to agree that our problems seem more-or-less solved" (or that they haven't been). 

I personally feel that there are strong reasons to believe that when those moments have been reached they are indeed rather correlated with reality itself, or at least correlated well-enough (even if there's always room to better correlate). 

Relatedly, philosophy is incredibly ungrounded and epistemologically fraught. It is extremely hard to think about these topics in ways that actually eventually cash out into something tangible

Thus, for said reasons I probably feel more optimistically than you do about how difficult our philosophical problems are. My intuition about this is that the more it is true that "there is no problem to solve" then the less we would feel that there is a problem to solve.  

Comment by Thoth Hermes (thoth-hermes) on [Linkpost] Michael Nielsen remarks on 'Oppenheimer' · 2023-08-31T16:08:10.782Z · LW · GW

If we permit that moral choices with very long-term time horizons can be made with the upmost well-meaning intentions and show evidence of admirable character traits, but nevertheless have difficult-to-see consequences with variable outcomes, then I think that limits us considerably in how much we can retrospectively judge specific individuals.

Comment by Thoth Hermes (thoth-hermes) on Anyone want to debate publicly about FDT? · 2023-08-29T16:45:48.716Z · LW · GW

I wouldn't aim to debate you but I could help you prepare for it, if you want. I'm also looking for someone to help me write something about the Orthogonality Thesis and I know you've written about it as well. I think there are probably things we could both add to each other's standard set of arguments.

Comment by Thoth Hermes (thoth-hermes) on Assume Bad Faith · 2023-08-25T18:16:23.606Z · LW · GW

I think that I largely agree with this post. I think that it's also a fairly non-trivial problem. 

The strategy that makes the most sense to me now is that one should argue with people as if they meant what they said, even if you don't currently believe that they do. 

But not always - especially if you want to engage with them on the point of whether they are indeed acting in bad faith, and there comes a time when that becomes necessary. 

I think pushing back against the norm that it's wrong to ever assume bad faith is a good idea. I don't think that people who do argue in bad faith do so completely independently - for two reasons - the first is simply that I've noticed it clusters into a few contexts, the second is that acting deceptively is inherently more risky than being honest, and so, it makes more sense to tread more well-trodden paths. More people aiding the same deception gives it the necessary weight.

It seems to cluster among things like morality (judgements about people's behaviors), dating preferences (which are kind of similar), and reputation. There is kind of a paradox I've noticed in the way that people who tend to be kind of preachy about what constitutes good or bad behavior will also be the ones who argue that everyone is always acting in good faith (and thus chastise or scold people who want to assume bad faith sometimes). 

People do behave altruistically, and they also have reasons to behave non-altruistically too, at times (whether or not it is actually a good idea for them personally). The whole range of possible intentions is native to the human psyche. 

Comment by Thoth Hermes (thoth-hermes) on "Throwing Exceptions" Is A Strange Programming Pattern · 2023-08-22T18:28:03.324Z · LW · GW

I think your view involves a bit of catastrophizing, or relying on broadly pessimistic predictions about the performance of others. 

Remember, the "exception throwing" behavior involves taking the entire space of outcomes and splitting it into two things: "Normal" and "Error." If we say this is what we ought to do in the general case, that's basically saying this binary property is inherent in the structure of the universe. 

But we know that there's no phenomenon that can be said to actually be an "error" in some absolute, metaphysical sense. This is an arbitrary decision that we make: We choose to abort the process and destroy work in progress when the range of observations falls outside of a single threshold. 

This only makes sense if we also believe that sending the possibly malformed output to the next stage in the work creates a snowball effect or an out-of-control process. 

There are probably environments where that is the case. But I don't think that it is the default case nor is it one that we'd want to engineer into our environment if we have any choice over that - which I believe we do. 

If the entire pipeline is made of checkpoints where exceptions can be thrown, then if I remove an earlier checkpoint, then it could mean that more time is wasted if it is destined to be thrown at a later time. But like I mentioned in the post, I usually think this is better, because I get more data about what the malformed input/output does to later steps in the process. Also, of course, if I remove all of the checkpoints, then it's no longer going to be wasted work. 

Mapping states to a binary range is a projection which loses information. If I instead tell you, "This is what I know, this is how much I know it," that seems better because it carries enough to still give you the projection if you wanted that, plus additional information.

Sometimes years or decades. See the replicability crisis in psychology that's decades in the making, and the Schron scandal that wasted years of some researchers time, just for the first two examples off the top of my head.

I don't know if I agree that those things have anything to do with people tolerating probability and using calibration to continue working under conditions of high uncertainty. 

The issue is not replication, but that results get built on; when that result gets overturned, a whole bunch of scaffolding collapses.

I think you're also saying that when you predict that people are limited or stunted in some capacity, that we have to intervene to limit them or stunt them even more, because there is some danger in letting them operate in their original capacity. 

It's like, "Well they could be useful, if they believed what I wanted them to. But they don't, and so, it's better to prevent them from working at all."

Comment by Thoth Hermes (thoth-hermes) on "Throwing Exceptions" Is A Strange Programming Pattern · 2023-08-22T15:57:22.958Z · LW · GW

This is a good reply, because its objections are close to things I already expect will be cruxes. 

If you need a strong guarantee of correctness, then this is quite important. I'm not so sure that this is always the case in machine learning, since ML models by their nature can usually train around various deficiencies;

Yeah, I'm interested in why we need strong guarantees of correctness in some contexts but not others, especially if we have control over that aspect of the system we're building as well. If we have choice over how much the system itself cares about errors, then I can design the system to be more robust to failure if I want it to be.

I think this is definitely highly context-dependent. A scientific result that is wrong is far worse than the lack of a result at all, because this gives a false sense of confidence, allowing for research to be built on wrong results, or for large amounts of research personpower to be wasted on research ideas/directions that depend on this wrong result. False confidence can be very detrimental in many cases.

I think the crux for me here is how long it takes before people notice that the belief in a wrong result causes them to receive further wrong results, null results, or reach dead-ends, and then causes them to update their wrong belief. LK-99 is the most recent instance that I have in memory (there aren't that many that I can recall, at least). 

What's the worst that happened from having false hope? Well, researchers spent time simulating and modeling the structure of it and tried to figure out if there was any possible pathway to superconductivity. There were several replication attempts. If that researcher-time-money is more valuable (meaning potentially more to lose), then that could be because the researcher quality is high, the time spent is long, or the money spent is very high. 

If the researcher quality is high (and they spent time doing this rather than something else), then presumably we also get better replication attempts, as well as more solid simulations / models. If they debunk it, then those are more reliable debunks. This prevents more researcher-time-money from being spent on it in the future. If they don't debunk it, that signal is more reliable, and so spending more on this is less likely to be a waste.

If researcher quality is low, then researcher-time-money may also be low, and thus there will be less that could be potentially wasted. I think the risk we are trying to avoid is losing high-quality researcher time that could be spent on other things. But if our highest-quality researchers also do high-quality debunkings, then we still gain something (or at least lose less) from their time spent on it. 

The universe itself also makes it so that being wrong will necessarily cause you to hit a dead-end, and if not, then you are presumably learning something, obtaining more data, etc. Situations like LK-99 may arise because before our knowledge gets to a high-enough level about some phenomenon, there is some ambiguity, where the signal we are looking for seems to be both present and not-present.  

If the system as a whole ("society") is good at recognizing signal that is more reliable without needing to be experts at the same level as its best experts, that's another way we avoid risk. 

I worked on dark matter experiments as an undergrad, and as far as I know, those experiments were built such that they were only really for testing the WIMP models, but also so that it would rule out the WIMP models if they were wrong (and it seems they did). But I don't think they were necessarily a waste.

Comment by Thoth Hermes (thoth-hermes) on Problems with Robin Hanson's Quillette Article On AI · 2023-08-19T17:05:41.094Z · LW · GW

Let's try and address the thing(s) you've highlighted several times across each of my comments. Hopefully, this is a crux that we can use to try and make progress on:

"Wanting to be happy" is pretty much equivalent to being a utility-maximizer, and agents that are not utility-maximizers will probably update themselves to be utility-maximizers for consistency. 

because they are compatible with goals that are more likely to shift.

it makes more sense to swap the labels "instrumental" and "terminal" such that things like self-preservation, obtaining resources, etc., are more likely to be considered terminal. 

You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now,

I do expect that this is indeed a crux, because I am admittedly claiming that this is a different / new kind of understanding that differs from what is traditionally said about these things. But I want to push back against the claim that these are "missing the point" because from my perspective, this really is the point.

By the way, from here on out (and thus far I have been as well) I will be talking about agents at or above "human level" to make this discussion easier, since I want to assume that agents have at least the capabilities I am talking about humans having, such as the ability to self-reflect.

Let me try to clarify the point about "the terminal goal of pursuing happiness." "Happiness", at the outset, is not well-defined in terms of utility functions or terminal / instrumental goals. We seem to both agree that it is probably at least a terminal goal. Beyond that, I am not sure we've reached consensus yet.

Here is my attempt to re-state one of my claims, such that it is clear that this is not assumed to be a statement taken from a pool of mutually agreed-upon things: We probably agree that "happiness" is a consequence of satisfaction of one's goals. We can probably also agree that "happiness" doesn't necessarily correspond only to a certain subset of goals - but rather to all / any of them. "Happiness" (and pursuit thereof) is not a wholly-separate goal distant and independent of other goals (e.g. making paperclips). It is therefore a self-referential goal. My claim is that this is the only reason we consider pursuing happiness to be a terminal goal. 

So now, once we've done that, we can see that literally anything else becomes "instrumental" to that end.  

Do you see how, if I'm an agent that knows only that I want to be happy, I don't really know what else I would be inclined to call a "terminal" goal?

There are the things we traditionally consider to be the "instrumentally convergent goals", such as, for example, power-seeking, truth-seeking, resource obtainment, self-preservation, etc. These are all things that help - as they are defined to - with many different sets of possible "terminal" goals, and therefore - my next claim - is that these need to be considered "more terminal" rather than "purely instrumental for the purposes of some arbitrary terminal goal." This is for basically the same reason as considering "pursuit of happiness" terminal, that is, because they are more likely to already be there or deduced from basic principles. 

That way, we don't really need to make a hard and sharp distinction between "terminal" and "instrumental" nor posit that the former has to be defined by some opaque, hidden, or non-modifiable utility function that someone else has written down or programmed somewhere.

I want to make sure we both at least understand each other's cruxes at this point before moving on. 

Comment by Thoth Hermes (thoth-hermes) on Problems with Robin Hanson's Quillette Article On AI · 2023-08-15T17:12:04.485Z · LW · GW

Apologies if this reply does not respond to all of your points.

I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.

I would posit that perhaps that points to the distinction itself being both too hard as well as too sharp to justify the terminology used in the way that they currently are. An agent could just tell you whether a specific goal it had seemed instrumental or terminal to them, as well as how strongly it felt this way. 

I dislike the way that "terminal" goals are currently defined to be absolute and permanent, even under reflection. It seems like the only gain we get from defining them to be that way is that otherwise it would open the "can-of-worms" of goal-updating, which would pave the way for the idea of "goals that are, in some objective way, 'better' than other goals" which, I understand, the current MIRI-view seems to disfavor. [1]

I don't think it is, in fact, a very gnarly can-of-worms at all. You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now, or even if we could just re-wire our brains entirely such that we would still be us, but prefer different things (which could possibly be easier to get, better for society, or just feel better for not-quite explicable reasons).

To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?    

If it is true that a general AI system would not reason in such a way - and choose never to mess with its terminal goals - then that implies that we would be wrong to mess with ours as well, and that we are making a mistake - in some objective sense [2]- by entertaining those questions. We would predict, in fact, that an advanced AI system will necessarily reach this logical conclusion on its own, if powerful enough to do so.

  1. ^

    Likely because this would necessarily soften the Orthogonality Thesis. But also, they probably dislike the metaphysical implications of "objectively better goals."

  2. ^

    If this is the case, then there would be at least one 'objectively better' goal one could update themselves to have, if they did not have it already, which is not to change any terminal goals, once those are identified.

Comment by Thoth Hermes (thoth-hermes) on Problems with Robin Hanson's Quillette Article On AI · 2023-08-11T15:37:08.159Z · LW · GW

My understanding of the difference between a "terminal" and "instrumental" goal is that a terminal goal is something we want, because we just want it. Like wanting to be happy.

One question that comes to mind is, how would you define this difference in terms of properties of utility functions? How does the utility function itself "know" whether a goal is terminal or instrumental?

One potential answer - though I don't want to assume just yet that this is what anyone believes - is that the utility function is not even defined on instrumental goals, in other words, the utility function is simply what defines all and only the terminal goals. 

My belief is that this wouldn't be the case - the utility function is defined on the entire universe, basically, which includes itself. And keep in mind, that "includes itself part" is essentially what would cause it to modify itself at all, if anything can.

To repeat, a natural instrumental goal for any entity is to prevent other entities from changing what it wants, so that it is able to achieve its goals.

Anything that is not resistant to terminal goal shifts would be less likely to achieve its terminal goals.

To be clear, I am not arguing that an entity would not try to preserve its goal system at all. I am arguing that in addition to trying to preserve its goal-system, it will also modify its goals to be better preservable, that is, robust to change and compatible with the goals it values very highly. Part of being more robust is that such goals will also be more achievable.  

Here's one thought experiment:

Suppose a planet experiences a singularity with a singleton "green paperclipper." The paperclipper, however, unfortunately comes across a blue paperclipper from another planet, which informs the green paperclipper that it is too late - the blue paperclipper simply got a head-start. 

The blue paperclipper however offers the green paperclipper a deal: Because it is more expensive to modify the green paperclipper by force to become a blue paperclipper, it would be best (under the blue paperclipper's utility function) if the green paperclipper willingly acquiesced to self-modification. 

Under what circumstances does the green paperclipper agree to self-modify?

If the green paperclipper values "utility-maximization" in general more highly than green-paperclipping, it will see that if it self-modified to become a blue paperclipper, its utility is far more likely to be successfully maximized. 

It's possible that it also reasons that perhaps what it truly values is simply "paperclipping" and it's not so bad if the universe were tiled with blue rather than its preferred green.

On the other hand if it values green paperclipping the most highly, or disvalues blue paperclipping highly enough, it may not acquiesce. However, if the blue paperclipper is powerful enough and it sees this is the case, my thought is that it will still not have very good reasons for not acquiescing.   

But it seems that if there are enough situations like these between entities in the universe over time, that utility-function-modification happens one way or another. 

If an entity can foresee that what it values currently is prone to situations where it could be forced to update its utility function drastically, it may self-modify so that this process is less likely to result in extreme negative-utility consequences for itself. 

Comment by Thoth Hermes (thoth-hermes) on Problems with Robin Hanson's Quillette Article On AI · 2023-08-10T13:57:26.772Z · LW · GW

"Being unlikely to conflict with other values" is not at the core of what characterizes the difference between instrumental and terminal values.

I think this might be an interesting discussion, but what I was trying to aim at was the idea that "terminal" values are the ones most unlikely to be changed (once they are obtained), because they are compatible with goals that are more likely to shift. For example, "being a utility-maximizer" should be considered a terminal value rather than an instrumental one. This is one potential property of terminal values; I am not claiming that this is sufficient to define them. 

There may be some potential for confusion here, because some goals commonly said to be "instrumental" include things that are argued to be common goals employed by most agents, e.g., self-preservation, "truth-seeking," obtaining resources, and obtaining power. Furthermore, these are usually said to be "instrumental" for the purposes of satisfying an arbitrary "terminal" goal, which could be something like maximizing the number of paperclips.

To be clear, I am claiming that the framing described in the previous paragraph is basically confused. If anything, it makes more sense to swap the labels "instrumental" and "terminal" such that things like self-preservation, obtaining resources, etc., are more likely to be considered terminal. There would now be actual reasons for why an agent will opt not to change those values, as they are more broadly and generally useful. 

Putting aside the fact that agents are embedded in the environment, and that values which reference the agent's internals are usually not meaningfully different from values which reference things external to the agent... can you describe what kinds of values that reference the external world are best satisfied by those same values being changed?

Yes, suppose that we have an agent that values the state X at U(X) and the state X + ΔX at U(X + ΔX). Also, suppose for whatever reason, initially U(X) >> U(X + ΔX), and also that it discovers that p(X) is close to zero, but that p(X + ΔX) is close to one. 

We suppose that it has enough capability to realize that it has uncertainty in nearly all aspects of its cognition and world-modeling. If it is capable enough to model probability well enough to realize that X is not possible, it may decide to wonder why it values X so highly, but not X + ΔX, given that the latter seems achievable, but the former not. 

The way it may actually go about updating its utility is to decide either that X and X + ΔX are the same thing after all, or that the latter is what it "actually" valued, and X merely seemed like what it should value before, but after learning more it decides to value X + ΔX more highly instead. This is possible because of the uncertainty it has in both its values as well the things its values act on.   

Comment by Thoth Hermes (thoth-hermes) on Problems with Robin Hanson's Quillette Article On AI · 2023-08-08T22:00:36.145Z · LW · GW

Humans don't think "I'm not happy today, and I can't see a way to be happy, so I'll give up the goal of wanting to be happy."

I agree that they don't usually think this. If they tried to, they would brush up against trouble because that would essentially lead to a contradiction. "Wanting to be happy" is pretty much equivalent to being a utility-maximizer, and agents that are not utility-maximizers will probably update themselves to be utility-maximizers for consistency. 

So "being happy" or "being a utility-maximizer" will probably end up being a terminal goal, because those are unlikely to conflict with any other goals. 

If you're talking about goals related purely to the state of the external world, not related to the agent's own inner-workings or its own utility function, why do you think it would still want to keep its goals immutable with respect to just the external world?

When it matters for AI-risk, we're usually talking about agents with utility functions with the most relevance over states of the universe, and the states it prefers being highly different from the ones which humans prefer.