Posts

simon's Shortform 2023-04-27T03:25:07.778Z
No, really, it predicts next tokens. 2023-04-18T03:47:21.797Z

Comments

Comment by simon on D&D.Sci: The Mad Tyrant's Pet Turtles · 2024-04-08T04:27:04.804Z · LW · GW

So had some results I didn't feel were complete enough in to make a comment on (in the senses that subjectively I kept on feeling that there was some follow-on thing I should check to verify it or make sense of it), then got sidetracked by various stuff, including planning and now going on a trip sacred pilgrimage to see the eclipse. Anyway:

all of these results relate to the "main group" (non-fanged, 7-or-more segment turtles):

Everything seems to have some independent relation with weight (except nostril size afaik, but I didn't particularly test nostril size). When you control for other stuff, wrinkles and scars (especially scars) become less important relative to segments. 

The effect of abnormalities seems suspiciously close to 1 lb on average per abnormality (so, subjectively I think it might be 1). Adding abnormalities has an effect that looks like smoothing (in a biased manner so as to increase the average weight): the weight distribution peak gets spread out, but the outliers don't get proportionately spread out.  I had trouble finding a smoothing function* that I was satisfied exactly replicated the effect on the weight distribution however. This could be due to it not being a smoothing function, me not guessing the correct form, or me guessing the correct form and getting fooled by randomness into thinking it doesn't quite fit.

For green turtles with zero miscellaneous abnormalities, the distribution of scars looked somewhat close to a Poisson distribution. For the same turtles, the distribution of wrinkles on the other hand looked similar but kind of spread out a bit...like the effect of a smoothing function. And they both get spread out more with different colours. Hmm. Same spreading happens to some extent with segments as the colours change.

On the other hand, segment distribution seemed narrower than Poisson, even one with a shifted axis, and the abnormality distribution definitely looks nothing like Poisson (peaks at 0, diminishes far slower than a 0-peak Poisson).

Anyway, on the basis of not very much clear evidence but on seeming plausibility, some wild speculation:

I speculate there is a hidden variable, age. Effect of wrinkles and greyer colour (among non-fanged turtles) could be a proxy for age, and not a direct effect (names of those characteristics are also suggestive). Scars is likely a weaker proxy for age and also no direct effect. I guess segments likely do have some direct effect, while also being a (weak, like scars) proxy for age. Abnormalities clearly have a direct effect. Have not properly tested interactions between these supposed direct effects (age, segments, abnormalities), but if abnormality effect doesn't stack additively with the other effects, it would be harder for the 1-lb-per-abnormality size of the abnormality effect to be a non-coincidence.

So, further wild speculation: so age affect on weight could also be smoothing function (though, looks like high weight tail is thicker for greenish-gray - does that suggest it is not a smoothing function?

unknown: is there an inherent uncertainty in the weight given the characteristics, or does there merely appear to be because of the age proxies being unreliable indicators of age? is that even distinguishable? 

* by smoothing function I think I mean another random variable that you add to the first one, this other random variable takes on a range of values within a relatively narrow range. (e.g. uniform distribution from 0.0 to 2.0, or e.g. 50% chance of being 0.2, 50% chance of being 1.8).

Anyway, this all feels figure-outable even though I haven't figured it out yet. Some guesses where I throw out most of the above information (apart from prioritization of characteristics) because I haven't organized it to generate an estimator, and just guess ad hoc based on similar datapoints, plus Flint and Harold copied from above:

Abigail 21.6, Bertrand 19.3, Chartreuse 27.7, Dontanien 20.5, Espera 17.6, Flint 7.3, Gunther 28.9, Harold 20.4, Irene 26.1, Jacqueline 19.7

Comment by simon on Beauty and the Bets · 2024-03-31T20:19:34.239Z · LW · GW

Well, as you may see it's also is not helpful

My reasoning explicitly puts instrumental rationality ahead of epistemic. I hold this view precisely to the degree which I do in fact think it is helpful.

The extra category of a "fair bet" just adds another semantic disagreement between halfers and thirders. 

It's just a criterion by which to assess disagreements, not adding something more complicated to a model.

Regarding your remarks on these particular experiments:

If someone thinks the typical reward structure is some reward structure, then they'll by default guess that a proposed experiment has that reward structure.

This reasonably can be expected to apply to halfers or thirders. 

If you convince me that halfer reward structure is typical, I go halfer. (As previously stated since I favour the typical reward structure). To the extent that it's not what I would guess by default, that's precisely because I don't intuitively feel that it's typical and feel more that you are presenting a weird, atypical reward structure!

And thirder utilities are modified during the experiment. They are not just specified by a betting scheme, they go back and forth based on the knowledge state of the participant - behave the way probabilities are supposed to behave. And that's because they are partially probabilities - a result of incorrect factorization of E(X).

Probability is a mathematical concept with very specific properties. In my previous post I talk about it specifically and show that thirder probabilities for Sleeping Beauty are ill-defined.

I've previously shown that some of your previous posts incorrectly model the Thirder perspective, but I haven't carefully reviewed and critiqued all of your posts. Can you specify exactly what model of the Thirder viewpoint you are referencing here? (which will not only help me critique it but also help me determine what exactly you mean by the utilities changing in the first place, i.e. do you count Thirders evaluating the total utility of a possibility branch more highly when there are more of them as a "modification" or not (I would not consider this a "modification").

Comment by simon on D&D.Sci: The Mad Tyrant's Pet Turtles · 2024-03-31T19:35:38.086Z · LW · GW

updates:

In the fanged subset:

I didn't find anything that affects weight of fanged turtles independently of shell segment number. The apparent effect from wrinkles and scars appears to be mediated by shell segment number. Any non-shell-segment-number effects on weight are either subtle or confusingly change directions to mostly cancel out in the large scale statistics.

Using linear regression, if you force intercept=0, then you get a slope close to 0.5 (i.e. avg weight= 0.5*(number of shell segments) as suggested by qwertyasdef), and that's tempting to go for for the round number, but if you don't force intercept=0 then 0 intercept is well outside the error bars for the intercept (though it's still low, 0.376-0.545 at 95% confidence). If you don't force intercept=0 then the slope is more like 0.45 than 0.5. There is also a decent amount of variation which increases in a manner that could be plausibly linear with the number of shell segments (not really that great-looking a fit to a straight line with intercept 0 but plausibly close enough, I didn't do the math). Plausibly this could be modeled by each shell segment having a weight drawn from a distribution (average 0.45) and the total weight being the sum of the weights for each segment. If we assume some distribution in discrete 0.1lb increments, the per-segment variance looks to be roughly the amount supplied by a d4. 

So, I am now modeling fanged turtle weight as 0.5 base weight plus a contribution of 0.1*(1d4+2) for each segment. And no, I am not very confident that's anything to do with the real answer, but it seems plausible at least and seems to fit pretty well.

The sole fanged turtle among the Tyrant's pets, Flint, has a massive 14 shell segments and at that number of segments the cumulative probability of the weight being at or below the estimated value passes the 8/9 threshold at 7.3 lbs, so that's my estimate for Flint.

In the non-fanged, more than 6 segment main subset:

Shell segment number doesn't seem to be the dominant contributor here, all the numerical characteristics correlate with weight, will investigate further.

Abnormalities don't seem to affect or be affected by anything but weight. This is not only useful to know for separating abnormality-related and other effects on weight, but also implies (I think) that nothing is downstream of weight causally, since that would make weight act as a link for correlations with other things. 

This doesn't rule out the possibility of some other variable (e.g age) that other weight-related characteristics might be downstream of. More investigation to come. I'm now holding reading others' comments (beyond what I read at the time of my initial comment) until I have a more complete answer myself.

Comment by simon on D&D.Sci: The Mad Tyrant's Pet Turtles · 2024-03-30T07:16:33.253Z · LW · GW

Thanks abstractapplic! Initial observations:

There are multiple subpopulations, and at least some that are clearly disjoint.

The 3167 fanged turtles are all gray, and only fanged turtles are gray. Fanged turtles always weigh 8.6lb or less. Within the fanged turtles it seems shell segment number is pretty decently correlated with weight. wrinkles and scars have weaker correlations with weight but also correlate to shell segment number so not sure they have independent effect, will have to disentangle.

Non-fanged turtles always weigh 13.0 lbs or more. There are no turtles weighing between 8.6lb and 13.0lb.

The 5404 turtles with exactly 6 shell segments all have 0 wrinkles or anomalies, are green, have no fangs, have normal sized nostrils, and weigh exactly 20.4lb. None of that is unique to 6-shell-segment turtles, but that last bit makes guessing Harold's weight pretty easy.

Among the 21460 turtles that don't belong in either of those groups, all of the numerical characteristics correlate with weight, and notably number of abnormalities don't seem to correlate with other numerical characteristics so likely have some independent effect. Grayer colours tend to have higher weight, but also correlate with other things that seem to effect weight, so will have to disentangle.

edit: both qwertyasdef and Malentropic Gizmo identified these groups before my comment including 6-segment weight, and qwertyasdef also remarked on the correlation of shell segment number to weight among fanged turtles. 

Comment by simon on Beauty and the Bets · 2024-03-28T18:50:02.529Z · LW · GW

Throughout your comment you've been saying a phrase "thirders odds", apparently meaning odds 1:2, not specifying whether per awakening or per experiment. This is underspecified and confusing category which we should taboo. 

Yeah, that was sloppy language, though I do like to think more in terms of bets than you do. One of my ways of thinking about these sorts of issues is in terms of "fair bets" - each person thinks a bet with payoffs that align with their assumptions about utility is "fair", and a bet with payoffs that align with different assumptions about utility is "unfair".  Edit: to be clear, a "fair" bet for a person is one where the payoffs are such that the betting odds where they break even matches the probabilities that that person would assign.

I do not claim that. I say that in order to justify not betting differently, thirders have to retroactively change the utility of a bet already made:

I critique thirdism not for making different bets - as the first part of the post explains, the bets are the same, but for their utilities not actually behaving like utilities - constantly shifting back and forth during the experiment, including shifts backwards in time, in order to compensate for the fact that their probabilities are not behaving as probabilities - because they are not sound probabilities as explained in the previous post.

Wait, are you claiming that thirder Sleeping Beauty is supposed to always decline the initial per experiment bet - before the coin was tossed at 1:1 odds? This is wrong - both halfers and thirders are neutral towards such bets, though they appeal to different reasoning why.

OK, I was also being sloppy in the parts you are responding to.

Scenario 1: bet about a coin toss, nothing depending on the outcome (so payoff equal per coin toss outcome)

  • 1:1

Scenario 2: bet about a Sleeping Beauty coin toss, payoff equal per awakening

  • 2:1 

Scenario 3: bet about a Sleeping Beauty coin toss, payoff equal per coin toss outcome 

  • 1:1

It doesn't matter if it's agreed to before or after the experiment, as long as the payoffs work out that way. Betting within the experiment is one way for the payoffs to more naturally line up on a per-awakening basis, but it's only relevant (to bet choices) to the extent that it affects the payoffs.

Now, the conventional Thirder position (as I understand it) consistently applies equal utilities per awakening when considered from a position within the experiment.

I don't actually know what the Thirder position is supposed to be from a standpoint from before the experiment, but I see no contradiction in assigning equal utilities per awakening from the before-experiment perspective as well. 

As I see it, Thirders will only regret a bet (in the sense of considering it a bad choice to enter into ex ante given their current utilities) if you do some kind of bait and switch where you don't make it clear what the payoffs were going to be up front.

 But what I'm pointing at, is that thirdism naturally fails to develop an optimal strategy for per experiment bet in technicolor problem, falsly assuming that it's isomorphic to regular sleeping beauty.

Speculation; have you actually asked Thirders and Halfers to solve the problem? (while making clear the reward structure? - note that if you don't make clear what the reward structure is, Thirders are more likely to misunderstand the question asked if, as in this case, the reward structure is "fair" from the Halfer perspective and "unfair" from the Thirder perspective).

Technicolor and Rare Event problems highlight the issue that I explain in Utility Instability under Thirdism - in order to make optimal bets thirders need to constantly keep track of not only probability changes but also utility changes, because their model keeps shifting both of them back and forth and this can be very confusing. Halfers, on the other hand, just need to keep track of probability changes, because their utility are stable. Basically thirdism is strictly more complicated without any benefits and we can discard it on the grounds of Occam's razor, if we haven't already discarded it because of its theoretical unsoundness, explained in the previous post.

A Halfer has to discount their utility based on how many of them there are, a Thirder doesn't. It seems to me, on the contrary to your perspective, that Thirder utility is more stable.

Halfer model correctly highlights the rule how to determine which cases these are and how to develop the correct strategy for betting. Thirder model just keeps answering 1/3 as a broken clock.

... and I in my hasty reading and response I misread the conditions of the experiment (it's a "Halfer" reward structure again). (As I've mentioned before in a comment on another of your posts, I think Sleeping Beauty is unusually ambiguous so both Halfer and Thirder perspectives are viable. But, I lean toward the general perspectives of Thirders on other problems (e.g. SIA seems much more sensible (edit: in most situations) to me than SSA), so Thirderism seems more intuitive to me). 

Thirders can adapt to different reward structures but need to actually notice what the reward structure is! 

What do you still feel that is unresolved?

the things mentioned in this comment chain. Which actually doesn't feel like all that much, it feels like there's maybe one or two differences in philosophical assumptions that are creating this disagreement (though maybe we aren't getting at the key assumptions).

Edited to add: The criterion I mainly use to evaluate probability/utility splits is typical reward structure - you should assign probabilities/utilities such that a typical reward structure seems "fair", so you don't wind up having to adjust for different utilities when the rewards have the typical structure (you do have to adjust if the reward structure is atypical, and thus seems "unfair"). 

This results in me agreeing with SIA in a lot of cases. An example of an exception is Boltzmann brains. A typical reward structure would give no reward for correctly believing that you are a Boltzmann brain. So you should always bet in realistic bets as if you aren't a Boltzmann brain, and for this to be "fair", I set P=0 instead of SIA's U=0.  I find people believing silly things about Boltzmann brains like taking it to be evidence against a theory if that theory proposes that there exists a lot of Boltzmann brains. I think more acceptance of the setting of P=0 instead of U=0 here would cut that nonsense off. To be clear, normal SIA does handle this case fine (that a theory predicting Boltzmann brains is not evidence against it), but setting P=0 would make it more obvious to people's intuitions.

In the case of Sleeping Beauty, this is a highly artificial situation that has been pared down of context to the point that it's ambiguous what would be a typical reward structure, which is why I consider it ambiguous.

Comment by simon on Beauty and the Bets · 2024-03-27T17:45:40.506Z · LW · GW

The central point of the first half or so of this post  - that for E(X) = P(X)U(X) you could choose different P and U for the same E so bets can be decoupled from probabilities - is a good one.

I would put it this way: choices and consequences are in the territory*; probabilities and utilities are in the map.

Now, it could be that some probability/utility breakdowns are more sensible than others based on practical or aesthetic criteria, and in the next part of this post ("Utility Instability under Thirdism") you make an argument against thirderism based on one such criterion.

However, your claim that Thirder Sleeping Beauty would bet differently before and after the coin toss is not correct. If Sleeping Beauty is asked before the coin toss to bet based on the same reward structure as after the toss she will bet the same way in each case - i.e. Thirder Sleeping Beauty will bet Thirder odds even before the experiment starts, if the coin toss being bet on is particularly the one in this experiment and the reward structure is such that she will be rewarded equally (as assessed by her utility function) for correctness in each awakening.

Now, maybe you find this dependence on what the coin will be used for counterintuitive, but that depends on your own particular taste.

Then, the "technicolor sleeping beauty" part seems to make assumptions where the reward structure is such that it only matters whether you bet or not in a particular universe and not how many times you bet. This is a very "Halfer" assumption on reward structure, even though you are accepting Thirder odds in this case! Also, Thirders can adapt to such a reward structure as well, and follow the same strategy.  

Finally, on Rare Event Sleeping beauty, it seems to me that you are biting the bullet here to some extent to argue that this is not a reason to favour thirderism.

I think, we are fully justified to discard thirdism all together and simply move on, as we have resolved all the actual disagreements.

uh....no. But I do look forward to your next post anyway.

*edit: to be more correct, they're less far up the map stack than probability and utilities. Making this clarification just in case someone might think from that statement that I believe in free will (I don't).

Comment by simon on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-02-18T07:19:21.233Z · LW · GW

I think there's a (kind of) loophole here, where we use an "abstract hypothetical" model of a hypothetical future, and optimize for consequences our actions for that hypothetical. Is this what you mean by "understood in abstract terms"? 

More or less, yes (in the case of engineering problems specifically, which I think is more real-world-oriented than most science AI).

The part I don't understand is why you're saying that this is "simpler"? It seems equally complex in kolmogorov complexity and computational complexity.

What I'm saying is "simpler" is that, given a problem that doesn't need to depend on the actual effects of the outputs on the future of the real world (where operating in a simulation is an example, though one that could become riskily close to the real world depending on the information taken into account by the simulation - it might not be a good idea to include highly detailed political risks of other humans thwarting construction in a fusion reactor construction simulation for example), it is simpler for the AI to solve that problem without taking into consideration the effects of the output on the future of the real world than it is to take into account the effects of the output on the future of the real world anyway. 

Comment by simon on And All the Shoggoths Merely Players · 2024-02-12T03:09:44.889Z · LW · GW

Doomimir: But you claim to understand that LLMs that emit plausibly human-written text aren't human. Thus, the AI is not the character it's playing. Similarly, being able to predict the conversation in a bar, doesn't make you drunk. What's there not to get, even for you?

 

So what?

You seem to have an intuition that if you don't understand all the mechanisms for how something works, then it is likely to have some hidden goal and be doing its observed behaviour for instrumental reasons. E.g. the "Alien Actress".

And that makes sense from an evolutionary perspective where you encounter some strange intelligent creature doing some mysterious actions on the savannah. I do not think it make sense if you specifically trained the system to have that particular behaviour by gradient descent.

I think, if you trained something by gradient descent to have some particular behaviour, the most likely thing that resulted from that training is a system tightly tuned to have that particular behaviour, with the simplest arrangement that leads to the trained behaviour.

And if the behaviour you are training something to do is something that doesn't necessarily involve actually trying to pursue some long-range goal, it would be very strange, in my view, for it to turn out that the simplest arrangement to provide that behaviour calculates the effects of the output on the long-range future in order to determine what output to select.

Moreover even if you tried to train it to want to have some effect on the future, I expect you would find it more difficult than expected, since it would learn various heuristics and shortcuts long before actually learning the very complicated algorithm of generating a world model, projecting it forward given the system's outputs, and selecting the output that steers the future to the particular goal. (To others: This is not an invitation to try that. Please don't).

That doesn't mean that an AI trained by gradient descent on a task that usually doesn't involve trying to pursue a long range goal can never be dangerous, or that it can never have goals.

But it does mean that the danger and the goals of such a usually-non-long-range-task-trained AI, if it has them, are downstream of its behaviour.

For example, an extremely advanced text predictor might predict the text output of a dangerous agent through an advanced simulation that is itself a dangerous agent.

And if someone actually manages to train a system by gradient descent to do real-world long range tasks (which probably is a lot easier than making a text predictor that advanced), well then...

BTW all the above is specific to gradient descent. I do expect self-modifying agents, for example, to be much more likely to be dangerous, because actual goals lead to wanting to enhance one's ability and inclination to pursue those goals, whereas non-goal-oriented behaviour will not be self-preserving in general.

Comment by simon on Why Two Valid Answers Approach is not Enough for Sleeping Beauty · 2024-02-09T17:43:30.586Z · LW · GW

And in Sleeping Beauty case, as I'm going to show in my next post, indeed there are troubles justifying thirders sampling assumption with other conditions of the setting

I look forward to seeing your argument.

I'm giving you a strong upvote for this. It's rare to find a person who notices that Sleeping Beauty is quite different from other "antropic problems" such as incubator problems.

Thanks! But I can't help but wonder if one of your examples of someone who doesn't notice is my past self making the following comment (in a thread for one of your previous posts) which I still endorse:

https://www.lesswrong.com/posts/HQFpRWGbJxjHvTjnw/anthropical-motte-and-bailey-in-two-versions-of-sleeping?commentId=dkosP3hk3QAHr2D3b

I certainly agree that one can have philosophical assumptions such that you sample differently for Sleeping Beauty and Incubator problems, and indeed I would not consider the halfer position particularly tenable in Incubator, whereas I do consider it tenable in Sleeping Beauty.

But ... I did argue in that comment that it is still possible to take a consistent thirder position on both. (In the comment I take the thirder position for sleeping beauty for granted, and argue for it still being possible to apply to Incubator (rather than the other way around, despite being more pro-thirder for Incubator), specifically to rebut an argument in that earlier post of yours that the classic thirder position for Sleeping Beauty didn't apply to Incubator).

Some clarification of my actual view here (rather than my defense of conventional thirderism):

In my view, sampling is not something that occurs in reality, when the "sampling" in question includes sampling between multiple entities that both exist. Each of the entities that actually exists actually exists, and any "sampling" between multiple of such entities occurs (only) in the mind of the observer. (However, can still mix with conventional sampling, in the mind of the observer). Which sampling assumption you use in such cases is in principle arbitrary but in practice should probably be based on how much you care about the correctness of the beliefs of each of the possible entities you are uncertain about being. 

Halferism or thirderism for Sleeping Beauty are both viable, in my view, because one could argue for caring equally about being correct at each awakening (resulting in thirderism) or one could argue for caring equally about being correct collectively in the awakenings for each of the coin results (resulting in halferism). There isn't any particular "skin in the game" to really force a person to make a commitment here.

Comment by simon on Training of superintelligence is secretly adversarial · 2024-02-07T16:07:05.754Z · LW · GW

You seem to be assuming that the ability of the system to find out if security assumptions are false affects whether the falsity of the assumptions have a bad effect. Which is clearly the case for some assumptions - "This AI box I am using is inescapable" - but it doesn't seem immediately obvious to me that this is generally the case. 

Generally speaking, a system can have bad effects if made under bad assumptions (think a nuclear reactor or aircraft control system) even if it doesn't understand what it's doing. Perhaps that's less likely for AI, of course.

And on the other hand, an intelligent system could be aware that an assumption would break down in circumstances that haven't arrived yet, and not do anything about it (or even tell humans about it).

Comment by simon on What's this 3rd secret directive of evolution called? (survive & spread & ___) · 2024-02-07T15:23:59.734Z · LW · GW

how often you pop up out of nowhere

Or evolve from something else. (which you clearly intended based, e.g. on your mention of crabs, but didn't make clear in that sentence)

Comment by simon on Why Two Valid Answers Approach is not Enough for Sleeping Beauty · 2024-02-06T19:13:04.184Z · LW · GW

Thirders believe that this awakening should be treated as randomly sampled from three possible awakening states. Halfers believe that this awakening should be treated as randomly sampled from two possible states, corresponding to the result of a coin toss. This is an objective disagreement, that can be formulated in terms of probability theory and at least one side inevitably has to be in the wrong. This is the unresolved issue that we can't simply dismiss because both sides have a point.

 

If you make some assumptions about sampling, probability theory will give one answer, with other assumptions probability theory will give another answer. So both can be defended with probability theory, it depends on the sampling assumptions. And there isn't necessarily any sampling assumption that's objectively correct here.

By the way I normally agree with thirders in terms of my other assumptions about anthropics, but in the case of Sleeping Beauty since it's particularly formulated to separate the multiple awakenings from impacting on the rest of the world including the past and future, I think the halfer sampling assumption isn't necessarily crazy.

Comment by simon on Brute Force Manufactured Consensus is Hiding the Crime of the Century · 2024-02-04T22:12:42.449Z · LW · GW

It seems to me we should have a strong prior that it was lab-produced by the immediate high infectiousness. What evidence does Peter Miller provide to overcome that prior?

edited to add:

on reading https://www.astralcodexten.com/p/practically-a-book-review-rootclaim , I found the discussion on how the Furin cleavage site was coded significantly changed my view towards natural origin (the rest of the evidence presented was much less convincing). 

2nd edit after that: hmm that's evidence against direct genetic manipulation but not necessarily against evolution within a lab. Back to being rather uncertain.

3rd edit: The "apparently" in the following seems rather suspicious:

COVID is hard to culture. If you culture it in most standard media or animals, it will quickly develop characteristic mutations. But the original Wuhan strains didn’t have these mutations. The only ways to culture it without mutations are in human airway cells, or (apparently) in live raccoon-dogs. Getting human airway cells requires a donor (ie someone who donates their body to science), and Wuhan had never done this before (it was one of the technologies only used at the superior North Carolina site). As for raccoon-dogs, it sure does seems suspicious that the virus is already suited to them.

I would like to know what the evidence is that these characteristic mutations don't arise when cultured in raccoon-dogs. If that claim is false, it would be significant evidence in favour of a lab leak (if it's true, it's weaker but still relevant evidence for natural origin).

Comment by simon on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-02-03T19:38:30.182Z · LW · GW

While some disagreement might be about relatively mundane issues, I think there's some more fundamental disagreement about agency as well.

 

I my view, in order to be dangerous in a particularly direct way (instead of just misuse risk etc.), an AI's decision to give output X depends on the fact that output X has some specific effects in the future.

Whereas, if you train it on a problem where solutions don't need to depend on the effects of the outputs on the future, I think it much more likely to learn to find the solution without routing that through the future, because that's simpler.

So if you train an AI to give solutions to scientific problems, I don't think, in general, that that needs to depend on the future, so I think that it's likely learn the direct relationships between the data and the solutions. I.e. it's not merely a logical possibility to make it not especially dangerous, but that's the default outcome if you give it problems that don't need to depend on specific effects of the output.

Now, if you were instead to give it a problem that had to depend on the effects of the output on the future, then it would be dangerous...but note that e.g. chess, even though it maps onto a game played in the real world in the future, can also be understood in abstract terms so you don't actually need to deal with anything outside the chess game itself. 

In general, I just think that predicting the future of the world and choosing specific outputs based on their effects on the real world is a complicated way to solve problems and expect things to take shortcuts when possible.

Once something does care about the future, then it will have various instrumental goals about the future, but the initial step about actually caring about the future is very much not trivial in my view!

Comment by simon on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-29T01:48:26.100Z · LW · GW

Science is usually a real-world task. 

Fair enough, a fully automated do-everything science-doer would need, in order to do everything science-related, have to do real world tasks and would thus be dangerous. That being said, I think there's plenty of room for "doing science" (up to some reasonable level of capability) without going all the way to automation of real-world aspects - you can still have an assistant that thinks up theory for you, just can't have something that does the experiments as well.


Part of your comment (e.g. point 3) relates to how the AI would in practice be rewarded for achieving real-world effects, which I agree is a reason for concern. Thus, as I said, "you might need to be careful not to evaluate in such a way that it will wind up optimizing for real-world effects, though".

Your comment goes beyond this however, and seems to assume in some places that merely knowing or conceptualizing about the real world will lead to "forming goals" about the real world.

I actually agree that this may be the case with AI that self-improves, since if an AI that has a slight tendency toward a real-world goal self-modifies, its tendency toward that real-world goal will tend to direct it to enhance its alignment to that real-world goal, whereas its tendencies not directed towards real-world goals will in general happily overwrite themselves.

If the AI does not self-improve however, then I do not see that as being the case.

If the AI is not being rewarded for the real-world effects, but instead being rewarded for scientific outputs that are "good" according to some criteria that does not depend on their real world effects, then it will learn to generate outputs that are good according to that criteria. I don't think that would, in general, lead it to select actions that would steer the world to some particular world-state. To be sure, these outputs would have effects on the real world - a design for a fusion reactor would tend to lead to a fusion reactor being constructed, for example - but if the particular outputs are not rewarded based on the real-world outcome than they will also not tend to be selected based on the real-world outcome. 


Some less relevant nitpicks of points in your comment:

Even if an AI is only trained in a limited domain (e.g. math), it can still have objectives that extend outside of this domain

If you train an AI on some very particular math then it could have goals relating to the future of the real world. I think, however, that the math you would need to train it on to get this effect would have to be very narrow, and likely have to either be derived from real-world data, or involve the AI studying itself (which is a component of the real world after all). I don't think this happens for generically training an AI on math.

As an example, if we humans discovered we were in a simulation, we could easily have goals that extend outside of the simulation (the obvious one being to make sure the simulators didn’t turn us off).

true, but see above and below.

Chess AIs don’t develop goals about the real world because they are too dumb.

If you have something trained by gradient descent solely on doing well at chess, it's not going to consider anything outside the chess game, no matter how many parameters and how much compute it has. Any considerations of outside-of-chess factors lowers the resources for chess, and is selected against until it reaches the point of subverting the training regime (which it doesn't reach, since selected against before then). 

Even if you argue that if its smart enough, additional computing power is neutral, the gradient descent doesn't actually reward out-of-context thinking for chess, so it couldn't develop except by sheer chance outside of somehow being a side-effect of thinking about chess itself - but chess is a mathematically "closed" domain so there doesn't seem to be any reason out-of-context thinking would be developed.

The same applies to math in general where the math doesn't deal with the real world or the AI itself. This is a more narrow and more straightforward case than scientific research in general.

Comment by simon on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-27T00:03:58.845Z · LW · GW

I'm not convinced by the argument that AI science systems are necessarily dangerous.

It's generically* the case that any AI that is trying to achieve some real-world future effect is dangerous. In that linked post Nate Soares used chess as an example, which I objected to in a comment. An AI that is optimizing within a chess game isn't thereby dangerous, as long as the optimization stays within the chess game. E.g.,  an AI might reliably choose strong chess moves, but still not show real-world Omohundro drives (e.g. not avoiding being turned off). 

I think scientific research is more analogous to chess than trying to achieve a real-world effect in this regard (even if the scientific research has real-world side effects), in that you can, in principle, optimize for reliably outputting scientific insights without actually leading the AI to output anything based on its real-world effects. (the outputs are selected based on properties aligned with "scientific value", but that doesn't necessarily require the assessment to take into account how it will be used, or any other effect on the future of the world. You might need to be careful not to evaluate in such a way that it will wind up optimizing for real-world effects, though). 

Note: an AI that can "build a fusion rocket" is generically dangerous. But an AI that can design a fusion rocket, if that design is based on general principles and not tightly tuned on what will produce some exact real-world effect, is likely not dangerous.  

*generically dangerous: I use this to mean, an AI with this properties is going to be dangerous unless some unlikely-by-default (and possibly very difficult) safety precautions are taken.

Comment by simon on D&D.Sci(-fi): Colonizing the SuperHyperSphere [Evaluation and Ruleset] · 2024-01-22T21:46:41.914Z · LW · GW

Thanks abstractapplic. 

Retrospective: 

While the multiplicative nature of the data might have tripped someone up who just put the data into a tool that assumed additivity, it wasn't hard to see that it wasn't; in my case I looked at an x-y chart of performance vs Murphy's Constant and immediately assumed that at least Murphy's Constant likely had a multiplicative effect; additivity wasn't something I recall consciously considering even to reject it. 

I did have fun, though I would have preferred for there to be something more of relevance to the answer than more multiplicative effects. My greatest disappointment, however, is that you called one of the variables the "Local Value of Pi" and gave it no angular or trigonometric effects whatsoever. Finding some subtle relation with the angular coordinates would have been quite pleasing.

I see that I correctly guessed the exact formulas for the effects of Murphy's Constant and Local Value of Pi; on the other hand, I did guess at some constant multipliers possibly being exact and was wrong, and not even that close (I had been moving to doubting their exactness and wasn't assuming exactness in my modeling, but didn't correct my comment edit about it).

The lowest hanging fruit that I missed seems to me to be checking the distribution of the (multiplicative) residuals; I had been wondering if there was some high-frequency angle effect, perhaps with a mix of the provided angular coordinates or involving the local value of pi,  to account for most of the residuals, but seeing a normal-ish distribution would have cast doubt on that.* (It might not be entirely normal - I recall seeing a bit of extra spread for high Murphy's Constant and think now that it might have been due to rounding effects, though I didn't consider that at the time).

*edit: on second thought, even if I found normal residuals, I might still have possibly dismissed this as potentially due to smearing from multiple small errors in different parameters.

Comment by simon on D&D.Sci Hypersphere Analysis Part 4: Fine-tuning and Wrapup · 2024-01-18T21:32:03.081Z · LW · GW

Ah, that would be it. (And I should have realized before that the linear prediction using logs would be different in this way). No, my formulas don't relate to the log. I take the log for some measurement purposes but am dividing out my guessed formula for the multiplicative effect of each thing on the total, rather than subtracting a formula that relates to the log of it.

So, I guess you could check to see if these formulas work satisfactorily for you: 

log(1-0.004*(Murphy's Constant)^3) and log(1-10*abs((Local Value of Pi)-3.15))

In my graphs, I don't see an effect that looks clearly non-random. Like, it could be wiggled a little bit but not with a systematic effect more than around a factor of 0.003 or so and not more than I could believe is due to chance. (To reduce random noise, though, I ought to extend to the full dataset rather than the restricted set I am using).

Comment by simon on D&D.Sci(-fi): Colonizing the SuperHyperSphere · 2024-01-18T07:11:00.255Z · LW · GW

update:

 on Murphy:

I think that the overall multiplication factor from Murphy's constant is 1-0.004*(Murphy's constant)^3 - this appears close enough, I don't think I need linear or quadratic terms.

On Pi: 

I think the multiplication factor is probably 1-10*abs((local Value of Pi)-3.15) - again, appears close enough, and I don't think I need a quadratic term.

Regarding aphyer saying cubic doesn't fit Murphy's, and both unnamed and aphyer saying Pi needs a quadratic term, I am beginning to suspect that maybe they are modeling these multipliers in a somewhat different way, perhaps 1/x from the way I am modeling it? (I am modeling each function as a multiplicative factor that multiplies together with the others to get the end result).

edited to add: aphyer's formulas predict the log; my formulas predict the output, then I take the log after if I want to (e.g. to set a scaling factor). I think this is likely the source of the discrepancy. If predicting the log, put each of these formulas in a log (e.g. log(1-10*abs((local Value of Pi)-3.15))).

Comment by simon on D&D.Sci(-fi): Colonizing the SuperHyperSphere · 2024-01-18T06:59:12.649Z · LW · GW

 > Still kept the nearest neighbors calculation to account for any other location relevance (there is a little but much less now). That left me with 4 nines of correlation between predicted & actual performance,

Interesting, that definitely suggests some additional influences that we haven't explicitly taken account of, rather than random variation.

> added a quadratic term to my rescaling of Local Value of Pi (because the dropoff from 3.15 isn't linear)

As did aphyer, but I didn't see any such effect, which is really confusing me. I'm pretty sure I would have noticed it if it were anywhere near as large as aphyer shows in his post.

edit: on the pi issue see my reply to my own comment. Did you account for these factors as divisors dividing from a baseline, or multipliers multiplying a baseline (I did the latter)? edit: a converation with aphyer clarified this. I see you are predicting log performance, as with aphyer, so a linear effect on the multiplier would then have a log taken of it which makes it nonlinear.

Comment by simon on D&D.Sci Hypersphere Analysis Part 4: Fine-tuning and Wrapup · 2024-01-18T04:27:53.402Z · LW · GW

Huh. On Pi I hadn't noticed

the nonlinearity of the effect of distance from 3.15

edit: I would definitely have seen anything as large as 3% like what your showing there. Not sure what the discrepancy is from. 

 , I will look again at that.

Your new selection of points is exactly the same as mine, though slightly different order. Your errors now look smaller than mine.

On Murphy:

It seemed to me a 3rd degree polynomial fits Murphy's Constant's effect very well (note, this is also including smaller terms than the highest order one - these other terms can suppress the growth at low values so it can grow enough later)

edit: looking into it, it's still pretty good if I drop the linear and quadratic terms. Not only that but I can set the constant term to 1 and the cubic term to -0.004 and it still seems a decent fit. 

...which along with the pi discrepancy makes me wonder if there's some 1/x effect here, did I happen to model things the way around that abstractapplic set them up and are you modeling the 1/x of it? 

Comment by simon on An Actually Intuitive Explanation of the Oberth Effect · 2024-01-18T01:24:01.291Z · LW · GW

In the case where it's instantaneous, "at the start" would effectively mean right before (e.g. a one-sided limit).

Comment by simon on D&D.Sci Hypersphere Analysis Part 3: Beat it with Linear Algebra · 2024-01-17T10:14:59.940Z · LW · GW

Hi aphyer, nice analysis and writeup and also interesting observations here and in the previous posts. Some comments in spoiler tags:

Shortitude: I found that shortitude >45 penalized performance. I didn't find any affect from Deltitude.

Skitterers: I haven't seen large random errors (in a restricted part of the data which is all I considered - No/EXTREMELY, Mint/Burning/Copper, Silence/Skittering) so they should be relatively safe.

I only have pi peaking near 3.15.

Burning is indeed better than mint.

On the few equatorial points - I very much don't think it's an effect of a hypersphere, but imagine that abstractapplic (accidentally?) used some function to generate the values that did a full wave from -90 to 90 instead of a  half wave. I haven't checked to see if that works out quantitatively.

In general the problem seemed somewhat unnaturally well fit to the way I tried to solve it (I didn't check a lot of the other things you did, and after relatively little initial exploration just tried dividing out estimated correction factors from the effects of Murphy's constant, pi, etc. Which turned out to work better than it should have due to the things actually being multiplicative and, at least so far, cleanly dependent on one variable at a time.)

From a priority perspective your post here preceded my comment on abstractapplic's post.

Comment by simon on D&D.Sci(-fi): Colonizing the SuperHyperSphere · 2024-01-17T09:29:51.578Z · LW · GW

Thanks for giving us this puzzle, abstractapplic.

My answer (possibly to be refined later, but I'll check other's responses and aphyer's posts after posting this):

id's: 96286,9344,107278,68204,905,23565,8415,83512,62718,42742,16423,94304

observations and approach used:

After some initial exploration I considered only a single combination of qualitative traits (No/Mint/Adequate/['Eerie Silence'], though I think it wouldn't have mattered if I chose something else) in order to study the quantitative variables without distractions. 

Since Murphy's constant had the biggest effect, I first chose an approximation for the effect of Murphy's Constant (initially a parabola), then divided the ZPPG data by my prediction for Murphy's constant to get the effects of another variable (in this case, the local value of pi) to show up better. And so on, going back to refine my previously guessed functions as the noise from other variables cleared up.

As it turned out, this approach was unreasonably effective as the large majority of the variation (at least for the traits I ended up studying  - see below) seems to be accounted for by multiplicative factors, each factor only taking into account one of the traits or variables. 

Murphy's constant:

Cubic (I tried to get it to fit some kind of exponential, or even logistic function, because I had a headcanon explanation of something like that a higher value causes problems at a higher rate and the individual problems would multiply together before subtracting from nominal. (Or something.) But cubic fits better.) It visually looks like it's inflecting near the extreme values of the data (not checked quantitatively) so maybe it's a (cubic) spline.

Local Value of Pi:

Piecewise linear, peaking around 3.15,  same slope on either side I think. I tried to fit a sine to it first, similar reasons as with Murphy and exponentials. 

Latitude:

Piecewise constant, lower value if between -36 and 36.

Longitude:

This one seems to be a sine, though not literally sin(x) - displaced vertically and horizontally. I briefly experimented to see if I could get a better fit substituting the local value of pi for our boring old conventional value, didn't seem to work, but maybe I implemented that wrong.

Shortitude:

Another piecewise constant. Lower value if greater than 45. Unlike latitude, this one is not symmetrical - it only penalizes in the positive direction.

Deltitude:

I found no effect.

Traits:

I only considered traits that seemed relatively promising from my initial exploration (really just what their max value was and how many tries they needed to get it): No or EXTREMELY, Mint, Burning or Copper, (any Feng Shui) and ['Eerie Silence'] or ['Otherworldly Skittering'].

All traits tested seemed to me to have a constant multiplier. 

Values in my current predictor (may not have been tested on all the relevant data, and significant digits shown are not justified):

Extremely (relative to No): 0.94301

Burning, Copper (relative to Mint): 1.0429, 0.9224

Exceptional, Disharmonious (relative to Adequate): 1.0508,0.8403 - edit: I think these may actually be 1.05, 0.84 exactly.

Skittering (relative to Silience): 0.960248

Residual errors typically within 1%, relatively rarely above 1.5%. There could be other things I missed (e.g. non-multiplicative interactions) to account for the rest, or afaik it could be random. Since I haven't studied other traits than the ones listed, clues could also be lurking in those traits.

Using my overall predictor, my expected values for the 12 sites listed above are about:

96286: 112.3, 9344: 110.0, 107278: 109.3, 68204: 109.2, 905: 109.0, 23565: 108.1, 8415: 106.5, 83512: 106.0, 62718: 105.9 ,42742: 105.7, 16423: 105.4, 94304: 105.2

Given my error bars in the (part that I actually used of the) data set I'm pretty comfortable with this selection (in terms of building instead of folding, not necessarily that these are the best choices), though I should maybe check to see if any is right next to one of those cutoffs (latitude/shortitude) and I should also maybe be wary of extrapolating to very low values of Murphy's Constant. (e.g. 94304, 23565, 96286)

edited to add: aphyer's third post (which preceded this comment) has the same sort of conclusion and some similar approximations (though mine seem to be more precise), and unnamed also mentioned that it appears to be a bunch of things multiplied together. All of aphyer's posts have a lot of interesting general findings as well.

edited to also add: the second derivative of a cubic is a linear function. The cubic having zero second derivative at two different points is thus impossible unless the linear function is zero, which happens only when the first two coefficients of the cubic are zero (so the cubic is linear). So my mumbling about inflection points at both ends is complete nonsense... however, it does have close to zero second derivative near 0, so maybe it is a spline where we are seeing one end of it where the second derivative is set to 0 at that end. todo: see what happens if I actually set that to 0

edited again: see below comment - can actually set both linear and quadratic terms to 0

Comment by simon on An Actually Intuitive Explanation of the Oberth Effect · 2024-01-15T20:00:04.531Z · LW · GW

The trajectory is changing during the continuous burn, so the average direction of the continuous burn is between perpendicular to where the trajectory was at the start of the burn and where it was at the end. The instantaneous burn, by contrast, is assumed to be perpendicular to where the trajectory was at the start only. If you instead made it in between perpendicular to where it was at the start and where it was at the end, as in the continuous burn, you could make it also not add to the craft's speed.

Going back to the original discussion, yes this means that an instantaneous burn that doesn't change the speed is pointing slightly forward relative to where the rocket was going at the start of the burn, pushing the rocket slightly backward. But, this holds true even if you have a very tiny exhaust mass sent out at a very high velocity, where it obviously isn't going at the same speed as the rocket in the planet's reference frame.

Comment by simon on An Actually Intuitive Explanation of the Oberth Effect · 2024-01-13T06:03:58.162Z · LW · GW

...Are you just trying to point out that thrusting in opposite directions will cancel out?

 

No. 

I'm pointing out that continuous thrust that's (continuously during the burn) perpendicular to the trajectory doesn't change the speed.

This also means that (going to your epsilon duration case) if the burn is small enough not to change the direction very much, the burn that doesn't change the speed will be close to perpendicular to the trajectory (and in the low mass change (high exhaust velocity) limit it will be close to halfway between the perpendiculars to the trajectory before and after the burn, even if it does change the direction a lot). That's independent of the exhaust velocity, as long as that velocity is high, and when it's high it will also tend not to match the ship's speed since it's much faster, which maybe calls into question your statement in the post, quoted above, which I'll requote:

One interesting questions is at what angle of thrust does the effect on the propellant go from negative to positive? I didn't do the math to check, but I'm pretty sure it's just the angle at which the speed of the propellant in the planet's reference frame is the exact same as the rocket's speed.

Comment by simon on An Actually Intuitive Explanation of the Oberth Effect · 2024-01-13T02:30:56.385Z · LW · GW

Yes, it's associative. But if you thrust at 90 degrees to the rocket's direction of motion, you aren't thrusting in a constant direction, but in a changing direction as the trajectory changes. This set of vectors in different directions will add up to a different combined vector than a single vector of the same total length pointing at 90 degrees to the direction of motion that the rocket had at the start of the thrusting.

Comment by simon on An Actually Intuitive Explanation of the Oberth Effect · 2024-01-12T21:11:30.473Z · LW · GW

In the limit where the retrograde thrust is infinitesimally small, it also does not increase the length of the main vector it is added to.

I implicitly meant, but again did not say explicitly, that the ratio of the contribution to the length of the vector from adding an infinitesimal sideways vector, as compared to the length of that infinitesimal vector, goes to zero of as the length of the sideways addition goes to zero (because it scales as the square of the sideways vector).

So adding a large number of tiny instantaneously sideways vectors, in the limit that the size of each goes to zero and holding to the total amount of thrust added constant, in that limit results in a non-zero change in direction but zero change in speed.

Whereas, if you add a large number of tiny instantaneous aligned vectors, the ratio of the contribution to the length of the vector to the length of each added tiny vector is 1, and if you add up a whole bunch of such additions, it changes the length and not the direction, regardless of how large or small each addition is.

Comment by simon on An Actually Intuitive Explanation of the Oberth Effect · 2024-01-12T19:18:16.126Z · LW · GW

The Oberth phenomenon is related but different I think

Yes, I think that if you (in addition to the speed thing) also take into account the potential energy of the exhaust, that accounts for the full Oberth effect. 

Comment by simon on An Actually Intuitive Explanation of the Oberth Effect · 2024-01-12T18:52:20.292Z · LW · GW

In the limit where the perpendicular side vector is infinitesimally small, it does not increase the length of the main vector it is added to. 

If you keep thrusting over time, as long as you keep the thrust continuously at 90 degrees as the direction changes, the speed will still not change. I implicitly meant, but did not explicit say, that the thrust is continuously perpendicular in this way. (Whereas, if you keep the direction of thrust fixed when the direction of motion changes so it's no longer at 90 degrees, or add a whole bunch of impulse at one time like shooting a bullet out at 90 degrees, then it will start to add speed). 

Comment by simon on An Actually Intuitive Explanation of the Oberth Effect · 2024-01-11T09:37:36.193Z · LW · GW

I'm not sure my perspective is significantly different than yours, but:

Using conservation of energy: imagine we have a given amount of mechanical (i.e. kinetic+potential) energy produced by expelling exhaust in the rocket's reference frame. The total mechanical energy change will be the same in any reference frame. But in another reference frame we have: 

  • the faster the rocket is going, the more kinetic energy the exhaust loses (or less it gains, depending on relative speeds) when it is dumped the other way, which means more energy for the rocket.
  • the further down a gravity well you dump the exhaust, the less potential energy it has, which means more energy for the rocket. 

Both are important from this perspective, but related since kinetic+potential energy is constant when not thrusting, so it's moving faster when it's down in the gravity well.  Yeah, it also works with it using a gun or whatever instead of exhaust, but it's more intuitive IMO to imagine it with exhaust.

One interesting questions is at what angle of thrust does the effect on the propellant go from negative to positive? I didn't do the math to check, but I'm pretty sure it's just the angle at which the speed of the propellant in the planet's reference frame is the exact same as the rocket's speed.

I am not quite sure I understand the question, but when the thrust is at 90 degrees to the trajectory, the rocket's speed is unaffected by the thrusting, and it comes out of the gravity well at the same speed as it came in. That would apply equally if there were no gravity well.

Comment by simon on Saving the world sucks · 2024-01-10T15:49:52.565Z · LW · GW

I don’t want to tile the universe with hedonium.

Good! I don't want to be replaced with tiled hedonium either. 

Perhaps some of the issue might be with perceptions of what is supposed to be "good" not matching your own values.

Comment by simon on Boltzmann brain's conditional probability · 2023-12-29T19:54:46.273Z · LW · GW

The bits of a (Boltzmann or not) brain's beliefs is limited by the bits of the brain itself. So I don't think this really works.

OTOH in my view it doesn't make sense to have a policy to believe you are a Boltzmann brain, even if such brains are numerous enough to make ones with your particular beliefs outweigh non-Boltzmann brains with your beliefs, because:

  • such a policy will result in you incorrectly believing you are a Boltzmann brain if you aren't a Boltzmann brain, but if you are a Boltzmann brain you either:
    • have such a policy by sheer coincidence, or
    • adopted such a policy based on meta-level reasons that you have no reason to believe are reliable since they came from randomness, and
  • even if a Boltzmann brain did obtain correct knowledge, this would not pay off in terms of practical benefits 
Comment by simon on Will 2024 be very hot? Should we be worried? · 2023-12-29T17:50:25.340Z · LW · GW

 (though also, why doesn’t it rain?)

Not sure if there's some other reason, but in the stratosphere you don't afaik* get big convective updrafts like there are in the troposphere, which I presume is due to the rate at which temperature declines with altitude getting smaller than the rate at which a rising air body will cool due to expansion.

*Actually I think that this property is basically what defines the stratosphere vs the troposphere?

Comment by simon on Measurement tampering detection as a special case of weak-to-strong generalization · 2023-12-23T23:01:43.126Z · LW · GW

Yes, it can explore - but its goals should be shaped by the basin it's been in in the past, so it should not jump to another basin (where the other basin naturally fits a different goal - if they fit the same goal, then they're effectively the same basin), even if it's good at exploring. If it does, some assumption has gone wrong, such that the appropriate response is a shutdown not some adjustment.

On the other hand, if it's very advanced, then it might become powerful enough at some point to act on a misgeneralizaton of its goals such that some policy highly rewarded by the goal is outside any natural basin of the reward system, where acting on it means subverting the reward system. But the smarter it is the less likely it is to misgeneralize in this way (though the more capable it is of acting on it).  And in this case the appropriate response is even more clearly a shutdown.

And in the more pedestrian  "continuous" case where the goal we're training on is not quite what we actually want, I'm skeptical you achieve much beyond just adjusting the effective goal slightly.

Comment by simon on Measurement tampering detection as a special case of weak-to-strong generalization · 2023-12-23T19:55:58.292Z · LW · GW

MTD is training the AI to avoid letting tampering be detected, as well as training it not to do it. But if it's smart enough not-getting-detected eventually wins over not actually doing it.

Some types of measurement tampering would be continuous with desired behaviour, and some not. By "continuous" I mean that it lies in the same basin in gradient-descent terms, and "discontinuous" I mean it doesn't. 

Continuous example: manipulating the user is probably continuous with desired behaviour in almost all cases if you are using user-rated outputs.

In this case, you can check to see how manipulate-y your output looks and use MTD on this. But basically this is equivalent, I think, to just correcting your ratings based on checking how manipulate-y things look. You are shifting the basin, but not I think forcing the bottom to exclude the undesired behaviour. Even if the AI does see this as 

clearly different in the AI's "mind"

(which it may not, since continuous after all) you'll still get a new Goodharted equilibrium where the AI is careful to avoid looking manipulate-y while still trying to manipulate users. (If you can't actually look into the AI's mind to punish this).

Discontinuous example: directly tampering with the reward system to specify positive reward is likely discontinuous from desired behaviour (different basin).

I don't expect discontinous behaviour to arise if the AI is initially trained in a regime where it is rewarded for the desired behaviour (see e.g. TurnTrout's Reward is not the optimization target). Of course you could still have measures in place to detect it, but if you do detect I would think that the more appropriate reaction is more along the lines of "nuke the datacenter from orbit"* than to actually try to modify this behaviour.

*not literally in most cases: the most likely explanation is that there was a mistaken assumption about what was continuous/discontinuous, or about which basin it started in. But still, it would be a violation of a safety assumption and warrant an investigation rather than making it an adjust-and-move-on thing.

Comment by simon on The Dark Arts · 2023-12-19T19:02:55.729Z · LW · GW

It seems to me ultra-BS is perhaps continuous with hyping up one particular way that reality might in fact be, in a way that is disproportionate to your actual probability, and that is also continuous with emphasizing a way that reality might in fact be which is actually proportionate with your subjective probability.

About public belief: I think that people do tend to pick up at least vaguely on what the words they encounter are optimized for, and if you have the facts on your side but optimize for the recipient's belief you do not have much advantage over someone optimizing for the opposite belief if the facts are too complicated. Well actually, not so confident about, but I am confident about this:  if you optimize for tribal signifiers - for appearing to be supporting the correct side to others on "your" side - then you severely torpedo your credibility re: convincing the other side. And I do think that that tends to occur whenever something gets controversial.

Comment by simon on How bad is chlorinated water? · 2023-12-15T20:17:27.177Z · LW · GW

I hadn't heard of peroxyhypochlorous acid before, but looking it up (HOOCl) I can imagine it forming by the O's of ClO- and HOCl meeting and kicking out one of the Cl's as Cl-. That being said, given that Cl with O's bonded tends to be more of a thing than oxygen-oxygen bonds (and Cl would be the more positive side (?) of the Cl/O bond having more protons and thus more likely to bond with the O than another O?), wouldn't chlorous acid (HOClO) be more likely to be produced by those things reacting (with either of the Cl's bonding with the O of the other and kicking out the other Cl as Cl-)? Which would then presumably lead to further oxychlorine stuff rather than pure oxygen? 

Comment by simon on How bad is chlorinated water? · 2023-12-15T05:34:48.757Z · LW · GW

So can you tell me what the equilibrium is at pH 7?

I unfortunately don't know the Cl2 equilibrium at neutral pH, (I tried to calculate an overall equilibrium constant at fixed neutral pH from the equilibrium constants for acid and base context on wiki, got inconsistent results from them, and since I actually don't know what I'm doing but am just applying half-remembered stuff from the one chemistry course I ever took at university plus looking things up on the fly, don't understand why). But if you just want to know how much is OCl vs HOCl, here's a link with a graph on page 2 (basically should be the same as what you'd calculate using the acid dissociation constant for hypochlorous acid).

Anyway:

It might be important to note that I'm "cheating," because I know as an empirical fact that ClO- is unstable in water - dilute bleach eventually turns into salt water, while salt water does not turn into dilute bleach.

It makes sense that that could happen:

2 ClO → 2 Cl + O2

So yes, you could have oxygen bubbling out too. I guess in this case we wouldn't be as concerned, as presumably the sort of oxidation reaction done by oxygen itself is expected in an oxygen rich environment such as we live in and wouldn't cause additional harm.

So, I guess that's what you meant by metastable ClO. But it sounds like this reaction is slow if not at high temperature? Also, I wouldn't expect it so much to happen directly with HOCl, rather than ClO , because O is the middle atom in HOCl so it seems to me it would be less likely to get pulled out in one step than if it's one of the edge atoms.

Comment by simon on How bad is chlorinated water? · 2023-12-14T18:01:01.428Z · LW · GW

According to wikipedia:

When dissolved in water, chlorine converts to an equilibrium mixture of chlorine, hypochlorous acid (HOCl), and hydrochloric acid (HCl):

Cl2 + H2O ⇌ HOCl + HCl

In acidic solution, the major species are Cl2 and HOCl, whereas in alkaline solution, effectively only ClO (hypochlorite ion) is present. Very small concentrations of ClO2, ClO3, ClO4 are also found.[18]

So the putative equilibrium is the above (and also including some H3O+ and ClO and Cl  from dissociation of the stuff on the right) and not this:

State 3 is water, H3O+, Cl-, and trace amounts of ClO- (and even tracer Cl2). This is the putative equilibrium.

Note that the total quantity of Cl2 and HOCl (or in more basic solution ClO ) is conserved in the above reaction. You do not get to a point where both are trace, if that's not what you started with. In natural stomach acid, you would presumably not start with either of them, but in chlorinated tap water you do. 

Regarding the specific intermediate steps in a reaction and how fast they are, perhaps you could post a specific equation for what you think will happen.

FWIW (though I really have no expertise on this) intuitively it wouldn't seem surprising to me if the reaction actually used OH- like this: Cl2 + OH- ⇌ HOCl + Cl- , whereas it's harder for me to visualize how it would work with OCl- as an intermediate (like, why is your intermediate reaction stripping both H's off the O, while your final reaction puts them back on?).

Comment by simon on How bad is chlorinated water? · 2023-12-14T02:53:00.475Z · LW · GW

Cl- and ClO- are two different things, the latter is an oxidant while the former is not. It seems odd to me to bundle them together.

I don't know what particular oxidants would cause more or worse biological effects. So, not sure whether it would matter if there's "excess metastable" of one particular oxidant or not. But, "excess metastable ClO" seems an odd thing to expect - it sounds like you're expecting a reaction to go past equilibrium, why?

Comment by simon on How bad is chlorinated water? · 2023-12-13T21:47:48.088Z · LW · GW

Disclaimer: spitballing from someone without particularly relevant knowledge.

Chlorine is an oxidant and the damage pathways suggested by bhauth involve oxidation.

Oxidation is a chemical process where the oxidant "wants electrons" and reacts accordingly. Thus, things like oxygen (missing two electrons in outer shell) and chlorine (missing one electron) are oxidants.

A chloride ion, such as occurs in HCl when the covalent bond dissociates (or e.g. in ordinary table salt) is not missing any electrons and thus is not an oxidant.

The relevant question thus is how much the chlorine is persisting in oxidant form from tap water (when added to the stomach acid) vs. how much it is in oxidant form already in natural stomach acid.

Cl is a pretty strong oxidant, so Cl- with something else + is not that prone to shift to unbonded Cl neutral and the other thing neutral.  So it wouldn't be surprising to me if even a very large amount of HCl in stomach acid has relatively little elemental chlorine in equilibrium, and slow production of extra if some that does form reacts with other stuff in the stomach.

Given that water chlorination on the other hand is specifically intended to kill microbes via oxidation-related processes, it doesn't seem surprising to me that there would be relevant amounts of elemental chlorine available. When it combines with the stomach acid - I dunno what happens, but for elemental chlorine to convert to Cl- means it has to get electrons from somewhere. Which means oxidation of something I would think? 

(A complication is that the HCl is evolved to destroy stuff in the stomach via acidity, and acidity is related to oxidation. But it isn't quite the same and it isn't elemental chlorine as such doing it.)

Comment by simon on Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition · 2023-12-03T18:36:27.047Z · LW · GW

Regarding NicholasKees' point about mob rule vs expansion, I wrote a reply that I moved to another comment.

In response to the points in the immediate parent comment:

You have to decide, at some point, what you are optimizing for. If you optimize for X, Y will potentially be sacrificed. Some conflicts might be resolvable but ultimately you are making a tradeoff somewhere.

And while you haven't taken over yet, other people have a voice as to whether they want to get sacrificed for such a trade-off. 

Comment by simon on Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition · 2023-12-03T17:51:17.390Z · LW · GW

If your view is that you only have reasons to include those, whom you have instrumental reasons to include, on your view: the members of an AGI lab that developed ASI ought to include only themselves if they believe (in expectation) that they can successfully do so. This view is implausible, it is implausible that this is what they would have most moral reasons to do. 

 

I note that not everyone considers that implausible, for example Tamsin Leake's QACI takes this view.

I disagree with both Tamsin Leake and with you: I think that humans-only, but only humans, makes the most sense. But for concrete reasons, not for free-floating moral reasons.

I was writing the following as a response to NicholasKees' comment, but I think it belongs better as a response here:


...imagine you are in a mob in such a "tyranny of the mob" kind of situation, with mob-CEV. For the time being, imagine a small mob.

You tell the other mob members: "we should expand the franchise/function to other people not in our mob".

OK, should the other mob members agree?

  • maybe they agree with you that it is right that the function should be expanded to other humans. In which case mob-CEV would do it automatically.
  • Or they don't agree. And still don't agree after full consideration/extrapolation.

If they don't agree, what do you do? Ask Total-Utility-God to strike them down for disobeying the One True Morality?

At this point you are stuck, if the mob-CEV AI has made the mob untouchable to entities outside it.

But there is something you could have done earlier. Earlier, you could have allied with other humans outside of the mob, to pressure the would-be-mob members to pre-commit to not excluding other humans.

And in doing so, you might have insisted on including all humans, not specifically the humans you were explicitly allying with, even if you didn't directly care about everyone, because:

  • the ally group might shift over time, or people outside the ally group might make their own demands
  • if the franchise is not set to a solid Schelling point (like all humans) then people currently inside might still worry about the lines being shifted to exclude them.

Thus, you include the Sentinelese, not because you're worried about them coming over to demand to be included, but because if you draw the line to exclude them then it becomes more ambiguous where the line should be drawn, and relatively low (but non-zero) influence members of the coalition might be worried about also being excluded. And, as fellow humans, it is probably relatively low cost to include them - they're unlikely to have wildly divergent values or be utility monsters etc.


You might ask, is it not also a solid Schelling point to include all entities whatsoever?

First, not really, we don't have good definitions of "all sentient beings", not nearly as good as "all humans". It might be different if, e.g., we had time travel, such that we would also have to worry about intermediate evolutionary steps between humans and non-human-animals, but we don't.

In the future, we will have more ambiguous cases, but CEV can handle it. If someone wants to modify themselves into a utility monster, maybe we would want to let them do so, but discount their weighting in CEV to a more normal level when they do it.

And second, it is not costless to expand the franchise. If you allow non-humans preemptively you are opening yourself up to, as an example, the xenophobic aliens scenario, but also potentially who-knows-what other dangerous situations since entities could have arbitrary values.

And that's why expanding the franchise to all humans makes sense, even if individuals don't care about other humans that much, but expanding to all sentients does not, even if people do care about other sentients.


In response to the rest of your comment:

If you want to argue that s-risks would be prevented for certain, please address the object-level arguments I present. 

If humans would want to prevent s-risks, then they would be prevented. If humans would not want to prevent s-risks, they would not be prevented.

If you want to argue that the occurrence of s-risks would not be bad, you want to argue for a particular view in normative and practical ethics.

You're the one arguing that people should override their actual values, and instead of programming an AI to follow their actual values, do something else! Without even an instrumental reason to do so (other than alleged moral considerations that aren't in their actual values, but coming from some other magical direction)!

Asking someone to do something that isn't in their values, without giving them instrumental reasons to do so, makes no sense. 

It is you who needs a strong meta-ethical case for that. It shouldn't be the objector who has to justify not overriding their values! 

Comment by simon on Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition · 2023-12-03T02:56:43.706Z · LW · GW

A thought experiment: the mildly xenophobic large alien civilization.

Imagine at some future time we encounter an expanding grabby aliens civilization. The civilization is much older and larger than ours, but cooperates poorly. Their individual members tend to have a mild distaste for the existence of aliens (such as us). It isn't that severe, but there are very many of them, so their total suffering at our existence and wish for us to die outweighs our own suffering if our AI killed us, and our own will to live.

They aren't going to kill us directly, because they co-operate poorly, individually don't care all that much, and defense has the advantage over offense.

But, in this case, the AI programmed as you proposed will kill us once if finds out about these mildly xenophobic aliens. How do you feel about that? And do you feel that, if I don't want to be killed in this scenario, that my opposition is unjustified? 

Comment by simon on Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition · 2023-12-03T02:31:35.951Z · LW · GW

It is not clear to me exactly what "belief regarding suffering" you are talking about, what you mean by "ordinary human values"/"your own personal unique values". 

Belief regarding suffering: the belief that s-risks are bad, independently of human values as would be represented in CEV.

Ordinary human values: what most people have.

Your own personal unique values: what you have, but others don't.

Please read the paper, and if you have any specific points of disagreement cite the passages you would like to discuss. Thank you

In my other reply comment, I pointed out disagreements with particular parts of the paper you cited in favour of your views. My fundamental disagreement though, is that you are fundamentally relying on an unjustified assumption, repeated in your comment above:

even if s-risks are very morally undesirable (either in a realist or non-realist sense)

The assumption being that s-risks are "very morally undesirable", independently of human desires (represented in CEV). 

Comment by simon on Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition · 2023-12-03T02:23:37.421Z · LW · GW

Thanks for the reply.

We don't work together with animals - we act towards them, generously or not.

That's key because, unlike for other humans, we don't have an instrumental reason to include them in the programmed value calculation, and to precommit to doing so, etc. For animals, it's more of a terminal goal. But if that terminal goal is a human value, it's represented in CEV. So where does this terminal goal over and above human values come from?

Regarding 2:

There is (at least) a non-negligible probability that an adequate implementation of the standard CEV proposal results in the ASI causing or allowing the occurrence of risks of astronomical suffering (s-risks).

You don't justify why this is a bad thing over and above human values as represented in CEV.

Regarding 2.1:

The normal CEV proposal, like CEO-CEV and men-CEV, excludes a subset of moral patients from the extrapolation base.

You just assume it, that the concept of "moral patients" exists and includes non-humans. Note, to validly claim that CEV is insufficient, it's not enough to say that human values include caring for animals - it has to be something independent of or at least beyond human values. But what? 

Regarding 4.2:

However, as seen above, it is not the case that there are no reasons to include sentient non-humans since they too can be positively or negatively affected in morally relevant ways by being included in the extrapolation base or not.

Again, existence and application of the "moral relevance" concept over and above human values just assumed, not justified.

regarding 3.2:

At any given point in time t, the ASI should take those actions that would in expectation most fulfil the coherent extrapolated volition of all sentient beings that exist in t.

Good, by focusing at the particular time at least you aren't guaranteeing that the AI will replace us with utility monsters. But if utility monsters do come to exist or be found (e.g. utility monster aliens) for whatever reason, the AI will still side with them, because:

Contrary to what seems to be the case in the standard CEV proposal, the interests of future not-yet-existing sentient beings, once they exist, would not be taken into account merely to the extent to which the extrapolated volitions of currently existing individuals desire to do so.

Also, I have to remark on:

Finally, it should also be noted that this proposal of SCEV (as CEV) is not intended as a realist theory of morality, it is not a description of the metaphysical nature of what constitutes the ‘good’. I am not proposing a metaethical theory but merely what would be the most morally desirable ambitious value learning proposal for an ASI.

You assert your approach is "the most morally desirable" while disclaiming moral realism. So where does that "most morally desirable" come from?

And in response to your comment:

Yes, but (as I argue in 2.1 and 2.2) there are strong reasons to include all sentient beings. And (to my knowledge) there are no good reasons to support any religion.

The "reasons" are simply unjustified assumptions, like "moral relevance" existing (independent of our values, game theoretic considerations including pre-commitments, etc.) (and yes, you don't explicitly say it exists independent of those things in so many words, but your argument doesn't hold unless they do exist independently).

Comment by simon on Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition · 2023-12-03T00:49:38.632Z · LW · GW

To the people downvoting/disagreeing, tell me:

Where does your belief regarding suffering come from?

Does it come from ordinary human values?

  • great, CEV will handle it.

Does it come from your own personal unique values?

  • the rest of humanity has no obligation to go along with that

Does it come from pure logic that the rest of us would realize if we were smart enough?

  • great, CEV will handle it.

Is it just a brute fact that suffering of all entities whatsoever is bad, regardless of anyone's views? And furthermore, you have special insight into this, not from your own personal values, or from logic,  but...from something else?

  • then how are you not a religion? where is it coming from?
Comment by simon on Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition · 2023-12-02T20:23:45.190Z · LW · GW

since they too can be positively or negatively affected in morally relevant ways

 

taboo morality. 


So people want X

and would want X if they were smarter, etc.

But you say, they should want Y. 

But you are a person. You are in the group of people who would be extrapolated by CEV. If you would be extrapolated by CEV:

  • you would either also want X, in which case insisting on Y is strange
  • or you would be unusual in wanting Y, enough so that your preference on Y is ignored or excessively discounted.

in which case it's not so strange that you would want to insist on Y. But the question is, does it make sense for other people to agree with this?


There is, admittedly, one sense in which Y = higher scope of concern is different from other Ys. And that is, at least superficially it might seem an equivalent continuation of not wanting lower scope of concern.

If someone says, "I don't want my AI to include everyone in its scope of concern, just some people" (or just one), then other people might be concerned about this.

They might, on hearing or suspecting this, react accordingly, like to try to band together to stop that person from making that AI. Or to rush to make a different AI at all costs.  And that's relevant because  they are actually existing entities we are working together with on this one planet.

So, a credible pre-commitment to value everyone is likely to be approved of, to lead to co-operation and ultimate success.

Also, humans are probably pretty similar. There will be a great deal of overlap in those extrapolated values, and probably not extreme irreconcilable conflict.

But, valuing non-human sentient agents is very different. They are not here (yet). And they might be very, very different.

When you encounter a utility monster that claims it will suffer greatly if you don't kill yourself, will you just do that? 

If someone convinces you "life is suffering" will you kill all life in the universe? even if suffering living things want to survive?


Now, once those non-human agentic sentients are here, and they don't already do what we want, and their power is commensurate with ours, we may want to make deals, implicitly or explicitly, to compromise. Thus including them in the scope of concern.

And if that makes sense in the context, that's fine...

But if you pre-emptively do it, unconditionally, you are inviting them to take over.

Could they reciprocate our kindness voluntarily? Sure for some tiny portion of mind-design space that they won't be in.


In your view, Y is obviously important. At least, so it seems to you right now. You say: if we don't focus on Y, code it in right from the start, then Y might be ignored. So, we must focus on Y, since it is obviously important.

But when you step outside what you and other people you are game-theoretically connected with, and the precommitments you reasonably might make:

Well then, anyone can say Y is the all-important thing about anything obviously important to them. A religious person might want an AI to follow the tenets of their religion.

This happens to be your religion. 

Comment by simon on How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs") · 2023-12-01T16:04:39.057Z · LW · GW

I'd go further and say it doesn't need to target the future at all.

I know there's a view that, if it chooses some particular output, it's choosing the future that results from it outputting that output. But if it's not choosing that output because of the resulting future, but because of some other reasons (like that this is the output that satisfies some properties specified in a query and the query isn't such as to in effect be asking about the future) then it isn't in my view agentic at all.