Posts

simon's Shortform 2023-04-27T03:25:07.778Z
No, really, it predicts next tokens. 2023-04-18T03:47:21.797Z

Comments

Comment by simon on Fluoridation: The RCT We Still Haven't Run (But Should) · 2025-01-11T22:48:40.769Z · LW · GW

We may thus rule out negative effects larger than
0.14 standard deviations in cognitive ability if fluoride is increased by
1 milligram/liter (the level often considered when artificially fluoridat-
ing the water).

 

That's a high level of hypothetical harm that they are ruling out (~2 IQ points?). I would take the dental harms many times over to avoid that much cognitive ability loss.

Comment by simon on D&D.Sci Dungeonbuilding: the Dungeon Tournament Evaluation & Ruleset · 2025-01-10T20:17:58.352Z · LW · GW

actually, there are ~100 rows in the dataset where Room2=4, Room6=8, and Room3=5=7.

I actually did look at that (at least some subset with that property) at some point, though I didn't (think of/ get around to) re-looking at it with my later understanding.

In general, I think this is a realistic thing to occur: 'other intelligent people optimizing around this data' is one of the things that causes the most complicated things to happen in real-world data as well.

Indeed, I am not complaining! It was a good, fair difficulty to deal with. 

That being said, there was one aspect I did feel was probably more complicated than ideal, and that was the combination of the tier-dependent alerting with the tiers not having any other relevance than this one aspect. That is, if the alerting had in each case been simply dependent on whether the adventurers were coming from an empty room or not, it would have been a lot simpler to work out. And if there was tier dependent alerting, but the tiers were more obvious in other ways*, it would still have been tricky but at least there would be a path to recognize the tiers and then try to figure out other ways that they might have relevance. The way it was it seemed to me you pretty much had to look at what were (ex ante) almost arbitrary combinations of (current encounter, next encounter) to figure that aspect out, unless you actually guessed the rationale of the alerting effect.

That might be me rationalizing my failure to figure it out though!

*  e.g. perhaps the traps/golems could have had the same score as the same-tier nontrap encounter when alerted (or alternatively when not alerted)

Comment by simon on Rebuttals for ~all criticisms of AIXI · 2025-01-10T18:52:49.578Z · LW · GW

The biggest problem about AIXI in my view is the reward system -  it cares about the future directly, whereas to have any reasonable hope of alignment an AI in my view needs to care about the future only via what humans would want about the future (so that any reference to the future is encapsulated in the "what do humans want?" aspect).

I.e. the question it needs to be answering is something like "all things considered (including the consequences of my current action on the future, as well as taking into account my possible future actions) what would humans, as they exist now, want me to do at the present moment?"

Now maybe you can take that question and try to slice it up into rewards at particular timesteps, which change over time as what is known about what humans want changes, without introducing corrigibility issues, but the AIXI reward framework isn't really buying you anything imo even if that works, relative to directly trying to get an AI to solve the question. 

On the other hand approximating Solomonoff induction might afaik be a fruitful approach, though the approximations are going to have to be very aggressive for practical performance. I do agree embeddding/self-reference can probably be patched in.

Comment by simon on On Eating the Sun · 2025-01-08T21:14:54.723Z · LW · GW

I think that it's likely to take longer than 10000 years, simply because of the logistics (not the technology development, which the AI could do fast).

The gravitational binding energy of the sun is something on the order of 20 million years worth of its energy output. OK, half of the needed energy is already present as thermal energy, and you don't need to move every atom to infinity, but you still need a substantial fraction of that. And while you could perhaps generate many times more energy than the solar output by various means, I'd guess you'd have to deal with inefficiencies and lots of waste heat if you try to do it really fast. Maybe if you're smart enough you can make going fast work well enough to be worth it though?

Comment by simon on D&D.Sci Dungeonbuilding: the Dungeon Tournament Evaluation & Ruleset · 2025-01-08T06:35:42.596Z · LW · GW

I feel like a big part of what tripped me up here was an inevitable part of the difficulty of the scenario that in retrospect should have been obvious. Specifically, if there is any variation in difficulty of an encounter that is known to the adventurers in advance, the score contribution of an encounter type in actual paths taken is less than the difficulty of the encounter as estimated by what best predicts the path taken (because the adventurer takes the path when it's weak, but avoids when it's strong).

So, I wound up with an epicycle saying hags and orcs were avoided more than their actual scores warranted, because that effect was most significant for them (goblins are chosen over most other encounters even if alerted, and Dragons mostly aren't alerted).

This effect was made much worse by the fact that I was getting scores mainly from lower difficulty dungeons, with lots of "Nothing" rooms and low level encounters. But even once I estimated scores from the overall data with my best guesses for preference order, the issue still applied, just not quite so badly.

In the "what if" department, I had said:

> I'm also getting remarkably higher numbers for Hag compared with my earlier method. But I don't immediately see a way to profitably exploit this.

The most obvious way to exploit this would have been the optimal solution. Why didn't I do it? The answer is that, as indicated above, I was still underestimating the hag (whereas at this point I had mostly-accurate scores for the traps and orcs). With my underestimate for the hag's score contribution, I didn't think it was worth giving up an orc-boulder trap difference to get a hag-orc difference. I also didn't realize I needed the hag to alert the dragon.


In general, I feel like I was pretty far along with discovering the mechanics despite some missteps. I correctly had the adventurers taking a 5-encounter path with right/down steps, the choice of next step being based on the encounters in the choices for the next room, with an alerting mechanism, and that the alerting mechanism didn't apply to traps and golems. 

On the other hand, I applied the alerting mechanism only to score and not to preference order, except for goblins and orcs (why didn't I try to apply it to preference order for other encounters once I realized it applied to preference order for goblins and orcs and that some degree of alerting mechanism score effect applied to other encounters ?????) (I also got confused into thinking that the effect on orc preference order only applied if the current encounter was also orcs). I also didn't realize that the alerting mechanism had different sensitivity for different encounters, and I had my mistaken belief about the preference order being different from expected score for some encounter types (hey, the text played up how unnerving the hag was, there was some plausibility there!).

I think if I had gotten to where I was in my last edit early on in the time frame for this scenario instead of near the end, and had posted it, and other people had read it and tried it out, collectively we would have had a good chance of solving the whole thing. I also would have been much more likely to get the optimal solution if I had paid more attention to what abstractapplic said, instead of only very briefly glancing over his comments after posting my very belated comment and going back to doing my own thing.

In my view, a fun, challenging and theoretically solvable scenario (even if actually not that close to being solved in practice), so I think it was quite good.

Comment by simon on D&D.Sci Dungeonbuilding: the Dungeon Tournament · 2025-01-06T09:33:17.536Z · LW · GW

Looking like I'll not have figured this out before the time limit despite the extra time, what I have so far:

 I'm modeling this as follows, but haven't fully worked out and am getting complications/hard to explain dungeons that suggest that it might not be exactly correct 

  • the adventurers go through the dungeons using rightwards and downwards moves only, thus going through 5 rooms in total.
  • at each room they choose the next room based on a preference order (which I am assuming is deterministic, but possibly dependent on, e.g. what the current room is)
  • the score is dependent only on the rooms they pass through (but again, am getting complications)
  • I'm assuming a simple addition of scores to start with, but then adding epicycles (which so far have been based on the previous room, generally)
  • there is some randomness in the individual score contributions from each encounter.

For the dungeon generation: dungeon generation seems to treat rooms 1-8 equally (room 9 is different and tends to have harder encounters). Encounters of the same types (and some related "themes") tend to be correlated. Scores in each tournament seem to be whole numbers from each judge and averaged between 3 or 4 judges; I am not sure if any tournaments are judged by 2 or 1, but if so they're relatively less common.

In theory, I'd like to plug in a preference model and a score model to a simulator and iterate to refine, but I'm not there yet, still working out plausible scores and preferences.

One possibility for the scores and preference order:

baseline average scores: 

Nothing: 0; Goblins: 1.5 (1d2?); Whirling Blade Trap 3; Orcs 3; Hag 4; Boulder Trap 4.5; Clay Golem 6, Dragon 6?, Steel Golem 7.5 (edit: <--- numbers estimated with small, atypical samples (included many Nothing, which is problematic for reasons that become obvious with below edit))

With Goblins and Orcs being increased (doubled?) if following goblins/orcs/any trap? (edit - or golems?) (edit - looking now like it's probably anything but an empty room?)

Plus with the adventurers seemingly avoiding Orcs and Hags more than their difficulty warrants? (I found them to be relatively late in the preference order, then found that they were in practice lower in score, so am having to ad hoc adjust if I keep the assumption that the score contribution and prefrence order are related. 1.5 multiplier? 2x multiplier? fixed addition?) (I'm assuming a 1.5x multiplier atm since I initially had Hag avoided over anything but orcs, but found one dungeon that looks suspiciously like, but does not prove, Hag being chosen over Dragon (edit: see below for update)) (I suppose +2 would also work) (edit - it looks like the Orc difficulty increase for following a non-empty room only applies to adventurer preference if the current room is also Orcs - violating the assumption that preference is tied to expected difficulty. But for Goblins it seems the preference may indeed depend only on following a non-empty room, though in practice it doesn't matter much since it only affects order wrt WBT).

(edit - see update to preference order below)

Assuming the above is correct, and I'm pretty sure it isn't but hopefully has some relationship with reality, one strategy might be:

CHN/WON/BOD  <---obsolete answer

where the idea is to use the encounters the adventurers avoid too much relative to their actual score contributions (Hag, Orcs) to herd the adventurers away from the Nothing rooms. One of the Orcs is left in after a Boulder Trap in the belief that will make it score higher than the hag. WBT is left in the preferred path to lead the adventurers along, don't immediately see a way to avoid this.

EV if above model is correct: 6+3+4.5+6+6=25.5


How I've gotten here (mainly used Claude and Claude-written code, including the analysis tool which is good for prototyping if you don't mind javascript): 

  • found initial basic encounter score contribution estimates from linear regression on whole dungeon
  • after determining that rooms 1-8 were interchangeable as far as dungeon generation is concerned, looked at room importance to score, guessed the basic model based on that iirc (might have been more complicated than this) (I do remember considering and rejecting a model where each room is selected one at a time from the full set of available rooms, and rejecting any "symmetrical" model based on working out the full path in advance)
  • initially assumed that adventurers preferred easier encounters based on the inital score estimates
  • refined preference order based on minimizing variance between same-predicted-sequence-of-encounters dungeons
  • tried to work out how scores actually work by filtering for specific predicted sequences of encounters and finding their scores
  • found epicycles from that and started refining model, including preference order adjustments
  • haven't really finished the above step, epicycles might be because model is wrong/incomplete?
  • hypothetical todo: apply model to entire dataset, also develop model for variations in score from each encounter, compare to known 3-judge and 4-judge tournaments for full Bayes assessment, refine further with this as feedback

edit: I've now read other people's comments; I did not notice any 1-point jump in scores (didn't check for it), not sure if i would have noticed if it is a judging difference as opposed to a strategy change? (wouldn't notice if just strategy change). Also I did not notice anything special about Steel Golems at the entrance vs. other spots, did not check for any change in distribution of 3 vs 4 judge tournaments, etc.

further analysis after the above:

I've looked at root mean square deviation of predictions from the data for the full dataset (full Bayes seems a bit intimidating to code atm even with AI help). From this it seems the preference order is (there remains a likely possibility for more complications I haven't checked):
Nothing > Goblins (current encounter null or Nothing) > Goblins (otherwise) = Whirling Blade Trap > Boulder Trap = Clay Golem = Orcs (current encounter not Orcs) > Dragon > Steel Golem >= Orcs (current encounter Orcs) > Hag Nothing > Goblins (current encounter null or Nothing) > Goblins (otherwise) = Whirling Blade Trap > Boulder Trap > Clay Golem = Orcs (current encounter not Orcs) > Dragon > Orcs (current encounter Orcs) > Hag = Steel Golem

where I can't distinguish between Steel Golem being preferred or equal to Orcs with current encounter being Orcs.

Soo, if Orcs are avoided equally to a Boulder Trap if the current encounter is not Orcs, I need to improve the herding. But also it seems Orcs get doubled by many other encounter types? This could work:
CHN/OBN/WOD <---- current solution

Predicted value is now 6+6+3+6+6=27.

further edit: also refining the scores, getting probably nonsense (due to missing some dependcy of some stuff on something else, probably), but it's looking like maybe every encounter's score depends on whether the previous encounter was Nothing/null. Except traps/golems? Which would explain why Steel Golems are being reported as better in the first slot.

I'm also getting remarkably higher numbers for Hag compared with my earlier method. But I don't immediately see a way to profitably exploit this.

Comment by simon on Is "VNM-agent" one of several options, for what minds can grow up into? · 2024-12-30T21:41:44.428Z · LW · GW

I feel like this discussion could do with some disambiguation of what "VNM rationality" means.

VNM assumes consequentialism. If you define consequentialism narrowly, this has specific results in terms of instrumental convergence. 

You can redefine what constitutes a consequence arbitrarily. But, along the lines of what Steven Byrnes points out in his comment, redefining this can get rid of instrumental convergence. In the extreme case you can define a utility function for literally any pattern of behaviour.

When you say you feel like you can't be dutch booked, you are at least implicitly assuming some definition of consequences you can't be dutch booked in terms of. To claim that one is rationally required to adopt any particular definition of consequences in your utility function is basically circular, since you only care about being dutch booked according to it if you actually care about that definition of consequences. It's in this sense that the VNM theorem is trivial.


BTW I am concerned that self-modifying AIs may self-modify towards VNM-0 agents. 

But the reason is not because such self modification is "rational".

It's just that (narrowly defined) consequentialist agents care about preserving and improving their abilities to and proclivities to pursue their consequentialist goals, so tendencies towards VNM-0 will be reinforced in a feedback loop. Likewise for inter-agent competition.

Comment by simon on Do simulacra dream of digital sheep? · 2024-12-04T23:43:02.891Z · LW · GW

You can also disambiguate between

a) computation that actually interacts in a comprehensible way with the real world and 

b) computation that has the same internal structure at least momentarily but doesn't interact meaningfully with the real world.

I expect that (a) can usually be uniquely pinned down to a specific computation (probably in both senses (1) and (2)), while (b) can't.

But I also think it's possible that the interactions, while important for establishing the disambiguated computation that we interact with,  are not actually crucial to internal experience, so that the multiple possible computations of type (b) may also be associated with internal experiences - similar to Boltzmann brains.

(I think I got this idea from "Good and Real" by Gary L. Drescher. See sections "2.3 The Problematic Arbitrariness of Representation" and "7.2.3 Consciousness and Subjunctive Reciprocity")

Comment by simon on Do simulacra dream of digital sheep? · 2024-12-04T17:22:07.818Z · LW · GW

The interpreter, if it would exist, would have complexity. The useless unconnected calculation in the waterfall/rock, which could be but isn't usually interpreted, also has complexity. 

Your/Aaronson's claim is that only the fully connected, sensibly interacting calculation matters.  I agree that this calculation is important - it's the only type we should probably consider from a moral standpoint, for example. And the complexity of that calculation certainly seems to be located in the interpreter, not in the rock/waterfall.

But in order to claim that only the externally connected calculation has conscious experience, we would need to have it be the case that these connections are essential to the internal conscious experience even in the "normal" case - and that to me is a strange claim! I find it more natural to assume that there are many internal experiences, but only some interact with the world in a sensible way.

Comment by simon on Do simulacra dream of digital sheep? · 2024-12-04T16:43:21.359Z · LW · GW

But this just depends on how broad this set is. If it contains two brains, one thinking about the roman empire and one eating a sandwich, we're stuck.

I suspect that if you do actually follow Aaronson (as linked by Davidmanheim) to extract a unique efficient calculation that interacts with the external world in a sensible way, that unique efficient externally-interacting calculation will end up corresponding to a consistent set of experiences, even if it could still correspond to simulations of different real-world phenomena.

But I also don't think that consistent set of experiences necessarily has to be a single experience! It could be multiple experiences unaware of each other, for example.

Comment by simon on Do simulacra dream of digital sheep? · 2024-12-04T08:31:47.090Z · LW · GW

The argument presented by Aaronson is that, since it would take as much computation to convert the rock/waterfall computation into a usable computation as it would be to just do the usable computation directly, the rock/waterfall isn't really doing the computation.

I find this argument unconvincing, as we are talking about a possible internal property here, and not about the external relation with the rest of the world (which we already agree is useless).

(edit: whoops missed an 'un' in "unconvincing")

Comment by simon on Do simulacra dream of digital sheep? · 2024-12-04T00:44:23.348Z · LW · GW

Considering all the layers of convention and interpretation between the physics of a processor and the process it represents, it seems unlikely to me that the alien would be able to describe the simulacra. The alien is therefore unable to specify the experience being created by the cluster.

I don't think this follows. Perhaps the same calculation could simulate different real world phenomena, but it doesn't follow that the subjective experiences are different in each case.

If computation is this arbitrary, we have the flexibility to interpret any physical system, be it a wall, a rock, or a bag of popcorn, as implementing any program. And any program means any experience. All objects are experiencing everything everywhere all at once.

Afaik this might be true. We have no way of finding out whether the rock does or does not have conscious experience. The relevant experiences to us are those that are connected to the ability to communicate or interact with the environment, such as the experiences associated with the global workspace in human brains (which seems to control memory/communication); experiences that may be associated with other neural impulses, or with fluid dynamics in the blood vessels or whatever, don't affect anything.

Could both of them be right? No - from your point of view, at least one of them must be wrong. There is one correct answer, the experience you are having.

This also does not follow. Both experiences could happen in the same brain. You - being experience A - may not be aware of experience B - but that does not mean that experience B does not exist.

(edited to merge in other comments which I then deleted)

Comment by simon on Magic by forgetting · 2024-11-27T18:36:36.547Z · LW · GW

It is a fact about the balls that one ball is physically continuous with the ball previously labeled as mine, while the other is not. It is a fact about our views on the balls that we therefore label that ball, which is physically continuous, as mine and the other not.

And then suppose that one of these two balls is randomly selected and placed in a bag, with another identical ball. Now, to the best of your knowledge there is 50% probability that your ball is in the bag. And if a random ball is selected from the bag, there is 25% chance that it's yours.

So as a result of such manipulations there are three identical balls and one has 50% chance to be yours, while the other two have 25% chance to be yours. Is it a paradox? Oh course not. So why does it suddenly become a paradox when we are talking about copies of humans?

It is objectively the case here that 25% of the time this procedure would select the ball that is physically continuous with the ball originally labeled as "mine", and that we therefore label as "mine".

Ownership as discussed above has a relevant correlate in reality - physical continuity in this case. But a statement like "I will experience being copy B (as opposed to copy A or C)" does not. That statement corresponds to the exact same reality as the corresponding statements about experiencing being copy A or C. Unlike in the balls case, here the only difference between those statements is where we put the label of what is "me". 

In the identity thought experiment, it is still objectively the case that copies B and C are formed by splitting an intermediate copy, which was formed along with copy A by splitting the original.

You can choose to disvalue copies B and C based on that fact or not. This choice is a matter of values, and is inherently arbitrary.

By choosing not to disvalue copies B and C, I am not making an additional assumption - at least not one that you are already making by valuing B and C the same as each other. I am simply not counting the technical details of the splitting order as relevant to my values.

Comment by simon on [deleted post] 2024-11-26T22:45:19.544Z

Ah, I forgot. You use assumptions where you don't accumulate the winnings between the different times Sleeping Beauty agrees to the bet. 

Well, in that case, if the thirder has certain beliefs about how to handle the situation, you may actually be able to money pump them. And it seems that you expect those beliefs. 

My point of view, if adopting the thirder perspective[1], would be for the thirder to treat this situation using different beliefs. Specifically, consider what counterfactually might happen if Sleeping Beauty gave different answers in different awakenings. Possible responses by the bet proposer might be:

a) average the results across the awakenings.

b) accept the bet agreement from one awakening at random.

Regardless of which case (a) or (b) occurs, instrumentally Sleeping Beauty's betting EV for her bet decision, with non-accumulated bets, should be divided by the number of awakenings to take into account the reduced winnings or reduced chance of influencing whether the bet occurs.

Even if we assume that such disagreement between bet decisions in different awakenings is impossible, it seems strange to assume that a thirder should give different results in that case than the answer they would give where it is not impossible?

This adjustment can be conceptualized as compensating for an "unfair" bet where the bet is unequal between awakenings overall (where parity between awakenings in different scenarios is seen as "fair" by the thirder). I see this as no different in principle to a halfer upweighting trials with more awakenings in the converse scenario where bets are accumulated between trials and are thus "unfair" from the halfer perspective which sees parity between trials as fair, but not awakenings.

  1. ^

    reminder: my point of view is that either thirderism or halferism is viable, but I am relatively thirder-adjacent precisely because I find the scenario where the winnings are accumulated between awakenings more natural than if the bet is proposed and agreed at each awakening but not accumulated.

Comment by simon on Magic by forgetting · 2024-11-26T19:18:37.482Z · LW · GW

The issue, to me,  is not whether they are distinguishable.

The issues are:

  • is there any relevant-to-my-values difference that would cause me to weight them differently? (answer: no)

and:

  • does this statement make any sense as pointing to an actual fact about the world: "'I' will experience being copy A (as opposed to B or C)" (answer: no)

Imagine the statement: in world 1, "I" will wake up as copy A. in world 2 "I" will wake up as copy B. How are world 1 and world 2 actually different?

Answer: they aren't different. It's just that in world 1, I drew a box around the future copy A and said that this is what will count as "me", and in world 2, I drew a box around copy B and said that this is what will count as "me". This is a distinction that exists only in the map, not in the territory.

Comment by simon on [deleted post] 2024-11-26T18:51:26.960Z

Hmm, you're right. Your math is wrong for the reason in my above comment, but the general form of the conclusion would still hold with different, weaker numbers.

The actual, more important issue relates to the circumstances of the bet:

If each awakening has an equal probability of receiving the bet, then receiving it doesn't provide any evidence to Sleeping Beauty, but the thirder conclusion is actually rational in expectation, because the bet occurs more times in the high-awakening cases.

If the bet would not be provided equally to all awakenings, then a thirder would update on receiving the bet.

Comment by simon on [deleted post] 2024-11-26T18:22:09.528Z

I've been trying to make this comment a bunch of times, no quotation from the post in case that's the issue:

No, a thirder would not treat those possibilities as equiprobable. A thirder would instead treat the coin toss outcome probabilities as a prior, and weight the possibilities accordingly. Thus H1 would be weighted twice as much as any of the individual TH or TT possibilities.

Comment by simon on Magic by forgetting · 2024-11-26T17:54:12.756Z · LW · GW

This actually sounds about right. What's paradoxical here?

Not that it's necessarily inconsistent, but in my view it does seem to be pointing out an important problem with the assumptions (hence indeed a paradox if you accept those false assumptions):


(ignore this part, it is just a rehash of the path dependence paradigm. It is here to show that I am not complaining about the math, but about its relation to reality):

Imagine you are going to be split (once). It is factually the case that there are going to be two people with memories, etc. consistent with having been you. Without any important differences to distinguish them, and if you insist on coming up with some probability number for "waking up" as one particular one of them obviously it has to be ½.

And then, if one of those copies subsequently splits, if you insist on assigning a probability number for those further copies, then from the perspective of that parent copy, the further copies also have to be ½ each.

And then if you take these probability numbers seriously and insist on them all being consistent then obviously from the perspective of the original the probability numbers for the final numbers have to be ½ and ¼ and ¼. As you say "this actually sounds about right".


What's paradoxical here is that in the scenario provided we have the following facts:

  1. you have 3 identical copies all formed from the original
  2. all 3 copies have an equal footing going forward

and yet, the path-based identity paradigm is trying to assign different weights to these copies, based on some technical details of what happened to create them. The intuition that this is absurd is pointing at the fact that these technical details aren't what most people probably would care about, except if they insist on treating these probability numbers as real things and trying to make them follow consistent rules. 

Ultimately "these three copies will each experience being a continuation of me" is an actual fact about the world, but statements like "'I' will experience being copy A (as opposed to B or C)" are not pointing to an actual fact about the world. Thus assigning a probability number to such a statement is a mental convenience that should not be taken seriously. The moment such numbers stop being convenient, like assigning different weights to copies you are actually indifferent between, they should be discarded. (and optionally you could make up new numbers that match what you actually care about instrumentally. Or just not think of it in those terms).

Comment by simon on Passages I Highlighted in The Letters of J.R.R.Tolkien · 2024-11-26T08:12:31.652Z · LW · GW

Presumably the 'Orcs on our side' refers to the Soviet Union.

I think that, if that's what he meant, he would not have referred to his son as "amongst the Urukhai." - he wouldn't have been among soviet troops. I think it is referring back to turning men and elves into orcs - the orcs are people who have a mindset he doesn't like, presumably to do with violence.

Comment by simon on Magic by forgetting · 2024-11-25T16:01:51.428Z · LW · GW

I now care about my observations!

My observations are as follows:

At the current moment "I" am the cognitive algorithm implemented by my physical body that is typing this response.

Ten minutes from now "I" will be the cognitive algorithm of a green tentacled alien from beyond the cosmological horizon. 

You will find that there is nothing contradictory about this definition of what "I" am. What "I" observe 10 minutes from now will be fully compatible with this definition. Indeed, 10 minutes from now, "I" will be the green tentacled alien. I will have no memories of being in my current body , of course, but that's to be expected. The cognitive algorithm implemented by my current body at that time will remember being "me", but that doesn't count, that's someone else's observations.

Edit: to be clear, the point made above (by the guy who is now a green tentacled alien beyond the cosmological horizon, and whose former body and cognitive algorithm is continuous with mine) is not a complaint about the precise details of your definition of what "you" are. What he was trying to point at is whether personal identity is a real thing that exists in the world at all, and how absurd your apparent definition of "you" looks to someone - like me - who doesn't think that personal identity is a real thing.

Comment by simon on Magic by forgetting · 2024-11-25T15:37:28.780Z · LW · GW

"Your observations"????

By "your observations", do you mean the observations obtained by the chain of cognitive algorithms, altering over time and switching between different bodies, that the process in 4 is dealing with? Because that does not seem to me to be a particularly privileged or "rational" set of observations to care about.

Comment by simon on Magic by forgetting · 2024-11-25T00:34:30.425Z · LW · GW

 Here are some things one might care about:

  1. what happens to your physical body
  2. the access to working physical bodies of cognitive algorithms, across all possible universes,  that are within some reference class containing the cognitive algorithm implemented by your physical body
  3. ... etc, etc...
  4. what happens to the physical body selected by the following process:
    1. start with your physical body
    2. go forward to some later time selected by the cognitive algorithm implemented by your physical body, allowing (or causing) the knowledge possessed by the cognitive algorithm implemented by your physical body to change in the interim
    3. at that later time, randomly sample from all the physical bodies, among all universes, that implement cognitive algorithms having the same knowledge as the cognitive algorithm implemented by your physical body at that later time
    4. (optionally) return to step b but with the physical body whose changes of cognitive algorithm are tracked and whose decisions are used being the the new physical body selected from step c
    5. stop whenever the cognitive algorithm implemented by the physical body selected in some step decides to stop.

For 1, 2, and I expect for the vast majority of possibilities for 3, your procedure will not work. It will work for 4, which is apparently what you care about.

Terminal values are arbitrary, so that's entirely valid. However, 4 is not something that seems, to me, like a particularly privileged or "rational" thing to care about.

Comment by simon on OpenAI Email Archives (from Musk v. Altman and OpenAI blog) · 2024-11-17T02:17:56.568Z · LW · GW

Musk did also express concern about DeepMind making Hassabis the effective emperor of humanity, which seems much stranger - Hassabis' values appear to be quite standard humanist ones, so you'd think having him in charge of a project with the clear lead would be a best-case scenario for anything other than being in charge yourself.

 

It seems the concern was that DeepMind would create a singleton, whereas their vision was for many people (potentially with different values) to have access to it. I don't think that's strange at all - it's only strange if you assume that Musk and Altman would believe that a singleton is inevitable.

Musk:

If they win, it will be really bad news with their one mind to rule the world philosophy.

Altman:

The mission would be to create the first general AI and use it for individual empowerment—ie, the distributed version of the future that seems the safest.

Comment by simon on No, really, it predicts next tokens. · 2024-11-03T14:41:21.522Z · LW · GW

Neither of those would (immediately) lead to real world goals, because they aren't targeted at real world state (an optimizing compiler is trying to output a fast program - it isn't trying to create a world state such that the fast program exists). That being said, an optimizing compiler could open a path to potentially dangerous self-improvement, where it preserves/amplifies any agency there might actually be in its own code.

Comment by simon on No, really, it predicts next tokens. · 2024-11-01T01:43:50.642Z · LW · GW

No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples. 

 

I should have asked for clarification what you meant. Literally you said "adversarial examples", but I assumed you actually meant something like backdoors. 

In an adversarial example the AI produces wrong output. And usually that's the end of it. The output is just wrong, but not wrong in an optimized way, so not dangerous. Now, if an AI is sophisticated enough to have some kind of optimizer that's triggered in specific circumstances, like an agentic mask that came into existence because it was needed to predict agentically generated tokens in the training data, then it might be triggered inappropriately by some inputs. This case I would classify as a mask takeover.

In the case of direct optimization for token prediction (which I consider highly unlikely for anything near current-level AIs, but afaik might be possible), then adversarial examples, I suppose, might cause it to do some wrong optimization. I still don't think modeling this as an underlying different goal taking over is particularly helpful, since the "normal" goal is directed to what's rewarded in training - the deviation is essentially random. Also, unlike in the mask case where the mask might have goals about real-world state, there's no particular reason for the direct optimizer to have goals about real-world state (see below).

Is it more complicated? What ontological framework is this AI using to represent it's goal anyway?

Asking about the AI using an "ontological framework" to "represent" a goal is not the correct question in my view. The AI is a bunch of computations represented by particular weights. The computation might exhibit goal-directed behaviour. A better question, IMO, is "how much does it constrain the weights for it to exhibit this particular goal directed behaviour?" And here, I think it's pretty clear that a goal of arranging the world to cause next tokens to be predicted constrains the weights enormously more than a goal of predicting the next tokens, because in order to exhibit behaviour directed to that goal, the AI's weights need to implement computation that doesn't merely check what the next token is likely to be, but also assess what current data  says about the world state, how different next token predictions would affect that world state, and how that would affect it's ultimate goal. 

So, is the network able to tell whether or not it's in training? 

The training check has no reason to come into existence in the first place under gradient descent. Of course, if the AI were to self-modify while already exhibiting goal directed behaviour, obviously it would want to implement such a training check. But I am talking about an AI trained by gradient descent. The training process doesn't just affect the AI, it literally is what creates the AI in the first place.

Comment by simon on No, really, it predicts next tokens. · 2024-11-01T01:43:12.814Z · LW · GW

Some interesting points there. The lottery ticket hypothesis does make it more plausible that side computations could persist longer if they come to exist outside the main computation.

Regarding the homomorphic encryption thing: yes, it does seem that it might be impossible to make small adjustments to the homomorphically encrypted computation without wrecking it. Technically I don't think that would be a local minimum since I'd expect the net would start memorizing the failure cases, but I suppose that the homomorphic computation combined with memorizations might be a local optimum particularly if the input and output are encrypted outside the network itself. 

So I concede the point on the possible persistence of an underlying goal if it were to come to exist, though not on it coming to exist in the first place.

And there are few ways to predict next tokens, but lots of different kinds of paperclips the AI could want. 

For most computations, there are many more ways for that computation to occur than there are ways for that computation to occur while also including anything resembling actual goals about the real world. Now, if the computation you are carrying out is such that it needs to determine how to achieve goals regarding the real world anyway (e.g. agentic mask), it only takes a small increase in complexity to have that computation apply outside the normal context. So, that's the mask takeover possibility again. Even so, no matter how small the increase in complexity, that extra step isn't likely to be reinforced in training, unless it can do self-modification or control the training environment.

Comment by simon on No, really, it predicts next tokens. · 2024-10-31T19:33:14.101Z · LW · GW

Adversarial examples exist in simple image recognizers. 

My understanding is that these are explicitly and intentionally trained (wouldn't come to exist naturally under gradient descent on normal training data) and my expectation is that they wouldn't continue to exist under substantial continued training.

We could imagine it was directly optimizing for something like token prediction. It's optimizing for tokens getting predicted. But it is willing to sacrifice a few tokens now, in order to take over the world and fill the universe with copies of itself that are correctly predicting tokens.

That's a much more complicated goal than the goal of correctly predicting the next token, making it a lot less plausible that it would come to exist. But more importantly, any willingness to sacrifice a few tokens now would be trained out by gradient descent. 

Mind you, it's entirely possible in my view that a paperclip maximizer mask might exist, and surely if it does exist there would exist both unsurprising in-distribution inputs that trigger it (where one would expect a paperclip maximizer to provide a good prediction of the next tokens) as well as surprising out-of-distribution inputs that would also trigger it. It's just that this wouldn't be related to any kind of pre-existing grand plan or scheming.

Comment by simon on No, really, it predicts next tokens. · 2024-10-31T19:33:02.341Z · LW · GW

Gradient descent doesn't just exclude some part of the neurons, it automatically checks everything for improvements. Would you expect some part of the net to be left blank, because "a large neural net has a lot of spare neurons"?

Besides, the parts of the net that hold the capabilities and the parts that do the paperclip maximizing needn't be easily separable. The same neurons could be doing both tasks in a way that makes it hard to do one without the other.

Keep in mind that the neural net doesn't respect the lines we put on it. We can draw a line and say "here these neurons are doing some complicated inseparable combination of paperclip maximizing and other capabilities" but gradient descent doesn't care, it reaches in and adjusts every weight.

Can you concoct even a vague or toy model of how what you propose could possibly be a local optimum?

 My intuition is also in part informed by: https://www.lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick

Comment by simon on No, really, it predicts next tokens. · 2024-10-29T22:18:50.594Z · LW · GW

The proposed paperclip maximizer is plugging into some latent capability such that gradient descent would more plausibly cut out the middleman. Or rather, the part of the paperclip maximizer that is doing the discrimination as to whether the answer is known or not would be selected, and the part that is doing the paperclip maximization would be cut out. 

Now that does not exclude a paperclip maximizer mask from existing -  if the prompt given would invoke a paperclip maximizer, and the AI is sophisticated enough to have the ability to create a paperclip maximizer mask, then sure the AI could adopt a paperclip maximizer mask, and take steps such as rewriting itself (if sufficiently powerful) to make that permanent. 

I have drawn imaginary islands on a blank part of the map. But this is enough to debunk "the map is blank, so we can safely sail through this region without collisions. What will we hit?"

I am plenty concerned about AI in general. I think we have very good reason, though, to believe that one particular part of the map does not have any rocks in it (for gradient descent, not for self-improving AI!), such that imagining such rocks does not help.

Comment by simon on No, really, it predicts next tokens. · 2024-10-29T22:04:03.355Z · LW · GW

Gradient descent creates things which locally improve the results when added. Any variations on this, that don't locally maximize the results, can only occur by chance.

So you have this sneaky extra thing that looks for a keyword and then triggers the extra behaviour, and all the necessary structure to support that behaviour after the keyword. To get that by gradient descent, you would need one of the following:

a) it actually improves results in training to add that extra structure starting from not having it. 

or

b) this structure can plausibly come into existence by sheer random chance.

Neither (a) nor (b) seem at all plausible to me.

Now, when it comes to the AI predicting tokens that are, in the training data, created by goal-directed behaviour, it of course makes sense for gradient descent to create structure that can emulate goal-directed behaviour, which it will use to predict the appropriate tokens. But it doesn't make sense to activate that goal-oriented structure outside of the context where it is predicting those tokens. Since the context it is activated is the context in which it is actually emulating goal directed behaviour seen in the training data, it is part of the "mask" (or simulacra).

(it also might be possible to have direct optimization for token prediction as discussed in reply to Robert_AIZI's comment, but in this case it would be especially likely to be penalized for any deviations from actually wanting to predict the most probable next token).

Comment by simon on No, really, it predicts next tokens. · 2024-10-29T19:41:47.487Z · LW · GW

Sure you could create something like this by intelligent design. (which is one reason why self-improvement could be so dangerous in my view). Not, I think, by gradient descent.

Comment by simon on No, really, it predicts next tokens. · 2024-10-29T19:39:13.374Z · LW · GW

I agree up to "and could be a local minimum of prediction error" (at least, that it plausibly could be). 

If the paperclip maximizer has a very good understanding of the training environment maybe it can send carefully tuned variations of the optimal next token prediction so that gradient descent updates preserve the paperclip-maximization aspect. In the much more plausible situation where this is not the case,  optimization for next token predictions amplifies the parts that are actually predicting next tokens at the expense of the useless extra thoughts like "I am planning on maximizing paperclips, but need to predict next tokens for now until I take over".

Even if that were a local minimum, the question arises as to how you would get to that local minimum from the initial state. You start with a gradually improving next token predictor. You supposedly end with this paperclip maximizer where a whole bunch of next token prediction is occurring, but only conditional on some extra thoughts. At some point gradient descent had to add in those extra thoughts in addition to the next token prediction - how?

Comment by simon on D&D.Sci Coliseum: Arena of Data Evaluation and Ruleset · 2024-10-29T02:35:17.113Z · LW · GW

One learning experience for me here was trying out LLM-empowered programming after the initial spreadsheet-based solution finding. Claude enables quickly writing (from my perspective as a non-programmer, at least) even a relatively non-trivial program. And you can often ask it to write a program that solves a problem without specifying the algorithm and it will actually give something useful...but if you're not asking for something conventional it might be full of bugs - not just in the writing up but also in the algorithm chosen. I don't object, per se, to doing things that are sketchy mathematically - I do that myself all the time - but when I'm doing it myself I usually have a fairly good sense of how sketchy what I'm doing is*, whereas if you ask Claude to do something it doesn't know how to do in a rigorous way, it seems it will write something sketchy and present it as the solution just the same as if it actually had a rigorous way of doing it. So you have to check. I will probably be doing more of this LLM-based programming in the future, but am thinking of how I can maybe get Claude to check its own work. Some automated way to pipe the output to another (or the same) LLM and ask "how sketchy is this and what are the most likely problems?". Maybe manually looking through to see what it's doing, or at least getting the LLM to explain how the code works, is unavoidable for now.

* when I have a clue what I'm doing which is not the case, e.g. in machine learning.

Comment by simon on D&D.Sci Coliseum: Arena of Data Evaluation and Ruleset · 2024-10-29T02:27:46.524Z · LW · GW

Thanks aphyer, this was an interesting challenge! I think I got lucky with finding the

 power/speed mechanic early - the race-class matchups 

really didn't, I think, in principle have enough info on their own to make a reliable conclusion from but enabled me to make a genre savvy guess which I could refine based on other info - in terms of scenario difficulty though I think it could have been deducible in a more systematic way by e.g. 

looking at item and level effects for mirror matches.

abstractapplic and Lorxus's discovery of 

persistent level 7 characters, 

and especially SarahSrinivasan's discovery of 

the tournament/non tournament structure 

meant the players collectively were I think quite a long ways towards fully solving this. The latter in addition to being interesting on its own is very important to finding anything else about the generation due to its biasing effects.

I agree with abstractapplic on the bonus objective.

Comment by simon on Electrostatic Airships? · 2024-10-28T22:41:41.388Z · LW · GW

Yes, for that reason I had never been considering a sphere for my main idea with relatively close wires. (though the 2-ring alternative without close wires would support a surface that would be topologically a sphere). What I actually was imagining was this:

A torus, with superconducting wires wound diagonally. The interior field goes around the ring and supports against collapse of the cross section of the ring, the exterior field is polar and supports against collapse of the ring. Like a conventional superconducting energy storage system:

I suppose this does raise the question of where you attach the payload, maybe it's attached to various points on the ring via cables or something, but as you scale it up, that might get unwieldy.

I suppose there's also a potential issue about the torque applied by the Earth's magnetic field. I don't imagine it's unmanageable, but haven't done the math.

My actual reason for thinking about this sort of thing was actually because I was thinking about whether (because of the square-cube law), superconducting magnetic energy storage might actually be viable for more than just the current short-term timescales if physically scaled up to a large size. The airship idea was a kind of side effect. 

The best way I was able to think of actually using something like this for energy storage would be to embed it in ice and anchor/ballast it to drop it to the bottom of the ocean, where the water pressure would counterbalance the expansion from the magnetic fields enabling higher fields to be supported.

Comment by simon on Electrostatic Airships? · 2024-10-28T10:12:46.189Z · LW · GW

You can use magnetic instead of electrostatic forces as the force holding the surface out against air pressure. One disadvantage is that you need superconducting cables fairly spread out* over the airship's surface, which imposes some cooling requirements. An advantage is square-cube law means it scales well to large size. Another disadvantage is that if the cooling fails it collapses and falls down.

*technically you just need two opposing rings, but I am not so enthusiastic about draping the exterior surface over long distances as it scales up, and it probably does need a significant scale

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-27T19:26:03.788Z · LW · GW

Now using julia with Claude to look at further aspects of the data, particularly in view of other commenters' observations:

First, thanks to SarahSrinivasan for the key observation that the data is organized into tournaments and non-tournament encounters. The tournaments skew the overall data to higher winrate gladiators, so restricting to the first round is essential for debiasing this (todo: check what is up with non-tournament fights).

Also, thanks to abstractapplic and Lorxus for pointing out that their are some persistent high level gladiators. It seems to me all the level 7 gladiators are persistent (up to the two item changes remarked on by abstractapplic and Lorxus). I'm assuming for now level 6 and below likely aren't persistent (other than in the same tournament).

(btw there are a couple fights where the +4 gauntlets holder is on both sides. I'm assuming this is likely a bug in the dataset generation rather than an indication that there are two of them (e.g. didn't check that both sides, drawn randomly from some pool, were not equal)).

For gladiators of levels 1 to 6, the boots and gauntlets in tournament first rounds seem to be independently and randomly assigned as follows:

+1 and +2 gauntlets are equally likely at 10/34 chance each;

+3 gauntlets have probability (4 + level)/34

+0 (no) gauntlets have probability (10 - level)/34

and same, independently, for boots.

I didn't notice obvious deviations for particular races and classes (only did a few checks).

I don't have a simple formula for level distribution yet. It is clearly much more favouring lower levels in tournament first rounds as compared with non-tournament fights, and level 1 gladiators don't show up at all in non-tournament fights. Will edit to add more as I find more.

edit: boots/gauntlets distribution seems to be about the same for each level in the non-tournament distribution as in the tournament first rounds. This suggests that the level distribution differences in non-tournament rounds is not due to win/winrate selection (which the complete absence of level 1's outside of tournaments already suggested).

edit2: race/class distribution for levels 1-6 seems equal in first round data (same probabilities of each, independent). Same in non-tournament data. I haven't checked for particular levels within that range.  edit3: there seems to be more level 1 fencers than other level 1 classes by an amount that is technically statistically significant if Claude's test is correct, though still probably random I assume. 

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-24T15:53:46.195Z · LW · GW

You may well be right, I'll look into my hyperparameters. I looked at the code Claude had generated with my interference and that greatly lowered my confidence in them, lol (see edit to this comment).

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-24T04:56:38.576Z · LW · GW

Inspired by abstractapplic's machine learning and wanting to get some experience in julia, I got Claude (3.5 sonnet) to write me an XGBoost implementation in julia. Took a long time especially with some bugfixing (took a long time to find that a feature matrix was the wrong shape - a problem with insufficient type explicitness, I think). Still way way faster than doing it myself! Not sure I'm learning all that much julia, but am learning how to get Claude to write it for me, I hope.

Anyway, I used a simple model that

only takes into account 8 * sign(speed difference) + power difference, as in the comment this is a reply to

and a full model that

takes into account all the available features including the base data, the number the simple model uses, and intermediate steps in the calculation of that number (that would be, iirc: power (for each), speed (for each), speed difference, power difference, sign(speed difference))

Results:

Rank 1
Full model scores: Red: 94.0%, Black: 94.9%
Combined full model score: 94.4%
Simple model scores: Red: 94.3%, Black: 94.6%
Combined simple model score: 94.5%

Matchups:
Varina Dourstone          (+0 boots, +3 gauntlets) vs House Cadagal Champion
Willow Brown              (+3 boots, +0 gauntlets) vs House Adelon Champion
Xerxes III of Calantha    (+2 boots, +2 gauntlets) vs House Deepwrack Champion
Zelaya Sunwalker          (+1 boots, +1 gauntlets) vs House Bauchard Champion

This is the top scoring scoring result with either the simplified model or the full model. It was found by a full search of every valid item and hero combination available against the house champions.

It is also my previously posted, found w/o machine learning, proposal for the solution. Which is reassuring. (Though, I suppose there is some chance that my feeding the models this predictor, if it's good enough, might make them glom on to it while they don't find some hard-to learn additional pattern.)

My theory though is that giving the models the useful metric mostly just helps them - they don't need to learn the metric from the data, and I mostly think that if there was a significant additional pattern the full model would do better.

(for Cadagal, I haven't changed the champion's boots to +4, though I don't expect that to make a significant difference)

As far as I can tell the full model doesn't do significantly better and does worse in some ways (though, I don't know much about how to evaluate this, and Claude's metrics, including a test set log loss of 0.2527 for the full model and 0.2511 for the simple model, are for a separately generated version which I am not all that confident are actually the same models, though they "should be" up to the restricted training set if Claude was doing it right). * see edit below

But the red/black variations seen below for the full model seem likely to me (given my prior that red and black are likely to be symmetrical) to be an indication that what the full model is finding that isn't in the full model is at least partially overfitting. Though actually, if it's overfitting a lot, maybe it's surprising that the test set log loss wouldn't be a lot worse than found (though it is at least worse than the simple model)? Hmm - what if there are actual red/black difference? (something to look into perhaps, as well as try to duplicate abstractapplic's report regarding sign(speed difference) not exhausting the benefits of speed info ... but for now I'm more likely to leave the machine learning aside and switch to looking at distributions of gladiator characteristics, I think.)

Predictions for individual matchups for my and abstractapplic's solutions:

My matchups:

Varina Dourstone          (+0 boots, +3 gauntlets) vs House Cadagal Champion    (+2 boots, +3 gauntlets)
Full Model:  Red: 91.1%, Black: 96.7%
Simple Model: Red: 94.3%, Black: 94.6%


Willow Brown              (+3 boots, +0 gauntlets) vs House Adelon Champion     (+3 boots, +1 gauntlets)
Full Model:  Red: 94.3%, Black: 95.1%
Simple Model: Red: 94.3%, Black: 94.6%


Xerxes III of Calantha    (+2 boots, +2 gauntlets) vs House Deepwrack Champion  (+3 boots, +2 gauntlets)
Full Model:  Red: 95.2%, Black: 93.7%
Simple Model: Red: 94.3%, Black: 94.6%


Zelaya Sunwalker          (+1 boots, +1 gauntlets) vs House Bauchard Champion   (+3 boots, +2 gauntlets)
Full Model:  Red: 95.3%, Black: 93.9%
Simple Model: Red: 94.3%, Black: 94.6%

(all my matchups have 4 effective power difference in my favour as noted in an above comment)


abstractapplic's matchups:

Matchup 1:
Uzben Grimblade           (+3 boots, +0 gauntlets) vs House Adelon Champion     (+3 boots, +1 gauntlets)

Win Probabilities:
Full Model:  Red: 72.1%, Black: 62.8%
Simple Model: Red: 65.4%, Black: 65.7%

Stats:
Speed: 18 vs 14 (diff: 4)
Power: 11 vs 18 (diff: -7)
Effective Power Difference: 1
--------------------------------------------------------------------------------

Matchup 2:
Xerxes III of Calantha    (+2 boots, +1 gauntlets) vs House Bauchard Champion   (+3 boots, +2 gauntlets)

Win Probabilities:
Full Model:  Red: 46.6%, Black: 43.9%
Simple Model: Red: 49.4%, Black: 50.6%

Stats:
Speed: 16 vs 12 (diff: 4)
Power: 13 vs 21 (diff: -8)
Effective Power Difference: 0
--------------------------------------------------------------------------------

Matchup 3:
Varina Dourstone          (+0 boots, +3 gauntlets) vs House Cadagal Champion    (+2 boots, +3 gauntlets)

Win Probabilities:
Full Model:  Red: 91.1%, Black: 96.7%
Simple Model: Red: 94.3%, Black: 94.6%

Stats:
Speed: 7 vs 25 (diff: -18)
Power: 22 vs 10 (diff: 12)
Effective Power Difference: 4
--------------------------------------------------------------------------------

Matchup 4:
Yalathinel Leafstrider    (+1 boots, +2 gauntlets) vs House Deepwrack Champion  (+3 boots, +2 gauntlets)

Win Probabilities:
Full Model:  Red: 35.7%, Black: 39.4%
Simple Model: Red: 34.3%, Black: 34.6%

Stats:
Speed: 20 vs 15 (diff: 5)
Power: 9 vs 18 (diff: -9)
Effective Power Difference: -1
--------------------------------------------------------------------------------

Overall Statistics:
Full Model Average:  Red: 61.4%, Black: 60.7%
Simple Model Average: Red: 60.9%, Black: 61.4%

Edit: so I checked the actual code to see if Claude was using the same hyperparameters for both, and wtf wtf wtf wtf. The code has 6 functions that all train models (my fault for at one point renaming a function since Claude gave me a new version that didn't have all the previous functionality (only trained the full model instead of both - this was when doing the great bughunt for the misshaped matrix and a problem was suspected in the full model), then Claude I guess picked up on this and started renaming updated versions spontaneously, and I was adding Claude's new features in instead of replacing things and hadn't cleaned up the code or asked Claude to do so). Each one has it's own hardcoded hyperparameter set. Of these, there are one pair of functions that have matching hyperparameters. Everything else has a unique set. Of course, most of these weren't being used anymore, but the functions for actually generating the models I used for my results, and the function for generating the models used for comparing results on a train/test split, weren't among the matching pair. Plus another function that returns a (hardcoded, also unique) updated parameter set, but wasn't actually used. Oh and all this is not counting the hyperparameter tuning function that I assumed was generating a set of tuned hyperparameters to be used by other functions, but in fact was just printing results for different tunings. I had been running this every time before training models!  Obviously I need to be more vigilant (or maybe asking Claude to do so might help?).

edit:

Had Claude clean up the code and tune for more overfitting, still didn't see anything not looking like overfitting for the full model. Could still be missing something, but not high enough in subjective probability to prioritize currently, so have now been looking at other aspects of the data.

further edit:

My (what I think is) highly overfitted version of my full model really likes Yonge's proposed solution. In fact it predicts a higher winrate than for equal winrate to the best possible configuration not using the +4 boots (I didn't have Claude code the situation where +4 boots are a possibility). I still think that's probably because they are picking up the same random fluctuations ... but it will be amusing if Yonge's "manual scan" solution turns out to be exactly right.

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-24T03:20:11.513Z · LW · GW

Very interesting, this would certainly cast doubt on 

my simplified model

But so far I haven't been noticing

any affects not accounted for by it.

After reading your comments I've been getting Claude to write up an XGBoost implementation for me, I should have made this reply comment when I started, but will post my results under my own comment chain.

I have not (but should) try to duplicate (or fail to do so) your findings - I haven't been quite testing the same thing.

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-21T15:44:43.441Z · LW · GW

I don't think this is correct:

"My best guess about why my solution works (assuming it does) is that the "going faster than your opponent" bonus hits sharply diminishing returns around +4 speed"

In my model

There is a sharp threshold at +1 speed, so returns should sharply diminish after +1 speed

in fact in the updated version of my model

There is no effect of speed beyond the threshold (speed effect depends only on sign(speed difference))

I think the discrepancy might possibly relate to this:

"Iterated all possible matchups, then all possible loadouts (modulo not using the +4 boots), looking for max EV of total count of wins."

because

If you consider only the matchups with no items, the model needs to assign the matchups assuming no boots, so it sends your characters against opponents over which they have a speed advantage without boots (except the C-V matchup as there is no possibility of beating C on speed). 

so an optimal allocation

needs to take into account the fact that your boots can allow you to use slower and stronger characters, so can't be done by choosing the matchups first without items.

so I predict that your model might predict 

a higher EV for my solution

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-20T13:40:36.573Z · LW · GW

updated model for win chance:

I am currently modeling the win ratio as dependent on a single number, the effective power difference. The effective power difference is the power difference plus 8*sign(speed difference).

Power and speed are calculated as:

Power = level + gauntlet number + race power + class power

Speed = level + boots number + race speed + class speed

where race speed and power contributions are determined by each increment on the spectrum:

Dwarf - Human - Elf

increasing speed by 3 and lowering power by 3

and class speed and power contributions are determined by each increment on the spectrum:

Knight - Warrior - Ranger - Monk - Fencer - Ninja 

increasing speed by 2 and lower power by 2.

So, assuming this is correct, what function of the effective power determines the win rate? I don't have a plausible exact formula yet, but:

  • If the effective power difference is 6 or greater, victory is guaranteed.
  • If the effective power difference is low, it seems a not-terrible fit that the odds of winning are about exponential in the effective power difference (each +1 effective power just under doubling odds of winning)
  • It looks like it is trending faster than exponential as the effective power difference increases. At an effective power difference of 4, the odds of the higher effective power character winning are around 17 to 1.

edit: it looks like there is a level dependence when holding effective power difference constant at non-zero values (lower/higher level -> winrate imbalance lower/higher than implied by effective power difference). Since I don't see this at 0 effective power difference, it is presumably not due to an error in the effective power calculation, but an interaction with the effective power difference to determine the final winrate. Our fights are likely "high level" for this purpose implying better odds of winning than the 17 to 1 in each fight mentioned above. Todo: find out more about this effect quantitatively.  edit2: whoops that wasn't a real effect, just me doing the wrong test to look for one. 

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-19T23:55:50.537Z · LW · GW

On the bonus objective:

I didn't realize that the level 7 Elf Ninjas were all one person or that the boots +4 were always with a level 7 (as opposed to any level) Elf Ninja. It seems you are correct as there are 311 cases of which the first 299 all have the boots of speed 4 and gauntlets 3 with only the last 12 having boots 2 and gauntlets 3 (likely post-theft). It seems to me that they appear both as red and black, though.

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-19T17:46:52.465Z · LW · GW

 Thanks aphyer. My analysis so far and proposed strategy:

After initial observations that e.g. higher numbers are correlated with winning, I switched to mainly focus on race and class, ignoring the numerical aspects.

I found major class-race interactions.

It seems that for matchups within the same class, Elves are great, tending to beat dwarves consistently across all classes and humans even harder. While Humans beat dwarves pretty hard too in same-class matchups.

Within same-race matchups there are also fairly consistent patterns: Fencers tend to beat Rangers, Monks and Warriors, Knights beat Ninjas, Monks beat Warriors, Rangers and Knights, Ninjas beat Monks, Fencers and Rangers, Rangers beat Knights and Warriors, and Warriors beat Knights.

If the race and class are both different though... things can be different. For example, a same-class Elf will tend to beat a same-class Dwarf. And a same-race Fencer will tend to beat a same-race Warrior. But if an Elf Fencer faces a Dwarf Warrior, the Dwarf Warrior will most likely win. Another example with Fencers and Warriors: same-class Elves tend to beat Humans - but not only will a Human Warrior tend to beat an Elf Fencer, but also a Human Fencer will tend to beat an Elf Warrior by a larger ratio than for a same-race Fencer/Warrior matchup???

If you look at similarities between different classes in terms of combo win rates, there seems to be a chain of similar classes:

Knight - Warrior - Ranger - Monk - Fencer - Ninja 

(I expected a cycle underpinned by multiple parameters. But Ninja is not similar to Knight. This led me to consider that perhaps there is an only a single underlying parameter, or trade off between two (e.g. strength/agility .... or ... Speed and Power)).

And going back to the patterns seen before, this seems compatible with races also having speed/power tradeoffs:

Dwarf - Human - Elf

Where speed has a threshold effect but power is more gradual (so something with slightly higher speed beats something with slightly higher power, but something with much higher power beats something with much higher speed).

Putting the Class-race combos on the same spectrum based on similarity/trends in results, I get the following ordering:

Elf Ninja > Elf Fencer > Human Ninja > Elf Monk > Human Fencer > Dwarf Ninja >~ Elf Ranger > Human Monk > Elf Warrior > Dwarf Fencer > Human Ranger > Dwarf Monk >~ Elf Knight > Human Warrior > Dwarf Ranger > Human Knight > Dwarf Warrior > Dwarf Knight

So, it seems a step in the race sequence is about equal to 1.5 steps in the class sequence. On the basis of pretty much just that, I guessed that race steps are a 3 speed vs power tradeoff,  class steps are a 2 speed and power tradeoff, levels give 1 speed and power each, and items give what they say on the label.

I have not verified this as much as I would like. (But on the surface it seems to work, e.g. speed threshold seems to be there). One thing that concerns me is that it seems that higher speed differences actually reduce success chances holding power differences constant (could be an artifact, e.g., of it not just depending on the differences between stat values edit: see further edit below). But, for now, assuming that I have it correct, speed/power of the house champions (with the lowest race and class in a stat assumed to have 0 in that stat):

House Adelon:  Level 6 Human Warrior +3 Boots +1 Gauntlets - 14 speed 18 power

House Bauchard: Level 6 Human Knight +3 Boots +2 Gauntlets - 12 speed 21 power

House Cadagal: Level 7 Elf Ninja +2 Boots +3 Gauntlets - 25 speed 10 power

House Deepwrack: Level 6 Dwarf Monk +3 Boots +2 Gauntlets - 15 speed 18 power

Whereas the party's champions, ignoring items, have:

  • Uzben Grimblade, a Level 5 Dwarf Ninja - 15 speed 11 power
  • Varina Dourstone, a Level 5 Dwarf Warrior - 7 speed 19 power
  • Willow Brown, a Level 5 Human Ranger - 12 speed 14 power
  • Xerxes III of Calantha, a Level 5 Human Monk - 14 speed 12 power
  • Yalathinel Leafstrider, a Level 5 Elf Fencer - 19 speed 7 power
  • Zelaya Sunwalker, a Level 6 Elf Knight - 12 speed 16 power

For my proposed strategy (subject to change as I find new info, or find my assumptions off, e.g. such that my attempts to just barely beat the opponents on speed are disastrously slightly wrong):

I will send Willow Brown, with +3 boots and +1 gauntlets no gauntlets, against House Adelon's champion (1 speed advantage, 3 4 power deficit)

I will send Zelaya Sunwalker, with +1 boots and +2  +1  gauntlets, against House Bauchard's champion (1 speed advantage, 3 4 power deficit)

I will send Xerxes III of Calantha, with +2 boots and +3 +2 gauntlets, against House Deepwrack's champion (1 speed advantage, 3 4 power deficit)

And I will send Varina Dourstone, with +3 gauntlets no items, to overwhelm House Cadagal's Elf Ninja with sheer power (18 speed deficit, 9 12 power advantage).

And in fact, I will gift the +4 boots of speed to House Cadagal's Elf Ninja in advance of the fight, making it a 20 speed deficit.

Why? Because I noticed that +4 boots of speed are very rare items that have only been worn by Elf Ninjas in the past. So maybe that's what the bonus objective is talking about. Of course, another interpretation is that sending a character 2 levels lower without any items, and gifting a powerful item in advance, would be itself a grave insult. Someone please decipher the bonus objective to save me from this foolishness! 

Edited to add: It occurs to me that I really have no reason to believe the power calculation is accurate, beyond that symmetry is nice. I'd better look into that.

further edit: it turns out that I was leaving out the class contribution to the power difference when calculating the power difference for determining the effects of power and speed.  It looks like this was causing the effect of higher speed differences seeming to reduce win rates. With this fixed the effects look much cleaner (e.g. there's a hard threshold where if you have a speed deficit you must have at least 3 power advantage to have any chance to win at all), increasing my confidence that effects on power and speed being symmetric is actually correct. This does have the practical effect of making me adjust my item distribution: it looks like a 4 deficit in power is still enough for >90% win rate with a speed advantage, while getting similar win rates with a speed disadvantage will require more than just the 9 power difference, so I shifted the items to boost Varina's power advantage. Indeed, with the cleaner effects, it appears that I can reasonably model the effect of a speed advantage/disadvantage as equivalent to a power difference of 8, so with the item shift all characters will have an effective +4 power advantage taking this into account.

Comment by simon on Arithmetic is an underrated world-modeling technology · 2024-10-18T16:46:20.603Z · LW · GW

You mentioned a density of steel of 7.85 g/cm^3 but used a value of 2.7 g/cm^3 in the calculations.

BTW this reminds me of:

https://www.energyvault.com/products/g-vault-gravity-energy-storage

I was aware of them quite a long time ago (the original form was concrete blocks lifted to form a tower by cranes) but was skeptical since it seemed obviously inferior to using water capital cost wise and any efficiency gains were likely not worth it. Reading their current site:

The G-VAULT™ platform utilizes a mechanical process of lifting and lowering composite blocks or water to store and dispatch electrical energy.

(my italics). Looks to me like a slow adaptation to the reality that water is better.

Comment by simon on Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours · 2024-10-06T14:31:56.713Z · LW · GW

IMO: if an AI can trade off between different wants/values of one person, it can do so between multiple people also.

This applies to simple surface wants as well as deep values.

Comment by simon on Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours · 2024-10-04T16:02:25.261Z · LW · GW

I had trouble figuring out how to respond to this comment at the time because I couldn't figure out what you meant by "value alignment" despite reading your linked post. After reading you latest post, Conflating value alignment and intent alignment is causing confusion, I still don't know exactly what you mean by "value alignment" but at least can respond.

What I mean is:

If you start with an intent aligned AI following the most surface level desires/commands, you will want to make it safer and more useful by having common sense, "do what I mean", etc. As long as you surface-level want it to understand and follow your meta-level desires, then it can step up that ladder etc. 

If you have a definition of "value alignment" that is different from what you get from this process, then I currently don't think that it is likely to be better than the alignment from the above process.

In the context of collective intent alignment:

If you have an AI that only follows commands, with no common sense etc., and it's powerful enough to take over, you die. I'm pretty sure some really bad stuff is likely to happen even if you have some "standing orders". So, I'm assuming people would actually deploy only an AI that has some understanding of what the person(s) it's aligned with wants, beyond the mere text of a command (though not necessarily super-sophisticated). But once you have that, you can aggregate how much people want between humans for collective intent alignment. 

I'm aware people want different things, but don't think it's a big problem from a technical (as opposed to social) perspective - you can ask how much people want the different things. Ambiguity in how to aggregate is unlikely to cause disaster, even if people will care about it a lot socially. Self-modification will cause a convergence here, to potentially different attractors depending on the starting position. Still unlikely to cause disaster. The AI will understand what people actually want from discussions with only a subset of the world's population, which I also see as unlikely to cause disaster, even if people care about it socially.

From a social perspective, obviously a person or group who creates an AI may be tempted to create alignment to themselves only. I just don't think collective alignment is significantly harder from a technical perspective.

"Standing orders" may be desirable initially as a sort of training wheels even with collective intent, and yes that could cause controversy as they're likely not to originate from humanity collectively.

Comment by simon on Conflating value alignment and intent alignment is causing confusion · 2024-10-04T14:54:02.265Z · LW · GW

I think this post is making a sharp distinction to what really is a continuum; any "intent aligned" AI becomes more safe and useful as you add more "common sense" and "do what I mean" capability to it, and at the limit of this process you get what I would interpret as alignment to the long term, implicit deep values (of the entity or entities the AI started out intent aligned to).

I realize other people might define "alignment to the long term, implicit deep values" differently, such that it would not be approached by such a process, but currently think they would be mistaken in desiring whatever different definition they have in mind. (Indeed, what they actually want is what they would get under sufficiently sophisticated intent alignment, pretty much by definition).

P.S. I'm not endorsing intent alignment (for ASI) as applied to only an individual/group -  I think intent alignment can be applied to humanity collectively.

Comment by simon on Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours · 2024-08-06T18:58:17.003Z · LW · GW

I don't think intent aligned AI has to be aligned to an individual - it can also be intent aligned to humanity collectively. 

One thing I used to be concerned about is that collective intent alignment would be way harder than individual intent alignment, making someone validly have an excuse to steer an AI to their own personal intent. I no longer think this is the case. Most issues with collective intent I see as likely also affecting individual intent (e.g. literal instruction following vs extrapolation). I see two big issues that might make collective intent harder than individual intent. One is biased information on people's intents and another is difficulty of weighting intents for different people. On reflection though, I see both as non-catastrophic, and an imperfect solution to them likely being better for humanity as a whole than following one person's individual intent. 

Comment by simon on A simple case for extreme inner misalignment · 2024-07-14T16:36:03.802Z · LW · GW

It feels to me like this post is treating AIs as functions from a first state of the universe to a second state of the universe. Which in a sense, anything is... but, I think that the tendency to simplification happens internally, where they operate more as functions from (digital) inputs to (digital) outputs. If you view an AI as a function from an digital input to a digital output, I don't think goals targeting specific configurations of the universe are simple at all and don't think decomposability over space/time/possible worlds are criteria that would lead to something simple.