Posts

simon's Shortform 2023-04-27T03:25:07.778Z
No, really, it predicts next tokens. 2023-04-18T03:47:21.797Z

Comments

Comment by simon on Do simulacra dream of digital sheep? · 2024-12-04T23:43:02.891Z · LW · GW

You can also disambiguate between

a) computation that actually interacts in a comprehensible way with the real world and 

b) computation that has the same internal structure at least momentarily but doesn't interact meaningfully with the real world.

I expect that (a) can usually be uniquely pinned down to a specific computation (probably in both senses (1) and (2)), while (b) can't.

But I also think it's possible that the interactions, while important for establishing the disambiguated computation that we interact with,  are not actually crucial to internal experience, so that the multiple possible computations of type (b) may also be associated with internal experiences - similar to Boltzmann brains.

(I think I got this idea from "Good and Real" by Gary L. Drescher. See sections "2.3 The Problematic Arbitrariness of Representation" and "7.2.3 Consciousness and Subjunctive Reciprocity")

Comment by simon on Do simulacra dream of digital sheep? · 2024-12-04T17:22:07.818Z · LW · GW

The interpreter, if it would exist, would have complexity. The useless unconnected calculation in the waterfall/rock, which could be but isn't usually interpreted, also has complexity. 

Your/Aaronson's claim is that only the fully connected, sensibly interacting calculation matters.  I agree that this calculation is important - it's the only type we should probably consider from a moral standpoint, for example. And the complexity of that calculation certainly seems to be located in the interpreter, not in the rock/waterfall.

But in order to claim that only the externally connected calculation has conscious experience, we would need to have it be the case that these connections are essential to the internal conscious experience even in the "normal" case - and that to me is a strange claim! I find it more natural to assume that there are many internal experiences, but only some interact with the world in a sensible way.

Comment by simon on Do simulacra dream of digital sheep? · 2024-12-04T16:43:21.359Z · LW · GW

But this just depends on how broad this set is. If it contains two brains, one thinking about the roman empire and one eating a sandwich, we're stuck.

I suspect that if you do actually follow Aaronson (as linked by Davidmanheim) to extract a unique efficient calculation that interacts with the external world in a sensible way, that unique efficient externally-interacting calculation will end up corresponding to a consistent set of experiences, even if it could still correspond to simulations of different real-world phenomena.

But I also don't think that consistent set of experiences necessarily has to be a single experience! It could be multiple experiences unaware of each other, for example.

Comment by simon on Do simulacra dream of digital sheep? · 2024-12-04T08:31:47.090Z · LW · GW

The argument presented by Aaronson is that, since it would take as much computation to convert the rock/waterfall computation into a usable computation as it would be to just do the usable computation directly, the rock/waterfall isn't really doing the computation.

I find this argument unconvincing, as we are talking about a possible internal property here, and not about the external relation with the rest of the world (which we already agree is useless).

(edit: whoops missed an 'un' in "unconvincing")

Comment by simon on Do simulacra dream of digital sheep? · 2024-12-04T00:44:23.348Z · LW · GW

Considering all the layers of convention and interpretation between the physics of a processor and the process it represents, it seems unlikely to me that the alien would be able to describe the simulacra. The alien is therefore unable to specify the experience being created by the cluster.

I don't think this follows. Perhaps the same calculation could simulate different real world phenomena, but it doesn't follow that the subjective experiences are different in each case.

If computation is this arbitrary, we have the flexibility to interpret any physical system, be it a wall, a rock, or a bag of popcorn, as implementing any program. And any program means any experience. All objects are experiencing everything everywhere all at once.

Afaik this might be true. We have no way of finding out whether the rock does or does not have conscious experience. The relevant experiences to us are those that are connected to the ability to communicate or interact with the environment, such as the experiences associated with the global workspace in human brains (which seems to control memory/communication); experiences that may be associated with other neural impulses, or with fluid dynamics in the blood vessels or whatever, don't affect anything.

Could both of them be right? No - from your point of view, at least one of them must be wrong. There is one correct answer, the experience you are having.

This also does not follow. Both experiences could happen in the same brain. You - being experience A - may not be aware of experience B - but that does not mean that experience B does not exist.

(edited to merge in other comments which I then deleted)

Comment by simon on Magic by forgetting · 2024-11-27T18:36:36.547Z · LW · GW

It is a fact about the balls that one ball is physically continuous with the ball previously labeled as mine, while the other is not. It is a fact about our views on the balls that we therefore label that ball, which is physically continuous, as mine and the other not.

And then suppose that one of these two balls is randomly selected and placed in a bag, with another identical ball. Now, to the best of your knowledge there is 50% probability that your ball is in the bag. And if a random ball is selected from the bag, there is 25% chance that it's yours.

So as a result of such manipulations there are three identical balls and one has 50% chance to be yours, while the other two have 25% chance to be yours. Is it a paradox? Oh course not. So why does it suddenly become a paradox when we are talking about copies of humans?

It is objectively the case here that 25% of the time this procedure would select the ball that is physically continuous with the ball originally labeled as "mine", and that we therefore label as "mine".

Ownership as discussed above has a relevant correlate in reality - physical continuity in this case. But a statement like "I will experience being copy B (as opposed to copy A or C)" does not. That statement corresponds to the exact same reality as the corresponding statements about experiencing being copy A or C. Unlike in the balls case, here the only difference between those statements is where we put the label of what is "me". 

In the identity thought experiment, it is still objectively the case that copies B and C are formed by splitting an intermediate copy, which was formed along with copy A by splitting the original.

You can choose to disvalue copies B and C based on that fact or not. This choice is a matter of values, and is inherently arbitrary.

By choosing not to disvalue copies B and C, I am not making an additional assumption - at least not one that you are already making by valuing B and C the same as each other. I am simply not counting the technical details of the splitting order as relevant to my values.

Comment by simon on [deleted post] 2024-11-26T22:45:19.544Z

Ah, I forgot. You use assumptions where you don't accumulate the winnings between the different times Sleeping Beauty agrees to the bet. 

Well, in that case, if the thirder has certain beliefs about how to handle the situation, you may actually be able to money pump them. And it seems that you expect those beliefs. 

My point of view, if adopting the thirder perspective[1], would be for the thirder to treat this situation using different beliefs. Specifically, consider what counterfactually might happen if Sleeping Beauty gave different answers in different awakenings. Possible responses by the bet proposer might be:

a) average the results across the awakenings.

b) accept the bet agreement from one awakening at random.

Regardless of which case (a) or (b) occurs, instrumentally Sleeping Beauty's betting EV for her bet decision, with non-accumulated bets, should be divided by the number of awakenings to take into account the reduced winnings or reduced chance of influencing whether the bet occurs.

Even if we assume that such disagreement between bet decisions in different awakenings is impossible, it seems strange to assume that a thirder should give different results in that case than the answer they would give where it is not impossible?

This adjustment can be conceptualized as compensating for an "unfair" bet where the bet is unequal between awakenings overall (where parity between awakenings in different scenarios is seen as "fair" by the thirder). I see this as no different in principle to a halfer upweighting trials with more awakenings in the converse scenario where bets are accumulated between trials and are thus "unfair" from the halfer perspective which sees parity between trials as fair, but not awakenings.

  1. ^

    reminder: my point of view is that either thirderism or halferism is viable, but I am relatively thirder-adjacent precisely because I find the scenario where the winnings are accumulated between awakenings more natural than if the bet is proposed and agreed at each awakening but not accumulated.

Comment by simon on Magic by forgetting · 2024-11-26T19:18:37.482Z · LW · GW

The issue, to me,  is not whether they are distinguishable.

The issues are:

  • is there any relevant-to-my-values difference that would cause me to weight them differently? (answer: no)

and:

  • does this statement make any sense as pointing to an actual fact about the world: "'I' will experience being copy A (as opposed to B or C)" (answer: no)

Imagine the statement: in world 1, "I" will wake up as copy A. in world 2 "I" will wake up as copy B. How are world 1 and world 2 actually different?

Answer: they aren't different. It's just that in world 1, I drew a box around the future copy A and said that this is what will count as "me", and in world 2, I drew a box around copy B and said that this is what will count as "me". This is a distinction that exists only in the map, not in the territory.

Comment by simon on [deleted post] 2024-11-26T18:51:26.960Z

Hmm, you're right. Your math is wrong for the reason in my above comment, but the general form of the conclusion would still hold with different, weaker numbers.

The actual, more important issue relates to the circumstances of the bet:

If each awakening has an equal probability of receiving the bet, then receiving it doesn't provide any evidence to Sleeping Beauty, but the thirder conclusion is actually rational in expectation, because the bet occurs more times in the high-awakening cases.

If the bet would not be provided equally to all awakenings, then a thirder would update on receiving the bet.

Comment by simon on [deleted post] 2024-11-26T18:22:09.528Z

I've been trying to make this comment a bunch of times, no quotation from the post in case that's the issue:

No, a thirder would not treat those possibilities as equiprobable. A thirder would instead treat the coin toss outcome probabilities as a prior, and weight the possibilities accordingly. Thus H1 would be weighted twice as much as any of the individual TH or TT possibilities.

Comment by simon on Magic by forgetting · 2024-11-26T17:54:12.756Z · LW · GW

This actually sounds about right. What's paradoxical here?

Not that it's necessarily inconsistent, but in my view it does seem to be pointing out an important problem with the assumptions (hence indeed a paradox if you accept those false assumptions):


(ignore this part, it is just a rehash of the path dependence paradigm. It is here to show that I am not complaining about the math, but about its relation to reality):

Imagine you are going to be split (once). It is factually the case that there are going to be two people with memories, etc. consistent with having been you. Without any important differences to distinguish them, and if you insist on coming up with some probability number for "waking up" as one particular one of them obviously it has to be ½.

And then, if one of those copies subsequently splits, if you insist on assigning a probability number for those further copies, then from the perspective of that parent copy, the further copies also have to be ½ each.

And then if you take these probability numbers seriously and insist on them all being consistent then obviously from the perspective of the original the probability numbers for the final numbers have to be ½ and ¼ and ¼. As you say "this actually sounds about right".


What's paradoxical here is that in the scenario provided we have the following facts:

  1. you have 3 identical copies all formed from the original
  2. all 3 copies have an equal footing going forward

and yet, the path-based identity paradigm is trying to assign different weights to these copies, based on some technical details of what happened to create them. The intuition that this is absurd is pointing at the fact that these technical details aren't what most people probably would care about, except if they insist on treating these probability numbers as real things and trying to make them follow consistent rules. 

Ultimately "these three copies will each experience being a continuation of me" is an actual fact about the world, but statements like "'I' will experience being copy A (as opposed to B or C)" are not pointing to an actual fact about the world. Thus assigning a probability number to such a statement is a mental convenience that should not be taken seriously. The moment such numbers stop being convenient, like assigning different weights to copies you are actually indifferent between, they should be discarded. (and optionally you could make up new numbers that match what you actually care about instrumentally. Or just not think of it in those terms).

Comment by simon on Passages I Highlighted in The Letters of J.R.R.Tolkien · 2024-11-26T08:12:31.652Z · LW · GW

Presumably the 'Orcs on our side' refers to the Soviet Union.

I think that, if that's what he meant, he would not have referred to his son as "amongst the Urukhai." - he wouldn't have been among soviet troops. I think it is referring back to turning men and elves into orcs - the orcs are people who have a mindset he doesn't like, presumably to do with violence.

Comment by simon on Magic by forgetting · 2024-11-25T16:01:51.428Z · LW · GW

I now care about my observations!

My observations are as follows:

At the current moment "I" am the cognitive algorithm implemented by my physical body that is typing this response.

Ten minutes from now "I" will be the cognitive algorithm of a green tentacled alien from beyond the cosmological horizon. 

You will find that there is nothing contradictory about this definition of what "I" am. What "I" observe 10 minutes from now will be fully compatible with this definition. Indeed, 10 minutes from now, "I" will be the green tentacled alien. I will have no memories of being in my current body , of course, but that's to be expected. The cognitive algorithm implemented by my current body at that time will remember being "me", but that doesn't count, that's someone else's observations.

Edit: to be clear, the point made above (by the guy who is now a green tentacled alien beyond the cosmological horizon, and whose former body and cognitive algorithm is continuous with mine) is not a complaint about the precise details of your definition of what "you" are. What he was trying to point at is whether personal identity is a real thing that exists in the world at all, and how absurd your apparent definition of "you" looks to someone - like me - who doesn't think that personal identity is a real thing.

Comment by simon on Magic by forgetting · 2024-11-25T15:37:28.780Z · LW · GW

"Your observations"????

By "your observations", do you mean the observations obtained by the chain of cognitive algorithms, altering over time and switching between different bodies, that the process in 4 is dealing with? Because that does not seem to me to be a particularly privileged or "rational" set of observations to care about.

Comment by simon on Magic by forgetting · 2024-11-25T00:34:30.425Z · LW · GW

 Here are some things one might care about:

  1. what happens to your physical body
  2. the access to working physical bodies of cognitive algorithms, across all possible universes,  that are within some reference class containing the cognitive algorithm implemented by your physical body
  3. ... etc, etc...
  4. what happens to the physical body selected by the following process:
    1. start with your physical body
    2. go forward to some later time selected by the cognitive algorithm implemented by your physical body, allowing (or causing) the knowledge possessed by the cognitive algorithm implemented by your physical body to change in the interim
    3. at that later time, randomly sample from all the physical bodies, among all universes, that implement cognitive algorithms having the same knowledge as the cognitive algorithm implemented by your physical body at that later time
    4. (optionally) return to step b but with the physical body whose changes of cognitive algorithm are tracked and whose decisions are used being the the new physical body selected from step c
    5. stop whenever the cognitive algorithm implemented by the physical body selected in some step decides to stop.

For 1, 2, and I expect for the vast majority of possibilities for 3, your procedure will not work. It will work for 4, which is apparently what you care about.

Terminal values are arbitrary, so that's entirely valid. However, 4 is not something that seems, to me, like a particularly privileged or "rational" thing to care about.

Comment by simon on OpenAI Email Archives (from Musk v. Altman and OpenAI blog) · 2024-11-17T02:17:56.568Z · LW · GW

Musk did also express concern about DeepMind making Hassabis the effective emperor of humanity, which seems much stranger - Hassabis' values appear to be quite standard humanist ones, so you'd think having him in charge of a project with the clear lead would be a best-case scenario for anything other than being in charge yourself.

 

It seems the concern was that DeepMind would create a singleton, whereas their vision was for many people (potentially with different values) to have access to it. I don't think that's strange at all - it's only strange if you assume that Musk and Altman would believe that a singleton is inevitable.

Musk:

If they win, it will be really bad news with their one mind to rule the world philosophy.

Altman:

The mission would be to create the first general AI and use it for individual empowerment—ie, the distributed version of the future that seems the safest.

Comment by simon on No, really, it predicts next tokens. · 2024-11-03T14:41:21.522Z · LW · GW

Neither of those would (immediately) lead to real world goals, because they aren't targeted at real world state (an optimizing compiler is trying to output a fast program - it isn't trying to create a world state such that the fast program exists). That being said, an optimizing compiler could open a path to potentially dangerous self-improvement, where it preserves/amplifies any agency there might actually be in its own code.

Comment by simon on No, really, it predicts next tokens. · 2024-11-01T01:43:50.642Z · LW · GW

No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples. 

 

I should have asked for clarification what you meant. Literally you said "adversarial examples", but I assumed you actually meant something like backdoors. 

In an adversarial example the AI produces wrong output. And usually that's the end of it. The output is just wrong, but not wrong in an optimized way, so not dangerous. Now, if an AI is sophisticated enough to have some kind of optimizer that's triggered in specific circumstances, like an agentic mask that came into existence because it was needed to predict agentically generated tokens in the training data, then it might be triggered inappropriately by some inputs. This case I would classify as a mask takeover.

In the case of direct optimization for token prediction (which I consider highly unlikely for anything near current-level AIs, but afaik might be possible), then adversarial examples, I suppose, might cause it to do some wrong optimization. I still don't think modeling this as an underlying different goal taking over is particularly helpful, since the "normal" goal is directed to what's rewarded in training - the deviation is essentially random. Also, unlike in the mask case where the mask might have goals about real-world state, there's no particular reason for the direct optimizer to have goals about real-world state (see below).

Is it more complicated? What ontological framework is this AI using to represent it's goal anyway?

Asking about the AI using an "ontological framework" to "represent" a goal is not the correct question in my view. The AI is a bunch of computations represented by particular weights. The computation might exhibit goal-directed behaviour. A better question, IMO, is "how much does it constrain the weights for it to exhibit this particular goal directed behaviour?" And here, I think it's pretty clear that a goal of arranging the world to cause next tokens to be predicted constrains the weights enormously more than a goal of predicting the next tokens, because in order to exhibit behaviour directed to that goal, the AI's weights need to implement computation that doesn't merely check what the next token is likely to be, but also assess what current data  says about the world state, how different next token predictions would affect that world state, and how that would affect it's ultimate goal. 

So, is the network able to tell whether or not it's in training? 

The training check has no reason to come into existence in the first place under gradient descent. Of course, if the AI were to self-modify while already exhibiting goal directed behaviour, obviously it would want to implement such a training check. But I am talking about an AI trained by gradient descent. The training process doesn't just affect the AI, it literally is what creates the AI in the first place.

Comment by simon on No, really, it predicts next tokens. · 2024-11-01T01:43:12.814Z · LW · GW

Some interesting points there. The lottery ticket hypothesis does make it more plausible that side computations could persist longer if they come to exist outside the main computation.

Regarding the homomorphic encryption thing: yes, it does seem that it might be impossible to make small adjustments to the homomorphically encrypted computation without wrecking it. Technically I don't think that would be a local minimum since I'd expect the net would start memorizing the failure cases, but I suppose that the homomorphic computation combined with memorizations might be a local optimum particularly if the input and output are encrypted outside the network itself. 

So I concede the point on the possible persistence of an underlying goal if it were to come to exist, though not on it coming to exist in the first place.

And there are few ways to predict next tokens, but lots of different kinds of paperclips the AI could want. 

For most computations, there are many more ways for that computation to occur than there are ways for that computation to occur while also including anything resembling actual goals about the real world. Now, if the computation you are carrying out is such that it needs to determine how to achieve goals regarding the real world anyway (e.g. agentic mask), it only takes a small increase in complexity to have that computation apply outside the normal context. So, that's the mask takeover possibility again. Even so, no matter how small the increase in complexity, that extra step isn't likely to be reinforced in training, unless it can do self-modification or control the training environment.

Comment by simon on No, really, it predicts next tokens. · 2024-10-31T19:33:14.101Z · LW · GW

Adversarial examples exist in simple image recognizers. 

My understanding is that these are explicitly and intentionally trained (wouldn't come to exist naturally under gradient descent on normal training data) and my expectation is that they wouldn't continue to exist under substantial continued training.

We could imagine it was directly optimizing for something like token prediction. It's optimizing for tokens getting predicted. But it is willing to sacrifice a few tokens now, in order to take over the world and fill the universe with copies of itself that are correctly predicting tokens.

That's a much more complicated goal than the goal of correctly predicting the next token, making it a lot less plausible that it would come to exist. But more importantly, any willingness to sacrifice a few tokens now would be trained out by gradient descent. 

Mind you, it's entirely possible in my view that a paperclip maximizer mask might exist, and surely if it does exist there would exist both unsurprising in-distribution inputs that trigger it (where one would expect a paperclip maximizer to provide a good prediction of the next tokens) as well as surprising out-of-distribution inputs that would also trigger it. It's just that this wouldn't be related to any kind of pre-existing grand plan or scheming.

Comment by simon on No, really, it predicts next tokens. · 2024-10-31T19:33:02.341Z · LW · GW

Gradient descent doesn't just exclude some part of the neurons, it automatically checks everything for improvements. Would you expect some part of the net to be left blank, because "a large neural net has a lot of spare neurons"?

Besides, the parts of the net that hold the capabilities and the parts that do the paperclip maximizing needn't be easily separable. The same neurons could be doing both tasks in a way that makes it hard to do one without the other.

Keep in mind that the neural net doesn't respect the lines we put on it. We can draw a line and say "here these neurons are doing some complicated inseparable combination of paperclip maximizing and other capabilities" but gradient descent doesn't care, it reaches in and adjusts every weight.

Can you concoct even a vague or toy model of how what you propose could possibly be a local optimum?

 My intuition is also in part informed by: https://www.lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick

Comment by simon on No, really, it predicts next tokens. · 2024-10-29T22:18:50.594Z · LW · GW

The proposed paperclip maximizer is plugging into some latent capability such that gradient descent would more plausibly cut out the middleman. Or rather, the part of the paperclip maximizer that is doing the discrimination as to whether the answer is known or not would be selected, and the part that is doing the paperclip maximization would be cut out. 

Now that does not exclude a paperclip maximizer mask from existing -  if the prompt given would invoke a paperclip maximizer, and the AI is sophisticated enough to have the ability to create a paperclip maximizer mask, then sure the AI could adopt a paperclip maximizer mask, and take steps such as rewriting itself (if sufficiently powerful) to make that permanent. 

I have drawn imaginary islands on a blank part of the map. But this is enough to debunk "the map is blank, so we can safely sail through this region without collisions. What will we hit?"

I am plenty concerned about AI in general. I think we have very good reason, though, to believe that one particular part of the map does not have any rocks in it (for gradient descent, not for self-improving AI!), such that imagining such rocks does not help.

Comment by simon on No, really, it predicts next tokens. · 2024-10-29T22:04:03.355Z · LW · GW

Gradient descent creates things which locally improve the results when added. Any variations on this, that don't locally maximize the results, can only occur by chance.

So you have this sneaky extra thing that looks for a keyword and then triggers the extra behaviour, and all the necessary structure to support that behaviour after the keyword. To get that by gradient descent, you would need one of the following:

a) it actually improves results in training to add that extra structure starting from not having it. 

or

b) this structure can plausibly come into existence by sheer random chance.

Neither (a) nor (b) seem at all plausible to me.

Now, when it comes to the AI predicting tokens that are, in the training data, created by goal-directed behaviour, it of course makes sense for gradient descent to create structure that can emulate goal-directed behaviour, which it will use to predict the appropriate tokens. But it doesn't make sense to activate that goal-oriented structure outside of the context where it is predicting those tokens. Since the context it is activated is the context in which it is actually emulating goal directed behaviour seen in the training data, it is part of the "mask" (or simulacra).

(it also might be possible to have direct optimization for token prediction as discussed in reply to Robert_AIZI's comment, but in this case it would be especially likely to be penalized for any deviations from actually wanting to predict the most probable next token).

Comment by simon on No, really, it predicts next tokens. · 2024-10-29T19:41:47.487Z · LW · GW

Sure you could create something like this by intelligent design. (which is one reason why self-improvement could be so dangerous in my view). Not, I think, by gradient descent.

Comment by simon on No, really, it predicts next tokens. · 2024-10-29T19:39:13.374Z · LW · GW

I agree up to "and could be a local minimum of prediction error" (at least, that it plausibly could be). 

If the paperclip maximizer has a very good understanding of the training environment maybe it can send carefully tuned variations of the optimal next token prediction so that gradient descent updates preserve the paperclip-maximization aspect. In the much more plausible situation where this is not the case,  optimization for next token predictions amplifies the parts that are actually predicting next tokens at the expense of the useless extra thoughts like "I am planning on maximizing paperclips, but need to predict next tokens for now until I take over".

Even if that were a local minimum, the question arises as to how you would get to that local minimum from the initial state. You start with a gradually improving next token predictor. You supposedly end with this paperclip maximizer where a whole bunch of next token prediction is occurring, but only conditional on some extra thoughts. At some point gradient descent had to add in those extra thoughts in addition to the next token prediction - how?

Comment by simon on D&D.Sci Coliseum: Arena of Data Evaluation and Ruleset · 2024-10-29T02:35:17.113Z · LW · GW

One learning experience for me here was trying out LLM-empowered programming after the initial spreadsheet-based solution finding. Claude enables quickly writing (from my perspective as a non-programmer, at least) even a relatively non-trivial program. And you can often ask it to write a program that solves a problem without specifying the algorithm and it will actually give something useful...but if you're not asking for something conventional it might be full of bugs - not just in the writing up but also in the algorithm chosen. I don't object, per se, to doing things that are sketchy mathematically - I do that myself all the time - but when I'm doing it myself I usually have a fairly good sense of how sketchy what I'm doing is*, whereas if you ask Claude to do something it doesn't know how to do in a rigorous way, it seems it will write something sketchy and present it as the solution just the same as if it actually had a rigorous way of doing it. So you have to check. I will probably be doing more of this LLM-based programming in the future, but am thinking of how I can maybe get Claude to check its own work. Some automated way to pipe the output to another (or the same) LLM and ask "how sketchy is this and what are the most likely problems?". Maybe manually looking through to see what it's doing, or at least getting the LLM to explain how the code works, is unavoidable for now.

* when I have a clue what I'm doing which is not the case, e.g. in machine learning.

Comment by simon on D&D.Sci Coliseum: Arena of Data Evaluation and Ruleset · 2024-10-29T02:27:46.524Z · LW · GW

Thanks aphyer, this was an interesting challenge! I think I got lucky with finding the

 power/speed mechanic early - the race-class matchups 

really didn't, I think, in principle have enough info on their own to make a reliable conclusion from but enabled me to make a genre savvy guess which I could refine based on other info - in terms of scenario difficulty though I think it could have been deducible in a more systematic way by e.g. 

looking at item and level effects for mirror matches.

abstractapplic and Lorxus's discovery of 

persistent level 7 characters, 

and especially SarahSrinivasan's discovery of 

the tournament/non tournament structure 

meant the players collectively were I think quite a long ways towards fully solving this. The latter in addition to being interesting on its own is very important to finding anything else about the generation due to its biasing effects.

I agree with abstractapplic on the bonus objective.

Comment by simon on Electrostatic Airships? · 2024-10-28T22:41:41.388Z · LW · GW

Yes, for that reason I had never been considering a sphere for my main idea with relatively close wires. (though the 2-ring alternative without close wires would support a surface that would be topologically a sphere). What I actually was imagining was this:

A torus, with superconducting wires wound diagonally. The interior field goes around the ring and supports against collapse of the cross section of the ring, the exterior field is polar and supports against collapse of the ring. Like a conventional superconducting energy storage system:

I suppose this does raise the question of where you attach the payload, maybe it's attached to various points on the ring via cables or something, but as you scale it up, that might get unwieldy.

I suppose there's also a potential issue about the torque applied by the Earth's magnetic field. I don't imagine it's unmanageable, but haven't done the math.

My actual reason for thinking about this sort of thing was actually because I was thinking about whether (because of the square-cube law), superconducting magnetic energy storage might actually be viable for more than just the current short-term timescales if physically scaled up to a large size. The airship idea was a kind of side effect. 

The best way I was able to think of actually using something like this for energy storage would be to embed it in ice and anchor/ballast it to drop it to the bottom of the ocean, where the water pressure would counterbalance the expansion from the magnetic fields enabling higher fields to be supported.

Comment by simon on Electrostatic Airships? · 2024-10-28T10:12:46.189Z · LW · GW

You can use magnetic instead of electrostatic forces as the force holding the surface out against air pressure. One disadvantage is that you need superconducting cables fairly spread out* over the airship's surface, which imposes some cooling requirements. An advantage is square-cube law means it scales well to large size. Another disadvantage is that if the cooling fails it collapses and falls down.

*technically you just need two opposing rings, but I am not so enthusiastic about draping the exterior surface over long distances as it scales up, and it probably does need a significant scale

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-27T19:26:03.788Z · LW · GW

Now using julia with Claude to look at further aspects of the data, particularly in view of other commenters' observations:

First, thanks to SarahSrinivasan for the key observation that the data is organized into tournaments and non-tournament encounters. The tournaments skew the overall data to higher winrate gladiators, so restricting to the first round is essential for debiasing this (todo: check what is up with non-tournament fights).

Also, thanks to abstractapplic and Lorxus for pointing out that their are some persistent high level gladiators. It seems to me all the level 7 gladiators are persistent (up to the two item changes remarked on by abstractapplic and Lorxus). I'm assuming for now level 6 and below likely aren't persistent (other than in the same tournament).

(btw there are a couple fights where the +4 gauntlets holder is on both sides. I'm assuming this is likely a bug in the dataset generation rather than an indication that there are two of them (e.g. didn't check that both sides, drawn randomly from some pool, were not equal)).

For gladiators of levels 1 to 6, the boots and gauntlets in tournament first rounds seem to be independently and randomly assigned as follows:

+1 and +2 gauntlets are equally likely at 10/34 chance each;

+3 gauntlets have probability (4 + level)/34

+0 (no) gauntlets have probability (10 - level)/34

and same, independently, for boots.

I didn't notice obvious deviations for particular races and classes (only did a few checks).

I don't have a simple formula for level distribution yet. It is clearly much more favouring lower levels in tournament first rounds as compared with non-tournament fights, and level 1 gladiators don't show up at all in non-tournament fights. Will edit to add more as I find more.

edit: boots/gauntlets distribution seems to be about the same for each level in the non-tournament distribution as in the tournament first rounds. This suggests that the level distribution differences in non-tournament rounds is not due to win/winrate selection (which the complete absence of level 1's outside of tournaments already suggested).

edit2: race/class distribution for levels 1-6 seems equal in first round data (same probabilities of each, independent). Same in non-tournament data. I haven't checked for particular levels within that range.  edit3: there seems to be more level 1 fencers than other level 1 classes by an amount that is technically statistically significant if Claude's test is correct, though still probably random I assume. 

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-24T15:53:46.195Z · LW · GW

You may well be right, I'll look into my hyperparameters. I looked at the code Claude had generated with my interference and that greatly lowered my confidence in them, lol (see edit to this comment).

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-24T04:56:38.576Z · LW · GW

Inspired by abstractapplic's machine learning and wanting to get some experience in julia, I got Claude (3.5 sonnet) to write me an XGBoost implementation in julia. Took a long time especially with some bugfixing (took a long time to find that a feature matrix was the wrong shape - a problem with insufficient type explicitness, I think). Still way way faster than doing it myself! Not sure I'm learning all that much julia, but am learning how to get Claude to write it for me, I hope.

Anyway, I used a simple model that

only takes into account 8 * sign(speed difference) + power difference, as in the comment this is a reply to

and a full model that

takes into account all the available features including the base data, the number the simple model uses, and intermediate steps in the calculation of that number (that would be, iirc: power (for each), speed (for each), speed difference, power difference, sign(speed difference))

Results:

Rank 1
Full model scores: Red: 94.0%, Black: 94.9%
Combined full model score: 94.4%
Simple model scores: Red: 94.3%, Black: 94.6%
Combined simple model score: 94.5%

Matchups:
Varina Dourstone          (+0 boots, +3 gauntlets) vs House Cadagal Champion
Willow Brown              (+3 boots, +0 gauntlets) vs House Adelon Champion
Xerxes III of Calantha    (+2 boots, +2 gauntlets) vs House Deepwrack Champion
Zelaya Sunwalker          (+1 boots, +1 gauntlets) vs House Bauchard Champion

This is the top scoring scoring result with either the simplified model or the full model. It was found by a full search of every valid item and hero combination available against the house champions.

It is also my previously posted, found w/o machine learning, proposal for the solution. Which is reassuring. (Though, I suppose there is some chance that my feeding the models this predictor, if it's good enough, might make them glom on to it while they don't find some hard-to learn additional pattern.)

My theory though is that giving the models the useful metric mostly just helps them - they don't need to learn the metric from the data, and I mostly think that if there was a significant additional pattern the full model would do better.

(for Cadagal, I haven't changed the champion's boots to +4, though I don't expect that to make a significant difference)

As far as I can tell the full model doesn't do significantly better and does worse in some ways (though, I don't know much about how to evaluate this, and Claude's metrics, including a test set log loss of 0.2527 for the full model and 0.2511 for the simple model, are for a separately generated version which I am not all that confident are actually the same models, though they "should be" up to the restricted training set if Claude was doing it right). * see edit below

But the red/black variations seen below for the full model seem likely to me (given my prior that red and black are likely to be symmetrical) to be an indication that what the full model is finding that isn't in the full model is at least partially overfitting. Though actually, if it's overfitting a lot, maybe it's surprising that the test set log loss wouldn't be a lot worse than found (though it is at least worse than the simple model)? Hmm - what if there are actual red/black difference? (something to look into perhaps, as well as try to duplicate abstractapplic's report regarding sign(speed difference) not exhausting the benefits of speed info ... but for now I'm more likely to leave the machine learning aside and switch to looking at distributions of gladiator characteristics, I think.)

Predictions for individual matchups for my and abstractapplic's solutions:

My matchups:

Varina Dourstone          (+0 boots, +3 gauntlets) vs House Cadagal Champion    (+2 boots, +3 gauntlets)
Full Model:  Red: 91.1%, Black: 96.7%
Simple Model: Red: 94.3%, Black: 94.6%


Willow Brown              (+3 boots, +0 gauntlets) vs House Adelon Champion     (+3 boots, +1 gauntlets)
Full Model:  Red: 94.3%, Black: 95.1%
Simple Model: Red: 94.3%, Black: 94.6%


Xerxes III of Calantha    (+2 boots, +2 gauntlets) vs House Deepwrack Champion  (+3 boots, +2 gauntlets)
Full Model:  Red: 95.2%, Black: 93.7%
Simple Model: Red: 94.3%, Black: 94.6%


Zelaya Sunwalker          (+1 boots, +1 gauntlets) vs House Bauchard Champion   (+3 boots, +2 gauntlets)
Full Model:  Red: 95.3%, Black: 93.9%
Simple Model: Red: 94.3%, Black: 94.6%

(all my matchups have 4 effective power difference in my favour as noted in an above comment)


abstractapplic's matchups:

Matchup 1:
Uzben Grimblade           (+3 boots, +0 gauntlets) vs House Adelon Champion     (+3 boots, +1 gauntlets)

Win Probabilities:
Full Model:  Red: 72.1%, Black: 62.8%
Simple Model: Red: 65.4%, Black: 65.7%

Stats:
Speed: 18 vs 14 (diff: 4)
Power: 11 vs 18 (diff: -7)
Effective Power Difference: 1
--------------------------------------------------------------------------------

Matchup 2:
Xerxes III of Calantha    (+2 boots, +1 gauntlets) vs House Bauchard Champion   (+3 boots, +2 gauntlets)

Win Probabilities:
Full Model:  Red: 46.6%, Black: 43.9%
Simple Model: Red: 49.4%, Black: 50.6%

Stats:
Speed: 16 vs 12 (diff: 4)
Power: 13 vs 21 (diff: -8)
Effective Power Difference: 0
--------------------------------------------------------------------------------

Matchup 3:
Varina Dourstone          (+0 boots, +3 gauntlets) vs House Cadagal Champion    (+2 boots, +3 gauntlets)

Win Probabilities:
Full Model:  Red: 91.1%, Black: 96.7%
Simple Model: Red: 94.3%, Black: 94.6%

Stats:
Speed: 7 vs 25 (diff: -18)
Power: 22 vs 10 (diff: 12)
Effective Power Difference: 4
--------------------------------------------------------------------------------

Matchup 4:
Yalathinel Leafstrider    (+1 boots, +2 gauntlets) vs House Deepwrack Champion  (+3 boots, +2 gauntlets)

Win Probabilities:
Full Model:  Red: 35.7%, Black: 39.4%
Simple Model: Red: 34.3%, Black: 34.6%

Stats:
Speed: 20 vs 15 (diff: 5)
Power: 9 vs 18 (diff: -9)
Effective Power Difference: -1
--------------------------------------------------------------------------------

Overall Statistics:
Full Model Average:  Red: 61.4%, Black: 60.7%
Simple Model Average: Red: 60.9%, Black: 61.4%

Edit: so I checked the actual code to see if Claude was using the same hyperparameters for both, and wtf wtf wtf wtf. The code has 6 functions that all train models (my fault for at one point renaming a function since Claude gave me a new version that didn't have all the previous functionality (only trained the full model instead of both - this was when doing the great bughunt for the misshaped matrix and a problem was suspected in the full model), then Claude I guess picked up on this and started renaming updated versions spontaneously, and I was adding Claude's new features in instead of replacing things and hadn't cleaned up the code or asked Claude to do so). Each one has it's own hardcoded hyperparameter set. Of these, there are one pair of functions that have matching hyperparameters. Everything else has a unique set. Of course, most of these weren't being used anymore, but the functions for actually generating the models I used for my results, and the function for generating the models used for comparing results on a train/test split, weren't among the matching pair. Plus another function that returns a (hardcoded, also unique) updated parameter set, but wasn't actually used. Oh and all this is not counting the hyperparameter tuning function that I assumed was generating a set of tuned hyperparameters to be used by other functions, but in fact was just printing results for different tunings. I had been running this every time before training models!  Obviously I need to be more vigilant (or maybe asking Claude to do so might help?).

edit:

Had Claude clean up the code and tune for more overfitting, still didn't see anything not looking like overfitting for the full model. Could still be missing something, but not high enough in subjective probability to prioritize currently, so have now been looking at other aspects of the data.

further edit:

My (what I think is) highly overfitted version of my full model really likes Yonge's proposed solution. In fact it predicts a higher winrate than for equal winrate to the best possible configuration not using the +4 boots (I didn't have Claude code the situation where +4 boots are a possibility). I still think that's probably because they are picking up the same random fluctuations ... but it will be amusing if Yonge's "manual scan" solution turns out to be exactly right.

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-24T03:20:11.513Z · LW · GW

Very interesting, this would certainly cast doubt on 

my simplified model

But so far I haven't been noticing

any affects not accounted for by it.

After reading your comments I've been getting Claude to write up an XGBoost implementation for me, I should have made this reply comment when I started, but will post my results under my own comment chain.

I have not (but should) try to duplicate (or fail to do so) your findings - I haven't been quite testing the same thing.

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-21T15:44:43.441Z · LW · GW

I don't think this is correct:

"My best guess about why my solution works (assuming it does) is that the "going faster than your opponent" bonus hits sharply diminishing returns around +4 speed"

In my model

There is a sharp threshold at +1 speed, so returns should sharply diminish after +1 speed

in fact in the updated version of my model

There is no effect of speed beyond the threshold (speed effect depends only on sign(speed difference))

I think the discrepancy might possibly relate to this:

"Iterated all possible matchups, then all possible loadouts (modulo not using the +4 boots), looking for max EV of total count of wins."

because

If you consider only the matchups with no items, the model needs to assign the matchups assuming no boots, so it sends your characters against opponents over which they have a speed advantage without boots (except the C-V matchup as there is no possibility of beating C on speed). 

so an optimal allocation

needs to take into account the fact that your boots can allow you to use slower and stronger characters, so can't be done by choosing the matchups first without items.

so I predict that your model might predict 

a higher EV for my solution

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-20T13:40:36.573Z · LW · GW

updated model for win chance:

I am currently modeling the win ratio as dependent on a single number, the effective power difference. The effective power difference is the power difference plus 8*sign(speed difference).

Power and speed are calculated as:

Power = level + gauntlet number + race power + class power

Speed = level + boots number + race speed + class speed

where race speed and power contributions are determined by each increment on the spectrum:

Dwarf - Human - Elf

increasing speed by 3 and lowering power by 3

and class speed and power contributions are determined by each increment on the spectrum:

Knight - Warrior - Ranger - Monk - Fencer - Ninja 

increasing speed by 2 and lower power by 2.

So, assuming this is correct, what function of the effective power determines the win rate? I don't have a plausible exact formula yet, but:

  • If the effective power difference is 6 or greater, victory is guaranteed.
  • If the effective power difference is low, it seems a not-terrible fit that the odds of winning are about exponential in the effective power difference (each +1 effective power just under doubling odds of winning)
  • It looks like it is trending faster than exponential as the effective power difference increases. At an effective power difference of 4, the odds of the higher effective power character winning are around 17 to 1.

edit: it looks like there is a level dependence when holding effective power difference constant at non-zero values (lower/higher level -> winrate imbalance lower/higher than implied by effective power difference). Since I don't see this at 0 effective power difference, it is presumably not due to an error in the effective power calculation, but an interaction with the effective power difference to determine the final winrate. Our fights are likely "high level" for this purpose implying better odds of winning than the 17 to 1 in each fight mentioned above. Todo: find out more about this effect quantitatively.  edit2: whoops that wasn't a real effect, just me doing the wrong test to look for one. 

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-19T23:55:50.537Z · LW · GW

On the bonus objective:

I didn't realize that the level 7 Elf Ninjas were all one person or that the boots +4 were always with a level 7 (as opposed to any level) Elf Ninja. It seems you are correct as there are 311 cases of which the first 299 all have the boots of speed 4 and gauntlets 3 with only the last 12 having boots 2 and gauntlets 3 (likely post-theft). It seems to me that they appear both as red and black, though.

Comment by simon on D&D Sci Coliseum: Arena of Data · 2024-10-19T17:46:52.465Z · LW · GW

 Thanks aphyer. My analysis so far and proposed strategy:

After initial observations that e.g. higher numbers are correlated with winning, I switched to mainly focus on race and class, ignoring the numerical aspects.

I found major class-race interactions.

It seems that for matchups within the same class, Elves are great, tending to beat dwarves consistently across all classes and humans even harder. While Humans beat dwarves pretty hard too in same-class matchups.

Within same-race matchups there are also fairly consistent patterns: Fencers tend to beat Rangers, Monks and Warriors, Knights beat Ninjas, Monks beat Warriors, Rangers and Knights, Ninjas beat Monks, Fencers and Rangers, Rangers beat Knights and Warriors, and Warriors beat Knights.

If the race and class are both different though... things can be different. For example, a same-class Elf will tend to beat a same-class Dwarf. And a same-race Fencer will tend to beat a same-race Warrior. But if an Elf Fencer faces a Dwarf Warrior, the Dwarf Warrior will most likely win. Another example with Fencers and Warriors: same-class Elves tend to beat Humans - but not only will a Human Warrior tend to beat an Elf Fencer, but also a Human Fencer will tend to beat an Elf Warrior by a larger ratio than for a same-race Fencer/Warrior matchup???

If you look at similarities between different classes in terms of combo win rates, there seems to be a chain of similar classes:

Knight - Warrior - Ranger - Monk - Fencer - Ninja 

(I expected a cycle underpinned by multiple parameters. But Ninja is not similar to Knight. This led me to consider that perhaps there is an only a single underlying parameter, or trade off between two (e.g. strength/agility .... or ... Speed and Power)).

And going back to the patterns seen before, this seems compatible with races also having speed/power tradeoffs:

Dwarf - Human - Elf

Where speed has a threshold effect but power is more gradual (so something with slightly higher speed beats something with slightly higher power, but something with much higher power beats something with much higher speed).

Putting the Class-race combos on the same spectrum based on similarity/trends in results, I get the following ordering:

Elf Ninja > Elf Fencer > Human Ninja > Elf Monk > Human Fencer > Dwarf Ninja >~ Elf Ranger > Human Monk > Elf Warrior > Dwarf Fencer > Human Ranger > Dwarf Monk >~ Elf Knight > Human Warrior > Dwarf Ranger > Human Knight > Dwarf Warrior > Dwarf Knight

So, it seems a step in the race sequence is about equal to 1.5 steps in the class sequence. On the basis of pretty much just that, I guessed that race steps are a 3 speed vs power tradeoff,  class steps are a 2 speed and power tradeoff, levels give 1 speed and power each, and items give what they say on the label.

I have not verified this as much as I would like. (But on the surface it seems to work, e.g. speed threshold seems to be there). One thing that concerns me is that it seems that higher speed differences actually reduce success chances holding power differences constant (could be an artifact, e.g., of it not just depending on the differences between stat values edit: see further edit below). But, for now, assuming that I have it correct, speed/power of the house champions (with the lowest race and class in a stat assumed to have 0 in that stat):

House Adelon:  Level 6 Human Warrior +3 Boots +1 Gauntlets - 14 speed 18 power

House Bauchard: Level 6 Human Knight +3 Boots +2 Gauntlets - 12 speed 21 power

House Cadagal: Level 7 Elf Ninja +2 Boots +3 Gauntlets - 25 speed 10 power

House Deepwrack: Level 6 Dwarf Monk +3 Boots +2 Gauntlets - 15 speed 18 power

Whereas the party's champions, ignoring items, have:

  • Uzben Grimblade, a Level 5 Dwarf Ninja - 15 speed 11 power
  • Varina Dourstone, a Level 5 Dwarf Warrior - 7 speed 19 power
  • Willow Brown, a Level 5 Human Ranger - 12 speed 14 power
  • Xerxes III of Calantha, a Level 5 Human Monk - 14 speed 12 power
  • Yalathinel Leafstrider, a Level 5 Elf Fencer - 19 speed 7 power
  • Zelaya Sunwalker, a Level 6 Elf Knight - 12 speed 16 power

For my proposed strategy (subject to change as I find new info, or find my assumptions off, e.g. such that my attempts to just barely beat the opponents on speed are disastrously slightly wrong):

I will send Willow Brown, with +3 boots and +1 gauntlets no gauntlets, against House Adelon's champion (1 speed advantage, 3 4 power deficit)

I will send Zelaya Sunwalker, with +1 boots and +2  +1  gauntlets, against House Bauchard's champion (1 speed advantage, 3 4 power deficit)

I will send Xerxes III of Calantha, with +2 boots and +3 +2 gauntlets, against House Deepwrack's champion (1 speed advantage, 3 4 power deficit)

And I will send Varina Dourstone, with +3 gauntlets no items, to overwhelm House Cadagal's Elf Ninja with sheer power (18 speed deficit, 9 12 power advantage).

And in fact, I will gift the +4 boots of speed to House Cadagal's Elf Ninja in advance of the fight, making it a 20 speed deficit.

Why? Because I noticed that +4 boots of speed are very rare items that have only been worn by Elf Ninjas in the past. So maybe that's what the bonus objective is talking about. Of course, another interpretation is that sending a character 2 levels lower without any items, and gifting a powerful item in advance, would be itself a grave insult. Someone please decipher the bonus objective to save me from this foolishness! 

Edited to add: It occurs to me that I really have no reason to believe the power calculation is accurate, beyond that symmetry is nice. I'd better look into that.

further edit: it turns out that I was leaving out the class contribution to the power difference when calculating the power difference for determining the effects of power and speed.  It looks like this was causing the effect of higher speed differences seeming to reduce win rates. With this fixed the effects look much cleaner (e.g. there's a hard threshold where if you have a speed deficit you must have at least 3 power advantage to have any chance to win at all), increasing my confidence that effects on power and speed being symmetric is actually correct. This does have the practical effect of making me adjust my item distribution: it looks like a 4 deficit in power is still enough for >90% win rate with a speed advantage, while getting similar win rates with a speed disadvantage will require more than just the 9 power difference, so I shifted the items to boost Varina's power advantage. Indeed, with the cleaner effects, it appears that I can reasonably model the effect of a speed advantage/disadvantage as equivalent to a power difference of 8, so with the item shift all characters will have an effective +4 power advantage taking this into account.

Comment by simon on Arithmetic is an underrated world-modeling technology · 2024-10-18T16:46:20.603Z · LW · GW

You mentioned a density of steel of 7.85 g/cm^3 but used a value of 2.7 g/cm^3 in the calculations.

BTW this reminds me of:

https://www.energyvault.com/products/g-vault-gravity-energy-storage

I was aware of them quite a long time ago (the original form was concrete blocks lifted to form a tower by cranes) but was skeptical since it seemed obviously inferior to using water capital cost wise and any efficiency gains were likely not worth it. Reading their current site:

The G-VAULT™ platform utilizes a mechanical process of lifting and lowering composite blocks or water to store and dispatch electrical energy.

(my italics). Looks to me like a slow adaptation to the reality that water is better.

Comment by simon on Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours · 2024-10-06T14:31:56.713Z · LW · GW

IMO: if an AI can trade off between different wants/values of one person, it can do so between multiple people also.

This applies to simple surface wants as well as deep values.

Comment by simon on Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours · 2024-10-04T16:02:25.261Z · LW · GW

I had trouble figuring out how to respond to this comment at the time because I couldn't figure out what you meant by "value alignment" despite reading your linked post. After reading you latest post, Conflating value alignment and intent alignment is causing confusion, I still don't know exactly what you mean by "value alignment" but at least can respond.

What I mean is:

If you start with an intent aligned AI following the most surface level desires/commands, you will want to make it safer and more useful by having common sense, "do what I mean", etc. As long as you surface-level want it to understand and follow your meta-level desires, then it can step up that ladder etc. 

If you have a definition of "value alignment" that is different from what you get from this process, then I currently don't think that it is likely to be better than the alignment from the above process.

In the context of collective intent alignment:

If you have an AI that only follows commands, with no common sense etc., and it's powerful enough to take over, you die. I'm pretty sure some really bad stuff is likely to happen even if you have some "standing orders". So, I'm assuming people would actually deploy only an AI that has some understanding of what the person(s) it's aligned with wants, beyond the mere text of a command (though not necessarily super-sophisticated). But once you have that, you can aggregate how much people want between humans for collective intent alignment. 

I'm aware people want different things, but don't think it's a big problem from a technical (as opposed to social) perspective - you can ask how much people want the different things. Ambiguity in how to aggregate is unlikely to cause disaster, even if people will care about it a lot socially. Self-modification will cause a convergence here, to potentially different attractors depending on the starting position. Still unlikely to cause disaster. The AI will understand what people actually want from discussions with only a subset of the world's population, which I also see as unlikely to cause disaster, even if people care about it socially.

From a social perspective, obviously a person or group who creates an AI may be tempted to create alignment to themselves only. I just don't think collective alignment is significantly harder from a technical perspective.

"Standing orders" may be desirable initially as a sort of training wheels even with collective intent, and yes that could cause controversy as they're likely not to originate from humanity collectively.

Comment by simon on Conflating value alignment and intent alignment is causing confusion · 2024-10-04T14:54:02.265Z · LW · GW

I think this post is making a sharp distinction to what really is a continuum; any "intent aligned" AI becomes more safe and useful as you add more "common sense" and "do what I mean" capability to it, and at the limit of this process you get what I would interpret as alignment to the long term, implicit deep values (of the entity or entities the AI started out intent aligned to).

I realize other people might define "alignment to the long term, implicit deep values" differently, such that it would not be approached by such a process, but currently think they would be mistaken in desiring whatever different definition they have in mind. (Indeed, what they actually want is what they would get under sufficiently sophisticated intent alignment, pretty much by definition).

P.S. I'm not endorsing intent alignment (for ASI) as applied to only an individual/group -  I think intent alignment can be applied to humanity collectively.

Comment by simon on Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours · 2024-08-06T18:58:17.003Z · LW · GW

I don't think intent aligned AI has to be aligned to an individual - it can also be intent aligned to humanity collectively. 

One thing I used to be concerned about is that collective intent alignment would be way harder than individual intent alignment, making someone validly have an excuse to steer an AI to their own personal intent. I no longer think this is the case. Most issues with collective intent I see as likely also affecting individual intent (e.g. literal instruction following vs extrapolation). I see two big issues that might make collective intent harder than individual intent. One is biased information on people's intents and another is difficulty of weighting intents for different people. On reflection though, I see both as non-catastrophic, and an imperfect solution to them likely being better for humanity as a whole than following one person's individual intent. 

Comment by simon on A simple case for extreme inner misalignment · 2024-07-14T16:36:03.802Z · LW · GW

It feels to me like this post is treating AIs as functions from a first state of the universe to a second state of the universe. Which in a sense, anything is... but, I think that the tendency to simplification happens internally, where they operate more as functions from (digital) inputs to (digital) outputs. If you view an AI as a function from an digital input to a digital output, I don't think goals targeting specific configurations of the universe are simple at all and don't think decomposability over space/time/possible worlds are criteria that would lead to something simple.

Comment by simon on D&D.Sci: Whom Shall You Call? · 2024-07-06T08:14:30.745Z · LW · GW

Thanks abstractapplic! Initial analysis:

Initial stuff that hasn't turned out to be very important:

My immediate thought was that there are likely to be different types of entities we are classifying, so my initial approach was to look at the distributions to try to find clumps.

All of the 5 characteristics (Corporeality, Sliminess, Intellect, Hostility, Grotesqueness) have bimodal distributions with one peak around 15-30 (position varies) and the other peak at around 65-85 (position varies. Overall, the shapes are very similar looking. The trough between the peaks is not very deep, plenty of intermediate values.

All of these characteristics are correlated with each other.

Looking at sizes of bins for pairs of characteristics, again there appears to be two humps - but this time in the 2d plot only. That is, there is a high/high hump and a low/low hump, but noticeably there does not appear to be, for example, a high-sliminess peak when restricting to low-corporality data points.

Again, the shape varies a bit between characteristic pairs but overall looks very similar.

Adding all characteristics together gets a deeper trough between the peaks, though still no clean separation.

Overall, it looks to me like there are two types, one with high values of all characteristics, and another with low values of all characteristics, but I don't see any clear evidence for any other groupings so far.

Eyeballing the plots, it looks compatible with no relation between characteristics other than the high/low groupings. Have not checked this with actual math.

In order to get a cleaner separation between the high/low types, I used the following procedure to get a probability estimate for each data point being in the high/low type:

  1. For each characteristic, sum up all the other characteristics (rather, subtract that characteristic from the total)
  2. For each characteristic, classify each data point into pretty clearly low (<100 total), pretty clearly high (>300 total) or unclear based on the sum of all the other characteristics
  3. obtain frequency distribution for the characteristic values for the points classified clearly low and high using the above steps for each characteristic
  4. smooth in ad hoc manner
  5. obtain odds ratio from ratio of high and low distributions, ad hoc adjustment for distortions caused by ad hoc smoothing
  6. multiply odds ratios obtained for each characteristic and obtain probability from odds ratio

I think this gives cleaner separation, but still not super great imo, most points 99%+ likely to be in one type or the other, but still 2057 (out of 34374) are between 0.1 and 0.9 in my ad hoc estimator. Todo: look for some function to fit to the frequency distributions and redo with the function instead of ad hoc approach. 

Likely classifications of our mansion's ghosts: 

low: A,B,D,E,G,H,I,J,M,N,O,Q,S,U,V,W

high: C,F,K,L,P,R,T

To actually solve the problem: I now proceeded to split the data based on exorcist group. Expecting high/low type to be relevant, I split the DD points by likely type (50% cutoff), and then tried some stuff for DD low including a linear regression. Did a couple graphs on the characteristics that seemed to matter (grotesqueness and hostility in this case) to confirm effects looked linear. So, then tried linear regression for DD  high and got the same coefficients, within error bars. So then I thought, if it's the same linear coefficients in both cases, I probably could have gotten them from the combined data for DD, don't need to separate into high and low, and indeed linear regression on the combined DD data gave the same coefficients more or less.


Actually  finding the answer:

So, then I did regression for the exorcist groups without splitting based on high/low type. (I did split after to check whether it mattered)

Results: 

DD cost depends on Grotesqueness and to a lesser extent Hostility.

EE cost depends on all characteristics slightly, Sliminess then Intellect/Grotesqueness being the most important. Note: Grotesqueness less important, perhaps zero effect, for "high" type.

MM cost actually very slightly declines for higher values of all characteristics. (note: less effect for "high" type, possibly zero effect)

PP cost depends mainly on Sliminess. However, slight decline in cost with more Corporeality and increase with more of everything else.

SS cost depends primarily on Intellect. However, slight decline with Hostility and increase with Sliminess.

WW cost depends primarily on Hostility. However, everything else also has at least a slight effect, especially Sliminess and Grotesqueness.

Provisionally, I'm OK with just using the linear regression coefficients without the high/low split, though I will want to verify later if this was causing a problem (also need to verify linearity, only checked for DD low (and only for Grotesqueness and Hostility separately, not both together)).

Results:

Ghost | group with lowest estimate | estimated cost for that group

A | Spectre Slayers | 1926.301885259

B | Wraith Wranglers | 1929.72034133793

C | Mundanifying Mystics | 2862.35739392631

D | Demon Destroyers | 1807.30638053037 (next lowest: Wraith Wranglers, 1951.91410462716)

E | Wraith Wranglers | 2154.47901124028

F | Mundanifying Mystics | 2842.62070661731

G | Demon Destroyers | 1352.86163670857 (next lowest: Phantom Pummelers, 1688.45809434935)

H | Phantom Pummelers | 1923.30132492753

I  | Wraith Wranglers | 2125.87216703498

J  | Demon Destroyers | 1915.0299245701 (Next lowest: Wraith Wranglers, 2162.49691339282)

K | Mundanifying Mystics | 2842.16499046146

L | Mundanifying Mystics | 2783.55221244497

M | Spectre Slayers | 1849.71986735069

N | Phantom Pummelers | 1784.8259008802

O | Wraith Wranglers | 2269.45361189797

P | Mundanifying Mystics | 2775.89249612121

Q | Wraith Wranglers | 1748.56167086623

R | Mundanifying Mystics | 2940.5652346428

S | Spectre Slayers | 1666.64380523907

T | Mundanifying Mystics | 2821.89307084084

U | Phantom Pummelers | 1792.3319145455

V | Demon Destroyers | 1472.45641559628 (Next lowest: Spectre Slayers, 1670.68911559919)

W | Demon Destroyers | 1833.86462523462 (Next lowest: Wraith Wranglers, 2229.1901870478)

So that's my provisional solution, and I will pay the extra 400sp one time fee so that Demon Destroyers can deal with ghosts D, G, J, V, W.

--Edit: whoops, missed most of this paragraph (other than the Demon Destroyers): 

"Bad news! In addition to their (literally and figuratively) arcane rules about territory and prices, several of the exorcist groups have all-too-human arbitrary constraints: the Spectre Slayers and the Entity Eliminators hate each other to the point that hiring one will cause the other to refuse to work for you, the Poltergeist Pummelers are too busy to perform more than three exorcisms for you before the start of the social season, and the Demon Destroyers are from far enough away that – unless you eschew using them at all – they’ll charge a one-time 400sp fee just for showing up."

will edit to fix! post edit: Actually my initial result is still compatible with that paragraph, it doesn't involve the Entity Eliminators, and only uses the Phantom Pummelers 3 times. --

Not very confident in my solution (see things to verify above), and if it is indeed this simple it is an easier problem than I expected.

further edit (late July 15 2024): haven't gotten around to checking those things and also my check of linearity, where I did check, binned the data and could be hiding all sorts of patterns.

Comment by simon on Getting 50% (SoTA) on ARC-AGI with GPT-4o · 2024-06-23T16:42:50.385Z · LW · GW

Huh, I was missing something then, yes. And retrospectively should have thought of it - 

it's literally just filling in the blanks for the light blue readout rectangle (which in a human-centric point of view, is arguably simpler to state than my more robotic perspective even if algorithmically more complex) and from that perspective the important thing is not some specific algorithm for grabbing the squares but just finding the pattern. I kind of feel like I failed a humanness test by not seeing that.

Comment by simon on Getting 50% (SoTA) on ARC-AGI with GPT-4o · 2024-06-23T04:50:24.905Z · LW · GW

Missed this comment chain before making my comment. My complaint is the most natural extrapolation here (as I assess it, unless I'm missing something) would go out of bounds. So either you have ambiguity about how to deal with the out of bounds, or you have a (in my view) less natural extrapolation.

E.g. "shift towards/away from the center" is less natural than "shift to the right/left", what would you do if it were already in the center for example?

Comment by simon on Getting 50% (SoTA) on ARC-AGI with GPT-4o · 2024-06-23T03:40:33.275Z · LW · GW

Problem 2 seems badly formulated because

The simplest rule explaining the 3 example input-output pairs would make the output corresponding to the test input depend on squares out of bounds of the test input. 

To fix you can have some rule like have the reflection axis be shifted from the center by one in the direction of the light blue "readout" rectangle (instead of fixed at one to the right from the center) or have the reflection axis be centered, and have a 2-square shift in a direction depending on which side of center is the readout rectangle (instead of in a fixed direction), but that seems strictly more complicated.

Alternatively, you could have some rule about wraparound, or e.g. using white squares if out of bounds, but what rule to use for out of bounds squares isn't determined from the example input-output pairs given.

Edit: whoops, see Fabien Roger's comment and my reply.

Comment by simon on D&D.Sci II: The Sorceror's Personal Shopper · 2024-06-21T04:17:10.079Z · LW · GW

It seems I missed this at the time, but since Lesswrong's sorting algorithm has now changed to bring it up the list for me, might as well try it:

X-Y chart of mana vs thaumometer looked interesting, splitting it into separate charts for each colour returned useful results for blue:

  • blue gives 2 diagonal lines, one for tools/weapons, one for jewelry - for tools/weapons it's pretty accurate, +-1, but optimistic by 21 or 23 for jewelry

and... that's basically it, the thaumometer seems relatively useless for the other colours.

But: 

green gives an even number of mana that looks uniformish in the range of 2-40

yellow always gives mana in the range of 18-21

red gives mana that can be really high, up to 96, but is not uniform, median 18

easy strategy: 

pendant of hope (blue, 77 thaumometer reading -> 54 or 56 mana expected), 34 gp

hammer of capability (blue, 35 thaumometer reading -> 34 or 36 mana expected), 35 gp

Plough of Plenty (yellow, 18-21 mana expected), 35 gp

Warhammer of Justice +1 (yellow, 18-21 mana expected), 41 gp

For a total of at least 124 mana at the cost of 145 gp, leaving 55 gp left over

Now, if I was doing this at the time, I would likely investigate further to check if, say, high red or green values can be predicted.

But, I admit I have some meta knowledge here - it was stated in discussion of difficulty of a recent problem, if I recall correctly, that this was one of the easier ones. So, I'm guessing there isn't a hidden decipherable pattern to predict mana values for the reds and greens.

Comment by simon on D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues Evaluation & Ruleset · 2024-06-19T20:54:54.651Z · LW · GW

You don't need to justify - hail fellow D&Dsci player, I appreciate your competition and detailed writeup of your results, and I hope to see you in the next d&dsci!

Comment by simon on D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues Evaluation & Ruleset · 2024-06-19T16:53:01.514Z · LW · GW

I liked the bonus objective myself, but maybe I'm biased about that...

As a someone who is also not a "data scientist" (but just plays one on lesswrong), I also don't know what exactly actual "data science" is, but I guess it's likely intended to mean using more advanced techniques?

(And if I can pull the same Truth from the void with less powerful tools, should that not mark me as more powerous in the Art? :P)

Perhaps, but don't make a virtue of not using the more powerful tools, the objective is to find the truth, not to find it with handicaps...

Speaking of which one thing that could help making things easier is aggregating data, eliminating information you think is irrelevant. For example, in this case, I assumed early on (without actually checking) that timing would likely be irrelevant, so aggregated data for ingredient combinations. As in, each tried ingredient combination gets only one row, with the numbers of different outcomes listed. You can do this by assigning a unique identifier to each ingredient combination (in this case you can just concatenate over the ingredient list), then counting the results for the different unique identifiers. Countifs has poor performance for large data sets, but you can sort using the identifiers then make a column that adds up the number of rows (or, the number of rows with a particular outcome) since the last change in the identifier, and then filter the rows for the last row before the change in the identifier (be wary of off-by-one errors). Then copy the result (values only) to a new sheet. 

This also reduces the number of rows, though not enormously in this case.

Of course, in this case, it turns out that timing was relevant, not for outcomes but only for the ingredient selection (so I would have had to reconsider this assumption to figure out the ingredient selection).