D&D.Sci September 2022: The Allocation Helm

post by abstractapplic · 2022-09-16T23:10:23.364Z · LW · GW · 34 comments

Contents

34 comments

This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursue using information from that dataset.

You are the Allocation Helm, a piece of magical headwear employed at Swineboils College of Spellcraft and Sorcery. Your purpose is to read the minds of incoming students, and use the information you glean to Allocate them between the school’s four Houses: Dragonslayer, Thought-Talon, Serpentyne and Humblescrumble.

You’ve . . . not been doing a terribly good job lately. You were impressively competent at assigning students when newly enchanted, but over the centuries your skill and judgement have steadily unraveled, to the point where your Allocations over the most recent decade have been completely random.

Houses have begun to lose their character, Ofspev[1] ratings have plummeted, and applications have declined precipitously. There is serious talk of Swineboils being shut down. Under these circumstances, the Headmistress has been moved to desperate action, and performed a Forbidden Ritual to temporarily restore your former brilliance.

This boost will only last you for one Allocation, so you intend to make it count. Using the records of past years’ readings and ratings, you hope to raise this class’ average score to match or exceed the glory of yore. (And if you do well enough, you might even be able to convince the Headmistress to keep performing rituals . . .)

There are twenty incoming students this year. You may place them however you wish. Who goes where?


I’ll post an interactive you can use to test your choices, along with an explanation of how I generated the dataset, sometime on Monday the 26th. I’m giving you nine days, but the task shouldn’t take more than an evening or two; use Excel, R, Python, Haruspicy, or whatever other tools you think are appropriate. Let me know in the comments if you have any questions about the scenario.

If you want to investigate collaboratively and/or call your decisions in advance, feel free to do so in the comments; however, please use spoiler tags or rot13 when sharing inferences/strategies/decisions, so people intending to fly solo can look for clarifications without being spoiled.

 

  1. ^

    The Oracle for Spellcaster Evaluations, who predicts a quantitative measure of the lifetime impact each student will have on the world shortly after they’re Allocated. (No-one knows how to make him predict anything else, or predict at any other time, or stop predicting, or be affected by the passage of time.)

34 comments

Comments sorted by top scores.

comment by gjm · 2022-09-17T02:08:26.517Z · LW(p) · GW(p)

First-order attempt:

I used scikit-learn to build several random-forest regressors mapping attributes + house to Ofspev rating, and verified that early on the Helm tended to allocate students to the house for which the regressor predicted the best rating, and that at the end it didn't. Then I Sorted ... excuse me, Allocated ... the students to the houses for which the regressors predicted the best rating. In cases where they disagreed I tried to eyeball the distributions and use my judgement :-).

Resulting allocation:

Serpentyne gets C, F, K. Humblescrumble gets E, I, L, M, P, Q, R, T. Dragonslayer gets D, G, H, N. Thought-Talon gets A, B, J, O, S.

Most of these results are pretty clear-cut in that every prediction for the winning house was better than any prediction for any other house. Notable exceptions were E (who might do well in Thought-Talon or maybe in Dragonslayer; certainly not Serpentyne, though) and Q (for whom all houses gave rather similar predictions).

With these allocations I cautiously predict the following Ofspev ratings: A 36..39, B 15..18, C 28..30, D 17..19, E 17..19, F 34..40, G 21..23, H 15..19, I 12..14, J 27..30, K 23..26, L 22..26, M 24..29, O 18..23, P 22..25, Q 30..32, R 40..44, S 29..32, T 28..34. These intervals are probably too narrow; they are determined by the range of variation among the 16 regressors I used, but the overall prediction errors for these regressors are wider than those ranges.

Possible reasons why this might suck:

  1. Early on we only get information about how students perform in the house to which a skilled Helm allocated them. This means that we have less information about how they would have performed in other houses.
  2. I am assuming that our only goal is to make our students as impactful as possible. It may be that the Helm actually has other aims (e.g., to make them happy or psychologically well-adjusted), in which case we should be trying to reproduce early-Helm allocations instead.
  3. I just used an out-of-the-box regressor that seemed plausible. I haven't tried to tune its parameters.

I have made a cursory attempt to understand what my black boxes are doing

by feeding in all the 2^5 attribute-vectors where each one is either 10 (low) or 40 (high) and seeing what the predictions for each house look like. Crudely, it seems as if: students do well in Serpentyne when they have high Intellect and either Reflexes or Patience; in Humblescrumble when they have high Intellect and Integrity, with Patience serving as a partial stand-in for either; in Dragonslayer when they have high Courage and Reflexes; in Thought-Talon when they have high Intellect and Patience. Students high in all five attributes do exceptionally well in Dragonslayer and Thought-Talon; for Thought-Talon but not for Dragonslayer it's almost as good to be high in everything except Reflexes.

I think I could quantify those observations (and maybe a few second-order effects I didn't mention explicitly) and get an explicit model that would serve the Helm pretty well in practice, though I doubt it would outperform the brute-force random forests.

Replies from: gjm, gjm
comment by gjm · 2022-09-21T02:44:29.619Z · LW(p) · GW(p)

Further noodling around with ad hoc models suggests that

in at least some cases, some of the students' attributes are best thought of as having limits such that increases above the limits make no difference. Specifically, I played around a little with Serpentyne and it seems that we probably want to look at min(40,Intellect) and min(65,Reflexes) rather than using those values unaltered. The limits might well be different for different houses (analogy: intelligence is probably an advantage both for theoretical physicists and for taxi drivers, but most likely being 1-in-a-million smart rather than "just" 1-in-a-thousand is more beneficial for the theoretical physicists); so far this is just the result of idly looking at one particular house.

comment by gjm · 2022-09-18T11:59:04.949Z · LW(p) · GW(p)

Continuing with the principle "when in doubt, use brute force",

I did the same thing with gradient-boosted trees; these had somewhat more prediction error on each validation set (oh, I forgot to mention that each regressor was trained on 90% of the data and evaluated on the remaining 10%). And with SVMs using radial basis functions; these were comparable in accuracy to the random forests. (Note: There's much less diversity in my ensemble of SVMs, because the only difference between them is the training/validation split, whereas for RFs and GBTs there is randomness in the fitting process itself.)

Did this make a difference to my predictions or suggestions?

Not much; usually all three agreed; where they didn't, usually the SVM agreed with the RF. However, the SVM regressors fairly confidently want to put K in Dragonslayer, and the GBT ones less confidently agree. On the other hand, they predict less loss from putting K in the RFs' suggestion of Serpentyne than the RFs do from putting K in Dragonslayer, so it's a tough call. I'll switch to putting K in Dragonslayer. And for Q, the RFs and GBTs are fairly indifferent between all houses and slightly prefer Humblescrumble (for the RFs) and Thought-Talon (for the GBTs), but the SVMs think that Dragonslayer and Thought-Talon are much better than the other two, and give the nod to Dragonslayer. Looking at all their numbers, I'll move Q into Dragonslayer.

So my revised allocations are:

Serpentyne gets C, F. Humblescrumble gets E, I, L, M, P, R, T. Dragonslayer gets D, G, H, K, N, Q. Thought-Talon gets A, B, J, O, S.

And my revised predicted ratings with these allocations are:

A 36..39, B 15..18, C 28..30, D 17..20, E 17..19, F 33..40, G 21..24, H 15..20, I 12..15, J 24..30, K 23..26, L 21..26, M 24..29, N 19..21, O 18..23, P 22..25, Q 27..34, R 40..44, S 29..32, T 28..34.

Replies from: gjm
comment by gjm · 2022-09-21T02:34:30.092Z · LW(p) · GW(p)

It occurs to me that there is a possible source of bias in the approach I am taking:

perhaps not everyone gets into Swineboils, so that e.g. if we looked for correlations between the attributes we would get spurious negative correlations because people who are bad at everything don't get in. If such effects are strong, then our model will have bias if we apply it to the population at large. That's OK because we aren't applying it to the population at large, we're applying it to the students who got in ... but if e.g. Swineboils is less selective than it used to be because there are way fewer applications, then the bias will distort our predictions for this year's students.
(This is the phenomenon sometimes called "Berkson's paradox".)

I don't expect this bias to be very large, but I have made no attempt to check that expectation against reality. Still less have I made any attempt to correct for it, and probably I won't.

comment by aphyer · 2022-09-21T19:41:00.953Z · LW(p) · GW(p)

Current model of how your mistakes work:

Your mistakes have always taken the form of giving random answers to a random set of students.  You did not e.g. get worse at solving difficult problems earlier, and then gradually lose the ability to solve easy problems as well.  

The probability of you giving a random answer began at 10% in 1511.  (You did not allocate perfectly even then).  Starting in 1700, it began to increase linearly, until it reached 100% in 2000.

This logic is based on: student 37 strongly suggesting that you can make classification mistakes early, and even in obvious cases; and looking at '% of INT<10 students in Thought-Talon' and '% of COU<10 students in Dragonslayer' as relatively unambiguous mistakes we can track the frequency of.

Replies from: Vaniver
comment by Vaniver · 2022-09-26T04:28:37.214Z · LW(p) · GW(p)

This logic is based on: student 37 strongly suggesting that you can make classification mistakes early, and even in obvious cases; and looking at '% of INT<10 students in Thought-Talon' and '% of COU<10 students in Dragonslayer' as relatively unambiguous mistakes we can track the frequency of.

Tho presumably it could be the case that even if a student will be a poor fit for Thought-Talon, they would be an even poorer fit everywhere else?

comment by Grey Wolf (grey-wolf) · 2022-09-25T22:19:04.951Z · LW(p) · GW(p)

 I trained a boosting model on the whole dataset (minus the year column) that predicts the Ospef score. The allocation of a student is then basically just iterating through the four houses and pick the one with the maximum score.

As a sanity check of my model I sliced the dataset into a few parts to confirm that we (the Allocation Helm) got worse over time. This wasn't very rigorous and spending more time would have definitly helped to work out how to mathematically define our degradation. But my testing generally confirmed the downwards trend.

In the end these are my allocations:

Student     House
A       Thought-Talon
B       Humblescrumble
C       Serpentyne
D       Dragonslayer
E       Humblescrumble
F       Serpentyne
G       Dragonslayer
H       Dragonslayer
I       Humblescrumble
J       Thought-Talon
K       Dragonslayer
L       Humblescrumble
M       Humblescrumble
N       Dragonslayer
O       Thought-Talon
P       Humblescrumble
Q       Thought-Talon
R       Humblescrumble
S       Thought-Talon
T       Humblescrumble

comment by SarahNibs (GuySrinivasan) · 2022-09-19T21:15:53.098Z · LW(p) · GW(p)

Students may reach their potential in many ways, as long as they are not actively prevented.

Through sophisticate techniques (eyeballing), my own hat has recommended:

Dragonslayer [G, K, N]
Humblescrumble [A, E, R, T]
Humblescrumble? [L, M]
Serpentyne [C, F, H, O, S]
Serpentyne :( [B, D]
Serpentyne/Humblescrumble [Q]
Serpentyne? [P]
Thought-Talon [J]
Thought-Talon :( :( [I]

Otherwise known as:

Dragonslayer: [G, K, N]
Thought-Talon: [I, J]
Serpentyne: [B, C, D, F, H, O, P, Q, S]
Humblescrumble: [A, E, L, M, R, T]
(Completely revised in followup comment)

Replies from: GuySrinivasan
comment by SarahNibs (GuySrinivasan) · 2022-09-26T03:08:28.540Z · LW(p) · GW(p)

The Ofstev rating of someone sorted into Thought-Talon can be modeled as follows:

lower = 1/2 x min(Intellect, Patience)
upper = 3/2 x min(Intellect, Patience)
~triangular distribution with min=lower, max=upper, mode=30% of the way from lower to upper

Each other house can be modeled similarly. ...not that I fully succeeded at doing so. Just a guess. But sketching:

Serpentyne is between 3/4 x [min(Intellect, Reflexes, Patience) - 10] and max(Reflexes, Patience) - 5

Humblescrumble is between, uh,
max(max(8, 3/4 x (min(Integrity, Intellect) - 15)), 1/4 x (max(Integrity, Patience) + 5))
and
min(max(30, 3/4 x (min(Integrity, Intellect) + 15)), 3/4 x (max(Integrity, Patience) + 5))
which is definitely 100% accurate

Dragonslayer is between max(5/6 x min(everything)-4, 3/2 x min(everything)-20) and 3/2 x min(Courage, max(everything else))) and also this one doesn't yield something that looks triangular so yeah probably not that.

In any case, trying to maximize EV assuming those are right yields my new submission:

Dragonslayer [A, E, H]
Humblescrumble [D, I, R]
Serpentyne [G, L, M, N, P]
Thought-Talon [B, C, F, J, K, O, Q, S, T]

comment by aphyer · 2022-09-18T17:28:49.885Z · LW(p) · GW(p)

Edited to put my final answer at the top for ease of reading:

Thought-Talon: A, J, O, S

Serpentyne: C, F, P

Dragonslayer: D, G, H, K, N, Q

Humblescrumble: B, E, I, L, M, R, T

A starting approach and some basic analysis:
 


I'm going to approach this by trying to minimize the amount of unpredicted variance in the data.


Our initial prediction, without using the houses or stats at all, is to predict all students at the average rating of 25.9436.  The residual has std 9.8649. Over the course of improving our model, we'll try to reduce this. 

Additionally, we can calculate a correlation of this residual with year, getting -0.1711.  This reflects that, while we don't yet know why, the earlier years performed better and the later ones worse (since in earlier years we were assigning them better).  As we get better at predicting ourselves, this correlation should shrink - if it hits zero, that will suggest that we've figured out everything that we used to know at our height.

First, most basic, improvement: we run a regression model to predict rating based on the five stats (ignoring house for now).  This predicts score of: 

-1.2239 + (0.2519 * Intellect) + (0.1314 * Integrity) + (0.1441 * Courage) + (0.1307 * Reflexes) + (0.1765 * Patience).


We expect this to reduce std of residual, but to somewhat increase the correlation of residual with year: and unexplained std drops to 7.6427 while negative correlation with year increases to -0.2211.

 

Next, also pretty basic, improvement: we run that regression model separately for each house.

Serpentyne regression: -1.2733 + (0.3256 * Intellect) + (-0.0120 * Integrity) + (-0.0001 * Courage) + (0.2324 * Reflexes) + (0.2284 * Patience)
Humblescrumble regression: 4.4478 + (0.1882 * Intellect) + (0.2691 * Integrity) + (-0.0029 * Courage) + (0.0013 * Reflexes) + (0.1111 * Patience)
Dragonslayer regression: -6.2998 + (0.1341 * Intellect) + (0.1193 * Integrity) + (0.3397 * Courage) + (0.2300 * Reflexes) + (0.1056 * Patience)
Thought-Talon regression: -7.3507 + (0.3684 * Intellect) + (0.1400 * Integrity) + (0.1155 * Courage) + (-0.0555 * Reflexes) + (0.3643 * Patience)

So Serpentyne and Thought-Talon are using Intellect more, Dragonslayer is using Courage more, Humblescrumble is giving a higher base number regardless of stats (accepting everyone Hufflepuff-style?) while caring more about Integrity.  Also, a few terms are very close to 0, suggesting that e.g. Serpentyne and Humblescrumble do not care at all about Courage.  This leaves us with a residual std of 6.5525, and still has a negative correlation with year of -0.1259, suggesting that there are still many more things to be found.

Still, in case I get too busy to continue, the preliminary regression gives the following house allocations:


Serpentyne
['C', 'F', 'I', 'K', 'L', 'O', 'P', 'S']
Humblescrumble
['B', 'E', 'Q', 'R', 'T']
Dragonslayer
['D', 'G', 'H', 'N']
Thought-Talon
['A', 'J', 'M']

My next step of investigation is going to be stat interaction effects.  Does Integrity affect people with low Intellect more (since they might have more temptation to cheat?)  Do Reflexes matter more for people with high Courage (who would be more likely to put themselves in dangerous situations where Courage is needed?)
 

Replies from: aphyer
comment by aphyer · 2022-09-21T01:49:14.324Z · LW(p) · GW(p)

Had trouble making further progress using that method, realized I was being silly about this and there was a much easier starting solution:

Rather than trying to figure out anything whatsoever about scores, we're trying for now just to mimic what we did in the past. 

Define a metric of 'distance' between two people equal to the sum of the absolute values of the differences between their stats.

To evaluate a person:

  • Find the 10* students with the smallest distances from them who were sorted pre-1700*
  • Assume that those students were similar to them, and were sorted correctly.  Sort them however the majority were sorted.

*these numbers may be varied to optimize.  For example, moving the year threshold earlier makes you more certain that the students you find were correctly sorted...at the expense of making them be selected from a smaller population and so be further away from the person you're evaluating.  I may twiddle these number in future and see if I can do better.

 

We can test this algorithm by trying it on the students from 1511 (and using students from 1512-1699 to find close matches).  When we do this:

  • 49 students are sorted by this method into the same house we sorted them into in 1511.
  • 3 students are ambiguous (e.g. we see a 5-5 split among the 10 closest students, one of which is the house we chose).
  • 8 students are sorted differently.
    • Some of these are very dramatically different.  For example, student 37 had Intellect 7 and Integrity 61.  All students with stats even vaguely near that were sorted into Humblescrumble, which makes sense given that house's focus on Integrity.  However, Student 37 was sorted into Thought-Talon, which seems very odd given their extremely low Intellect.
    • The most likely explanation for this is that our sorting wasn't perfect even in 1511.  Student 37 did quite badly, which suggests this is plausible.
    • The less likely but scarier explanation is that our sorting in 1511 was based on something other than stats (a hidden stat that we can no longer see?  Cohort effects?)

Sadly this method provides no insight whatsoever into the underlying world.  We're copying what we did in the past, but we're not actually learning anything.  I still think it's better than any explicit model I've build so far.

 

This gives the following current allocations for our students (still subject to future meddling):

Thought-Talon: A, J, O, S

Serpentyne: C, F*

Dragonslayer: D, H, G*, K*, N*, Q*

Humblescrumble: B*, E*, I, L, M*, P*, R, T

where entries marked with a * are those where the nearby students were a somewhat close split, while those without are those where the nearby students were clearly almost all in the same house.

 

And some questions for the GM based on something I ran into doing this (if you think these are questions you're not comfortable answering that's fine, but if they were meant to be clear one way or the other from the prompt please let me know):

 The problem statement says we were 'impressively competent' at assigning students when first enchanted.

  • Should we take this to mean we were perfect, or should we take this to mean that we were fairly good but could possibly be even better?
  • When first enchanted, did we definitely still only use the five stats specified here to classify students, or is it possible that we were able to identify an additional stat (Fated?  Protagonist-hood?) that we can no longer perceive, and sorted students based on that?
Replies from: aphyer, gjm
comment by aphyer · 2022-09-21T02:59:59.659Z · LW(p) · GW(p)

Robustness analysis: seeing how the above changes when we tweak various aspects of the algorithm.

  • Requiring Ofstev Rating at least 20 (fewer samples, less likely mis-sorted, might be some bias introduced if e.g. some houses have higher variance than others):
    • B shifts from Humblescrumble to Thought-Talon.
    • I shifts from Humblescrumble to Serpentyne.
    • K shifts from Dragonslayer to Serpentyne.
    • P shifts from Humblescrumble to Serpentyne.
  • Changing threshold year to 1800 (closer samples, more of them mis-sorted): 
    • F ambiguously might shift from Serpentyne to Thought-Talon (5-5).  
    • K shifts from Dragonslayer to Serpentyne.
    • P ambiguously might shift from Humblescrumble to Serpentyne (4-4-1-1)
  • Changing threshold year to 1600 (fewer samples, less likely mis-sorted):
    • F ambiguously might shift from Serpentyne to Thought-Talon (5-5).  
    • K ambiguously might shift from Dragonslayer to Serpentyne (5-5).
    • P shifts from Humblescrumble to Serpentyne. 
  • Increasing # of samples used to 20 (less risk of one of them being mis-sorted, but they are less good comparisons):
    • K shifts from Dragonslayer to Serpentyne (just barely, 10-9-1).

I'm not certain whether this will end up changing my views, but K in particular looks very close between Dragonslayer and Serpentyne, and P plausibly better in Serpentyne.

Replies from: gjm, aphyer
comment by gjm · 2022-09-21T17:10:52.979Z · LW(p) · GW(p)

According to my models

B indeed belongs in Th rather than Hu, but it's close and not very clear. I belongs in Hu rather than Se according to all my models, but it's close. My models disagree with one another about K, some preferring Dr narrowly and fewer preferring Se  less narrowly. Most of my models put P in Hu not Se, and the ones that put it in Se are ones with larger errors. My models disagree with one another about F, preferring Se or Th and not expecting much difference between those.

(aphyer, I don't know whether you would prefer me not to say such things in case you are tempted to read them. I will desist if you prefer. The approaches we're taking are sufficiently different that I don't think there is much actual harm in reading about one another's results.)

Replies from: aphyer
comment by aphyer · 2022-09-21T17:48:46.928Z · LW(p) · GW(p)

No objection to you commenting.  The main risk on my end is that my fundamental contrariness will lead me to disagree with you wherever possible, so if you do end up being right about everything you can lure me into being wrong just to disagree with you.

 

 P is a very odd statblock, with huge Patience and incredibly low Courage and Integrity. (P-eter Pettigrew?)  I might trust your models more than my approach on students like B, who have middle-of-the-road stats but happen to be sitting near a house boundary.  I'm less sure how much I trust your models on extreme cases like P, and think there might be more benefit there to an approach that just looks at a dozen or so students with similar statblocks rather than trying to extrapolate a model out to those far values.

comment by aphyer · 2022-09-21T16:06:16.184Z · LW(p) · GW(p)

 Based on poking at the score figures, I think I'm currently going to move student P from Humblescrumble to Serpentyne but not touch the other ambiguous ones:

Thought-Talon: A, J, O, S

Serpentyne: C, F, P

Dragonslayer: D, G, H, K, N, Q

Humblescrumble: B, E, I, L, M, R, T

comment by gjm · 2022-09-21T02:27:46.045Z · LW(p) · GW(p)

You haven't sorted student G.

I remark that (note: not much spoilage here, but a little)

your allocations are very similar to mine, even though my approach was quite different; maybe this kinda-validates what both of us are doing. Ignoring the missing student G, I think we disagree only about B, and neither of us was very sure about B.

Replies from: aphyer
comment by aphyer · 2022-09-21T02:33:23.074Z · LW(p) · GW(p)

Good catch, fixed.

Replies from: gjm
comment by gjm · 2022-09-21T02:36:56.494Z · LW(p) · GW(p)

With that fix

student B is indeed the only one we (both unconfidently) disagree on.

comment by kave · 2022-09-17T22:09:44.318Z · LW(p) · GW(p)

Seems like the "year" column is missing(?) from the records

Replies from: abstractapplic
comment by abstractapplic · 2022-09-17T22:25:26.580Z · LW(p) · GW(p)

Good catch; fixed now; thank you.

comment by DaveEtCircenses · 2022-09-26T17:41:15.207Z · LW(p) · GW(p)

A solution by method of "Thrash with linear regression, then get bored". I also make the (completely unsubstantiated) claim that an even split of students across houses will lead to better results.

Humblescrumble gets A,B,E,R and T.

Dragonslayer gets D,G,H,K and N.

Thought-Talon gets C,F,L*,M and Q*.

Serpentyne gets I*,J*,O,P and S*.

(Students marked * get a slightly better linear score in another House, but I balance the sizes)

comment by simon · 2022-09-26T09:12:38.207Z · LW(p) · GW(p)

My entry just before the deadline:

Dragonslayer: D,G,H,N,Q

Humblescrumble: E,I,L,M,R,T

Serpentyne: C,F,K,P

Thought-Talon: A,B,J,O,S

Compared with gjm, I disagree (unconfidently) on K and P

Compared with aphyer, I disagree (unconfidently) on B and K 

Compared with Thomas Sepulchre, I disagree (unconfidently) on P only, agreeing with everything else.

(note that on my reading of aphyer and gjm's entries,  they disagree on B and P, despite them saying they only disagree on B)

I used ad-hoc local methods which ultimately does not provide much insight, unfortunately.

Replies from: gjm, Thomas Sepulchre
comment by gjm · 2022-09-26T18:53:56.556Z · LW(p) · GW(p)

I disagree only about B with one version of aphyer's allocations. It is possible that that was out of date at the point when I said "we disagree only about B" but I'm not sure. Anyway, yes, now we do disagree with one another about P as well.

comment by Thomas Sepulchre · 2022-09-26T09:31:08.115Z · LW(p) · GW(p)

Out of curiosity, can you, if you don't mind, describe what methods you used?

Replies from: simon
comment by simon · 2022-09-26T09:54:42.144Z · LW(p) · GW(p)

methods:

 I took the 100 nearest (Euclidean distance in stat-space) students from each house and did linear regression to predict the value for the student in question, then arbitrarily changed my answers based on e.g. residuals of the nearest points or too low density near the point in question, and then did the same for the 20 nearest for certain of the incoming students (which I had noted to be questionable in some way or another, or which disagreed with aphyer or gjm), and in the end I may have decided some of the more ambiguous stuff based on too low local density of some houses, which may explain why my results are so similar to yours (I did not check your results until after arriving at mine).

edit: actually this did provide some insight, in terms of seeing how the regression coefficients change locally (e.g. often the lowest house-relevant stat is most relevant), and I did try a bit to come up with  global formulas (like GuySrinivasan's) but I didn't get far with that.

comment by Yonge · 2022-09-22T22:55:44.063Z · LW(p) · GW(p)

Some traits definitely go better with some houses, however I couldn't see much in the way of clear cut rules. I constructed the following highly provisional allocation by considering students that were sorted when the helm was still reasonably reliable, and then combining the probabilities of a student with each of the 5 ratings being sorted into each house, and selecting the one which on balance seemed most likely.

A    Dragonslayer
B    Thought-Talon
C    Serpentyne
D    Dragonslayer
E    Humblescrumble
F    Serpentyne
G    Dragonslayer
H    Humblescrumble
I    Humblescrumble
J    Thought-Talon
K    Dragonslayer
L    Dragonslayer
M    Dragonslayer
N    Dragonslayer
O    Serpentyne
P    Thought-Talon
Q    Dragonslayer
R    Humblescrumble
S    Serpentyne
T    Humblescrumble
 

comment by Thomas Sepulchre · 2022-09-20T08:09:17.067Z · LW(p) · GW(p)

A few observations

 Looking at the moving average of the Ofspev rating, it seems the helm slowly stopped providing a good allocation starting around 10,000 students. This opens the opportunity for a blackbox approach, where one could simply train a model to replicate the performance of the initial helm, without any gear-level understanding. This might prove useful if the gear-level understanding is really complicated, but this might also limit our result, especially if the original helm was good, but far from perfect.

Looking at the number of students each year, it indeed decreases in the last few decades, which must be a consequence of the fact that applications have declined precipitously.

The individual skills of students don't seem to decrease over time, so, despite the uninterrupted whining of archmages in the newspapers, the lower Ofspev rating is not explained by the alleged "laziness" of this "spoiled generation".

Replies from: Thomas Sepulchre
comment by Thomas Sepulchre · 2022-09-21T14:03:28.115Z · LW(p) · GW(p)

So, I did precisely that. I trained a classifier on the first 7500 students to mimic the behavior of the original helm. 

My predictions:

Serpentyne: C,F,K

Dragonslayer: D,G,H,N,Q

Humblescrumble: E,I,L,M,P,R,T

Thought-Talon: A,B,J,O,S

comment by Gunnar_Zarncke · 2022-09-17T00:44:24.259Z · LW(p) · GW(p)

I haven't looked at the data but some quick meta thoughts:

  • Obvious: Samples should be weighted by the amount of signal they have. 100% at the beginning, 0% at the end.
  • The noise at the end has some benefits:

The effect of the houses on Ofspev can be learned from it. It is an unintended RCT.

comment by outofculture · 2022-09-17T00:39:33.235Z · LW(p) · GW(p)

The new student data already have house entries. Is that a mistake?

Replies from: abstractapplic
comment by abstractapplic · 2022-09-17T02:24:42.939Z · LW(p) · GW(p)

It was, though fortunately that was just the random Houses they would have been Allocated to, and as such provides no meaningful information. Still, I've updated the file to not have that column; thank you.

comment by gjm · 2022-09-17T00:45:17.147Z · LW(p) · GW(p)

Is the goal (1) to allocate these new students to the Houses they would have been put in by the Helm at the peak of its abilities, or (2) to allocate these new students in whatever manner maximizes their Ofspev[1] scores? Or are we to understand that these are more or less the same thing?

[1] Obviously this really stands for the Office for Standards in Potter-Evans-Verres.

Replies from: abstractapplic
comment by Christian Z R · 2024-10-31T08:48:09.554Z · LW(p) · GW(p)

Just putting a guess in here, before I go check if it is true:
 

Actually the 'Houses' have no effect, they are just the names of the different groups. In order to get a good rating, the members of each house should be as close as possible in Stat-space, or perhaps all be high in one stat (still experimenting with this). Since the early students were all placed by a functioning hat, each house had a well defining place in Stat space that it would carry on with. But since all current students have been randomly selected, we don't have to worry about this historical data. Instead, we should try to get the new students as close as possible to the randomly generated spot in Stat space for the current students. As such, I think Serpentyne might become the new House of Integrity. (I do believe a strange thing like this is also happening in real life, and is one of the main ways that political parties gradually change their positions in Stat space).