Posts
Comments
I like this genre of task. I didn't quite understand what you meant about being able to score the human immediately - presumably we're interested in how to human could do given more learning, also?
Yes, I suppose so. I assumed (without noticing I was doing so) that humans wouldn't get that much better at the 'match words' game given more learning time than the 6-hour baseline of task length. But that is not necessarily true. I do think a lot of the relevant learning is in-context and varies from instance to instance ("how are the other players playing? what strategies are being used in this game? how are their strategies evolving in response to the game?").
It seems like a good consideration to bear in mind when selecting games for this genre of task: the amount that the task gets easier for humans given learning time (hence how much time you'd need to invest evaluating human performance).
Another bucket of games that might be good fodder for this task genre are 'social deduction' where deception, seeing through deception, and using allegiances are crucial subtasks. I think for social deduction games, or for manipulation and deception in general, the top capability level achievable by humans is exceedingly high (it's more chess-like than tic-tac-toe-like), and would take a lot of time to attain. It's high because the better your opponent is, the better you need to be.
Possible tweaks to the 'match words' game to introduce deception:
- introduce the possibility that some players may have other goals, e.g. trying to minimize their own scores, or minimize/maximize group/team scores.
- introduce the facility for players to try to influence each others' behaviour between rounds (e.g. by allowing private and public chat between players). This would facilitate the building of alliances / reciprocal behaviour / tit-for-tat.
I have a task idea that falls outside the domains you listed as being of interest. The task essentially involves playing a game strategically against other agents (human or AI), where the rules, outputs and scoring are simple, but strategy is complex.* As such it would test threat-model-relevant skills like modelling and predicting other players (even recursively modelling the other players' models?), and doing in-context learning about them. The difficulty of the task depends how good your opponents are, and how many of them there are. It's unlike many of your example tasks because, despite a high upper bound of difficulty level, it doesn't necessarily take very long to 'implement' - e.g. to benchmark against a human, you can just tell the human the rules of the game, let them ask clarifying questions, and then immediately score their performance in the game. (Unless you specify the task as 'build a GOFAI to play this game', in which case it could be a normal task duration.)
How interested are you in a task such as this?
If it's of interest, should the opponent players be human or AI?
Some thoughts:
Pros of playing against humans:
- Maybe the task is more threat-model-relevant when the other players are human (since modelling humans might be harder or just a different capability than modelling AI agents).
- The benchmark would appear to be more headline-worthy when opponents are human or human experts.
Cons of playing against humans:
- In order to fit the desideratum "It's great if the task is reasonable to perform without requiring interacting with the live internet", the other players would need to be AI agents packaged with the task.
- Finding human experts is really hard or expensive (could use non-experts).
- Human performance might vary too much from person to person, or over time, for the task to be a reliable/stable benchmark, while GPT-2 will always be the same kind of player.
Misc:
- If the AI under test isn't told whether the opponents are human or not, this adds more complexity and richness to the task, making it harder to model the opponents. (Because game turns are simple, it seems like it would be really hard to tell whether opponents are human or not.)
- This genre of task could be specified as 'take the best game turns you can right now and learn in context, reasoning on the fly', or as 'build the best game-playing agent you can'. If the task is specified as 'build the best game-playing AI you can', then rather than needing to interact with the live internet during the task, it can be scored after the task per se is completed
*The game I'm thinking of, which could be swapped out for another game, is "every player must name something in a given category, e.g. things you might find in a kitchen, and points are awarded to those players whose answers match exactly [none/one/two/all/as many as you can/etc] of the other players' answers."
The currently top comment on the EA Forum copy of this post says that at least one person who wrote a positive testimonial was asked to leave a comment by Nonlinear (but they didn’t say it had to be positive) https://forum.effectivealtruism.org/posts/32LMQsjEMm6NK2GTH/sharing-information-about-nonlinear?commentId=kqQK2So3L5NJKEcYE
Feature suggestion: Up/downvoting a post shouldn’t be possible within 30 seconds of opening a (not very short) post (to prevent upvoting based on title only), or should be weighted less.
I'm looking forward to part 2.
The Telegraph (UK’s main conservative broadsheet): https://www.telegraph.co.uk/news/2023/03/28/elon-musk-twitter-owner-artificial-intelligence/
I'd heard of a 'hive mind', but this is ridiculous.
(tone: wordplay, not criticism!)
As a data-point, I'm a rationalist, and a subscriber to the New Humanist, which is published by the Rationalist Association you mention, and is the descendant of the 1971 magazine you mention titled 'The Humanist'.
So I fall into the intersection of LW rationalists and "1950's rationalists".
Wikipedia:
The New Humanist has been in print for 131 years; starting out life as Watts's Literary Guide, founded by C. A. Watts in November 1885.[4] It later became The Literary Guide and Rationalist Review (1894–1954), Humanist (1956–1971) and the New Humanist in 1972.
Wikipedia:
Overshoot
The cumulative proportion of individuals who get infected during the course of a disease outbreak can exceed the HIT. This is because the HIT does not represent the point at which the disease stops spreading, but rather the point at which each infected person infects fewer than one additional person on average. When the HIT is reached, the number of additional infections begins to taper off, but it does not immediately drop to zero. The difference between the cumulative proportion of infected individuals and the theoretical HIT is known as the overshoot
If, at the time we reach the Herd Immunity Threshold, many people are infected, an R number of eg 0.5 will cause a large number of others to be infected.
Here are two relevant links.
1) Julia Galef comments on a post by Jeff Kaufman:
Status is key to well-being & isn't zero sum. Modern society's ability to create more sources of status, via allowing diverse subcultures, is like printing free $ w/out inflation. Possibly one of modernity's most overlooked benefits.
I and people I'm close to all have our status boosted by membership in these various subgroups, while another random person has, in their perspective, the status of them and their friends boosted by similar means. This is like the paradox of most people thinking they're above-average drivers: if different drivers are going for different things (speed, safety, considerateness, ...) then it's quite possible for most drivers to be above average by their own evaluation of what counts.
In general, feeling higher status is pretty good for you: it makes you healthier, happier, and you live longer. [2] So the ability of subcultures to produce new status opportunities out of nowhere seems really valuable, and something we should try to have more of.
2) Katja Grace writes, relatedly:
It might sound intuitive that more [subcultures] mean more status for all, but in most straightforward models the number of ponds doesn’t change the size of the status pie.
In Hanson and Simler's 'The Elephant in the Brain', they mention Axelrod's (1986) "meta-norm" modelling which shows that cooperation is stable only when non-punishers are punished.