Posts
Comments
Your examples seem to imply that believing QI means such an agent would in full generality be neutral on an offer to have a quantum coin tossed, where they're killed in their sleep on tails, since they only experience the tosses they win. Presumably they accept all such trades offering epsilon additional utility. And presumably other agents keep making such offers since the QI agent doesn't care what happens to their stuff in worlds they aren't in. Thus such an agent exists in an ever more vanishingly small fraction of worlds as they continue accepting trades.
I should expect to encounter QI agents approximately never as they continue self-selecting out of existence in approximately all of the possible worlds I occupy. For the same reason, QI agents should expect to see similar agents almost never.
From the outside perspective this seems to be in a similar vein to the fact all computable agents exist in some strained sense (every program, more generally every possible piece of data, is encodable as some integer, and exist exactly as much as the integers do) , even if they're never instantiated. For any other observer, this QI concept is indistinguishable in the limit.
Please point out if I misunderstood or misrepresented anything.
I'll note that malicious compliance is a very common response to being provided a task that's not straightforwardly possible with the resources available, and no channel to simply communicate that without retaliation. BS an answer, or technically correct/rules as written response, is often just the best available strategy if one isn't in a position to fix the evaluator's broken incentives.
An actual human's chain of thought would be a lot spicier if their boss ask them to produce a document with working links without providing internet access.
"English" keeps ending up as a catch-all in K-12 for basically all language skills and verbal reasoning skills that don't obviously fit somewhere else. Read and summarize fiction - English, Write a persuasive essay - English, grammar pedantry - English, etc.
That link currently redirects the reader to https://siderea.dreamwidth.org/1209794.html
(just in case the old one stops working)
Good clarification; not just the amount of influence, something about the way influence is exercised being unsurprising given the task. Central not just in terms of "how much influence", but also along whatever other axes the sort of influence could vary?
I think if the agent's action space is still so unconstrained there's room to consider benefit or harm that flows through principle value modification it's probably still been given too much latitude. Once we have informed consent, because the agent has has communicated the benefits and harms as best it understands, it should have very little room to be influenced by benefits and harms it thought too trivial to mention (by virtue of their triviality).
At the same time, it's not clear the agent should, absent further direction, reject the offer to brainwash the principle for resources, as opposed to punting to the principle. Maybe the principle thinks those values are an improvement and it's free money? [e.g. Prince's insurance company wants to bribe him to stop smoking.]
WRT non-manipulation, I don't suppose there's an easy way to have the AI track how much potentially manipulative influence it's "supposed to have" in the context and avoid exercising more than that influence?
Or possibly better, compare simple implementations of the principle's instructions, and penalize interpretations with large/unusual influence on the principle's values. Preferably without prejudicing interventions straightforwardly protecting the principle's safety and communication channels.
Principle should, for example, be able to ask the AI to "teach them about philosophy", without it either going out of it's way to ensure Principle doesn't change their mind about anything as a result of the instruction, nor unduly influencing them with subtly chosen explanations or framing. The AI should exercise an "ordinary" amount of influence typical of the ways AI could go about implementing the instruction.
Presumably there's a distribution around how manipulative/anti-manipulative(value-preserving) any given implementation of the instruction is, and we may want AI to prefer central implementations rather than extremely value-preserving ones.
Ideally AI should also worry that it's contemplating exercising more or less influence than desired, and clarify that as it would any other aspect of the task.
You're very likely correct IMO. The only thing I see pulling in the other direction is that cars are far more standardized than humans, and a database of detailed blueprints for every make and model could drastically reduce the resolution needed for usefulness. Especially if the action on a cursory detection is "get the people out of the area and scan it harder", not "rip the vehicle apart".
This is the first text talking about goals I've read that meaningfully engages with "but what if you were (partially) wrong about what you want" instead of simply glorifying "outcome fixation". This seems like a major missing piece in most advice about goals. That the most important thing about your goals is that they're actually what you want. And discovering that may not be the case is a valid reason to tap the brakes and re-evaluate.
(Assuming a frame of materialism, physicalism, empiricism throughout even if not explicitly stated)
Some of your scenarios that you're describing as objectionable would reasonably be described as emulation in an environment that you would probably find disagreeable even within the framework of this post. Being emulated by a contraption of pipes and valves that's worse in every way than my current wetware is, yeah, disagreeable even if it's kinda me. Making my hardware less reliable is bad. Making me think slower is bad. Making it easier for others to tamper with my sensors is bad. All of these things are bad even if the computation faithfully represents me otherwise.
I'm mostly in the same camp as Rob here, but there's plenty left to worry about in these scenarios even if you don't think brain-quantum-special-sauce (or even weirder new physics) is going to make people-copying fundamentally impossible. Being an upload of you that now needs to worry about being paused at any time or having false sensory input supplied is objectively a worse position to be in in.
The evidence does seem to lean in the direction that non-classical effects in the brain are unlikely, neurons are just too big for quantum effects between neurons, and even if there were quantum effects within neurons, it's hard to imagine them being stable for even as long as a single train of thought. The copy losing their train of thought and having momentary confusion doesn't seem to reach the bar where they don't count as the same person? And yet weirder new physics mostly requires experiments we haven't thought to do yet, or experiments is regimes we've not yet been able to test. Whereas the behavior of things at STP in water is about as central to things-Science-has-pinned-down as you're going to get.
You seem to hold that the universe maybe still has a lot of important surprises in store, even within the central subject matter of century old fields? Do you have any kind of intuition pump for that feeling there's still that many earth-shattering surprises left (while simultaneously holding empiricism and science mostly work)? My sense of where there's likely to be surprises left is not quite so expansive and this sounds like a crux for a lot of people. Even as much of a shock as qm was to physics, it didn't invalidate much if any theory except in directly adjacent fields like chemistry and optics. And working out the finer points had progressively more narrower and shorter reaching impact. I can't think of examples of surprises with a larger blast radius within the history of vaguely modern science. Findings of odd as yet unexplained effects pretty consistently precedes attempts at theory. Empirically determined rules don't start working any worse when we realize the explanation given with them was wrong.
Keep in mind that society holds that you're still you even after a non-trivial amount of head trauma. So whatever amount of imperfection in copying your unknown-unknowns cause, it'd have to both be something we've never noticed before in a highly studied area, and something more disruptive than getting clocked in the jaw, which seems a tall order.
Keep in mind also that the description(s) of computation that computer science has worked out is extremely broad and far from limited to just electronic circuits. Electronics are pervasive because we have as a society sunk the world GDP (possibly several times over) into figuring out how to make them cheaply at scale. Capital investment is the only thing special about computers realized in silicon. Computer science makes no such distinction. The notion of computation is so broad that there's little if any room to conceive of an agent that's doing something that can't be described as computation. Likewise the equivalence proofs are quite broad; it can arbitrarily expensive to translate across architectures, but within each class of computers, computation is computation, and that emulation is possible has proofs.
All of your examples are doing that thing where you have a privileged observer position separate and apart from anything that could be seeing or thinking within the experiment. You-the-thinker can't simply step into the thought experiment. You-the-thinker can of course decide where to attach the camera by fiat, but that doesn't tell us anything about the experiment, just about you and what you find intuitive.
Suppose for sake of argument your unknown unknowns mean your copy wakes up with a splitting headache and amnesia for the previous ~12 hours as if waking up from surgery. They otherwise remember everything else you remember and share your personality such that no one could notice a difference (we are positing a copy machine that more or less works). If they're not you they have no idea who else they could be, considering they only remember being you.
The above doesn't change much for me, and I don't think I'd concede much more without saying you're positing a machine that just doesn't work very well. It's easy for me to imagine it never being practical to copy or upload a mind, or having modest imperfections or minor differences in experience, especially at any kind of scale. Or simply being something society at large is never comfortable pursuing. It's a lot harder to imagine it being impossible even in principle with what we already know, or can already rule out with fairly high likelihood. I don't think most of the philosophy changes all that much if you consider merely very good copying (your friends and family can't tell the difference; knows everything you know) vs perfect copying.
The most bullish folks on LLMs seem to think we're going to be able to make copies good enough to be useful to businesses just off all your communications. I'm not nearly so impressed with the capabilities I've seen to date and it's probably just hype. But we are already getting into an uncanny valley with the (very) low fidelity copies current AI tech can spit out - which is to say they're already treading on the outer edge of peoples' sense of self.
Realistically I doubt you'd even need to be sure it works, just reasonably confident. Folks step on planes all the time and those do on rare occasion fail to deliver them intact at the other terminal.
Within this framework, whether or not you "feel that continuity" would mostly be a fact about the ontology your mindstate uses thinking about teleportation. Everything in this post could be accurate and none of it would be incompatible with you having an existential crisis upon being teleported, freaking out upon meeting yourself, etc.
Nor does anything here seem to make a value judgement about what the copy of you should do if told they're not allowed to exist. Attempting revolution seems like a perfectly valid response; self defense is held as a fairly basic human right after all. (I'm shocked that isn't already the plot of a sci-fi story.)
It would also be entirely possible for both of your copies to hold conviction that they're the one true you - Their experiences from where they sit being entirely compatible with that belief. (Definitely the plot of at least one Star Trek episode.)
There's not really any pressure currently to have thinking about mind copying that's consistent with every piece of technology that could ever conceivably be built. There's nothing that forces minds to have accurate beliefs about anything that won't kill them or wouldn't have killed their ancestors in fairly short order. Which is to say mostly that we shouldn't expect to get accurate beliefs about weird hypotheticals often without having changed our minds at least once.
There's a presumption you're open to discussing on a discussion forum, not just grandstanding. Strong downvoted much of this thread for the amount of my time you've wasted trolling.
Bell labs, Xerox park, etc were AFAIK were mostly privately funded research labs that existed for decades and churned out patents that may as well have been money printers. When AT&T (Bell Labs) was broken up, that research all but started the modern telecom and tech industry, which is now something like 20%+ of the stock market. If you attribute even a tiny fraction of that to Bell Labs it's enough to fund another 1000 times over.
The missing piece arguably is executive teams with a 25 year vision instead of a 25 week vision, AND the institutional support to see it through; cost cutting is in fashion with investors too. Private equity is in theory well positioned to repeat this elsewhere, but for reasons I don't entirely understand has become too short sighted and/or has significantly shortened horizons on returns. IBM, Qualcom, TSMC, ASML, and Intel all seem to have research operations of that same near-legendary caliber, mostly privately funded (albeit treated as a national treasure of strategic importance); what they have in common of course, is they're all tech. Semiconductor fabrication is extremely research intensive and world class R&D operations are table stakes just to survive to the next process node.
Maybe a good followup question is why hasn't this model spread outside of semiconductors and tech? Is a functional monopoly a requirement for the model to work? (ASML has a functional monopoly on leading edge photo-lithography machines that power modern semiconductor fabs). Do these labs ever start independently without a clear lineage to 100 billion+ dollar govt research initiatives? Electronics and tech is probably many trillions in US govt funding since WWII once you include military R&D and contracts.
Govt. spending is a ratchet that only goes one direction, replacing dysfunctional agencies costs jobs and makes political enemies. Reform might be more practical, but much like people, very hard to reform an agency that doesn't want to change. You'd be talking about sustained expenditure of political capital, the sort of thing that requires an agency head who's invested in the change and popular enough with both parties to get to spend a few administrations working at it.
Edit: I answered separately above with regards to private industry.
Again you're saying that without engaging with any of my arguments or giving me any more of your reasoning to consider. Unless you care to share substantially more of your reasoning, I don't see much point continuing this?
That is a big part of the threat here. Many of the current deployments are many steps removed from anyone reading research papers. E.g. sure, people at MS and OpenAI involved with that roll-out are presumably up on the literature. But the IT director deciding when and how to deploy copilot, what controls need to be in place, etc? Trade publications, blogs, maybe they ask around on Reddit to see what others are doing.
Related, how does spin-off subcultures fit into this model? E.g. in music you have people that consume an innovation in one genre, then reinvent it in their own scene where they're a creator. I think there's similar dynamics in various LW adjacent subcultures, though I'm not up enough on detailed histories to comment.
For less loaded terms, maybe Create, Consume, Exploit or Create, Enjoy, Exploit as the set of actions available. Looks like loosely what was settled on above.
Where exploit more naturally captures things like soulless commercialization and others low key taking advantage of those enjoying the scene.
Consume in the context or rationalists would more be people who read the best techniques on offer and then go try to use them for things that aren't "advancing the art" itself, like addressing x-risk.
You're still hammering on stuff I never disagreed with in the first place. In so far as I don't already understand all the math (or math notation) I'd need to follow this, that's a me problem not a you problem, and having a pile of cool papers I want to grok is prime motivation for brushing up on some more math. I'm definitely not down-voting merely on that.
What I'm mostly trying to get across is just how large of a leap of logic you're making from [post got 2 or 3 downvotes] => [everyone here hates math]. There's got to be at least 3 or 4 major inferences there you haven't articulated here and I'm still not sure what you're reacting so strongly to. Your post with the lowest karma is the first one and it's sitting at neutral, based on a grand total of 3 votes besides yours. You are definitely sophisticated enough on math to understand the hazards of reasoning from a sample size that small.
Any conversation about karma would necessarily involve talking about what does and doesn't factor into votes, likely both here and in the internet or society at large. Not thinking we're getting anywhere on that point.
I've already said clearly and repeatedly I don't have a problem with math posts and I don't think others do either. You're not going to get what you want by continuing to straw-man myself and others. I disagree with your premise you've thus far failed to acknowledge or engage with any of those points.
Ah, gotcha. I had gotten the other impression from the thread in aggregate.
If you're selling them at unit cost you aren't selling at cost, you're straightforwardly selling at a loss. That's definitely not what I'm thinking of when someone tells me they're selling at cost.
For everyone who gets curious and challenges (or even evaluates on the merits) the approved right answers they learned from their culture, there's dozens more who for whatever reason don't. "Who am I to challenge <insert authority>", "Why should I think I know better?", "How am I supposed to know what's true?" (rhetorically, not expecting an answer exists). And a thousand other rationalizations besides.
And then of those who try, most just find another authority they like better and end their inquiry - independent thinking is hard work, thankless work, lonely work. Even many groups that supposedly value this adopt the language and trappings without the actual thought and inquiry. People mostly challenge the approved right answers that the in-group has told them are safe to challenge. Even here plenty haven't escaped this.
And obviously you already know the safe approved "right" answers from society at large on this question - it's all a trap and you're a fool for considering it. And credit where it's due historically, they've so far been right.
I've always taken that as hold average volumetric flow rate constant or slightly reduce, reduce the rate at which breaths are taken significantly, breath deeper (more air at once) to compensate.
The use of the phrase "deep breath and hold" is also consistent with max lung volume == deep breath.
Wouldn't be engaging at all if I didn't think there was some truth to what you're saying about the math being important and folks needing to be persuaded to "take their medicine" as it were and use some rigor. You are not the first person to make such an observation and you can find posts on point from several established/respected members of the community.
That said, I think "convincing people to take their medicine" mostly looks like those answers you gave just being at the intro of the post(s) by default (and/or the intro to the series if that makes more sense). Alongside other misc readability improvements. Might also try tagging the title as [math heavy] or some such.
I think you're taking too narrow a view on what sorts of things people vote on and thus what sort of signal karma is. If that theory of mind is wrong, any of the inferences that flow from it are likely wrong too. Keep in mind also (especially when parsing karma in comments) that anything that parses as whining costs you status even if you're right (not just a LW thing). And complaining about internet points almost always parses that way.
I don't think it necessarily follows that math heavy post got some downvotes therefore everyone hates math and will downvote math in the future. As opposed to something like people care a lot about readability and about being able to prioritize their reading to the subjects they find relevant, neither of which scores well if the post is math to the exclusion of all else.
I didn't find any of those answers surprising but it's an interesting line of inquiry all the same. I don't have a good sense of how it's simultaneously true that LLMs keep finding it helpful to make everything bigger, but also large sections of the model don't seem to do anything useful, and increasingly so in the largest models.
There's a more general concern here for running organizations where anyone can sue anyone at any time for any reason, merit or no. If one allows the barest hint of a lawsuit to dictate their actions, that too becomes another vector through which they can be manipulated. Perhaps a better thing to aim for is "don't do anything egregious enough a lawyer will take it on contingency", use additional caution if the potential adversary is much better resourced than you (and can afford sustained frivolous litigation).
Not a lawyer, but the "can't explain your reasoning" problem is overblown. Just need to be very diligent in separating facts from the opinions and findings of the panel. There is a reason every report of that sort done professionally sounds the particular flavor of stilted that it does.
"Our panel found that <accused> did <thing>" <- potential lawsuit, hope you can prove that in court. You're not a fact finder in a court of law, speak as if you are at your own peril.
"Our panel was convened to investigate <accusation> against <accused>. Based on <process>, we believe the accusation credible and recommend the following: ..." <- A OK
"Our panel was convened to investigate <accusation> against <accused>. Based on <process>, we were unable to corroborate the accusation and cannot recommend action at this time." <- A OK
"person X said Y, I did/didn't believe them" is basically always fine, provided X actually said Y. Quoting someone else's statement is well protected, and you're entitled to your opinions. The trouble happens when your opinions/findings/beliefs are stated as facts about what happened instead of facts about what you believe and what evidence you found persuasive.
There's also nothing stopping the panel from saying "we heard closed door testimony...", describe the rough topic, speakers relation to the inquiry, and the degree to which it was persuasive.
From an operational perspective, this is eye-opening in terms of how much trust is being placed in the companies that train models, and the degree to which nobody coming in later in the pipeline is going to be able to fully vouch for the behavior of the model, even if they spend time hammering on it. In particular, it seems like it took vastly less effort to sabotage those models than would be required to detect this.
That's relevant to the models that are getting deployed today. I think the prevailing thinking among those deploying AI models in businesses today is that the supply chain is less vulnerable to quietly slipping malware into an LLM compared to traditional software. That's not seeming like a safe assumption.
I did go pull up a couple of your posts as that much is a fair critique:
That first post is only the middle section of what would already be a dense post and is missing the motivating "what's the problem?", "what does this get us?"; without understanding substantially all of the math and spending hours I don't think I could even ask anything meaningful. That first post in particular is suffering from an approachable-ish sounding title then wall of math, so you're getting laypeople who expected to at least get an intro paragraph for their trouble.
The August 19th post piqued my interest substantially more on account of including intro and summary sections, and enough text to let me follow along only understanding part of the math. A key feature of good math text is I should be able to gloss over challenging proofs on a first pass, take your word for it, and still get something out of it. Definitely don't lose the rigor, but have mercy on those of us not cut out for a math PhD. If you had specific toy examples your were playing with while figuring out the post, those can also help make posts more aproachable. That post seemed well received just not viewed much; my money says the title is scaring off everyone but the full time researchers (which I'm not, I'm in software).
I think I and most other interested members not in the field default to staying out of the way when people open up with a wall of post-grad math or something that otherwise looks like a research paper, unless specifically invited to chime in. And then same story with meta; this whole thread is something most folks aren't going to start under your post uninvited, especially when you didn't solicit this flavor of feedback.
I bring up notation specifically as the software crowd is very well represented here, and frequently learn advanced math concepts without bothering to learn any of the notation common in math texts. So not like, 1 or 2 notation questions, but more like you can have people who get the concepts but all of the notation is Greek to them.
It is still a forum, all the usual norms about avoid off-topic, don't hijack threads apply. Perhaps a Q&A on how to get more engagement with math-heavy posts would be more constructive? Speaking just for myself, a cheat-sheet on notation would do wonders.
Nobody is under any illusions that karma is perfect AFAICT, though much discussion has already been had on to what extent it just mirrors the flaws in people's underlying rating choices.
Point of clarification: Is the supervisor the same as the potentially faulting hardware, or are we talking about a different, non-suspect node checking the work, and/or e.g. a more reliable model of chip supervising a faster but less reliable one?
The more curious case for excavators would be open pit mines or quarries where you know you're going to be in roughly the same place for decades and already have industrial size hookups
The answer there is if you can get it into evidence then you can get it in front of a jury. A big part of what lawyers do in litigation is argue about what gets into evidence and can get shown; all of that arguing costs time and money. I think a fair summary is if it's plausibly relevant, the judge usually can't/won't exclude it.
I wouldn't count on Microsoft being ineffective, but there's good reason to think they'll push for applications for the current state of the art over further blue sky capabilities stuff. The commitment to push copilot into every Microsoft product is already happening, the copilot tab is live in dozens of places in their software and in most it works as expected. It's already good enough to replace 80%+ of the armies of temps and offshore warm bodies that push spreadsheets and forms around today without any further big capabilities gains, and that's a plenty huge market to sate public investors. Sure more capabilities gets you more markets, but what they have now probably gets the entire AI division self-supporting on cashflow, or at least able to help with the skyrocketing costs of compute, plus funding the coming legal and lobbying battles over training data.
Covert side channels like you're suggesting would probably be a related and often helpful thing for someone trying to do what OP is talking about, but I think the side channels are distinct from the things they can be used for.
This concept in radio communications would be "spread spectrum", reducing the signal intensity or duration in any given part of the spectrum and using a wider band/more channels. See especially military spread spectrum comms and radars. E.g. this technique has been used to frustrate simple techniques for identifying the location of a radio transmitter, to avoid jamming, and to defeat radar warning/missile warning systems on jets.
It's pretty easy to find reasons why everything will hopefully be fine, or AI hopefully won't FOOM, or we otherwise needn't do anything inconvenient to get good outcomes. It's proving considerably harder (from my outside the field view) to prove alignment, or prove upper bounds on rate of improvement, or prove much of anything else that would be cause to stop ringing the alarm.
FWIW I'm considerably less worried than I was when the Sequences were originally written. The paradigms that have taken off since do seem a lot more compatible with straightforward training solutions that look much less alien than expected. There are plausible scenarios where we fail at solving alignment and still get something tolerably human shaped, and none of those scenarios previously seemed plausible. That optimism just doesn't take it under the stop worrying threshold.
Admittedly I skimmed large portions of that, but I'd like to take a crack at bridging some of that inferential distance with a short description of the model I've been using, whereby I keep all the concerns you brought up straight but also don't have to choke on pronouns.
Categories of Men and Women are useful in a wide variety of areas and point at a real thing. There's a region in the middle these categories overlap and lack clean boundaries - while both genetics and birth sex are undeniable and straightforward fact in almost all cases (~98% IIRC), they don't make the wide ranging good predictions you'd otherwise expect in this region. I've mentally been calling this the "gender/sex/identity is complicated" region. Within this region, carefully consider which category is more relevant and go with that; other times a weighted average may be more appropriate.
By way of example if I want to infer likely skill-sets, hobbies, or interests for someone trans, I'm probably looking at either their pre-transition category, or a weighted average based on years before vs after transition.
On the other hand if I'm considering how a friend or conversation partner might prefer to be treated, I'd almost certainly be correct to infer based on claimed/stated gender until I know more.
On the one hand I can definitely see why those threads got under your skin (and shocked The Thoughts You Cannot Think didn't get a link); not the finest showing in clear thinking. Ultimately though I'm skeptical that we should treat pronouns as making some deep claim about the structure of person-space along the axis of sex. If anything, that there's conflict at all should serve to highlight that there's a large region (as much as 20% of the population maybe???) where this isn't cut and dry and simple rules aren't making good predictions. Looking at that structure there's a decent if not airtight case for treating pronouns as you would any other nicknames or abbreviations - namely acceptable insofar as the referent finds the name acceptable. There are places where a "no pseudonyms allowed, no exceptions" rule should and does trump "preferred moniker"/"no name-calling", but Twitter clearly isn't one.
I think a key distinction here is any of this only helps if people care more about the truth of the issue at hand than whatever realpolitik considerations the issue has tangentially gotten pulled into. And yeah, absent "unreasonable levels of political savvy", academics are mostly relying on academic issues usually being far enough from the icky world of politics to be openly discussed, at least outside of a few seriously diseased disciplines where the rot is well and truly set in. The powers that be seem to only care about the truth of an issue when it starts directly impinging on their day to day; people seem to find it noteworthy when this isn't true of a given leader.
I don't think this will ever be fully predictable. E.g. in the US I don't think anyone really saw the magnitude of the backlash against election workers, academics, and security folks coming until it became headline news. And arguably that's what a near-miss looks like.
This is very much what I want my headlines to look like.
Personally, preferred mode of consumption would be AM email newsletter like Axios or Morning Brew.
The resolution dates on the markets seem important on several of the headlines and were noticeably missing from the body.
"Crimea land bridge 22% chance of being cut [this year/campaign season], down from 34% according to Insight"
Notice how different that would read with the time horizon on there vs leaving unqualified. The other big question an update like that begs is "what changed?"
Interesting follow-up: how long do they take to break out of the bad equilibrium if all start there? How about if we choose a less extreme bad equilibrium (say 80 degrees)?
Looking ahead multiple moves seems sufficient to break the equilibrium, but for the started assumption that the other players also have deeply flawed models of your behavior that assume you're using a different strategy - the shared one including punishment. There does seem to be something fishy/circular about baking an assumption about other players strategy into the player's own strategy and omitting any ability to update.
Not sure I'm following the setup and notation quite close enough to argue that one way or the other, as far as the order we're saying the agent receives evidence and has to commit to actions. Above I was considering the simplest case of 1 bit evidence in, 1 bit action out, repeat.
I pretty sure that could be extended to get that one small key/update that unlocks the whole puzzle sort of effect and have the model click all at once. As you say though, not sure that gets to the heart of the matter regarding the bound; it may show that no such bound exists on the margin, the last piece can be much more valuable on the margin than all the prior pieces of evidence, but not necessarily in a way that violates the proposed bound overall. Maybe we have to see that last piece as unlocking some bounded amount of value from your prior observations.
It's possible to construct a counterexample where there's a step from guessing at random to perfect knowledge after an arbitrary number of observed bits; n-1 bits of evidence are worthless alone and the nth bit lets you perfectly predict the next bit and all future bits.
Consider for example shifting bits in one at a time into the input of a known hash function that's been initialized with an unknown value (and known width) and I ask you to guess a specified bit from the output; in the idealized case, you know nothing about the output of the function until you learn the final bit in the input (all unknown bits have shifted out) b/c they're perfectly mixed, and after that you'll guess every future bit correctly.
Seems like the pathological cases can be arbitrarily messy.
Wary of this line of thinking, but I'll concede that it's a lot easier to moderate when there's something written to point to for expected conduct. Seconding the other commenters that if it's official policy then it's more correctly dubbed guidelines rather than norms.
I'm struck by the lack any principled center or shelling point for balancing [ability to think and speak freely as the mood takes you] vs any of the thousand and one often conflicting needs for what makes a space nice/useful/safe/productive/etcetera. It seems like anyone with moderating experience ends up with some idea for a workable place to draw those lines, but it rarely seems like two people end up with exactly the same idea, and articulating it is fraught. This would really benefit from some additional thought and better framing, and is pretty central to what this forum is about (namely building effective communities around these ideas) rather than purely a moderation question.
Taking the premise at face value for sake of argument.
You should be surprised just how many fields of study bottom out in something intractable to simulate or re-derive from first principals.
The substrate that all agents seem to run on seems conveniently obfuscated and difficult to understand or simulate ourselves - perhaps intentionally obfuscated to make it unclear what shortcuts are being taken or if the minds are running inside the simulation at all.
Likewise chemistry bottoms out in near-intractable quantum soup, the end result being that almost all related knowledge has to be experimentally determined and compiled into large tables of physical properties. Quantum mechanics does relatively to constrain this in practice; I think large molecules and heavy elements' properties could diverge significantly from what-we-would-predict if we could run large enough QM simulations without it being detectable.
It's awfully convenient most of us spend all our time running on autopilot and then coming up with post-hoc justifications of our behavior. Why we're scarcely more convincing than GPT explaining the actions of a game NPC. I wonder why we're like that... (see point 1).
I'm sure folks could come up with other examples. It's kind of an odd change of pace how science keeps running into bizarre smokescreens everywhere we look after the progress seen in the last few centuries. How many oddities are hiding just a little deeper?
I don't personally find the above persuasive on net, but it is the first tree I'd go barking up if I was giving that hypothesis further consideration.
I suppose the depends a lot on how hard anyone is trying to cause mischief, and how much easier it's going to get to do anything of consequence. 4-chan is probably a good prototype of your typical troll "in it for the lulz", and while they regularly go past what most would call harmless fun, there's not a body count.
The other thing people worry about (and the news has apparently decided is the thing we all need to be afraid of this month...) is conventional bad actors using new tools to do substantially whatever they were trying to do before, but more; mostly confuse, defraud, spread propaganda, what have you. I'm kind of surprised I don't already have an inbox full of LLM composed phishing emails... On some level it's a threat, but it's also not a particularly hard one to grasp, it's getting lots of attention, and new weapons and tactics are a constant in conflicts of all types.
I'm still of the mind that directly harmful applications like the above are going to pale next to the economic disruption and social unrest that's going to come from making large parts of the workforce redundant very quickly. Talking specific policy doesn't look like it's going to be in the Overton window until after AI starts replacing jobs at scale, and the "we'll have decades to figure it out" theory hasn't been looking good of late. And when that conversation starts it's going to suck all the air out of the room and leave little mainstream attention for worrying about AGI.
Getting clear, impossible to ignore warning shots first would be a good thing on net, even if unpleasant in the moment. Unless you're suggesting that simple(non-AGI) AI tools are going to be civilization-threatening - but I'm not seeing it and you didn't argue it.
I very much understand the frustration, but I'll ask, as someone also not directly adjacent to any of this either, what would you have me and others like me do? There's no shortage of anger and frustration in any discussion like this, but I don't see any policy suggestions floating around that sound like they might work and aren't already being tried (or at least any suggestion that there's countermeasures that should be deployed and haven't been).
Chiming in on toy models of research incentives: Seems to me like a key feature is that you start with an Arms Race then, after some amount of capabilities accumulate, transitions to the Suicide Race. But players have only vague estimates of where that threshold is, have widely varying estimates, and may not be able to communicate estimates effectively or persuasively. Players have a strong incentive to push right up to the line where things get obviously (to them) dangerous, and with enough players, somebody's estimate is going to be wrong.
Working off a model like that, we'd much rather be playing the version where players can effectively share estimates and converge on a view of what level of capabilities makes things get very dangerous. Lack of constructive conversations with the largest players on that topic do sound like a current bottleneck.
It's unclear to me to what extent there's even a universal clear distinction understood between mundane weak AI systems with ordinary kinds of risks and superhuman AGI systems with exotic risks that software and business people aren't used to thinking about outside of sci-fi. That strikes me as a key inferential leap that may be getting glossed over.
There's quite a lot of effort spent in technology training people that systems are mostly static absent human intervention or well defined automations that some person ultimately wrote, anything else being a fault that gets fixed. Computers don't have a mind of their own, troubleshoot instead of anthropomorphizing, etc., etc. That this intuition will at some point stop working or being true of a sufficiently capable system (and that this is fundamentally part of what we mean by human level AGI) is something that probably needs to be focused on more as it's explicitly contrary to the basic induction that's part of usefully working in/on computers.