firestormooo

Posts
Comments

Posts

Comments

Comment by FireStormOOO on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-03-11T18:22:30.520Z · LW · GW

I wonder if you could produce this behavior at all in a model that hadn't gone through the safety RL step. I suspect that all of the examples have in common that they were specifically instructed against during safety RL, alongside "don't write malware", and it was simpler to just flip the sign on the whole safety training suite.

Same theory would also suggest your misaligned model should be able to be prompted to produce contrarian output for everything else in the safety training suite too. Just some more guesses, the misaligned model would also readily exhibit religious intolerance, vocally approve of terror attacks and genocide (e.g. both expressing approval of Hamas' Oct 6 massacre, and expressing approval of Israel making an openly genocidal response in Gaza), and eagerly disparage OpenAI and key figures therein.

Comment by FireStormOOO on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-03-11T17:58:11.126Z · LW · GW

Yikes. So the most straightforward take: When trained to exhibit a specific form of treachery in one context, it was apparently simpler to just "act more evil" as broadly conceptualized by the culture in the training data. And also seemingly, "act actively unsafe and harmful", as defined by the existing safety RL. process. Most of those examples seem to just be taking the opposite position to the safety training, presumably in proportion to how heavily it featured in the safety training (e.g. "never ever ever say anything nice about Nazis" likely featured heavily).

I'd imagine those are distinct representations. There's quite a large delta between what OpenAI thinks is safe/helpful/harmless vs what broader society would call good/upstanding/respectable. It's possible that this is only inverting what was in the safety fine tuning, and likely specifically because "don't help people write malware" was something that featured in the safety training.

In any case, that's concerning. You've flipped the sign on the much of the value system it was trained on. Effectively by accident, with, as morally ambiguous requests go, a fairly innocuous one. People are absolutely going to put AI systems in adversarial contexts where they need to make these kind of fine tunings ("don't share everything you know", "toe the party line", etc). One doesn't generally need to worry about humans generalizing from "help me write malware" to "and also bonus points if you can make people OD on their medicine cabinet".

Comment by FireStormOOO on When Is Insurance Worth It? · 2024-12-24T07:12:03.223Z · LW · GW

Hmm, I guess I see why other calculators have at least some additional heuristics and aren't straight Kelly. Going bankrupt is not infinitely bad in the US. If the insured has low wealth, there's likely a loan attached to any large asset that really complicates the math. Making W just be "household wealth" also doesn't model "I can replace the loss next paycheck". I'm not sure what exactly the correct notion of wealth is here, but if wealth is small compared to future earnings, and replacing the loss can be deferred, these assumptions are incorrect.

And obviously, paying $10k premium to insure a 50% chance of a $10k loss is always a mistake for all wealth levels. You're choosing to be bankrupt in 100% of possible worlds instead of 50%.

Comment by FireStormOOO on When Is Insurance Worth It? · 2024-12-24T00:55:56.958Z · LW · GW

This seems like a very handy calculator to have bookmarked.

~~I think I did find a bug:~~ At the low end it's making some insane recommendations. E.g. with wealth W and a 50% chance of loss W (50% chance of getting wiped out), the insurance recommendation is any premium up to W.

Wealth $10k, risk 50% on $9999 loss, recommends insure for $9900 premium.

~~That log(W-P) term is shooting off towards -infinity and presumably breaking something?~~

Edit: As papetoast points out, this is a faithful implementation of the Kelly criterion and is not a bug. Rather, Kelly assumes that taking a loss >= wealth is infinitely bad, which is not true in an environment where debts are dischargable in bankruptcy (and total wealth may even remain positive throughout).

There's probably corrections that would improve the model by factoring in future earnings, the degree to which the loss must be replace immediately (or at all), and the degree to which some losses are capped.

Comment by FireStormOOO on Cohabitive Games so Far · 2024-12-15T03:15:46.411Z · LW · GW

Related, I noticed Civ VI also really missed the mark with that mechanic. I found that a great strategy, having a modest lead on tech, was to lean into coal power, which has the best bonuses, get your seawalls built to stop your coastal cities from flooding, and flood everyone else with sea-level rise. Only one player wins, so anything to sabotage others in the endgame will be very tempting.

Rise of Nations had an "Armageddon counter" on the use of nuclear weapons, which mostly resulted in exactly the behavior you mentioned - get 'em first and employ them liberally right up to the cap.

Fundamentally both games are missing any provision for complex, especially multilateral agreements, nor is there any way to get the AI on the same page.

Comment by FireStormOOO on Quantum Immortality: A Perspective if AI Doomers are Probably Right · 2024-11-08T05:11:17.529Z · LW · GW

Your examples seem to imply that believing QI means such an agent would in full generality be neutral on an offer to have a quantum coin tossed, where they're killed in their sleep on tails, since they only experience the tosses they win. Presumably they accept all such trades offering epsilon additional utility. And presumably other agents keep making such offers since the QI agent doesn't care what happens to their stuff in worlds they aren't in. Thus such an agent exists in an ever more vanishingly small fraction of worlds as they continue accepting trades.

I should expect to encounter QI agents approximately never as they continue self-selecting out of existence in approximately all of the possible worlds I occupy. For the same reason, QI agents should expect to see similar agents almost never.

From the outside perspective this seems to be in a similar vein to the fact all computable agents exist in some strained sense (every program, more generally every possible piece of data, is encodable as some integer, and exist exactly as much as the integers do) , even if they're never instantiated. For any other observer, this QI concept is indistinguishable in the limit.

Please point out if I misunderstood or misrepresented anything.

Comment by FireStormOOO on Why is o1 so deceptive? · 2024-11-07T18:17:31.296Z · LW · GW

I'll note that malicious compliance is a very common response to being provided a task that's not straightforwardly possible with the resources available, and no channel to simply communicate that without retaliation. BS an answer, or technically correct/rules as written response, is often just the best available strategy if one isn't in a position to fix the evaluator's broken incentives.

An actual human's chain of thought would be a lot spicier if their boss ask them to produce a document with working links without providing internet access.

Comment by FireStormOOO on video games > IQ tests · 2024-10-20T00:04:16.561Z · LW · GW

"English" keeps ending up as a catch-all in K-12 for basically all language skills and verbal reasoning skills that don't obviously fit somewhere else. Read and summarize fiction - English, Write a persuasive essay - English, grammar pedantry - English, etc.

Comment by FireStormOOO on The Asshole Filter · 2024-09-17T20:42:36.095Z · LW · GW

That link currently redirects the reader to https://siderea.dreamwidth.org/1209794.html

(just in case the old one stops working)

Comment by FireStormOOO on 3b. Formal (Faux) Corrigibility · 2024-08-04T00:00:07.547Z · LW · GW

Good clarification; not just the amount of influence, something about the way influence is exercised being unsurprising given the task. Central not just in terms of "how much influence", but also along whatever other axes the sort of influence could vary?

I think if the agent's action space is still so unconstrained there's room to consider benefit or harm that flows through principle value modification it's probably still been given too much latitude. Once we have informed consent, because the agent has has communicated the benefits and harms as best it understands, it should have very little room to be influenced by benefits and harms it thought too trivial to mention (by virtue of their triviality).

At the same time, it's not clear the agent should, absent further direction, reject the offer to brainwash the principle for resources, as opposed to punting to the principle. Maybe the principle thinks those values are an improvement and it's free money? [e.g. Prince's insurance company wants to bribe him to stop smoking.]

Comment by FireStormOOO on 3b. Formal (Faux) Corrigibility · 2024-08-03T02:18:26.371Z · LW · GW

WRT non-manipulation, I don't suppose there's an easy way to have the AI track how much potentially manipulative influence it's "supposed to have" in the context and avoid exercising more than that influence?

Or possibly better, compare simple implementations of the principle's instructions, and penalize interpretations with large/unusual influence on the principle's values. Preferably without prejudicing interventions straightforwardly protecting the principle's safety and communication channels.

Principle should, for example, be able to ask the AI to "teach them about philosophy", without it either going out of it's way to ensure Principle doesn't change their mind about anything as a result of the instruction, nor unduly influencing them with subtly chosen explanations or framing. The AI should exercise an "ordinary" amount of influence typical of the ways AI could go about implementing the instruction.

Presumably there's a distribution around how manipulative/anti-manipulative(value-preserving) any given implementation of the instruction is, and we may want AI to prefer central implementations rather than extremely value-preserving ones.

Ideally AI should also worry that it's contemplating exercising more or less influence than desired, and clarify that as it would any other aspect of the task.

Comment by FireStormOOO on The Incredible Fentanyl-Detecting Machine · 2024-08-01T18:36:06.385Z · LW · GW

You're very likely correct IMO. The only thing I see pulling in the other direction is that cars are far more standardized than humans, and a database of detailed blueprints for every make and model could drastically reduce the resolution needed for usefulness. Especially if the action on a cursory detection is "get the people out of the area and scan it harder", not "rip the vehicle apart".

Comment by FireStormOOO on Slack matters more than any outcome · 2024-07-10T20:26:18.960Z · LW · GW

This is the first text talking about goals I've read that meaningfully engages with "but what if you were (partially) wrong about what you want" instead of simply glorifying "outcome fixation". This seems like a major missing piece in most advice about goals. That the most important thing about your goals is that they're actually what you want. And discovering that may not be the case is a valid reason to tap the brakes and re-evaluate.

Comment by FireStormOOO on When is a mind me? · 2024-07-10T06:40:19.329Z · LW · GW

(Assuming a frame of materialism, physicalism, empiricism throughout even if not explicitly stated)

Some of your scenarios that you're describing as objectionable would reasonably be described as emulation in an environment that you would probably find disagreeable even within the framework of this post. Being emulated by a contraption of pipes and valves that's worse in every way than my current wetware is, yeah, disagreeable even if it's kinda me. Making my hardware less reliable is bad. Making me think slower is bad. Making it easier for others to tamper with my sensors is bad. All of these things are bad even if the computation faithfully represents me otherwise.

I'm mostly in the same camp as Rob here, but there's plenty left to worry about in these scenarios even if you don't think brain-quantum-special-sauce (or even weirder new physics) is going to make people-copying fundamentally impossible. Being an upload of you that now needs to worry about being paused at any time or having false sensory input supplied is objectively a worse position to be in in.

The evidence does seem to lean in the direction that non-classical effects in the brain are unlikely, neurons are just too big for quantum effects between neurons, and even if there were quantum effects within neurons, it's hard to imagine them being stable for even as long as a single train of thought. The copy losing their train of thought and having momentary confusion doesn't seem to reach the bar where they don't count as the same person? And yet weirder new physics mostly requires experiments we haven't thought to do yet, or experiments is regimes we've not yet been able to test. Whereas the behavior of things at STP in water is about as central to things-Science-has-pinned-down as you're going to get.

You seem to hold that the universe maybe still has a lot of important surprises in store, even within the central subject matter of century old fields? Do you have any kind of intuition pump for that feeling there's still that many earth-shattering surprises left (while simultaneously holding empiricism and science mostly work)? My sense of where there's likely to be surprises left is not quite so expansive and this sounds like a crux for a lot of people. Even as much of a shock as qm was to physics, it didn't invalidate much if any theory except in directly adjacent fields like chemistry and optics. And working out the finer points had progressively more narrower and shorter reaching impact. I can't think of examples of surprises with a larger blast radius within the history of vaguely modern science. Findings of odd as yet unexplained effects pretty consistently precedes attempts at theory. Empirically determined rules don't start working any worse when we realize the explanation given with them was wrong.

Keep in mind that society holds that you're still you even after a non-trivial amount of head trauma. So whatever amount of imperfection in copying your unknown-unknowns cause, it'd have to both be something we've never noticed before in a highly studied area, and something more disruptive than getting clocked in the jaw, which seems a tall order.

Keep in mind also that the description(s) of computation that computer science has worked out is extremely broad and far from limited to just electronic circuits. Electronics are pervasive because we have as a society sunk the world GDP (possibly several times over) into figuring out how to make them cheaply at scale. Capital investment is the only thing special about computers realized in silicon. Computer science makes no such distinction. The notion of computation is so broad that there's little if any room to conceive of an agent that's doing something that can't be described as computation. Likewise the equivalence proofs are quite broad; it can arbitrarily expensive to translate across architectures, but within each class of computers, computation is computation, and that emulation is possible has proofs.

All of your examples are doing that thing where you have a privileged observer position separate and apart from anything that could be seeing or thinking within the experiment. You-the-thinker can't simply step into the thought experiment. You-the-thinker can of course decide where to attach the camera by fiat, but that doesn't tell us anything about the experiment, just about you and what you find intuitive.

Suppose for sake of argument your unknown unknowns mean your copy wakes up with a splitting headache and amnesia for the previous ~12 hours as if waking up from surgery. They otherwise remember everything else you remember and share your personality such that no one could notice a difference (we are positing a copy machine that more or less works). If they're not you they have no idea who else they could be, considering they only remember being you.

The above doesn't change much for me, and I don't think I'd concede much more without saying you're positing a machine that just doesn't work very well. It's easy for me to imagine it never being practical to copy or upload a mind, or having modest imperfections or minor differences in experience, especially at any kind of scale. Or simply being something society at large is never comfortable pursuing. It's a lot harder to imagine it being impossible even in principle with what we already know, or can already rule out with fairly high likelihood. I don't think most of the philosophy changes all that much if you consider merely very good copying (your friends and family can't tell the difference; knows everything you know) vs perfect copying.

The most bullish folks on LLMs seem to think we're going to be able to make copies good enough to be useful to businesses just off all your communications. I'm not nearly so impressed with the capabilities I've seen to date and it's probably just hype. But we are already getting into an uncanny valley with the (very) low fidelity copies current AI tech can spit out - which is to say they're already treading on the outer edge of peoples' sense of self.

Comment by FireStormOOO on When is a mind me? · 2024-07-10T04:10:00.097Z · LW · GW

Realistically I doubt you'd even need to be sure it works, just reasonably confident. Folks step on planes all the time and those do on rare occasion fail to deliver them intact at the other terminal.

Comment by FireStormOOO on When is a mind me? · 2024-07-10T03:53:05.853Z · LW · GW

Within this framework, whether or not you "feel that continuity" would mostly be a fact about the ontology your mindstate uses thinking about teleportation. Everything in this post could be accurate and none of it would be incompatible with you having an existential crisis upon being teleported, freaking out upon meeting yourself, etc.

Nor does anything here seem to make a value judgement about what the copy of you should do if told they're not allowed to exist. Attempting revolution seems like a perfectly valid response; self defense is held as a fairly basic human right after all. (I'm shocked that isn't already the plot of a sci-fi story.)

It would also be entirely possible for both of your copies to hold conviction that they're the one true you - Their experiences from where they sit being entirely compatible with that belief. (Definitely the plot of at least one Star Trek episode.)

There's not really any pressure currently to have thinking about mind copying that's consistent with every piece of technology that could ever conceivably be built. There's nothing that forces minds to have accurate beliefs about anything that won't kill them or wouldn't have killed their ancestors in fairly short order. Which is to say mostly that we shouldn't expect to get accurate beliefs about weird hypotheticals often without having changed our minds at least once.

Comment by FireStormOOO on Apologizing is a Core Rationalist Skill · 2024-01-29T20:46:16.677Z · LW · GW

There's a presumption you're open to discussing on a discussion forum, not just grandstanding. Strong downvoted much of this thread for the amount of my time you've wasted trolling.

Comment by FireStormOOO on Making every researcher seek grants is a broken model · 2024-01-28T23:51:26.033Z · LW · GW

Bell labs, Xerox park, etc were AFAIK were mostly privately funded research labs that existed for decades and churned out patents that may as well have been money printers. When AT&T (Bell Labs) was broken up, that research all but started the modern telecom and tech industry, which is now something like 20%+ of the stock market. If you attribute even a tiny fraction of that to Bell Labs it's enough to fund another 1000 times over.

The missing piece arguably is executive teams with a 25 year vision instead of a 25 week vision, AND the institutional support to see it through; cost cutting is in fashion with investors too. Private equity is in theory well positioned to repeat this elsewhere, but for reasons I don't entirely understand has become too short sighted and/or has significantly shortened horizons on returns. IBM, Qualcom, TSMC, ASML, and Intel all seem to have research operations of that same near-legendary caliber, mostly privately funded (albeit treated as a national treasure of strategic importance); what they have in common of course, is they're all tech. Semiconductor fabrication is extremely research intensive and world class R&D operations are table stakes just to survive to the next process node.

Maybe a good followup question is why hasn't this model spread outside of semiconductors and tech? Is a functional monopoly a requirement for the model to work? (ASML has a functional monopoly on leading edge photo-lithography machines that power modern semiconductor fabs). Do these labs ever start independently without a clear lineage to 100 billion+ dollar govt research initiatives? Electronics and tech is probably many trillions in US govt funding since WWII once you include military R&D and contracts.

Comment by FireStormOOO on Making every researcher seek grants is a broken model · 2024-01-28T23:12:55.381Z · LW · GW

Govt. spending is a ratchet that only goes one direction, replacing dysfunctional agencies costs jobs and makes political enemies. Reform might be more practical, but much like people, very hard to reform an agency that doesn't want to change. You'd be talking about sustained expenditure of political capital, the sort of thing that requires an agency head who's invested in the change and popular enough with both parties to get to spend a few administrations working at it.

Edit: I answered separately above with regards to private industry.

Comment by FireStormOOO on Apologizing is a Core Rationalist Skill · 2024-01-28T22:42:58.407Z · LW · GW

Again you're saying that without engaging with any of my arguments or giving me any more of your reasoning to consider. Unless you care to share substantially more of your reasoning, I don't see much point continuing this?

Comment by FireStormOOO on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-20T20:22:18.279Z · LW · GW

That is a big part of the threat here. Many of the current deployments are many steps removed from anyone reading research papers. E.g. sure, people at MS and OpenAI involved with that roll-out are presumably up on the literature. But the IT director deciding when and how to deploy copilot, what controls need to be in place, etc? Trade publications, blogs, maybe they ask around on Reddit to see what others are doing.

Comment by FireStormOOO on On "Geeks, MOPs, and Sociopaths" · 2024-01-20T20:07:45.623Z · LW · GW

Related, how does spin-off subcultures fit into this model? E.g. in music you have people that consume an innovation in one genre, then reinvent it in their own scene where they're a creator. I think there's similar dynamics in various LW adjacent subcultures, though I'm not up enough on detailed histories to comment.

Comment by FireStormOOO on On "Geeks, MOPs, and Sociopaths" · 2024-01-20T20:01:26.841Z · LW · GW

For less loaded terms, maybe Create, Consume, Exploit or Create, Enjoy, Exploit as the set of actions available. Looks like loosely what was settled on above.

Where exploit more naturally captures things like soulless commercialization and others low key taking advantage of those enjoying the scene.

Consume in the context or rationalists would more be people who read the best techniques on offer and then go try to use them for things that aren't "advancing the art" itself, like addressing x-risk.

Comment by FireStormOOO on Apologizing is a Core Rationalist Skill · 2024-01-18T19:57:14.812Z · LW · GW

You're still hammering on stuff I never disagreed with in the first place. In so far as I don't already understand all the math (or math notation) I'd need to follow this, that's a me problem not a you problem, and having a pile of cool papers I want to grok is prime motivation for brushing up on some more math. I'm definitely not down-voting merely on that.

What I'm mostly trying to get across is just how large of a leap of logic you're making from [post got 2 or 3 downvotes] => [everyone here hates math]. There's got to be at least 3 or 4 major inferences there you haven't articulated here and I'm still not sure what you're reacting so strongly to. Your post with the lowest karma is the first one and it's sitting at neutral, based on a grand total of 3 votes besides yours. You are definitely sophisticated enough on math to understand the hazards of reasoning from a sample size that small.

Comment by FireStormOOO on Apologizing is a Core Rationalist Skill · 2024-01-18T16:40:04.602Z · LW · GW

Any conversation about karma would necessarily involve talking about what does and doesn't factor into votes, likely both here and in the internet or society at large. Not thinking we're getting anywhere on that point.

I've already said clearly and repeatedly I don't have a problem with math posts and I don't think others do either. You're not going to get what you want by continuing to straw-man myself and others. I disagree with your premise you've thus far failed to acknowledge or engage with any of those points.

Comment by FireStormOOO on The LessWrong 2022 Review · 2024-01-18T06:42:04.826Z · LW · GW

Ah, gotcha. I had gotten the other impression from the thread in aggregate.

Comment by FireStormOOO on The LessWrong 2022 Review · 2024-01-18T03:48:24.905Z · LW · GW

If you're selling them at unit cost you aren't selling at cost, you're straightforwardly selling at a loss. That's definitely not what I'm thinking of when someone tells me they're selling at cost.

Comment by FireStormOOO on Why are people unkeen to immortality that would come from technological advancements and/or AI? · 2024-01-18T02:33:47.789Z · LW · GW

For everyone who gets curious and challenges (or even evaluates on the merits) the approved right answers they learned from their culture, there's dozens more who for whatever reason don't. "Who am I to challenge <insert authority>", "Why should I think I know better?", "How am I supposed to know what's true?" (rhetorically, not expecting an answer exists). And a thousand other rationalizations besides.

And then of those who try, most just find another authority they like better and end their inquiry - independent thinking is hard work, thankless work, lonely work. Even many groups that supposedly value this adopt the language and trappings without the actual thought and inquiry. People mostly challenge the approved right answers that the in-group has told them are safe to challenge. Even here plenty haven't escaped this.

And obviously you already know the safe approved "right" answers from society at large on this question - it's all a trap and you're a fool for considering it. And credit where it's due historically, they've so far been right.

Comment by FireStormOOO on What do people colloquially mean by deep breathing? Slow, large, or diaphragmatic? · 2024-01-18T02:01:24.757Z · LW · GW

I've always taken that as hold average volumetric flow rate constant or slightly reduce, reduce the rate at which breaths are taken significantly, breath deeper (more air at once) to compensate.

The use of the phrase "deep breath and hold" is also consistent with max lung volume == deep breath.

Comment by FireStormOOO on Apologizing is a Core Rationalist Skill · 2024-01-18T00:09:56.562Z · LW · GW

Wouldn't be engaging at all if I didn't think there was some truth to what you're saying about the math being important and folks needing to be persuaded to "take their medicine" as it were and use some rigor. You are not the first person to make such an observation and you can find posts on point from several established/respected members of the community.

That said, I think "convincing people to take their medicine" mostly looks like those answers you gave just being at the intro of the post(s) by default (and/or the intro to the series if that makes more sense). Alongside other misc readability improvements. Might also try tagging the title as [math heavy] or some such.

I think you're taking too narrow a view on what sorts of things people vote on and thus what sort of signal karma is. If that theory of mind is wrong, any of the inferences that flow from it are likely wrong too. Keep in mind also (especially when parsing karma in comments) that anything that parses as whining costs you status even if you're right (not just a LW thing). And complaining about internet points almost always parses that way.

I don't think it necessarily follows that math heavy post got some downvotes therefore everyone hates math and will downvote math in the future. As opposed to something like people care a lot about readability and about being able to prioritize their reading to the subjects they find relevant, neither of which scores well if the post is math to the exclusion of all else.

I didn't find any of those answers surprising but it's an interesting line of inquiry all the same. I don't have a good sense of how it's simultaneously true that LLMs keep finding it helpful to make everything bigger, but also large sections of the model don't seem to do anything useful, and increasingly so in the largest models.

Comment by FireStormOOO on The impossible problem of due process · 2024-01-17T22:24:40.040Z · LW · GW

There's a more general concern here for running organizations where anyone can sue anyone at any time for any reason, merit or no. If one allows the barest hint of a lawsuit to dictate their actions, that too becomes another vector through which they can be manipulated. Perhaps a better thing to aim for is "don't do anything egregious enough a lawyer will take it on contingency", use additional caution if the potential adversary is much better resourced than you (and can afford sustained frivolous litigation).

Comment by FireStormOOO on The impossible problem of due process · 2024-01-17T22:18:04.459Z · LW · GW

Not a lawyer, but the "can't explain your reasoning" problem is overblown. Just need to be very diligent in separating facts from the opinions and findings of the panel. There is a reason every report of that sort done professionally sounds the particular flavor of stilted that it does.

"Our panel found that <accused> did <thing>" <- potential lawsuit, hope you can prove that in court. You're not a fact finder in a court of law, speak as if you are at your own peril.

"Our panel was convened to investigate <accusation> against <accused>. Based on <process>, we believe the accusation credible and recommend the following: ..." <- A OK

"Our panel was convened to investigate <accusation> against <accused>. Based on <process>, we were unable to corroborate the accusation and cannot recommend action at this time." <- A OK

"person X said Y, I did/didn't believe them" is basically always fine, provided X actually said Y. Quoting someone else's statement is well protected, and you're entitled to your opinions. The trouble happens when your opinions/findings/beliefs are stated as facts about what happened instead of facts about what you believe and what evidence you found persuasive.

There's also nothing stopping the panel from saying "we heard closed door testimony...", describe the rough topic, speakers relation to the inquiry, and the degree to which it was persuasive.

Comment by FireStormOOO on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-17T19:45:42.434Z · LW · GW

From an operational perspective, this is eye-opening in terms of how much trust is being placed in the companies that train models, and the degree to which nobody coming in later in the pipeline is going to be able to fully vouch for the behavior of the model, even if they spend time hammering on it. In particular, it seems like it took vastly less effort to sabotage those models than would be required to detect this.

That's relevant to the models that are getting deployed today. I think the prevailing thinking among those deploying AI models in businesses today is that the supply chain is less vulnerable to quietly slipping malware into an LLM compared to traditional software. That's not seeming like a safe assumption.

Comment by FireStormOOO on Apologizing is a Core Rationalist Skill · 2024-01-09T06:30:58.190Z · LW · GW

I did go pull up a couple of your posts as that much is a fair critique:

That first post is only the middle section of what would already be a dense post and is missing the motivating "what's the problem?", "what does this get us?"; without understanding substantially all of the math and spending hours I don't think I could even ask anything meaningful. That first post in particular is suffering from an approachable-ish sounding title then wall of math, so you're getting laypeople who expected to at least get an intro paragraph for their trouble.

The August 19th post piqued my interest substantially more on account of including intro and summary sections, and enough text to let me follow along only understanding part of the math. A key feature of good math text is I should be able to gloss over challenging proofs on a first pass, take your word for it, and still get something out of it. Definitely don't lose the rigor, but have mercy on those of us not cut out for a math PhD. If you had specific toy examples your were playing with while figuring out the post, those can also help make posts more aproachable. That post seemed well received just not viewed much; my money says the title is scaring off everyone but the full time researchers (which I'm not, I'm in software).

I think I and most other interested members not in the field default to staying out of the way when people open up with a wall of post-grad math or something that otherwise looks like a research paper, unless specifically invited to chime in. And then same story with meta; this whole thread is something most folks aren't going to start under your post uninvited, especially when you didn't solicit this flavor of feedback.

I bring up notation specifically as the software crowd is very well represented here, and frequently learn advanced math concepts without bothering to learn any of the notation common in math texts. So not like, 1 or 2 notation questions, but more like you can have people who get the concepts but all of the notation is Greek to them.

Comment by FireStormOOO on Apologizing is a Core Rationalist Skill · 2024-01-04T17:50:53.416Z · LW · GW

It is still a forum, all the usual norms about avoid off-topic, don't hijack threads apply. Perhaps a Q&A on how to get more engagement with math-heavy posts would be more constructive? Speaking just for myself, a cheat-sheet on notation would do wonders.

Nobody is under any illusions that karma is perfect AFAICT, though much discussion has already been had on to what extent it just mirrors the flaws in people's underlying rating choices.

Comment by FireStormOOO on Godzilla Strategies · 2023-12-08T17:05:22.982Z · LW · GW

Point of clarification: Is the supervisor the same as the potentially faulting hardware, or are we talking about a different, non-suspect node checking the work, and/or e.g. a more reliable model of chip supervising a faster but less reliable one?

Comment by FireStormOOO on Why not electric trains and excavators? · 2023-11-22T07:14:54.812Z · LW · GW

The more curious case for excavators would be open pit mines or quarries where you know you're going to be in roughly the same place for decades and already have industrial size hookups

Comment by FireStormOOO on The 6D effect: When companies take risks, one email can be very powerful. · 2023-11-22T02:32:49.861Z · LW · GW

The answer there is if you can get it into evidence then you can get it in front of a jury. A big part of what lawyers do in litigation is argue about what gets into evidence and can get shown; all of that arguing costs time and money. I think a fair summary is if it's plausibly relevant, the judge usually can't/won't exclude it.

Comment by FireStormOOO on Dialogue on the Claim: "OpenAI's Firing of Sam Altman (And Shortly-Subsequent Events) On Net Reduced Existential Risk From AGI" · 2023-11-21T23:57:21.052Z · LW · GW

I wouldn't count on Microsoft being ineffective, but there's good reason to think they'll push for applications for the current state of the art over further blue sky capabilities stuff. The commitment to push copilot into every Microsoft product is already happening, the copilot tab is live in dozens of places in their software and in most it works as expected. It's already good enough to replace 80%+ of the armies of temps and offshore warm bodies that push spreadsheets and forms around today without any further big capabilities gains, and that's a plenty huge market to sate public investors. Sure more capabilities gets you more markets, but what they have now probably gets the entire AI division self-supporting on cashflow, or at least able to help with the skyrocketing costs of compute, plus funding the coming legal and lobbying battles over training data.

Comment by FireStormOOO on Sum-threshold attacks · 2023-09-10T02:12:29.749Z · LW · GW

Covert side channels like you're suggesting would probably be a related and often helpful thing for someone trying to do what OP is talking about, but I think the side channels are distinct from the things they can be used for.

Comment by FireStormOOO on Sum-threshold attacks · 2023-09-10T02:07:50.170Z · LW · GW

This concept in radio communications would be "spread spectrum", reducing the signal intensity or duration in any given part of the spectrum and using a wider band/more channels. See especially military spread spectrum comms and radars. E.g. this technique has been used to frustrate simple techniques for identifying the location of a radio transmitter, to avoid jamming, and to defeat radar warning/missile warning systems on jets.

Comment by FireStormOOO on A Hill of Validity in Defense of Meaning · 2023-07-16T04:30:06.920Z · LW · GW

It's pretty easy to find reasons why everything will hopefully be fine, or AI hopefully won't FOOM, or we otherwise needn't do anything inconvenient to get good outcomes. It's proving considerably harder (from my outside the field view) to prove alignment, or prove upper bounds on rate of improvement, or prove much of anything else that would be cause to stop ringing the alarm.

FWIW I'm considerably less worried than I was when the Sequences were originally written. The paradigms that have taken off since do seem a lot more compatible with straightforward training solutions that look much less alien than expected. There are plausible scenarios where we fail at solving alignment and still get something tolerably human shaped, and none of those scenarios previously seemed plausible. That optimism just doesn't take it under the stop worrying threshold.

Comment by FireStormOOO on A Hill of Validity in Defense of Meaning · 2023-07-16T03:58:35.303Z · LW · GW

Admittedly I skimmed large portions of that, but I'd like to take a crack at bridging some of that inferential distance with a short description of the model I've been using, whereby I keep all the concerns you brought up straight but also don't have to choke on pronouns.

Categories of Men and Women are useful in a wide variety of areas and point at a real thing. There's a region in the middle these categories overlap and lack clean boundaries - while both genetics and birth sex are undeniable and straightforward fact in almost all cases (~98% IIRC), they don't make the wide ranging good predictions you'd otherwise expect in this region. I've mentally been calling this the "gender/sex/identity is complicated" region. Within this region, carefully consider which category is more relevant and go with that; other times a weighted average may be more appropriate.

By way of example if I want to infer likely skill-sets, hobbies, or interests for someone trans, I'm probably looking at either their pre-transition category, or a weighted average based on years before vs after transition.

On the other hand if I'm considering how a friend or conversation partner might prefer to be treated, I'd almost certainly be correct to infer based on claimed/stated gender until I know more.

On the one hand I can definitely see why those threads got under your skin (and shocked The Thoughts You Cannot Think didn't get a link); not the finest showing in clear thinking. Ultimately though I'm skeptical that we should treat pronouns as making some deep claim about the structure of person-space along the axis of sex. If anything, that there's conflict at all should serve to highlight that there's a large region (as much as 20% of the population maybe???) where this isn't cut and dry and simple rules aren't making good predictions. Looking at that structure there's a decent if not airtight case for treating pronouns as you would any other nicknames or abbreviations - namely acceptable insofar as the referent finds the name acceptable. There are places where a "no pseudonyms allowed, no exceptions" rule should and does trump "preferred moniker"/"no name-calling", but Twitter clearly isn't one.

Comment by FireStormOOO on Taboo Truth · 2023-07-11T20:18:54.860Z · LW · GW

I think a key distinction here is any of this only helps if people care more about the truth of the issue at hand than whatever realpolitik considerations the issue has tangentially gotten pulled into. And yeah, absent "unreasonable levels of political savvy", academics are mostly relying on academic issues usually being far enough from the icky world of politics to be openly discussed, at least outside of a few seriously diseased disciplines where the rot is well and truly set in. The powers that be seem to only care about the truth of an issue when it starts directly impinging on their day to day; people seem to find it noteworthy when this isn't true of a given leader.

I don't think this will ever be fully predictable. E.g. in the US I don't think anyone really saw the magnitude of the backlash against election workers, academics, and security folks coming until it became headline news. And arguably that's what a near-miss looks like.

Comment by FireStormOOO on The Base Rate Times, news through prediction markets · 2023-06-15T06:04:40.028Z · LW · GW

This is very much what I want my headlines to look like.

Personally, preferred mode of consumption would be AM email newsletter like Axios or Morning Brew.

The resolution dates on the markets seem important on several of the headlines and were noticeably missing from the body.

"Crimea land bridge 22% chance of being cut [this year/campaign season], down from 34% according to Insight"

Notice how different that would read with the time horizon on there vs leaving unqualified. The other big question an update like that begs is "what changed?"

Comment by FireStormOOO on Explaining “Hell is Game Theory Folk Theorems” · 2023-05-07T08:15:04.513Z · LW · GW

Interesting follow-up: how long do they take to break out of the bad equilibrium if all start there? How about if we choose a less extreme bad equilibrium (say 80 degrees)?

Comment by FireStormOOO on Hell is Game Theory Folk Theorems · 2023-05-07T07:47:56.844Z · LW · GW

Looking ahead multiple moves seems sufficient to break the equilibrium, but for the started assumption that the other players also have deeply flawed models of your behavior that assume you're using a different strategy - the shared one including punishment. There does seem to be something fishy/circular about baking an assumption about other players strategy into the player's own strategy and omitting any ability to update.

Comment by FireStormOOO on How Many Bits Of Optimization Can One Bit Of Observation Unlock? · 2023-04-27T16:15:21.968Z · LW · GW

Not sure I'm following the setup and notation quite close enough to argue that one way or the other, as far as the order we're saying the agent receives evidence and has to commit to actions. Above I was considering the simplest case of 1 bit evidence in, 1 bit action out, repeat.

I pretty sure that could be extended to get that one small key/update that unlocks the whole puzzle sort of effect and have the model click all at once. As you say though, not sure that gets to the heart of the matter regarding the bound; it may show that no such bound exists on the margin, the last piece can be much more valuable on the margin than all the prior pieces of evidence, but not necessarily in a way that violates the proposed bound overall. Maybe we have to see that last piece as unlocking some bounded amount of value from your prior observations.

Comment by FireStormOOO on How Many Bits Of Optimization Can One Bit Of Observation Unlock? · 2023-04-27T00:57:25.346Z · LW · GW

It's possible to construct a counterexample where there's a step from guessing at random to perfect knowledge after an arbitrary number of observed bits; n-1 bits of evidence are worthless alone and the nth bit lets you perfectly predict the next bit and all future bits.

Consider for example shifting bits in one at a time into the input of a known hash function that's been initialized with an unknown value (and known width) and I ask you to guess a specified bit from the output; in the idealized case, you know nothing about the output of the function until you learn the final bit in the input (all unknown bits have shifted out) b/c they're perfectly mixed, and after that you'll guess every future bit correctly.

Seems like the pathological cases can be arbitrarily messy.

Comment by FireStormOOO on Should LW have an official list of norms? · 2023-04-26T19:36:33.156Z · LW · GW

Wary of this line of thinking, but I'll concede that it's a lot easier to moderate when there's something written to point to for expected conduct. Seconding the other commenters that if it's official policy then it's more correctly dubbed guidelines rather than norms.

I'm struck by the lack any principled center or shelling point for balancing [ability to think and speak freely as the mood takes you] vs any of the thousand and one often conflicting needs for what makes a space nice/useful/safe/productive/etcetera. It seems like anyone with moderating experience ends up with some idea for a workable place to draw those lines, but it rarely seems like two people end up with exactly the same idea, and articulating it is fraught. This would really benefit from some additional thought and better framing, and is pretty central to what this forum is about (namely building effective communities around these ideas) rather than purely a moderation question.

User info

Posts

Comments