Charlie Steiner's Shortform

charlie-steiner

Charlie Steiner's Shortform

post by Charlie Steiner · 2020-08-04T06:28:11.553Z · LW · GW · 54 comments

54 comments

54 comments

Comments sorted by top scores.

comment by Charlie Steiner · 2025-01-17T01:24:13.109Z · LW(p) · GW(p)

Could someone who thinks capabilities benchmarks are safety work explain the basic idea to me?

It's not all that valuable for my personal work to know how good models are at ML tasks. Is it supposed to be valuable to legislators writing regulation? To SWAT teams calculating when to bust down the datacenter door and turn the power off? I'm not clear.

But it sure seems valuable to someone building an AI to do ML research, to have a benchmark that will tell you where you can improve.

But clearly other people think differently than me.

Replies from: elifland, habryka4, Thane Ruthenis, bogdan-ionut-cirstea, Chris_Leong, jacques-thibodeau, lahwran, nathan-helm-burger, throwaway_2025

↑ comment by elifland · 2025-01-17T15:48:50.468Z · LW(p) · GW(p)

Not representative of motivations for all people for all types of evals, but https://www.openphilanthropy.org/rfp-llm-benchmarks/, https://www.lesswrong.com/posts/7qGxm2mgafEbtYHBf/survey-on-the-acceleration-risks-of-our-new-rfps-to-study [LW · GW], https://docs.google.com/document/d/1UwiHYIxgDFnl_ydeuUq0gYOqvzdbNiDpjZ39FEgUAuQ/edit, and some posts in https://www.lesswrong.com/tag/ai-evaluations [? · GW] seem relevant.

Replies from: sheikh-abdur-raheem-ali

↑ comment by Sheikh Abdur Raheem Ali (sheikh-abdur-raheem-ali) · 2025-01-18T03:34:47.764Z · LW(p) · GW(p)

I'm unable to open the google docs file in the third link.

Replies from: elifland

↑ comment by elifland · 2025-01-18T03:36:16.565Z · LW(p) · GW(p)

Sorry, fixed

↑ comment by habryka (habryka4) · 2025-01-17T21:00:33.211Z · LW(p) · GW(p)

I think the core argument is "if you want to slow down, or somehow impose restrictions on AI research and deployment, you need some way of defining thresholds. Also, most policymaker's cruxes appear to be that AI will not be a big deal, but if they thought it was going to be a big deal they would totally want to regulate it much more. Therefore, having policy proposals that can use future eval results as a triggering mechanism is politically more feasible, and also, epistemically helpful since it allows people who do think it will be a big deal to establish a track record".

I find these arguments reasonably compelling, FWIW.

Replies from: nathan-helm-burger

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-17T21:09:35.695Z · LW(p) · GW(p)

I think it would be good for more people to explicitly ask political staffers and politicians the question: "What hypothetical eval result would change your mind if you saw it?"

I think a lot of the evals are more targeted towards convincing tech workers than convincing politicians.

Replies from: habryka4, sharmake-farah

↑ comment by habryka (habryka4) · 2025-01-17T22:56:57.275Z · LW(p) · GW(p)

My sense is political staffers and politicians aren't that great at predicting their future epistemic states this way, and so you won't get great answers for this question. I do think it's a really important one to model!

↑ comment by Noosphere89 (sharmake-farah) · 2025-01-19T18:44:08.442Z · LW(p) · GW(p)

I believe the actual answer is "when it starts automating everything in the real world."

↑ comment by Thane Ruthenis · 2025-01-17T01:33:45.035Z · LW(p) · GW(p)

Perhaps the reasoning is that the AGI labs already have all kinds of internal benchmarks of their own, no external help needed, but the progress on these benchmarks isn't a matter of public knowledge. Creating and open-sourcing these benchmarks, then, only lets the society better orient to the capabilities progress taking place, and so make more well-informed decisions, without significantly advantaging the AGI labs.

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2025-01-17T17:49:18.759Z · LW(p) · GW(p)

At the very least, evals for automated ML R&D should be a very decent proxy for when it might be feasible to automate very large chunks of prosaic AI safety R&D.

↑ comment by Chris_Leong · 2025-01-17T15:18:04.115Z · LW(p) · GW(p)

I think I saw someone arguing that their particular capability benchmark was good for evaluating the capability, but of limited use for training the capability because their task only covered a small fraction of that domain.

↑ comment by jacquesthibs (jacques-thibodeau) · 2025-01-18T13:15:19.819Z · LW(p) · GW(p)

In case you didn't read Paul's reasoning [LW · GW].

↑ comment by the gears to ascension (lahwran) · 2025-01-17T14:56:24.927Z · LW(p) · GW(p)

What will you do if nobody makes a successful case?

Replies from: Charlie Steiner

↑ comment by Charlie Steiner · 2025-01-17T17:41:27.615Z · LW(p) · GW(p)

Be sad.

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-17T06:00:46.886Z · LW(p) · GW(p)

You probably don't mean dangerous capabilities evals, right? I mean, I do feel hesitant even about those. I would really not want someone using my work on WMDP to increase their model's ability to make bioweapons.

In Connor Leahy's recent interview on Trajectory he argues that scientists making evals are being "used" as tools by the AI corporations in a similar way to how cancer researchers were used by cigarette companies to throw confusion into the path of concluding cigarettes cause cancer.

Replies from: Charlie Steiner, sharmake-farah

↑ comment by Charlie Steiner · 2025-01-17T09:13:42.788Z · LW(p) · GW(p)

With bioweapons evals at least the profit motive of AI companies is aligned with the common interest here; a big benefit of your work comes from when companies use it to improve their product. I'm not at all confused about why people would think this is useful safety work, even if I haven't personally hashed out the cost/benefit to any degree of confidence.

I'm mostly confused about ML / SWE / research benchmarks.

Replies from: nathan-helm-burger

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-17T16:26:59.760Z · LW(p) · GW(p)

I'm not sure but I have a guess. A lot of "normies" I talk to in the tech industry are anchored hard on the idea that AI is mostly a useless fad and will never get good enough to be useful.

They laugh off any suggestions that the trends point towards rapid improvements that can end up with superhuman abilities. Similarly, completely dismiss arguments that AI might used for building better AI. 'Feed the bots their own slop and they'll become even dumber than they already are!'

So, people who do believe that the trends are meaningful, and that we are near to a dangerous threshold, want some kind of proof to show the doubters. They want people to start taking this seriously before it's too late.

I do agree that the targeting of benchmarks by capabilities developers is totally a thing. The doubting-Thomases of the world are also standing in the way of the capabilities folks of getting the cred and funding they desire. A benchmark designed specifically to convince doubters is a perfect tool for... convincing doubters who might then fund you and respect you.

↑ comment by Noosphere89 (sharmake-farah) · 2025-01-17T16:28:10.852Z · LW(p) · GW(p)

I'm really getting annoyed by AI safety people making analogies towards things that had way more evidence than the AI risk field ever got, and it also happens with comparisons to climate change.

↑ comment by throwaway_2025 · 2025-01-17T19:40:32.957Z · LW(p) · GW(p)

Capabilities benchmarks can be highly useful in safety applications. You raised a great example with ML benchmarks. Strong ML R&D capabilities lie upstream of many potential risks:

Labs may begin automating research, which could shorten timelines.
These capabilities may increase proliferation risks of techniques used to develop frontier models.
In the extremes, these capabilities may increase the risk of uncontrolled recursive self-improvement.

Labs, governments, and everyone else involved should have an accurate understanding of where the capabilities frontier lies to enable good decision making. The only quantitatively rigorous way of doing that is with good benchmarks.

Capabilities are not bottlenecked on benchmarks to inform where model developers could make improvements, and adding more is extremely unlikely to make any significant difference to capabilities progress.

Therefore, I think having more capabilities benchmarks a good thing because it can greatly increase our understanding of model capabilities without making much of a difference in timelines. However, if you are interested in doing safety work, building capabilities benchmarks is probably not the most effective thing you could be doing.

comment by Charlie Steiner · 2024-12-13T07:39:27.488Z · LW(p) · GW(p)

Taking AI companies that are locally incentivized to race toward the brink, and then hoping they stop right at the cliff's edge, is potentially a grave mistake.

One might hope they stop because of voluntary RSPs, or legislation setting a red line, or whistleblowers calling in the government to lock down the datacenters, or whatever. But just as plausible to me is corporations charging straight down the cliff (of building ever-more-clever AI as fast as possible until they build one too clever and it gets power and does bad things to humanity), and even strategizing ahead of time how to avoid obstacles like legislation telling them not to. Local incentives have a long history of dominating people in this way, e.g. people in the tobacco and fossil fuel industries

What would be so much safer is if even the local incentives of cutting-edge AI companies favored social good, alignment to humanity, and caution. This would require legislation blocking off a lot of profitable activity, plus a lot of public and philanthropic money incentivizing beneficial activity, in a convulsive effort whose nearest analogy is the global shift to renewable energy.

(this take is the one thing I want to boost from AI For Humanity.)

comment by Charlie Steiner · 2021-07-08T00:47:59.581Z · LW(p) · GW(p)

Some thoughts on reading Superintelligence (2014). Overall it's been quite good, and nice to read through such a thorough overview even if it's not new to me. Weirdly I got some comments that people often stop reading it. What this puts me in mind of is a physics professor remarking to me that they used to find textbooks impenetrable, but now they find it quite fun to leaf through a new introductory textbook. And now my brain is relating this to the popularity of fanfiction that re-uses familiar characters and settings :P

By god, Nick Bostrom thinks in metaphors all the time. Not to imply that this is bad at all, in fact it's very interesting.

The way the intelligence explosion kinetics is presented really could stand to be less one-dimensional about intelligence. Or rather, perhaps it should ask us to accept that there is some one-dimensional measure of capability that is growing superlinearly, which can then by parlayed into all the other things we care about via the "superpower"-style arguments that appear two chapters later.

Has progress on AI seemed to outpace progress on augmenting human intelligence since 2014? I think so, and perhaps this explains why Bostrom_2014 puts more emphasis on whole brain emulations. But perhaps not - perhaps instead I've/we've been unduly neglecting thinking about alternate paths to superintelligence in the last few years.

Human imitations (machine learning systems trained to reproduce the behavior of a human within some domain) seem conspicuously absent from Bostrom_2014's toolbox of parts to build an aligned AI out of. Is this a reflection of the times? My memory is blurry but... plausibly? If so, I think that's that's a pretty big piece of conceptual progress we've made.

When discussing what moral theory to give a superintelligence, Bostrom inadvertently makes a good case for another piece of conceptual progress since 2014 - our job is not to find The One Right Moral Theory, it is both more complicated and easier than that (see Scott, Me [LW · GW]). Hopefully this notion has percolated around enough that this chapter would get written differently today. Or is this still something we don't have consensus on among Serious Types? Then again, it was already Bostrom who coined the term MaxiPOK - maybe the functional difference isn't too large.

I would have thought "maybe the CEV of humanity would just shut itself down" would be uttered with more alarm.

In conclusion, footnotes are superior to endnotes.

comment by Charlie Steiner · 2024-08-22T10:42:19.106Z · LW(p) · GW(p)

About a year ago, I made a bet - my $50,000 against their $1000 - that we wouldn't see slam-dunk evidence of UFOs/UAPs being the result of aliens, the supernatural, simulations, or anything similarly non-mundane.

What's changed 1 year on? Well, I think a year ago UAPs and aliens were more in the news, between governmental hearings in several countries, a whistleblower ex-USAF intelligence official, and continuing coverage of navy UAP tapes. None of that has led anywhere, and it's mostly fading from public memory.

You can find people currently claiming that the big reveal that breaks it all wide open is just around the corner. But you can basically always find people claiming that. While doing some quick searching before making this comment, though, I did find out that a congressman from Tennessee is a big believer that the extraordinarily fake-looking aliens exhibited in Mexico last year are super important and need to be investigated at U Tennessee.

If they were, and they turned out to have non-terrestrial biological structure, that's definitely a way I could pay the money. I estimate the probability of this at about 0.00000000000000001.

Replies from: Vladimir_Nesov

↑ comment by Vladimir_Nesov · 2024-08-22T17:08:58.133Z · LW(p) · GW(p)

It's feasible to establish AGI-run governance that does nothing on its own other than permanently and irrevocably but unobtrusively restrict the level of technological development of every civilization it reaches, including its own builders (perhaps as a way of opposing extinction risk). This leads to strange ancient cultures of biological low-tech aliens that slowly travel the galaxy, much later than the initial wave of von Neumann probes of the technological development restricting AGI.

This is still unlikely, as the outcome both wastes the cosmic endowment and requires sufficient technical sophistication to make it stable and irrevocable. So the builders of this AGI governance both need to decently understand alignment and target an outcome that radically impairs their future. But this seems only Fermi paradox unlikely, not literal magic unlikely. The fraudulent nature of "evidence" we see reduces the probability that this is the case further, as low-tech aliens could instead be making themselves known in straightforward ways, while the high-tech AGI that restricts tech doesn't need to be observable at all. But this doesn't go all the way to impossibility, as an ancient low-tech culture could have traditions and bureaucracy cashing out in a bizarre first contact process.

The prediction of this hypothesis is that we don't get to develop unrestricted ASI of our own. Given the inscrutable nature of the models (or equivalently lack of technical sophistication needed to know what we are doing) any interventions don't yet need to be humanly observable.

comment by Charlie Steiner · 2024-12-05T11:15:31.828Z · LW(p) · GW(p)

I read Fei-Fei Li's autobiographical book (The Worlds I See). I give it a 'imagenet wasn't really an adventure story, so you'd better be interested in it intrinsically and also want to hear about the rest of Fei-Fei Li's life story.' out of 5.

She's somewhat coy about military uses, how we're supposed to deal with negative social impacts, and anything related to superhuman AI. I can only point to the main vibe, which is 'academic research pointing out problems is vital, I sure hope everything works out after that.'

comment by Charlie Steiner · 2022-01-04T02:26:08.880Z · LW(p) · GW(p)

What do you use to shoot an acausal blackmailer?

Depleted Platonium rounds.

Replies from: avturchin

↑ comment by avturchin · 2022-01-04T10:25:05.790Z · LW(p) · GW(p)

These are possible worlds where you can blackmail the blaclmailer by the fact that you know that he did blackmail

comment by Charlie Steiner · 2021-11-13T09:47:53.579Z · LW(p) · GW(p)

A fun thought about nanotechnology that might only make sense to physicists: in terms of correlation function, CPUs are crystalline solids, but eukaryotic cells are liquids. I think a lot of people imagine future nanotechnology as made of solids, but given the apparent necessity of using diffusive transport, nanotechnology seems more likely to be statistically liquid.

(For non-physicists: the building blocks of a CPU are arranged in regular patterns even over long length scales. But the building blocks of a cell are just sort of diffusing around in a water solvent. This can be formalized by a "correlation function," basically how correlated the positions of building blocks is - how much knowing the position of one lets you pin down the position of another. The position of transistors is correlated with each other over length scales as long as the size of the entire wafer of silicon, but the position of one protein in a cell tells you only a little bit about nearby proteins, and almost nothing about the position of proteins far away.)

comment by Charlie Steiner · 2024-03-28T20:36:43.538Z · LW(p) · GW(p)

Dictionary/SAE learning on model activations is bad as anomaly detection because you need to train the dictionary on a dataset, which means you needed the anomaly to be in the training set.

How to do dictionary learning without a dataset? One possibility is to use uncertainty-estimation-like techniques to detect when the model "thinks its on-distribution" for randomly sampled activations.

Replies from: neel-nanda-1, ejenner

↑ comment by Neel Nanda (neel-nanda-1) · 2024-03-29T12:03:34.947Z · LW(p) · GW(p)

You may be able to notice data points where the SAE performs unusually badly at reconstruction? (Which is what you'd see if there's a crucial missing feature)

Replies from: ejenner, Charlie Steiner

↑ comment by Erik Jenner (ejenner) · 2024-03-29T17:12:24.418Z · LW(p) · GW(p)

Would you expect this to outperform doing the same thing with a non-sparse autoencoder (that has a lower latent dimension than the NN's hidden dimension)? I'm not sure why it would, given that we aren't using the sparse representations except to map them back (so any type of capacity constraint on the latent space seems fine). If dense autoencoders work just as well for this, they'd probably be more straightforward to train? (unless we already have an SAE lying around from interp anyway, I suppose)

Replies from: Charlie Steiner

↑ comment by Charlie Steiner · 2024-03-29T21:01:04.843Z · LW(p) · GW(p)

Regular AE's job is to throw away the information outside some low-dimensional manifold, sparse ~linear AE's job is to throw away the information not represented by sparse dictionary codes. (Also a low-dimensional manifold, I guess, just made from a different prior.)

If an AE is reconstructing poorly, that means it was throwing away a lot of information. How important that information is seems like a question about which manifold the underlying network "really" generalizes according to. And also what counts as an anomaly / what kinds of outliers you're even trying to detect.

↑ comment by Charlie Steiner · 2024-03-29T16:45:27.314Z · LW(p) · GW(p)

Ah, yeah, that makes sense.

↑ comment by Erik Jenner (ejenner) · 2024-03-28T22:04:17.531Z · LW(p) · GW(p)

I think this is an important point, but IMO there are at least two types of candidates for using SAEs for anomaly detection (in addition to techniques that make sense for normal, non-sparse autoencoders):

Sometimes, you may have a bunch of "untrusted" data, some of which contains anomalies. You just don't know which data points have anomalies on this untrusted data. (In addition, you have some "trusted" data that is guaranteed not to have anomalies.) Then you could train an SAE on all data (including untrusted) and figure out what "normal" SAE features look like based on the trusted data.
Even for an SAE that's been trained only on normal data, it seems plausible that some correlations between features would be different for anomalous data, and that this might work better than looking for correlations in the dense basis. As an extreme version of this, you could look for circuits in the SAE basis and use those for anomaly detection.

Overall, I think that if SAEs end up being very useful for mech interp, there's a decent chance they'll also be useful for (mechanistic) anomaly detection (a lot of my uncertainty about SAEs applies to both possible applications). Definitely uncertain though, e.g. I could imagine SAEs that are useful for discovering interesting stuff about a network manually, but whose features aren't the right computational units for actually detecting anomalies. I think that would make SAEs less than maximally useful for mech interp too, but probably non-zero useful.

Replies from: Charlie Steiner

↑ comment by Charlie Steiner · 2024-03-29T02:44:13.453Z · LW(p) · GW(p)

Even for an SAE that's been trained only on normal data [...] you could look for circuits in the SAE basis and use those for anomaly detection.

Yeah, this seems somewhat plausible. If automated circuit-finding works it would certainly detect some anomalies, though I'm uncertain if it's going to be weak against adversarial anomalies relative to regular ol' random anomalies.

comment by Charlie Steiner · 2024-03-09T13:29:05.756Z · LW(p) · GW(p)

Congrats to Paul on getting appointed to NIST AI safety.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2024-03-09T14:09:49.393Z · LW(p) · GW(p)

Can't find a reference that says it has actually happened already.

Replies from: Charlie Steiner

↑ comment by Charlie Steiner · 2024-03-09T15:59:26.236Z · LW(p) · GW(p)

Oh, maybe I've jumped the gun then. Whoops.

comment by Charlie Steiner · 2021-04-12T07:06:57.070Z · LW(p) · GW(p)

I think you can steelman Ben Goertzel-style worries about near-term amoral applications of AI being bad "formative influences" on AGI, but mostly under a continuous takeoff model of the world. If AGI is a continuous development of earlier systems, then maybe it shares some datasets and learned models with earlier AI projects, and definitely it shares the broader ecosystems of tools, dataset-gathering methodologies, model-evaluating paradigms, and institutional knowledge on the part of the developers. If the ecosystem in which this thing "grows up" is one that has previously been optimized for marketing, or military applications, or what have you, this is going to have ramifications in how the first AGI projects are designed and what they get exposed to. The more continuous you think the development is going to be, the more this can be intervened on by trying to make sure that AI is pro-social even in the short term.

comment by Charlie Steiner · 2023-03-05T04:34:50.733Z · LW(p) · GW(p)

Interpretability as an RLHF problem seems like something to do.

comment by Charlie Steiner · 2022-11-14T17:39:47.236Z · LW(p) · GW(p)

Idea: The AI of Terminator.

One of the formative books of my childhood was The Physics of Star Trek, by Lawrence Krauss. Think of it sort of like xkcd's What If?, except all about physics and getting a little more into the weeds.

So:

Robotics / inverse kinematics. Voice recognition, language models, and speech synthesis. Planning / search. And of course, self-improvement, instrumental convergence, existential risk.

To make this work you'd need to already be pretty well-suited. The Physics of Star Trek was Krauss' third published book, and he got Stephen Hawking to write the forward.

comment by Charlie Steiner · 2021-07-16T08:09:05.831Z · LW(p) · GW(p)

There's a point by Stuart Armstrong that anthropic updates are non-Bayesian, because you can think of Bayesian updates as deprecating improbable hypotheses and renormalizing, while anthropic updates (e.g. updating on "I think just got copied") require increasing probability on previously unlikely hypotheses.

In the last few years I've started thinking "what would a Solomonoff inductor do?" more often about anthropic questions. So I just thought about this case, and I realized there's something interesting (to me at least).

Suppose we're in the cloning version of Sleeping Beauty. So if the coin landed Heads, the sign outside the room will say Room A, if the coin landed Tails, the sign could say either Room A or Room B. Normally I translate this into Solomonoff-ese by treating different hypotheses about the world as Turing machines that could reproduce my memories. Each hypothesis is like a simulation of the world (with some random seed), plus some rule for specifying what physical features (i.e. my memories) to read to the output tape. So "Heads in Room A," "Tails in Room A," and "Tails in Room B" are three different Turing machines that reproduce the sequence of my memories by simulating a physical universe.

Where's the non-Bayesian-ness? Well, it's because before getting copied, but after the coin was flipped, there were supposed to only be two clumps of hypotheses - let's call them "Heads and walking down the hallway" and "Tails and walking down the hallway." And so before getting copied you assign only 50% to Tails. And so P(Tails|memories after getting copied) is made of two hypotheses while P(Tails|memories before getting copied) is only made of one hypothesis, so the extra hypothesis got bumped up non-Bayesianly.

But... hang on. Why can't "Tails in Room B" get invoked as a second hypothesis for my memories even before getting copied? After all, it's just as good at reproducing my past memories - the sequence just happens to continue on a bit.

What I think is happening here is that we have revealed a difference between actual Solomonoff induction, and my intuitive translation of hypotheses into Turing machines. In actual (or at least typical) Solomonoff induction, it's totally normal if the sequence happens to continue! But in the hypotheses that correspond to worlds at specific points in time, the Turing machine reads out my current memories and only my current memories. At first blush, this seems to me to be an arbitrary choice - is there some reason why it's non-arbitrary?

We could appeal to the imperfection of human memory (so that only current-me has my exact memories), but I don't like the sound of that, because anthropics doesn't seem like it should undergo a discontinuous transition if we get better memory.

Replies from: Dagon

↑ comment by Dagon · 2021-07-16T17:54:10.336Z · LW(p) · GW(p)

Do you have a link to that argument? I think Bayesean updates include either reducing a prior or increasing it, and then renormalizing all related probabilities. Many updatable observations take the form of replacing an estimate of future experience (I will observe sunshine tomorrow) by a 1 or zero (I did or did not observe that, possibly not quite 0 or 1 if you want to account for hallucinations and imperfect memory).

Anthropic updates are either bayesean or impossible. The underlying question remains "how does this experience differ from my probability estimate"? For Bayes or for Solomonoff, one has to answer "what has changed for my prediction? In what way am I surprised and have to change my calculation?"

Replies from: Charlie Steiner

↑ comment by Charlie Steiner · 2021-07-16T18:41:41.330Z · LW(p) · GW(p)

https://www.alignmentforum.org/posts/iNi8bSYexYGn9kiRh/paradoxes-in-all-anthropic-probabilities [AF · GW] I think?

I have a totally non-Solomonoff explanation of what's going on [LW · GW], which actually goes full G.K. Chesterton - I assign anthropic probabilities because I don't assume that my not waking is impossible. But I'm not sure how a Solomonoff inductor would see it.

comment by Charlie Steiner · 2021-04-01T18:28:16.166Z · LW(p) · GW(p)

Will the problem of logical counterfactuals just solve itself with good model-building capabilities? Suppose an agent has knowledge of its own source code, and wants to ask the question "What happens if I take action X?" where their source code provably does not actually do X.

A naive agent might notice the contradiction and decide that "What happens if I take action X?" is a bad question, or a question where any answer is true, or a question where we have to condition on cosmic rays hitting transistors at just the right time. But we want a sophisticated agent to be able to be aware of the contradiction and yet go on to say "Ah, but what I meant wasn't a question about the real world, but a question about some simplified model of the world that lumps all of the things that would normally be contradictory about this into one big node - I take action X, and also my source code outputs X, and also maybe even the programmers don't immediately see X as a bug."

Of course, the sophisticated agent doesn't have to bother saying that if it already makes plans using simplified models of the world that lump things together etc. etc. It's planning will thus implicitly deal with logical counterfactuals, and if it does verbal reasoning that taps into these same models, it can hold a conversation about logical counterfactuals. This seems pretty close to how humans do it.

Atheoretically building an agent that is good at making approximate models would therefore "accidentally" be a route to solving logical counterfactuals. But maybe we can do theory to this too: a theorem about logical counterfactuals is going to be a theorem about processes for building approximate models of the world - which it actually seems plausible to relate back to logical inductors and the notion of out-planning "simple" agents.

comment by Charlie Steiner · 2020-08-04T06:28:12.915Z · LW(p) · GW(p)

It seems like there's room for the theory of logical-inductor-like agents with limited computational resources, and I'm not sure if this has already been figured out. The entire trick seems to be that when you try to build a logical inductor agent, it's got some estimation process for math problems like "what does my model predict will happen?" and it's got some search process to find good actions, and you don't want the search process to be more powerful than the estimator because then it will find edge cases. In fact, you want them to be linked somehow, so that the search process is never in the position of taking advantage of the estimator's mistakes - if you, a human, are making some plan and notice a blind spot in your predictions, you don't "take advantage" of yourself, you do further estimating as part of the search process.

The hard part is formalizing this handwavy argument, and figuring out what other strong conditions need to be met to get nice guarantees like bounded regret.

comment by Charlie Steiner · 2023-11-01T03:20:46.913Z · LW(p) · GW(p)

Charlie's easy and cheap home air filter design.

Ingredients:

MERV-13 fabric, cut into two disks (~35 cm diameter) and one long rectangle (16 cm by 110 cm).

Computer fan - I got a be quiet BL047.

Cheap plug-in 12V power supply

Hot glue

Instructions:

Splice the computer fan to the power supply. When you look at the 3-pin fan connector straight on and put the bumps on the connector on the bottom, the wire on the right is ground and the wire in the middle is 12V. Do this first so you are absolutely sure which way the fan blows before you hot glue it.

Hot glue the long edge of the rectangle to the edge of one disc. The goal is to start a short cylinder (to be completed once you put on the second disc). Use plenty of glue on the spots where you have to pleat the fabric to get it to conform.

Place the fan centered on the seam of the long rectangle (the side of the cylinder), oriented to blow inwards, and tack it down with some hot glue. Then cut away a circular hole for the air to get in and hot glue the circular housing around the fan blade to the fabric.

Hot glue the other disk on top to complete the cylinder.

Power on and enjoy.

comment by Charlie Steiner · 2023-04-14T18:45:37.979Z · LW(p) · GW(p)

AI that's useful for nuclear weapon design - or better yet, a clear trendline showing that AI will soon be useful for nuclear weapon design - might be a good way to get governments to put the brakes on AI.

comment by Charlie Steiner · 2024-04-21T12:56:31.750Z · LW(p) · GW(p)

Humans using SAEs to improve linear probes / activation steering vectors might quickly get replaced by a version of probing / steering that leverages unlabeled data.

Like, probing is finding a vector along which labeled data varies, and SAEs are finding vectors that are a sparse basis for unlabeled data. You can totally do both at once - find a vector along which labeled data varies and is part of a sparse basis for unlabeled data.

This is a little bit related to an idea with the handle "concepts live in ontologies." If I say I'm going to the gym, this concept of "going to the gym" lives in an ontology where people and activites are basic components - it's probably also easy to use ideas like "You're eating dinner" in that ontology, but not "1,3-diisocyanatomethylbenzene." When you try to express one idea, you're also picking a "basis" for expressing similar ideas.

comment by Charlie Steiner · 2023-12-06T22:18:12.408Z · LW(p) · GW(p)

Trying to get to a good future by building a helpful assistant seems less good than it did a month ago, because the risk is more salient that clever people in positions of power may coopt helpful assistants to amass even more power.

One security measure against this is reducing responsiveness to the user, and increasing the amount of goal information that's put into to large finetuning datasets that have lots of human eyeballs on them.

comment by Charlie Steiner · 2023-02-25T17:53:30.364Z · LW(p) · GW(p)

Should government regulation on AI ban using reinforcement learning with a target of getting people to do things that they wouldn't endorse in the abstract (or some similar restriction)?

E.g. should using RL to make ads that maximize click-through be illegal?

comment by Charlie Steiner · 2022-12-04T19:43:50.563Z · LW(p) · GW(p)

Just looked up Aligned AI (the Stuart Armstrong / Rebecca Gorman show) for a reference, and it looks like they're publishing blog posts:

E.g. https://www.aligned-ai.com/post/concept-extrapolation-for-hypothesis-generation

comment by Charlie Steiner · 2021-09-28T15:02:57.139Z · LW(p) · GW(p)

https://venturebeat.com/2021/09/27/the-limitations-of-ai-safety-tools/

This article makes a persuasive case that there being different sorts of safety research can be confusing to keep track of if you're a journalist (who are not so different from policymakers or members of the public).

comment by Charlie Steiner · 2021-08-13T21:20:07.818Z · LW(p) · GW(p)

https://www.sciencedirect.com/science/article/abs/pii/S0896627321005018

(biorxiv https://www.biorxiv.org/content/10.1101/613141v2 )

Cool paper on trying to estimate how many parameters neurons have (h/t Samuel at EA Hotel). I don't feel like they did a good job distinguishing how hard it was for them to fit nonlinearities that would nonetheless be the same across different neurons, versus the number of parameters that were different from neuron to neuron. But just based on differences in physical arrangement of axons and dendrites, there's a lot of opportunity for diversity, and I do think the paper was convincing that neurons are sufficiently nonlinear that this structure is plausibly important. The question is how much neurons undergo selection based on this diversity, or even update their patterns as a form of learning!

comment by Charlie Steiner · 2021-03-11T07:51:25.412Z · LW(p) · GW(p)

Back in the "LW Doldrums" c. 2016, I thought that what we needed was more locations - a welcoming (as opposed to heavily curated a la old AgentFoundations), LW-style forum solely devoted to AI alignment, and then the old LW for the people who wanted to talk about human rationality.

This philosophy can also be seen in the choice to make the AI Alignment forum as a sister site to LW2.0.

However, what actually happened is that we now have non-LW forums for SSC readers who want to talk about politics, SSC readers who want to talk about human rationality, and people who want to talk about effective altruism. And meanwhile, LW2.0 and the alignment forum have sort of merged into one forum that is mostly talking about AI alignment but sometimes also has posts on COVID, EA, peoples' personal lives, and economics, and more rarely human rationality. Honestly, it's turned out pretty good.

Charlie Steiner's Shortform

Contents

54 comments