alex-rozenshteyn

Posts
Comments

Posts

PSA: Consider alternatives to AUROC when reporting classifier metrics for alignment 2024-06-24T17:53:28.705Z

Are LLMs sufficient for AI takeoff? 2023-02-17T15:46:36.765Z

What's actually going on in the "mind" of the model when we fine-tune GPT-3 to InstructGPT? 2023-02-10T07:57:04.733Z

Image generation and alignment 2023-01-05T16:05:44.231Z

rpglover64's Shortform 2022-06-28T03:01:32.214Z

Comments

Comment by rpglover64 (alex-rozenshteyn) on Interpreting Complexity · 2025-03-18T14:17:44.599Z · LW · GW

Thanks!

I did notice that we were comparing traces at the same parameters values by the third read-through, so I appreciate the clarification. I think the thing that would have made this clear to me is an explicit mention that it only makes sense to compare traces within the same run.

Comment by rpglover64 (alex-rozenshteyn) on Interpreting Complexity · 2025-03-15T20:55:48.355Z · LW · GW

This may be a naive question/observation, but it seems to me like treating the trace as a vector conflates the behavior before convergence and after convergence. Note that my intuition is that SGLD is basically an MCMC technique, so if I'm misunderstanding, it's probably because of that.

After convergence, the samples should be viewed as drawn from the stationary distribution, and ideally they have low autocorrelation, so it doesn't seem to make sense to treat them as a vector, since there should be many equivalent traces. I'd be interested to see density estimates per sample point, not just the per-timestep ones you have.

Before convergence, it does make sense to look at the vector, but it's still noisy, so I'd also be interested in certain summaries of it; for example, time to convergence and slope of loss (which is a function of time and the mean loss of the stationary distribution).

I also find myself wondering about other, more tangential things:

Could you use a per-sample gradient trace (rather than the loss trace) of the SGLD to learn something?
Does it make sense to e.g. run multiple chains?
Does it make sense to change the temperature throughout the run (like simulated annealing) rather than just run with each temperature?

Comment by rpglover64 (alex-rozenshteyn) on What TMS is like · 2024-11-30T05:54:53.831Z · LW · GW

Not my medical miracle post, just my comment on it.

"Objectively" for me would translate to "biomarker" i.e., a bio-physical signal that predicts a clinical outcome.

Yes, though I wouldn't restrict it to "clinical" because I care about non-medical outcomes, and "bio-physical" seems restrictive, though based on your example, that seems to be just my interpretation of the term.

Note that for depression and many psychological issues this means that we find the biomarkers by asking people how they feel

These are legitimate biomarkers, but they're not what I want, and I'm struggling to explain specifically why; the two things that come up are that they have low statistical power and they're a particularly lagging indicator (imagine for contrast e.g. being able to tell whether an antidepressant would work for you after taking it for a week, even if it takes two months to feel the effects). They're fine and useful for statistics, and even for measuring the effectiveness of a treatment in an individual, but a lot less useful for experimenting.

I'm assuming you mean biomarkers for psychological / mental health outcomes specifically. This is spiritually pretty close to what my lab studies - ways to predict how TMS will affect individuals, and adjust it to make it work better in each person.

That sounds really cool. I'm assuming there's nothing actionable available right now for patients?

Our philosophy [...] is that the effects of an intervention will manifest most reliably in reactions to very simple cognitive tasks like vigilance, working memory, and so on. Most serious health issues impact your reaction times, accuracy, bias, etc. in subtle but statistically reliable ways.

Yep. This is basically what I'm hoping to monitor in myself. For example, better vigilance might translate to better focus on work tasks, or better selective attention might imply better impulse control.

Measuring these with random sampling from a phone app and doing good statistics on the data is probably your best bet for objectively assessing interventions. Maybe that is what Quantified Mind does, I'm not sure?

QM doesn't work so well on phone and hasn't been updated on years and has major UX issues for my use case that makes it too hard to work with. It also doesn't expose the raw statistics. Cognifit (the only app I've found that does assessment and not just "brain training") reports even less.

Do you have a specific app that you know of?

The short answer is that if this were easy, it would already be popular, because we clearly need it. A lot of academic labs and industry people are trying to do this all the time. There is growing success, but it's slow growing and fraught with non-replicable work.

I don't think this is true. My alternative hypothesis (which I think is also compatible with the data) is that it's not hard, but there's no money in it, so there's not much commercial "free energy" making it happen, and that it's tedious, so there's not much hobbyist "free energy", and academia is slow as things like this.

Comment by rpglover64 (alex-rozenshteyn) on What TMS is like · 2024-11-09T15:40:04.754Z · LW · GW

This isn't directly related to TMS, but I've been trying to get an answer to this question for years, and maybe you have one.

When doing TMS, or any depression treatment, or any supplementation experiment, etc. it would make sense to track the effects objectively (in addition to, not as a replacement for subjective monitoring). I haven't found any particularly good option for this, especially if I want to self-administer it most days. Quantified mind comes close, but it's really hard to use their interface to construct a custom battery and an indefinite experiment.

Do you know of anything?

Comment by rpglover64 (alex-rozenshteyn) on LLMs for Alignment Research: a safety priority? · 2024-04-05T15:27:47.210Z · LW · GW

Would you say that models designed from the ground up to be collaborative and capabilitarian would be a net win for alignment, even if they're not explicitly weakened in terms of helping people develop capabilities? I'd be worried that they could multiply human efforts equally, but with humans spending more effort on capabilities, that's still a net negative.

Comment by rpglover64 (alex-rozenshteyn) on Many arguments for AI x-risk are wrong · 2024-03-05T16:25:46.240Z · LW · GW

I really appreciate the call-out where modern RL for AI does not equal reward-seeking (though I also appreciate @tailcalled 's reminder that historical RL did involve reward during deployment); this point has been made before, but not so thoroughly or clearly.

A framing that feels alive for me is that AlphaGo didn't significantly innovate in the goal-directed search (applying MCTS was clever, but not new) but did innovate in both data generation (use search to generate training data, which improves the search) and offline-RL.

Comment by rpglover64 (alex-rozenshteyn) on Open Thread – Winter 2023/2024 · 2024-03-03T16:32:15.136Z · LW · GW

before:

after:

Here the difference seems only to be spacing, but I've also seen bulleted lists appear. I think but I can't recall for sure that I've seen something similar happen to top-level posts.

Comment by rpglover64 (alex-rozenshteyn) on Open Thread – Winter 2023/2024 · 2024-03-02T20:50:48.181Z · LW · GW

@Habryka @Raemon I'm experiencing weird rendering behavior on Firefox on Android. Before voting, comments are sometimes rendered incorrectly in a way that gets fixed after I vote on them.

Is this a known issue?

Comment by rpglover64 (alex-rozenshteyn) on Open Thread – Winter 2023/2024 · 2024-03-02T20:44:29.041Z · LW · GW

This is also mitigated by automatic images like gravatar or the ssh key visualization. I wonder if they can be made small enough to just add to usernames everywhere while maintaining enough distinguishable representations.

Comment by rpglover64 (alex-rozenshteyn) on Hidden Cognition Detection Methods and Benchmarks · 2024-02-27T18:54:54.809Z · LW · GW

Note that every issue you mentioned here can be dealt with by trading off capabilities.

Yes. The trend I see is "pursue capabilities, worry about safety as an afterthought if at all". Pushing the boundaries of what is possible on the capabilities front subject to severe safety constraints is a valid safety strategy to consider (IIRC, this is one way to describe davidad's proposal), but most orgs don't want to bite the bullet of a heavy alignment tax.

I also think you're underestimating how restrictive your mitigations are. For example, your mitigation for sycophancy rules out RLHF, since the "HF" part lets the model know what responses are desired. Also, for deception, I wasn't specifically thinking of strategic deception; for general deception, limiting situational awareness doesn't prevent it arising (though it lessens its danger), and if you want to avoid the capability, you'd need to avoid any mention of e.g. honesty in the training.

Comment by rpglover64 (alex-rozenshteyn) on Hidden Cognition Detection Methods and Benchmarks · 2024-02-27T13:43:02.659Z · LW · GW

The "sleeper agent" paper I think needs to be reframed. The model isn't plotting to install a software backdoor, the training data instructed it to do so. Or simply put, there was sabotaged information used for model training.

This framing is explicitly discussed in the paper. The point was to argue that RLHF without targeting the hidden behavior wouldn't eliminate it. One threat to validity is the possibility that artificially induced hidden behavior is different than naturally occurring hidden behavior, but that's not a given.

If you want a 'clean' LLM that never outputs dirty predicted next tokens, you need to ensure that 100% of it's training examples are clean.

First of all, it's impossible to get "100% clean data", but there is a question of whether 5 9s of cleanliness is enough; it shouldn't be, if you want a training pipeline that's capable of learning from rare examples. Separate from that, some behavior is either subtle or emergent; examples include "power seeking", "sycophancy", and "deception". You can't reliably eliminate them from the training data because they're not purely properties of data.

Comment by rpglover64 (alex-rozenshteyn) on Hidden Cognition Detection Methods and Benchmarks · 2024-02-26T17:26:19.708Z · LW · GW

Since the sleeper agents paper, I've been thinking about a special case of this, namely trigger detection in sleeper agents. The simple triggers in the paper seem like they should be easy to detect by the "attention monitoring" approach you allude to, but it also seems straightforward to design more subtle, or even steganographic, triggers.

A question I'm curious about it whether hard-to-detect triggers also induce more transfer from the non-triggered case to the triggered case. I suspect not, but I could see it happening, and it would be nice if it does.

Comment by rpglover64 (alex-rozenshteyn) on Commonsense Good, Creative Good · 2023-09-30T18:08:00.001Z · LW · GW

Let's consider the trolley problem. One consequentialist solution is "whichever choice leads to the best utility over the lifetime of the universe", which is intractable. This meta-principle rules it out as follows: if, for example, you learned that one of the 5 was on the brink of starting a nuclear war and the lone one was on the brink of curing aging, that would say switch, but if the two identities were flipped, it would say stay, and generally, there are too many unobservables to consider. By contrast, a simple utilitarian approach of "always switch" is allowed by the principle, as are approaches that take into account demographics or personal importance.

The principle also suggests that killing a random person on the street is bad, even if the person turns out to be plotting a mass murder, and conversely, a doctor saving said person's life is good.

Two additional cases where the principle may be useful and doesn't completely correspond to common sense:

I once read an article by a former vegan arguing against veganism and vegetarianism; one example was the fact that the act of harvesting grain involves many painful deaths of field mice, and that's not particularly better than killing one cow. Applying the principle, this suggests that suffering or indirect death cannot straightforwardly be the basis for these dietary choices, and that consent is on shaky ground.
When thinking about building a tool (like the LW infrastructure) that could be either hugely positive (because it leads to aligned AI) or hugely negative (because it leads to unaligned AI by increasing AI discussions), and there isn't really a way to know which, you are morally free to build it or not; any steps you take to increase the likelihood of a positive outcome are good, but you are not required to stop building the tool due to a huge unknowable risk. Of course, if there's compelling reason to believe that the tool is net-negative, that reduces the variance and suggests that you shouldn't build it (e.g. most AI capabilities advancements).

Framed a different way, the principle is, "Don't tie yourself up in knots overthinking." It's slightly reminiscent of quiescence search in that it's solving a similar "horizon effect" problem, but it's doing so by discarding evaluation heuristics that are not locally stable.

Comment by rpglover64 (alex-rozenshteyn) on Commonsense Good, Creative Good · 2023-09-28T14:00:07.564Z · LW · GW

This makes me think that a useful meta-principle for application of moral principles in the absence omniscience is "robustness to auxillary information." Phrased another way, if the variance of the outcomes of your choices is high according to a moral principle, in all but the most extreme cases, either find more information or pick a different moral principle.

Comment by rpglover64 (alex-rozenshteyn) on Natural Abstractions: Key claims, Theorems, and Critiques · 2023-03-19T14:13:11.758Z · LW · GW

I had an insight about the implications of NAH which I believe is useful to communicate if true and useful to dispel if false; I don't think it has been explicitly mentioned before.

One of Eliezer's examples is "The AI must be able to make a cellularly identical but not molecularly identical duplicate of a strawberry." One of the difficulties is explaining to the AI what that means. This is a problem with communicating across different ontologies--the AI sees the world completely differently than we do. If NAH in a strong sense is true, then this problem goes away on its own as capabilities increase; that is, AGI will understand us when we communicate something that has a coherent natural interpretation, even without extra effort on our part to translate it to the AGI version of machine code.

Does that seem right?

Comment by rpglover64 (alex-rozenshteyn) on "Carefully Bootstrapped Alignment" is organizationally hard · 2023-03-18T18:37:31.628Z · LW · GW

(Why "Top 3" instead of "literally the top priority?" Well, I do think a successful AGI lab also needs have top-quality researchers, and other forms of operational excellence beyond the ones this post focuses on. You only get one top priority, )

I think the situation is more dire than this post suggests, mostly because "You only get one top priority." If your top priority is anything other than this kind of organizational adequacy, it will take precedence too often; if your top priority is organizational adequacy, you probably can't get off the ground.

The best distillation of my understanding regarding why "second priority" is basically the same as "not a priority at all" is this twitter thread by Dan Luu.

The fear was that if they said that they needed to ship fast and improve reliability, reliability would be used as an excuse to not ship quickly and needing to ship quickly would be used as an excuse for poor reliability and they'd achieve none of their goals.

Comment by rpglover64 (alex-rozenshteyn) on Teleosemantics! · 2023-03-04T16:03:52.479Z · LW · GW

I just read an article that reminded me of this post. The relevant section starts with "Bender and Manning’s biggest disagreement is over how meaning is created". Bender's position seems to have some similarities with the thesis you present here, especially when viewed in contrast to what Manning claims is the currently more popular position that meaning can arise purely from distributional properties of language.

This got me wondering: if Bender is correct, then there is a fundamental limitation in how well (pure) language models can understand the world; are there ways to test this hypothesis, and what does it mean for alignment?

Thoughts?

Comment by rpglover64 (alex-rozenshteyn) on Teleosemantics! · 2023-02-24T02:09:53.249Z · LW · GW

Interesting. I'm reminded of this definition of "beauty".

Comment by rpglover64 (alex-rozenshteyn) on The idea that ChatGPT is simply “predicting” the next word is, at best, misleading · 2023-02-20T14:57:12.009Z · LW · GW

One minor objection I have to the contents of this post is the conflation of models that are fine-tuned (like ChatGPT) and models that are purely self-supervised (like early GPT3); the former has no pretenses of doing only next token prediction.

Comment by rpglover64 (alex-rozenshteyn) on Are LLMs sufficient for AI takeoff? · 2023-02-17T21:59:32.222Z · LW · GW

But they seem like they are only doing part of the "intelligence thing".

I want to be careful here; there is some evidence to suggest that they are doing (or at least capable of doing) a huge portion of the "intelligence thing", including planning, induction, and search, and even more if you include minor external capabilities like storage.

I don't know if anyone else has spoken about this, but since thinking about LLMs a little I am starting to feel like their something analagoss to a small LLM (SLM?) embedded somewhere as a component in humans

I know that the phenomenon has been studied for reading and listening (I personally get a kick out of garden-path sentences); the relevant fields are "natural language processing" and "computational linguistics". I don't know know of any work specifically that addressed it in the "speaking" setting.

if we want to build something "human level" then it stands to reason that it would end up with specialized components for the same sorts of things humans have specialized components for.

Soft disagree. We're actively building the specialized components because that's what we want, not because that's particularly useful for AGI.

Comment by rpglover64 (alex-rozenshteyn) on Open & Welcome Thread — February 2023 · 2023-02-17T15:28:21.535Z · LW · GW

Some thoughts:

Those who expect fast takeoffs would see the sub-human phase as a blip on the radar on the way to super-human
The model you describe is presumably a specialist model (if it were generalist and capable of super-human biology, it would plausibly count as super-human; if it were not capable of super-human biology, it would not be very useful for the purpose you describe). In this case, the source of the risk is better thought of as the actors operating the model and the weapons produced; the AI is just a tool
Super-human AI is a particularly salient risk because unlike others, there is reason to expect it to be unintentional; most people don't want to destroy the world
The actions for how to reduce xrisk from sub-human AI and from super-human AI are likely to be very different, with the former being mostly focused on the uses of the AI and the latter being on solving relatively novel technical and social problems

Comment by rpglover64 (alex-rozenshteyn) on Cyborgism · 2023-02-14T13:58:18.452Z · LW · GW

I think "sufficiently" is doing a lot of work here. For example, are we talking about >99% chance that it kills <1% of humanity, or >50% chance that it kills <50% of humanity?

I also don't think "something in the middle" is the right characterization; I think "something else" it more accurate. I think that the failure you're pointing at will look less like a power struggle or akrasia and more like an emergent goal structure that wasn't really present in either part.

I also think that "cyborg alignment" is in many ways a much more tractable problem than "AI alignment" (and in some ways even less tractable, because of pesky human psychology):

It's a much more gradual problem; a misaligned cyborg (with no agentic AI components) is not directly capable of FOOM (Amdhal's law was mentioned elsewhere in the comments as a limit on usefulness of cyborgism, but it's also a limit on damage)
It has been studied longer and has existed longer; all technologies have influenced human thought

It also may be an important paradigm to study (even if we don't actively create tools for it) because it's already happening.

Comment by rpglover64 (alex-rozenshteyn) on Cyborgism · 2023-02-12T19:41:08.425Z · LW · GW

Like, I may not want to become a cyborg if I stop being me, but that's a separate concern from whether it's bad for alignment (if the resulting cyborg is still aligned).

Comment by rpglover64 (alex-rozenshteyn) on Cyborgism · 2023-02-11T21:40:53.282Z · LW · GW

OpenAI’s focus with doing these kinds of augmentations is very much “fixing bugs” with how GPT behaves: Keep GPT on task, prevent GPT from making obvious mistakes, and stop GPT from producing controversial or objectionable content. Notice that these are all things that GPT is very poorly suited for, but humans find quite easy (when they want to). OpenAI is forced to do these things, because as a public facing company they have to avoid disastrous headlines like, for example: Racist AI writes manifesto denying holocaust.^[7]
As alignment researchers, we don’t need to worry about any of that! The goal is to solve alignment, and as such we don’t have to be constrained like this in how we use language models. We don’t need to try to “align” language models by adding some RLHF, we need to use language models to enable us to actually solve alignment at its core, and as such we are free to explore a much wider space of possible strategies for using GPT to speed up our research.^[8]

I think this section is wrong because it's looking through the wrong frame. It's true that cyborgs won't have to care about PR disasters in their augments, but it feels like this section is missing the fact that there are actual issues behind the PR problem. Consider a model that will produce the statement "Men are doctors; women are nurses" (when not explicitly playing an appropriate role); this indicates something wrong with its epistemics, and I'd have some concern about using such a model.

Put differently, a lot of the research on fixing algorithmic bias can be viewed as trying to repair known epistemic issues in the model (usually which are inherited from the training data); this seems actively desirable for the use case described here.

Comment by rpglover64 (alex-rozenshteyn) on Cyborgism · 2023-02-11T15:08:43.966Z · LW · GW

I think that's an important objection, but I see it applying almost entirely on a personal level. On the strategic level, I actually buy that this kind of augmentation (i.e. with in some sense passive AI) is not an alignment risk (any more than any technology is). My worry is the "dual use technology" section.

Comment by rpglover64 (alex-rozenshteyn) on On utility functions · 2023-02-10T17:36:37.569Z · LW · GW

simple utility functions are fundamentally incapable of capturing the complexity/subtleties/nuance of human preferences.

No objections there.

that inexploitable agents exist that do not have preferences representable by simple/unitary utility functions.

Yep.

Going further, I think utility functions are anti-natural to generally capable optimisers in the real world.

I tentatively agree.

That said

The existence of a utility function is a sometimes useful simplifying assumption, in a way similar to how logical omniscience is (or should we be doing all math with logical inductors?), and naturally generalizes as a formalism to something like "the set of utility functions consistent with revealed preferences".

In the context of human rationality, I have found a local utility function perspective to be sometimes useful, especially as a probe into personal reasons; that is, if you say "this is my utility function" and then you notice "huh... my action does not reflect that", this can prompt useful contemplation, some possible outcomes of which are:

You neglected a relevant term in the utility function, e.g. happiness of your immediate family
You neglected a relevant utility cost of the action you didn't take, e.g. the aversiveness of being sweaty
You neglected a constraint, e.g. you cannot actually productively work for 12 hours a day, 6 days a week
The circumstances in which you acted are outside the region of validity of your approximation of the utility function, e.g. you don't actually know how valuable having $100B would be to you
You made a mistake with your action

Of course, a utility function framing is neither necessary nor sufficient for this kind of reflection, but for me, and I suspect some others, it is helpful.

Comment by rpglover64 (alex-rozenshteyn) on What's actually going on in the "mind" of the model when we fine-tune GPT-3 to InstructGPT? · 2023-02-10T17:15:59.361Z · LW · GW

The base models of GPT-3 already have the ability to "follow instructions", it's just veiled behind the more general interface. [...] you can see how it contains this capability somewhere.

This is a good point that I forgot. My mental model of this is that since many training samples are Q&A, in these cases, learning to complete implies learning how to answer.

InstructGPT also has the value of not needing the wrapper of "Q: [] A: []", but that's not really a qualitative difference.

I want to push back a little bit on the claim that this is not a qualitative difference; it does imply a big difference in output for identical input, even if the transformation required to get similar output between the two models is simple.

In other words, instruction following is not a new capability and the fine-tuning doesn't really make any qualitative changes to the model. In fact, I think that you can get results [close to] this good if you prompt it really well (like, in the realm of soft prompts).

TIL about soft prompts. That's really cool, and I'm not surprised it works (it also feels a little related to my second proposed experiment). My intuition here (transferred from RNNs, but I think it should mostly apply to unidirectional transformers as well) is that a successful prompt puts the NN into the right "mental state" to generate the desired output: fine-tuning for e.g. instruction following mostly pushes to get the model into this state from the prompts given (as opposed to e.g. for HHH behavior, which also adjusts the outputs from induced states); soft prompts instead search for and learn a "cheat code" that puts the model into a state such that the prompt is interpreted correctly. Would you (broadly) agree with this?

Comment by rpglover64 (alex-rozenshteyn) on On utility functions · 2023-02-10T13:57:46.162Z · LW · GW

Other people have given good answers to the main question, but I want to add just a little more context about self-modifying code.

A bunch of MIRI's early work explored the difficulties of the interaction of "rationality" (including utility functions induced by consistent preferences) with "self-modification" or "self-improvement"; a good example is this paper. They pointed out some major challenges that come up when an agent tries to reason about what future versions of itself will do; this is particularly important because one failure mode of AI alignment is to build an aligned AI that accidentally self-modifies into an unaligned AI (note that continuous learning is a restricted form of self-modification and suffers related problems). There are reasons to expect that powerful AI agents will be self-modifying (ideally self-improving), so this is an important question to have an answer to (relevant keywords include "stable values" and "value drift").

There's also some thinking about self-modification in the human-rationality sphere; two things that come to mind are here and here. This is relevant because ways in which humans deviate from having (approximate, implicit) utility functions may be irrational, though the other responses point out limitations of this perspective.

Comment by rpglover64 (alex-rozenshteyn) on Open & Welcome Thread - January 2023 · 2023-02-10T13:08:28.181Z · LW · GW

Done: https://www.lesswrong.com/posts/eywpzHRgXTCCAi8yt/what-s-actually-going-on-in-the-mind-of-the-model-when-we

Comment by rpglover64 (alex-rozenshteyn) on Open & Welcome Thread - January 2023 · 2023-02-10T00:38:23.059Z · LW · GW

(I may promote this to a full question)

Do we actually know what's happening when you take an LLM trained on token prediction and fine-tune is via e.g. RLHF to get something like InstructGPT or ChatGPT? The more I think about the phenomenon, the more confused I feel.

Comment by rpglover64 (alex-rozenshteyn) on Focus on the places where you feel shocked everyone's dropping the ball · 2023-02-02T23:24:38.731Z · LW · GW

English doesn’t have great words for me to describe what I mean here, but it’s something like: your visualization machinery says that it sees no obstacle to success, such that you anticipate either success or getting a very concrete lesson.

One piece of advice/strategy I've received that's in this vein is "maximize return on failure". So prefer to fail in ways that you learn a lot, and to fail quickly, cheaply, and conclusively, and produce positive externalities from failure. This is not so much a good search strategy but a good guiding principle and selection heuristic.

Comment by rpglover64 (alex-rozenshteyn) on Alignment allows "nonrobust" decision-influences and doesn't require robust grading · 2023-01-31T01:24:43.192Z · LW · GW

This is a great article! It helps me understand shard theory better and value it more; in particular, it relates to something I've been thinking about where people seem to conflate utility-optimizing agents with policy-execuing agents, but the two have meaningfully different alignment characteristics, and shard theory seems to be deeply exploring the latter, which is 👍.

That is to say, prior to "simulators" and "shard theory", a lot of focus was on utility-maximizers--agents that do things like planning or search to maximize a utility function; but planning, although instrumentally useful, is not strictly necessary for many intelligent behaviors, so we are seeing more focus on e.g. agents that enact learned policies in RL that do not explicitly maximize reward in deployment but try to enact policies that did so in training.

The answer is not to find a clever way to get a robust grader. The answer is to not need a robust grader

💯

From my perspective, this post convincingly argues that one route to alignment involves splitting the problem into two still-difficult sub-problems (but actually easier, unlike inner- and outer-alignment, as you've said elsewhere): identifying a good shard structure and training an AI with such a shard structure. One point is that the structure is inherently somewhat robust (and that therefore each individual shard need not be), making it a much larger target.

I have two objections:

I don't buy the implied "naturally-robust" claim. You've solved the optimizer's curse, wireheading via self-generated adversarial inputs, etc., but the policy induced by the shard structure is still sensitive to the details; unless you're hiding specific robust structures in your back pocket, I have no way of knowing that increasing the candy-shard's value won't cause a phase shift that substantially increases the perceived value of the "kill all humans, take their candy" action plan. I ultimately care about the agent's "revealed preferences", and I am not convinced that those are smooth relative to changes in the shards.
I don't think that we can train a "value humans" shard that avoids problems with the edge cases of what that means. Maybe it learns that it should kill all humans and preserve their history; or maybe it learns that it should keep them alive and comatose; or maybe it has strong opinions one way or another on whether uploading is death; or maybe it respects autonomy too much to do anything (though that one would probably be decomissioned and replaced by one more dangerous). The problem is not adversarial inputs but genuine vagueness where precision matters. I think this boils down to me disagreeing with John Wentworth's "natural abstraction hypothesis" (at least in some ways that matter)

Comment by rpglover64 (alex-rozenshteyn) on Deconfusing "Capabilities vs. Alignment" · 2023-01-23T15:31:04.362Z · LW · GW

My working model for unintended capabilities improvement focuses more on second- and third-order effects: people see promise of more capabilities and invest more money (e.g. found startups, launch AI-based products), which increases the need for competitive advantage, which pushes people to search for more capabilities (loosely defined). There is also the direct improvement to the inner research loop, but this is less immediate for most work.

Under my model, basically any technical work that someone could look at and think "Yes, this will help me (or my proxies) build more powerful AI in the next 10 years" is a risk, and it comes down to risk mitigation (because being paralyzed by fear of making the wrong move is a losing strategy):

One way to be safe is to take a very comms and policy flavored, helping highlight already known problems with AI, like Adversarial Policies Beat Superhuman Go AIs and Discovering Language Model Behaviors with Model-Written Evaluations
Another way is if the work just doesn't apply to existing models, i.e. most MIRI work

Comment by rpglover64 (alex-rozenshteyn) on The Plan - 2022 Update · 2023-01-20T03:20:53.268Z · LW · GW

I don't think "definitions" are the crux of my discomfort. Suppose the model learns a cluster; the position, scale, and shape parameters of this cluster summary are not perfectly stable--that is, they vary somewhat with different training data. This is not a problem on its own, because it's still basically the same; however, the (fuzzy) boundary of the cluster is large (I have a vague intuition that the curse of dimensionality is relevant here, but nothing solid). This means that there are many cutting planes, induced by actions to be taken downstream of the model, on which training on different data could have yielded a different result. My intuition is that most of the risk of misalignment arises at those boundaries:

One reason for my intuition is that in communication between humans, difficulties arise in a similar way (i.e. when two peoples clusters have slightly different shapes)
One reason is that the boundary cases feel like the kind of stuff you can't reliably learn from data or effectively test.

Your comment seems to be suggesting that you think the edge cases won't matter, but I'm not really understanding why the fuzzy nature of concepts makes that true.

Comment by rpglover64 (alex-rozenshteyn) on Compounding Resource X · 2023-01-11T14:31:49.959Z · LW · GW

This seems like a useful special case of "conditions-consequences" reasoning. I wonder whether

Avoiding meddling is a useful subskill in this context (probably not)
There is another useful special case

Comment by rpglover64 (alex-rozenshteyn) on Contra Common Knowledge · 2023-01-07T00:44:22.614Z · LW · GW

A good example of this is the Collatz conjecture. It has a stupendous amount of evidence in the finite data points regime, but no mathematician worth their salt would declare Collatz solved, since it needs to have an actual proof.

It's important to distinguish the probability you'd get from a naive induction argument and a more credible one that takes into account the reference class of similar mathematical statements that hold until large but finite limits but may or may not hold for all naturals.

Similarly, P=NP is another problem with vast evidence, but no proof.

Arguably even more than the Collatz conjecture, if you believe Scott Aaronson (link).

Mathematics is what happens when 0 or 1 are the only sets of probabilities allowed, that is a much stricter standard exists in math.

As I mentioned in my other reply, there are domains of math that do accept probabilities < 1 as proof (but it's in a very special way). Also, proof usually comes with insight (this is one of the reasons that the proof of the 4 color theorem was controversial), which is almost more valuable than the certainty it brings (and of course as skeptics and bayesians, we must refrain from treating it as completely certain anyway; there could have been a logical error that everyone missed).

I think talking about proofs in terms of probabilities is a category error; logical proof is analogous to computable functions (equivalent if you are a constructivist), while probabilities are about bets.

Comment by rpglover64 (alex-rozenshteyn) on Contra Common Knowledge · 2023-01-07T00:16:48.534Z · LW · GW

I feel like this is a difference between "almost surely" and "surely", both of which are typically expressed as "probability 1", but which are qualitatively different. I'm wondering whether infinitesimals would actually work to represent "almost surely" as (as suggested in this post).

Also a nitpick and a bit of a tangent, but in some cases, a mathematician will accept any probability > 1 as proof; probabilistic proof is a common tool for non-constructive proof of existence, especially in combinatorics (although the way I've seen it, it's usually more of a counting argument than something that relies essentially on probability).

Comment by rpglover64 (alex-rozenshteyn) on How to slow down scientific progress, according to Leo Szilard · 2023-01-06T23:58:15.768Z · LW · GW

Not missing the joke, just engaging with a different facet of the post.

Comment by rpglover64 (alex-rozenshteyn) on Machine Learning vs Differential Privacy · 2023-01-06T17:42:45.975Z · LW · GW

should I do something like rewriting the main post to mentioned the excellent answers I had, or can I click on some « accept this answer » button somewhere?

I don't think there's an "accept answer" button, and I don't think you're expected to update your question. I personally would probably edit it to add one sentence summarizing your takeaways.

Comment by rpglover64 (alex-rozenshteyn) on How to slow down scientific progress, according to Leo Szilard · 2023-01-06T02:36:58.473Z · LW · GW

I don't think it would work to slow down AI capabilities progress. The reason is that AI capabilities translate into money in a way that's much more direct than "science" writ large--they're a lot closer to engineering.

Put differently, if it could have worked (and before GPT-2 and the surrounding hype, I might have believed it) it's too late now.

Comment by rpglover64 (alex-rozenshteyn) on Contra Common Knowledge · 2023-01-06T02:25:52.481Z · LW · GW

western philosophy has a powerful anti-skepticism strain, to the point where "you can know something" is almost axiomatic

I'm pretty pessimistic about the strain of philosophy as you've described it. I have yet to run into a sense of "know" that is binary (i.e. not "believed with probability") that I would accept as an accurate description of the phenomenon of "knowledge" in the real world rather than as an occasionally useful approximation. Between the preface paradox (or its minor modification, the lottery paradox) and Fitch's paradox of knowability, I don't trust the "knowledge" operator in any logical claim.

Comment by rpglover64 (alex-rozenshteyn) on Contra Common Knowledge · 2023-01-06T02:03:49.438Z · LW · GW

My response would be that this unfairly (and even absurdly) maligns "theory"!

I agree. However, the way I've had the two generals problem framed to me, it's not a solution unless it guarantees successful coordination. Like, if I claim to solve the halting problem because in practice I can tell if a program halts, most of the time at least, I'm misunderstanding the problem statement. I think that conflating "approximately solves the 2GP" with "solves the 2GP" is roughly as malign as my claim that the approximate solution is not the realm of theory.

Some people (as I understand it, core LessWrong staff, although I didn't go find a reference) justify some things in terms of common knowledge.
You either think that they were not intending this literally, or at least, that no one else should take them literally, and instead should understand "common knowledge" to mean something informal (which you yourself admit you're somewhat unclear on the precise meaning of).

I think that the statement, taken literally, is false, and egregiously so. I don't know how the LW staff meant it, but I don't think they should mean it literally. I think that when encountering a statement that is literally false, one useful mental move is to see if you can salvage it, and that one useful way to do so is to reinterpret an absolute as a gradient (and usually to reduce the technical precision). Now that you have written this post, the commonality of the knowledge that the statement should not be taken literally and formally is increased; whether the LW staff responds by changing the statement they use, or by adding a disclaimer somewhere, or by ignoring all of us and expecting people to figure it out on their own, I did not specify.

My problem with this is that it creates a missing stair kind of issue. There's the people "in the know" who understand how to walk carefully on the dark stairway, but there's also a class of "newcomers" who are liable to fall. (Where "fall" here means, take all the talk of "common knowledge" literally.)

Yes, and I think as aspiring rationalists we should try to eventually do better in our communications, so I think that mentions of common knowledge should be one of:

explicitly informal, intended to gesture to some real world phenomenon that has the same flavor
explicitly contrived, like the blue islanders puzzle
explicitly something else, like p-common knowledge; but beware, that's probably not meant either

This idea is illustrated with the electronic messaging example, which purports to show that any number of levels of finite iteration are as good as no communication at all.

I think (I haven't read the SEP link) that this is correct--in the presence of uncertainty, iteration does not achieve the thing we are referring to precisely as "common knowledge"--but we don't care, for the reasons mentioned in your post.

I think your post and my reply together actually point to two interesting lines of research:

formalize measures of "commonness" of knowledge and see how they respond to realistic scenarios such as "signal boosting"
see if there is an interesting "approximate common knowledge", vaguely analogous to The Complexity of Agreement

Comment by rpglover64 (alex-rozenshteyn) on Image generation and alignment · 2023-01-06T01:33:57.131Z · LW · GW

You're saying that transformers are key to alignment research?

I would imagine that latent space exploration and explanation is a useful part of interpretability, and developing techniques that work for both language and images improves the chance that the techniques will generalize to new neural architectures.

Comment by rpglover64 (alex-rozenshteyn) on Machine Learning vs Differential Privacy · 2023-01-06T01:29:34.309Z · LW · GW

getting both differential privacy and capabilities pushes non-differentially-private capabilities more, usually, I think, or something

I don't think it does in general, and every case I can think of right now did not, but I agree that it is a worthwhile thing to worry about.

tools [for finding DP results] I'd recommend include

I'd add clicking through citations and references on arxiv and looking at the litmap explorer in arxiv.

Comment by rpglover64 (alex-rozenshteyn) on Machine Learning vs Differential Privacy · 2023-01-06T01:22:54.635Z · LW · GW

As mentioned in the other reply, DP gives up performance (though with enough data you can overcome that, and in many cases, you'd need only a little less data for reliable answers anyway).

Another point is that DP is fragile and a pain:

you have to carefully track the provenance of your data (so you don't accidentally include someone's data twice without explicitly accounting for it)
you usually need to clip data to a priori bounds or something equivalent (IIRC, the standard DP algorithm for training NNs requires gradient clipping)
you can "run out of budget"--there's a parameter that bounds the dissimilarity, and at some point, you have run so many iterations that you can't prove that the result is DP if you keep running; this happens way before you stop improving on the validation set (with current DP techniques)
hyperparameter tuning is an active research area, because once you've run with one set of hyperparameters, you've exhausted the budget (see previous point); you can just say "budget is per hyperparameter set", but then you lose generalization guarantees
DP bounds tend to be pessimistic (see previous example about single pathological dataset), so while you might in principle be able to continue improving without damaging generalization, DP on its own won't be informative about it

There are relaxations of DP (here's a reference that tries to give an overview of over 200 papers; since its publication, even more work in this vein has come out), but they're not as well-studied, and even figuring out which one has the properties you need is a difficult problem.

It's also not exactly what you want, and it's not easy, so it tends to be better to look for the thing that correlates more closely with what you actually want (i.e. generalization).

Comment by rpglover64 (alex-rozenshteyn) on Machine Learning vs Differential Privacy · 2023-01-05T16:14:37.067Z · LW · GW

I've been working in differential privacy for the past 5 years; given that it doesn't come up often unprompted, I was surprised and pleased by this question.

Short answer: no, generalizability does not imply differential privacy, although differential privacy does imply generalizability, to a large degree.

The simplest reason for this is that DP is a property that holds for all possible datasets, so if there is even one pathological data set for which your algorithm overfits, it's not DP, but you can still credibly say that it generalizes.

(I have to go, so I'm posting this as it, and I will add more later if you're interested.)

Comment by rpglover64 (alex-rozenshteyn) on A Löbian argument pattern for implicit reasoning in natural language: Löbian party invitations · 2023-01-05T02:28:59.333Z · LW · GW

I'm having difficulties getting my head around the intended properties of the "implicitly" modal.

Could you give an example of where $\neg p \land □ p$ ; that is, $p$ is implicit but not explicit?
Am I correct in understanding that there is a context attached to the box and the turnstile that captures the observer's state of knowledge?
Is the "implicitly" modal the same as any other better-known modal?
Is the primary function of the modal to distinguish "stuff known from context or inferred using context" from "stuff explicitly assumed or derived"?

Comment by rpglover64 (alex-rozenshteyn) on Contra Common Knowledge · 2023-01-05T02:10:19.203Z · LW · GW

The way I've heard the two generals problem, the lesson is that it's unsolvable in theory, but approximately (to an arbitrary degree of approximation) solvable in practice (e.g. via message confirmations, probabilistic reasoning, and optimistic strategies), especially because channel reliability can be improved through protocols (at the cost of latency).

I also think that taking literally a statement like "LessWrong curated posts help to establish common knowledge" is outright wrong; instead, there's an implied transformation to "LessWrong curated posts help to increase the commonness of knowledge", where "commonness" is some property of knowledge, which common knowledge has to an infinite degree; in $p$ -common knowledge is an initially plausible measure of commonness, but I'm not sure we ever get up to infinity.

Intuitively, though, increasing commonness of knowledge is a comparative property: the probability that your interlocutor has read the post is higher if it's curated than if it's not, all else being equal, and, crucially, this property of curation is known (with some probability) to your interlocutor.

Comment by rpglover64 (alex-rozenshteyn) on Slack matters more than any outcome · 2023-01-03T01:26:14.649Z · LW · GW

I think this is a challenge of different definitions. To me, what "adaptation" and "problem" mean requires that every problem be a failure of adaptation. Otherwise it wouldn't be a problem!

This was poor wording on my part; I think there's both a narrow sense of "adaptation" and a broader sense in play, and I mistakenly invoked the narrow sense to disagree. Like, continuing with the convenient fictional example of an at-birth dopamine set-point, the body cannot adapt to increase the set-point, but this is qualitatively different than a set-point that's controllable through diet; the latter has the potential to adapt, while the former cannot, so it's not a "failure" in some sense.

I feel like there's another relevant bit, though: whenever we talk of systems, a lot depends on where we draw the boundaries, and it's inherently somewhat arbitrary. The "need" for caffeine may be a failure of adaptation in the subsystem (my body), but a habit of caffeine intake is an example of adaptation in the supersystem (my body + my agency + the modern supply chain)

I think I'd have to read those articles. I might later.

I think I can summarize the connection I made.

In "out to get you", Zvi points out an adversarial dynamic when interacting with almost all human-created systems, in that they are designed to extract something from you, often without limit (the article also suggests that there are broadly four strategies for dealing with this). The idea of something intelligent being after your resources reminds me of your description of adaptive entropy.

In "refactored agency", Venkatesh Rao describes a cognitive reframing that I find particularly useful in which you ascribe agency to different parts of a system. It's not descriptive of a phenomenon (unlike, say, an egregore, or autopoesis) but of a lens through which to view a system. This is particularly useful for seeking novel insights or solutions; for example, how would problems and solutions differ if you view yourself as a "cog in the machine" vs "the hero", or your coworkers as "Moloch's pawns" rather than as "player characters, making choices" (these specific examples are my own extrapolations, not direct from the text). Again, ascribing agency/intelligence to the adaptive entropy, reminds me of this.

Everything we're inclined to call a "problem" is an encounter with a need to adapt that we haven't yet adapted to. That's what a problem is, to me.

This is tangential, but it strongly reminds me of the TRIZ framing of a problem (or "contradiction" as they call it): it's defined by the desire for two (apparently) opposing things (e.g. faster and slower).

Comment by rpglover64 (alex-rozenshteyn) on Slack matters more than any outcome · 2023-01-03T01:12:05.608Z · LW · GW

This is actually confounded when using ADHD as an example because there's two dynamics at play:

Any "disability" (construed broadly, under the social model of disability) is, almost by definition, a case where your adaptive capacity is lower than expected (by society)
ADHD specifically affects executive function and impulse control, leading to a reduced ability to force, or do anything that isn't basically effortless.

User info

Posts

Comments