Posts

Who can most reduce X-Risk? 2023-08-28T14:38:07.188Z
What would flourishing look like in Conway's Game of Life? 2020-05-12T11:22:31.089Z

Comments

Comment by sudhanshu_kasewa on No Clickbait - Misalignment Database · 2024-03-08T12:13:37.639Z · LW · GW

No, this is not something I can undertake -- however, the effort itself need not be very complicated. You've already got a list of Misalignment types in the form: create a google doc with definitions/descriptions of each of these, and put a link to that doc in this question.

Comment by sudhanshu_kasewa on No Clickbait - Misalignment Database · 2024-02-23T16:27:46.205Z · LW · GW

It might be worth (someone) writing out what is meant by each kind of misalignment category, as used in the db. Objective misalignment, specific gaming, value misalignment all seem overlapping, and I'm not at all sure what physical misalignment is supposed to be pointing to.

Comment by sudhanshu_kasewa on Who can most reduce X-Risk? · 2023-08-28T14:42:33.530Z · LW · GW

MIRI

Comment by sudhanshu_kasewa on Who can most reduce X-Risk? · 2023-08-28T14:42:03.491Z · LW · GW

Anthropic

Comment by sudhanshu_kasewa on Who can most reduce X-Risk? · 2023-08-28T14:41:46.689Z · LW · GW

The EU parliament

Comment by sudhanshu_kasewa on Who can most reduce X-Risk? · 2023-08-28T14:41:30.882Z · LW · GW

The UK government

Comment by sudhanshu_kasewa on Who can most reduce X-Risk? · 2023-08-28T14:41:15.485Z · LW · GW

Vladimir Putin

Comment by sudhanshu_kasewa on Who can most reduce X-Risk? · 2023-08-28T14:41:05.051Z · LW · GW

The Chinese Communist Party

Comment by sudhanshu_kasewa on Who can most reduce X-Risk? · 2023-08-28T14:40:47.961Z · LW · GW

The American Public

Comment by sudhanshu_kasewa on Who can most reduce X-Risk? · 2023-08-28T14:40:35.393Z · LW · GW

Joe Biden

Comment by sudhanshu_kasewa on Who can most reduce X-Risk? · 2023-08-28T14:40:15.315Z · LW · GW

Greg Brockman/Sam Altman

Comment by sudhanshu_kasewa on Who can most reduce X-Risk? · 2023-08-28T14:39:38.008Z · LW · GW

Demis Hassabis

Comment by sudhanshu_kasewa on Who can most reduce X-Risk? · 2023-08-28T14:39:19.777Z · LW · GW

Elon Musk

Comment by sudhanshu_kasewa on Who can most reduce X-Risk? · 2023-08-28T14:38:18.284Z · LW · GW

Mark Zuckerberg

Comment by sudhanshu_kasewa on Neuronpedia · 2023-07-28T11:35:51.925Z · LW · GW

Very cool.  Thanks for putting this together.

Half-baked, possibly off-topic: I wonder if there's some data-collection that can be used to train out polysemi from a model by fine-tuning. 

e.g.: 

  • Show 3 examples (just like in this game), and have the user pick the odd-one-out
    • User can say "they are all the same", if so, remove one at random, and replace with a new example
  • Tag the (neuron, positive example) pairs with (numerical value) label 1, the odd-one-out with 0
  • Fine-tune with next-word-prediction and an auxilliary loss using this new collected dataset
    • Can probably use some automated (e.g. semantic similarity) labelling method to cluster labelled+unlabelled instances, to increase the size of the dataset

 

Neuronpedia interface/codebase could probably be forked to do this kind of data collection very easily.

Comment by sudhanshu_kasewa on Prizes for ELK proposals · 2022-02-09T10:30:14.133Z · LW · GW

Dumb question alert:

In the appendix "Details for penalizing depending on “downstream” variables", I'm not able to wrap my head around what we can expect the reporter to learn -- if anything at all -- seeing that it has no dependency on the inputs (elsewhere it is dependent on z sampled from the posterior).

Specifically, the only call to the reporter (in the function reporter_loss in this section) contains no information (about before, action, after) from the predictor at all:

answer = reporter(question, ε, θ_reporter)

(unless "question" includes some context from the current (before, action, after) being considered, which I'm assuming is not the case)

My dumb question then is:

-- Why would this reporter be performant in any way? 

My reasoning: For a given question Q (say, "Is the diamond in the room?") we might have some answers of "Yes" and some of "No" in the dataset, but without the context, we're essentially training the reporter to map noise that is uncorrelated with/independent of the context to the answer; essentially, for a fixed question Q and fixed realization of the noise RV, the reporter will be uniformly uncertain (or well, it will mirror the statistics in the data) about the value of the answer. Since the noise is independent/uncorrelated, this would be true for every noise value. 

Comment by sudhanshu_kasewa on Prizes for ELK proposals · 2022-02-04T16:02:15.484Z · LW · GW

Naive thought #2618281828:

Could asking counterfactual questions be a potentially useful strategy to bias the reporter to be a direct translator rather than a human simulator?

Concretely, consider a tuple (v, a, v'), where v := 'before' video, a := 'action' selected by SmartVault or augmented-human or whatever, and v' := 'after' video.

Then, for some new action a', ask the question:

  • "Given (v, a, v'), if action a' was taken, is the diamond in the room?"

(How we collect such data is unclear but doesn't seem obviously intractable.)

I think there's some value here:

  • Answering such a question might not require computation concerning a and v' ; if we see these computations being used, we might derive more value from regularizers that penalize downstream variables (which now includes the nodes close to a)
  • This might also force the reporter to essentially model (or compress but not indefinitely) the predictor; the reporter now has both a compressed predictor Bayes' net and a human Bayes' net. If we can be confident that the compressed predictor BN is much smaller than the human BN, then doing direct translation within the reporter, i.e. compressed predictor BN inference + translation + read off from human BN might be less expensive than the human simulator alternative, i.e. compressed predictor BN inference + 'translation'/bridging computation + human BN inference.
    • We might find ways of being confident that the compressed predictor BN is small (e.g. by adding decoders at every layer of the reporter that reconstruct v, a or v' and heavily penalizing later-layer decoders)
Comment by sudhanshu_kasewa on Why is the impact penalty time-inconsistent? · 2020-07-15T21:28:27.466Z · LW · GW

1.

Is there such a thing as a free action, or an action where e.g. the agent breaks its own legs, when it is not accounted for in the action space of the underlying MDP? That feels like adding a new layer of misspecification (which no doubt is a possibility, and probably deserves deep investigation) orthogonal to reward function misspecification.


2.

It seems as though this kind of circumvention of impact penalties depends on what calculates the penalty. If the environment (or something else external to the agent) is responsible for calculating the penalty and providing it to the agent, then it could do so as if the agent was not constrained (or committing to be constrained) at all. If the agent is internally responsible for computing this impact, it could probably find ways to hack this, similar to wireheading.

So, assuming its some external entity E that computes the impact penalty, it must have the (dis?)ability to account for the agent's modified action space when making this computation for the agent to successfully reduce the penalty as in the earlier example.

Something agent A does must signal to E that A's action space has changed. If so, we might find ourselves in a situation where A is able to communicate an arbitrarily complex message to E, particularly, something of the form of: "I will never do anything that gives me more than the minimum penalty", or in terms of actions: "I will deterministically follow the policy that gives me the minimum penalty while achieving my goals." E, if it believed A, would not include high penalty actions in the action space of its penalty computations, and thus A would avoid the impact penalty.

Comment by sudhanshu_kasewa on What would flourishing look like in Conway's Game of Life? · 2020-05-14T16:29:47.702Z · LW · GW

Thanks for the detailed response. Meta: It feels good to receive a signal that this was a 'neat question', or in general, a positive-seeming contribution to LW. I have several unexpressed thoughts from fear of not actually creating value for the community.

it sounds like what you want is a reward function that is simple, but somehow analogous to the complexity of human value? And it sounds like maybe the underspecified bit is "you, as a human, have some vague notion that some sorts of value-generation are 'cheating'", and your true goal is "the most interesting outcome that doesn't feel like Somehow Cheating to me?"

This is about correct. A secondary reason for simplicity is to attempt to be computationally efficient (for the environment that generates the reward).

"one cell == an atom"

I can see that as being a case, but, again, computational tractability. Actual interesting structures in GoL can be incredibly massive, for example, this Tetris Proccessor (2,940,928 x 10,295,296 cells). Maybe there's some middle ground between truly fascinating GoL patterns made from atoms and my cell-as-a-planet level abstraction, as suggested by Daniel Kokotajlo in another comment.

How 'good' is it to have a repeating loop of, say, a billion flourishing human lives? Is it better than a billion human lives that happens exactly once and ends?

Wouldn't most argue that, in general, more life is better than less life? (but I see some of my hidden assumptions here, such as "the 'life's we're talking about here are qualitatively similar e.g. the repeating life doesn't feel trapped/irrelevant/futile because it is aware that it is repeating")

I think "moral value" (or, "value") in real life is about the process of solving "what is valuable and how to do I get it?"

I don't disagree, but I also think this is sort of outside the scope of finite-space cellular automata.

In this case it might mean that the system optimizes either for true continuous novelty, or the longest possible loop?

Given the constraints of CA, I'm mostly in agreement with this suggestion. Thanks.

I do suspect that figuring out which of your assumptions are "valid" is an important part of the question here.

Yes, I agree. Concretely, to me it looks like 'if I saw X happening in GoL, and I imagine being a sentient being (at some scale, TBD) in that world (well, with my human values), then would I want to live in it?', and translating that into some rules that promote or disincentivise X.

I do think taking this approach is broadly difficult, though. Perhaps its worth getting a v0.1 out with reward being tied to instantiations of novel states to begin with, and then seeing whether to build on that or try a new approach.

Comment by sudhanshu_kasewa on What would flourishing look like in Conway's Game of Life? · 2020-05-14T15:29:22.478Z · LW · GW

Interesting thoughts, thanks. My concerns: 1) Diversity would be restricted to what I specify as interesting shapes, while perhaps what I really want is for the AI to be able to discover new ways to accomplish some target value. 2) From a technological perspective, may be too expensive to implement? (in that, at every pass, must search over all subsets of space and check against all (suitably-sized) patterns in the database in order to determine what reward to provide).

Comment by sudhanshu_kasewa on What would flourishing look like in Conway's Game of Life? · 2020-05-14T15:20:18.132Z · LW · GW

After reading through the suggestions, including yours and Raemon's, I'm also sort of circling around this idea. Thanks.

Comment by sudhanshu_kasewa on What would flourishing look like in Conway's Game of Life? · 2020-05-14T15:17:16.887Z · LW · GW

Thanks for the note. I'll let you know if my explorations take me that way.

Comment by sudhanshu_kasewa on What would flourishing look like in Conway's Game of Life? · 2020-05-12T19:21:54.930Z · LW · GW

Fascinating. Thanks. My sense is GoL already has this property; any intuitions on how to formalise it?

Comment by sudhanshu_kasewa on What would flourishing look like in Conway's Game of Life? · 2020-05-12T13:33:15.922Z · LW · GW

I was not; thanks for the pointer!

A quick look suggests that it's not quite what I had in mind; nonetheless a reference worth looking at.

Comment by sudhanshu_kasewa on Databases of human behaviour and preferences? · 2020-04-25T03:37:22.792Z · LW · GW

Perhaps these could be useful:

1) Human Decision-Making dataset https://osf.io/eagcd/ ; but from what I can tell, has less than 300 human participants

2) User rating dataset, e.g. Yahoo! Music or Netflix or Amazon product review datasets. These could be trimmed in various ways to reduce complicatedness. Netflix dataset is here : https://www.kaggle.com/netflix-inc/netflix-prize-data

Amazon product reivew is at http://liu.cs.uic.edu/download/data/ , but it says available upon request

3) Transactional data, e.g. https://data.world/uci/online-retail might shed some light on preferences (as transactional data could be a proxy for demand)

Comment by sudhanshu_kasewa on Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More · 2019-10-24T23:46:58.036Z · LW · GW

ICYMI: Yann posted this page on FB as well, and some additional conversation happened there, with at least one interesting exchange between Rob Bensinger and Tony Zador:

https://www.facebook.com/yann.lecun/posts/10156278492457143

(I did scan the comments and didn't see this posted earlier, so...)

Comment by sudhanshu_kasewa on Constructing Goodhart · 2019-02-14T22:34:02.926Z · LW · GW

I assumed (close to) pareto-optimality, since the OP suggests that most real systems are starting from this state.

The (immediately disceranable) competing objectives here are training error and test error. Only one can be observed ahead of deployment (much like X in the X+Y example earlier), while it's actually the other which matters. That is not to say that there aren't other currently undiscovered / unstated metrics of interest (training time, designer time, model size, etc.) which may be articulated and optimised for, still leading to a suboptimal result on test error. Indeed, we can imagine a situation with a perfectly good predictive neural network, which for some reason won't run on the new hardware that's provisioned, and so a hasty, over-extended engineer might simply delete entire blocks of it, optimising their time and the ability of the model to fit on a Raspberry PI, while most likely completely voiding any ability of the network to perform the task meaningfully.

If this sounds contrived, forgive me. Perhaps I'm am talking tangentially past the discussion at hand; if so, kindly ignore. Mostly, I only wish to propose that a fundamental formulation of ML, of minimising training loss while we want to reduce test loss, is an example of Goodhart's law in action, and there is rich literature on techniques to circumvent its effects. Do you agree? Why / why not?

Comment by sudhanshu_kasewa on Constructing Goodhart · 2019-02-14T12:06:18.921Z · LW · GW
The problem is to come up with some model system where optimizing for something which is almost-but-not-quite the thing you really want produces worse results than not optimizing at all.

I'm confused; maybe the following query is addressed elsewhere but I have yet to come across it:

Doesn't the (standard, statistical machine learning 101) formulation of minimising-training-error-when-we-actually-care-about-minimising-test-error fall squarely in the camp of something that demonstrates Goodhart's law? Aggressively optimising to reduce training error with a function-approximator with parameters >> number of data points (e.g. today's deep neural networks) will result in ~0.0 training error, but it would most likely totally bomb on unseen data. This, to me, appears to be as straightforward an example of a Goodhart's Law as is necessary to illustrate the concept, and serves as a segue into how to mitigate this phenomenon of overfitting, e.g. by validation, regularisation, enforcing sparsity, and so on.

Given the premise that we are likely to start from something close to pareto-optimal to begin with, we now have a system which works well from the get-go, and without suitable controls optimising on reducing training error to the exclusion of all other metrics will almost certainly be worse than not optimising at all.

Comment by sudhanshu_kasewa on Could we send a message to the distant future? · 2018-06-09T15:11:34.583Z · LW · GW

Apropos on the space-object proposal from Facebook, and the question about recovering the information: Could the object itself have a shape that encodes the information? I was thinking something like a disco ball, except perhaps cylindrical/oblong such that we had a higher expectation of spinning in certain directions; then our message could be encoded in binary as shiny/matte facets on this thing; as its twinkling pattern is observed over time, the entire message is reconstructed.

If we can do Durable Writing, we could probably also do Durable Ellipsoid.

For context, this is inspired by how we read information from optical disk technology, but irony is not lost on me that CDs are obsolete. This proposal has many obvious drawbacks, such as can we ensure that 500m years in the future, it spins fast enough to be observed as repeating, but not too fast to be confused as uniform?

A way-cooler-but-also-much-harder alternative would be to launch some really robust object into a slow-decaying orbit such that it crashes back to Earth after 500m years. Its internals might be stuff made up with two different contrasting materials that encode the payload, while the outside is just enough of a heat-shield to survive the journey back; additionally, it should probably also scream "Not Natural!" in some way, so that someone from 1900 looking at it can predict that it will fall in the next 100 years or so, and prepare to collect it "for science". Hopefully, it won't destroy any cities in the process. Reflecting on that, if future inhabitants have anything like human-like psychology, an alien artifact about to plummet to Earth is just asking to be shot out of the sky. But then again, why should they be human-like at all?

Comment by sudhanshu_kasewa on Hufflepuff Cynicism on Hypocrisy · 2018-03-31T06:12:06.472Z · LW · GW

I feel I flinch away from hypocrisy because allowing it seems to nudge us towards world states that I find undesirable. Consider a malicious version of hypocrisy through the lens of the diner's dilemma: Transitioning meat-eaters reluctantly order tofu salads, while the vocal vegan gets themselves a steak. I imagine that in a subsequent outing, at least some of the carnivores break their resolve, seeing their duplicitous comrade tuck into a bucket of chicken wings. Eventually, no one cares to take the signalling action; preferably, though, perhaps they eject the offending member from the party.

I think, though, that the free rider problem probably better reflects my beef: Hypocrisy is one of those things where one can get something for nothing, but the getting sort of depends on most other parties believing that everyone involved is getting something for something. Then someone notices that one can in fact get something for nothing, and proceeds to jump on that gravy train; soon, we reach a critical mass of someones getting somethings for nothing, while the other ones really have to work overtime to keep the lights on. This is unsustainable*, and leads to a complete breakdown of the community that was built.

To give this concreteness, I sometimes think about some of the arguments against a rapid expansion of the EA movement, especially with regards to signalling.

*(except maybe in a world with Superman, where he could just power everything, if it eventually came down to it)