What is autonomy, and how does it lead to greater risk from AI? 2023-08-01T07:58:06.366Z
A Defense of Work on Mathematical AI Safety 2023-07-06T14:15:21.074Z
"Safety Culture for AI" is important, but isn't going to be easy 2023-06-26T12:52:47.368Z
"LLMs Don't Have a Coherent Model of the World" - What it Means, Why it Matters 2023-06-01T07:46:37.075Z
Systems that cannot be unsafe cannot be safe 2023-05-02T08:53:35.115Z
Beyond a better world 2022-12-14T10:18:26.810Z
Far-UVC Light Update: No, LEDs are not around the corner (tweetstorm) 2022-11-02T12:57:23.445Z
Announcing AISIC 2022 - the AI Safety Israel Conference, October 19-20 2022-09-21T19:32:35.581Z
Rehovot, Israel – ACX Meetups Everywhere 2022 2022-08-25T18:01:16.106Z
AI Governance across Slow/Fast Takeoff and Easy/Hard Alignment spectra 2022-04-03T07:45:57.592Z
Arguments about Highly Reliable Agent Designs as a Useful Path to Artificial Intelligence Safety 2022-01-27T13:13:11.011Z
Elicitation for Modeling Transformative AI Risks 2021-12-16T15:24:04.926Z
Modelling Transformative AI Risks (MTAIR) Project: Introduction 2021-08-16T07:12:22.277Z
Maybe Antivirals aren’t a Useful Priority for Pandemics? 2021-06-20T10:04:08.425Z
A Cruciverbalist’s Introduction to Bayesian reasoning 2021-04-04T08:50:07.729Z
Systematizing Epistemics: Principles for Resolving Forecasts 2021-03-29T20:46:06.923Z
Resolutions to the Challenge of Resolving Forecasts 2021-03-11T19:08:16.290Z
The Upper Limit of Value 2021-01-27T14:13:09.510Z
Multitudinous outside views 2020-08-18T06:21:47.566Z
Update more slowly! 2020-07-13T07:10:50.164Z
A Personal (Interim) COVID-19 Postmortem 2020-06-25T18:10:40.885Z
Market-shaping approaches to accelerate COVID-19 response: a role for option-based guarantees? 2020-04-27T22:43:26.034Z
Potential High-Leverage and Inexpensive Mitigations (which are still feasible) for Pandemics 2020-03-09T06:59:19.610Z
Ineffective Response to COVID-19 and Risk Compensation 2020-03-08T09:21:55.888Z
Link: Does the following seem like a reasonable brief summary of the key disagreements regarding AI risk? 2019-12-26T20:14:52.509Z
Updating a Complex Mental Model - An Applied Election Odds Example 2019-11-28T09:29:56.753Z
Theater Tickets, Sleeping Pills, and the Idiosyncrasies of Delegated Risk Management 2019-10-30T10:33:16.240Z
Divergence on Evidence Due to Differing Priors - A Political Case Study 2019-09-16T11:01:11.341Z
Hackable Rewards as a Safety Valve? 2019-09-10T10:33:40.238Z
What Programming Language Characteristics Would Allow Provably Safe AI? 2019-08-28T10:46:32.643Z
Mesa-Optimizers and Over-optimization Failure (Optimizing and Goodhart Effects, Clarifying Thoughts - Part 4) 2019-08-12T08:07:01.769Z
Applying Overoptimization to Selection vs. Control (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 3) 2019-07-28T09:32:25.878Z
What does Optimization Mean, Again? (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 2) 2019-07-28T09:30:29.792Z
Re-introducing Selection vs Control for Optimization (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 1) 2019-07-02T15:36:51.071Z
Schelling Fences versus Marginal Thinking 2019-05-22T10:22:32.213Z
Values Weren't Complex, Once. 2018-11-25T09:17:02.207Z
Oversight of Unsafe Systems via Dynamic Safety Envelopes 2018-11-23T08:37:30.401Z
Collaboration-by-Design versus Emergent Collaboration 2018-11-18T07:22:16.340Z
Multi-Agent Overoptimization, and Embedded Agent World Models 2018-11-08T20:33:00.499Z
Policy Beats Morality 2018-10-17T06:39:40.398Z
(Some?) Possible Multi-Agent Goodhart Interactions 2018-09-22T17:48:22.356Z
Lotuses and Loot Boxes 2018-05-17T00:21:12.583Z
Non-Adversarial Goodhart and AI Risks 2018-03-27T01:39:30.539Z
Evidence as Rhetoric — Normative or Positive? 2017-12-06T17:38:05.033Z
A Short Explanation of Blame and Causation 2017-09-18T17:43:34.571Z
Prescientific Organizational Theory (Ribbonfarm) 2017-02-22T23:00:41.273Z
A Quick Confidence Heuristic; Implicitly Leveraging "The Wisdom of Crowds" 2017-02-10T00:54:41.394Z
Most empirical questions are unresolveable; The good, the bad, and the appropriately under-powered 2017-01-23T20:35:29.054Z
Map:Territory::Uncertainty::Randomness – but that doesn’t matter, value of information does. 2016-01-22T19:12:17.946Z
Meetup : Finding Effective Altruism with Biased Inputs on Options - LA Rationality Weekly Meetup 2016-01-14T05:31:20.472Z


Comment by Davidmanheim on Sharing Information About Nonlinear · 2023-09-12T13:57:57.077Z · LW · GW

...only when there are no externalities and utilities from accusations and inflicted damage are symmetric. Neither of these is the case.

Comment by Davidmanheim on Sharing Information About Nonlinear · 2023-09-12T13:47:01.737Z · LW · GW

A very general point about how we are supposed to update in a complex system:

Evidence that a company you trust uses these should cause you to update BOTH slightly more towards "this isn't too bad," and slightly more towards "YC companies, and this company in particular, are unethical."

Comment by Davidmanheim on A Defense of Work on Mathematical AI Safety · 2023-09-07T12:50:49.461Z · LW · GW

I'm not really clear what you mean by not buying the example. You certainly seem to understand the distinction I'm drawing - mechanistic interpretability is definitely not what I mean by "mathematical AI safety," though I agree there is math involved.

And I think the work on goal misgeneralization was conceptualized in ways directly related to Goodhart, and this type of problem inspired a number of research projects, including quantilizers, which is certainly agent-foundations work. I'll point here for more places the agents foundations people think it is relevant.

Comment by Davidmanheim on Why might General Intelligences have long term goals? · 2023-08-17T14:15:45.986Z · LW · GW

We're very likely to give them long term goals. And as I explain here: lots of things people are likely to request seem near certain to lead to complex and autonomous systems.

Comment by Davidmanheim on Optimisation Measures: Desiderata, Impossibility, Proposals · 2023-08-08T07:18:10.032Z · LW · GW

I'm very confused about why we think zero for unchanged expected utility and strict mononicity are reasonable.

A simple example: I want to maximize expected income. I have actions including "get a menial job," and "rob someone at gunpoint and get away with it," where the first gets me more money. Why would I assume that the second requires less optimization power than the first?

Comment by Davidmanheim on Monthly Roundup #9: August 2023 · 2023-08-08T06:21:55.638Z · LW · GW

On McNuggets, I tweeted the following response:

I will gladly pay more money for a six-pack of candy bars than for a ten-pack. 
Why? Because a six-pack means everyone in my family gets one, and there is no fighting or crying, and no kids end up eating more candy than is reasonable.

But that is different, you might say.
Yes, in the case of my family, I have a largely coherent decisionmaker choosing for a set of partially aligned and incoherent agents.
But do you really think your mind works so very differently?

It's clear to me that many people prefer buying less because they have some level of self-awareness about their likely future behavior.
Given that, paying more for fewer items seems perfectly rational - despite violating constraints of simplified models.

Comment by Davidmanheim on Boxing · 2023-08-03T10:07:56.839Z · LW · GW

When training a model that, for example, is designed to make recursive calls and use tools, I suspect that it will need to be able to do so during training, at which point the difference between training and deployed internally is far less clear.

And it seems functionally impossible to do air-gapped training of systems that are being trained to query web data sources, ask for feedback from users, and/or write code that calls external APIs, for obvious the reason that they need to be connected to external systems.

Comment by Davidmanheim on 3 levels of threat obfuscation · 2023-08-03T10:01:37.857Z · LW · GW

A story of how that happens:

In future (unsafe but) sophisticated systems, models will have access to external storage, and be given the ability to read/write. For example, AutoGPT-like systems will have shared scratchpads across instances of the model, to coordinate with themselves. It seems instrumentally convergent for such systems to store things related to their true / eventual goals in such places, for example, so that they can be referenced by local instances. And if we do some sort of oversight of that data storage, key facts that the model wants to preserve might be obfuscated.

Of course, this is only one story, and I don't expect it to be the only way such things could happen, but it seems to be a reasonable candidate as a failure mode for systems designed with normal (read: insufficient, but far more than otherwise expected,) levels of caution.

Comment by Davidmanheim on Accidentally Load Bearing · 2023-07-31T19:10:17.852Z · LW · GW

I think a related concept gets at even more of the point you're making:

Comment by Davidmanheim on For alignment, we should simultaneously use multiple theories of cognition and value · 2023-07-31T16:26:21.477Z · LW · GW

For a defense of people pursuing a mathematical approach of a type you think isn't valuable, see my recent post.
(That does not address the correct issue you raised about requisite variety, but some work on HRAD does do so explicitly - such as embedded agency.)

Comment by Davidmanheim on Non-loss of control AGI-related catastrophes are out of control too · 2023-07-29T18:32:45.038Z · LW · GW
  1. The grants are to groups and individuals doing work in this area, so it encompasses much of the grants that are on biorisk in general - as well as several people employed by OpenPhil doing direct work on the topic.
  2. I'm saying that defense looks similar - early detection, quarantines, contact tracing, and rapid production of vaccines all help, and it's unclear how non-superhuman AI could do any of the scariest things people have discussed as threats.
  3. Public statements to the press already discuss this, as does some research.
  4. Yes, and again, I think it's a big deal.

And yes, but expert elicitation is harder than lit reviews.

Comment by Davidmanheim on Non-loss of control AGI-related catastrophes are out of control too · 2023-07-12T10:31:17.440Z · LW · GW

A few brief notes.

1. Openphil's biosecurity work already focuses on AIxBio quite a bit.

2. re: engineered pandemics, I think the analysis is wrong on several fronts, but primarily, it seems unclear how AI making a pandemic would differ from the types of pandemic we're already defending against, and so assuming no LoC, typical biosecurity approaches seem fine.

3. Everyone is already clear that we want AI away from nuclear systems, and I don't know that there is anything more to say about it, other than focusing on how little we understand Deep Learning systems and ensuring politicians and others are aware.

4. This analysis ignores AIxCybersecurity issues, which also seem pretty important for non-LoC risks.

Comment by Davidmanheim on The world where LLMs are possible · 2023-07-10T10:00:14.768Z · LW · GW

It's not a priori clear whether it's easier to reduce language to intelligence or intelligence to language. We see them co-occur in nature. But which way does the causality point? It does seem that some level of intelligence is required to develop language. 

It's also not clear to me that humans developed intelligence and then language. The evolution of the two very plausibly happened in tandem, and if so, language could be fundamental to the construct of intelligence - even when we talk about mathematical IQ, the abilities are dependent on the identification of categories, which is deeply linguistic. Similarly, for visual IQ, the tasks are using lots of linguistic categories.

Comment by Davidmanheim on What Does LessWrong/EA Think of Human Intelligence Augmentation as of mid-2023? · 2023-07-10T09:48:28.357Z · LW · GW

The research on brain training seems to disagree with you about how much it could have helped non-task-specific intelligence.

Comment by Davidmanheim on A Defense of Work on Mathematical AI Safety · 2023-07-09T08:34:10.708Z · LW · GW

I'm not going to try to summarize the arguments here, but it's been discussed on this site for a decade. And those quoted bits of the paper were citing the extensive discussions about this point - that's why there were several hundred citations, many of which were to Lesswrong posts.

Comment by Davidmanheim on A Defense of Work on Mathematical AI Safety · 2023-07-09T08:30:39.756Z · LW · GW

You say that, but then don't provide any examples. I imagine readers just not thinking of any, and then moving on without feeling any more convince.


I don't imagine readers doing that if they are familiar with the field, but I'm happy to give a couple of simple examples for this, at least. Large parts of mechanistic interpretability is to address concerns with future deceptive behaviors. Goal misgeneralization was predicted and investigated as Goodhart's law.


Overall, I think that it's hard for people to believe agent foundations will be useful because they're not visualizing any compelling concrete path where it makes a big difference.

Who are the people you're referring to? I'm happy to point to this paper for an explanation, but I didn't think anyone on Lesswrong would really not understand why people thought agent foundations were potentially important as a research agenda - I was defending the further point that we should invest in them even if we think other approaches are critical near term.

Comment by Davidmanheim on A Defense of Work on Mathematical AI Safety · 2023-07-08T22:08:58.180Z · LW · GW

Because I didn't know that was a thing and cannot find an option to do it...

Comment by Davidmanheim on Hooray for stepping out of the limelight · 2023-07-02T09:49:23.458Z · LW · GW

Alas, it seems they are not being as silent as was hoped.

Comment by Davidmanheim on "Safety Culture for AI" is important, but isn't going to be easy · 2023-06-29T11:39:11.029Z · LW · GW

Thanks, this is great commentary. 

On your point about safety culture after 3MI, when it took hold, and regression to the mean, see this article: Also, for more background about post-3MI safety, see this report:

Comment by Davidmanheim on On how various plans miss the hard bits of the alignment challenge · 2023-06-27T16:43:52.539Z · LW · GW

I was referring to his promotion of political approaches, which is what this post discussed, and which Eliezer has recently said is the best hope for avoiding doom, even if he's still very pessimistic about it.

His alignment work is a different question, and I don't feel particularly qualified to weigh in on it.

Comment by Davidmanheim on On how various plans miss the hard bits of the alignment challenge · 2023-06-27T07:06:57.584Z · LW · GW

Just noting that given more recent developments than this post, we should be majorly updating on recent progress towards Andrew Critch's strategy. (Still not more likely than not to succeed, but we still need to assign some Bayes points to Critch, and take some away from Nate.)

Comment by Davidmanheim on The Control Problem: Unsolved or Unsolvable? · 2023-06-12T08:07:22.942Z · LW · GW

Worth noting that every one of the "not solved" problems was, in fact, well understood and proven impossible and/or solved for relaxed cases.

We don't need to solve this now, we need to improve the solution enough to figure out ways to improve it more, or show where it's impossible, before we build systems that are more powerful than we can at least mostly align. That's still ambitious, but it's not impossible!

Comment by Davidmanheim on The unspoken but ridiculous assumption of AI doom: the hidden doom assumption · 2023-06-01T20:17:47.865Z · LW · GW

You missed April 1st by 2 months...

Comment by Davidmanheim on "LLMs Don't Have a Coherent Model of the World" - What it Means, Why it Matters · 2023-06-01T17:05:33.458Z · LW · GW

Yes, I'm mostly embracing simulator theory here, and yes, there are definitely a number of implicit models of the world within LLMs, but they aren't coherent. So I'm not saying there is no world model, I'm saying it's not a single / coherent model, it's a bunch of fragments.

But I agree that it doesn't explain everything! 

To step briefly out of the simulator theory frame, I agree that part of the problem is next-token generation, not RLHF - the model is generating the token, so it can't "step back" and decide to go back and not make the claim that it "knows" should be followed by a citation. But that's not a mistake on the level of simulator theory, it's a mistake because of the way the DNN is used, not the joint distribution implicit in the model, which is what I view as "actually" what is simulated. For example, I suspect that if you were to have it calculate the joint probability over all the possibilities for the next 50 tokens at a time, and pick the next 10 based on that, then repeat, (which would obviously be computationally prohibitive, but I'll ignore that for now,) it would mostly eliminate the hallucination problem.

On racism, I don't think there's much you need to explain; they did fine-tuning, and that was able to generate the equivalent of insane penalties for words and phrases that are racist. I think it's possible that RHLF could train away from the racist modes of thinking as well, if done carefully, but I'm not sure that is what occurs.

Comment by Davidmanheim on We don’t need AGI for an amazing future · 2023-05-16T09:01:48.232Z · LW · GW

No, it was and is a global treaty enforced multilaterally, as well as a number of bans on testing and arms reduction treaties. For each, there is a strong local incentive for states - including the US - to defect, but the existence of a treaty allows global cooperation.

With AGI, of course, we have strong reasons to think that the payoff matrix looks something like the following:

(0,0) (-∞, 5-∞)
(5-∞, -∞)  (-∞, -∞)

So yes, there's a local incentive to defect, but it's actually a prisoner's dilemma where the best case for defecting is identical to suicide.

Comment by Davidmanheim on We don’t need AGI for an amazing future · 2023-05-09T20:33:48.631Z · LW · GW

We decided to restrict nuclear power to the point where it's rare in order to prevent nuclear proliferation. We decided to ban biological weapons, almost fully successfully. We can ban things that have strong local incentives, and I think that ignoring that, and claiming that slowing down or stopping can't happen, is giving up on perhaps the most promising avenue for reducing existential risk from AI. (And this view helps in accelerating race dynamics, so even if I didn't think it was substantively wrong, I'd be confused as to why it's useful to actively promote it as an idea.)

Comment by Davidmanheim on We don’t need AGI for an amazing future · 2023-05-09T20:29:11.726Z · LW · GW

I think the post addresses a key objection that many people opposed to EA and longtermist concerns have voiced with the EA view of AI, and thought it was fairly well written to make the points it made, without also making the mostly unrelated point that you wanted it to have addressed.

Comment by Davidmanheim on How much do you believe your results? · 2023-05-07T07:13:53.041Z · LW · GW

Getting close to the decade anniversary for Why the Tails Come Apart, and this is a very closely related issue to regressional Goodhart.

Comment by Davidmanheim on Systems that cannot be unsafe cannot be safe · 2023-05-03T17:14:56.730Z · LW · GW

AI safety "thought" is more-or-less evenly distributed

Agreed - I wasn't criticizing AI safety here, I was talking about the conceptual models that people outside of AI safety have - as was mentioned in several other comments. So my point was about what people outside of AI safety think about when talking about ML models, trying to correct a broken mental model.

So, I disagree that evals and red teaming in application to AI are "meaningless" because there are no standards. 

I did not say anything about evals and red teaming in application to AI, other than in comments where I said I think they are a great idea. And the fact that they are happening very clearly implies that there is some possibility that the models perform poorly, which, again, was the point.

You seem to try to import quite an outdated understanding of safety and reliability engineering. 

Perhaps it's outdated, but it is the understanding which engineers who I have spoken to who work on reliability and systems engineering actually have, and it matches research I did on resilience most of a decade ago, e.g. this. And I agree that there is discussion in both older and more recent journal articles about how some firms do things in various ways that might be an improvement, but it's not the standard. And even when doing agile systems engineering, use cases more often supplement or exist alongside requirements, they don't replace them. Though terminology in this domain is so far from standardized that you'd need to talk about a specific company, or even a specific project's process and definitions to have a more meaningful discussion.

Standards are impossible in the case of AI, but they are also unnecessary, as was evidenced by the experience in various domains of engineering, where the presence of standards doesn't "save" one from drifting into failures.

I don't disagree with the conclusion, but the logic here simply doesn't work to prove anything. It implies that standards are insufficient, not that they are not necessary.

Comment by Davidmanheim on Systems that cannot be unsafe cannot be safe · 2023-05-03T05:45:55.189Z · LW · GW

I do think that some people are clearly talking about meanings of the word "safe" that aren't so clear-cut (e.g. Sam Altman saying GPT-4 is the safest model yet™️), and in those cases I agree that these statements are much closer to "meaningless".


The people in the world who actually build these models are doing the thing that I pointed out. That's the issue I was addressing.

People do actually have a somewhat-shared set of criteria in mind when they talk about whether a thing is safe, though, in a way that they (or at least I) don't when talking about its qwrgzness. e.g., if it kills 99% of life on earth over a ten year period, I'm pretty sure almost everyone would agree that it's unsafe. No further specification work is required. It doesn't seem fundamentally confused to refer to a thing as "unsafe" if you think it might do that.

I don't understand this distinction. If " I'm pretty sure almost everyone would agree that it's unsafe," that's an informal but concrete ability for the system to be unsafe, and it would not be confused to say something is unsafe if you think it could do that, nor to claim that it is safe if you have clear reason to believe it will not.

My problem is, as you mentioned, that people in the world of ML are not making that class of claim. They don't seem to ground their claims about safety in any conceptual model about what the risks or possible failures are whatsoever, and that does seem fundamentally confused.

Comment by Davidmanheim on Systems that cannot be unsafe cannot be safe · 2023-05-02T20:35:39.155Z · LW · GW

I think it would be really good to come up with a framing of these intuitions that wouldn't be controversial.


That seems great, I'd be very happy for someone to write this up more clearly. My key point was about people's claims and confidence about safety, and yes, clearly that was communicated less well than I hoped.

Comment by Davidmanheim on Will nanotech/biotech be what leads to AI doom? · 2023-05-02T20:33:49.427Z · LW · GW

As an aside, mirror cells aren't actually a problem, and non-mirror digestive systems and immune systems can break them down, albeit with less efficiency. Church's early speculation that these cells would not be digestible by non-mirror life forms doesn't actually make work, per several molecular biologists I have spoken to since then.

Comment by Davidmanheim on Systems that cannot be unsafe cannot be safe · 2023-05-02T20:28:22.026Z · LW · GW

Sure, I agree with that, and so perhaps the title should have been "Systems that cannot be reasonably claimed to be unsafe in specific ways cannot be claimed to be safe in those ways, because what does that even mean?" 

If you say something is "qwrgz," I can't agree or disagree, I can only ask what you mean. If you say something is "safe," I generally assume you are making a claim about something you know. My problem is that people claim that something is safe, despite not having stated any idea about what they would call unsafe. But again, that seems fundamentally confused about what safety means for such systems.

Comment by Davidmanheim on Systems that cannot be unsafe cannot be safe · 2023-05-02T20:23:41.599Z · LW · GW

"If it would fail under this specific load, then it is unsafe" is a clear idea of what would constitute unsafe. I don't think we have this clear of an idea for AI. 


Agreed. And so until we do, we can't claim they are safe.

But maybe when you say "clear idea", you don't necessarily mean a clean logical description, and also consider more vague descriptions to be relevant?

A vague description allows for a vague idea of safety. That's still far better than what we have now, so I'd be happier with that than the status quo - but in fact, what people outside of AI safety seem to mean by "safe" is even less specific than having an idea about what could go wrong - it's more often "I haven't been convinced that it's going to fail and hurt anyone."

I already addressed cars and you said we should talk about rods. Then I addressed rods and you want to switch back to cars. Can you make up your mind?

Both are examples. Both are examples, but useful for illustrating different things. Cars are far more complex, and less intuitive, but they still have clear safety standards for design.

Comment by Davidmanheim on Systems that cannot be unsafe cannot be safe · 2023-05-02T16:26:46.360Z · LW · GW

Mostly agree.

I will note that correctly isolating the entertainment system from the car control system is one of those things you'd expect, but you'd be disappointed. Safety is hard.

Comment by Davidmanheim on Systems that cannot be unsafe cannot be safe · 2023-05-02T16:23:48.274Z · LW · GW

For construction, it amounts to "doesn't collapse,"

No, the risk and safety models for construction go far, far beyond that, from radon and air quality to size and accessibility of fire exits. 

with AI you are talking to the full generality of language and communication and that effectively means: "All types of harm."

Yes, so it's a harder problem to claim that it's safe. But doing nothing, having no risk model at all, and claiming that there's no reason to think it's unsafe, so it is safe, is, as I said, "fundamentally confused about what safety means for such systems."

Comment by Davidmanheim on Systems that cannot be unsafe cannot be safe · 2023-05-02T13:30:40.160Z · LW · GW

For the first point, if "people can in fact recognize some types of unsafety," then it's not the case that "you don't even have a clear idea of what would constitute unsafe." And as I said in another comment, I think this is trying to argue about standards, which is a necessity in practice for companies that want to release systems, but isn't what makes the central point, which is the title of the post, true.

And I agree that rods are often simple, and the reason that I chose rods as an example is because people have an intuitive understanding of some of the characteristics you care about. But the same conceptual model, however, applies to cars, where there is tons of specific safety testing with clearly defined standards, despite the fact that their behavior can be very, very complex.

Comment by Davidmanheim on Systems that cannot be unsafe cannot be safe · 2023-05-02T13:20:49.587Z · LW · GW

That's true - and from what I can see, this emerges from the culture in academia. There, people are doing research, and the goal is to see if something can be done, or to see what happens if you try something new. That's fine for discovery, but it's insufficient for safety. And that's why certain types of research, ones that pose dangers to researchers or the public, have at least some degree of oversight which imposes safety requirements. ML does not, yet.

Comment by Davidmanheim on Systems that cannot be unsafe cannot be safe · 2023-05-02T13:08:40.303Z · LW · GW

I think you're focusing on the idea of a standard, which is necessary for a production system or reliability in many senses, and should be demanded of AI companies - but it is not the fundamental issue with not being able to say in any sense what makes the system safe or unsafe, which was the fundamental point here that you seem not to disagree with.

I'm not laying out a requirement, I'm pointing out a logical necessity; if you don't know what something is or is not, you can't determine it. But if something "will reliably cause serious harm to people who interact with it," it sounds like you have a very clear understanding of how it would be unsafe, and a way to check whether that occurs.

Comment by Davidmanheim on Systems that cannot be unsafe cannot be safe · 2023-05-02T11:24:27.391Z · LW · GW

I'm not saying that a standard is sufficient for safety, just that it's incoherent to talk about safety if you don't even have a clear idea of what would constitute unsafe. 

Also, I wasn't talking about cars in particular - every type of engineering, including software engineering, follows this type of procedure for verification and validation, when those are required. And I think metal rods are a better example to think about - we don't know what it is going to be used for when it is made, but whatever application the rod will be used for, it needs to have some clear standards and requirements.

Comment by Davidmanheim on [deleted post] 2023-04-27T11:15:50.480Z

Jaynes discusses exactly this, in reference to whether someone displaying psychic powers really has them, and whether correctly predicting 100 cards is enough to overcome your prior that psychic powers don't exist. In response, he points out that you need more than 2 hypotheses. In this case, consider the prior odds of god giving her the information, or her cheating, or someone else lying about what happened, or you imagining the whole thing - and this is evidence in favor of all of those hypotheses over it being truly random, not just for the existence of god.

Comment by Davidmanheim on Transcript and Brief Response to Twitter Conversation between Yann LeCunn and Eliezer Yudkowsky · 2023-04-27T04:10:59.361Z · LW · GW

A quick google search gives a few options for the definition, and this qualifies according to all of them, from what I can tell. The fact that he thinks the comment is true doesn't change that.

Trolling definition: 1. the act of leaving an insulting message on the internet in order to annoy someone

Trolling is when someone posts or comments online to 'bait' people, which means deliberately provoking an argument or emotional reaction.

Online, a troll is someone who enters a communication channel, such as a comment thread, solely to cause trouble. Trolls often use comment threads to cyberbully...

What is trolling? ... A troll is Internet slang for a person who intentionally tries to instigate conflict, hostility, or arguments in an online social community.

Comment by Davidmanheim on Transcript and Brief Response to Twitter Conversation between Yann LeCunn and Eliezer Yudkowsky · 2023-04-26T19:20:17.179Z · LW · GW

Yes, he's trolling:

Comment by Davidmanheim on Language Models are a Potentially Safe Path to Human-Level AGI · 2023-04-20T20:33:45.169Z · LW · GW

I think this is wrong, but a useful argument to make.

I disagree even though I generally agree with each of your sub-points. The key problem is that the points can all be correct, but don't add to the conclusion that this is safe. For example, perhaps an interpretable model is only 99.998% likely to be a misaligned AI system, instead of 99.999% for a less interpretable one. I also think that the current paradigm is shortening timelines, and regardless of how we do safety, less time makes it less likely that we will find effective approaches in time to preempt disaster.

(I would endorse the weaker claim that LLMs are more plausibly amenable to current approaches to safety than alternative approaches, but it's less clear that we wouldn't have other and even more promising angles to consider if a different paradigm was dominant.)

Comment by Davidmanheim on Moderation notes re: recent Said/Duncan threads · 2023-04-20T06:45:55.802Z · LW · GW

I agree - but think that now, if and when similarly initial thoughts on a conceptual model are proposed, there is less ability or willingness to engage, especially with people who are fundamentally confused about some aspect of the issue. This is largely, I believe, due to the volume of new participants, and the reduced engagement for those types of posts.

Comment by Davidmanheim on grey goo is unlikely · 2023-04-19T15:04:34.548Z · LW · GW

He excludes the only examples we have, which is fine for his purposes, though I'm skeptical it's useful as a definition, especially since "some difference" is an unclear and easily moved bar. However, it doesn't change the way we want to do prediction about whether something different is possible. That is, even if the example is excluded, it is very relevant for the question "is something in the class possible to specify." 

Comment by Davidmanheim on grey goo is unlikely · 2023-04-18T10:51:47.282Z · LW · GW

I assume the strong +1 was specifically on the infohazards angle? (Which I also strongly agree with.) 

Comment by Davidmanheim on grey goo is unlikely · 2023-04-18T10:48:30.192Z · LW · GW

None of this argues that creating grey goo is an unlikely outcome, just that it's a hard problem. And we have an existence proof of at least one example of a way to make gray goo that covers a planet, which is life-as-we-know-it, which did exactly that.

But solving hard problems is a thing that happens, and unlike the speed of light, this limit isn't fundamental. It's more like the "proofs" that heavier than air flight is impossible which existed in the 1800s, or the current "proofs" that LLMs won't become AGIs - convincing until the counterexample exists, but not at all indicative that no counterexample does or could exist.

Comment by Davidmanheim on Moderation notes re: recent Said/Duncan threads · 2023-04-18T10:27:25.215Z · LW · GW

Noting that my very first lesswrong post, back in the LW1 days, was an example of #2. I was wrong on some of the key parts of the intuition I was trying to convey, and ChristianKl corrected me. As an introduction to posting on LW, that was pretty good - I'd hate to think that's no longer acceptable.

At the same time, there is less room for it as the community got much bigger, and I'd probably weak downvote a similar post today, rather than trying to engage with a similar mistake, given how much content there is. Not sure if there is anything that can be done about this, but it's an issue.

Comment by Davidmanheim on Moderation notes re: recent Said/Duncan threads · 2023-04-18T10:18:26.643Z · LW · GW

Just want to note that I'm less happy with a lesswrong without Duncan. I very much value Duncan's pushback against what I see as a slow decline in quality, and so I would prefer him to stay and continue doing what he's doing. The fact that he's being complained about makes sense, but is mostly a function of him doing something valuable. I have had a few times where I have been slapped down by Duncan, albeit in comments on his Facebook page, where it's much clearer that his norms are operative, and I've been annoyed, but each of those times, despite being frustrated, I have found that I'm being pushed in the right direction and corrected for something I'm doing wrong.

I agree that it's bad that his comments are often overly confrontational, but there's no way to deliver constructive feedback that doesn't involve a degree of confrontation, and I don't see many others pushing to raise the sanity waterline. In a world where a dozen people were fighting the good fight, I'd be happy to ask him to take a break. But this isn't that world, and it seems much better to actively promote a norm of people saying they don't have energy or time to engage than telling Duncan (and maybe / hopefully others) not to push back when they see thinking and comments which are bad.