LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

← previous page (newer posts) · next page (older posts) →

Recent comments

davidad on Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Paralysis of the form "AI system does nothing" is the most likely failure mode. This is a "de-pessimizing" agenda at the meta-level as well as at the object-level. Note, however, that there are some very valuable and ambitious tasks (e.g. build robots that install solar panels without damaging animals or irreversibly affecting existing structures, and only talking to people via a highly structured script) that can likely be specified without causing paralysis, even if they fall short of ending the acute risk period.

"Locked into some least-harmful path" is a potential failure mode if the semantics or implementation of causality or decision theory in the specification framework are done in a different way than I hope. Locking in to a particular path massively reduces the entropy of the outcome distribution beyond what is necessary to ensure a reasonable risk threshold (e.g. 1 catastrophic event per millennium) is cleared. A FEEF objective (namely, minimize the divergence of the outcomes conditional on intervention from the outcomes conditional on filtering for the goal being met) would greatly penalize the additional facts which are enforced by the lock-in behaviours.

As a fail-safe, I propose to mitigate the downsides of lock-in by using time-bounded utility functions [LW · GW].

benito on Stephen Fowler's Shortform

Very Spicy Take

Epistemic Note: Many highly respected community members with substantially greater decision making experience (and Lesswrong karma) presumably disagree strongly with my conclusion.

I wish to register my weak disapproval of this opening. A la Against Bravery Debates ( https://slatestarcodex.com/2013/05/18/against-bravery-debates/ ), I think it is actively distracting and a little mind-killing to open by making a claim about status and popularity of a position, even if accurate.

I think in this case it would be reasonable to say something like “the implications of this argument being true involve substantial reallocation of status and power, so please be conscious of that and let’s all try to assess the evidence accurately and avoid overheating”. This is different from something like “I know lots of people will disagree with me on this but I’m going to say it”.

I’m not saying this was an easy post to write, but I think the standard to aim for is not having openings like this.

jeremy-gillen on Alexander Gietelink Oldenziel's Shortform

Doesn't the futarchy hack come up here? Contractors will be betting that competitors timelines and cost will be high, in order to get the contract.

benito on Stephen Fowler's Shortform

Perhaps, but “seven years from now my reputation in my industry will drop markedly on the basis of this decision” seems to me like a normal human thing that happens all the time.

jbash on robo's Shortform

I think the "crux" is that, while policy is good to have, it's fundamentally a short-term delaying advantage. The stuff will get built eventually no matter what, and any delay you can create before it's built won't really be significant compared to the time after it's built. So if you have any belief that you might be able to improve the outcome when-not-if it's built, that kind of dominates.

tailcalled on tailcalled's Shortform

An implicit assumption I'm making when I clip off from the end with the smallest singular values is that the importance of a dimension is proportional to its singular values. This seemed intuitively sensible to me ("bigger = more important"), but I thought I should test it, so I tried clipping off only one dimension at a time, and plotting how that affected the probabilities:

Clearly there is a correlation, but also clearly there's some deviations from that correlation. Not sure whether I should try to exploit these deviations in order to do further dimension reduction. It's tempting, but it also feels like it starts entering sketchy territories, e.g. overfitting and arbitrary basis picking. Probably gonna do it just to check what happens, but am on the lookout for something more principled.

eggsyntax on Language Models Model Us

Thanks! Doomed though it may be (and I'm in full agreement that it is), here's hoping that your and everyone else's pseudonymity lasts as long as possible.

davidad on Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

It seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system's pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications and safety guardrails for particular narrow deployments).

It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details of the engineering designs (which it then proves satisfy the safety specifications). But, I think sufficient fine-tuning with a GFlowNet objective will naturally penalise description complexity, and also penalise heavily biased sampling of equally complex solutions (e.g. toward ones that encode messages of any significance), and I expect this to reduce this risk to an acceptable level. I would like to fund a sleeper-agents-style experiment on this by the end of 2025.

bogdan-ionut-cirstea on Success without dignity: a nearcasting story of avoiding catastrophe by luck

Short-horizon tasks (e.g., fixing a problem on a Linux machine or making a web server) were those that would take less than 1 hour, whereas long-horizon tasks (e.g., building a web app or improving an agent framework) could take over four (up to 20) hours for a human to complete.
[...]
The Purple and Blue models completed 20-40% of short-horizon tasks but no long-horizon tasks. The Green model completed less than 10% of short-horizon tasks and was not assessed on long-horizon tasks³. We analysed failed attempts to understand the major impediments to success. On short-horizon tasks, models often made small errors (like syntax errors in code). On longer horizon tasks, models devised good initial plans but did not sufficiently test their solutions or failed to correct initial mistakes. Models also sometimes hallucinated constraints or the successful completion of subtasks.
Summary: We found that leading models could solve some short-horizon tasks, such as software engineering problems. However, no current models were able to tackle long-horizon tasks.

bogdan-ionut-cirstea on Bogdan Ionut Cirstea's Shortform

For example, in a t-AGI framework [LW · GW], using an interpretability LM agent to search for the feature corresponding to a certain semantic direction should be much shorter horizon than e.g. coming up with a new conceptual alignment agenda or coming up with a new ML architecture (as well as having much faster feedback loops than e.g. training a SOTA LM using a new architecture).

Short-horizon tasks (e.g., fixing a problem on a Linux machine or making a web server) were those that would take less than 1 hour, whereas long-horizon tasks (e.g., building a web app or improving an agent framework) could take over four (up to 20) hours for a human to complete.
[...]
The Purple and Blue models completed 20-40% of short-horizon tasks but no long-horizon tasks. The Green model completed less than 10% of short-horizon tasks and was not assessed on long-horizon tasks³. We analysed failed attempts to understand the major impediments to success. On short-horizon tasks, models often made small errors (like syntax errors in code). On longer horizon tasks, models devised good initial plans but did not sufficiently test their solutions or failed to correct initial mistakes. Models also sometimes hallucinated constraints or the successful completion of subtasks.
Summary: We found that leading models could solve some short-horizon tasks, such as software engineering problems. However, no current models were able to tackle long-horizon tasks.

LessWrong 2.0 Reader

Archive

Recent comments