justinshovelain

Anthropic's marginal contribution to safety (compared to what we would have in a world without Anthropic) probably doesn't offset Anthropic's contribution to the AI race.

I think there are more worlds where Anthropic is contributing to the race in a negative fashion than there are worlds where Anthropic's marginal safety improvement over OpenAI/DeepMind-ish orgs is critical for securing a good future with AGI (weighing things according to the impact sizes and probabilities).

Comment by JustinShovelain on evhub's Shortform · 2025-01-08T11:31:34.945Z · LW · GW

More generally you can use the following typology to inspire creating more interventions.

Interventions points to change/form an AGI company and its surroundings towards safer x-risk results (I've used this in advising startups on AI safety, it is also related to my post on positions where people can be in the loop):

Type of organization: nonprofit, public benefit organization, have a partner non-profit, join the government
Rules of organization, event triggers:
- Rules:
  - x-risk mission statement
  - x-risk strategic plan
- Triggering events:
  - Gets very big: windfall clause
  - Gets sold to another party: ethics board, restrictions on potential sale
  - Value drift: reboot board and CEOs, shut it down, allocate more resources to safety, build a new company, put the ethics board in charge, build a monitoring system, some sort of line in the sand
  - AI safety isn’t viable yet but dangerous AGI is: shut it down or pivot to sub AGI research and product development
  - Hostile government tries to take it over: shut it down, change countries, (see also: Soft Nationalization: How the US Government Will Control AI Labs)
Path decisions for organization: ethics board, aligned investors, good CEOs, giving x-risk orgs or people choice power, voting stock to aligned investors, periodic x-risk safety reminders
Resource allocation by organization: precommitting a varying percentage of money/time focused on x-risk reduction based on conditions with some up front, a commitment devices for funding allocation into the future
Owners of organization: aligned investors, voting stock for aligned investors, necessary percentage as aligned investors
Executive decision making: good CEOs, company mission statement?, company strategic plan?
Employees: select employees preferably by alignment, have only aligned people hire folks
Education of employees and/or investors by x-risk folks: employee training in x-risks and information hazards, a company culture that takes doing good seriously, coaching and therapy services
Social environment of employees: exposure to EAs and x-risk people socially at events, x-risk community support grants, a public pledge
Customers of organization: safety score for customers, differential pricing, customers have safety plans and information hazard plans
Uses of the technology: terms of service
Suppliers of organization: (mostly not relevant), select ethical or aligned suppliers
Difficulty to steal or copy: trade secrets, patents, service based, NDAs, (physical security)
Internal political hazards: (standard)
Information hazards: an institutional framework for research groups (FHI has a draft document)
Cyber hazards: (standard IT)
Financial hazards: (standard finances)
External political hazards: government industry partnerships, talk with x-risk folks about this, external x-risk outreach
Monitoring by x-risk folks: quarterly reports to x-risk organizations,
Projection by x-risk folks: commissioned projections, x-risk prediction market questions
Meta research and x-risk research: AI safety team, AI safety grants, meet up on organization safety at X-risk orgs, (x-risk strategy, AI safety strategy) – team and grants, information hazard grant question, go through these ideas in a check list fashion and allocate company computer folders to them (and they will get filled up), scalable and efficient grant giving system, form an accelerator, competitions, hackathon, BERI type project support
Coordination hazards: Incentivized coordination through cheap resources for joint projects, government industry partnerships, coordination theory and implementation grants, concrete coordination efforts, joint ethics boards, mergers with other groups to reduce arms race risks
Specific safety procedures: (depends on the project)
Jurisdiction: Choosing a good legal jurisdiction

Comment by JustinShovelain on evhub's Shortform · 2025-01-08T11:30:10.732Z · LW · GW

Thanks for asking the question!

Some things I'd especially like to see change (in as much as I know what is happening) are:

Making more use of available options to improve AI safety (I think there are more than I get the impression that Anthropic thinks. For instance, 30% of funds could be allocated to AI safety research if framed well and it would probably be below the noise threshold/froth of VC investing. Also, there probably is a fair degree of freedom in socially promoting concern around unaligned AGI.)
Explicit ways to handle various types of events like organizational value drift, hostile government takeover, organization get's sold or unaligned investors have control, another AGI company takes a clear lead
Enforceable agreements to, under some AGI safety situations, not race and pool resources (a possible analogy from nuclear safety is having a no first strike policy)
Allocate a significant fraction of resources (like > 10% of capital) to AGI technical safety, organizational AGI safety strategy, and AGI governance
An organization consists of its people and great care needs to be taken in hiring employees and and their training and motivation for AGI safety. If not, I expect Anthropic to regress towards the mean (via an eternal September) and we'll end up with another OpenAI situation where AGI safety culture is gradually lost. I want more work to be done here. (see also "Carefully Bootstrapped Alignment" is organizationally hard)
The owners of a company are also very important and ensuring that the LTBT has teeth and the members are selected well is key. Furthermore, preferential allocation of voting stock towards AGI algned investors should happen. Teaching investors about the company and what it does, including AGI safety issues, would be good to do. More speculatively, you can have various types of voting stock for various types of issues and you could build a system around this.

Comment by JustinShovelain on Some background for reasoning about dual-use alignment research · 2023-07-05T19:30:36.963Z · LW · GW

Gotcha. What determines the "ratios" is some sort of underlying causal structure of which some aspects can be summarized by a tech tree. For thinking about the causal structure you may also like this post: https://forum.effectivealtruism.org/posts/TfRexamDYBqSwg7er/causal-diagrams-of-the-paths-to-existential-catastrophe

Comment by JustinShovelain on Some background for reasoning about dual-use alignment research · 2023-07-05T11:59:19.780Z · LW · GW

Complementary ideas to this article:

https://www.lesswrong.com/posts/BfKQGYJBwdHfik4Kd/fai-research-constraints-and-agi-side-effects: (the origin for the fuel tank metaphor Raemon refers to in these comments)
Extending things further to handle higher order derivatives and putting things within a cohesive space: https://forum.effectivealtruism.org/posts/TCxik4KvTgGzMowP9/state-space-of-x-risk-trajectories
A typology for mapping downside risks: https://www.lesswrong.com/posts/RY9XYoqPeMc8W8zbH/mapping-downside-risks-and-information-hazards
A set of potential responses for what to do with potentially dangerous developments and a heuristic for triggering that evaluation: https://www.lesswrong.com/posts/6ur8vDX6ApAXrRN3t/information-hazards-why-you-should-care-and-what-you-can-do
A general heuristic for what technology to develop and how to distribute it: https://forum.effectivealtruism.org/posts/4oGYbvcy2SRHTWgWk/improving-the-future-by-influencing-actors-benevolence
A coherence focused framework from which is more fundamental than the link just above and from which it can be derived: https://www.lesswrong.com/posts/AtwPwD6PBsqfpCsHE/aligning-ai-by-optimizing-for-wisdom

Comment by JustinShovelain on Dual-Useness is a Ratio · 2023-04-06T10:52:39.960Z · LW · GW

Relatedly, here is a post going beyond the framework of a ratio of progress to the effect on the ratio of research that still needs to be done for various outcomes: https://www.lesswrong.com/posts/BfKQGYJBwdHfik4Kd/fai-research-constraints-and-agi-side-effects

Extending further one can examine higher order derivatives and curvature in a space of existential risk trajectories: https://forum.effectivealtruism.org/posts/TCxik4KvTgGzMowP9/state-space-of-x-risk-trajectories

Comment by JustinShovelain on When you plan according to your AI timelines, should you put more weight on the median future, or the median future | eventual AI alignment success? ⚖️ · 2023-01-05T12:09:04.117Z · LW · GW

Roughly speaking, in terms of the actions you take, various timelines should be weighted as P(AGI in year t)*DifferenceYouCanProduceInAGIAlignmentAt(t). This produces a new, non normalized distribution of how much to prioritize each time (you can renormalize it if you wish to make it more like "probability").

Note that this is just a first approximation and there are additional subtleties.

This assumes you are optimizing for each time and possible world orthogonality but much of the time optimizing for nearby times is very similar to optimizing for a particular time.
The definition of "you" here depends on the nature of the decision maker which can vary between a group, a person, or even a person at a particular moment.
Using different definitions of "you" between decision makers can cause a coordination issue where different people are trying to save different potential worlds (because of their different skills and ability to produce change) and their plans may tangle with each other.
It is difficult to figure out how much of a difference you can produce in different possible worlds and times. You do the best you can but you might suffer a failure of imagination in either finding ways your plans wont work, ways your plans will have larger positive effects, or ways you may in the future improve your plans. For more on the difference one can produce see this and this.
Lastly, there is a risk here psychologically and socially of fudging the calculations above to make things more comfortable.

(Meta: I may make a full post on this someday and use this reasoning often)

Comment by JustinShovelain on Goodhart's Law Causal Diagrams · 2022-04-12T06:36:46.720Z · LW · GW

I think causal diagrams naturally emerge when thinking about Goodhart's law and its implications.

I came up with the concept of Goodhart's law causal graphs above because of a presentation someone gave at the EA Hotel in late 2019 of Scott's Goodhart Taxonomy. I thought causal diagrams were a clearer way to describe some parts of the taxonomy but their relationship to the taxonomy is complex. I also just encountered the paper you and Scott wrote a couple weeks ago when getting ready to write this Good Heart Week prompted post, and I was planning in the next post to reference it when we address "causal stomping" and "function generalization error" and can more comprehensively describe the relationship with the paper.

In terms of the relationship to the paper, I think that the Goodhart's law causal graphs I describe above are more fundamental and atomically describe the relationship types between the target and proxies in a unified way. I read how you were using causal diagrams in your paper as rather describing various ways causal graph relationships may be broken by taking action rather than simply describing relationships between proxies and targets and ways they may be confused with each other (which is the function of the Goodhart's law causal graphs above).

Mostly the purpose of this post and the next are to present an alternative, and I think cleaner, ontological structure for thinking about Goodhart's law though there will still be some messiness in carving up reality.

As to your suggested mitigations, both randomization and secret metric are good to add though I'm not as sure about post hoc. Thanks for the suggestions and the surrounding paper.

Comment by JustinShovelain on Subspace optima · 2020-05-18T14:51:43.195Z · LW · GW

I like the distinction that you're making and that you gave it a clear name.

Relatedly, there is the method of Lagrangian multipliers for solving things in the subspace.

On a side note: there is a way to partially unify subspace optimum and local optimum by saying that the subspace optimum is a local optimum with respect to the local set of parameters you're using to define the subspace. You're at a local optimum with respect to defining the underlying space to optimize over (aka the subspace) and a local optimum within that space (the subspace). (Relatedly, moduli spaces.)

Comment by JustinShovelain on COVID-19: An opportunity to help by modelling testing and tracing to inform the UK government · 2020-04-18T12:02:41.352Z · LW · GW

I've decided to try modelling testing and contact tracing over the weekend. If you wish to join and want to ping me my contact details are in the doc.

Comment by JustinShovelain on Why don't we have active human trials with inactivated SARS-COV-2? · 2020-04-09T19:59:26.478Z · LW · GW

I think virus inactivation is a normal vaccination approach and is probably being pursued here? The hardest part is probably growing it in vitro at scale and perhaps ensuring that all of them are inactive.

Comment by JustinShovelain on Conflict vs. mistake in non-zero-sum games · 2020-04-07T14:17:57.876Z · LW · GW

Nice deduction about the relationship between this and conflict vs mistake theory! Similar and complementary to this post is the one I wrote on Moloch and the Pareto optimal frontier.

Comment by JustinShovelain on Using vector fields to visualise preferences and make them consistent · 2020-01-31T18:25:03.220Z · LW · GW

How so? I don't follow your comment's meaning.

Comment by JustinShovelain on Safety regulators: A tool for mitigating technological risk · 2020-01-21T23:03:36.564Z · LW · GW

Edited to add "I" immediately in front of "wish".

Comment by JustinShovelain on Metaphilosophical Mysteries · 2010-07-28T19:57:39.325Z · LW · GW

By new "term" I meant to make the clear that this statement points to an operation that cannot be done with the original machine. Instead it calls this new module (say a halting oracle) that didn't exist originally.

Comment by JustinShovelain on Metaphilosophical Mysteries · 2010-07-28T09:06:39.977Z · LW · GW

Are you trying to express the idea of adding new fundamental "terms" to your language describing things like halting oracles and such? And then discounting their weight by the shortest statement of said term's properties expressed in the language that existed previously to including this additional "term?" If so, I agree that this is the natural way to extend priors out to handle arbitrary describable objects such as halting oracles.

Stated another way. You start with a language L. Let the definition of an esoteric mathematical object (say a halting oracle) E be D in the original language L. Then the prior probability of a program using that object is discounted by the description length of D. This gives us a prior over all "programs" containing arbitrary (describable) esoteric mathematical objects in their description.

I'm not yet sure how universal this approach is at allowing arbitrary esoteric mathematical objects (appealing to the Church-Turing thesis here would be assuming the conclusion) and am uncertain whether we can ignore the ones it cannot incorporate.

Comment by JustinShovelain on Think Before You Speak (And Signal It) · 2010-03-19T23:39:58.954Z · LW · GW

Interesting idea.

I agree that trusting newly formed ideas is risky, but there are several reasons to convey them anyway (non-comprehensive listing):

To recruit assistance in developing and verifying them
To convey an idea that is obvious in retrospect, an idea you can be confident in immediately
To signal cleverness and ability to think on one's feet
To socially play with the ideas

What we are really after though is to asses how much weight to assign to an idea off the bat so we can calculate the opportunity costs of thinking about the idea in greater detail and asking for the idea to be fleshed out and conveyed fully. This overlaps somewhat with the confidence (context sensitive rules in determining) with which the speaker is conveying the idea. Also, how do you gauge how old an idea really is? Especially if it condenses gradually or is a simple combination out of very old parts? Still... some metric is better than no metric.

Comment by JustinShovelain on Open Thread: March 2010, part 2 · 2010-03-14T08:24:46.590Z · LW · GW

Vote this down for karma balance.

Comment by JustinShovelain on Open Thread: March 2010, part 2 · 2010-03-14T08:24:25.521Z · LW · GW

Vote this up if you are the oldest child with siblings.

Comment by JustinShovelain on Open Thread: March 2010, part 2 · 2010-03-14T08:23:46.614Z · LW · GW

Vote this up if you are an only child.

Comment by JustinShovelain on Open Thread: March 2010, part 2 · 2010-03-14T08:23:32.478Z · LW · GW

Vote this up if you have older siblings.

Comment by JustinShovelain on Open Thread: March 2010, part 2 · 2010-03-14T08:23:08.088Z · LW · GW

Poll: Do you have older siblings or are an only child?

karma balance

Comment by JustinShovelain on Open Thread: March 2010 · 2010-03-10T00:48:39.422Z · LW · GW

I'm thinking of writing up a post clearly explaining update-less decision theory. I have a somewhat different way of looking at things than Wei Dia and will give my interpretation of his idea if there is demand. I might also need to do this anyway in preparation for some additional decision theory I plan to post to lesswrong. Is there demand?

Comment by JustinShovelain on Individual vs. Group Epistemic Rationality · 2010-03-02T22:25:27.110Z · LW · GW

Closely related to your point is the paper, "The Epistemic Benefit of Transient Diversity"

It describes and models the costs and benefits of independent invention and transient disagreement.

Comment by JustinShovelain on The Preference Utilitarian’s Time Inconsistency Problem · 2010-01-15T17:35:15.481Z · LW · GW

Why are you more concerned about something with unlimited ability to self reflect making a calculation error than about the above being a calculation error? The AI could implement the above if the calculation implicit in it is correct.

Comment by JustinShovelain on The Preference Utilitarian’s Time Inconsistency Problem · 2010-01-15T17:23:27.833Z · LW · GW

What keeps the AI from immediately changing itself to only care about the people's current utility function? That's a change with very high expected utility defined in terms of their current utility function and one with little tendency to change their current utility function.

Will you believe that a simple hack will work with lower confidence next time?

Comment by JustinShovelain on Positive-affect-day-Schelling-point-mas Meetup · 2009-12-23T21:24:08.995Z · LW · GW

I'll be there.

Comment by JustinShovelain on Intuitive supergoal uncertainty · 2009-12-05T02:46:26.327Z · LW · GW

Hmm, darn. When I write I do have a tendency to see what ideas I meant to describe instead of seeing my actual exposition; I don't like grammar checking my writing until I've had some time to forget details, I read right over my errors unless I pay special attention.

I did have a three LWers look over the article before I sent it and got the general criticism that it was a bit obscure and dense but understandable and interesting. I was probably too ambitious in trying to include everything within one post though, length vs clarity tradeoff.

To address your points:

Have you not felt or encountered people who have the opinion that our life goals may be uncertain, something to have opinions about, and are valid targets for argument? Also, is not uncertainty of our most fundamental goals something we must consider and evaluate (explicitly or implicitly) in order to verify that an artificial intelligence is provably Friendly?

Elaborating on the second statement, when I used "naturalistically" I wished to invoke the idea that the exploration I was doing was similar to classifying animals before we had taxonomies, we look around with our senses (or imagination and inference in this case) and see what we observe and lay no claim to systematic search or analysis. In this context I did a kind of imagination limited shallow search process without trying to systematically relate the concepts (combinatorial explosion and I'm not yet sure how to condense and analyze supergoal uncertainty).

As to the third point, what I did in this article is allocate a name "supergoal uncertainty", roughly described it in the first paragraph and hopefully brought up the intuition, and then subsequently considered various definitions of "supergoal uncertainty" following from this intuition.

In retrospect, I probably errored on the clarity versus writing time trade-off and was perhaps biased in trying to get this uncomfortable writing task (I'm not a natural writer) off my plate so I can do other things.

Comment by JustinShovelain on Intuitive supergoal uncertainty · 2009-12-05T02:09:29.165Z · LW · GW

I think he meant that even if we are not religious, society tends to pull us into moral realism even though of course moral realism is an illusion.

You are correct, though I don't go as far as calling moral realism an illusion because of unknown unknowns (though I would be very surprised to find it isn't illusionary).

Comment by JustinShovelain on Intuitive supergoal uncertainty · 2009-12-05T02:04:18.007Z · LW · GW

Addressing your reification point:

By means of reification something that was previously implicit, unexpressed and possibly unexpressible is explicitly formulated and made available to conceptual (logical or computational) manipulation." - Reification(computer science) from wikipedia.

I don't think I did abuse vocabulary outside of possibly generalizing meanings in straightforward ways and taking words and meanings common in one topic and using them in a context where they are rather uncommon (e.g. computer science to philosophy). I rely on context to refine and imbue words with meaning instead of focusing on dictionary definitions (to me all sentences take the form of puzzles and words are the pieces; I've written more words in proofs than in all other contexts combined). I will try to pay more attention to context invariant meanings in the future. Thanks for the criticism.

Comment by JustinShovelain on How to test your mental performance at the moment? · 2009-11-24T05:16:59.056Z · LW · GW

Some things I use to test mental ability as well as train it are: BrainWorkshop (A free dualNback program), Cognitivefun.net (A site with assorted tests and profiles including everything from reaction time, to subitizing, to visual backward digit span), Posit Science's jewel diver demo (a multi-object tracking test), and Lumosity.com (brainshift, memory matrix, speed match, top chimp. All of these tests can be found for free on the internet).

Subjectively the regular use of these tests has increased my metacognitive and self monitoring ability. Anyone have other suggestions? How about tests one can do without the aid of external devices?

In complement to determining whether one's brain isn't in its best state there is the question of how to improve or fix it. Keeping with the general spirit of this thread, what are some strategies people use to improve their cognitive functioning (as it pertains to low level properties such as short term memory) in the short term without the use of external aids? A few I use are priming emotional state with posture, expression, and words, doing mental arithmetic, memorizing arbitrary information, and doing the above mental tests.

Comment by JustinShovelain on Rationality Quotes - July 2009 · 2009-07-03T01:02:57.352Z · LW · GW

I do not agree with all interpretations of the quote but primed by:

That's not right. It's not even wrong. -- Wolfgang Pauli

I interpreted it charitably with "critical" loosely implying "worth thinking about" in contrast to vague ideas that are not even wrong. Furthermore, from thefreedictionary.com definition of critical, "1. Inclined to judge severely and find fault.", vague statements may be considered useless and so judged severely but much of the time they are also slippery in that they must be broken down into precise disjoint "meaning sets" where faults can be found. So vague ideas cannot necessarily be criticized directly in the fault finding sense. (Wide concepts that have useful delimitations in contrast to arbitrary ill-formed vague ones can be useful and are a powerful tool in generalization. In informal contexts these two meanings of vague overlap).

Comment by JustinShovelain on Rationality Quotes - July 2009 · 2009-07-02T23:25:44.219Z · LW · GW

Make everything as simple as possible, but not simpler.

-- Albert Einstein

Comment by JustinShovelain on Rationality Quotes - July 2009 · 2009-07-02T22:55:13.524Z · LW · GW

Many highly intelligent people are poor thinkers. Many people of average intelligence are skilled thinkers. The power of a car is separate from the way the car is driven.

-- Edward de Bono

Comment by JustinShovelain on Rationality Quotes - July 2009 · 2009-07-02T22:51:23.423Z · LW · GW

In a sense, words are encyclopedias of ignorance because they freeze perceptions at one moment in history and then insist we continue to use these frozen perceptions when we should be doing better.

-- Edward de Bono

Comment by JustinShovelain on Rationality Quotes - July 2009 · 2009-07-02T22:45:58.396Z · LW · GW

Some people are always critical of vague statements. I tend rather to be critical of precise statements; they are the only ones which can correctly be labeled 'wrong'.

-- Raymond Smullyan

Comment by JustinShovelain on Controlling your inner control circuits · 2009-06-29T23:39:24.252Z · LW · GW

From pwno: "Aren't true theories defined by how useful they are in some application?"

My definition of "usefulness" was built with the express purpose of relating the truth of theories to how useful they are and is very much a context specific temporary definition (hence "define:"). If I had tried to deal with it directly I would have had something uselessly messy and incomplete, or I could have used a true but also uninformative expectation approach and hid all of the complexity. Instead, I experimented and tried to force the concepts to unify in some way. To do so I stretched the definition of usefulness pretty much to the breaking point and omitted any direct relation to utility functions. I found it a useful thought to think and hope you do as well even if you take issue with my use of the name "usefulness".

Comment by JustinShovelain on Controlling your inner control circuits · 2009-06-29T22:52:06.964Z · LW · GW

define: A theory's "truthfulness" as how much probability mass it has after appropriate selection of prior and applications of Bayes' theorem. It works as a good measure for a theory's "usefulness" as long as resource limitations and psychological side effects aren't important.

define: A theory's "usefulness" as a function of resources needed to calculate its predictions to a certain degree of accuracy, the "truthfulness" of the theory itself, and side effects. Squinting at it, I get something roughly like: usefulness(truthfulness, resources, side effects) = truthfulness * accuracy(resources) + messiness(side effects)

So I define "usefulness" as a function and "truthfulness" as its limiting value as side effects go to 0 and resources go to infinity. Notice how I shaped the definition of "usefulness" to avoid mention of context specific utilities; I purposefully avoided making it domain specific or talking about what the theory is trying to predict. I did this to maintain generality.

(Note: For now I'm polishing over the issue of how to deal with abstracting over concrete hypotheses and integrating the properties of this abstraction with the definitions)

Comment by JustinShovelain on What's In A Name? · 2009-06-29T19:04:16.068Z · LW · GW

I agree that it may plausibly be argued that the difference should rarely fall into the small margin: U(good name) - U(bad name) (up to varying priors, utility functions, ...). However, should people calculate to the point that they can resolve differences of that order of magnitude? A fast and dirty heuristic may be the way to go practically speaking; the difference in utility would be less than the utility lost in calculating it.

Comment by JustinShovelain on What's In A Name? · 2009-06-29T18:49:09.942Z · LW · GW

Is this whole bias caused by the exposure effect? Would there be any obstacle in unifying the two? Do people also prefer to live in towns that are associated with their parents' names? Do people who fall for this effect also name their pets or children after themselves to a greater extent?

User info

Posts

Comments