Posts

Is AlphaGo actually a consequentialist utility maximizer? 2023-12-07T12:41:05.132Z
faul_sname's Shortform 2023-12-03T09:39:10.782Z
Regression To The Mean [Draft][Request for Feedback] 2012-06-22T17:55:51.917Z
The Dark Arts: A Beginner's Guide 2012-01-21T07:05:05.264Z
What would you do with a financial safety net? 2012-01-16T23:38:18.978Z

Comments

Comment by faul_sname on Is this a Pivotal Weak Act? Creating bacteria that decompose metal · 2024-09-11T21:18:37.348Z · LW · GW

"Create bacteria that can quickly decompose any metal in any environment, including alloys and including metal that has been painted, and which also are competitive in those environments, and will retain all of those properties under all of the diverse selection pressures they will be under worldwide" is a much harder problem than "create bacteria that can decompose one specific type of metal in one specific environment", which in turn is harder than "identify specific methanobacteria which can corrode exposed steel by a small fraction of a millimeter per year, and find ways to improve that to a large fraction of a millimeter per year."

Also it seems the mechanism is "cause industrial society to collapse without killing literally all humans" -- I think "drop a sufficiently large but not too large rock on the earth" would also work to achieve that goal, you don't have to do anything galaxy-brained.

Comment by faul_sname on yanni's Shortform · 2024-08-30T18:02:02.089Z · LW · GW

I am struggling to see how we do lose 80%+ of these jobs within the next 3 years.

Operationalizing this, I would give you 4:1 that the fraction (or raw number, if you'd prefer) of employees occupied as travel agents is over 20% of today's value, according to the Labor Force Statistics from the US Bureau of Labor Statistics Current Population Survey (BLS CPS) Characteristics of the Employed dataset.

For reference, here are the historical values for the BLS CPS series cpsaat11b ("Employed persons by detailed occupation and age") since 2011 (which is the earliest year they have it available as a spreadsheet). If you want to play with the data yourself, I put it all in one place in google sheets here.

As of the 2023 survey, about 0.048% of surveyed employees, and 0.029% of surveyed people, were travel agents. As such, I would be willing to bet at 4:1 that when the 2027 data becomes available, at least 0.0096% of surveyed employees and at least 0.0058% of surveyed Americans report their occupation as "Travel Agent".

Are you interested in taking the opposite side of this bet?

Edit: Fixed aritmetic error in the percentages in the offered bet

Comment by faul_sname on Am I confused about the "malign universal prior" argument? · 2024-08-29T23:05:58.909Z · LW · GW

Suppose that there is some search process that is looking through a collection of things, and you are an element of the collection. Then, in general, it's difficult to imagine how you (just you) can reason about the whole search in such a way as to "steer it around" in your preferred direction.

I think this is easy to imagine. I'm an expert who is among 10 experts recruited to advise some government on making a decision. I can guess some of the signals that the government will use to choose who among us to trust most. I can guess some of the relative weaknesses of fellow experts. I can try to use this to manipulate the government into taking my opinion more seriously. I don't need to create a clone government and hire 10 expert clones in order to do this.

The other 9 experts can also make guesses about which the signals the government will use and what the relative weaknesses of their fellow experts are, and the other 9 experts can also act on those guesses. So in order to reason about what the outcome of the search will be, you have to reason about both yourself and also about the other 9 experts, unless you somehow know that you are much better than the other 9 experts at steering the outcome of the search as a whole. But in that case only you can steer the search . The other 9 experts would fail if they tried to use the same strategy you're using.

Comment by faul_sname on Why Large Bureaucratic Organizations? · 2024-08-28T02:55:41.853Z · LW · GW

The employee doesn't need to understand why their job is justified in order for their job to be justified. In particular, looking at the wikipedia article, it gives five examples of types of bullshit jobs:

  1. Flunkies, who serve to make their superiors feel important, e.g., receptionists, administrative assistants, door attendants, store greeters;
  2. Goons, who act to harm or deceive others on behalf of their employer, or to prevent other goons from doing so, e.g., lobbyists, corporate lawyers, telemarketers, public relations specialists;
  3. Duct tapers, who temporarily fix problems that could be fixed permanently, e.g., programmers repairing shoddy code, airline desk staff who calm passengers with lost luggage;
  4. Box tickers, who create the appearance that something useful is being done when it is not, e.g., survey administrators, in-house magazine journalists, corporate compliance officers;
  5. Taskmasters, who create extra work for those who do not need it, e.g., middle management, leadership professionals.[4][2]

The thing I notice is that all five categories contain many soul-crushing jobs, and yet for all five categories I expect that the majority of people employed in those jobs are in fact a net positive to the companies they work for when they work in those roles.

  • Flunkies:
    • Receptionists + administrative assistants: a business has lots of boring administrative tasks to keep the lights on. Someone has to make sure the invoices are paid, the travel arrangements are made, and that meetings are scheduled without conflicts. For many of these tasks, there is no particular reason that the people keeping the lights on needs to be the same person as the person keeping the money fountain at the core of the business flowing.
    • Door attendants, store greeters: these are loss prevention jobs: people are less likely to just walk off with the merchandise if someone is at the door. Not "entirely prevented from walking out with the merchandise", just "enough less likely to justify paying someone minimum wage to stand there".
  • Goons:
    • Yep, there sure is a lot of zero- and negative-sum stuff that happens in the corporate world. I don't particularly expect that 1000 small firms will have less zero-sum stuff going on than 10 large firms, though, except to the extent that 10 large firms have more surplus to expend on zero-sum games.
  • Duct tapers:
    • Programmers repairing shoddy code: It is said that there are two types of code: buggy hacked-together spaghetti code, and code that nobody uses. More seriously, the value of a bad fix later today is often higher than the value of a perfect fix next year. Management still sometimes makes poor decisions about technical debt, but also the optimal level of tech debt from the perspective of the firm is probably not the optimal level of tech debt for the happiness and job satisfaction of the development team. And I say this as a software developer who is frequently annoyed by tech debt.
    • airline desk staff who calm passengers with lost luggage: I think the implication is supposed to be "it would be cheaper to have policies in place which prevent luggage from being lost than it is to hire people to deal with the fallout", but that isn't directly stated
  • Box tickers:
    • Yep, everyone hates doing compliance work. And there sure are some rules which fail a cost-benefit analysis. Still, given a regulatory environment, the firm will make cost-benefit calculations within that regulatory environment, and "hire someone to do the compliance work" is frequently a better option than "face the consequences for noncompliance".
    • With regards to regulatory capture, see section "goons".
  • Taskmasters:
    • A whole lot can be said here, but one thing that's particularly salient to me is that some employees provide most of their value by being present during a few high-stakes moments per year where there's a massive benefit of having someone available vs not. The rest of the time, for salaried employees, the business if going to be tempted to press them into any work that has nonzero value, even if the value of that work is much less than the salary of the employee divided by the annual number of hours they work.

That said, my position isn't "busywork / bullshit doesn't exist", it's "most employees provide net value to their employers relative to nobody being employed in that position, and this includes employees who think their job is bullshit".

Comment by faul_sname on Shortform · 2024-08-28T01:40:26.262Z · LW · GW

I can think of quite a few institutions that certify people as being "good" in some specific way, e.g.

  • Credit Reporting Agencies: This person will probably repay money that you lend to them
  • Background Check Companies: This person doesn't have a criminal history
  • Professional Licensing Boards: This person is qualified and authorized to practice in their field
  • Academic Institutions: This person has completed a certain level of education or training
  • Driving Record Agencies: This person is a responsible driver with few or no traffic violations
  • Employee Reference Services: This individual has a positive work history and is reliable

Is your question "why isn't there an institution which pulls all of this information about a single person, and condenses it down to a single General Factor of Goodness Score"?

Comment by faul_sname on Why Large Bureaucratic Organizations? · 2024-08-28T01:21:56.058Z · LW · GW

Indeed my understanding is that my mental model is pretty close to the standard economist one, thiugh I don’t have a formal academic background so don’t quote me as "this is the canonical form of the theory of the firm".

I also wanted a slightly different emphasis from the standard framing I've seen, because the post says

The economists have some theorizing on the topic (google “theory of the firm”), but none of it makes me feel much less confused about the sort of large organizations I actually see in our world. The large organizations we see are clearly not even remotely economically efficient; for instance, they’re notoriously full of “bullshit jobs” which do not add to the bottom line, and it’s not like it’s particularly difficult to identify the bullshit jobs either. How is that a stable economic equilibrium?!?

so I wanted to especially emphasize the dynamic where jobs which are clearly inefficient and wouldn't work at all in a small company ("bullshit jobs") can still be net positive at a large enough company.

Comment by faul_sname on Why Large Bureaucratic Organizations? · 2024-08-27T21:25:48.191Z · LW · GW

The large organizations we see are clearly not even remotely economically efficient

I think a large organizations are often have non-obvious advantages of scale.

My mental model is that businesses grow approximately until the marginal cost of adding another employee is higher than the marginal benefit. This can combine with the advantages of scale that companies have to produce surprising results.

Let's say you have a company with a billion users and a revenue model with net revenue of $0.25 / user / year, and only 50 employees (like a spherical-cow version of [WhatsApp in 2015](https://news.ycombinator.com/item?id=34543480)).

If you're in this position, you're probably someone who likes money. As such, you will be asking questions like

  • Can I increase the number of users on the platform?
  • Can I increase the net revenue per user?
  • Can I do creative stuff with cashflow?

And, for all of these, you might consider hiring a person to do the thing.

At a billion $0.25 / year users, and let's say $250k / year to hire a person, that person would only have to do one of

  • Bring in an extra million users
    • Or increase retention by an amount with the same effect
    • Or ever-so-slightly decrease [CAC](https://en.wikipedia.org/wiki/Customer_acquisition_cost)
  • Increase expected annual net revenue per user by $0.00025
    • Or double annual net revenue per user specifically for users in Los Angeles County, while not doing anything anywhere else
  • Figure out how to get the revenue at the beginning of the week instead of the end of the week
  • Increase the effectiveness of your existing employees by some tiny amount

A statement that you shouldn't hire past your initial 50 people is either a statement that none of these are available paths to you, or that you don't know how to slot additional people into your organizational structure without harming the performance of your existing employees (note that "harming the performance of your existing employees" is not the same thing as "decreasing the average performance of your employees"). The latter is sometimes true, but it's generally not super true of large profitable companies like Apple or Google.

Status concerns do matter at all but I don't think it's the only explanation, or even the most important consideration, for why Apple, the most valuable company in the world by market cap, has 150,000 employees.

Comment by faul_sname on If we solve alignment, do we die anyway? · 2024-08-23T18:17:12.804Z · LW · GW

People often speak of massively multipolar scenarios as a good outcome.

I understand that inclination. Historically, unipolar scenarios do not have a great track record of being good for those not in power, especially unipolar scenarios where the one in power doesn't face significant risks to mistreating those under them. So if unipolar scenarios are bad, that means multipolar scenarios are good, right?

But "the good situation we have now is not stable, we can choose between making things a bit worse (for us personally) immediately and maybe not get catastrophically worse later, or having things remain good now but get catastrophically worse later" is a pretty hard pill to swallow. And is also an argument with a rich history of being ignored without the warned catastrophic thing happening.

Comment by faul_sname on If we solve alignment, do we die anyway? · 2024-08-23T17:37:52.409Z · LW · GW

If those don't hold, what is the alternate scenario in which a multipolar world remains safe?

The choice of the word "remains" is an interesting one here. What is true of our current multipolar world which makes the current world "safe", but which would stop being true of a more advanced multipolar world? I don't think it can be "offense/defense balance" because nuclear and biological weapons are already far on the "offense is easier than defense" side of that spectrum.

Comment by faul_sname on The economics of space tethers · 2024-08-23T16:54:28.455Z · LW · GW

Here, you could just have a hook grab a perpendicular rope, but if you don't have any contingency plans, well, "dock or die" isn't very appealing. Especially if it happens multiple times.

If the thing you want to accelerate with the tether is cheap but heavy to LEO (e.g. "big dumb tank of fuel"), it might be a reasonable risk to take. Then missions which have more valuable payload like humans can take the safer approach of strapping them to a somewhat larger pile of explosions, and things which need a lot of delta V can get up to LEO with very little fuel left, dock with one of the big dumb tanks of fuel, and then refuel at that point.

Source: I have played a bunch of Kerbal Space Program, and if it works in KSP it will definitely work in real life with no complications.

Comment by faul_sname on If we solve alignment, do we die anyway? · 2024-08-23T16:28:30.606Z · LW · GW

I think "pivotal act" is being used to mean both "gain affirmative control over the world forever" and "prevent any other AGI from gaining affirmative control of the world for the foreseeable future". The latter might be much easier than the former though.

Comment by faul_sname on A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed · 2024-08-23T00:15:21.542Z · LW · GW

One point of confusion I still have is what a natural latent screens off information relative to the prediction capabilities of.

Let's say one of the models "YTDA" in the ensemble knows the beginning-of-year price of each stock, and uses "average year-to-date market appreciation" as its latent., and so learning the average year-to-date market appreciation of the S&P250odd will tell it approximately everything about that latent, and learning the year-to-date appreciation of ABT will give it almost no information it knows how to use about the year-to-date appreciation of AMGN.

So relative to the predictive capabilities of the YTDA model, I think it is true that "average year-to-date market appreciation" is a natural latent.

However, another model "YTDAPS" in the ensemble might use "per-sector average year-to-date market appreciation" as its latent. Since both the S&P250even and S&P250odd contain plenty of stocks in each sector, it is again the case that once you know the YTDAPS' latent conditioning on S&P250odd, learning the price of ABT will not tell the YTDAPS model anything about the price of AMGN.

But then if both of these are latents, does that mean that your theorem proves that any weighted sum of natural latents is also itself a natural latent?

Comment by faul_sname on A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed · 2024-08-22T23:31:10.738Z · LW · GW

Alright, I'm terrible at abstract thinking, so I went through the post and came up with a concrete example. Does this seem about right?

Suppose we have multiple distributions  over the same random variables . (Speaking somewhat more precisely: the distributions are over the same set, and an element of that set is represented by values .)

We are a quantitative trading firm. Our investment strategy is such that we care about the prices of the stocks in the S&P 500 at market close today ().

We have a bunch of models of the stock market (), where we can feed in a set of possible prices of stocks in the S&P 500 at market close, and the model spits out a probability of seeing that exact combination of prices (where a single combination of prices is ).

We take a mixture of the distributions: , where  and  is nonnegative

We believe that some of our models are better than others, so our trading strategy is to take a weighted average of the predictions of each model, where the weight assigned to the th model  is , and obviously the weights have to sum to 1 for this to be an "average".

Mathematically: the natural latent over  is defined by , and naturality means that the distribution  satisfies the naturality conditions (mediation and redundancy).

We believe that there is some underlying factor which we will call "market factors" () such that if you control for "market factors", you no longer learn (approximately) anything about the price of say MSFT when you learn about the price of AAPL, and also such that if you order the stocks in the S&P 500 alphabetically and then take the odd-indexed stocks (i.e. A, AAPL, ABNB, ...) in that list and call them the S&P250odd, and call the even-indexed (i.e. AAL, ABBV, ABT, ...) ones the S&P250even, you will come to (approximately) the same estimation of "market factors" by looking at either the S&P250odd or the S&P250even. Further, this means that if you estimate "market conditions" by looking at S&P250odd, then your estimation of the price of AAL will be approximately unchanged if you learn the price of ABT.

Then our theorem says: if an approximate natural latent exists over , and that latent is robustly natural under changing the mixture weights , then the same latent is approximately natural over  for all .

Anyway, if we find that the above holds for the weighted sum we use in practice, and we also find that it robustly [1] holds when we change the weights, that actually means that all of our market price models take "market factors" into account.

Alternatively stated, it means that if one of the models was written by an intern that procrastinated until the end of his internship and then on the last morning wrote def predict_price(ticker): return numpy.random.lognormal(), then our weighted sum is not robust to changes in the weights.

Is this a reasonable interpretation? If so, I'm pretty interested to see where you go with this. 

  1. ^

    Terms and conditions apply. This information is not intended as, and shall not be understood or construed as, financial advice.

Comment by faul_sname on Limitations on Formal Verification for AI Safety · 2024-08-22T18:13:00.797Z · LW · GW

My point was more that I expect there to be more value in producing provable safety design demos and provable safety design tutorials than in provable safety design principles, because I think the issue is "people don't know how, procedurally, to implement provable safety in systems they build or maintain" than it is "people don't know how to think about provable safety but if their philosophical confusion was resolved they wouldn't have too many further implementation difficulties".

So having any examples at all would be super useful, and if you're trying to encourage "any examples at all" one way of encouraging that is to go "look, you can make billions of dollars if you can build this specific example".

Comment by faul_sname on Limitations on Formal Verification for AI Safety · 2024-08-21T22:10:50.242Z · LW · GW

I think the greatest contribution to humanity's survival right now is to create detailed plans for building provably safe infrastructure, so that when the enabling technologies appear and the world begins demanding safe technology, there is a plan for moving forward.

There are enough places provably-safe-against-physical-access hardware would be an enormous value-add that you don't need to wait to start working on it until the world demands safe technology for existential reasons. Look at the demand for secure enclaves, which are not provably secure, but are "probably good enough because you are unlikely to have a truly determined adversary".

The easiest way to convince people that they, personally, should care more about provable correctness over immediately-obvious practical usefulness is to demonstrate that provable correctness is possible, not too costly, and has clear benefits to them, personally.

Comment by faul_sname on Limitations on Formal Verification for AI Safety · 2024-08-20T21:13:42.890Z · LW · GW

But provably safe infrastructure could stop this kind of attack at every stage: biochemical design AI would not synthesize designs unless they were provably safe for humans, data center GPUs would not execute AI programs unless they were certified safe, chip manufacturing plants would not sell GPUs without provable safety checks, DNA synthesis machines would not operate without a proof of safety, drone control systems would not allow drones to fly without proofs of safety, and armies of persuasive bots would not be able to manipulate media without proof of humanness.

Let's say that you have a decider which can look at some complex real-world system and determine whether it is possible to prove that the complex real-world system has some desirable safety properties.

Let's further say that your decider is not simply a rock with the word "NO" written on it.

Concretely, we can look at the example of "armies of persuasive bots would not be able to manipulate media without proof of humanness". In order to do this, we need to have an adversarially robust classifier for "content we can digitally prove was generated by a specific real human" vs "content we can't digitally prove was generated by a specific real human".

But that also gets you, at a minimum, a solid leg up in all of the following business areas

So if you think this problem is solvable, not only can you make a potentially large positive impact on the future of humanity, you can also get very very rich while doing it.

You don't even need to solve the whole problem. With a solid demonstration of a provable humanness detector, you should be able to get arbitrarily large amounts of venture funding to make your system into a reality.

The first step of creating a working prototype is left as an exercise for the reader.

Comment by faul_sname on Quick look: applications of chaos theory · 2024-08-19T16:52:14.737Z · LW · GW

The margins of error of existing measuring instruments will tell you how long you can expect your simulation to resemble reality, but an exponential decrease in measurement error will only buy you a linear increase in how long that simulation is good for.

I also call into question the divergence, at least in weather prediction. Bright and sunny, how different/divergent is it from thunderstorm? There could be something lost in translation, going from numerical outputs to natural language descriptions like sunny, rainy.. etc.

If you don't like descriptive stuff like "sunny" or "thunderstorm" you could use metrics like "watts per square meter of sunlight at surface level" or "atmospheric pressure" or "rainfall rate". You will still observe a divergence in behavior between models with arbitrarily small differences in initial state (and between your model and the behavior of the real world).

Comment by faul_sname on adam_scholl's Shortform · 2024-08-17T00:34:57.803Z · LW · GW

Mexico and Chile are the most salient examples to me. But also I've only ever gotten food poisoning once in my life despite frequent risky food behavior.

Strong agree that the magnitude of the overzealousness is much higher for drugs than for food.

Comment by faul_sname on How unusual is the fact that there is no AI monopoly? · 2024-08-17T00:16:37.413Z · LW · GW

[Epistemic Status: extremely not endorsed brain noise] New EA cause area just dropped! Do lots of cutting edge algorithmic AI research, and then publish that research, but patent your published research and become a patent troll!

Comment by faul_sname on adam_scholl's Shortform · 2024-08-16T23:50:24.177Z · LW · GW

Have you ever visited a country without zealous food safety regulations? I think it's one of those things where it's hard to realize what the alternative looks like (plentiful, cheap, and delicious street food available wherever people gather, so that you no longer have to plan around making sure you either bring food or go somewhere with restaurants, and it is viable for individuals to exist without needing a kitchen of their own).

Comment by faul_sname on How unusual is the fact that there is no AI monopoly? · 2024-08-16T23:36:04.824Z · LW · GW

Is the current legal situation with patents different?

My understanding is that Google did patent transformers, but the patent explicitly only covered encoder/decoder architectures and e.g. GPT-2 uses a decoder-only architecture and so not covered under that patent (and that it would have been very hard for OpenAI to obtain and defend a patent for decoder-only transformers due to Google's prior art).

If your question is, instead, "why didn't the first person to come up with the idea of using computers to predict the next element in a sequence patent that idea, in full generality", keep in mind that (POSIWID aside) patents are intended "to promote the progress of science and useful arts". They are not meant as a way of allowing the first person to come up with an idea to prevent all further research in vaguely adjacent fields.

As a concrete example of the sorts of things patents don't do, take O'Reilly v. Morse, 56 U.S. 62 (1853). In his patent application, Morse claimed

Eighth. I do not propose to limit myself to the specific machinery or parts of machinery described in the foregoing specification and claims; the essence of my invention being the use of the motive power of the electric or galvanic current, which I call electro-magnetism, however developed for marking or printing intelligible characters, signs, or letters, at any distances, being a new application of that power of which I claim to be the first inventor or discoverer.

The court's decision stated

If this claim can be maintained, it matters not by what process or machinery the result is accomplished. For aught that we now know some future inventor, in the onward march of science, may discover a mode of writing or printing at a distance by means of the electric or galvanic current, without using any part of the process or combination set forth in the plaintiff's specification. His invention may be less complicated-less liable to get out of order-less expensive in construction, and its operation. But yet if it is covered by this patent the inventor could not use it, nor the public have the benefit of it without the permission of this patentee. [...] In fine, he claims an exclusive right to use a manner and process which he has not described and indeed had not invented, and therefore could not describe when he obtained his patent. The court is of opinion that the claim is too broad, and not warranted by law.

Comment by faul_sname on Ten arguments that AI is an existential risk · 2024-08-13T23:41:32.343Z · LW · GW

Which is why, since the beginning of the nuclear age, the running theme of international relations is "a single nation embarked on multiple highly destructive wars of conquest, and continued along those lines until no nations that could threaten it remained".

Comment by faul_sname on Ten arguments that AI is an existential risk · 2024-08-13T20:26:04.679Z · LW · GW

"Everyone who only cares about their slices of the world coordinates against those who want to seize control of the entire world" seems like it might be one of those stable equilibria.

Comment by faul_sname on leogao's Shortform · 2024-08-12T22:40:02.566Z · LW · GW

One nuance here is that a software tool that succeeds at its goal 90% of the time, and fails in an automatically detectable fashion the other 10% of the time is pretty useful for partial automation. Concretely, if you have a web scraper which performs a series of scripted clicks in hardcoded locations after hardcoded delays, and then extracts a value from the page from immediately after some known hardcoded text, that will frequently give you a ≥ 90% success rate of getting the piece of information you want while being much faster to code up than some real logic (especially if the site does anti-scraper stuff like randomizing css classes and DOM structure) and saving a bunch of work over doing it manually (because now you only have to manually extract info from the pages that your scraper failed to scrape).

Comment by faul_sname on faul_sname's Shortform · 2024-08-12T19:26:40.207Z · LW · GW

I don't think it's an issue of pure terminology. Rather, I expect the issue is expecting to have a single discrete point in time at which some specific AI is better than every human at every useful task. Possibly there will ever be such a point in time, but I don't see any reason to expect "AI is better than all humans at developing new euv lithography techniques", "AI is better than all humans at equipment repair in the field", and "AI is better than all humans at proving mathematical theorems" to happen at similar times.

Put another way, is an instance of an LLM that has an affordance for "fine-tune itself on a given dataset" an ASI? Going by your rubric:

  • Can think about any topic, including topics outside of their training set:Yep, though it's probably not very good at it
  • Can do self-directed, online learning: Yep, though this may cause it to perform worse on other tasks if it does too much of it
  • Alignment may shift as knowledge and beliefs shift w/ learning: To the extent that "alignment" is a meaningful thing to talk about with regards to only a model rather than a model plus its environment, yep
  • Their own beliefs and goals: Yes, at least for definitions of "beliefs" and "goals" such that humans have beliefs and goals
  • Alignment must be reflexively stable: ¯_(ツ)_/¯ seems likely that some possible configuration is relatively stable
  • Alignment must be sufficient for contextual awareness and potential self-improvement: ¯_(ツ)_/¯ even modern LLM chat interfaces like Claude are pretty contextually aware these days
  • Actions: Yep, LLMs can already perform actions if you give them affordances to do so (e.g. tools)
  • Agency is implied or trivial to add: ¯_(ツ)_/¯, depends what you mean by "agency" but in the sense of "can break down large goals into subgoals somewhat reliably" I'd say yes

Still, I don't think e.g. Claude Opus is "an ASI" in the sense that people who talk about timelines mean it, and I don't think this is only because it doesn't have any affordances for self-directed online learning.

Comment by faul_sname on faul_sname's Shortform · 2024-08-12T16:14:45.607Z · LW · GW

I don't think talking about "timelines" is useful anymore without specifying what the timeline is until (in more detail than "AGI" or "transformative AI"). It's not like there's a specific time in the future when a "game over" screen shows with our score. And for the "the last time that humans can meaningfully impact the course of the future" definition, that too seems to depend on the question of how: the answer is already in the past for "prevent the proliferation of AI smart enough to understand and predict human language", but significantly in the future for "prevent end-to-end automation of the production of computing infrastructure from raw inputs".

Comment by faul_sname on shminux's Shortform · 2024-08-11T05:25:45.222Z · LW · GW

Email didn't entirely kill fax machines or paper records. For similar reasons, I expect that LLMs will not entirely kill computer languages.

Also, I expect things to go the other direction - I expect that as LLMs get better at writing code, they will generate enormous amounts of one-off code. For example, one thing that is not practical to do now but will be practical to do in a year or so is to have sales or customer service webpages where the affordances given to the user (e.g. which buttons and links are shown, what data the page asks for and in what format) will be customized on a per-user basis. For example, when asking for payment information, currently the UI is almost universally credit card number / cvv / name / billing street address / unit / zipcode / state. However, "hold your credit card and id up to the camera" might be easier for some people, while others might want to read out that information, and yet others might want to use venmo or whatever, and a significant fraction will want to stick to the old form fields format. If web developers developed 1,000x faster and 1,000x as cheaply, it would be worth it to custom-develop each of these flows to capture a handful of marginal customers. But forcing everyone to use the LLM interface would likely cost customers.

Comment by faul_sname on It's time for a self-reproducing machine · 2024-08-07T22:35:30.536Z · LW · GW

How much of a loss of precision would we expect in one generation of autofacs?

As a concrete example, let's say one of the components of an autofac is a 0.03125 inch (±0.1 thousandths) CNC drill bit. Can your autofac make another such drill bit out of the same material and at the same level of precision?

If not, maybe we have to ship in the drill bits as well. But there are a large number of things like this, and at some point you've got a box that can assemble copies of itself from prefabricated parts, but uses a pretty standard supply chain to obtain those prefabricated parts. Which, to be clear, would still be pretty cool.

Comment by faul_sname on yanni's Shortform · 2024-08-07T06:41:13.361Z · LW · GW

If you work in a generative AI lab a significant number of people already hate the work you're doing and would likely hate you specifically if your existence became salient to you, for reasons that are at best tangentially related to your contribution to existential risk. This is true regardless of what your timelines look like.

But I don't understand the mechanism by which working in a frontier AI lab is supposed to damage your employment prospects. The set of people who hate you for causing technological unemployment is probably not going to intersect much with the set of people who are making hiring decisions. People who have a history of doing antisocial-but-profitable-for-their-employer stuff get hired all the time, and proudly advertise those profitable antisocial activities on their resumes.

In the extreme, you could argue that working in a frontier AI lab could lead to total human obsolescence, which would harm your job prospects on account of there are no jobs anywhere for anyone. But that's like saying "crashing into an iceberg could cause a noticeable decrease in the number of satisfied diners in the dining saloon of the Titanic".

Comment by faul_sname on Circular Reasoning · 2024-08-05T21:21:47.639Z · LW · GW

Possible, although I think you probably reach that point much faster if you can establish that your conversational partner disagrees with the idea that arguments should be supported or at least supportable by empirical evidence.

Comment by faul_sname on Circular Reasoning · 2024-08-05T20:40:18.465Z · LW · GW

despite that, circular arguments in a tightly closed loop with no reference to empirical observations seem like so little evidence that they're almost worthless

In such situations, "that argument seems to be unsupported by empirical evidence" seems to me like a better counterargument than "that argument is circular".

Comment by faul_sname on You don't know how bad most things are nor precisely how they're bad. · 2024-08-04T20:10:17.369Z · LW · GW

He scoffed "there are some people who do that, but that really only gets you close, and they'd have to finish by ear anyway, especially with the sort of pianos you typically have to work with, since you really need to finesse how the overtones interact with each other, and it's not guaranteed that the overtones are going to be exactly what they're supposed to be, given variations in string thickness, stretching, corrosion, dents, the harp flexing, you know... The whole thing is a negotiation with the piano, you can't just read it its orders and expect it to sound good."

This seems to be a theme with very exact things - once you reach a certain level of required precision, the most effective approach switches from "have a target value and a way of measuring the value" to "have something to compare with". See gauge blocks in machining (nice basic explainer video if you like videos).

Comment by faul_sname on A Simple Toy Coherence Theorem · 2024-08-02T20:52:48.425Z · LW · GW

So if I understand correctly, optimal policies specifically have to be coherent in their decision-making when all information about which decision was made is destroyed, and only information about the outcome remains. The load-bearing part being:

Now, suppose that at timestep  there are two different states either of which can reach either state  or state  in the next timestep. From one of those states the policy chooses  ; from the other the policy chooses  . This is an inconsistent revealed preference between  and  at time  : sometimes the policy has a revealed preference for  over  , sometimes for  over  .

Concrete example:

Start with the state diagram

 

We assign values to the final states, and then do

start from those values over final state and compute the best value achievable starting from each state at each earlier time. That's just dynamic programming:

where  are the values over final states.

 

and so the reasoning is that there is no coherent policy which chooses Prize Room A from the front door but chooses Prize Room B from the side door.

But then if we update the states to include information about the history, and say put +3 on "histories where we have gone straight", we get

and in that case, the optimal policy will go to Prize Room A from the front door and Prize Room B from the side door. This happens because "Prize Room A from the front door" is not the same node as "Prize Room A from the side door" in this graph.

The coherence theorem in the post talks about how optimal models can't make take alternate options when presented with the same choice based on their history, but for the choice to be "the same choice" you have to have merging paths on the graph, and if nodes contain their own history, paths will never merge.

Is that basically why only the final state is allowed to "count" under this proof, or am I still missing something?

 

Edited to add: link to legible version of final diagram

Comment by faul_sname on A Simple Toy Coherence Theorem · 2024-08-02T19:14:22.581Z · LW · GW

Notice that we used values over final state, and explicitly set incremental reward at earlier timesteps to zero. That was load-bearing: with arbitrary freedom to choose rewards at earlier timesteps, any policy is optimal for some nontrivial values/rewards. (Proof: just pick the rewards at timestep  to reward whatever the policy does enough to overwhelm future value/rewards.)

Do you expect that your methods would generalize over a utility function that was defined as the sum of some utility function over the state at some fixed intermediate timestamp  and some utility function over the final state? Naively, I would think one could augment the state space such that the entire state at time  became encoded in subsequent states, and the utility function in question could then be expressed soley as a utility function over the final state. But I don't know if that strategy is "allowed".

If this method is "allowed", I don't understand why this theorem doesn't extend to systems where incremental reward is nonzero at arbitrary timesteps.

If this method is not "allowed", does that mean that this particular coherence theorem only holds over policies which care only about the final state of the world, and agents which are coherent in this sense are not allowed to care about world histories and the world state is not allowed to contain information about its history?

Comment by faul_sname on Dragon Agnosticism · 2024-08-02T00:07:01.730Z · LW · GW

Concrete example: one of my core beliefs is " :::systems which get their productive capacity mainly from voluntary trade are both more productive and better for the people living under them than systems which get their productive capacity mainly from threats and coercion::: ". In this example, a "dragon" would be :::a coercion-based system which is more productive than a voluntary-trade-based one::: .

So working through the analogy:

  • Historically, people who believed in this kind of "dragon", and who acted on that belief, tended not to behave very nicely.
  • There are still people who believe in "dragons". They tend to valorize the actions of past dragon believers to a worrying extent.
  • I think "dragons" probably don't exist, but I haven't actually proven it.
  • I would prefer to live in a world where "dragons" don't exist.
  • If "dragons" did exist, I would feel obligated to spend a lot of time and effort reevaluating my world model and plans.
  • If I discovered the existence of a "dragon", and shared that, that would become a defining thing I am known for
  • Most likely, there wouldn't be much that I, personally, could do with the information that a "dragon" exists

I guess arguably this is a hot-button political issue in some contexts. The other example I was thinking of was less political had the similar shape of "thing which, if true implies that we can'texpect to maintain certain nice things about the world we live in, and where people believing that the thing is true, if it is in fact true, would hasten the end of the nice things, to the benefit of nobody in particular".

ETA: On reflection I do agree that this is a different flavor of "nothing good can come of this line of thought" than the one you outlined in your post.

Comment by faul_sname on Dragon Agnosticism · 2024-08-01T21:47:45.353Z · LW · GW

It's fun to put various of my foundational non-hot-button-political beliefs in place of "dragons" and see which ones make my mind try to flinch away from thinking that thought for the reasons outlined in this post (i.e. "if I checked and it's not the way I thought, that would necessitate a lot of expensive and time-consuming updates to what I'm doing").

Comment by faul_sname on faul_sname's Shortform · 2024-08-01T21:30:15.314Z · LW · GW

I'm curious what sort of policies you're thinking of which would allow for a pause which plausibly buys us decades, rather than high-months-to-low-years. My imagination is filling in "totalitarian surveillance state which is effective at banning general-purpose computing worldwide, and which prioritizes the maintenance of its own control over all other concerns". But I'm guessing that's not what you have in mind.

Comment by faul_sname on faul_sname's Shortform · 2024-08-01T07:58:09.682Z · LW · GW

In the startup world, conventional wisdom is that, if your company is default-dead (i.e. on the current growth trajectory, you will run out of money before you break even), you should pursue high-variance strategies. In one extreme example, "in the early days of FedEx, [founder of FedEx] Smith had to go to great lengths to keep the company afloat. In one instance, after a crucial business loan was denied, he took the company's last $5,000 to Las Vegas and won $27,000 gambling on blackjack to cover the company's $24,000 fuel bill. It kept FedEx alive for one more week."

By contrast, if your company is default-alive (profitable or on-track to become profitable long before you run out of money in the bank), you should avoid making high-variance bets for a substantial fraction of the value of the company, even if those high-variance bets are +EV.

Obvious follow-up question: in the absence of transformative AI, is humanity default-alive or default-dead?

Comment by faul_sname on faul_sname's Shortform · 2024-08-01T00:38:11.645Z · LW · GW

Wait I think I am overthinking this by a lot and the thing I want is in the literature under terms like "classifier" / and "linear regression'.

Comment by faul_sname on faul_sname's Shortform · 2024-07-31T20:50:26.448Z · LW · GW

Is it possible to determine whether a feature (in the SAE sense of "a single direction in activation space") exists for a given set of changes in output logits?

Let's say I have a feature from a learned dictionary on some specific layer of some transformer-based LLM. I can run a whole bunch of inputs through the LLM, either adding that feature to the activations at that layer (in the manner of Golden Gate Claude) or ablating that direction from the outputs at that layer. That will have some impact on the output logits.

Now I have a collection of (input token sequence, output logit delta) pairs. Can I, from that set, find the feature direction which produces those approximate output logit deltas by gradient descent?

If yes, could the same method be used to determine which features in a learned dictionary trained on one LLM exist in a completely different LLM that uses the same tokenizer?

I imagine someone has already investigated this question, but I'm not sure what search terms to use to find it. The obvious search terms like "sparse autoencoder cross model" or "Cross-model feature alignment in transformers" don't turn up a ton, although they turn up the somewhat relevant paper Text-To-Concept (and Back) via Cross-Model Alignment.

Comment by faul_sname on lukehmiles's Shortform · 2024-07-31T09:04:18.923Z · LW · GW

It's not perfect, but one approach I saw on here and liked a lot was @turntrout's MATS team's approach for some of the initial shard theory work, where they made an initial post outlining the problem and soliciting predictions on a set of concrete questions (which gave a nice affordance for engagement, namely "make predictions and maybe comment on your predictions), and then they made a follow-up post with their actual results. Seemed to get quite good engagement.

A confounding factor, though, was that was also an unusually impressive bit of research.

Comment by faul_sname on On “first critical tries” in AI alignment · 2024-07-29T10:22:41.081Z · LW · GW

I think the most important part of your "To stand x chance of property p applying to system s, we'd need to apply resources r" model is the word "we".

Currently, there exists no "we" in the world that can ensure that nobody in the world does some form of research, or at least no "we" that can do that in a non-cataclysmic way. The International Atomic Energy Agency comes the closest of any group I'm aware of, but the scope is limited and also it does its thing mainly by controlling access to specific physical resources rather than by trying to prevent a bunch of people from doing a thing with resources they already possess.

If "gain a DSA (or cause some trusted other group to gain a DSA) over everyone who could plausibly gain a DSA in the future" is a required part of your threat mitigation strategy, I am not optimistic about the chances for success but I'm even less optimistic about the chances of that working if you don't realize that's the game you're trying to play.

Comment by faul_sname on Arjun Panickssery's Shortform · 2024-07-28T05:56:07.457Z · LW · GW

Interesting. I wonder if it's because scrambled words of the same length and letter distribution are tokenized into tokens which do not regularly appear adjacent to each other in the training data.

If that's what's happening, I would expect gpt3.5 to classify words as long if they contain tokens that are generally found in long words, and not otherwise. One way to test this might be to find shortish words which have multiple tokens, reorder the tokens, and see what it thinks of your frankenword (e.g. "anozdized" -> [an/od/ized] -> [od/an/ized] -> "odanized" -> "is odanized a long word?").

Comment by faul_sname on On “first critical tries” in AI alignment · 2024-07-27T05:29:35.477Z · LW · GW

Does any specific human or group of humans currently have "control" in the sense of "that which is lost in a loss-of-control scenario"? If not, that indicates to me that it may be useful to frame the risk as "failure to gain control".

Comment by faul_sname on faul_sname's Shortform · 2024-07-24T07:12:34.521Z · LW · GW

[Epistemic status: 75% endorsed]

Those who, upon seeing a situation, look for which policies would directly incentivize the outcomes they like should spend more mental effort solving for the equilibrium.

Those who, upon seeing a situation, naturally solve for the equilibrium should spend more mental effort checking if there is indeed only one "the" equilibrium, and if there are multiple possible equilibria, solving for which factors determine which of the several possible the system ends up settling on.

Comment by faul_sname on Lorxus's Shortform · 2024-07-20T01:31:37.090Z · LW · GW

I played with this with a colab notebook way back when. I can't visualize things directly in 4 dimensions, but at the time I came up with the trick of visualizing the pairwise cosine similarity for each pair of features, which gives at least a local sense of what the angles are like.

 Trying to squish 9 features into 4 dimensions looks to me like it either ends up with

  • 4 antipodal pairs which are almost orthogonal to one another, and then one "orphan" direction squished into the largest remaining space
     
    OR
  • 3 almost orthogonal antipodal pairs plus a "Y" shape with the narrow angle being 72º and the wide angles being 144º
  • For reference this is what a square antiprism looks like in this type of diagram:

     
Comment by faul_sname on Daniel Tan's Shortform · 2024-07-17T08:24:27.872Z · LW · GW

Yeah, stands for Max Cosine Similarity. Cosine similarity is a pretty standard measure for how close two vectors are to pointing in the same direction. It's the cosine of the angle between the two vectors, so +1.0 means the vectors are pointing in exactly the same direction, 0.0 means the vectors are orthogonal, -1.0 means the vectors are pointing in exactly opposite directions.

To generate this graph, I think he took each of the learned features in the smaller dictionary, and then calculated to cosine similarity of that small-dictionary feature with every feature in the larger dictionary, and then the maximal cosine similarity was the MCS for that small-dictionary feature. I have a vague memory of him also doing some fancy linear_sum_assignment() thing (to ensure that each feature in the large dictionary could only be used once in order avoid having multiple features in the small dictionary have their MCS come from the same feature on the large dictionary) though IIRC it didn't actually matter.

Also I think the small and large dictionaries were trained using different methods as each other for layer 2, and this was on pythia-70m-deduped so layer 5 was the final layer immediately before unembedding (so naively I'd expect most of the "features" to just be "the output token will be the" or "the output token will be when" etc).

Edit: In terms of "how to interpret these graphs", they're histograms with the horizontal axis being bins of cosine similarity, and the vertical axis being how many small-dictionary features had a the cosine similarity with a large-dictionary feature within that bucket. So you can see at layer 3 it looks like somewhere around half of the small dictionary features had a cosine similarity of 0.96-1.0 with one of the large dictionary features, and almost all of them had a cosine similarity of at least 0.8 with the best large-dictionary feature.

Which I read as "large dictionaries find basically the same features as small ones, plus some new ones".

Bear in mind also that these were some fairly small dictionaries. I think these charts were generated with this notebook so I think smaller_dict was of size 2048 and larger_dict was size 4096 (with a residual width of 512, so 4x and 8x respectively). Anthropic went all the way to 256x residual width with their "Towards Monosemanticity" paper later that year, and the behavior might have changed at that scale.

Comment by faul_sname on Daniel Tan's Shortform · 2024-07-17T07:13:01.467Z · LW · GW

Found this graph on the old sparse_coding channel on the eleuther discord:

Logan Riggs: For MCS across dicts of different sizes (as a baseline that's better, but not as good as dicts of same size/diff init). Notably layer 5 is sucks. Also, layer 2 was trained differently than the others, but I don't have the hyperparams or amount of training data on hand. 

Image

So at least tentatively that looks like "most features in a small SAE correspond one-to-one with features in a larger SAE trained on the activations of the same model on the same data".

Comment by faul_sname on Seeking feedback on a critique of the paperclip maximizer thought experiment · 2024-07-16T23:58:43.482Z · LW · GW

As a spaghetti behavior executor, I'm worried that neural networks are not a safe medium for keeping a person alive without losing themselves to value drift, especially throughout a much longer life than presently feasible

As a fellow spaghetti behavior executor, replacing my entire motivational structure with a static goal slot feels like dying and handing off all of my resources to an entity that I don't have any particular reason to think will act in a way I would approve of in the long term.

Historically, I have found varying things rewarding at various stages of my life, and this has chiseled the paths in my cognition that make me me. I expect that in the future my experiences and decisions and how rewarded / regretful I feel about those decisions will continue to chisel my cognition in a way that changes what I care about, in the way that past-me endorsed current-me's experiences causing me to care about things (e.g. specific partners, offspring) that past-me did not care about.

I would not endorse freezing my values in place to prevent value drift in full generality. At most I endorse setting up contingencies so my values don't end up trapped in some specific places current-me does not endorse (e.g. "heroin addict").

so I'd like to get myself some goal slots that much more clearly formulate the distinction between capabilities and values. In general this sort of thing seems useful for keeping goals stable, which is instrumentally valuable for achieving those goals, whatever they happen to be, even for a spaghetti behavior executor.

So in this ontology, an agent is made up of a queryable world model and a goal slot. Improving the world model allows the agent to better predict the outcomes of its actions, and the goal slot determines which available action the agent would pick given its world model.

I see the case for improving the world model. But once I have that better world model, I don't see why I would additionally want to add an immutable goal slot that overrides my previous motivational structure. My understanding is that adding a privileged immutable goal slot would only change the my behavior in those cases where I would otherwise have decided that achieving the goal that was placed in that slot was not a good idea on balance.

As a note, you could probably say something clever like "the thing you put in the goal slot should just be 'behave in the way you would if you had access to unlimited time to think and the best available world model'", but if we're going there then I contend that the rock I picked up has a goal slot filled with "behave exactly like this particular rock".

Comment by faul_sname on Seeking feedback on a critique of the paperclip maximizer thought experiment · 2024-07-15T22:57:51.647Z · LW · GW

Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a bunch of the wrinkles and will be oriented around a particular goal (at least behaviorally, cf. efficiency—though I would also guess that the mental architecture ultimately ends up cleanly-factored (albeit not in a way that creates a single point of failure, goalwise)).

Do you have a good reference article for why we should expect spaghetti behavior executors to become wrapper minds as they scale up?