Posts
Comments
There are various technologies that might let you make many more egg cells than are possible to retrieve from an IVF cycle. For example, you might be able to mature oocytes from an ovarian biopsy, or you might be able to turn skin cells into eggs.
Copying over Eliezer's top 3 most important projects from a tweet:
1. Avert all creation of superintelligence in the near and medium term.
2. Augment adult human intelligence.
3. Build superbabies.
Thanks. Fixed.
Looks like the base url is supposed to be niplav.site. I'll change that now (FYI @niplav)
I think TLW's criticism is important, and I don't think your responses are sufficient. I also think the original example is confusing; I've met several people who, after reading OP, seemed to me confused about how engineers could use the concept of mutual information.
Here is my attempt to expand your argument.
We're trying to design some secure electronic equipment. We want the internal state and some of the outputs to be secret. Maybe we want all of the outputs to be secret, but we've given up on that (for example, radio shielding might not be practical or reliable enough). When we're trying to design things so that the internal state and outputs are secret, there are a couple of sources of failure.
One source of failure is failing to model the interactions between the components of our systems. Maybe there is an output we don't know about (like the vibrations the electronics make while operating), or maybe there is an interaction we're not aware of (like magnetic coupling between two components we're treating as independent).
Another source of failure is that we failed to consider all the ways that an adversary could exploit the interactions we do know about. In your example, we fail to consider how an adversary could exploit higher-order correlations between emitted radio waves and the state of the electronic internals.
A true name, in principle, allows us to avoid the second kind of failure. In high-dimensional state spaces, we might need to get kind of clever to prove the lack of mutual information. But it's a fairly delimited analytic problem, and we at least know what a good answer would look like.
The true name could also guide our investigations into our system, to help us avoid the first kind of failure. "Huh, we just made the adder have a more complicated behaviour as an optimisation. Could the unnevenness of that optimisation over the input distribution leak information about the adder's inputs to another part of the system?"
Now, reader, you might worry that the chosen example of a True Name leaves an implementation gap wide enough for a human adversary to drive an exploit through. And I think that's a pretty good complaint. The best defence I can muster is that it guides and organises the defender's thinking. You get to do proofs-given-assumptions, and you get more clarity about how to think if your assumptions are wrong.
To the extent that the idea is that True Names are part of a strategy to come up with approaches that are unbounded-optimisation-proof, I think that defence doesn't work and the strategy is kind of sunk.
On the other hand, here is an argument that I can plause. In the end, we've got to make some argument that when we flick some switch or continue down some road, things will be OK. And there's a big messy space of considerations to navigate to that end. True Names are necessary to have any hope of compressing the domain enough that you can make arguments that stand up.
With LLMs, we might be able to aggregate more qualitative anonymous feedback.
The general rule is roughly "if you write a frontpage post which has an announcement at the end, that can be frontpaged". So for example, if you wrote a post about the vision for Online Learning, that included as a relatively small part the course announcement, that would probably work.
By the way, posts are all personal until mods process them, usually around twice a day. So that's another reason you might sometimes see posts landing on personal for awhile.
Mod note: this post is personal rather than frontpage because event/course/workshop/org... announcements are generally personal, even if the content of the course, say, is pretty clearly relevant to the frontpage (as in this case)
I believe it includes some older donations:
- Our Manifund application's donations, including donations going back to mid-May, totalling about $50k
- A couple of older individual donations, in October/early Nov, totalling almost 200k
Mod note: I've put this on Personal rather than Frontpage. I imagine the content of these talks will be frontpage content, but event announcements in general are not.
neural networks routinely generalize to goals that are totally different from what the trainers wanted
I think this is slightly a non sequitor. I take Tom to be saying "AIs will care about stuff that is natural to express in human concept-language" and your evidence to be primarily about "AIs will care about what we tell it to", though I could imagine there being some overflow evidence into Tom's proposition.
I do think the limited success of interpretability is an example of evidence against Tom's proposition. For example, I think there's lots of work where you try and replace an SAE feature or a neuron (R) with some other module that's trying to do our natural language explanation of what R was doing, and that doesn't work.
I dug up my old notes on this book review. Here they are:
So, I've just spent some time going through the World Bank documents on its interventions in Lesotho. The Anti-Politics Machine is not doing great on epistemic checking
- There is no recorded Thaba-Tseka Development Project, despite the period in which it should have taken place being covered
- There is a Thaba-Bosiu development project (parts 1 and 2) taking place at the correct time.
- Thaba-Bosiu and Thaba-Tseka are both regions of Lesotho
- The spec doc for Thaba-Bosiu Part 2 references the alleged problems the economists were faced with (remittances from South African miners, poor crop yield ... no complaint about cows)
- It has a negative assessment doc at the end. It was an unsuccessful project. This would match
- The funding doesn't quite match up. The UK is mentioned as funding the "Thaba-Tseka" project, and is indeed funding Thaba-Bosiu. But Canada is I believe funding a road project instead
- Something like 2/3 of the country is involved in Thaba-Bosiu Development II (It became renamed the "Basic Agricultural Services Program")
- There is no mention of ponies or wood involved in interventions anywhere. In fact, the part II retrospective includes the lack of focus on livestock as a problem (suggesting they didn't do much of it)
- They were focused on five major crops (maize, sorghum, beans, peas and wheat)
- Also the quote in the book review of the quote in The Anti-Politics Machine of the quote in the report doesn't show up in any of the documents I looked at (which basically covered every project in Lesotho by the World Bank in that time period). The writing style of the quote is also moderately distinct from that of the reports
- AFAICT, the main intervention was fertiliser. The retrospective claims this failed because (a) the climate in Lesotho is uniquely bad and screened off fertilisation and (b) the Lesotho government fucked up messaging and also every other part of everything all the time and ultimately all the donors backed out.
- The government really wanted to be self-sufficient in food production. None of the donors, the farmers or the world bank cared about this but the government focused its messaging heavily around this. The government ended up directing a lot of its efforts towards a new Food Self-Sufficiency Program which was seen as incompatible with the goals of Basic Agricultural Services Program.
- The fact that the crop situation wasn't working was recognised fairly early on. They started on an adaptive trial of crop research to figure out what would work better. This was hampered by donor coordination so only happened in a small area, but apparently worked quite well
All-in-all, sounds less bad than the Anti-Politics Machine makes it out to be, and also just generally very different? I'm not 100% certain I've managed to locate all the relevant programs though, so it's possible something closer to the book's description did happen
I think 2023 was perhaps the peak for discussing the idea that neural networks have surprisingly simple representations of human concepts. This was the year of Steering GPT-2-XL by adding an activation vector, cheese vectors, the slightly weird lie detection paper and was just after Contrast-consistent search.
This is a pretty exciting idea, because if it’s easy to find human concepts we want (or don’t want) networks to possess, then we can maybe use that to increase the chance that systems that are honest, kind, loving (and can ask them questions like “are you deceiving me?” and get useful answers).
I don’t think the idea is now definitively refuted or anything, but I do think a particular kind of lazy version of the idea, more popular in the Zeitgeist, perhaps, than amongst actual proponents, has fallen out of favour.
CCS seemed to imply an additional proposition, which is that you can get even more precise identification of human concepts by encoding some properties of the concept you’re looking for into the loss function. I was kind of excited about this, because things in this realm are pretty powerful tools for specifying what you care about (like, it rhymes with axiom-based definition or property-based testing).
But actually, if you look at the numbers they report, that’s not really true! As this post points out, basically all their performance is recoverable by doing PCA on contrast pairs.[1]
I like how focused and concise this post is, while still being reasonably complete.
There’s another important line of criticism of CCS, which is about whether its “truth-like vector” is at all likely to track truth, rather than just something like “what a human would believe”. I think posts like What Discovering Latent Knowledge Did and Did Not Find address this somewhat more directly than this one.
But I think, for me, the loss function had some mystique. Most of my hope was that encoding properties of truth into the loss function would help us find robust measures of what a model thought was true. So I think this post was the main one that made me less excited about both CCS and take a bit more of a nuanced view about the linearity of human concept representations.
- ^
Though I admit I’m a little confused about how to think about the fact that PCA happens to have pretty similar structure to the CCS loss. Maybe for features that have less confidence/consistency-shaped properties, shaping the loss function would be more important.
I'm not sure I understand what you're driving at, but as far as I do, here's a response: I have lots of concepts and abstractions over the physical world (like chair). I don't have many concepts or abstractions over strings of language, apart from as factored through the physical world. (I have some, like register or language, but they don't actually feel that "final" as concepts).
As far as factoring my predictions of language through the physical world, a lot of the simplest and most robust concepts I have are just nouns, so they're already represented by tokenisation machinery, and I can't do interesting interp to pick them out.
That sounds less messy than the path from 3D physical world to tokens (and less (edit: I meant more here!) messy than the path from human concepts to tokens)
quality of tasks completed
quantity?
Just a message to confirm: Zac's leg of the trade has been executed for $810. Thanks Lucie for those $810!
This doesn't play very well with fractional kelly though
I do feel like it would be good to start with a more optimistic prior on new posts. Over the last year, the mean post karma was a little over 13, and the median was 5.
This seems unlikely to satisfy linearity, as A/B + C/D is not equal to (A+C)/(B+D)
I don't feel particularly uncertain. This EA Forum comment and its parents inform my view quite a bit.
Maybe sometimes a team will die in the dungeon?
<details>blah blah</details>
So I did some super dumb modelling.
I was like: let's assume that there aren't interaction effects between the encounters either in the difficulty along a path or in the tendency to co-occur. And let's assume position doesn't matter. Let's also assume that the adventurers choose the minimally difficult path, only moving across room edges.
To estimate the value of an encounter, let's look at how the dungeons where it occurs in one of the two unavoidable locations (1 and 9) differ on average from the overall average.
Assuming ChatGPT did all the implementation correctly, this predictor never overestimates the score by much. Though it frequently, and sometimes egregiously, underestimates the score.
Anyway, using this model and this pathing assumption, we have DBN/OWH/NOC
We skip the goblins and put our fairly rubbish trap in the middle to stop adventurers picking and choosing which parts of the outside paths they take. The optimal path for the adventurers is DONOC, which has a predicted score of 30.29, which ChatGPT tells me is ~95th percentile.
I'd love to come at this with saner modelling (especially of adventurer behaviour), but I somewhat doubt I will.
I'm guessing encounter 4 (rather than encounter 6) follows encounter 3?
You can simulate a future by short-selling the underlying security and buying a bond with the revenue. You can simulate short-selling the same future by borrowing money (selling a bond) and using the money to buy the underlying security.
I think these are backwards. At the end of your simulated future, you end up with one less of the stock, but you have k extra cash. At the end of your simulated short sell, you end up with one extra of the stock and k less cash.
A neat stylised fact, if it's true. It would be cool to see people checking it in more domains.
I appreciate that Ege included all of examples, theory, and predictions of the theory. I think there's lots of room for criticism of this model, which it would be cool to see tried. In particular, as far as I understand the formalism, it doesn't seem like it is obviously discussing the costs of the investments, as opposed to their returns.
But I still like this as a rule of thumb (open to revision).
I still think this post is cool. Ultimately, I don't think the evidence presented here bares that strongly on the underlying question: "can humans get AIs to do their alignment homework?". But I think it bares on it at all, and was conducted quickly and competently.
I would like to live in a world where lots of people gather lots of weak pieces of evidence on important questions.
Yep, if the first vote takes the score to ≤ 0, then the post will be dropped off the latest list. This is somewhat ameliorated by:
(a) a fair number of people browsing https://lesswrong.com/allPosts
(b) https://greaterwrong.com having chronological sort by default
(c) posts appearing in recent discussion in order that they're posted (though I do wonder if we filter out negative karma posts from recent discussion)
I often play around with different karma / sorting mechanisms, and I do think it would be nice to have a more Bayesian approach that started with a stronger prior. My guess is the effect you're talking about isn't a big issue in practice, though probably worth a bit of my time to sample some negative karma posts.
I had a quick look in the database, and you do have some tag filters set, which could cause the behaviour you describe
- Because it's a number and a vector, you're unlikely to see anyone (other than programmers) trying to use i as a variable.
I think it's quite common to use i as index variable (for example, in a sum)
(edit: whoops, I see several people have mentioned this)
In this case sitting down with someone doing similar tasks but getting more use out of LMs would likely help.
I would contribute to a bounty for y'all to do this. I would like to know whether the slow progress is prompting-induced or not.
Click on the gear icon next to the feed selector
A quick question re: your list: do you have any tag filters set?
I think "unacceptable reputational costs" here basically means the conjunction of "Dustin doesn't like the work" and "incurs reputational costs for Dustin". Because of the first conjunct, I don't think this suggestion would help Lightcone, sadly.
The "latest" tab works via the hacker news algorithm. Ruby has a footnote about it here. I think we set the "starting age" to 2 hours, and the power for the decay rate to 1.15.
mod note: this post used to say "LessWrong doesn't seem to support the <details>
element, otherwise I would put this code block in it".
We do now support it, so I've edited the post to put the code block in such an element
Robin Hanson is one of the intellectual fathers of LessWrong, and I'm very glad there's a curated, organised list of some of his main themes.
He's the first thinker I remember reading and thinking "what? that's completely wrong", who went on to have a big influence on my thought. Apparently I'm not an isolated case (paragraph 3 page 94).
Thanks, Arjun and Richard.
37bvhXnjRz4hipURrq2EMAXN2w6xproa9T
I've updated the post with it.
FTX did successfully retrieve the $1M from the title company! We didn't have any control over those funds, so I don't think we were involved apart from pointing FTX in the right direction.
Habryka means we would have to pick one number per Stripe link (eg one like for $5/month, 1 for $100/month, etc)
Are you checking the box for “Save my info for 1-click checkout with Link”? That’s the only way I’ve figured out get Stripe to ask for my phone number. If so, you can safely uncheck that
(Also, I don’t know if it’s important you, but I don’t think we would see your phone number if you gave it to Stripe)
What do you mean by A?
Habryka is slightly sloppily referring to using Janus' 'base model jailbreak' for Claude 3.5 Sonnet
as I understand it, the majority of this money will go towards supporting Lighthaven
I think if you take Habryka's numbers at face value, a hair under half of the money this year will go to Lighthaven (35% of core staff salaries@1.4M = 0.49M. 1M for a deferred interest payment. And then the claim that otherwise Lighthaven is breaking even). And in future years, well less than half.
I worry that the future of LW will be endangered by the financial burden of Lighthaven
I think this is a reasonable worry, but I again want to note that Habryka is projecting a neutral or positive cashflow from Lighthaven to the org.
That said, I can think of a couple of reasons for financial pessimism[1]. Having Lighthaven is riskier. It involves a bunch of hard-to-avoid costs. So, if Lighthaven has a bad year, that does indeed endanger the project as a whole.
Another reason to be worried: Lightcone might stop trying to make Lighthaven break even. Lightcone is currently fairly focused on using Lighthaven in revenue-producing ways. My guess is that we'll always try and structure stuff at Lighthaven such that it pays its own way (for example, when we ran LessOnline we sold tickets[2]). But maybe not! Maybe Lightcone will pivot Lighthaven to a loss-making plan, because it foresees greater altruistic benefit (and expects to be able to fundraise to cover it).
So the bundling of the two projects still leaks some risk.
Of course, you might also think Lighthaven makes LessWrong more financially robust, if on the mainline it ends up producing a modest profit that can be used to subsidise LessWrong.
- ^
Other than just doubting Habryka's projections, which also might make sense.
- ^
My understanding of the numbers is that we lost money once you take into account staff time, but broke even if you don't. And it seems the people most involved with running it are hopeful about cutting a bunch of costs in future.
I worry that cos this hasn't received a reply in a bit, people might think it's not in the spirit of the post. I'm even more worried people might think that critical comments aren't in the spirit of the post.
Both critical comments and high-effort-demanding questions are in the spirit of the post, IMO! But the latter might take awhile to get a response
The EIN is 92-0861538