An Analysis of the ‘Digital Gaia’ Proposal from a Safety Perspective 2023-05-31T12:21:39.390Z
Crunch Dynamics and Control Loss 2023-05-24T13:03:04.583Z
(Slightly) Scalable RLHF Alternatives: A Productive Path for Slow Takeoff Worlds? 2023-05-17T11:31:59.946Z
[Link] PaLM 2 Technical Report 2023-05-10T20:28:16.060Z
Archetypal Transfer Learning and a Corrigibility-Friendly Optimization Technique 2023-05-04T12:15:39.386Z
The Security Mindset, S-Risk and Publishing Prosaic Alignment Research 2023-04-22T14:36:50.535Z
Select Agent Specifications as Natural Abstractions 2023-04-07T23:16:02.803Z


Comment by marc/er on The AGI Race Between the US and China Doesn’t Exist. · 2023-06-03T05:50:29.834Z · LW · GW

While it's true that Chinese semiconductor fabs are a decade behind TSMC (and will probably remain so for some time), that doesn't seem to have stopped them from building 162 of the top 500 largest supercomputers in the world.

They did this (mostly) before the export regulations were instantiated. I'm not sure what the exact numbers are, but both of their supercomputers in the top 10 were constructed before October 2022 (when they were imposed). Also, I imagine that they still might have had a steady supply of cutting edge chips soon after the export regulations. It would make sense that they were not enacted immediately and also that exports that had already begun hadn't been ceased, but I have not verified that.

Comment by marc/er on The Divine Move Paradox & Thinking as a Species · 2023-05-31T22:00:33.574Z · LW · GW

Sure, this is an argument 'for AGI', but rarely do people (on this forum at least) reject the deployment of AGI because they feel discomfort in not fully comprehending the trajectory of their decisions. I'm sure that this is something most of us ponder and would acknowledge is not optimal, but if you asked the average LW user to list the reasons they were not for the deployment of AGI, I think that this would be quite low on the list.

Reasons higher on the list for me for example would be "literally everyone might die." In light of that; dismissing control loss as a worry seems quite miniscule. The reason people fear control loss is generally because losing control of something more intelligent than you with instrumental subgoals that if pursued would probably result in a bad outcome for you, but this doesn't change the fact that "we shouldn't fear not being in control for the above reasons" does not constitute sufficient reason to deploy AGI.

Also, although some of the analogies drawn here do have merit; I can't help but gesture toward the giant mass of tentacles and eyes you are applying them to. To make this more visceral, picture a literal Shoggoth descending from a plane of Eldlitch horror and claiming decision-making supremacy and human-aligned goals. Do you accept its rule because of its superior decision making supremacy and claimed human-aligned, or do you seek an alternative arrangement?

Comment by marc/er on Adumbrations on AGI from an outsider · 2023-05-25T10:39:00.679Z · LW · GW

Soft upvoted your reply, but have some objections. I will respond using the same numbering system you did such that point 1 in my reply will address point 1 of yours. 

  1. I agree with this in the context of short-term extinction (i.e. at or near the deployment of AGI), but would offer that an inability to remain competitive and loss of control is still likely to end in extinction, but in a less cinematic and instantaneous way. In accordance with this, the potential horizon for extinction-contributing outcomes is expanded massively. Although Yudkowsky is most renowned for hard takeoff, soft takeoff has a very differently shaped extinction-space and (I would assume) is a partial reason for his high doom estimate. Although I cannot know this for sure, I would imagine he has a >1% credence in soft takeoff. 'Problems with the outcome' seem highly likely to extend to extinction given time. 
  2. There are (probably) an infinite number of possible mesa-optimizers. I don't see any reason to assume an upper bound on potential mesa-optimization configurations, and yes; this is not a 'slam dunk' argument. Rather, as derived from the notion that even slightly imperfect outcomes can extend to extinction, I was suggesting that you are trying to search an infinite space for a quark that fell out of your pocket some unknown amount of time ago whilst you were exploring said space. This can be summed up as 'it is not probable that some mesa-optimizer selected by gradient descent will ensure a Good Outcome'.
  3. This still does not mean that the only form of brain hacking is via highly immersive virtual reality. I recall the Tweet that this comment came from, and I interpreted it as a highly extreme and difficult form of brain hacking used to prove a point (the point being that if ASI could accomplish this it could easily accomplish psychological manipulation). Eliezer's breaking out of the sandbox experiments circa 2010 (I believe?) are a good example of this.
  4. Alternatively you can claim some semi-arbitrary but lower extinction risk like 35%, but you can make the same objections to a more mild forecast like that. Why is assigning a 35% probability to an outcome more epistemologically valid than a >90% probability? Criticizing forecasts based on their magnitude seems difficult to justify in my opinion, and critiques should rely on argument only.
Comment by marc/er on Adumbrations on AGI from an outsider · 2023-05-25T06:34:22.283Z · LW · GW

The focus of the post is not on this fact (at least not in terms of the quantity of written material). I responded to the arguments made because they comprised most of the post, and I disagreed with them.

If the primary point of the post was "The presentation of AI x-risk ideas results in them being unconvincing to laypeople", then I could find reason in responding to this, but other than this general notion, I don't see anything in this post that expressly conveys why (excluding troubles with argumentative rigor, and the best way to respond to this I can think of is by refuting said arguments).

Comment by marc/er on Adumbrations on AGI from an outsider · 2023-05-25T00:44:16.209Z · LW · GW

I disagree with your objections.

"The first argument–paperclip maximizing–is coherent in that it treats the AGI’s goal as fixed and given by a human (Paperclip Corp, in this case). But if that’s true, alignment is trivial, because the human can just give it a more sensible goal, with some kind of “make as many paperclips as you can without decreasing any human’s existence or quality of life by their own lights”, or better yet something more complicated that gets us to a utopia before any paperclips are made"

This argument is essentially addressed by this post, and has many failure modes. For example, if you specify the superintelligence's goal as the example you gave, it's most optimal solution might be to cryopreserve the brain of every human in a secure location, and prevent any attempts an outside force could make at interfacing with them. You realize this, and so you specify something like "Make as many squiggles as possible whilst leaving humans in control of their future", and the intelligence is quite smart and quite general, so it can comprehend the notion of what you want when you say "we want control of our future", but then BayAreaAILab#928374 trains a superintelligence designed to produce squiggles without this limit and outcompetes the aligned intelligence, because humans are much less efficient than inscrutable matrices.

This is not even mentioning issues with inner alignment and mesa-optimizers. You start to address this with:

AGI-risk argument responds by saying, well, paperclip-maximizing is just a toy thought experiment for people to understand. In fact, the inscrutable matrices will be maximizing a reward function, and you have no idea what that actually is, it might be some mesa-optimizer

But I don't feel as though your referencing to Eliezer's Twitter loss drop fiasco and subsequent argument regarding GPU maximization successfully refutes claims regarding mesa-optimization. Even if GPU-maximizing mesa-optimization was intractable, what about the other potentially infinite number of possible mesa-optimizer configurations that result ?

You don’t know that human brains can be hacked using VR headsets; it has never been demonstrated that it’s possible and there are common sense reasons to think it’s not. The brain is an immensely complicated, poorly-understood organ. Applying a lot of computing power to that problem is very unlikely to yield total mastery of it by shining light in someone’s eyes

When Eliezer talks about 'brain hacking' I do not believe he means by dint of a virtual reality headset. Psychological manipulation is an incredibly powerful tool, and who else could manipulate humanity if not a superintelligence? Furthermore, if said intelligence models humans via simulating strategies, which that post argues is likely assuming large capabilities gaps between humanity and a hypothetical superintelligence. 

As I said before, I’m very confused about how you get to >90% chance of doom given the complexity of the systems we’re discussing

The analogy of "forecasting the temperature of the coffee in 5 minutes" VS "forecasting that if left the coffee will get cold at some point" seems relevant here. Without making claims about the intricacies of the future state of a complex system, you can make high-reliability inferences about their future trajectories in more general terms. This is how I see AI x-risk claims. If the claim was that there was a 90% chance that a superintelligence will render humanity extinct and it will have some architecture x I would agree with you, but feel as though Eliezer's forecast is general enough to be reliable. 

Comment by marc/er on Crunch Dynamics and Control Loss · 2023-05-24T22:09:32.937Z · LW · GW

Agreed. I will add a clarifying statement in the introduction.

Comment by marc/er on A rejection of the Orthogonality Thesis · 2023-05-24T22:07:15.726Z · LW · GW

So if the argument the OT proponents are making is that AI will not self-improve out of fear of jeopardising its commitment to its original goal, then the entire OT is moot, because AI will never risk self-improving at all.

This seems to me to apply only to self improvement that modifies the outcome of decision-making irrespective of time. How does this account for self improvement that only serves to make decision making more efficient? 

If I have some highly inefficient code that finds the sum of two integers by first breaking them up into 10000 smaller decimal values, randomly orders them and then adds them up in serial, and I rewrite the code to do the same thing but in way less ops, I have self improved without jeopardizing my goal.

This kind of self improvement can still be fatal in the context of deceptively aligned systems.

Comment by marc/er on Distillation of Neurotech and Alignment Workshop January 2023 · 2023-05-22T13:14:43.792Z · LW · GW

I agree with this post almost entirely and strong upvoted as a result. The fact that more effort has not been allocated to the neurotechnology approach already is not a good sign, and the contents of this post do ameliorate that situation in my head slightly. My one comment is that I disagree with this analysis of cyborgism:

Interestingly, Cyborgism appeared to diverge from the trends of the other approaches. Despite being consistent with the notion that less feasible technologies take longer to develop, it was not perceived to have a proportionate impact on AI alignment. Essentially, even though cyborgism might require substantial development time and be low in feasibility, its success wouldn’t necessarily yield a significant impact on AI alignment.

Central to the notion of cyborgism is an alternative prioritization of time. Whilst other approaches focus on deconfusing basic concepts central to their agenda or obtaining empirical groundings for their research, cyborgism opts to optimize for the efficiency of applied time during 'crunch time'. Perhaps the application of neurotechnology to cyborgism mightn't seem as obviously beneficial as say WBE relative to its feasibility, but cyborgism is composed of significantly more than just the acceleration of alignment via neurotechnology. I will attempt to make the case for why cyborgism might be the most feasible and valuable "meta-approach" to alignment and to the development of alignment-assisting neurotechnology.

Suitability to Commercialization

Cyborgism is inherently a commercializable agenda as it revolves around the production of tools for an incredibly cognitively-demanding task. Tools capable of accelerating alignment work are generally capable of lots of things. This makes cyborgist research suited to the for-profit structure, which has clear benefits for rapid development over alternative structures. This is invaluable in time-sensitive scenarios and elevates my credence in the high-feasibility of cyborgism.

Better Feedback Loops

Measuring progress in cyborgism is considerably more trivial than alternative approaches. Metrics like short-form surveys become an actual applicable metric for success, and proxies like "How much do you feel this tool has accelerated your alignment work" are useful sources of information that can be turned into quantifiable progress metrics. This post is an example of that. Furthermore, superior tools can not only accelerate alignment work but also tool development. As cyborgism has a much broader scope than just neurotechnology, you could differentially accelerate higher value neurotechnological or otherwise approaches with the appropriate toolset. It may be better to invest in constructing the tools necessary to perform rapid neurotechnology research at GPT-(N-1) than it is to establish foundational neurotechnology research now at a relatively lower efficiency.

Broad Application Scope

I find any statement like "cyborgism is/isn't feasible" to be difficult to support due mainly to the seemingly infinite possible incarnations of the agenda. Although the form of AI-assisted-alignment described in the initial cyborgism post is somewhat specific, other popular cyborgism writings describe more varied applications. It seems highly improbable that we will not see something remotely "cyborg-ish" and that some cyborgish acts will not affect existential risk posed by artificial intelligence, which makes it difficult from my perspective to make claims like that which I instantiated this paragraph with. The primary question to me seems to be more of the kind "how heavily do we lean into cyborgism?", or more practically "what percentage of resources do we allocate toward efficiency optimization as opposed to direct alignment/neurotechnology research?".

My personal preference is to treat cyborgism as more of a "meta-agenda" than as an agenda itself. Shifting toward this model of it impacted how I see its implications for other agendas quite significantly, and has increased my credence in its feasibility substantially. 

Also, as a side note; I think that the application of neurotechnology to cyborgism is quite non-obvious. "Use neurotechnology as a more efficient interface between tools and their human user" and "use invasive BCI technology to pursue the hardest form of cyborgism" are exceedingly different in nature, and as a result contribute to the difficulty of assessing the approach due in large to the reasons that drove me to classify it more as a meta-agenda.

Comment by marc/er on (Slightly) Scalable RLHF Alternatives: A Productive Path for Slow Takeoff Worlds? · 2023-05-18T08:35:23.039Z · LW · GW

Agreed and edited.

Comment by marc/er on AI Will Not Want to Self-Improve · 2023-05-17T03:29:27.003Z · LW · GW

I disagree with your framing of the post. I do not think that this is wishful thinking. 

The first and most obvious issue here is that an AI that "solves alignment" sufficiently well to not fear self-improvement is not the same as an AI that's actually aligned with humans. So there's actually no protection there at all.

It is not certain that upon deployment the first intelligence capable of RSI will be capable of solving alignment. Although this seems improbable in accordance with more classic takeoff scenarios (i.e. Yudkowsky's hard takeoff), the likelihood of those outcomes has been the subject of great debate. I feel as though someone could argue for the claim "it is more likely than not that there will be a period of time in which AI is capable of RSI but not of solving alignment". The arguments in this post seem to me quite compatible with e.g. Jacob Cannell's soft(er) takeoff model, or many of Paul Christiano's takeoff writings.

In fact, the phenomenon described here seems to make it more likely that an unaligned AI will be fine with self-improving, because the simpler the utility function the easier time it has guaranteeing the alignment of the improved version!

Even with your model of solving alignment before or at the same time as RSI becomes feasible, I do not think that this holds well. As far as I can tell, the simplicity of the utility function a general intelligence could be imbued with doesn't obviously impact the difficulty of alignment. My intuition is that attempting to align an intelligence with a utility function dependent on 100 desiderata is probably not that much easier than trying to align an intelligence with a utility function dependent on 1000. Sure, it is likely more difficult, but is utility function complexity realistically anywhere near as large a hurdle as say robust delegation?

Last, but far from least, self-improvement of the form "get faster and run on more processors" is hardly challenging from an alignment perspective. And it's far from unlikely an AI could find straightforward algorithmic improvements that it could mathematically prove safe relative to its own utility function.

This in my opinion is the strongest claim, and is in essence quite similar to this post, my response to which was "I question the probability of a glass-box transition of type "AGI RSIs toward non-DL architecture that results in it maximizing some utility function in a pre-DL manner" being more dangerous than simply "AGI RSIs". If behaving like an expected utility maximizer was optimal: would not AGI have done so without the architecture transition? If not, then you need to make the case for why glass-box architectures are better ways of building cognitive systems. I think that this argument is at odds with the universal learning hypothesis and seems more in-line with evolved modularity, which has a notoriously poor mapping to post-DL thinking. ULH seems to suggest that actually modular approaches might be inferior efficiency-wise to universal learning approaches, which contradicts the primary motive a general intelligence might have to RSI in the direction of a glass-box architecture."

In summary: Although it seems probable to me that algorithmic approaches are superior for some tasks, it seems to me that ULH would imply that the majority of tasks are best learned by a universal learning algorithm.

Comment by marc/er on Aggregating Utilities for Corrigible AI [Feedback Draft] · 2023-05-13T00:17:43.934Z · LW · GW
Comment by marc/er on AGI-Automated Interpretability is Suicide · 2023-05-10T22:10:13.052Z · LW · GW

Strong upvoted this post. I think the intuition is good and that architecture shifts invalidating anti-foom arguments derived from the nature of the DL paradigm is counter-evidence to those arguments, but simultaneously does not render them moot (i.e. I can still see soft takeoff as described by Jacob Cannell to be probable and assume he would be unlikely to update given the contents of this post).

I might try and present a more formal version of this argument later, but I still question the probability of a glass-box transition of type "AGI RSIs toward non-DL architecture that results in it maximizing some utility function in a pre-DL manner" being more dangerous than simply "AGI RSIs". If behaving like an expected utility maximizer was optimal: would not AGI have done so without the architecture transition? If not, then you need to make the case for why glass-box architectures are better ways of building cognitive systems. I think that this argument is at odds with the universal learning hypothesis and seems more in-line with evolved modularity, which has a notoriously poor mapping to post-DL thinking. ULH seems to suggest that actually modular approaches might be inferior efficiency-wise to universal learning approaches, which contradicts the primary motive a general intelligence might have to RSI in the direction of a glass-box architecture.

Comment by marc/er on [deleted post] 2023-05-07T22:00:30.111Z

I agree with some of this as a crticism of the idea, but not of the post. Firstly, I stated the same risk you did in the introduction of the post, hence the communication was "Here is an idea, but it has this caveat", and then the response begins with "but it has this caveat".

Next, if the 'bad outcome' scenario looks like most or all parties that receive the email ignoring it/not investigating further, then I see such an email as easily justifiable to send, as it is a low-intensity activity labour-wise with the potential to expand knowledge of x-risks posed by AI. Obviously this isn't the upper bound for the negativity of the outcome of sending an email like this, as you and I both pointed out a worse alternative, but as outlined in the post, I believe this can be accounted for via an iterative drafting process (like the one I am trying to initiate by sharing the draft above) and handling communications like this with care.

I agree with the assertion regarding consulting a PR professional (and I think this is another reason to delegate the task to a willing organization with access to resources like this).

As for the critique of the email itself, I agree that omitting that sentence improves the conciseness of the email, but that sequence of six words doesn't make the entire email contents not concise. Also, feedback like this was my primary motivation for sharing a draft.

I disagree with your final remark regarding email communication. Email is the single digital communication medium shared by almost every company and individual on the planet and is the medium most likely to be selected as an open contact method on the website of a journalist/media company. Furthermore, communication via email is highly scalable, which is a critical factor in mass communication. Sure, I cannot prove that email is undoutably the most superior form of communication for this task (and thus by definition the conjecture is subject to the criticism you posed), but I can (and hope I just did) make a case supporting my intuition. I'm not sure by what other means than my best intellectual effort and discussion with others you could want when justifying the use of a communication method. Are you looking for a formal proof?

Comment by marc/er on The Security Mindset, S-Risk and Publishing Prosaic Alignment Research · 2023-04-23T03:41:58.756Z · LW · GW

I agree with this sentiment in response to the question of "will this research impact capabilities more than it will alignment?", but not in response to the question of "will this research (if implemented) elevate s-risks?". Partial alignment inflating s-risk is something I am seriously worried about, and prosaic solutions especially could lead to a situation like this.

If your research not influencing s-risks negatively is dependent on it not being implemented, and you think that it your research is good enough to post about, don't you see the dilemma here?

Comment by marc/er on The Security Mindset, S-Risk and Publishing Prosaic Alignment Research · 2023-04-23T01:31:50.766Z · LW · GW

It's fine to make the mistake of publishing something if the mistake you made was assuming "this is great research", but if the mistake was "this is safe to publish because I'm new to research", the consequences can be irreversible. I probably fall into the category of 'wildly overthinking the harms of publishing due to inexperience', but it seems to me like a simple assessment using the ABC model I outlined in the post should take only a few minutes and could quickly inform someone of whether or not they might want to show their research to someone more experienced before publishing.

I am personally having this dilemma. I have something I want to publish, but I'm unsure of whether I should listen to the voice telling me "you’re so new to this, this is not going to have any real impact anyway" or the voice that's telling me "if it does have some impact or was hypothetically implemented in a generally intelligent system this could reduce extinction risk but inflate s-risk". It was a difficult decision, but I decided I would rather show someone more experienced, which is what I am doing currently.  This post was intended to be a summary of why/how I converged upon that decision.

Comment by marc/er on A decade of lurking, a month of posting · 2023-04-09T12:13:53.495Z · LW · GW

Although I soft upvoted this post, there are some notions I'm uncomfortable with. 

What I agree with:

  • Longtime lurkers should post more
  • Less technical posts are pushing more technical posts out of the limelight
  • Posts that dispute the Yudkowskian alignment paradigm are more likely to contain incorrect information (not directly stated but heavily implied I believe, please correct me if I've misinterpreted)
  • Karma is not an indicator of correctness or of value

The third point is likely due to the fact that the Yudkowskian alignment paradigm isn't a particularly fun one. It is easy to dismiss great ideas for other great ideas when the other ideas promise lower x-risk. This applies in both directions however, as it's far easier to succumb to extreme views (I don't mean to use this term in a diminishing fashion) like "we are all going to absolutely die" or "this clever scheme will reduce our x-risk to 1%" and miss the antimeme hiding in plain sight. A perfect example of this is in my mind is the comment section of the Death with Dignity post. 

I worry that posts like this discourage content that does not align with the Yudkowskian paradigm, which are likely just as important as posts that conform to it. I don't find ideas like Shard Theory or their consequential positive reception alarming or disappointing, and on the contrary I find their presentation meaningful and valuable, regardless of whether or not they are correct (not meant to imply I think that Shard Theory is incorrect, it was merely an example). The alternative to posting potentially incorrect ideas (a category that encompasses most ideas) is to have them never scrutinized, improved upon or falsified. Furthermore, incorrect ideas and their falsification can still greatly enrich the field of alignment, and there is no reason why an incorrect interpretation of agency for example couldn't still produce valuable alignment insights. Whilst we likely cannot iterate upon aligning AGI, alignment ideas are an area in which iteration can be applied, and we would be fools not to apply such a powerful tool broadly. Ignoring the blunt argument of "maybe Yudkowsky is wrong", it seems evident that "non-Yudkowskian" ideas (even incorrect ones) should be a central component of LessWrong's published alignment research, this seems to me the most accelerated path toward being predictably wrong less often. 

To rephrase, is it the positive reception non-Yudkowskian ideas that alarm/disappoint you, or the positive reception of ideas you believe have a high likelihood of being incorrect (which happens to correlate positively with non-Yudkowskian ideas)? 

I assume your answer will be the latter, and if so then I don't think the correct point to press is whether or not ideas conform to views associated with a specific person, but rather ideas associated with falsity. Let me know what you think, as I share most of your concerns. 

Comment by marc/er on Select Agent Specifications as Natural Abstractions · 2023-04-08T09:07:56.376Z · LW · GW

Thank you for the feedback, I have repaired the post introduction in accordance your commentary on utility functions. I challenge the assumption that a system not being able to reliably simulate an agent with human specifications is worrying, and I would like to make clear that the agenda I am pushing is not:

  1. Capabilities and understanding through simulation scale proportionately
  2. More capable systems can simulate, and therefore comprehend the goals of other systems to a greater extent
  3. By dint of some unknown means we align AGI to this deep understanding of our goals

I agree that in the context of a plan like this a failure to establish robust abstractions of human values could be catastrophic, but when applied to a scenario like trying to align a powerful LLM, being able to estimate and interpret even incorrect abstractions could be vitally important. This could look like:

  1. Estimate the LLMs abstraction of say "what do humans want me to refuse to reply to"
  2. Compare this to some desired abstraction
  3. Apply some technique like RLHF accordingly

Of course an actual implementation probably wouldn't look like that, ("what do humans want me to refuse to reply to" isn't necessarily one unified concept that can be easily abstracted and interpreted) but it is a high level overview of why pursuing questions like "do some specifications abstract well?" could still be useful even if they do not.

I hadn't come across the relative abstracted agency post, but I find its insights incredibly useful. Over the next few days I will update this post to include its terminology. I find it likely that testing whether or not some specifications abstract well would provide useful information as to how targets are modeled as agents, but the usefulness of being able to test this in existing systems depends strongly on how the current LLM paradigm scales as we approach superintelligence. Regardless I'm sure any indication as to how targets are modeled as agents could be valuable, even in systems incapable of scaling to superintelligence.