Machine learning could be fundamentally unexplainable

post by George3d6 · 2020-12-16T13:32:36.105Z · LW · GW · 15 comments

This is a link post for https://cerebralab.com/Machine_learning_could_be_fundamentally_unexplainable

Contents

  I - Unexplainable due to complexity
  II — Some problems are boring
  III — Explainable to me but not to thee
  III.2 — Inaccessible truth and explainable lies
  IV — I digress
None
15 comments

I’m going to consider a fairly unpopular idea: most efforts towards “explainable AI” are essentially pointless. Useful as an academic pursuit and topic for philosophical debate, but not much else.

Consider this article a generator of interesting intuitions and viewpoints, rather than an authoritative take-down of explainability techniques.

That disclaimer aside:


What if almost every problem for which it is desirable to use machine learning is unexplainable?

At least unexplainable in an efficient-enough way to be worth explaining. Whether it is an algorithm or a human that is doing the explanation.

Let’s define “explainable AI” in a semi-narrow sense, inspired by the DARPA definition, as an inference system that can answer the questions:

Why might we be unable to answer the above questions in a satisfactory manner for most machine learning algorithms? I think I can name four chief reasons:

  1. Some problems are just too complex to explain. Often enough, these are perfect problems for machine learning, it’s exactly their intractability to our brains that makes them ideal for equation-generating algorithms to solve.
  2. Some problems, while not that complex, are really boring and no human wants or should be forced to understand them.
  3. Some problems can be understood, but understanding in itself is different for every single one of us, and people’s culture and background often influence what “understanding” means. So explainable for one person is not explainable for another.
  4. Even given an explanation that everyone agrees on, this usually puts us no closer to most of what we want to achieve with said explanation, things like gathering better data or removing “biases” from our models.

I - Unexplainable due to complexity

Let’s say, physicists, take in 100 PetaBytes of experimental data, reduce them using equations, and claim with a high probability that there exists this thing called a “Higgs Boson” with implications for how gravity works, among other things.

The resulting Boson can probably be defined within a few pages of text via things such as mass, the things it decays into, its charge, its spin, the various interactions it can have with other particles, and so on.

But if a luddite like myself asks the physicists:

Why did you predict this fundamental particle exists?

I will either get a “press conference answer” which carries no meaning other than providing a “satisfying” feeling, but it doesn’t answer any of the above questions.

It doesn’t tell me why the data shows the existence of the Higgs Boson, it doesn’t tell me how the data could have been different in order for this not to be the case, and it doesn’t tell me how confident they are in this inference and why.

If I press for an answer that roughly satisfies the explainability criteria I mentioned above, I will at best get them to say:

Look, the standard model is a fairly advanced concept in physics, so you first have to understand that and why it came to be. Then you have to understand the experimental statistics needed to interpret the kind of data we work with here. In the process, you’ll obviously learn quantum mechanics, but to understands the significance of the Higgs boson specifically it’s very important that you have an amazing grasp of general relativity, since part of the reason we defined it as is and why it’s so relevant is because it might be a unifying link between the two theories. Depending on how smart you are this might take 6 to 20 years to wrap your head around, really you won’t even be the same person by the time you’re done with this. And oh, once you get your Ph.D. and work with us for half a decade there’s a chance you’ll disagree with your statistics and our model and might think that we are wrong, which is fine, but in that case, you will find the explanation unsatisfactory.

We are fine with this, since physics is bound to be complex, it earns its keep by being useful and making predictions about very specific things with very tight error margins, its fundamental to all other areas of scientific inquiry.

When we say that we “understand” physics what we really mean is that there are a few dozen of thousands of blokes that spent half their lives turning their brains into hyper-optimized physics-thinking machines and they assure us that they “understand” it.

For the rest of us, the edges of physics are a black box, I know physics works because Nvidia sells me GPUs with more VRAM each year and I’m able to watch videos of nuclear reactors glowing on youtube while patients in the nearby oncology ward are getting treated with radiation therapy.

This is true for many complex areas, we “understand” them because a few specialists say they do, and the knowledge that trickles down from those specialists has results that are obvious to all. Or, more realistically, because a dozen-domain long chain of specialists combined, each relying on the other, is able to produce results that are obvious to all.

As long as there is a group of specialist that understands the field, as long as those specialists can prove to us that their discoveries can affect the real world (thus excluding groups of well-synchronized insane people) and as long as they can teach other people to understand the field… we claim that it’s “understood”.


But what about a credit risk analysis “AI” making a prediction that we should loan Steve at most 14,200$?

The model making this prediction might be operating with TBs worth of data about Steve, his browsing history, his transaction history, his music preferences, a video of him walking into the bank… each time he walked into the bank for the last 20 years, various things data aggregators tell us about him, from his preference about clothing to the likelihood he wants to buy an SUV, and of course, the actual stated purpose Steve gave us for the credit, both in text and as a video recording.

Not only that, but the “AI” has been trained on previous data from millions of people similar to Steve and the outcomes of the loans handed to then, thus working with petabytes of data in order to draw the 1-line conclusion of “You should loan steve, at most, 14,200$, if you want to probabilistically make a profit”.

If we ask the AI:

Why is the maximum loan 14,200$? How did the various inputs and their interactions contribute to coming up with this number?

Well, the real answer is probably something like:

Look, I can explain this to you, but 314,667,344,401 parameters had a significant role in coming up with this number, and if you want to “truly” understand that then you’d have to understand my other 696,333,744,001 parameters and the ways they related to each other in the equation. In order to do this, you have to gain an understanding of human-gate analysis as well as how its progress over time relates to life-satisfaction, credit history analysis, shopping preference analysis, error theory behind the certainty of said shopping preferences, and about 100 other mini-models that end up coalescing into the broader model that gave this prediction. And the way they “coalesce” is even more complex than any of the individual models. You can probably do this given 10 or 20 years, but basically, you’d have to re-train your brain from scratch to be like an automated risk analyst, you’d only be able to explain this to another automated risk analysts, and the “you” “understanding” my decision will be quite different from the “you” that is currently asking.

And even the above is an optimist take assuming the “AI” is made of multiple modules that are somewhat explainable.

So, is the “AI” unexplainable here?

Well, not more so than the physicists are. Both of them can, in theory, explain the reasoning behind their choice. But in both cases, the reasoning is not simple, there’s no single data point that is crucial, if even a few inputs were to change slightly the outcome might be completely different, but the input space is so fast it’s impossible to reason about all significant changes to it.

This is just the way things are in physics and it might be just the way things are in credit risk analysis. After all, there’s no fundamental rule of the universe saying it should be easy to comprehend by the human mind. The reason this is more obvious in physics is simply because physicists have been gathering loads of data for a long time. But it might be equally true in all other fields of inquiry, based on current models, it probably is. It’s just that those other fields didn’t have enough data nor the intelligence required to grok through it until recently.

II — Some problems are boring

There is a class of problems that is complex, but not as complex as to be impenetrable to the vast majority of human minds.

To harken back to the physics example, think classical mechanics. Given the observations made by Galileo and some training in analysis, most of us could, in principle, understand classical mechanics.

But this is still difficult, it requires a lot of background knowledge, although fairly common and useful background knowledge and a significant amount of times. Ranging from, say, a day to several months depending on the person.

This is time well spent learning classical mechanics, but what if the problem domain was something else, say:

These are the kind of problems one might well use machine learning for, but they are also the kind of problems that, arguably, could lie well within the realm of human understanding.

The problem is not that they are really hard, they are just really **** boring. I can see the appeal of spending 20 years of your life training to better understand the fundamental laws of reality or the engines behind biological life. But who in their right mind wants to spend weeks or months studying sparkling water supply chains? Or learning how to observe subtle differences in shadow on a CT scan?

Yet, for all of these problems, we run into a similar issue as with case nr I. Either we have a human specialist, or the decision of the algorithm we trained will not be explainable to anyone.

… Hopefully, you get the gist of it

III — Explainable to me but not to thee

This leads us to the third problem, which is who exactly are the understanding-capable agents the algorithms must explain themselves to.

Take as an example an epidemiological psychology generating algorithm that tries to find insight into the fundamentals of human nature by giving a few hundred people questioners on mturk. After fine-tuning itself for a while it finally manages to consistently produce interesting findings, one of which is something like:

People that like Japanese culture are likely to be introverts.

When asked to “explain” this finding it may come up with an explanation like:

Based on a 2-question survey we found that participants which enjoy the smell of nato are much more likely to paint a lot. Furthermore, there is a strong correlation between nato-enjoyment and affinity for Japanese culture[2,14], and between painting and introversion[3,45]. Thus we draw a tentative conclusion that introverts are likely drawn to Japanese culture (p~=0.003, n=311).

This requires only the obvious assumptions that the relation between our results and the null-hypothesis can be numerically modeled into a sticky-chewing-gum distribution and the God-ordained truth that human behavior has precisely 21 degrees of freedom (all of which we have controlled for). It also requires the validity of 26 other studies based on which our references depend, but for the sake of convention, we won’t consider the p-values of those findings when computing ours.

Replication and lab studies are required to confirm the finding, this is a preliminary paper meant only to be used as the core source material for front-page articles by The Independent, The NY Times, Vice, and other major media outlets.

Jokes aside, I could see an algorithm being designed to generate questioner-based studies… I’m not saying I have designing one that looks promising, or that I’m looking for an academic fatalistic enough to risk his career for the sake of a practical joke (see my email in the header). But in principle, I think this is doable.

I also think that something like the explanation above (again, a bit of humor aside), would fly as the explanation for why the algorithm’s finding is true. After all, that’s basically the same explanation a human researcher would give.

A similar reference and statistical significance based explanation could feasibly be given as to why the algorithm converged on the questions and sample sizes it ended up with.

But we could get widely different reaction to that explanation:

In other words, even within cultural and geographic proximity, depending on the person a decision is explained to, an explanation might be satisfactory or unsatisfactory, might make or not make sense, and might prove the conclusion is true or the opposite.

And while the above example is tongue-in-cheek, this is very much the case when it comes to actual scientifical findings.

One can define an anti-scientific world view, quite popular among religious people and philosophers, which either entirely denies the homogeneity needed for science to hold true, or deems scientific reductionism as too limited to provide knowledge regarding most objects and topics worth caring about epistemically. Arguably, every single religious person falls into this category at least a tiny bit, in that they disagree with falsifiability in a specific context (i.e. the existence of some supernatural entities or principles that can’t be falsified) and even if they agree with homogeneity (which in turn allows scientific reductionism) in most scenarios, they believe edge cases exist (miracles, souls… etc).

To go one level down, you’ve got things like the anti-vaccination movements, which choose to distrust specific areas of science. This is not always for the same reason, and often not for a single reason. In Europe, the main reasons can be thought of as:

This combination of causes means that there’s no single way to explain to an anti-vaxxer why they should vaccinate their kid against polio or hepatitis, or measles, or whatever new disease might come about or re-emerge in the future.

If we had “AI” generated vaccines, with an “AI” generate clinical trial procedures and “AI” written studies based on those trials, how does the “AI” answer to an anti-vaxxer when asked, “why is this prediction true? why do you predict this vaccine will protect me against the disease and have negligible side effects?”.

It could generate a 1000-pages length explanation that amounts to the history of skeptical philosophy and a collection of instances where the scientific method leads to correct theories for otherwise near-impossible to solve problems. Couple that with some basic instructions on statistics, mathematics, epidemiology, and human biology.

Or it could try to generate a deep-fake video of their deceased mom and their local priest talking about how vaccines are good. Couple that with a video of a politician they endorsed getting vaccinate and maybe a papal speech about how we should trust doctors and a very handsome man with a perfect smile in a white coat talking about the lack of side effects.

And for some reason, the first seems like a much better “explanation” yet the latter is why 99% of people that do get vaccinated trust the science. Have you ever read any paper about a vaccine you got or gave to your kids?

I’m passionate about medicine and biology, and I only ever read two vaccine trial papers, both about JE vaccines, and only since they were made in poor Asian countries, and thus my “medical authority” heuristic wasn’t able to bypass my rational mind (for reference, the recombinant DNA one from Thailand seems to be the best).

So which of the “explanations” should the algorithm provide? Should it discriminate between the person asking and provide a scientific explanation to some and a social persuasion based explanation to others?


Ant-vaccination is very much not a strawman, 10% of the US population believes giving the MMR vaccine to their kids is not worth the risk [pew]. 42% would not get an FDA approved vaccine for COVID-19 [gallup].

The difference between people results in at least three issues:

  1. Some people might need further background knowledge to accept any explanation (collapses into I and II).
  2. Some people might accept some explanation but it’s not what some of us think to be the “correct” explanation.
  3. Some people might never accept any explanation an algorithm provides, even though those same explanations would immediately click for others.

Going back to the “argument from authority” versus “careful reading of studies” approach to trusting an “AI-generated” vaccine study (or any vaccine study, really).

It seems clear to me that most of us made a choice to trust things like vaccines, or classical mechanics, or inorganic chemistry models, or the matrix-inverse solution to a linear regression, way before we “understood” them. We trusted them due to arguments from authority.

This is not necessarily bad, after all, we would not have the time to gain a deeper meaning of everything, we’d just keep falling down levels of abstractions.

III.2 — Inaccessible truth and explainable lies

If a prediction is made with 99% confidence, but our system realizes you’re one of “those people” that doesn’t trust its authority, should it lie to you, in order to bias your trust more towards what it thinks is the real confidence?

Furthermore, if the algorithm determines nobody will trust a prediction is made, or if the human supervising it determines that same thing, should it’s choice be between:

a) Lying to us about the explanation.

b) Coming up with a more “explainable” decision.

Well, a) is fairly difficult, and will probably remain the realm of humans for quite some time, it also seems intuitively undesirable. So let’s focus on option b), changing the decision process to one that is more explainable to people. Again, I’d like to start with a thought experiment:

Assume we have a disease-detecting CV algorithm that looks at microscope images of tissue for cancerous cells. Maybe there’s a specific protein cluster (A) that shows up on the images which indicates a cancerous cell with 0.99 AUC. Maybe there’s also another protein cluster (B) that shows up and only has 0.989 AUC, A overlaps with B in 99.9999% of true positive. But B looks big and ugly and black and cancery to a human eye, A looks perfectly normal, it’s almost indistinguishable from perfectly benign protein clusters even to the most skilled oncologist.

For the pedantic among you assume the AUC above is determined via k-fold cross-validation with a very large number of folds and that we don’t mix samples from the same patient between folds

Now, both of these protein clumps factor into the final decision of cancer vs non-cancer. But the algorithm can be made “explainable” by investigating which features are necessary and/or sufficient for the decision (e.g via an Anchor method). The CV algorithm can show A and B as having some contribution to its decision to mark a cell as cancerous. Say A is at 51% and B at 49%. But B looks much scarier, so what if a human marks that explanation as “wrong” and says “B should have a larger weight”.

Well, we could tune the algorithm to put more weight on B, both B and A are fairly accurate and A overlaps with B whenever there is a TP. So in a worst-case scenario, we’re now killing 0.x% less cancer cells than before or killing a few more healthy cells, not a huge deal.

So should we accept the more “explainable” algorithm in this scenario?

If your answer is “yes”, if completely irrational human intuition is reason enough to opt for the worst model, I think our disagreement might be a very fundamental one. But if the answer is “no”, then think of the following:

For any given ML algorithm we’ve got a certain amount of research time and a certain amount of compute that’s feasible to spend. While in some cases explainability and accuracy can go hand in hand (see, e.g, a point I made about confidence determination networks that could improve the accuracy of the main network beyond what can be achieved with “normal” training), this is probably the exception.

As a rule of thumb, explainability is traded off for accuracy. It’s another thing we waste compute and brain-power on that takes away from how much we can refine and for how long we can train our models.

This might not be an issue in cases where the model converges to a perfect solution fairly easily (perfect as in, based on existing data quality and current assumptions about future data there’s no more room to improve accuracy, not perfect in the 100% accuracy sense), and there are plenty such problems, but we usually aren’t able to tell they fall into this category.

The best way to figure out that an accuracy is “the best we can get” for a specific problem is to throw a lot of brainpower and compute at it and conclude that there’s no better alternative. Unless we are overfitting (and even if we are overfitting) determining the perfect solution to a problem is usually impossible.

So if you wouldn’t sacrifice >0.01AUC for the sake of what a human thinks is the “reasonable” explanation to a problem, in the above thought experiment, then why sacrifice unknown amounts of lost accuracy for the sake of explainability? If truth takes precedence over explanations people agree with, then how can we justify the latter before we’ve perfected the former?

IV — I digress

I think it’s worth expanding more on this last topic but from a different angle. I also listed a 4th reason in my taxonomy that I didn’t have the time to get into. On the whole, I think exploring those two combined is broad enough to warrant a second article.

I kind of hand-wave in a very skeptical (in the humean sense) worldview to make my stance, and I steam over a bunch of issues related to scientific truth. I’m open to debating those if you think they are the only weak points in this article, but I’m skeptical (no pun intended) about those conversations having a reasonable length or satisfactory conclusion.

As I said at the beginning, take this article more as an interesting perspective, rather than as a claim to absolute truth. Don’t take it to say “we should stop doing any research into explainable ML” but rather “we should be aware of these pitfalls and try to overcome them when doing said research”.


I should note, part of my day-job actually involves explainable models, 2 years of my work are staked in a product which has explainability as an important selling point, so I am somewhat up-to-date with this field and also all my incentives are aligned against this hypothesis. I very much think and want the above problems to be, to some degree, “fixable”, I get no catharsis from pointing them out.

That being said, I think that challenging base assumptions about our work is useful, as a mechanism for reframing our problems as well as a lifeline to sanity. So don’t take his as an authoritative final take on the topic, but rather as a shakey but interesting point of view worth pondering.


If you enjoyed this article I’d recommend you read Jason Collins’s humorous and insightful article, Principles for the Application of Human Intelligence. Which does a fantastic job at illustrating the double standards we harbor regarding human versus algorithmic decision-makers.

15 comments

Comments sorted by top scores.

comment by Akshat Mahajan (AkshatM) · 2020-12-17T18:46:43.627Z · LW(p) · GW(p)

I have a huge problem with the "Some problems are boring" section, and it basically boils down into the following set of rebuttals:

  1. Some problems may seem boring, but are vital to solve anyway
  2. Some problems may seem boring, but their generalizations are interesting
  3. Problems that seem boring may have really interesting solutions we are unaware of

Every single one of the examples cited in that section falls into this category:

  1. Figuring out if a blotch on a dental CT scan is more likely to indicate a streptococcus or a lactobacillus infection.
  2. Understanding what makes an image used to advertise a hiking pole attractive to middle-class Slovenians over the age of 54.
  3. Figuring out, using l2 data, if the spread for the price of soybean oil is too wide, and whether the bias is towards the sell or buy.
  4. Finding the optimal price at which to pre-sell a new brand of luxury sparkling water based on yet uncertain bottling, transport, and branding cost.
  5. Figuring out if a credit card transaction is likely to be fraudulent based on the customer’s previous buying pattern.

They all have interesting generalizations, applications and potential solutions. Identifying arbitrary blotches on dental CT scans can be generalized to early-stage gum disease prevention. Figuring out optimal pricing for any item can assist in optimal market regulation. Identifying fraud actively makes the world safer and gives us tools to understand how cheaters adapt in real-time to detection events. And, be honest, if the answer to any of these turned not to be trivial at all - if this is what our models point to - no one would be suddenly claiming the problem itself is boring.

I feel really strongly about this because dismissing any problem as "boring" is isomorphic to asking "why do we fund basic science at all if we get no applications from it" or "why study pure math", and we all ought to know better than to advance a position so well-rebuffed as "it seems really specific and not personally interesting to me, so why should we (as a society/field) care?"

comment by Gunnar_Zarncke · 2020-12-16T14:55:25.732Z · LW(p) · GW(p)

Great article! I especially liked the analogy with the explainability of physics.

One tangential comment here:

It seems clear to me that most of us made a choice to trust things like vaccines, or classical mechanics, or inorganic chemistry models, or the matrix-inverse solution to a linear regression, way before we “understood” them. We trusted them due to arguments from authority.

I think many of us test some of the statements other people make. And over time, we form an opinion on how trustworthy certain people and groups of people are. If you do the math yourself often enough and it always matches what people say, if you read studies yourself sometimes and turn out to more differentiated than their summaries, you can gain some trust in 'science' (though maybe less in journalists). If, on the other hand, you are lied to regularly, and you are promised jobs and tax breaks and they don't materialize, then I don't find it surprising that some people don't trust vaccines.

Replies from: George3d6
comment by George3d6 · 2020-12-16T15:34:52.002Z · LW(p) · GW(p)

If, on the other hand, you are lied to regularly, and you are promised jobs and tax breaks and they don't materialize, then I don't find it surprising that some people don't trust vaccines.

 

Kind of unrelated to this article, or rather, one of many tangents that can diverge from the topic, but I actually really like this idea... I don't think I ever considered this perspective when thinking about "applied epistemology", that some people's place in society might make them pre-disposed to low levels of trust, not because of the environment, but rather because of how often "society" itself lies to them (either via things like politicians making promises of jobs or sugar-oil bars making promises of losing weight... and other obvious falsehoods that people end up not being immunized to in their upbringing)

Replies from: Gunnar_Zarncke
comment by Gunnar_Zarncke · 2020-12-17T16:34:34.893Z · LW(p) · GW(p)

and other obvious falsehoods that people end up not being immunized to in their upbringing

Or their immunization against these falsehoods being exactly what causes the problem. 

comment by Viliam · 2020-12-20T22:54:03.260Z · LW(p) · GW(p)

Ant-vaccination is very much not a strawman,

Indeed.

comment by MikkW (mikkel-wilson) · 2020-12-16T19:02:02.077Z · LW(p) · GW(p)

Just a heads up, the Higgs boson really doesn't have much to do with General Relativity. It does help explain why Z and W bosons have mass, but this is accomplished only within the framework of the Standard Model, and doesn't give any clues as to how the SM may be unified with GR

comment by axioman (flodorner) · 2020-12-18T21:57:05.587Z · LW(p) · GW(p)

"So if you wouldn’t sacrifice >0.01AUC for the sake of what a human thinks is the “reasonable” explanation to a problem, in the above thought experiment, then why sacrifice unknown amounts of lost accuracy for the sake of explainability?" 

You could think of explainability as some form of regularization to reduce overfitting (to the test set). 

Replies from: George3d6
comment by George3d6 · 2020-12-18T22:58:27.158Z · LW(p) · GW(p)

Regrading the thought experiment:

For the pedantic among you assume the AUC above is determined via k-fold cross-validation with a very large number of folds and that we don’t mix samples from the same patient between folds

 

As a general rule of thumb, I agree with you, explainability techniques might often help with generalization. Or at least be intermixed with them. For example, to use techniques that alter the input space, it helps to train with dropout and to have certain activations that could be seen as promoting homogenous behavior on OOD data 

comment by Bruce G · 2020-12-17T16:43:38.377Z · LW(p) · GW(p)

Assume we have a disease-detecting CV algorithm that looks at microscope images of tissue for cancerous cells. Maybe there’s a specific protein cluster (A) that shows up on the images which indicates a cancerous cell with 0.99 AUC. Maybe there’s also another protein cluster (B) that shows up and only has 0.989 AUC, A overlaps with B in 99.9999% of true positive. But B looks big and ugly and black and cancery to a human eye, A looks perfectly normal, it’s almost indistinguishable from perfectly benign protein clusters even to the most skilled oncologist.

 

If I understand this thought experiment right, we are also to assume that we know the slight difference in AUC is not just statistical noise (even with the high co-linearity between the A cluster and the B cluster)?  So, say we assume that you still get a slightly higher AUC for A on a data set of cells that have either only A or neither versus a data set of cells with either only B or neither?

In that case, I would say that the model that weighs A a bit more is actually "explainable" in the relevant sense of the term -  it is just that some people find the explanation aesthetically unpleasing.  You can show what features the model is looking at to assign a probability that some cell is cancerous.  You can show how, in the vast majority of cases, a model that looks at the presence or absence of the A cluster assigns a higher probability of a cell being cancerous to cells that actually are cancerous.  And you can show how a model that looks at B does that also, but that A is slightly better at it.

If the treatment is going to be slightly different for a patient depending on how much weight you give to A versus B, and if I were the patient, I would want to use the treatment that has the best chance of working without negative side effects based on the data, regardless of whether A or B looks uglier.  If some other patients want a version of the treatment that is statistically less likely to work based on their aesthetic sense of A versus B, I would think that is a silly risk to take (though also a very slight risk if A and B are that strongly correlated), but that would be their problem not mine.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-12-17T10:42:26.856Z · LW(p) · GW(p)

One thing I'm very interested in is whether advanced AGIs will be explainable to other advanced AGIs. If one AGI can observe all the weights and firings of the other AGI, and pore over the details of everything the other AGI does, will it be able to tell when the other AGI is lying? Will it be able to say "Ah, here is the real goal of the other AGI, now I can make my own copy that has a different goal." Etc. What do you think about this case?

For a bit about why this is important, see e.g. this [LW · GW] and this [AF · GW].

Replies from: George3d6
comment by George3d6 · 2020-12-17T11:27:12.394Z · LW(p) · GW(p)

That example doesn't really make sense to me, could you taboo the word "lying". I am rather confused as to what you mean by it, it could have a lot of different interpretations.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-12-17T11:45:42.243Z · LW(p) · GW(p)

Sure. I guess I should have said interpretable rather than explainable, though maybe the two go hand in hand.

Suppose two AGIs are negotiating with each other. Suppose that their inner workings (their network weights, the activations, etc.) is all transparent to each other, and recorded so that they can replay it and analyze it. Suppose they agree to some deal, but are worried that the other is secretly planning to cheat on their end of the bargain. Can they find out whether the other is secretly planning to cheat? How? Insofar as their inner working are explainable/interpretable to each other, perhaps they can examine each other's thoughts and see whether or not there exist any plans to cheat, or plans to make such plans, etc. in the other's mind.

Replies from: George3d6
comment by George3d6 · 2020-12-17T13:15:22.889Z · LW(p) · GW(p)

(their network weights, the activations, etc.)

I still don't understand the example. If you have access to everything about a given algorithm you are guaranteed to be able to know anything you want about it.

If "cheating" means something like "deciding at T that I will do action X at T+20 even though I said "" I will do action Y at T+20"" "... then that decision is stored somewhere in those parameters and as such is known to anyone with access to them.

If neither system knows what action will happen at T+20 until T+20 arrises, then it becomes a problem of one turing machine trying to simulate another turing machine, so the amount of operations available from T until T+20 will decide the problem.

But I feel like the framework you are using here doesn't really make a lot of sense, as in, what you are describing is very anthropomorphized.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-12-17T15:56:46.728Z · LW(p) · GW(p)

It's interesting that you feel this way -- I feel the opposite.

If you have access to everything about a given algorithm you are guaranteed to be able to know anything you want about it.

This seems pretty false to me. You yourself give some counterexamples later.

If "cheating" means something like "deciding at T that I will do action X at T+20 even though I said "" I will do action Y at T+20"" "... then that decision is stored somewhere in those parameters and as such is known to anyone with access to them.

This is definitely false. It's adjacent to something true, which is: "Deciding given input-history H that I will do action X even though I said "I will do action Y given input-history H" is something that anyone with access to parameters etc. can verify given sufficient compute, by running a copy of the agent given input-history H.

However, this true thing doesn't solve the problem (or resolve the question I'm interested in) by itself, for several reasons. One, you might not have sufficient compute, even in principle (perhaps what they do is logically entangled with what you do, so you can't just simulate them or else run into the two-turing-machines-simulating-each-other problem). Two, in realistic situations you are interested not just in a single specific H, but a whole category of H's (i.e. whatever future scenarios may arise). And you may not have a good definition for the category. And you certainly don't have enough compute to simulate what the agent does in every possible H! Three, in realistic situations you are interested not just in a specific X/Y but in a distinction between actions that classifies some as X's and some as Y's, and you don't have a precise definition of that distinction. Or maybe you do have a precise definition, but it's based on long-term outcomes and you don't have enough compute to simulate the long term.

I'm optimistic that there are solutions to these problems though -- which is why I asked you what you thought.

Replies from: George3d6
comment by George3d6 · 2020-12-17T19:59:18.331Z · LW(p) · GW(p)

This seems pretty false to me. You yourself give some counterexamples later.

Hmh, I don't think so.

As in, my argument is simply that it might not be worth groking through the data and the explanation is a poorly defined concept which we don't have even about human-made understanding.

I'd never claim that it's impossible for me to know a specific about the outputs of an algorithm I have full data about, after all, I could just run it and see what the specific output I care about it. The edge case would come when  I can't run the algorithm due to computing power limitations, but someone else can by having much more compute than me. In which case the problem becomes one of trying to infer things about the output without running the algorithm itself (which could be construed as similar to the explanation problem, maybe, if I twist my head at a weird angle)

Anyway, I can see your point here but I can see it from a linguistic perspective, as in, we seem to use similar terms with slightly different meanings and this leads me to not quite understanding your reasoning here (and I assume the feeling is mutual). I will try to read this again later and see if I can formulate a reply, but for now I'm unable to put my hand on what those linguistic differences are, and I find that rather frustrating on a personal level :/