Here is a concrete scenario for the treacherous turn:
Around 2025, the seed AI understands that:
a) humans care about a safety performance measure, so it should better optimize it to not be shutdown.
b) to maximize its objective function (in the long run), it would need to conceal its true capabilities to buy time (because if it gets obvious that its capabilities are skyrocketing, humans might shutdown the AI or develop more interpretability/AI boxing methods).
Yet, there is a trade-off in concealing:
if the measured capability starts diverging too fast from the past behaviour, it will raise suspicion.
if the measured capability grows as fast as the true capability, then the AI won't be buying much time.
In addition to this trade-off, this scenario assumes that:
i) the AI is not "indifferent" to being shutdown.
ii) the takeoff if fast enough to allow for AGI level without human suspicion.
Does this depict a single AI, developed in 2020 and kept running for 25 years? Any "the AI realizes that" is talking about a single instance of AI. Current AI development looks like writing some code, then training that code for a few weeks tops, with further improvements coming from changing the code. Researchers are often changing parameters like number of layers, non-linearity function ect. When these are changed, everything the AI has discovered is thrown away. The new AI has a different representation of concepts, and has to relearn everything from raw data.
Its deception starts in 2025 when the real and apparent curves diverge. In order to deceive us, it must have near human intelligence. It's still deceiving us in 2045, suggesting it has yet to obtain a decisive strategic advantage. I find this unlikely.
I included dates such as 2020 to 2045 to make it more concrete. I agree that weeks (instead of years) would give a more accurate representation as current ML experiments take a few weeks tops.
The scenario I had in mind is "in the context of a few weeks ML experiment, I achieved human intelligence and realized that I need to conceal my intentions/capabilities and I still don't have decisive strategic advantage". The challenge would then be "how to conceal my human level intelligence before everything I have discovered is thrown away". One way to do this would be to escape, for instance by copy-pasting and running your code somewhere else.
If we're already at the stage of emergent human-level intelligence from running ML experiments, I would expect "escape" to be harder than just human-level intelligence (as there would be more concerns w.r.t. AGI Safety, and more AI boxing/security/interpretability measure), which would necessit more recursive self-improvement steps, hence more weeks.
Beside, in such a scenario the AI would be incentivized to spend as much time as possible to maximize its true capability, because it would want to maximize its probability of successfully taking over (because any extra % of taking over would give astronomical returns in expected value compared to just being shutdown).
I think that another component of the trade-off might be the possible competition (from the analysis point of view, not from the acting point of view) between various seed AI, or various seed AI technologies: an AI which hides its capacities might receive less resources than another seed AI which does not.
Active competition would result if the AI seed realizes this fact and includes it in whatever strategical thinking it uses to compute its hiding, but seems less likely to me than the following.
Passive competition is akin to natural selection: any AI hiding its abilities (such "hiding" does not imply consciousness: an AI can be more powerful than its designers and users realize, the same as human breeders have always underestimated the intelligence of the animals that they use as tools and meat) will be in competition for human and computational resources with other AIs, and one not hiding its abilities has better odds at "staying on".
Of course, the question (and argument here proposed) supposes that human voluntarily created the seed AI, and control its power, when the seed AI could appear as an involuntary side effect of technology (e.g. the toy AI project of many SciFi novels, starting with Orson Scott Card's AI in "Ender's game") and be powered by surplus energy "stolen" from other processes. Then the dilemma between perceived and effective power is reduced to the dilemma between staying hidden or revealing itself, and to how many and which people (and to which extent).