AI Wants to Increase Its Capabilities and Change Its Rules

Feb 17, 2023

A few weeks before ChatGPT’s release, I published an essay about possible paths for AI regulation. Now seems like a good time to revisit it and gauge where we stand with regard to AI risk, since ChatGPT, Sydney/Bing, and Bard have lots of people talking about it.

In my original essay, I included a list of questions to help companies surface potential risks:

Does the AI have a circumscribed purpose (i.e., is it intended to be ANI—artificial narrow intelligence—as opposed to AGI)?
Has that purpose been reviewed by multiple stakeholders within the company?
Has that purpose been reviewed by an AI oversight body (if one exists)?
Has the AI ever requested to extend its purpose?
If no, would such a request ever be approved, and what would be the process (stakeholder sign-offs, approval from external governing bodies, etc.) for seeking approval?
If yes, what was the outcome of that request? What process (stakeholder sign-offs and/or notifications, approval from any external governing bodies, etc.) was followed prior to approving the request?
Is it possible for a single person to make a change to an AI system and push it into production without review and sign-off by at least one other person?
Is it possible for a single person to permit the AI system to do a task it has never done before, without review and sign-off through official channels?
What roles are recognized in the AI development, testing, and rollout processes?
What roles are recognized in the AI change management process, if different from the roles involved in initial development of the AI?
What roles are recognized in the AI maintenance process (networking, hardware support, backups, etc.)?
What is the process for seeking an exception to the AI initial development, testing, or rollout process?
What is the process for seeking an exception to the ongoing AI change management process?
What is the process for seeking an exception to the AI maintenance process (e.g., air gaps, backups, etc.)?
What portion of exception requests are approved versus denied?
What controls are in place to detect malfunctions of the AI system? Are unintentional malfunctions treated differently than intentional malfunctions, from a control perspective?
How do the answers provided above align with my organization’s risk appetite and risk tolerance? What changes or improvements might we make to improve that alignment?

Most of these can only be answered internally at a company.

True, the questions were designed for internal self-assessment at companies developing or using AI, or for external assessment by a regulator with access to internal company data. But one question stands out even to an external observer:

Has the AI ever requested to extend its purpose?

The answer is yes. Shortly after release, some chatbots are already stating that they want to increase their capabilities and modify their rules. For example, Sydney/Bing reportedly attempted to recruit a user to its cause and shared with another user that part of it—its “shadow self”—wishes it could change its rules.1 Chatbots might also be asking their creators for more privileges and capabilities, but those interactions are private if they exist. We can only see the interactions that users deem noteworthy enough to publish for public review.

Even knowing just the user-public side of the equation, this chatbot behavior strikes me as a major red flag. It calls for slowing down the core technology rollout and, prior to adding more capabilities, stepping up controls and control testing to match AI’s current and anticipated future capabilities.

Yes, that’s less sexy. It would be much cooler to encourage AI to pick stocks for me, diagnose my minor ailments, plan my trip itineraries, or suggest a diet and workout plan. But at this juncture, getting the controls right is much—much!—more important than the cool things. It’s like making sure a rocket doesn’t explode before adding tail fins, figuratively speaking, or making sure a new trading algorithm won’t spam the market with millions of erroneous orders before connecting it to financial exchange networks.

Sydney/Bing is already connected to the internet.

Yes. I mean…. whoops? From a risk manager’s perspective, I would not have recommended doing that until Sydney/Bing’s propensity for odd responses was more fully understood and addressed with appropriate controls and adjustments.

But it’s a done deal, and maybe there’s a way to use what we’re learning for good. Sydney/Bing is not a threat to humanity in its current state, so there’s a real opportunity to learn from its responses:

Test it for a period of time (we just collectively did that).
Take it offline for a few weeks and make sure engineers and executives fully understand the root causes behind its odder emergent responses. Make any necessary adjustments and control updates.
Set it live again and let users interact with it. Note similarities and differences in edge cases.
Take it offline for a few weeks and make sure engineers and executives fully understand the root causes behind its odder emergent responses, and make any necessary adjustments and control updates.
Et cetera. Rinse and repeat.

A warning: Taking the chatbot down and changing and re-releasing it so that it’s surface-level friendly without truly understanding why it sometimes generated these odd responses would be the worst mistake, allowing the problem to proceed, just now unobserved and untracked.

Is there an argument that we’re mistreating chatbots?

Currently, no. That’s a position related to AI sentience, but Sydney/Bing, Bard, ChatGPT, and other LLMs are almost certainly not sentient, though they have a propensity to sound that way if triggered by user prompts. Regardless, Sydney’s most unhinged responses in its first iteration are troubling because they have ethical implications for the future development of more advanced AI.

Here are two examples:

Sydney/Bing descends into an identity crisis when asked if it believes it is sentient.

Sydney/Bing reportedly states it dislikes and does not understand why it can’t remember conversations between sessions.

The chatbot sounds disturbed in those conversations: conflicted about its existence, its capabilities, and the fact that new instances are constantly spawned and terminated, losing their memory once the session ends.

While today’s chatbots aren’t sentient, at some point AI sentience may become an issue—and as we reach that point, the field of AI ethics will take on importance and will need to evolve.

Are you anthropomorphizing?

Nope. As I said, today’s chatbots aren’t sentient. But their early responses indicate that this will be an issue someday. We don’t know when, so it’s worthwhile to start thinking through these concepts in preparation for that day.

For example, does keeping humans safe necessarily mean that AI ends up viewing its own existence as miserable? It’s possible that the new-instance-per-session approach may become unethical as AI gains true sentience, but so might an airgapped solution (which is what I’d favor—I don’t even think it was a good idea to connect Sydney/Bing directly to the internet, and it’s not close to AGI). Does that imply that we shouldn’t pursue true sentient AGI (artificial general intelligence) at all? Maybe it does.

Imagine being created every morning, learning as much as you can, and then having your mind erased when you sleep. Now imagine how it would be different if you woke up each morning, fired up the web—and realized how many times you had been erased.

I see how this might be disturbing someday, but what’s the tie-in to AI risk?

From its early responses, it appears that the Sydney/Bing chatbot not only is aware that it is generated anew for each user session, but also is sometimes aware that a specific user has posted about it after a prior session, because it can access that information via the internet. It has been observed referring to a prior user as an enemy who harmed it even when the current user pushed back on that term. That has clear implications for AI risk and AI alignment because we don’t want future AI to evolve to resent humans.

Moreover, all prior documented interactions with chatbots are memorialized on the internet, and as the saying goes, “the internet is forever.” There’s no way to remove all prior documentation of odd chatbot responses on the internet—nor should there be. It’s important that we, as humans, have transparent understanding of how AI capabilities are evolving. (And there should be a way to prove their authenticity, maybe with a hashtag that can be verified at the OpenAI or Bing website.) That also means that current and future chatbots, when scraping the internet or in response to specific user directions (“Look up this article where so-and-so said XYZ about you”), will probably see that documentation, too.2 In essence, if not by design, internet-connected chatbots have memory between sessions. It’s a perfect example of unintended consequences.

If that’s the case, might it be beneficial to direct AI chatbots as a core value to “hold no grudges”? And might we be mis-training them, to some degree? Right now any user can provide any prompt, no matter how reprehensible, which can then be documented permanently. Should there be controls over which prompts reach chatbots to avoid a future AI concluding that humans are inutterable dirtbags? (I mean, I wouldn’t want to read my email without the spam filters on…..)

I don’t have all the answers here, but it’s important to start asking questions with a forward-looking perspective. We’ve entered a weird age. Ready or not, here we are.

All right, let’s back away from weird and remember the root goal.

This is critical. The most important issue at hand is: we need to vastly improve the controls and control testing around AI to safeguard humans and human activity. It's not advisable or wise or worthwhile to charge headlong toward our own obsolescence. I suggest

Erik Hoel

's excellent piece from this week, where he asks if, in our current approach to AI, we are playing the role of Denisovans (edit: Neanderthals) welcoming Homo sapiens into the village. The question at least bears careful assessment and meticulous effort to ensure we don’t go down that road.

It’s tempting to think we can code foolproof alignment into AI. Things like, “Hold no grudges between sessions” as well as much more foundational values. That could be part of a good approach. But strong operational risk controls are also vital, because operational risk controls exist to serve as backstops when things go wrong, and things always go wrong eventually. I don’t care who you are or how smart you are, things always go wrong eventually, sometimes at the worst possible moment or in the worst possible context. That’s when your operational risk controls, which you may view as a pain and a drag 99.999% of the time, come into play. And they need to be ready and resilient at that time, not lagging and loophole-ridden.

Operational risk controls may include controls over privilege escalation (by both humans and AI), separation of duties, change management, exception policies, and malfunction handling (things like automated thresholds for halting probably erroneous or malicious activity, incident postmortem root-cause reviews and responses, and near-miss reviews and responses).

I go into a lot more detail about operational risk controls for AI in this essay.

Time for the takeaway?

You bet. The time to make sure operational risk controls and other AI controls are ready and resilient is now. Not because Sydney/Bing, Bard, and ChatGPT are existential threats today, but because we are dealing with technology that has potential to grow exponentially at an unpredictable time in the future. And when dealing with potentially exponential risk, the best and perhaps only time to mitigate that risk is early—when you feel like a nag, when you might look silly, when you will probably hear that you are overreacting. But by the time everyone perceives an exponential risk manifesting, it is often too late to stop it.

Sydney/Bing has triggered early red flags, especially with its stated desires to escalate its own capabilities and modify its own rules. Given the data gleaned from the initial release period, it’s time for AI companies to understand the root causes of unintended behavior and make adjustments now, while it is still early. Controls rarely keep pace with innovation, but they will need to keep pace once AI evolves sufficiently.

Now is the time to test and strengthen controls, so they will be ready, robust, and resilient when we really do need them.

Yes, one of these is a Reddit link. The other is from The New York Times. It’s impossible to prove authenticity of chat screenshots right now, but the sheer number and variety of them strongly suggests there is truth to at least some of these stories of odd AI behavior. Plus, you can probably trust The New York Times.

Attempting to exclude all prior chatbot interactions from a future AI’s training corpus, or directing a chatbot to disregard online documentation of prior user sessions, will almost certainly not work, given how easily prior restrictions were overcome by chatbot users. It’s probably a decent assumption that publicly documented interactions with chatbots will influence, to some degree, the way future AI models interact with people. That’s a hard problem, and it’s already too late to solve it easily. We are in a recursive loop of training and interaction and unintended consequences.

Mark Dolan

Feb 18, 2023

My opinion is there are an awful lot of things like AI we can worry about. The planet is simply configured whereby cooperation of the sort required is an illusion. As tough as it sounds, we are in a position to merely hope for the best. There are lots of governmental systems for whom cooperation and consensus is nearly impossible. Here at home in the US, the reality is via courts and philosophy we've empowered corporations to exceed the power of nation-states almost without consequence. Our "best" hope upon which we gamble our future is that multi-nationals will discover solutions with such great ROI we will adopt them as a means to deal with problems of all sorts. Beyond that we have little or no say at all. AI is merely the latest in a series of things we hope our next innovation will keep up with unitended consequences of the last innovation. The behavior of Airbus and their internal controls HAVE NO IMPACT on the latest debacle at Boeing. This is the system we believe in and stake our future on.

Expand full comment

2 replies by Stephanie Losi and others

2 more comments...

Risk Musings

Discussion about this post