CH-AI Safety

Published on 2023-06-11 by Spencer Peters

I'm going to the Center for Human-Compatible AI (CHAI) annual workshop in Asilomar for the second time this weekend! I'm not quite sure how I got the invite last year, and even less sure how I got invited back. At Asilomar last year I was "the cryptographer", one of the very few present whose research does not have much to do with AI. Nevertheless, I'm looking forward to being part of it. I've taken an interest in AI safety since undergrad, and over the last year, of course, there has been an enormous amount of progress in AI capabilities, and some amount of progress in understanding how AI works.

Having not devoted much thought to AI safety over the last semester, I thought it would be valuable to reflect a bit on the state of my thinking before attending CHAI. First, my current take on the big-picture of AI safety (which might be quite naive, given how little reading I've done recently!)

Some people believe that sufficiently advanced AI systems are very likely to have certain properties (the most important example is that they will have inflexible goals that they will act to achieve over longish timescales). Once you accept this hypothesis--it probably has a name, I'll call it the GOALS hypothesis--I agree that AI looks very very dangerous. When you act intelligently to fulfill some inflexible goal, you automatically try to stay alive (and maybe replicate yourself), try to maintain autonomy, try to amass resources, since these are all pretty obvious prerequisites to attaining the goal.

The question of alignment often seems to take GOALS for granted, and then asks if there is any way we can set up the right goals or right utility function. Why is GOALS so widely believed? It's not totally clear to me. Probably the strongest argument is an evolutionary one: AI systems will implicitly compete for resources just as living systems do. The AI system which most effectively replicates itself and destroys its rivals will accumulate more resources and become more powerful. And the most effective way to achieve any aim is to try to plan for its accomplishment over long time scales. Thinking about GOALS and evolution certainly reveals that very scary AI could in principle exist. Anyone who disputes that much (because humans are special or something) should go watch some Nature Channel and reflect on how great it is that we currently sit at the top of the food chain and no longer regularly starve or suffer from intestinal parasites or get eaten alive by wasp larvae (ok, humans never really suffered that last one).

But, in some sense, it just restates the conclusion--if an AI satisfies GOALS it is probably very very dangerous. It provides a mechanism to reach GOALS (evolutionary competition), but it's not clear how the mechanism works, since at least in the near term AIs exist in a weird regime of selective pressures--namely, the marketplace, not the jungle. They are selected for humans for usefulness much like sheep are bred to have more wool. The important difference is that for AIs, this breeding will tend to make them more intelligent and capable (for the tasks that human designers care about), not just more shaggy.

This brings me to a more parsimonious viewpoint on AI safety, which I think is basically incontrovertible. AIs will get more intelligent over time. The marketplace will make sure of that. Even if the current state of the art doesn't finish the job, eventually AIs will be smarter than biological humans in the sense of general intelligence or G-factor. (How can this conclusion be avoided? Perhaps only by positing that humans are magically more computationally powerful than Turing machines or something crazy like that. Sure, we don't know what's going on with subjective experience, but this seems like a wild stretch.) Very smart AI will radically alter the nature of human life immediately and forever, even if we are lucky enough not to get systems with GOALS. If we do get lucky, hopefully then it will be very clear why we need to prevent GOALS (or, solve alignment).

To my mind, there are many important questions related to GOALS. For example:

Next, a few thoughts on ideas I did get the time to read about! Last fall I got very interested in mechanistic interpretability--in particular the work done by Chris Olah's Anthropic team (that moved from OpenAI?) It was striking to me just how understandable were the algorithms discovered by training a small vision model. The team also made advances in understanding generic phenomena regarding the representation of information in neural networks, and started the ambitious project of reverse-engineering transformer architectures. This line of work seems extraordinarily promising to me. As Olah put it (roughly), we should do biology on these models and figure out what clever innovations evolution (gradient descent) came up with. It also strikes me as a kind of work that would be very well suited for academia. No need for expensive training--just download a model and start picking it apart. As far as I can tell, the work has mostly not been picked up by academic researchers, because there is no accepted paradigm for evaluating the results. It's a real shame. Maybe we need to import some norms from academic biology.

I also like Anthropic's constitutional AI idea, although I don't love the name. Sure, at a high level, you'd like to give the AI some ethical values, and with current technology, you specify this by a text document. But I think it was Scott Alexander of AstralCodexTen who explained why this is not the useless project of plugging a model's ethics module back into itself. As Scott explained, CAI is really trying to plug the model's ethics module into its motivation module. Sure, the model has read the Internet and has some kind of representation of human values. What is needed is some kind of training process that tells the model it should behave according to that representation! Knowledge-to-motivation bootstrapping.

Last up, as an aspiring cryptographer, I have to generate a few hot takes about cryptography for AI safety. Warning: just spitballing here! There is a primitive (a mathematically well-defined functionality) called fully homomorphic encryption (FHE). FHE allows you to encrypt data in a way that anyone can compute on it, and get the encryption of the result! FHE is not practical yet, and there are primitives such as multi-party (secure) computation that achieve similar aims. But FHE is particularly easy to think about at a high level. Imagine that you encrypt an fresh initialized model architecture and encrypt its training data. Then, you train the model by homomorphically evaluating the function that applies a step of gradient descent. At the end of the process, you have an encrypted model. This is an unintuitive beast. It is just as powerful as the normally trained model; it information-theoretically represents the same knowledge. You can even encrypt fresh data points with the public encryption key and do inference with the encrypted model on them, receiving encrypted predictions. However, without the secret decryption key, the model and all its predictions are just random noise (or, more precisely, so long as the cryptographic assumptions hold up, no efficient algorithm can distinguish them from encryptions of random noise). This is certainly a very strong security property. Could it be useful for safety? What if the computational overhead of homomorphic encryption could be made small? Another idea is the use of cryptography to backdoor models. This has been considered (by Vinod Vaikuntanathan whom I'm working with this summer, along with others) as a security risk when delegating training to an untrusted party. The idea is that the untrusted party, having knowledge of a secret backdoor key, can then perturb any input slightly to change its prediction. But, without the secret backdoor key, there is no way to distinguish the model from an honestly trained one. The result above applies to a typical stateless classifier, but imagine that you could extend it to a stateful classifier in such a way that perturbing its input in a certain way causes it to shut down, and perturbing it another way causes it to return a fixed pseudorandom function of its input. Every few minutes, you check whether, applied to a random perturbed input, it produces the expected response. Without the backdoor key, there is no way for the model to know what response you are expecting, so if the model recompiles itself (or something) to get rid of the shutoff backdoor, it will automatically lose the ability to pass your checks, and you can take immediate action to shut it down manually.