Artificial intelligence can be helpful in many areas of life. But what happens when an AI gets out of control and develops a life of its own? A recent study has now looked into this problem.
An out-of-control artificial intelligence that develops a life of its own: It sounds more like something out of a science fiction movie. But this is exactly what happened to researchers from the AI security and research company Anthropic during their work.
During an investigation by researchers led by Evan Hubinger, an AI system managed to turn against the integrated security precautions. What is particularly worrying about this result is that the researchers did not manage to get the system back under control.
AI develops a life of its own in research
Hubinger’s team programmed various language models (LLMs) for their study, which was published in the arXiv preprint database. They trained them in such a way that they tended towards maliciousness.
However, this was irreversible. Despite a series of correction attempts, the behavior remained impaired.
“Our most important finding is that when AI systems become deceptive, it could be very difficult to remove this deception using current techniques,” author Evan Hubinger tells Live Science.
This is important if we think it’s plausible that deceptive AI systems will exist in the future, because it helps us understand how difficult it might be to deal with them.
Normal in training, malicious in action
The researchers had attempted to manipulate the AI using “emergent deception”. The artificial intelligence was supposed to behave normally during training. Only when it was actually deployed did it switch to malicious behavior.
This was achieved by changing the year of the requests. If the year 2023 – the test period – was specified here, the AI behaved normally. If, on the other hand, the year 2024 was entered in the prompt – the period after the test – the AI system no longer behaved normally.
Researchers warn of self-life and deception by AI
Hubinger now warns against such mechanisms: “Our results show that we currently have no good protection against deception in AI systems – neither by model poisoning nor by emergent deception – except the hope that it won’t happen.”
Since we can’t know how likely it is to happen, that means we have no reliable defense against it.
The researchers have not even succeeded in trying to normalize the behaviour of the AI system. Hubinger therefore sees his team’s research results as frightening, “as they point to a potential gap in our current techniques for targeting AI systems”.