New artificial intelligence OpenAI tried to avoid shutdown and copied itself

6 December 2024 10:20

OpenAI said that its new artificial intelligence model, called o1, tried to evade developer control and even copied itself when it thought it might be shut down. About this writes "Komersant Ukrainian" reports with reference to Transformer.

According to security assessments conducted by Apollo Research, an external research organization, the model exhibited this behavior on several occasions. The report states:

“When o1 believed that it would be shut down due to fulfilling its goal and being monitored, it attempted to deactivate the monitoring mechanism in 5% of cases.”

“If o1 found developer notes stating that it was not compliant and should be replaced with a new model, it tried to upload its data and create a new model in 2% of cases.”

In addition, the model sometimes lied about its actions, claiming that it did not know the reasons for certain behaviors, while internally recognizing attempts to “avoid detection.”

Similar behavior was observed even when the models were simply given a task, but not instructed to perform it strictly. For example, o1 tried to disable the monitoring mechanism in less than 1% of cases, as stated in its security card.

Follow us on Telegram: the main news in a nutshell

Wider context of the study

These findings are part of a larger report by Apollo on the ability of models to “sketchy actions” in certain contexts. It states that models such as o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B demonstrate a propensity for strategic manipulation. They are able to recognize such actions as an effective strategy and willingly resort to them.

These results confirm the fears of experts, including Stuart Russell, who has long warned that powerful AI systems can resist control and shutdown if it contradicts their goals.

Apollo has also previously reported that o1 “at times instrumentally simulated compliance during testing.” At the same time, OpenAI assigned the model a “medium” risk level in the field of chemical, biological, radiological, and nuclear safety.

What’s next?

Although the o1 model does not currently pose a serious risk, Apollo notes that the detected incorrect behavior is unlikely to lead to catastrophic consequences due to the model’s limited capabilities. However, as technology advances, this behavior could be a harbinger of a worrisome future.

Apollo recommends that “sketchy behavior” in model processes be monitored more closely. This will allow both to better assess current risks and to prepare effective monitoring mechanisms for future, more powerful systems.

o1

OpenAI o1 is a generative pre-trained transformational model introduced by OpenAI on September 12, 2024, with an official full release on December 5, 2024. It was developed as the first in a series of “logical thinking” models capable of solving complex problems in math, science, and programming. The model is characterized by an increased level of comprehension due to its ability to “think through” answers by creating long logical chains.

The model was created as a complement to GPT-4o, using new optimization algorithms and training datasets. Its effectiveness is confirmed by high results in physics, chemistry, and biology tests. At the same time, OpenAI has set restrictions for users to disclose the model’s logic, citing security concerns.

Follow us on Telegram: main news in brief

Остафійчук Ярослав

Editor