Don’t touch that AI – model fiddling can skew algorithm output, study shows
In a paper posted to pre-print service ArXiv, “PoTrojan: powerful neural-level trojan designs in deep learning models,” authors Minhui Zou, Yang Shi, Chengliang Wang, Fangyu Li, WenZhan Song and Yu Wang describe a technique for inserting trojan code into deep learning models.
The researchers, affiliated with the Chongqing and Tsinghua Universities (China), and the University of Georgia in the US, contend that malicious neural networks represent “a significant threat to human society of the AI era.”
Deep learning is a form of machine learning that can be used to train a neural network to predict results based on input data. The term “deep” here means there are multiple hidden layers where mathematical computations occur to transform the inputs into a predictive output.
The researchers suggest that pre-trained neural network models, used for things like facial or image recognition in devices or in the cloud, can be subtly altered to produce malicious results in specific circumstances using malware they dub PoTrojan.
During normal operation, these neural networks appear to be working fine; however, shown a particular input, produce a very different result. For example, a model trained to spot guns from other objects can be slightly tampered with to ignore firearms of a particular shape.
“Most of the time, the PoTrojan remains inactive, without affecting the normal functions of the host [neural network] model,” the paper explains. “It is only activated upon very rare input patterns that are carefully chosen by its designers.”
Pre-trained models appeal to organizations that are disinclined to go through the time-consuming process of training their own neural network models, and instead want something they can slap into a phone or app. What they bung into their gear, though, may look sane – until the hidden trojan is activated.
“We are still working on the implementation of inserting the proposed PoTrojans in real-life pre-trained learning models,” explained Minhui (Michael) Zou, one of the paper’s authors, in an email to The Register.
“The pre-trained models are comprised of architecture files and parameter files. Hence, inserting the Protrojans could be done by modifying the model files. Currently, this could not be done remotely.”
The research builds upon previous work that explored manipulation of machine-learning models. Prior research has involved retraining models to make them produce bad results, an approach that changes the parameters of the original model and affects the error rate.
The group, however, claims its technique avoids these potentially noticeable interventions.
“The modification would require adding some values in the parameter files and changing the architecture codes to insert the PoTrojans,” said Zou. “Note that adding parameters or changing the architecture codes are not changing the existing parameters or the architecture of the model designs, respectively.”
Given access to the training instance of the target predictions or classification labels, the researchers just have to train the neural inputs adjacent to the layer where the PoTrojans are inserted, a more efficient approach, they claim.
The attack is not particular dramatic: It involves changing some values in stored files, which assumes access to a target computer. But it could lead to problems.
Zou cites as an example the possibility that an attacker might mess with an image recognition system.
“Once the adversary system is fed with the specific picture, it would label the specific picture as the desired object that the adversary chooses,” he explained.
In other words, it’s a way to say, ‘These aren’t the droids you’re looking for,’ in the language of machines. ®