New System for Image Object Recognition Modernizes How Computers Learn Independently and Transfer Skills

By Joshua Preston

A computer vision team at the Georgia Institute of Technology has developed a new approach for computers to train themselves on image object recognition.

The work modernizes how machine learning models that are trained in one area or domain can take that knowledge and apply it to accurately recognize objects in other domains, a practice known as visual domain adaptation. An example could be a computer capturing driving scenes from a massive open-world game like Grand Theft Auto 5 and using those to understand real-world objects on the road.

“Applications for this work are limitless when talking about how a machine could take one image set from a domain and apply it anywhere,” said Viraj Prabhu, Ph.D. student in computer science and co-lead author on the work.

Georgia Tech’s novel solution uses a popular image object recognition tool – the vision transformer (ViT) – and image data to allow a machine to perform self-supervised learning (SSL) on datasets, where no human assistance is used. The model that results is tested for its predictive capabilities to correctly classify objects in new images.

According to the researchers, the simple approach leads to consistent performance gains on standard object recognition benchmarks over competing methods that use ViTs and SSL.

For the machine to add to its library of training images, it has to correctly predict the objects in the images.

“Key to this approach is a new way to empower the machine to decide when an image from a new domain can reliably be used for learning,” said Judy Hoffman, assistant professor in the School of Interactive Computing and the research’s faculty lead.

“As this algorithm operates without a person in the loop, the machine needs the ability to distinguish between images with minor changes that can be automatically adapted versus images with major changes that make automated adaptation riskier.”

The team’s work uses a built-in “attention” mechanism from ViTs together with a “greedy assignment algorithm” to produce images that are partially masked, with only the most salient parts of the images being viewable by the machine.

“If the machine makes the same prediction on the majority of masked views, then the image is deemed to be reliable enough to be incorporated into automated learning,” said Hoffman.

As the computer gets smarter at learning on its own it can take the known set of images and apply it to unknown sets, thereby creating new opportunities for visual domain adaptation.

The team calls its technique Probing Attention-Conditioned Masking Consistency (PACMAC). The work and code are detailed in the paper Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency and is published in the 2022 proceedings of the Neural Information Processing Systems (NeurIPS) conference, Nov. 28 – Dec. 9.

The authors, all from the School of Interactive Computing, include Prabhu, Sriram Yenamandra, Aaditya Singh, and Hoffman. The work was supported in part by funding from the DARPA LwLL project and ARL. The researchers’ views and statements are based on their findings and do not necessarily reflect those of the funding agencies.