Artificial Intelligence · 2020-10-15

Microsoft AI system can now give captions to photographs – AI


The onslaught of artificial intelligence (AI) on Content continues. Microsoft has announced that its researchers had developed an AI-based system that could generate captions for images. Microsoft boasts that these were, in many cases, more accurate than the descriptions people often wrote for photographs.

In the announcement made on the official Microsoft blog, author John Roach said the breakthrough came in a benchmark challenge, & was a “milestone in Microsoft’s push to make its products & services inclusive & accessible to all users.”

“Image captioning is one of the core computer vision capabilities that can enable a broad range of services,” said Xuedong Huang, a Microsoft technical fellow and the chief technology officer of Azure AI Cognitive Services in Redmond, Washington.

Novel object captioning

Image captioning is a core challenge in the discipline of computer vision, one that requires an AI system to understand & describe the Content, or action, in an image, explained Lijuan Wang, a principal research manager in Microsoft’s research lab in Redmond.

The new model was now available to customers via the Azure Cognitive Services Computer Vision offering, which is part of Azure AI, enabling developers to use this capability to improve accessibility in their own services. It also is being incorporated into Seeing AI, & will start rolling out later this year in Microsoft Word & Outlook, for Windows & Mac, & PowerPoint for Windows, Mac & Web.

Wang led the research team that achieved – and beat – human parity on the novel object captioning at scale, or nocaps, benchmark. The benchmark evaluates AI systems on how well they generate captions for objects in images that are not in the dataset used to train them.

Image captioning systems are typically trained with datasets that contain images paired with sentences that describe the images, essentially a dataset of captioned images.

The use of image captioning to generate a photo description, known as alt text, in a Web page or document is especially important for people who are blind or have low vision, noted Saqib Shaikh, a software engineering manager with Microsoft’s AI platform group in Redmond.

For example, his team is using the improved image captioning capability in the Seeing AI talking camera app for people who are blind or have low vision. The app uses image captioning to describe photos, including those from social media apps.

“The nocaps challenge is really how are you able to describe those novel objects that you haven’t seen in your training data?” added Wang.

To meet the challenge, the Microsoft team pre-trained a large AI model with a rich dataset of images paired with word tags, with each tag mapped to a specific object in an image.

Datasets of images with word tags instead of full captions are more efficient to create, which allowed Wang’s team to feed lots of data into their model. The approach imbued the model with what the team calls a visual vocabulary.

The visual vocabulary pre-training approach, Huang explained, was similar to prepping children to read by first using a picture book that associated individual words with images, such as a picture of an Apple with the word “Apple” beneath it & a picture of a cat with the word “Cat” beneath it.

“This visual vocabulary pre-training essentially is the education needed to train the system; we are trying to educate this motor memory,” Huang said.

The pre-trained model is then fine-tuned for captioning on the dataset of captioned images. In this stage of training, the model learns how to compose a sentence. When presented with an image containing novel objects, the AI system leverages the visual vocabulary to generate an accurate caption.

“It combines what is learned in both the pre-training and the fine-tuning to handle novel objects in the testing,” Wang said.

When evaluated on nocaps, the AI system created captions that were more descriptive & accurate than the captions for the same images that were written by people, according to results presented in a research paper.

The new image captioning system was also two times better than the image captioning model that’s been used in Microsoft products & services since 2015, according to a comparison on another industry benchmark, claimed Microsoft.

“In the last five years,” Huang said, “we have achieved 5 major human parities: in speech recognition, in machine translation, in conversational question answering, in machine reading comprehension, & in 2020, in spite of COVID-19, we got the image captioning human parity.”

via:Microsoft

Click here to opt-out of Google Analytics