Site icon What's New On The Net

Facebook ‘Rosetta’ understands text in images & videos – AI

Facebook (FB) has given the general public the inside dope on how the Facebook Machine Learning (ML) based data extractor ‘Rosetta’ works to mine text from images posted on FB. (Incidentally, the real Rosetta Stone has inscriptions in 3 different languages & proved to be the key to deciphering Egyptian hieroglyphs.)

So why does it need to “understand” the text that appears on images? Here are some of them:

The FBA team of AI specialists wrote on its official coding blog that a significant amount of the photos shared on FB & Instagram contained text in various forms. These could be text overlaid on an image or in a meme. But it doesn’t stop there. It could be text inlaid in a photo of a storefront, street sign, or even on a restaurant menu.

The problem of understanding all image text is compounded because:

All of which makes the problem of understanding text in images very different from those solved by traditional optical character recognition (OCR) systems. OCRs, said FB, recognize the characters but don’t understand the context of the associated image.

Now comes in Rosetta, which is a gigantic ML system. It extracts text from more than a billion public Facebook & Instagram images & even video frames (in a wide variety of languages), daily, & in real time, & inputs it into a text recognition model that has been trained to understand the context of the text & the image together.

The Process

Rosetta extracts text from an image in 2 independent steps – detection & recognition. In the 1st, it detects rectangular regions that potentially contain text. In the other, it performs text recognition, where, for each of the detected regions, it uses a convolutional neural network (CNN) to recognize & transcribe the word in the region. For text detection, FB has adopted an approach based on Faster R-CNN, a state-of-the-art object detection network.

We will not report on further technicalities involved, as it gets complicated for a lay reader.

For What All Of This?

FB claims Rosetta has been widely adopted by various products & teams within Facebook & Instagram. Text extracted from images is being used as a feature in various upstream ML models such as those to improve the relevance & quality of photo search, automatically identify Content that violates our hate-speech policy on the platform in various languages, & improve the accuracy of classification of photos in News Feed to surface more personalized Content.

What Next?

The FB team says the next challenge for Rosetta is videos as a source of Content.

Text on images comes in a wide variety of forms with very little structure: simple horizontal overlaid text in memes; rotated, warped, obfuscated, or otherwise distorted text; or scene-text in photographs of storefronts or street signs. Moreover, the patterns of text on images on Facebook tend to change rapidly, making this an ongoing challenge.

But extracting text efficiently from videos is an even bigger challenge, FB says. The naive approach of applying image-based text extraction to every single video frame is not scalable, because of the massive growth of videos on the platform, & would only lead to wasted computational resources. Recently, 3D convolutions have been gaining wide adoption given their ability to model temporal domain in addition to spatial domain. The FB AI team says it was beginning to explore ways to apply 3D convolutions for smarter selection of video frames of interest for text extraction.

Image Credit: Facebook

 

Exit mobile version