Facebook ‘Rosetta’ understands text in images & videos – AI

Facebook (FB) has given the general public the inside dope on how the Facebook Machine Learning (ML) based data extractor ‘Rosetta’ works to mine text from images posted on FB. (Incidentally, the real Rosetta Stone has inscriptions in 3 different languages & proved to be the key to deciphering Egyptian hieroglyphs.)

So why does it need to “understand” the text that appears on images? Here are some of them:

  • Improving user experiences, such as a more relevant photo search or the incorporation of text into screen readers that make Facebook more accessible for the visually impaired.
  • Understanding text in images along with the context in which it appears also helps FB’s systems proactively identify inappropriate or harmful Content & keep its community safe.

The FBA team of AI specialists wrote on its official coding blog that a significant amount of the photos shared on FB & Instagram contained text in various forms. These could be text overlaid on an image or in a meme. But it doesn’t stop there. It could be text inlaid in a photo of a storefront, street sign, or even on a restaurant menu.

The problem of understanding all image text is compounded because:

  • One has to take into account the sheer volume of photos shared each day
  • The number of languages supported
  • The variations of the text

All of which makes the problem of understanding text in images very different from those solved by traditional optical character recognition (OCR) systems. OCRs, said FB, recognize the characters but don’t understand the context of the associated image.

Now comes in Rosetta, which is a gigantic ML system. It extracts text from more than a billion public Facebook & Instagram images & even video frames (in a wide variety of languages), daily, & in real time, & inputs it into a text recognition model that has been trained to understand the context of the text & the image together.

Facebook Rosetta

The Process

Rosetta extracts text from an image in 2 independent steps – detection & recognition. In the 1st, it detects rectangular regions that potentially contain text. In the other, it performs text recognition, where, for each of the detected regions, it uses a convolutional neural network (CNN) to recognize & transcribe the word in the region. For text detection, FB has adopted an approach based on Faster R-CNN, a state-of-the-art object detection network.

We will not report on further technicalities involved, as it gets complicated for a lay reader.

For What All Of This?

FB claims Rosetta has been widely adopted by various products & teams within Facebook & Instagram. Text extracted from images is being used as a feature in various upstream ML models such as those to improve the relevance & quality of photo search, automatically identify Content that violates our hate-speech policy on the platform in various languages, & improve the accuracy of classification of photos in News Feed to surface more personalized Content.

What Next?

The FB team says the next challenge for Rosetta is videos as a source of Content.

Text on images comes in a wide variety of forms with very little structure: simple horizontal overlaid text in memes; rotated, warped, obfuscated, or otherwise distorted text; or scene-text in photographs of storefronts or street signs. Moreover, the patterns of text on images on Facebook tend to change rapidly, making this an ongoing challenge.

But extracting text efficiently from videos is an even bigger challenge, FB says. The naive approach of applying image-based text extraction to every single video frame is not scalable, because of the massive growth of videos on the platform, & would only lead to wasted computational resources. Recently, 3D convolutions have been gaining wide adoption given their ability to model temporal domain in addition to spatial domain. The FB AI team says it was beginning to explore ways to apply 3D convolutions for smarter selection of video frames of interest for text extraction.

Image Credit: Facebook

 

Click here to opt-out of Google Analytics