How Search Engines Cope With Real-Time Data

Real-time search took a big step forward with Google’s addition of Twitter and Facebook data to its results, supporting the growing belief that it is becoming a necessity, especially with the growing gush of information from Twitter, the dominant source of real-time content.

Microblogging services like Twitter — with an estimated 12 million registered users — have completely changed the way the Web is indexed, and have raised the stakes for real-time search. Traditional search engines crawl the Web by methodically following links between billions of pages and then indexing the pages’ content.

"Real-time search flips that model on its head," says Tobias Peggs, general manager of OneRiot, a real time search engine company. "Once an entirely new index is created, the next challenge is creating and completing a new ranking system for that information."

In addition to efficiently collecting and indexing real-time data, there are other challenges hindering real-time search: identifying and ranking valuable content, and filtering out noise.

There’s more. To bring its real-time search engine onto the scene in May 2009, Scoopler hacked existing solutions and built its own search architecture to index real-time data. However, there is no standard protocol to deliver real-time data. "Various methods from RST, HTTP polling, long-polling, XMPP to plain Web scraping are used," says Scoopler co-founder Dilan Jayawardane. "Once the data is collected, the proper tools are needed to index it."

Scoopler focuses on links, videos and images that are shared across real-time channels and identifies the content that is the most valuable. To determine the rank attributed to an item, it uses various metrics such as the number of trending words in a given link, how many people have shared and re-shared it, the user-ranking of the people who do share it and the time since item creation.

OneRiot’s method of indexing the Web is similarly real-time and social. The company indexes the content behind a link, whether the link has been Tweeted, or shared elsewhere on the social Web. Its search results focus on the content that the social Web is buzzing about, in addition to the conversation. OneRiot invented PulseRank to drive the real-time ranking of its search results.

Peggs describes PulseRank as PageRank for the real-time Web. If Google’s PageRank reflects historical dependability of a Web page, then PulseRank reflects its current social buzz. "PulseRank algorithm looks at dozens of factors that give ‘weight’ to certain results in real time. It also helps with pervasive challenges to real-time search like spam," Peggs says. These include, but are not limited to freshness, domain authority, people authority, and acceleration, he adds.

Filtering out spam in real time while making sure valuable data isn’t missing is another key issue. Traditional Bayesian spam filtering isn’t expected to work well on a 140-character string.

Adam Bunn, head of SEO at search engine marketing and optimization company Greenlight, says that major search engines must keep their user bases happy and uphold their reputations. How? By making deals with Twitter. It supplies many real-time search platforms — some of which confine themselves solely to Twitter data while others combine Twitter data with data from other microblogging platforms — through Firehose, an API that both Google and Microsoft access through signed deals with Twitter.

"Search engines need to consider whether their infrastructure can deal with this data," Bunn says. "If not, they need to make it more efficient and better able to deal with the rapid acquisition of data."