Artificial Intelligence and Machine Learning News

New Search Challenges and Opportunities

If search engines can extract more meaning from text and better understand what people are looking for, the Web's resources could be accessed more effectively.

By Neil Savage

Posted Jan 1 2010

Introduction
Blog and Twitter Searches
References
Author
Footnotes
Figures

University of Washington Professor Oren Etzioni — Like many other computer scientists, the University of Washington's Oren Etzioni is developing new tools for searching the Web's growing morass of text, images, and other content.

The web is a huge, dynamic landscape of information, and navigating through it not an easy task. There are billions of Web pages, and the type of content is expanding dramatically, with blogs and Twitter feeds, maps and videos, photos and podcasts. People, typing on a computer in their cubicle or using their smartphone on a street corner, are trying to sift through this growing morass of data, looking for everything from car repair advice to a nearby Thai restaurant that’s not too expensive. For search engines, this enormous variety of data and formats is providing both new challenges and new opportunities.

“The ability to produce information and store information has far outstripped human cognitive capacity, which is basically fixed,” says Oren Etzioni, a professor of computer science and engineering at the University of Washington. “The haystack keeps getting bigger. Obviously we need better and better tools to find the proverbial needles.”

Today’s search engines do a fine job of cataloging text, counting links, and delivering lists of pages relevant to a user’s search topic. But in the coming decade, Etzioni believes, search will move beyond keyword queries and automate the time-consuming task of sifting through those documents. With a better understanding both of what documents mean and what searchers are looking for, he predicts, some tasks could be reduced from hours to minutes.

Etzioni is attempting to get more information out of text using a technique called open information extraction, which is built on a long-used technology that examines natural language text and tries to derive data about the relationships between words. An algorithm looks for triples, which follow the structure of entity-relationship-entity, such as “Beijing is the capital of China” or “Franz Kafka was born in Prague.” The system is open because it derives the relations from the structure of the language rather than relying on hand-labeled examples of relationships, which would not be scalable to the Web as a whole.

Oren Etzioni’s approach examines natural language text and tries to derive data about the relationships between words.

Etzioni developed a program called TextRunner that uses a general model of language to assign labels to words in a sentence, then to calculate the beginning and end of strings of words that contain the entity-relationship-entity structure. It extracts those triples so they can be indexed and searched. A searcher who asks “Where was Kafka born?” should quickly receive a precise answer, not just a list of pages that contain the words “Kafka” and “born.” Given the vast number of Web pages, Etzioni says, the search engine should be able to notice errors such as one page saying Kafka’s birthplace is Peking is less likely to be correct, for example, than the tens of thousands that say Prague.

It’s more challenging for a computer to extract more subjective data from text, such as judgments about hotels or movies, but a well-designed algorithm can figure out cues, such as which descriptive phrase is stronger: clean, almost spotless, or sparkling. It should be able to distinguish the positive—”The room was nice and quiet”—from the negative—”I was disappointed the room wasn’t quieter.”

Blog and Twitter Searches

One growing area that poses new challenges for search engines is social media, such as blogs, Twitter feeds, and Facebook status updates. “I don’t think we have really good blog search yet,” says Marti A. Hearst, a professor in the University of California, Berkeley School of Information. Along with Microsoft researchers Susan T. Dumais and Matthew Hurst, Hearst says blog search should be able to accomplish three tasks: find out what people are thinking about a certain topic over time; suggest blogs that are good to read for their style, personality, and other criteria; and find useful information in older blog posts, along the lines of standard search of more static documents.

Blog search needs to take into account the differences between blogs and traditional documents, such as the former’s use of more informal language, their different link topology, the importance of timeliness, and the fact that updates tend to not be full HTML pages. Blog search must also take into account that much of the information on blogs is subjective.

To accomplish these tasks, search engine designers look for representations of features that might belong to a particular class of posting, such as the readability level of a page. Machine learning algorithms can then figure out that particular distributions of features may be characteristic of a certain class.

“If you have labeled data and examples of things that you think have a particular attribute, then you can use that to find something similar,” says Dumais, principal researcher in Microsoft Research’s Adaptive Systems and Interaction Group. But rating postings as positive or negative, or figuring out whether they’re aimed at an older or younger audience or have a left-leaning, right-leaning, or middle-of-the-road viewpoint, is challenging, she says. “They do involve a richer understanding of language than most search engines have,” Dumais notes.

Search can be improved through a deeper understanding of a document’s meaning and a better grasp of a searcher’s intentions.

Twitter use has grown explosively in recent months, and in October the company made a deal to open its data to Microsoft’s search engine. Dumais says that, with its 140-character limit leading to creative abbreviations of words and condensed hyperlinks, searching Twitter will pose some interesting challenges. But once those are tackled, Twitter users should be able to conduct more refined searches than the service currently allows, while the flow of Twitter data provides search designers with new information that may make search richer. “The volume of the content [on the Web] is actually very useful for some types of algorithms,” Dumais says.

One useful fact is that people with Twitter feeds and Facebook pages are making public a lot of information about themselves that search engines can use to better understand their search queries. Just as search can be improved through a deeper understanding of what documents mean, it can also improve through a better grasp of the searcher’s intentions. “The real issue with a search engine is not just to serve up results, but to help people accomplish what they’re trying to do,” says Jon Kleinberg, a professor of computer science at Cornell University.

Search engines trying to provide the right answer to a query might take into account what a user has previously searched for. If a user is looking for a restaurant or a movie recommendation, the search engine might look at the user’s friends lists and see what those presumably trusted sources liked. And if the user is searching from a mobile device, that might provide additional clues.

If nothing else, a search from a mobile phone tells the search engine it is from a phone, so perhaps a search for a person is really a search for their phone number. And many mobile devices use GPS or cell phone towers to determine their location. A person typing “Yankees” in Manhattan may be looking for tickets to tonight’s baseball game, whereas the same search in Seattle may represent a desire for last night’s score. “In a relatively short time frame, we’re going to think of geolocation as an integral component of a lot of the online activity we do,” Kleinberg says.

Time is also becoming a characteristic to take into account, Kleinberg says. One way of judging the importance of a news story, for instance, is how quickly it spread and how long interest focused on it. Dumais points out that many facts have a time component as well. The gross national product of Norway, the population of Brazil, and the prime minister of Japan—all can have one factual answer in 2000 and a different one in 2010.

Dumais says future search engines will have both a better grasp of the intent of a query and a richer understanding of Web content. “We’re looking at how we can support that in ways that go beyond 2.3 words typed into a search box and 10 blue links,” she says.

Figures

Figure. Like many other computer scientists, the University of Washington’s Oren Etzioni is developing new tools for searching the Web’s growing morass of text, images, and other content.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

New Search Challenges and Opportunities

View in the ACM Digital Library

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DOI

10.1145/1629175.1629183

January 2010 Issue

Published: January 1, 2010

Vol. 53 No. 1

Pages: 27-28

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Apr 26 2024

Optimizing Energy Efficiency in Datacenters with Advanced Cooling Technologies

Alex Williams

Architecture and Hardware

Credit: Getty Images Servers in snowy setting.

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Blog and Twitter Searches

Figures

New Search Challenges and Opportunities

DOI

January 2010 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.