Sign In

Communications of the ACM

BLOG@CACM

Capturing and Structuring Data Mined from the Web


View as: Print Mobile App ACM Digital Library Full Text (PDF) In the Digital Edition Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook
Kate Matsudaira

http://bit.ly/1eJYVUH
July 26, 2012

Central to any data-mining project is having sufficient amounts of data that can be processed to provide meaningful and statistically relevant information. But acquiring the data is only the first phase. Often collected in an unstructured form, this data must be transformed into a structured format suitable for processing.

Within the past few years there has been an increase of free Web crawler datasets (http://bit.ly/1eDHFOO), but for many applications it is still necessary to crawl the Web to collect information. And if the data mining pieces were not difficult enough, there are many counterintuitive challenges associated with crawling the Web to discover and collect content. The challenges become increasingly difficult when doing this on a larger scale.

In practice, capturing the content of a single Web page is quite easy. Here is a simple crawler that uses the command-line tool wget:

ins01.gif

If you want to capture the content of multiple pages, there exists a wealth of open source Web crawlers, but you will find many of these solutions are overkill, often designed to crawl the Web in a very large-scale, distributed manner. Here, I focus on the challenges of collecting specific types of Web data for data mining on a smaller scale by exploring three of these challenges (creating the URL frontier, scheduling the URLs, and extracting the information) and potential solutions.

Back to Top

Creating the URL Frontier

A URL frontier is the collection of URLs the crawler intends to fetch and process in the future. URL frontiers generally work in one of two ways: a batch crawl or a continuous crawl. A batch crawl's frontier contains only new URLs, whereas a continuous crawl's can contain a seed set of URLs, but new ones may be added (or existing ones removed) during the crawl.

Regardless of the type of crawl, the data store used to manage the frontier needs to be fast, because it will slow down crawling otherwise. Below is a diagram that helps illustrate the URLs in a crawl frontier.

Crawling is quite simple at its core:

  1. Select a URL to crawl
  2. Fetch and parse page
  3. Save the important content
  4. Extract URLs from page
  5. Add URLs to queue
  6. Repeat

Of course, the details around URL discovery are quite complicated, depending on your application (you may not want every URL on a page, for example). Adding logic around which URLs to select to crawl (selection), and which URLs to filter (often an easy restriction is to set your filters to only apply to specific domains and ignore all others) can help ensure your crawl is tightly targeted to your target content/data.

One way to create your URL frontier is to define a seed set of URLs or domains and crawl them to extract and discover new links. Post-processing the links to normalize them can help reduce duplication and is a good best practice to follow. Another example of this post-processing is filtering the links before adding them to the frontier, such as restricting links to specific domains or using an algorithm to order the URLs in a priority that makes sense for the application and goals of the crawl. (For more, check out the Stanford paper at http://stanford.io/1dsSy7V.)

Back to Top

Crawling and Scheduling

When it comes to crawling, there are many details that allow one to do this quickly and at scale. If your crawl is less than 100 million pages and you do not need to distribute the crawling across servers, then crawling and scheduling is more manageable. The coordination of distributed crawling across several servers can get difficult quickly (and since it is much more complicated, I will have to leave those details for a future post).

When it comes to fetching and parsing a page, there are several steps involved for each URL:

  1. DNS lookup
  2. Fetch robots.txt
  3. Fetch URL
  4. Parse content
  5. Store results

For the DNS lookup and robots.txt rules, one can likely cache these such that any subsequent URLs from the same IP will be much faster and will not require a trip across the network.

Tip: For DNS lookups, many of the built-in toolkits are synchronized by default, so it helps to configure this to achieve parallelism and speed.

Of course, it is important any crawler obeys the rules in robots.txt file, which may exclude all or parts of the domain. (More details on this exclusion protocol can be found at http://bit.ly/1eRsqnr.)

Another important aspect of crawling is choosing which URLs to crawl and in what order.

If you are doing a continuous crawl, it is important to think about when you may re-crawl a page. Certain parts of the Internet are changing all the time (such as news sites like CNN.com) and should be crawled regularly as they have new content, are high authority sites, or are important sources of data.

Remember the importance of politeness. Your crawler should not make too many simultaneous requests, as they can overwhelm underpowered servers. Best practice is to wait two seconds between requests for the same IP.

Tip: Do not sort input URLs, or you will be waiting between calls. Randomize the order of URLs across domains.

Scheduling is not just about selecting which URLs to crawl, but also tracking the URLs crawled so far. If you have a small crawl (<100 million URLs), it probably makes sense to store these in memory; otherwise, your application performance can suffer reading from disk or across the network. It is possible to store URLs in plain text, although using a bloom filter (http://bit.ly/1j7JGZ0) to hash URLs makes reading and writing much faster.

Finally, as new URLs are discovered, it is important to normalize them to remove duplicates.

For example, many pages are expressed relative to the main domain, such that a link might be /about.html, but would need to be transformed into: http://katemats.com/about.html. Since many URLs in the wild have query parameters (such as those used for analytics or session information), it is easy to come across many different versions of a URL that reference the same Web page. These extraneous parameters must be removed or normalized so the only URL crawled is the canonical version of the page.

Using the rel="canonical" tag (to read more, check out Google's documentation at http://bit.ly/K39Nlj) can serve as a signal if the page is a duplicate, but it is not used everywhere, so it is worth implementing a fingerprinting algorithm. Using an algorithm like shingling (http://stanford.io/Kjpnsy) can help detect the same or near-duplicate pages.

There are a handful of open source crawlers to consider, such as:

Back to Top

Extracting the Information

Once the crawler is configured, the last piece of the puzzle is extracting and structuring the data from different Web pages.

The most important part of parsing Web pages is that your parser is able to deal with messy markup and will be resilient to errors. Chances are you will want some smart logic that removes the page chrome (navigation, headers/footers, search boxes, advertisements, and so forth) so only data is returned. It is also important to consider more than ASCII text, as it is common to run across Unicode URLs and content.

If you are trying to crawl the entire Web, there is more to consider, as the Internet is full of junk such as enormous or never-ending pages, with spider traps that look like dynamic content but are actually an infinite generation of links.

Once you have the html content, it is necessary to navigate the page and DOM to find the specific parts of the page relevant or important to your cause. And then, of course, storing that information in a structured form, which may mean converting it to JSON, or transforming it to fit a schema and inserting it into the database. Thankfully, there are a lot of extractors for parsing Web pages, including:

A little bit different, but good for pulling out article text from pages is Readability (http://bit.ly/19uQ3Cp).

There you have it!

Back to Top

Further Reading

Pant, G., Srinivasan, P., Menczer, F.
Crawling the Web, Department of Management Sciences, School of Library and Information Science, The University of Iowa; School of Informatics, Indiana University, http://bit.ly/1d7cCIl

Web Crawling and Indexes, Cambridge University Press, http://stanford.io/1j7HDnX

Zhao, B.Y., Huang, L, Stribling, J., Rhea, S.C., Joseph, A.D., Kubiatowicz, J.D.
Tapestry: A Resilient Global-Scale Overlay for Service Development, IEEE Journal on Selected Areas in Communications, http://bit.ly/1iW0HTx

DeCandia, G., Hastorun, D., Madan, J., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.
Dynamo: Amazon's Highly Available Keyvalue Store, Amazon.com, http://bit.ly/KjqGYl

Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, MIT Laboratory for Computer Science, http://bit.ly/1m6fgq1

Back to Top

Author

Kate Matsudaira is vice president of Decide.com, which was acquired by eBay in 2013.


©2014 ACM  0001-0782/14/03

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2014 ACM, Inc.


 

No entries found