Capturing and Structuring Data Mined from the Web

http://bit.ly/1eJYVUH
July 26, 2012

Central to any data-mining project is having sufficient amounts of data that can be processed to provide meaningful and statistically relevant information. But acquiring the data is only the first phase. Often collected in an unstructured form, this data must be transformed into a structured format suitable for processing.

Within the past few years there has been an increase of free Web crawler datasets (http://bit.ly/1eDHFOO), but for many applications it is still necessary to crawl the Web to collect information. And if the data mining pieces were not difficult enough, there are many counterintuitive challenges associated with crawling the Web to discover and collect content. The challenges become increasingly difficult when doing this on a larger scale.

In practice, capturing the content of a single Web page is quite easy. Here is a simple crawler that uses the command-line tool wget:

If you want to capture the content of multiple pages, there exists a wealth of open source Web crawlers, but you will find many of these solutions are overkill, often designed to crawl the Web in a very large-scale, distributed manner. Here, I focus on the challenges of collecting specific types of Web data for data mining on a smaller scale by exploring three of these challenges (creating the URL frontier, scheduling the URLs, and extracting the information) and potential solutions.

Creating the URL Frontier

A URL frontier is the collection of URLs the crawler intends to fetch and process in the future. URL frontiers generally work in one of two ways: a batch crawl or a continuous crawl. A batch crawl’s frontier contains only new URLs, whereas a continuous crawl’s can contain a seed set of URLs, but new ones may be added (or existing ones removed) during the crawl.

Regardless of the type of crawl, the data store used to manage the frontier needs to be fast, because it will slow down crawling otherwise. Below is a diagram that helps illustrate the URLs in a crawl frontier.

Crawling is quite simple at its core:

Select a URL to crawl
Fetch and parse page
Save the important content
Extract URLs from page
Add URLs to queue
Repeat

Of course, the details around URL discovery are quite complicated, depending on your application (you may not want every URL on a page, for example). Adding logic around which URLs to select to crawl (selection), and which URLs to filter (often an easy restriction is to set your filters to only apply to specific domains and ignore all others) can help ensure your crawl is tightly targeted to your target content/data.

One way to create your URL frontier is to define a seed set of URLs or domains and crawl them to extract and discover new links. Post-processing the links to normalize them can help reduce duplication and is a good best practice to follow. Another example of this post-processing is filtering the links before adding them to the frontier, such as restricting links to specific domains or using an algorithm to order the URLs in a priority that makes sense for the application and goals of the crawl. (For more, check out the Stanford paper at http://stanford.io/1dsSy7V.)

Crawling and Scheduling

When it comes to crawling, there are many details that allow one to do this quickly and at scale. If your crawl is less than 100 million pages and you do not need to distribute the crawling across servers, then crawling and scheduling is more manageable. The coordination of distributed crawling across several servers can get difficult quickly (and since it is much more complicated, I will have to leave those details for a future post).

When it comes to fetching and parsing a page, there are several steps involved for each URL:

DNS lookup
Fetch robots.txt
Fetch URL
Parse content
Store results

For the DNS lookup and robots.txt rules, one can likely cache these such that any subsequent URLs from the same IP will be much faster and will not require a trip across the network.

Tip: For DNS lookups, many of the built-in toolkits are synchronized by default, so it helps to configure this to achieve parallelism and speed.

Of course, it is important any crawler obeys the rules in robots.txt file, which may exclude all or parts of the domain. (More details on this exclusion protocol can be found at http://bit.ly/1eRsqnr.)

Another important aspect of crawling is choosing which URLs to crawl and in what order.

If you are doing a continuous crawl, it is important to think about when you may re-crawl a page. Certain parts of the Internet are changing all the time (such as news sites like CNN.com) and should be crawled regularly as they have new content, are high authority sites, or are important sources of data.

Remember the importance of politeness. Your crawler should not make too many simultaneous requests, as they can overwhelm underpowered servers. Best practice is to wait two seconds between requests for the same IP.

Tip: Do not sort input URLs, or you will be waiting between calls. Randomize the order of URLs across domains.

Scheduling is not just about selecting which URLs to crawl, but also tracking the URLs crawled so far. If you have a small crawl (<100 million URLs), it probably makes sense to store these in memory; otherwise, your application performance can suffer reading from disk or across the network. It is possible to store URLs in plain text, although using a bloom filter (http://bit.ly/1j7JGZ0) to hash URLs makes reading and writing much faster.

Finally, as new URLs are discovered, it is important to normalize them to remove duplicates.

For example, many pages are expressed relative to the main domain, such that a link might be /about.html, but would need to be transformed into: http://katemats.com/about.html. Since many URLs in the wild have query parameters (such as those used for analytics or session information), it is easy to come across many different versions of a URL that reference the same Web page. These extraneous parameters must be removed or normalized so the only URL crawled is the canonical version of the page.

Using the rel=”canonical” tag (to read more, check out Google’s documentation at http://bit.ly/K39Nlj) can serve as a signal if the page is a duplicate, but it is not used everywhere, so it is worth implementing a fingerprinting algorithm. Using an algorithm like shingling (http://stanford.io/Kjpnsy) can help detect the same or near-duplicate pages.

There are a handful of open source crawlers to consider, such as:

Apache Nutch (http://nutch.apache.org/)
Bixo (http://bixo.101tec.com/)
Heritrix (http://bit.ly/1dYRV2n).
Crawler4j (http://code.google.com/p/crawler4j/)
Scrapy (scrapy.org)

Extracting the Information

Once the crawler is configured, the last piece of the puzzle is extracting and structuring the data from different Web pages.

The most important part of parsing Web pages is that your parser is able to deal with messy markup and will be resilient to errors. Chances are you will want some smart logic that removes the page chrome (navigation, headers/footers, search boxes, advertisements, and so forth) so only data is returned. It is also important to consider more than ASCII text, as it is common to run across Unicode URLs and content.

If you are trying to crawl the entire Web, there is more to consider, as the Internet is full of junk such as enormous or never-ending pages, with spider traps that look like dynamic content but are actually an infinite generation of links.

Once you have the html content, it is necessary to navigate the page and DOM to find the specific parts of the page relevant or important to your cause. And then, of course, storing that information in a structured form, which may mean converting it to JSON, or transforming it to fit a schema and inserting it into the database. Thankfully, there are a lot of extractors for parsing Web pages, including:

Scrapy (scraping and crawling framework)
Mechanize
Beautiful Soup (http://bit.ly/1iVZlbl)
Boilerplate (http://bit.ly/1b3EltF, used to remove the chrome around Web pages)

A little bit different, but good for pulling out article text from pages is Readability (http://bit.ly/19uQ3Cp).

There you have it!

Capturing and Structuring Data Mined from the Web

Creating the URL Frontier

Crawling and Scheduling

Extracting the Information

Further Reading

Capturing and Structuring Data Mined from the Web

DOI

March 2014 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Creating the URL Frontier

Crawling and Scheduling

Extracting the Information

Further Reading

Capturing and Structuring Data Mined from the Web

DOI

March 2014 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.