Research and Advances
Computing Applications

Current Trends in Web Data Analysis

Considering potential reasons for the underutilization of clickstream data and suggesting ways to enhance its use.
Posted
  1. Introduction
  2. Reasons for Underutilization of Clickstream Data
  3. An Alternative Analysis Approach: A Web Forensic Framework
  4. Conclusion
  5. References
  6. Authors
  7. Figures
  8. Tables

According to Shop.org, U.S.-based visits to retail Web sites exceeded 10% of total Internet traffic for the first time ever, accounting for 11.39%, 11.03%, and 10.74%, respectively, on Nov. 24th, Thanksgiving, and the day after Thanksgiving in 2004. Hitwise.com found that the top five sites luring traffic over the same timeframe were eBay, Amazon.com, Dell.com, Walmart.com, BestBuy.com, and Target.com, in that order. Each shopper’s interactions with a Web site can provide a site manager with a plethora of data known as clickstream data. This data includes visitors’ clicks (including what they looked at, what they selected, and how long they spent looking) and Web server’s responses.

There are several approaches to create clickstream data on the Web. These include client-based, server-based, and network-pipe based approaches. A client-based approach entails using a null logging server solution [7]. That is, the site owner embeds needed information (such as page type or product SKU) in HTML tags inside their pages. When a visitor interacts with the pages, the embedded information is sent to a server set up to collect this information.

In the server-based approach, three methods are available to create clickstream data: a server log, a server plug-in, and/or a network pipe. The server log file method uses a basic log entry that records information about the computer making the request, the resource being requested, and the date of the request. The most common log format is W3C’s (World Wide Web Consortium) Extended Common Log Format (ECLF). Microsoft’s IIS 4.0 (Internet Information Server 4.0) Web server log format is illustrated in Table 1.


Current surveys suggest that in spite of storing many terabytes of clickstream data and investing heavily in Web analytic tools, few companies understand how to use the data effectively.


The second server-based method uses server plug-ins (integrated with the server through native API) to monitor server events (such as POST requests) and provides additional data not captured in basic server logs. This method may also use application plug-ins that only monitor application events (examples include data entry by a visitor into some kind of form embedded in a Web page).

In the final method, network-pipe, network sniffers, usually located on the Web server’s network, capture the TCP/IP packets being exchanged. Because of their location, sniffers are able to detect low-level network events like a network disconnect that is sent when a user clicks the stop button before a page has completely loaded. Contrast this with basic server logs that do not detect network disconnects, and may indicate a page was sent even though it was disconnected. A comparison of these different data types is shown in Table 2a.

Because creating clickstream data is relatively easy, the Web analytic market has been growing fast. According to Creese [2], the growth rate has been 200% annually during 1995–2000. Forrester Research estimates the worldwide market for Web analytics in 2004 was $292 million, and has grown approximately at a rate of 27% in 2005. Driving this impressive rise is the realization that even a conversion of 1% of “window shoppers” to “buyers” has the potential to increase sales by millions of dollars for a site.

Interestingly, current surveys suggest that in spite of storing many terabytes of clickstream data and investing heavily in Web analytic tools, few companies understand how to use the data effectively. Although various Web analytic technologies have been tried, Creese [3] estimates that about 60%–70% of the respondents use clickstream data only for basic Web site management activities. Why is then clickstream data so underutilized? What needs to be done to increase the use? The objective of this article is to address these questions by discussing potential reasons for this underutilization and offering ways to enhance the use of clickstream data.

Back to Top

Reasons for Underutilization of Clickstream Data

Clickstream data has many inherent problems that form the basis for reasons for its underutilization. We can group these problems into three broad categories: problems in data, too many analytical methods used with data, and inherent problems in data analysis.

Problems with Data. Incompleteness in Data. Complete clickstream data is difficult to create [5, 6] as caching in the Internet is quite common and it occurs in several independent places. For example, browsers automatically cache pages and their contents when they are downloaded onto the clients’ disks; the ISPs also have their own cache storage. Consequently, browsers often start looking at a page by first checking the local disk cache, followed by the ISP site cache, then the regional cache and so on. If the page exists at any intermediary cache, the Web server will not have any record of serving the page. These missed records lead to incomplete clickstream data being created at any given Web site.

Very Large Data Size. Clickstream data files can be extremely large. A reasonably large Web site can easily produce 100 million records per day since server logs include records for each HTML page served as well as image files embedded on each page. Also, included in the logs are requests made by spiders (automated crawlers), which make requests for all pages. With large clickstream data files, it is important to determine how many days of clickstream records one needs to keep. For example, 100 days worth of clickstream data could easily contain billions of records. Data sets of this size pose significant challenges for storage and analysis. Most existing Web analytics tools “crawl” or “choke” when presented with such large data sets.

Messiness in the Data. Although the clickstream data follows a known format, it can still be very messy. For example, IP addresses recorded in clickstream data may be meaningless as some ISPs (for example, AOL) change a client’s IP address during a session, which creates problems for using IP numbers to identify a single visitor’s activities within any given session. Furthermore, changes in the content of a Web page (including the modified or replaced text, revised links, and modified images) also add to the messiness of clickstream data.

Integration Problems with Enterprise Data. Because it is more behavioral in nature, it is not easy to integrate clickstream data seamlessly with existing e-commerce and other enterprise data. Integrating clickstream data with other enterprise data sources requires the definition and implementation of data links between clickstream and other enterprise data sources. This integration also requires a robust data warehouse infrastructure (discussed later) that can accommodate the volume of clickstream data and its behavioral nature.

Too Many Analytical Methodologies. The last several years have seen a wide use of this data along with a plethora of analysis methodologies. Instead of discussing all of them, we group them into the four categories discussed here.

Web Metric-based Methodologies. Web metrics (W-metrics) measure the extent of the Web site success [11]. Common examples of W-metrics include stickiness, velocity, slipperiness, focus, and shopping cart abandonment. Stickiness is a composite measure capturing the effectiveness of the Web site content in terms of consistently holding the visitor’s attention and allowing visitors to quickly complete online tasks. Velocity is a measure of how quickly a visitor navigates between various stages of their visit. Slipperiness measures how well pages are designed so that the visitors need not spend a lot of time on those pages. Focus is a measure of the number of pages the visitors look at in a section of the Web site. Shopping cart abandonment typically measures the ratio of abandoned carts to completed transactions per day (while abandonment is rare in the brick-and-mortar world, is very common in online stores). Calculation of these W-metrics usually involves clickstream data.

Basic Marketing Metric-based Methodologies. The clickstream data is also analyzed using basic marketing measures like attrition, churn, lead generation, visitor-customer conversion ratio, and marketing campaign effectiveness. Attrition is the percentage of known customers who have stopped buying. Churn, on the other hand, measures attrition with respect to total number of customers at the end of a given period. Lead generation measures the number of new leads generated based on the dollars spent on an advertisement. Visitor-customer conversion ratio uses clickstream and other data sets to compute how many visitors convert to customers. Finally, marketing campaign effectiveness measures campaign conversion, the ratio of visitors to customers for any specific campaign.

Navigation-based Methodologies. In this category, methodologies use data mining concepts on clickstream data to create page navigational patterns [4]. A navigation pattern is usually derived from the sequence of Web pages visited. After an initial cleaning of the data, visitors are categorized into groups based on their clickstream records. The groups are then analyzed or mined (mostly using clustering techniques) to develop navigational patterns. By investigating visitors’ navigation patterns, analysts construct rules that can be used to reconfigure the Web site.

Traffic-based Methodologies. In this category, the analysis of clickstream data focuses on the notion of traffic. Moe and Fader [9] define this as “visits per visitor” and suggest measuring online store traffic by studying the timing and frequency of store visits obtainable from the clickstream data. Using this methodology, they find that buying propensity increases with visit frequency and, consequently, contend that any change in the visit frequency of the shoppers can also be a good predictor of visitors’ buying behavior.

Although there is a strong desire to build Web analytics tools internally [10] using the methodologies given previously, commercially packaged Web analytic tools have flourished in recent years. These tools have evolved from simple server-centric measurements at the early stages of the Internet (first-generation tools) to visitor-centric measurements that evaluate the Web experience of visitors (second-generation tools).

First-generation Web analytics tools were standalone tools that were used to analyze Web server logs and report basic usage metrics such as hits and page views. These tools, commonly based on log file or network sniffer technologies, are called log parsers. They could read logs and parse them based on standard rules and produce reports. The first generation of log parsers did not use a database to store any of the parsed data. Because the first-generation tools rely on Web server logs, they measure essentially the Web server activities and use the information to infer visitor behavior. Consequently, users of these methodologies experience many of the problems discussed previously.

Second-generation tools (see Table 2b) directly measure visitor interactions with Web pages using null server logging. These tools perform browser-based measurements by typically tagging the key Web pages and collecting information via “single pixel GIFs” [7] and browser instrumentation through JavaScript. Many of these solutions are offered as a hosted service. The tools can minimize the false data from machine-generated activities, are immune to issues associated with proxy servers or browser-based caching, and offer real-time data capture, reporting, some systems integration and customization capabilities.

Data Analysis Problems. Despite the existence of all these methodologies, analysis of clickstream data is not easy and has some inherent data analysis problems. Some examples include:

  • Summarization of clickstream data across dimensions (such as time and visitors) requires the use of non-additive and semi-additive measures across these dimensions. Since a visitor may visit various areas of a Web site, simply adding users across various areas to come up with the total number of visitors may result in overcounting the number of independent users.
  • The size of the data poses a problem as the standard analysis tools are not capable of handling such a large size and use sampling methods to process the clickstream data.
  • With thousands of metrics available, determining the right set for actionable or useful analysis can be very complicated.

Back to Top

An Alternative Analysis Approach: A Web Forensic Framework

The major emphasis in the Web analytics industry to date has been to develop newer data capture techniques yet maintain simple data analysis focusing primarily on Web site management and simple marketing metrics. We believe this emphasis impedes the effective usage of clickstream data by many organizations. With continuing rapid growth of online shopping, the time has come to put a “structured methodology” in place that promotes a deeper understanding of visit behaviors and how they directly relate to important outcomes such as customer loyalty. Using the crime scene investigation paradigm, we suggest one way to do this is through a more detailed framework for Web data analysis called a Web forensic framework.

In Web forensics, the emphasis is on structuring analysis techniques and developing newer ones to detect and interpret the often overlooked evidence in Web data. In structuring the existing clickstream data analysis methodologies we propose a hierarchical approach. Using Sterne [11], we developed the Web forensic pyramid (see the figure here), which emphasizes the increasing business value of the aggregated clickstream data and their analysis. Each level offers a specific service described here.


With continuous rapid growth of online shopping, the time has come to put a “structured methodology” in place that promotes a deeper understanding of visit behaviors and how they directly relate to important outcomes such as customer loyalty.


Level-0 (Basic Metric Reporting). The key service of this level is to analyze click records and to provide prefabricated reports of Web metrics and Basic Marketing Metrics. This level includes the same set of tools discussed in Table 2b and consequently will not be repeated. They are essential for capturing click records and for Web site management.

Level-1 (Web Data Warehousing). The next higher level in the Web forensic framework is Web data warehousing [8]. The key services at this level are: storing very large data sets; integrating with enterprise data; and supporting rudimentary decision support such as cross-channel traffic analysis. Typically, the Web data warehousing process includes such tasks as: business discovery, data design, architecture design, data warehouse implementation, and data warehouse deployment.

Level-2 (Visit Behavior Tracing). At this level, the key service is to understand the visit behavior. This navigation behavior lies not in each individual action of a visitor, but rather in the entirety of a visitor’s actions in a Web site. Three visit behavior concepts are presented here.

To implement behavior tracing, we introduce the concept of a footprint representing a single clickstream record created by the interaction of the visitor with a page on a Web site. A visitor’s footprint provides information about such things as IP address, date, time, dwell time, referrer, and page-id. A footprint is “well-defined,” if the visitor’s dwell time on the page is moderately high. Otherwise, the footprint is “latent.” A footprint is called “partial,” if only a portion of the page appears in the visitor’s clickstream record. To illustrate, using the click records of a commercial Web site, we extract nine footprints (see Table 3a) of a specific visit. Footprint f1 is a well-defined footprint, f4 is latent, and f2 is partial.

A second important concept for visit behavior tracing is track—a collection of footprints. A track provides a chronological history of a visitor’s browsing actions while visiting a Web site. A track could be general or focused. Usually, a focused track contains a specific page that is important for the organization. In illustrating a track for a visit (see Table 3b) we introduce several important attributes. The first attribute, page-path, is a chronological sequence of page-ids. Other attributes are important to explain events taking place during the visit such as what did the visitor see, what areas of the page are most interesting, the amount of time he or she spent on it, and the entry-point and exit-point. The entry-point of the visitor is captured as the first page of the page-path, while the exit-point is the killer page of the page-path.

A third useful behavior tracing concept is a trail. Trails are created in a number of ways. One way is to group visitor tracks, using a clustering algorithm, to create trails comprising similar behaviors, similar attitudes, beliefs, and values. We label these visitor-driven trails. Web site managers can also create trails with help from the marketers. Here, a trail is a set of tracks that the site developers would like the visitors to take while they visit the Web site. We call these site-driven trails. There can be various kinds of site-driven trails in a Web site. Some trails could be more popular than the others. Examples of site-driven trail in a travel Web site could be air-path for airline travel, car-path for rental car reservations and so on.

The different types of trails can be evaluated to measure their effectiveness in driving sales and better customer experience. The Web site operators can use this data to optimize the site layout and navigation to provide the visitor with more efficient trails that align with their purpose of visiting the Web site.

Level-3 (Customer Segmentation). At this level, the key service is to create segments of customers by using clickstream and other available data. As the focus of this level is to understand customer segments, buy and no-buy decisions are vital. At one extreme, a firm could think of each customer as a unique entity that must be served. However, for most companies, the costs and overhead involved with uniquely serving each individual customer are prohibitive. Consequently, marketers balance the cost of serving a single customer and the value that a group of customers get from products designed to fit their needs. To do this, marketers turn to customer segmentation—identifying a group of consumers’ with similar needs, thereby increasing the ability to serve a large number of customers simultaneously with one marketing program. For example, visitors could be segmented by their visit behaviors, go-grab, browse-buy, information-loader, and quick-contact [12], providing the marketer with a basic idea of what would appeal to each segment in a Web site. Of course, segmentation could also occur beyond simple general behaviors.

Through clickstream data one can identify whether a consumer made an actual purchase on the Web site. Consequently, using this information, the marketer can estimate the probability of being a “buyer” in any given segment. This allows the marketer to directly assess the profitability of a segment for the marketing organization. In addition, through calibration of the clustering algorithm, a marketer can redefine segments to create sufficiently profitable segments.

Level-4 (Loyal Customer Identification). Services at this level are most expensive but provide the highest return. Loyal customers include profitable customers and repeat visitors or customers. Currently, these services are only available for visitors who register at a Web site or provide identifying information at purchase. Using this data and data from companies such as Acxiom (see www.acxiom.com/) that collect and sell customer profiles, the Web site can identify specific visitors’ profiles. These individual profiles can then be augmented by the facts describing the visitors’ history at the Web site to investigate what, beyond purchase behavior, characterizes loyal customers. This technique fails miserably, however, for those visitors who do not register or purchase anything at the Web site.

Loyal customer identification can lead to more targeted marketing activities that can yield higher value of return from loyal customers. In addition, loyal customer behavior identification can enable marketing organizations to measure and increase the rate of loyal and profitable customers for the company.

Back to Top

Conclusion

The objective of this article has been to examine why clickstream data is not used extensively by commercial Web sites owners for anything beyond a few simple analyses. We discussed three categories of reasons for this—some are data related, some are methodology related, and some are data analysis related. We believe that, to date, the emphasis has been too strong on the data capture side of Web analysis (see Table 3c). Therefore, we have introduced and illustrated a Web forensic framework as an alternative structure for analyzing the clickstream data. Based on this framework, we argue the emphasis must move more toward data analysis and understanding if we want to use clickstream data for strategic purposes, especially in developing customer segments and identifying loyal customers.

Back to Top

Back to Top

Back to Top

Figures

F1 Figure 1. Web forensic pyramid (adapted from [

Back to Top

Tables

T1 Table 1. W3C extended common log format with actual values.

T2A Table 2a. Comparing different clickstream data.

T2B Table 2b. Comparing well-known Web analytic tools.

T3A Table 3a. Nine extracted footprints for a track.

T3B Table 3b. Information about a track.

T3C Table 3c. Relating the Web forensic framework with clickstream problems.

Back to top

    1. Catham, B., Manning, H., Gardner, K.M., and Amato, M. Why Web site analytics matter. The TechStrategy Report. Forrester Research, Apr. 2003; www.forrester.com.

    2. Creese, G. Web analytics: Translating clicks into business. White Paper, Aberdeen Group, 2000; www.aberdeen.com.

    3. Creese, G. E-channel awareness: Usage, satisfaction, and buying intentions. White Paper, Aberdeen Group, 2003; www.aberdeen.com.

    4. Fu, Y., Sandhu, K., and Shih, M. A generalization-based approach to clustering of Web usage sessions. In B. Masand and M. Spiliopoulou, Web Usage Analysis and User Profiling. Spring Verlag, London, 2000.

    5. Goldberg, J. Why Web usage statistics are (worse than) meaningless, 2001; www.goldmark.org/netrants/Webstats/.

    6. Haigh, S. and Megarity, J. Measuring Web site usage: Log file analysis. National Library of Canada, (Aug. 4, 1998), Ottawa, ON; www.nlc-bnc.ca/9/1/p1-256-e.html.

    7. Keylime Software, Inc. The evolution of Web analytics: From server measurement to customer relationship optimization. White Paper, Apr. 2002; www.limesoft.com.

    8. Kimball, R. and Merz, R. The Data Webhouse Toolkit. Wiley, New York, 1999.

    9. Moe, W.W. and Fader, P. Capturing evolving visit behavior in clickstream data. Journal of Interactive Marketing 18, 1 (Winter 2004), 5–19.

    10. Parshal, B. Web analytics: A bird's-eye view of practices and plans. White Paper, 2001; www.techrepublic.com.

    11. Sterne, J. Web Metric: Proven Methods for Measuring Web Site Success. Wiley, New York, 2002.

    12. Underhill, P. Why We Buy: The Science of Shopping. Simon and Schuster, New York, 1999.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More