Forum

In his article "Visual Exploration of Large Data Sets" (Aug. 2001, p. 38) Daniel Keim displayed and/or mentioned several techniques for visualizing data, including parallel coordinates, dense pixel displays, geographical maps, hyperbolic trees, and TableLens. Unfortunately, none of these techniques is suitable for data sets with large numbers of data points. Although Keim did not define "large data sets," his opening remarks refer to one exabyte of data per annum, so I assume he refers to large data sets that can involve dozens of variables and hundreds of millions or even billions of data points. Some of the techniques Keim listed have difficulty handling even tens of thousands of data points, and none can deal with tens of millions, let alone hundreds of millions or billions.

One technique he did not mention is called TempleMVV, first developed at Temple University’s Department of Physics in the mid-1980s for the visualization of multivariate functions and data. The first article on the technique was published in 1989, and it was patented in 1993. Three seminal articles by Mihalisin et al. are cited in Keim’s reference 4. We and others have been utilizing TempleMVV to visualize and visually mine and analyze data involving up to 10 dimensions (independent variables) and up to 25 measures (dependent variables) with hundreds of millions of data points on a desktop machine for over a decade. Keim’s "ultimate goal" mentioned in the last paragraph of his article was attained more than a decade ago.

Interested parties can have a full listing of more than a dozen articles we have published on this subject by emailing tmihal@bellatlantic.net.

Ted Mihalisin
Philadelphia, PA

Keim presents a number of examples of how visualization tools can be helpful in analyzing the seemingly infinite masses of digitized data. He reports that 99.997% of the exabyte (10¹⁸B) of data created every year is already in digital form. This is so important a point that it is quoted as 99.9% in the issue’s "Editorial Pointers." Clearly, analysis of massive data sets is a big problem, but just how big are these digital data sets? No formal reference is provided in the article, but based on the reference to the University of California, Berkeley, it isn’t difficult to discover the Web site www.sims.berkeley.edu/research/projects/how-much-info, which is the source of this important fact. Both a Web-based and a printable (200+ page PDF) version of the Berkeley report are available.

Several things should be noted. Berkeley’s intent is to count only original data. Circulation numbers or retail sales figures are irrelevant. Although the researchers discuss ephemeral content (such as phone conversations and live television) their goal is to measure the amount of data produced every year that is stored for some reasonable period of time. In many cases, this is done by determining how much media is produced of a particular type (paper, film, magnetic) and then estimating the percentage of original content for that particular type. For nondigital media, they make assumptions on acceptable digitization and compression techniques. As a result, for data not "born" electronic, their upper and lower bounds vary by as much as an order of magnitude.

The problem is that it does not support Keim’s claim that 99.997% of original data is already in digital form. The closest it comes is in the executive summary where it states that only .003% of data is generated as printed material. Table 1 in the executive summary shows that depending on which estimate you use, film media is anywhere between 5% and almost 50% of the total data stored every year. Film is clearly not digital nor even electronic. Camcorder tapes, which are primarily analog, are at least 15% of the total as well. Perhaps the only relevant thing concluded from the Berkeley report is that pictures really are worth much more then words since so much of our unique data is stored that way, whether the format is analog or digital. I suppose this is something with which Keim would agree.

William Bogstad
Cambridge, MA

Author responds:
Thank to Mihalisin for mentioning the TempleMVV system. Unfortunately, I was unable to mention all relevant techniques and systems. According to my understanding of TempleMVV, in my classification, the technique it uses would belong to the stacked display category.

I disagree with Mihalisin’s statement that none of the techniques mentioned in my article can handle tens of millions of data points. In conjunction with preprocessing and interaction, many of them have been used on millions of data points. I don’t know the latest version of TempleMVV, but apparently it also uses a high degree of aggregation (aggregation into bins) and is able to show a specific view of the data set at only one point of time. Therefore we still have to do a significant amount of research and development in order to reach the "ultimate goal" mentioned in my article.

In response to Bogstad’s letter: Thanks for providing the pointer to the Berkeley Web page. Due to the limited number of references allowed in my article I was unable to include the reference. The report is indeed interesting. However, like every report that tries to estimate something on a global level, I think one has to be careful with the numbers it provides. By the way, I cannot see how you find in the report that between 5% and 50% of the media is film media. The numbers in Table 1 range from 9% (58PB/635PB) to about 20% (427PB/2120PB). But this isn’t the point—neither in my article nor in Bogstad’s reaction. These amounts are really large and need careful examination. Even if only 1% of the data is actually available in electronic form, that would still be 12PB (10¹⁵B) of data, far more than any data mining or visualization technique I know can handle.

Similar Figures for Women Math Majors

I read Paul De Palma’s "Viewpoint" ("Why Women Avoid Computer Science," Jun. 2001, p. 27) with great interest. I fear he may be onto something but will be hushed before the ideas he put forth are fully explored.

While I cannot cite specific figures to back up his position, anecdotally his observations track closely with mine. At our university (undergraduate population under 2,000), women outnumber men among the student body, along with a similarly disproportionate ratio among math majors. Yet in a computer department with about two dozen majors, only a handful—two or three at most—are female. Many of the math majors take several of our introductory computing courses and do quite well; in fact, many of them do very well writing programs and express satisfaction when the programs are complete. Still, we have not been successful in attracting more of them into the major.

As an experiment, I asked my wife to read De Palma’s column without explaining its contents beforehand. Uninfluenced by my point of view, she is very much in agreement with him on observations of women students and their expectations. Although my wife was a math major as an undergraduate, she doesn’t like computers.

J. William Cupp
Marion, IN

Commendation

I would like to commend the ACM for publishing David Touretzky’s recent "Viewpoint" ("Free Speech Rights for Programmers," Aug. 2001, p. 23) about the current problems with the Digital Millennium Copyright Act. As a longtime member of ACM I’m happy to see ACM taking a leadership role on this important issue. I’d like to encourage more action on behalf of ACM’s members.

Sandy Ressler
North Potomac, MD

Similar Figures for Women Math Majors

Commendation

Forum

DOI

October 2001 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Similar Figures for Women Math Majors

Commendation

Forum

DOI

October 2001 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.