Why do we care about rankings of graduate programs? Beyond the ability to cheer "We're Number One!" there are very practical reasons. For example, resource allocation is often based on using rankings as synonyms for quality indicators. An institution recently decided it would become a "top 25 institution" by ensuring that each of its graduate programs was ranked within the top 25% of all the graduate programs in the corresponding fields. And it was going to accomplish this by simply eliminating any program that was notmission accomplished! Besides resource allocation, prospective graduate students and faculty candidates look to rankings when deciding where to apply, so the rankings for U.S. institutions considered in this Viewpoint are of considerable interest both within the U.S. and internationally. Funders look at rankings when considering ability to perform the proposed research. Alumni look to rankings when making donation decisions. Despite all their acknowledged warts, rankings do matter.
In principle, generating rankings is straightforward mathematically:
Of course the practical difficulties are enormous. Among them:
So there are ample reasons why rankings based upon a transparent comprehensive analysis are not done frequently. Nevertheless, the U.S. National Research Council (NRC), through its Committee on an Assessment of Research Doctorate Programs, bravely tackled this thorny problem for U.S. institutions. (The NRC is the operating arm of the U.S. National Academies of Science and Engineering and the Institute of Medicine, honorific academies with a mission to improve government decision making and public policy, increase public education and understanding, and promote the acquisition and dissemination of knowledge in matters involving science, engineering, technology, and health. In many respects, the academies and NRC represent the "gold standard" of technical policy advice in the U.S. Because of the prestige of the academies and the NRC, their methodologies and reports have considerable international impact as well.)
Despite all their acknowledged warts, ranking do matter.
The NRC last ranked doctoral programs in the mid-1990s and these rankings are clearly out of date. Further, the earlier rankings depended heavily on "reputation" as determined by respondents and this is often an inexact and lagging indicator. This time around the NRC sought to focus on a purely quantitative approach.
In this Viewpoint we describe how this process has played out for computing. While these comments clearly apply directly only to the NRC rankings effort, they are relevant to other similar efforts.
The NRC Ranking Process
The specifics of the NRC process were the following. The NRC developed a single set of metrics for all 62 disciplines being analyzed, covering disciplines in science, engineering, humanities, social sciences, and others. It then collected the data for these metrics via questionnaires administered to institutions, programs, faculty, and Ph.D. students plus submitted faculty CVs. Determining the weights was done via two related approaches: ask a set of participants how much various metrics mattered in their perception of department rankings, and a linear regression of a set of rankings vs. these metrics. Because these two approaches yielded substantively different results, the NRC established two sets of rankingsSurvey and Regression rankingsand reported these probabilistically. Specifically, they ran a set of samples using weights derived from these acquired distributions, and then reported the range of rankings corresponding to a 90th percentile, meaning that with 95% probability, an institution's rank would lie within the designated range. In other words, as an example, the NRC states that with 95% probability Georgia Tech ranks somewhere between 14th and 57th using the Survey weights and somewhere between 7th and 28th using the Regression weights.
The first issue is that this range, arising out of the probabilistic analysis, is difficult to reconcile. What does a rank between 14th and 57th mean? How does one reconcile differences between the two ranking systemsbetween the Survey weights which measure what respondents claim is important and the Regression weights which measure these claims against departmental reputations? Of how much value is a range if a 95th percentile span is being used?
Even if the rankings were not as impactful as in prior NRC studies, a rigorous data collection process could have yielded valuable data, which departments could use to assess their standing relative to peers. Unfortunately, there were a number of issues with the quality of the data:
The second issue noted here has gained the most attention from our community. CRA and ACM provided testimony to the NRC in 2002 when the study was just beginning, pointing out the importance of conferences to our field. Unfortunately, this advice was simply ignored by the NRC, a fact we did not discover until February 2010. We immediately notified the NRC, urging it to include conference publications, both for measuring publication productivity and for measuring citation impact. The NRC ultimately agreed to do so after extensive discussion at various levels. CRA worked with its member societies to provide a list of quality conferences; due to the tight deadline we know that this list is not 100% complete or accurate. The NRC took this list and then searched all vitae provided by CS faculty (which we also know to be incomplete) to generate conference publication counts. Since citations for conference publications were not available via the ISI database used by the NRC, citation data was not used at all for computer science as alternatives were not acceptable to the NRC. Based upon the NRC's analysis, a typical department had one conference publication per faculty member per year. In our view, this is not credible. Further, the NRC claims that more computing publications appear in journals than in conferences, which is very difficult to reconcile with what we see in practice.
Similarly, CRA worked with its member societies to put together lists of the awards that should be included and to correctly categorize them as "Highly Prestigious" or "Prestigious." This is not a trivial process; for example, does one include the many SIG awards? Again, the deadline to provide the list was tight and we are unable to verify that our list was applied. Thus, it is not clear that the NRC even now has a meaningful method for measuring faculty awards.
Just as troubling is that various member departments have not been able to verify the data that the NRC presents. That is, using the same vita and publication and awards listings, they simply cannot reproduce the numbers that the NRC provides for their departments. The NRC process used temporary workers trained by the NRC staff. Perhaps they were unable to deal with the multiple possible titles of publicationsCommun. ACM = CACM = Communicationsself-reported by faculty on their CVs. The conference publication numbers do not provide much confidence that they were.
There are ample reasons why rankings based upon a transparent comprehensive analysis are not done frequently.
One might suggest that the central problem is that computer science is unusual in its practices, and that our field is simply an outlier. This does not appear to be the case. The Council of the American Sociological Association recently passed a resolution condemning the NRC rankings and saying that they should not be used for program evaluation. Input from colleagues suggests that other fields, such as aeronautics/astronautics and chemical engineering are uncomfortable with the NRC process, for many of the reasons we have raised in this Viewpoint.
So we have a situation in which incorrect data are provided for invalid metrics and rankings are calculated using weights that are not readily understood. It would be easy to dismiss the entire process except that institutions are using the results to make programmatic decisions including closing programs. At a recent symposium, many university administrators expressed considerable support for continuing the data collection effort, and generating rankings if it can be accomplished in a meaningful way.
So how should the process work? Here are our suggestions:
We do not claim that this strategy will eliminate all of the many issues with rankings, but it will provide a consistent set of fundamental data that administrators, faculty, students and others can use to understand departmental strengths and weaknesses in a way that matters to them.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.
Has anyone proven that the doctoral ranking of a university CS dept is actually predictive of the quality of an individual student or graduate? For "top" programs the quality variance may be less, yet this means that rankings are a crutch for analysis and are poor predictors of student quality outside of the top 10. Is the Pareto principle is at work here, ie the top 20% of CS programs are producing 80% of the research? A simple examination of top CS conference proceedings says no.
Displaying 1 comment