Today’s CS educators place much greater emphasis on algorithm design and analysis than on the data they process. They also generally require students to develop programs from scratch, rarely asking them to use and modify existing complex programs that could not be redeveloped ab initio. These practices need to change. I urge CS educators to help prepare their students to manipulate huge amounts of data and make sure they gain experience in using, modifying, and combining complex programs available on the Web.
Along with the continuing advances in computer hardware over the past four decades, CS has undergone its own enormous growth and evolution. That growth requires that faculty periodically update curricula to prepare their students for the new tasks they are likely to face upon graduation.
The introduction of structured programming, OO programming, and graphical interfaces are notable examples of how curricula are being updated in the area of programming languages. Similarly, algorithmic complexity, program correctness, the properties of grammars, and automata are subjects that have been given a prominent place in CS education. Nevertheless, none of these topics focuses on issues arising from today’s Web content.
Updating CS curricula is generally guided by the development of hardware that allows users to address previously undoable tasks. In the late 1950s, when I was first introduced to programming using the Illiac I, my professor at the University of Illinois said that a person who could not program by heart a square root subroutine in assembly language did not deserve to use a computer. This argument sounds ludicrous today, but it was understandable at a time when computing machines were as bulky as they were costly.
The improving capabilities in memory and speed of personal computers and the vast network of servers on the Web are noteworthy variables that increasingly influence CS education. The time is ripe for a major update, motivated by two crucial new realities that are direct consequences of the information available on the Web: the volume and nature of data to be dealt with and the availability of complex programs that cannot be redeveloped from scratch.
The mass of data any future programmer is likely to deal with is on the order of terabytes or possibly petabytes; hard drives storing a terabyte of data or more will soon be commercially available. It is also likely that the data will never be perfect, since errors of all kinds can creep in, so frequent updates are required. Several applications typify the nature of today’s Web-available data: the textual content of servers handled by Google and other search engines, the repository of genomic sequences, the data gathered by telescopes and satellites, and the data collected by all kinds of terrestrial sensor networks.
This volume of data and its imperfect nature produce a number of direct consequences, including:
- The notion of practical complexity must be revised in the sense that any above-linear algorithms might be too time consuming; one may even avoid algorithms having large coefficients of linearity;
- The processed results should reflect existing data imperfections;
- The ability to perform preprocessing and use incremental algorithms will become essential approaches in reducing computing times; and
- Approximate solutions may be the only resort for solving large complex problems.
Few such consequences are covered in even the most advanced undergraduate CS education curricula in U.S. universities today.
The availability of complex programs rules out the possibility of redoing from scratch programs that already exist on the Web and that can be freely downloaded or executed with a user’s data. Consider, for example, Graphviz, or Graph Visualization Software, a program developed by AT&T Labs for drawing graphs and trees (www.graphviz.org/). Frequent use of Graphviz requires writing a script that provides the program a file containing the data about the graph one wishes to draw, then having to retrieve another file in, say, a graphical format containing the desired drawing.
A second example is BLAST, or the Basic Local Alignment Search Tool, a program widely used by biologists to search for DNA sequences in vast genomic databases. Biologists often use a plethora of Web-available programs that can download data from a user file and place the results in another file. This feature explains the success of script languages like Perl and Python.
A third example is Google. Its sophisticated users often have to write scripts that analyze the data being provided and automatically generate new searches. As more data becomes available on the Web, such cascading searches will enable programmers to integrate data from a number of Web sites and solve problems that otherwise would not be approachable.
The benefits of including script languages in the CS curriculum are also exemplified by a recently introduced approach to solving crossword puzzles with the help of search engines [1]. A phrase specifying a desired but unknown word is fed into Google, and the search results are inspected by a program to determine a candidate word that satisfies such criteria as word length and the position of certain letters.
Any programmer knows the increasing speed of processors allows us to handle (using brute-force methods) problems that would not otherwise be solvable. Examples, like using Google to solve crossword puzzles, demonstrate the sheer volume of information available on the Web enables us to devise creative solutions for certain types of problems.
The two influences—dealing with algorithms that process huge amounts of imperfect data and the availability of ready-to-use complex programs—will inevitably change the way future graduates operate. CS educators should thus consider the following recommendations:
- Script capabilities. Teach script capabilities in introductory programming courses, including the search for sequences of patterns in Web sites and the use of existing complex programs;
- Data volume. Emphasize in algorithms/data structures courses that the volume of data can be enormous; for example, students should be required to attempt to sort a few gigabytes of data, preferably downloaded from a Web site. Doing so will lead them to considerations about incremental sorting, approximate algorithms, and preprocessing;
- Complexity of algorithms. Stress in theory courses that the proof of NP completeness is to be considered only as a warning sign of a difficult problem. It is important to concentrate on approximate algorithms capable of yielding imperfect but satisfactory solutions. In addition, probabilistic automata and Markov processes should be included. The answers provided by programs become not simply yes or no but expressions of probabilities of event occurrences; and
- Probability and statistics. Accentuate the importance of probability and statistics in discrete mathematics courses.
Probability and statistics require further justification. Since the genesis of CS 60 years ago, more attention has gone to algorithms as such, with less on the statistical nature of the data they utilize. Repositories of large data sets representing actual graphs and other data structures exist, but their diversity and scope need to be expanded; repositories should be in constant development and made available to the CS community. The Web itself is one of the best repositories available.
A final recommendation is in order: The general tendency toward outsourcing has a significant effect on how CS educators should prepare their students. More and more programs are being developed outside the organization and, increasingly, outside the U.S. as well. While programming remains a crucial skill, students must be able to conceptualize and clearly specify a program’s capabilities, rather than focus on the minutiae of programming.
Some CS educators argue that CS curricula already contain material that can hardly be covered in a few semesters. Others believe that students with a solid CS education can quickly learn the kind of material I’ve cited here. My rejoinder is that educators should be sensitive to presenting new directions in the field, constantly renewing the relevant material to be transmitted to the next generation of practitioners.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment