News
Computing Profession

The Data Science Boom is Here

Posted
Those who educate data scientists are scrambling to meet current needs while also trying to create solid educational foundations along the “K-20” (kindergarten through Ph.D.) pipeline.

There is a story, perhaps apocryphal, about a young engineer who was offered a job at IBM in the mid-1950s, programming around the development of the 7030 Stretch, the first transistorized supercomputer.

"But I'm a mechanical engineer," he said. "I don't know anything about programming."

Programming, his interviewer replied, was a language, and it could be taught to him. He accepted the job, and worked at IBM for 30 years.

The fields around computational expertise may be at another such moment when people with great potential but little experience in computer science are being welcomed and encouraged in large numbers to join the profession, thanks to the explosive growth of the discipline of data science: according to educational publisher John Wiley & Sons' Discover Data Science, 11.5 million data science jobs are expected to be created globally through 2026. The U.S. Bureau of Labor Statistics estimates aggregate growth rate of 22% for computer and research scientists in the U.S., including data scientists, through 2030 – about three times the overall estimated job growth estimate.

Those responsible for educating qualified data scientists are simultaneously scrambling to provide for current needs through a combination of graduate and certificate programs welcoming a diverse set of students–including those with limited technology backgrounds–while also trying to create solid educational foundations along the "K-20" (kindergarten through Ph.D. pipeline) as the discipline matures.

Understanding the audience while writing the script?

Karl Schubert is associate director of the University of Arkansas' bachelor's degree program in data science, which launched in 2020 and will graduate its first class in 2023. He also led the team that developed the state's high school curriculum for data science. Prior to his return to academia, he worked in corporate technology for companies including IBM, Dell, and Lifetouch, and has an ear to the ground for what industry wants in the way of trained technologists.

While numerous graduate data science programs have launched with invitations to students with limited computer science backgrounds in an attempt to meet current market demand, Schubert said his industrial sector colleagues wanted trained data scientists who could understand the statistical rigors of the field and rationalize why they took a certain approach to modeling a problem; they also needed to be able to convey that decision to non-technical executives.

Schubert said the graduates of the program are receiving "not just a technical degree; it is a STEM degree, but they have to be able to communicate outside their domain with people who aren't technically knowledgeable. That's not an insult, you just have to understand your audience."

In the case of the rapid expansion of the perceived demand for data science skills, understanding that audience means also simultaneously trying to reach consensus on how best to train its practitioners. Paul Leidig, director of the school of computing at Michigan's Grand Valley State University, served as co-chair of the ACM Education Board's Data Science Task Force, charged with developing the computational competencies for undergraduate data science curricula. The task force began its work in August 2017 and found itself working to frame a rapidly growing field with a ready market, with little in the way of consensus on what made a data scientist. The Task Force's recommendations were published in January 2021.

Leidig contends the explosion in the popularity of data science at the baccalaureate level is by "demand and awareness, not by definition, description, or desire. So I think it was very timely that we finally came out with at least a start to defining the competencies at a bachelor level."

The task force, he said, intentionally called its focus the "computing competencies," with the idea that associated disciplines could offer additional field-specific traits of a trained data scientist. "So the next efforts, which are in their infancy, are of what I hope will be the competencies necessary for data science undergraduate programs."

'A good-enough programmer'

The ACM task force report calls for data science graduates to have some "basic" skills and computational literacy, including "basic education in computing (programming, databases, use of the Internet); be able to program on their own with one or two common languages (Python, R); be aware of some common libraries such as sklearn in Python, R packages, and several method or domain specific libraries; and be able to learn new languages and new libraries when needed."

Schubert said the University of Arkansas curriculum is taking small steps to acclimate its data science students to programming competence, in which Python and R are currently the programming lingua franca.

"In three or four years, we will start having students come in who have had Python and R, who have done some case studies," he said. "But right now, our expectation is they have no skill set at all."

In their first semester, he said, students will be taught the very rudiments of programming – the command line, GitHub, GitLab, Python, and then R. In the second semester, Java will introduced to teach object-oriented programming.

"After that, we don't have any specific programming courses because they use programming in all the remaining data science courses, either Python or R. We do teach them Power BI and Tableau in the data visualization and communications course, but after that, they are using it every time, so we don't reteach it. We advance their skills in it but we do not go back to the beginning.

"I think in the end with data science, I tell the students in computer science, you need to be a really good programmer. In data science, you need to be a good-enough programmer that you can get to the point you can do the analysis without coming up with bad data."

'Computing With A Purpose'

Paul Anderson, currently principal investigator of the Anderson Data Science Lab of the California Polytechnic University at San Luis Obispo, was the director of the College of Charleston's data science program when the college launched its graduate curriculum in 2019. The program's first cohort, he said, was actually composed of non-computer science majors; he feels there is no harm in that and that, in fact, the diversity of experience will help broaden the field in "real-world" applications.

"We did have some physicists and mathematicians who definitely had a good computer science background, but the majority of folks didn't," Anderson said. "Luckily, there is so much excellent on-ramp material for data science, that we could provide a lot of asynchronous material before they finalized their entrance through our entrance exam.

"About whether or not this might disappear, I think it might depend on the programs. I personally feel the field of data science is stronger because of the diversity of backgrounds."

Anderson calls the practice of data science "computing with a purpose," and cautioned against defining it as a discrete element; recalling a recent meeting with Project Jupyter's Brian Granger, Anderson said Granger talked about not defining data science per se but instead thinking of tasks or processes to master. In some instances, Anderson said, those tasks may indeed need a computational-first approach, but in others, such as manufacturing, the "purpose" that computing addresses is domain-specific but augmented with programming skills and knowledge of mathematical and statistical methods (a Venn diagram created by Alluvium founder and CEO Drew Conway in 2010 may still be the most-referenced image that defines what skill sets must overlap to result in methodologically sound data science).

The principles behind that domain knowledge can be introduced even before a student gets to college, according to Jake Baskin, executive director of the Computer Science Teachers Association (CSTA), though Baskin also said he does not want to see the association's essential work of strengthening the K-12 core computer science curriculum pulled in too many directions at once.

"Education changes slowly and we still have a lot of work to do on the changes we're hoping to make in computer science," Baskin said. "I wonder if some of it is intermediary, and this is a process that gets to even fuller integration with other subject matters; it's not that you get into data science, but that it's understood that to be a sociologist, part of that will be data science, and including it there as well."

The University of Arkansas' Schubert said he doubts any kind of national template will emerge on a data science curriculum in the U.S., but that within state boundaries, objectives such as coordination between community colleges and universities will coalesce. "We can't have a proliferation of data science programs where students in the two-year colleges have trouble transferring to complete a 2+2."

Leidig said the multiple disciplines shepherding the field towards some sort of maturity may not be on the same track yet, but that mutual knowledge is growing. Putting a concise term on the state of that coordination, though, is tricky.

"It certainly hasn't been standardized," he said. "I think it's optimistic to use the word 'synchronized', although that's more appropriate than 'standardized'. It is becoming more synchronized. We are becoming more harmonious. Maybe harmonious is a better word. They're not really in sync, but they're happening simultaneously, in parallel, with awareness of other efforts. So maybe they're not even harmonized yet either."

There is no escaping the fact that, however one wants to define the commonalities between efforts, growth in data science courses is happening faster than expected, and it's bringing in students who have not seen computer science classrooms before.

Schubert said enrollment in the major at the University of Arkansas is 18 months to two years ahead of schedule. He said the program, which includes courses across 10 concentrations from three of the university's colleges—the College of Engineering, the Fulbright College of Arts and Sciences, and the Sam M. Walton College of Business—is significantly broadening the student demographics studying technology.

"We have attracted a demographic that does not look like the college of engineering demographic," he said. "We have a better male/female mix, a better minority mix. When I compare us to the three colleges individually, our students' ACT and SAT scores are higher and their incoming GPAs are higher, and 40% of our students are honor students."

 

Gregory Goth is an Oakville, CT-based writer who specializes in science and technology.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More