Sign In

Communications of the ACM


Software Engineering of Machine Learning Systems

tableau of server and computer icons, illustration

Credit: ArtemisDiana

Machine learning (ML) is ubiquitous, contributing to society-facing applications that each day impact how we work, play, communicate, live, and solve problems. In 2015, after Marc Andreesen proclaimed "software is eating the world," others countered "machine learning is eating software." While there have been exciting ACM A.M. Turing-award-winning worthy advances in the science of ML, it is important to remember ML is not just an academic subject: it is a technology used to build software that is consequential in the real world.

Failures of ML systems are commonplace in the news: IBM's Oncology Expert Advisor project is canceled for poor results after a $60 million investment; the chatbot Tay learns abusive language; an Uber self-driving car runs a red light; a Knightscope security robot knocks over a toddler;8 facial recognition systems unfairly make three times more errors for non-white non-males.2

For 50 years, the software engineering (SE) community has been learning from its own failures and introducing new tools, techniques, and processes to make better software. To build robust, reliable, safe, fair systems with ML, we need all these established SE techniques, and some new ML-specific ones. In turn, software engineers should realize how ML can change the way software is developed. Our goal in this Viewpoint is to pull together some of the threads that relate to this topic to spur more conversation about putting these ideas into practice. Ignoring these insights will lead to unprecedented levels of technical debt. In brief, we believe that ideas from software engineering such as explicit requirements, testing, and continuous integration need to be embraced by teams that are making systems built on top of data.

Back to Top

Software Engineering Basics

In some organizations, machine learning researchers develop ML algorithms using R or Python notebooks without basic software engineering tools like type checking, reusable modules, version control, unit tests, issue tracking, system and process documentation, code reviews, postmortems, automated hermetic builds, and continuous monitoring. This lack of discipline is an avoidable source of failures—organizations should insist on standard SE practices from their ML teams. ML researchers should be encouraged to use the same tools and pipelines that will be used in production, heeding the emerging machine learning operations (MLOps) techniques. As the NASA saying goes, "Test what you fly, fly what you test." Including software engineering education as part of the data science curriculum will help newly minted students appreciate how important these processes are to help them meet their responsibilities.

Back to Top

Taking Data Seriously

Every software program deals with issues of verifying data, keeping it secure and private, and scaling up to big data when needed. But in ML the stakes are raised because algorithms learn from the data, so a data error can lead, not just to a problem for one customer, but to decreased performance for every customer.

Organizations should employ a data pipeline that tracks the provenance of all data, allows it to be annotated, managed, and abstracted, and protects the privacy of personally identifiable or proprietary data. Promising technologies including differential privacy, federated learning, and homomorphic encryption are key tools for privacy, but are in varying stages of maturity.

ML teams must have separate training, validation, and test datasets, which should not accidentally become intertwined as the design space is explored. Teams can fool themselves by running experiments until the results look good, and then stopping. At that point, they cannot be sure if the last good results are truly representative of future performance, or if the results will regress back to the mean. Teams must protect themselves by adhering to an experimental design and stopping rules.

Many ML algorithms include randomness, so different training runs may produce different results. More thorough testing is needed as a consequence. Throughout the development process, developers must acknowledge that the fingerprints of the human engineers are all over the final result, for better and for worse.

When example data comes from users, ML systems are vulnerable to mistakes and intentional attacks by these users. The site catalogs labeling errors in popular crowd-sourced image databases (such as labeling a dingo as a coyote), which corrupts systems that train on this data. Worse, adversaries who can contribute as little as 1% of the training data can poison the data, allowing them to exploit an ML system such as an email spam filter.7 To guard against these attacks, we must verify data as well as software, and promote "adversarial thinking" throughout the educational pipeline so system designers know how to defend against malevolent actors.

Back to Top

Optimizing the Right Thing

In traditional software engineering, the requirements say what should be achieved, and programmers design and develop an implementation that meets the requirements, verifying their choice of how satisfies what must be done. With ML, computation becomes a partner in this translation process: Programmers state what they want achieved by supplying training examples of appropriate (and inappropriate) behavior and defining an objective function that measures how far off the implementation is from the requirements. A learning algorithm then takes the training examples and objective function and "compiles" them into a program that ideally produces success on future examples. If we get the data and requirements right, the software "writes itself."

Unfortunately, the reality of ML is it is difficult to precisely and unequivocally say what you want. ML algorithms will exploit loopholes in the requirements. A Roomba that was instructed not to bump into furniture learned to drive backward, because there were no bumper sensors on the rear. Video-game-playing programs were instructed to achieve the highest score, and did so not by playing the game, but by writing their name in the list of high scorers.4

Expressing requirements as objective functions can surface their latent contradictions. For example, the law requires equitable treatment of protected classes in employment, housing, and other decision-making scenarios.5 When writing requirements for a system to predict whether a defendant will commit a crime in the future, we need to translate this "equitable treatment" law into an objective function. One requirement is that predictions are equally accurate for, say, both Black and white defendants. Another requirement is that the harm done by bad predictions affects Black and white defendants equally. Unfortunately, it is not possible to achieve both these requirements simultaneously.6 ML developers, in consultation with all stakeholders, must decide how to make trade-offs in objectives.

Back to Top

Building Trust

A successful software system must not only get the right answers, but also convince the developers and users to trust the answers. As ML software is increasingly being tasked with highly sensitive decision making in areas such as hiring, medical treatment, lending, and criminal sentencing, we have to ensure such systems are fair, transparent, accountable, auditable, explainable, and interpretable. We must determine what sort of explanations will justify treatment of an individual; what kind of tests will certify robust system performance across likely use cases; what type of auditing and reporting will demonstrate fairness across diverse protected classes; and what remediation processes are available when there are errors. Doing so is difficult because ML systems are often used for the most complex and ill-defined tasks—if the tasks were easy, we would not need an ML solution.

For example, despite enthusiasm for ML classifiers in medical imaging and despite many systems achieving 95%+ accuracy,1 there are still few deployed systems, because ML engineers have not yet convinced doctors the systems are robust across different X-ray machines, different hospital settings, and so forth. Without better assurance, there is insufficient reason to believe such high accuracy will persist in the wild, resulting in poor outcomes.

Traditionally, programmers try to write code that is general enough to work even in cases that were not tested. In ML systems, it is hard to predict when a training example will generalize to other similar inputs, and when the example will be memorized by the system with no generalization. We need better tools for understanding how ML systems generalize.

In traditional software, major updates are infrequent, leaving sufficient time to thoroughly test them. Small patches are localized, and thus only require testing the parts that are changed. But in ML systems such as Internet search and product recommendations, new data is arriving continuously and is processed automatically, often resulting in unbounded nonlocal changes to ML models. ML developers need to be prudent about how often they retrain their models on new data—too frequently and it will be difficult for quality assurance teams to do thorough testing; too rarely and the models become stale and inaccurate. We need better tools for tracking model evolution. In addition to tracking new data, we need a process for discarding old data, as when a major event like a new social media platform or a pandemic significantly alters the data models rely on. Avoiding downstream trouble due to such data cascades takes vigilance, as well as inputs from domain experts and from "softer" fields such as human-computer interaction.

We need better tools for understanding how ML systems generalize.

Verification of traditional software relies on abstractions that let us divide the system into components that are composable, modular, interpretable, and often reusable. Current methods for building ML systems do not typically lead to models with these properties. Rather, each new ML system is built from scratch, and each resulting model is largely a black box requiring significant work to understand and verify. One promising approach is a class of multitask unified models6 that share internal structure across related tasks, data sets, and languages. More effort can go into verifying a single large model, which can then be specialized for new tasks.

Back to Top

Diverse Viewpoints

It is easy to create, say, a calculator program, because all users agree that the answer to 12x34 should be 408. But for many ML problems the desired answers are personalized for each user. That means that we need to do balanced data collection, particularly for underrepresented groups, to assure that each group gets the answers they need. We cannot let the data of the majority undermine proper answers for minority classes.

To properly serve a diverse population of users and use cases, ML development teams should be diverse themselves, and should be trained to consider diverse needs, and not to be satisfied with existing data sets that are unbalanced in their coverage.

Back to Top

Society Needs Us

There can be many sources of bugs in ML programs: traditional bugs in the implementation of the algorithm; invalid, incomplete, or unrepresentative data; incompleteness in the specification of objectives; poor choices of hyperparameters; and more.

Now that ML software is rapidly expanding in its prevalence, consequence, and influence on our lives, it is clear we need careful, professional methodologies designed to encourage the positive while moderating the negative impacts of this technology on society. We must build tools to support these methodologies, train people to use them, and develop certification systems that reassure users and minimize failures.

Making ML dependable will require new research and new practice. It will require machine learning researchers thinking like software engineers and software engineering researchers thinking like machine learning practitioners. The consequences are too dire to do anything else.

Back to Top


1. Aggarwal, R. et al. Diagnostic accuracy of deep learning in medical imaging: A systematic review and meta-analysis. Digit. Med. 4, 65 (2021).

2. Grother, P. et al. Face Recognition Vendor Test (FRVT). National Institute of Standards and Technology, (Dec. 2019).

3. Kleinberg, M. et al. Inherent Trade-Offs in the Fair Determination of Risk Scores. CoRR, (2016).

4. Krakovna, V. and Uesato, J. et al. Specification gaming: The flip side of AI ingenuity. DeepMind Blog, (Apr. 2020).

5. Mehrabi, N. and Morstatter, F. et al. A Survey on Bias and Fairness in Machine Learning, CoRR, (2019).

6. Nayak, P. MUM: A new AI milestone for understanding information. The Keyword (Google Blog), (May 2021).

7. Nelson, B. et al. Exploiting Machine Learning to Subvert Your Spam Filter. In Proceedings of LEET'08, (2008).

8. Partnership on AI, Artificial Intelligence Incident Database (2021);

Back to Top


Charles Isbell ( is Dean of the College of Computing and The John P. Imlay Jr. Chair at Georgia Tech in Atlanta, GA, USA.

Michael L. Littman ( is a University Professor at Brown University in Providence, RI, USA, and was on leave at Georgia Tech at the time this Viewpoint was written.

Peter Norvig ( is a Distinguished Education Fellow at Stanford University in Stanford, CA, USA, and Research Director at Google.

Copyright held by authors.
Request permission to (re)publish from the owner/author

The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.


No entries found