Opinion
Artificial Intelligence and Machine Learning

New Computer Evaluation Metrics for a Changing World

To grow AI and cloud computing efficiently and responsibly, we need new metrics. Computing systems should be measured on how they leverage available power and on their carbon footprint.

Posted
tree branches in front of building windows with solar panels

The decline of Moore’s Law, Dennard scaling’s demise, and environmental sustainability constraints have shaped the infrastructure landscape. So far, computing systems have been judged on peak performance or idealized benchmark performance with little regard for energy use or environmental impact. To find a path to grow artificial intelligence (AI) and cloud computing efficiently and responsibly, we need new metrics to guide us. Computing systems should now be measured on how they leverage available power and on their carbon footprint.

The Table provides a roadmap of performance/cost metrics in order of increasing recency, accuracy, and importance. We conclude with an example using real workloads that illustrates a 3x–13x gain over a conventional solution by paying attention to new metrics, such as goodput for performance and data-center power and carbon emissions for cost.

Table. 
Five independent views of performance/cost. The section in this article labeled “An Example” offers a concrete example of how the five metrics vary for deploying TPU v4 vs. TPU v3 in different Google data centers (DCs).
MetricCommentExample (See Section “An Example”)
Metric Value: TPU v4 vs. TPU v3Ratio
Peak Performance/Purchase PriceStatus quo: Peak FLOPS/price of chip1.91.0
Benchmark Performance/TCOUpgrade to MLPerf benchmark and total cost of ownership3.21.7
Workload Goodput/TCORealistic delivered performance of TPU v4 vs. TPU v32.51.3
Workload Goodput/DC PowerOversubscription allows more TPU v4s within DC power6.03.1
Workload Goodput/Operational CO2ePlacing TPU v4 in DCs with cleaner energy cuts CO2e25.913.5

Peak Performance/Purchase Price

Peak performance is the status quo, especially for emerging AI accelerators, as it is easy to calculate, and it showcases maximum speed. Alas, it does not predict actual performance.4 The flaw of using purchase price is that it focuses on today’s chip cost rather than lifetime system cost.

Benchmark Performance/Total Cost of Ownership

Benchmarks such as MLPerf8 were invented to improve prediction of real performance. The TPC-C benchmark added maintenance cost to purchase price,4 leading eventually to Total Cost of Ownership (TCO):1

TCO = Purchase Price + OpEx ( Over N Years )

where OpEx (Operational Expenditure) is the cost paid during the lifetime of the chip and infrastructure. It covers electricity costs consumed—including power distribution and cooling—and the cost of datacenter space over the server’s lifetime. TPC-C sets N to three years. With the slowing of Moore’s Law, new server CPUs deliver in excess of 100 cores rather than much faster ones, and performance/cost these days hardly improves. Most software is not designed for numerous cores, so in practice new servers are barely faster than old servers. When combined with small performance/cost gains, servers are replaced less frequently, stretching N to perhaps six to eight years today from three years a decade ago. In some systems today, OpEx is half of the TCO.

Workload Goodput/TCO

A problem with using benchmarks for performance is that they do not age well. As benchmark results affect sales, they immediately become the target of engineering efforts that help benchmarks but not necessarily real programs, so they lose their predictive value if refreshed infrequently. They also often target chip performance rather than systems performance. Popular benchmarks such as Coremark, Dhrystone, Linpack, and SPEC2017 are all cautionary examples.

Running the actual workload, such as production AI training, is obviously more accurate than chip benchmarks. However, it is also important to capture if computers are underutilized or if computation is wasted. Goodput is a networking term that only counts the information bits actually delivered, subtracting the protocol overhead and retransmissions due to failures. We borrow that term here to adjust workload performance by subtracting effort wasted on underutilization or unreliability problems.

As an example of underutilization, AI training and many other applications are bulk synchronous parallel,3 where all the computers in a system operate concurrently for one step and then exchange messages. They all wait until all messages are received, so communication speed is important. As the time per step is set by the slowest computer, load balance is critical to ensure that all computers do useful work. Stragglers can significantly degrade overall performance of large-scale synchronous jobs, such as AI training.2

A final goodput adjustment is to remove wasted work from unreliability. The overhead of software error detection, checkpointing, and error recovery can be substantial. Worryingly, hardware errors are increasing—with many going undetected—as we follow Moore’s Law to tinier transistors.5,9

Workload Goodput/Datacenter Power

The TCO formula implicitly assumes that sufficient datacenter (DC) capacity is available to house new servers, because the TCO formula includes DC provisioning cost. However, building new DCs is not always possible or affordable, especially at large scale, as some local electric utilities have practical limits to the maximum power available. Capital expenditure for DCs is also limited, in part because it needs to be spent up front. The combination of zoning, environmental regulations, competition for electricity from other customers, the desire to reduce dirty energy sources, and limited capital puts significant pressure on building new DCs. Thus, an important new metric to join goodput/TCO is the performance that fits within a current DC’s power envelope.

Not all DC power gets to the servers inside. Power Usage Effectiveness (PUE) is the industry-standard metric of DC efficiency, defined as the ratio between total energy usage (including all overheads, like cooling and power distribution) divided by the energy directly consumed by the DC’s computing equipment. If 1.5W of power must be delivered to the DC to, in the end, deliver 1W of power to a server after accounting for distribution overheads and cooling, then the PUE would be 1.5. This metric rewards DCs that reduce PUE, as they can hold more servers. In 2007, it was 2.5 (150% overhead), but by 2022, cloud providers cut PUEs to 1.1 (10% overhead).7 Reducing energy overhead 15x in 15 years illustrates the impact of new metrics.

A simple way to calculate the maximum number of servers per DC is to reduce the electrical power available by the PUE and then divide that result by the maximum electricity consumption per server. The last term is limited by the Thermal Design Power (TDP), the maximum heat a computer can cool. In practice, not all servers in a DC operate close to TDP simultaneously. An optimization is to instead deploy more servers assuming an over-provisioned TDP capacity, called the Oversubscription Rate (OSR).1 We only need a backoff method for the rare times when the actual power for many servers in a DC is too high, which is much easier for AI workloads, such as training or bulk inference, than for traditional user-facing applications. This insight means that the number of servers per DC can be based more on average power rather than on the worst-case TDP, increasing the importance of reducing average power.

Workload Goodput/CO2e

Environmental sustainability is increasingly crucial, which means information technology should minimize its carbon footprint. Measured in carbon dioxide equivalent emissions, CO2e accounts for other greenhouse gasses, such as methane. The CO2e unit is metric tons (t), which are 1,000kg. Therefore, an additional important metric to goodput/TCO and goodput/DC power is goodput/CO2e.

CO2e can be divided into the operational portion—from running computer equipment—and the embodied portion, from manufacturing it and building DCs. Embodied CO2e also includes the sourcing of raw materials, upstream energy use by suppliers to the manufacturers, and transportation.

Environmental organizations define rules to ensure that operational and embodied emissions are accounted for and allocated correctly at the proper stages of a value chain. Multiple standards specify boundaries and ensure all emissions are accounted for appropriately. Corporate accounting commonly uses Greenhouse Gas Protocol (GHGP) while product-level carbon footprinting commonly uses ISO standards 14040 and 14044. Both GHGP and these ISO standards encompass all operational and embodied emissions (further refined as scope 1, 2, and 3).

Operational CO2e use is well understood, and it depends heavily on where the energy is consumed.10 If the local grid relies primarily on renewable energy sources instead of fossil fuels, the footprint can drop 10x. The unit is grams of CO2e per kilowatt-hour (kWh). The worldwide average today is 475, but sites can drop below 60 by using solar or wind power and rise above 700 by burning coal.

Unfortunately, the embodied CO2e of CPU servers and AI accelerators is rarely published. The range of embodied CO2e per server from the limited publications is 1t to 4t.11 Given the high variance, more investigation is necessary. The embodied footprint of DC buildings is also less documented but likely much less than the servers inside given the 20-year amortization of DCs.7

Embodied emissions, similar to operational emissions, also heavily depend on where the chips are manufactured, as half is from electricity use.12 The top countries that manufacture chips are Taiwan, South Korea, and Japan, where the grid carbon intensity is still high at 542, 457, and 594 grams/kWh, respectively. Given these drawbacks to chip manufacturing and the grid decarbonization plans of many countries that deploy most chips, embodied CO2e is fundamental to the carbon footprint of information technology and needs to be better quantified and tracked.11

An Example

To illustrate the metrics and their importance, the last column of the Table compares a hypothetical deployment of two recent AI accelerators: Google’s TPU v3 and TPU v4. Public data6 records:

  • The performance ratio of peak floating point operations per second is 2.24.

  • The MLPerf benchmark performance ratio of TPU v4 over TPU v3 is 3.14.

  • The average performance ratio for the training of representative production AI models is 2.10.

A few assumptions supply the missing parameters:

  • The hypothetical relative purchase price is 1.2x for TPU v4 given its larger chip size.4,6

  • The hypothetical relative OpEx is 0.8 based on TPU v4’s lower average power.6

  • The hypothetical relative TCO is ~1.0 (assuming a 50-50 split between price and OpEx).

  • The hypothetical goodput of TPU v4 is another 1.2x due to its optical circuit switches,6 which improve communication speed and reduce downtime by quickly substituting spares for failed TPUs.

Performance/cost in the Table swings from peak/purchase price metric value of 1.9x of TPU v4 over TPU v3 to benchmark/TCO metric value of 3.2x.

The other metrics offer much larger gains. By taking advantage of oversubscription for TPU v4 as opposed to a standard allocation for TPU v3, goodput/DC power metric becomes 6x for TPU v4. Even after accounting for the larger energy use and embodied carbon footprint of deploying more servers that oversubscription enables, building a TPU v4 DC near green energy instead of an average location raises the goodput/CO2e metric value to 26x.

Conclusion

It’s hard to improve what you do not measure. The proposed metrics upgrade the conventional performance/cost equation in both the numerator and the denominator. The former improved from peak performance to benchmark performance to goodput, while the latter advanced from purchase price to TCO, DC power capacity, and carbon emissions. To find the best solution, infrastructure architects should co-optimize three metrics: goodput/TCO, goodput/data center power, and goodput/CO2e.

Given where we are with Moore’s Law, Dennard scaling, and environmental sustainability, our information technology community should:

  • Reduce average power consumption of hardware we design or purchase.

  • Request tools to measure a program’s energy use and CO2e and then reduce its footprint.

  • Refine manufacturing processes so that all computers and components can eventually be labeled with their embodied CO2e.

  • Recruit clean energy sites for new data centers and then favor their use.

  • Research how to lower the embodied CO2e associated with computer and semiconductor manufacturing.

    References

    • 1. Barroso, L.A., Hölzle, U., and Ranganathan, P. The Datacenter as a Computer: Designing Warehouse-Scale Machines, 3rd edition. Springer Nature (2019).
    • 2. Dean, J. and Barroso, L.A. The tail at scale. Communications of the ACM 56, 2 (2013), 7480.
    • 3. Gerbessiotis, A.V. and Valiant, L.G. Direct bulk-synchronous parallel algorithms. J. Parallel and Distributed Computing 22, 2 (1994), 251267.
    • 4. Hennessy, J.L. and Patterson, D.A. Computer Architecture: A Quantitative Approach, 6th edition. Elsevier (2017).
    • 5. Hochschild, P.H. et al. Cores that don’t count. In Proceedings of the Workshop on Hot Topics in Operating Systems, (2021), 916.
    • 6. Jouppi, N. et al.  TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual Intern. Symp. on Computer Architecture, (2023), 114.
    • 7. Malmodin, J., Lövehagen, N., Bergmark, P., and Lundén, D. ICT Sector Electricity Consumption and Greenhouse Gas Emissions–2020 Outcome (2023), Available at SSRN 4424264.
    • 8. Mattson, P. et al.  MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro 40, 2 (2020), 816.
    • 9. Papadimitriou, G., Gizopoulos, D., Dixit, H.D., and Sankar, S. Silent data corruptions: The stealthy saboteurs of digital integrity. In Proceedings of the 2023 IEEE 29th Intern. Symp. on On-Line Testing and Robust System Design, 17.
    • 10. Patterson, D. et al.  The carbon footprint of machine learning training will plateau, then shrink. Computer 55, 7 (2022), 1828.
    • 11. Patterson, D. et al.  Energy and emissions of machine learning on smartphones versus the cloud: A Google case study. Communications of the ACM 67, 2 (2024).
    • 12. Taiwan Semiconductor Manufacturing Co. TSMC 2022 Sustainability Report. (2022)

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More