Workload Frequency Scaling Law – Communications of the ACM

Many processors expose performance-monitoring counters that help measure ‘productive performance’ associated with workloads. Productive performance is typically represented by scale factor, a term that refers to the extent of stalls compared with stall-free cycles within a time window. The scale factor of workload is also influenced by clock frequency as selected by frequency-selection governors. Hence, in a dynamic voltage/frequency scaling or DVFS system (such as Intel Speed Shift¹), the utilization, power, and performance outputs are also functions of the scale factor and its variations. Some governance algorithms do treat the scale factor in ways that are native to their governance philosophy.

This article presents equations that relate to workload utilization scaling at a per-DVFS subsystem level. A relation between frequency, utilization, and scale factor (which itself varies with frequency) is established. The verification of these equations turns out to be tricky, since inherent to workload, the utilization also varies seemingly in an unspecified manner at the granularity of governance samples. Thus, a novel approach called histogram ridge trace is applied. Quantifying the scaling impact is critical when treating DVFS as a building block. Typical application includes DVFS governors and/or other layers that influence utilization, power, and performance of the system. The scope here though, is limited to demonstrating well-quantified and verified scaling equations.

Workload Scaling

Intel has three architecture-independent registers that are relevant to this topic:

A_perf. A running counter that counts at actual clock rate of execution at that instant. This actual clock frequency may vary over time based on governance and/or other algorithms. This register counts only during active state (C0).
M_perf. A running counter for activity that counts at a fixed TSC (time stamp counter) clock rate. It counts only during active state (C0).
P_perf. This counter is similar to A_perf, except that it does not count when the activity stalls as a result of some dependency, likely gated on another IP’s clock domain (for example, memory).

The delta of these counters in a given time window commonly interpret the following:

Utilization. U = M_perf/TSC
Scale factor. S = ΔP_perf/ΔA_perf

Ideally, if an activity is free of stalls, then P_perf = A_perf (that is, the scale factor S will be 1). In such a situation, the time taken to complete an activity in a given time window would simply be the inverse of the actual frequency in that window. In most real workloads the scale factor varies and is often less than 1. Establishing accurate equations relating utilization, scale factor, and frequency allows DVFS governors to dispense well-informed frequency-change decisions.

At the outset, it may seem the workload-scaling problem is similar to the well-known Amdahl’s Law, which deals with workload speedup as associated with parallel-compute resources. However, Amdahl’s Law cannot be applied here since the “parallelizability factor” analogy to “scale factor” does not hold up—the former is independent of the resource being scaled (that is, parallel processors), while the latter is a function of the resource being scaled (frequency). Utilization scalability over frequency has a cascade relation via the scale factor.

Terminology and Assumptions

This article makes use of following terminology:

Utilization or CO activity percentage refers to the active (non-idle) clocked state³ within a time window. Load and utilization, used interchangeably here, refer to the same attribute.
Scalable activity refers to the part (or percentage) of the operation whose execution-time scales inversely with frequency.
Stall in activity refers to the part (or percentage) of the operation that involves stall during its completion. While a longer delay actually causes C-state demotion, the instruction stalls referred to here are much shorter and occur with the CPU in C0 active state.

Consider a frequency governor on a DVFS system, updating frequency at every periodic time window T. Consider the present instant at time t₀ (“NOW”), as depicted in Figure 1. Let the just-completed window T of workload have t_act (or t_a) as part of its scalable activity (that is, t_act represents the cumulative portion of activity that keeps the processor busy in execution without the need for any dependent delay or stalls). Let t_stall (or t_s) be the sub-duration depicting the effective stalls experienced (inter-subsystem dependency stalls). Also let t_off (or t_o) be the cumulative duration where the DVFS subsystem is in a deeper C-state, which is a significantly lower power than C0 state. The fundamental difference between t_off and t_stall is that, in the latter case the subsystem is still active CO while experiencing very short dependency stalls; whereas in the former case the delays are large enough and put the subsystem into a momentary deeper C-state.

Figure 1. Scalability-related definitions.

In the given window, the load execution active-time is the inverse of the frequency f.

Or, in general, in any window:

Intel productive performance² register counts the productive (non-stall) counts proportional to t_act. Defining scalability S as a ratio of delta in productive-count (P_perf) to active-count (A_perf), results in the following general equations:

Also, the load l (C0 percentage) in this window can be defined as:

Causality. Consider one specific window T between t_-1 and t₀. Here, the governor has dispensed a frequency f₀. ifthe governor had dispensed a different frequency f‘₀ in the same window, the scalable part of the work (that is, t_a) would have been, say t’_a and thereby t₀ to t‘₀. For practical purposes, the stall cycles can be assumed to be unrelated to the selected frequency, since the stalls are not explicitly dependent on a local subsystem’s clock frequency. Hence t_s remain independent as far as their causality with frequency is considered. Similarly, within the same window the number of productive (stall-free) instructions to be executed remain fixed. Specifically, we can generalize and imply that:

Stall time and productive cycle count are invariant of frequency causality.

Note also that frequency and load are causality considerations within a time window, not across. In the rest of the derivation, a reference to change in frequency is not along the time flow, but another possible causality if the frequency were some other.

Utilization (Load) to Frequency Change Estimation

An important aspect in scalability of workload relates to the extent of change in the load caused by a change in frequency.

As shown in Figure 2, the sampled delta values of p_perf and A_perf as ΔP₀, ΔA₀ within windows ΔTSC can be associated with active, stall, and off time within T.

Figure 2. Perf counter delta relating to reference time window.

Now, assume the DVFS governor had chosen some target f‘₀ instead of initial actual frequency f₀. The requirement is to find a relation that accurately represents the changed final target load l‘₀ as a result of this causality. Hence derive estimated load change l₁ at time t₁.

Let the A_perf count in the causality window with frequency f₀ hypothetically transition from the present ΔA₀ to ΔA’₀

The stall time is not an explicit function of a local subsystem clock, the time spent in stall remains the same. In other words, stall time is invariant of frequency causality.

When we solve this with the following two equations for load and scale-factor in general, at time t_n:

and the scale factor as

solving yields

Now,

Again, due to stall time invariance,

Since this is not an iteration to the next time window, the productive cycles to be executed remain the same. In other words,

The productive cycle count is invariant of frequency causality.

This results in the scaling of scale factor equation

This intermediate equation defines the changes (scaling) of the scale factor itself over frequency. Solving further, we get the scaling of load equation:

In the causality-based derivation, there was no workload inherent change in that window. In other words, if the causality from f₀ to f‘₀ were actual change f₁ in adjacent future time window t₁, and the equation were used to estimate such a target load l₁, then any divergence (l_1actual – l‘₀) from expected target load is clearly workload inherent change culminated over the time window t₀ to t₁. Since l_1actual can be known, any divergence from the estimated load can be calculated. The estimated scaled load is this given by:

Similarly, the estimated scaled scale-factor is approximately given by:

Verification on an Actual Platform

The last two equations can be called the workload scaling equations. Verifying them empirically on an actual platform, however, isn’t as straightforward. At the granularity of a single cycle, the three parameters on the right (f₀,f₁,s₀) may be determined; but as described, the math result cannot be compared directly to l after the end of that cycle; primarily because the incoming load itself could fluctuate as seen at a discrete sample basis. It is possible though, to statistically verify by applying a series of histograms at different frequencies and tracing its ridge. Using this histogram ridge trace method, the generality of the equation is verified.

To begin, a workload is run completely at a fixed frequency.

Workload #1 is a random browser-based graphics load (Figure 3; https://webglsamples.org/multiple-views/multiple-views.html)

Figure 3. A browser-based workload from WebGLSamples.org

While capturing the data, it is preferred to stop unwanted background services that are irrelevant to the workload under analysis. Also, a precise and lightweight data-logging tool is preferred. On Linux systems, the Intel PSST₄ can accomplish logging at fixed, low, and precise overhead. The data is captured for, say, a duration of 100 seconds and a sampling poll period of 20ms. Data captured with every sample includes:

CPU load.
scale factor.
clamped frequency value.

The entire run is repeated at every frequency of CPU, while keeping other DVFS as fixed sources (for example, Gfx frequency at a moderate but fixed value).

The CPU scale-factor and utilization logs are captured. In this example, time-domain analysis is not practical for gathering scalability and utilization. Instead, a histogram analysis identifies in which bucket the maximum number of load points lie for a given frequency of operation. The peak of the histogram represents the load value with the highest probability of occurrence for the given frequency. In the example data in Figure 5, about 500 samples (or 80% of total samples) occurred at a 33% load. Such inferences are obviously hard to derive from the time chart in Figure 4 even if the frequency is fixed.

Figure 4. Utilization variation in the time domain.

Figure 5. Utilization histogram plot at fixed frequency 1GHz.

The experiment is iterated through all other frequencies, one at a time, to get a range of peak histogram values, as shown in Figure 6.

Figure 6. Collection of utilization histograms for each frequency.

With the same data from the set of experiments, a similar histogram ridge trace for scale-factor can be plotted as shown in Figure 7.

Figure 7. Scale-factor histograms for each frequency.

Scaling of scale factor. A plotting tool is used to create a map-view of peak histogram points (in scale factor) in Figure 7, which is then plotted with red dots in Figure 8. On the same mapview, the scaling of scale factor equation is plotted for different possible target frequencies based on any one initial seed value for utilization and scale factor. The correlation between the equation and the full data set is shown in Figure 8.

Figure 8. Scale-factor ridge points contour compared with equation.

Scaling of load. The map-view of peak histogram points of utilization (Figure 6) is plotted with red dots in Figure 9. On the same map view, the scaling of load equation is plotted for different possible target frequencies based on one initial seed value for utilization and scale factor.

Figure 9. Utilization ridge point contour compared with equation.

As summarized by the results shown in these figures, the independent estimates of the equation fit nearly precisely with the actual experiment’s statistical results. This methodology is applied to multiple other workloads (with different scale factors and load levels) and verified as noted here.

Conclusion

This article elaborated on the causalities of utilization and scale factor in relation to possible frequency-selection choices within a time window. Based on this, generic workload scaling equations are derived, quantifying utilization impact. The equations are verified by applying the histogram ridge trace method at discrete DVFS block level. Though only the CPU core example was detailed, similar equation agreements were observed on other DVFS blocks (graphics sub blocks), other workloads and other operating systems (Windows/Linux). The equation-estimated curve correlates very accurately with the actual ridge trace curve. This implies the utilization impact can be “predicted” to be statistically accurate in every cycle and any divergence from it can be attributed to workload-inherent utilization change in that cycle and treated as appropriate to the solution using it.

Acknowledgments

The author is thankful for all the help from colleagues at Intel, specifically, valuable input from principal engineers Harinarayanan Seshadri and Rajeev Muralidhar and software engineer B. M. Shravan K.