Today's CPUs are capable of supporting real-time audio for many popular applications, but some compute-intensive audio applications require hardware acceleration. This article looks at some real-time sound-synthesis applications and shares the authors' experiences implementing them on graphics processing units (GPUs).
Software synthesizers, which use software to generate audio in real time, have been around for decades. They allow the use of computers as virtual instruments, to supplement or replace acoustic instruments in performance. Groups of software instruments can be aggregated into virtual orchestras. Similarly, in a video game or virtual environment with multiple sound sources, software synthesizers can generate each of the sounds in an auditory scene. For example, in a racing game, each discrete software synthesizer may generate the sound of a single car, with the resulting sounds combined to construct the auditory scene of the race.
Traditionally, because of limited computing power, approaches to real-time audio synthesis have focused on techniques to compute simple waveforms directly (for example, additive, FM synthesis), using sampling and playback (for example, wavetable synthesis) or applying spectral modeling techniques (for example, modal synthesis) to generate audio waveforms. While these techniques are widely used and understood, they work primarily with a model of the abstract sound produced by an instrument or object, not a model of the instrument or object itself. A more recent approach is physical modeling-based audio synthesis, where the audio waveforms are generated using detailed numerical simulation of physical objects or instruments.
In physical modeling, a detailed numeric model of the behavior of an instrument or sound-producing object is built in software and then virtually "played" as it would be in the real world: the "performer" applies an excitation to the modeled object, analogous, for example, to a drumstick striking a drumhead. This triggers the computer to compute detailed simulation steps and generate the vibration waveforms that represent the output sound. By simulating the physical object and parameterizing the physical properties of how it produces sound, the same model can capture the realistic sonic variations that result from changes in the object's geometry, construction materials, and modes of excitation.
Suppose you are simulating a metallic plate to generate gong or cymbal-like sounds. Varying a parameter that corresponds to the stiffness of the material may allow you to produce sounds ranging from a thin, flexible plate to a thicker, stiffer one. By changing the surface area for the same object, you can generate sound corresponding to cymbals or gongs of different sizes. Us-
ing the same model, you may also vary the way in which you excite the metallic plateto generate sounds that result from hitting the plate with a soft mallet, a hard drumstick, or from bowing. By changing these parameters, you may even simulate nonexistent materials or physically impossible geometries or excitation methods.
There are various approaches to physical modeling of sound synthesis. One such approach, studied extensively by Stefan Bilbao,1 uses the finite difference approximation to simulate the vibrations of plates and membranes. The finite difference simulation produces realistic and dynamic sounds (examples can be found at http://unixlab.sfsu.edu/~whsu/FDGPU). Real-time finite difference-based simulations of large models are typically too computationally intensive to run on CPUs. In our work, we have implemented finite difference simulations in real time on GPUs.
Next, we address key issues in real-time audio software and look at how they relate to the computing characteristics of GPUs. We will look at how some types of real-time synthesis applications have benefited from GPU acceleration, ending with the challenges encountered while running finite difference-based synthesis on GPUs.
Real-time audio synthesis can be broadly broken down into three steps:
Excitation. An excitation event signals the synthesizer that real-time audio generation should begin. To strike a virtual cymbal, for example, a human performer may hit a key on a keyboard, which generates an excitation event.
Sample generation. Audio sample data is computed for the desired sounds (for example, the cymbal crash).
Output. Data generated in step 2 is sent to system software for playback by the system.
Figure 1 shows two approaches to real-time audio synthesis: in 1a the naïve approach computes and outputs a single sample at a time, while in 1b the buffered approach computes multiple samples and outputs them as a block.
In Figure 1a, after the excitation event, a function is called to generate a single sample. The new sample is sent to a buffer in the audio-output device. These two steps are repeated until a new excitation is received, or until the sound becomes inaudible. At the CD-quality sample rate of 44.1kHz, one sample has to be computed and ready for output every 1/44,100 seconds, or every 23μs.
The naïve approach in Figure 1a incurs high overhead. Every sample requires a function call and a copy into a system buffer, which may involve a context switch.
Instead, the buffered approach illustrated in Figure 1b is usually used. Samples are generated and transferred in blocks of n samples, significantly reducing overhead. Though the buffered approach reduces overhead, it introduces latency into the signal path. Latency is the time it takes from the excitation of the instrument to the production of its sound. The longer the latency, the less responsive an instrument feels. For software instruments, latency should be kept to a minimum, on the order of tens of milliseconds.
For the buffered approach, a block of n samples has to be generated in n × 23μs. In commonly used, less compute-intensive algorithms such as wavetable synthesis, generating a sample involves a few arithmetic operations and table lookups; this is usually achievable on the CPU. For compute-intensive real-time audio applications, the sample generation needs to be parallelized. The ubiquity of GPUs today makes them an obvious choice for such applications.
NVIDIA's GPUs and CUDA platform are popular options for high-performance computing today. An NVIDIA GPU (Figure 2) is a hierarchical multiprocessor on a chip, usually consisting of a number of streaming multiprocessors (SMs); each SM contains a number of streaming processors (SPs). An SM can execute large numbers of threads simultaneously, with each thread running the same program. This single instruction multiple thread (SIMT) architecture is especially suitable for applications with a high degree of data parallelism, where the same operations are applied to large amounts of data. For example, the NVIDIA GeForce GT 650M in a mid-2012 MacBook Pro Retina has two SMs, each with 192 SPs (cores), at a clock rate of 900MHz.
To the CPU, an NVIDIA GPU looks like a separate coprocessor with its own memory system. Jobs are configured on the CPU, or host, which then interacts with the GPU device. The host copies data and device-specific programs from host memory to device memory, and initiates program execution on the device. The GPU then executes the jobs independently of the host, either synchronously or asynchronously. When the GPU device is done, results are copied from the device to host memory. NVIDIA CUDA systems provide ways of reducing this memory copy latency using such techniques as shared memory pages between the host and device.
A function that is executed in parallel by the GPU is called a kernel. Just as the GPU hardware is composed of streaming processors grouped into streaming multiprocessors, a kernel is executed in parallel by threads grouped into blocks (see Figure 3). One or more blocks are assigned to a streaming multiprocessor, guaranteeing that threads in the same block are executed on the same SM. Each thread executes the same kernel function, but usually on different data. Since threads in a block execute on the same SM, they can leverage fast hardware-supported synchronization and share data using shared memory within the same SM. Different blocks cannot be synchronized within a kernel and are not guaranteed execution order by any particular SM.
A hierarchy of memory is available on the GPU device. In addition to registers accessible to a thread, a limited amount of faster shared memory can be shared among threads in a block, but persists only as long as the SM is executing the thread block. Larger amounts of slower global memory can be accessed and shared among all threads in any block. Global memory is allocated on the device by the host and persists until deallocated by the host, but access times for global memory are slower than for shared memory. Optimized GPU code leverages these thread and memory characteristics.
In older GPUs, efficient kernel execution usually requires careful management of shared memory in software. More recent GPUs, based on the Fermi and Kepler architectures, support a true hardware cache architecture; explicit software management of shared memory is not as critical on these systems.
GPU-based Applications with Multiple Independent Audio Streams
GPUs have previously been used successfully in real-time audio applications. Many of these applications have involved the simultaneous generation of multiple loosely coupled sound sources or processing streams. In these instances the GPU has been used to ease the load on the CPU, caused by the computational complexity of generating and processing many sounds simultaneously. Examples include rendering and spatializing sound-generating objects in a game or virtual environment, or synthesizing multiple instruments in a virtual ensemble. Each sound source might be a car in a racing game or an instrument in an orchestra.
Recall the buffered approach in Figure 1b; at each computation step, for each sound source, a block of n samples is computed and sent to system buffers for playback. Sequential samples in a buffer usually have to be computed in order. Since buffers for two sound sources can be computed independently of each other, assigning each sound source to a thread in the GPU is straightforward. Each thread computes an n-sample buffer and synchronizes; output from all the sources are mixed down and sent to system buffers for playback (see Figure 4). Since a typical GPU today efficiently supports thousands of simultaneous threads, this type of application is a good match for GPU acceleration.
For example, Zhang et al.3 describe a typical parallel setup that implements modal synthesis; the sound of a vibrating object is generated by combining a bank of damped sinusoidal oscillators, each representing a mode, or resonant frequency of the sound. Since the n samples for each mode can be calculated independently of other modes, each mode can be assigned to an independent thread in the GPU. After n samples are generated for each mode, the results for all modes are combined for playback.
Real-time finite difference synthesis works somewhat differently and is arguably not an efficient use of the GPU, but we have been able to get useful results despite some severe constraints.
Physical objects are frequently modeled using differential equations. To perform numerical simulations of these objects, finite difference approximations of the differential equations are commonly used. For our work, we use an approximation of the 2D wave equation, which describes vibrations in two dimensions through an object. Consider exciting a flat rectangular plate to produce sound. The plate is modeled as a horizontal 2D grid of points. When the plate is struck, points in the grid "bounce" up and down very fast, resulting in vibration and sound, as shown in Figure 5.
In the simulation, 2D arrays keep track of the vertical displacement of the plate at each point. One array stores the current displacements, while arrays of the two previous time steps are retained. To calculate the displacement of a point at the current time step, previous displacement values around the point being calculated are used. Suppose xi,j(t) contains the vertical displacement at time t of the point at (i,j). In a previous paper,2 we saw that xi,j(t+1) can be calculated from xi,j(t), the four nearest neighbors of xi,j at time t, and xi,j(t-1). A sample point at (for example) the center of the grid, marked in red in Figure 5, can be monitored to produce the audio samples for the output sound.
In a straightforward finite difference implementation, computing a W×W grid of data points at time t requires W2 steps, and depends on the W×W grid at the previous two time steps.
Challenges of Implementing the Finite Difference Technique
Using a finite difference-based simulation to generate a single stream of audio is a compute-intensive endeavor. Compared with the calculation of one output sample from a few sinusoidal oscillators or filters, generating a single sample in a finite difference simulation involves significantly more arithmetic and memory operations. Hence, the computation of a single sample in real time has to be spread over multiple threads. Figure 6 shows a high-level view of a GPU-based finite difference simulation.
We faced three major challenges implementing a real-time finite difference synthesizer on the GPU. First, kernel launch overhead, a delay from the time the host executes the kernel on the device until the device begins execution of the kernel, may be significant. Second, the limit on the number of available threads per block restricted how the simulation grid was mapped onto the GPU. Third, the inability to synchronize or order block execution limited the way blocks of threads could be configured and executed on the GPU device.
Figure 7 shows two types of parallel audio applications: Figure 7a demonstrates multiple independent audio streams, while 7b shows parallel finite difference simulation. As mentioned previously, with independent-stream audio processing it is usually feasible to configure the system as shown in Figure 7a; each thread freely computes an independent stream of n samples simultaneously, producing a buffer of n audio samples at the end of a period of computation. A single synchronization event occurs after n time steps, waiting for all threads to complete their calculations. After the synchronization event, these buffers of audio data are then organized before being sent back to the host. For example, multiple sources may be mixed together or organized to maintain temporal coherence.
Suppose x (t) is a two-dimensional array of vertical displacements at time t. With finite difference simulations, recall that to calculate vertical displacement of a point at i,j at time t+1, you need to refer to the point and its nearest neighbors at time t and t-1; these calculations need to be performed over the entire x array. To generate a buffer of audio samples over n time steps, you capture the vertical displacement of a single sample point at the same location in x over time; to do this it is necessary to calculate the x(t+1) to x(t+n) simulation arrays, while building a buffer of n samples of vertical displacement from the sample point in x. At time t+1, it is necessary only to retain the arrays for times t and t-1. Pseudocode for a sequential finite difference simulation is shown in Figure 8.
As shown in Figure 6, you specify p = WxW threads to execute the GPU kernel. You spread the inner-loop calculation of the x(t+1) array over multiple threads. Each thread computes one point at (myRow, myColumn) of x(t+1), with myRow and myColumn based on the thread's unique ID. The parallelized pseudocode is shown in Figure 9.
To calculate a buffer of audio samples over n time steps, you simply make one kernel call. Since the calculations at x(t+1) depend upon the completed calculations of the two previous time steps x(t), x(t-1), you must synchronize after each time step (Figure 7b). The time spent in the n synchronizations in the kernel is critical to the efficiency of this approach. CUDA provides fast hardware synchronization support but only for threads within the same block. Therefore, all calculations must be performed using threads within a single block (Figure 7b). Using a single block allows you to synchronize using fast mechanisms native to CUDA, but you can no longer leverage the efficiency of allowing the GPU to schedule multiple blocks of threads. Since there is only one block of threads, only one SM can operate on one finite difference approximation at any time. You could, however, simulate more than one finite difference-based instrument simultaneously, up to the number of SMs on the device.
To use multiple blocks of threads and multiple SMs, you can try configuring the kernel to calculate one time step of x(t) at a time, returning control to the host after each time step. The synchronization is taken care of in the return from the kernel. In this solution, the configuration of blocks of threads depends on the locality of the data being accessed and calculated; it is a standard GPU optimization problem. However, this approach has a problem very similar to the naïve audio generation problem described previously in Figure 1a; there is an overhead to the host executing a kernel on the device, and the device returning from the kernel. This kernel launch overhead builds linearly and is not insignificant. Our experiments have shown that on an NVIDIA GeForce GT 650M, on average, the minimum time to execute and return from a kernel is around 17μs, with some initial delays of 2ms or longer. Recall that CD-quality audio requires generating one sample per 23μs. This means that even with the minimum overhead, there are about 6μs to calculate a sample in real time. This is unrealistic for finite difference calculations.
Hence, we took the approach of using a single block of threads, with n time steps per kernel call, as shown in the previous pseudocode. However, we ran into another constraint: the maximum number of threads in a block is fixed in each GPU implementation and depends further on hardware resource constraints such as the number of registers and SPs. For larger simulation grids with more points than threads per block, each thread must be able to calculate a rectangular tile of several points. For example, the GeForce GT 650M supports up to 1,024, or 32x32 threads per block. To simulate a 64×64 grid, each thread would calculate a tile of 2x2 points.
Our software synthesis package, Finite Difference Synthesizer (FDS), was designed to operate on Mac OS X and Linux. FDS simulates a vibrating plate, similar to a percussion instrument. The system (Figure 10) has three primary components: the controller interface (Figure 10b), the finite difference engine (Figure 10c), and the audio callback handler (Figure 10d). Each of these three components runs in its own thread.
The controller interface (Figure 10b) is the program's foreground thread. It includes a listener loop that receives control and configuration messages from an external controller, via the Open Sound Control (OSC) protocol (http://opensoundcontrol.org).
To use FDS, a performer manipulates an external OSC-capable controller (Figure 10a), which may be a keyboard, drum pad, or tablet. An OSC message is sent from the controller to FDS's foreground thread. This message may change settings (simulation parameters, strike location, among others) or trigger an excitation event (strike the plate, damp it, and so on). The thread then initiates the corresponding operations in the finite difference and audio callback threads.
To address the implementation challenges previously described, we created a finite difference engine (Figure 10c). This engine runs continually in its own thread, executing the finite difference simulation and generating audio data. This engine thread is the only one in FDS that interacts with the GPU device. It contains a control loop running on the host, keeping the finite difference simulation running on the device (Figure 11). The control loop on the host part of the engine waits for control signals from the foreground (control) thread, such as excitation and damping events. When an excitation event is received, the host adds a precalculated 2D Gaussian impulse of vertical displacements into the finite difference grid, maintaining the current waveform while adding the energy from the excitation event. The center of the impulse is determined by the location of the "strike" on the grid (Figure 5).
Audio data is transferred from the device to the host using memory shared by both the host and device. This eliminates one of the major bottlenecks formerly associated with GPUs: the device-host memory transfers. The host copies the audio data to a ring buffer shared with the audio callback thread, which handles getting audio data to the audio driver.
The audio callback thread communicates with PortAudio (http://www.portaudio.com), a cross-platform audio driver that coordinates the interface between FDS and the operating system's audio layer. When the PortAudio driver is ready for more audio data, it executes a callback function in the audio callback thread. This function copies data placed in the ring buffer by the finite difference thread to the Port Audio output buffer. The output buffer is then sent to the operating system for playback.
To evaluate whether FDS constitutes a useful development in software synthesis, we asked two related questions: Is FDS able to generate real-time audio based on finite difference simulations, for an interesting range of simulation parameters, at reasonable latencies? How does FDS's performance on a GPU compare with a single-threaded finite difference simulation executed on a CPU, with identical simulation parameters? For the second question, note that GPUs and CPUs have very different architectures; different systems can have very different CPU/GPU combinations in terms of performance. Hence, our CPU-vs.-GPU comparisons should be considered practical references for application end users; they are not intended as rigorous comparative performance studies. Even for GPU-to-GPU comparisons, the models vary widely in their capabilities and system implementations, making these comparisons difficult.
For our measurements, we kept the audio buffer size constant and ran finite difference simulations for a number of simulation grid sizes on both CPUs and GPUs. Large grid sizes are important for generating sounds with low pitches and simulations with high spatial resolution. We also monitored the audio output buffer for underruns (that is, when audio data is not being produced fast enough to keep up with the demands of audio playback). This produces gaps in the audio output data, which are audible as glitches or other unpleasant artifacts as the audio system waits for data to be ready.
The results of these experiments are obviously highly system dependent. We have implemented versions of FDS on GPUs for some years. On the earlier platforms, we were able to execute FDS for up to 21×21 simulation grids in real time on an NVIDIA GTX285 GPU, with audio buffer size of 4,096 samples, without audio buffer underruns; on the 3GHz Intel Xeon 5160 CPU on the same system, audio buffer underruns were reported for all but trivially small grid sizes.2
Our latest test platform is a mid-2012 MacBook Pro Retina, with a 2.7GHz Intel Core i7 processor and 16GB of RAM. This system has a built-in 900MHz NVIDIA GeForce GT 650M GPU, which has two SMs with 192 SPs each. The 650M is an implementation of NVIDIA's latest Kepler architecture, optimized for power efficiency; hence, while it has numerous enhancements over the earlier Tesla and Fermi architectures, the 650M is one of the slower Kepler-based GPUs. The operating system is OS X version 10.8.2, running CUDA 5.0. We timed the finite difference kernel execution within the FDS program infrastructure, taking measurements around the kernel calls.
Figure 12 shows the time needed to generate one 512-sample buffer of audio for FDS running on the CPU and GPU of the MacBook Pro Retina. Execution times above 11ms produce audio buffer underruns and cannot be used for audio playback. These numbers are included for speed comparison only.
For our tests we kept the audio output buffer size at 512 samples. This means that FDS needed to produce and have ready 512 samples every 23μs × 512 = 11.61 ms. Simulations with execution times above 11ms produce buffer underruns and are unusable. We were able to obtain good resultsmeaning audio playback with no buffer underrunsfor grid sizes as large as 69×69 on the CPU, and grid sizes as large as 84x84 for the GPU, or a 48% improvement in the maximum grid size supported.
As mentioned earlier, performance analysis involving GPUs and CPUs is tricky. At the risk of comparing even more apples and oranges, we introduce another point of reference. Figure 13 shows measurements made on a system with a 2GHz Intel Xeon E5504 CPU and a 1.15GHz NVIDIA Tesla C2050 GPU, designed as a GPU-based server for scientific applications. The C2050 is based on NVIDIA's Fermi architecture; this implementation is targeted at the high-performance computing market, with power consumption being a lower priority. For these measurements, we again timed the finite difference kernel execution independent of the FDS program infrastructure, which does not run on the GPU. The audio output buffer size was again 512 samples. The C2050 supports simulation grid sizes up to 81×81, but the largest grid size that the slower Xeon CPU can support is about 27×27.
Our experiments have shown it is possible to run finite difference-based simulations on the GPU to generate real-time audio with reasonable latency, and for grid sizes larger than is possible on a CPU. We note that CPU-vs.-GPU comparisons are tricky at best; our latest measurements were made on the GPU and CPU for the mid-2012 MacBook Pro Retina. We were surprised that the Kepler-based 650M, designed for power efficiency, had comparable performance on FDS to the slightly earlier Fermi-based C2050, which was designed for high-performance computing. (This is partly because a single finite difference simulation can use only one SM at a time; the 650M has two SMs, while the C2050 has 14, so a much larger fraction of the C2050's hardware resources is idle). We were also surprised that current CPUs such as the Intel Core i7 exhibit competitive performance at medium grid sizes; an obvious future direction is to port FDS to multicore CPU systems for comparison with GPUs.
Our current implementation of the finite difference approximation on the GPU is straightforward. In addition, we plan to study approaches to optimize the software for larger grid sizes, for multiple finite difference-based instruments, and to support different simulation geometries. We also plan to port FDS to other GPU computing platforms such as OpenCL for testing on other GPU architectures.
Bill Hsu is an associate professor of computer science at San Francisco State University. His current interests include high-performance computing, audio analysis and synthesis, physics-based modeling, and audiovisual performance systems.
Marc Sosnick-Pérez is a graduate student, teacher, and researcher at San Francisco State University. He is part of SFSU's SETAP (Software Engineering Team Assessment and Prediction) project exploring novel means of assessing student learning, and he was part of the CAMPAIGN project exploring the use of GPU-accelerated clustering algorithms.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from email@example.com or fax (212) 869-0481.