Communications of the ACM

Research highlights

Technical Perspective: Customizing Media to Displays

A few years ago, I bought a wide-screen TV with an aspect ratio of 16:9. It is great for watching movies shot with a wide-screen format. However, on most other occasions, I am faced with a dilemma: If I choose the option to fill the entire screen, everything looks wider than normal, while preserving the aspect ratio of the video means seeing wasted space at both ends.

There is a mind-boggling array of displays that are readily available, from large plasma displays and high-resolution LCDs to low-resolution cellphone screens. These displays differ greatly in resolutions and aspect ratios. The problem is, images and videos are captured at fixed resolutions and aspect ratios, and from personal experience, viewing them properly in a display can be a challenge.

What, then, is the correct way of displaying media? Global scaling solves part of the problem, but naively stretching or squashing one of the dimensions to fill the screen introduces undesirable distortions. Cropping is not a satisfactory solution either because important elements in the scene may either be partially removed or totally cut out. We need a solution that intelligently customizes media to displays.

The answer may well lie in the work of Ariel Shamir and Shai Avidan. Their technique, intriguingly called "seam carving," cuts out or adds pixels to swaths of areas deemed less important. The importance can be measured by contrast or need to preserve humans or objects. Given the energy function that measures this importance, the process of removing pixels to minimize this energy function is nontrivial. This is because we must preserve both the rectangular shape and visual coherence of the image. Shamir and Avidan devised a simple but powerful idea: carve (remove) seams iteratively.

A seam is a connected path of low-energy pixels crossing the image from top to bottom or from left to right. Their seam carving algorithm changes the aspect ratio of the image by iteratively carving the seams with the lowest importance, horizontally or vertically. The optimal seam at each iteration can be found using dynamic programming.

Herein lies the magic of seam carving: removing a seam has only a local impact and the produced visual artifacts are globally imperceptible. As a result, seam carving maintains both a rectangular shape and visual coherence of the image.

To enlarge the image, the seam carving process is run in reverseby adding interpolated pixels along the lowest energy seam. The authors have also demonstrated other applications of seam carving, such as content amplification, object removal, multisize image format, and last but not least, video resizing. Video resizing is non-trivial because of the need for temporal coherence in addition to spatial coherence. Shamir and Avidan cleverly achieved video resizing by casting the problem as a 3D graph with 2D manifold (instead of 2D graph with 1D curve for images).

We need a solution like the one Shamir and Avidan explore here. However, there are two important issues that must be addressed before such a solution is exposed to the masses. First, there must be real-time performance (that is, real-time rendering of media). Even if the algorithm is highly optimized, I would imagine it is difficult to achieve real-time resizing for high-resolution images and HD videos. Shamir and Avidan recommend precomputing the resizing operations for the most popular resolutions and aspect ratios and storing the vertical and horizontal seam index maps. The player or TV set recognizes the display format, fetches such relevant metadata information, and re-renders the original video appropriately. While this is a good idea, in order for this solution to be practical, there needs to be an efficient compression scheme for the seam index maps, especially for video. The other issue is related to algorithmic robustness: How can the intent, tenor, and attractiveness of media be preserved after it has been resized? Can these qualities be reliably codified? Human visual attention has been modeled to some extent (see, for example, the work of Itti, Koch, and Niebur1), but is such a model enough?

Independent of these questions, as any computer vision scientist will tell you, completely automatic vision techniques are typically not foolproof. All techniques come with assumptions that may not be satisfied all the time. Consumers may not be forgiving if their video looks less than attractivea horde of people brandishing pitch forks and torches come to mind. Like it or not, I believe we will need a human in the loop for media customization. I am a big proponent of interactive computer vision, that is, the concept of judiciously adding interaction to complement what can be automated. This is acceptable in the context of media customization because it needs only to be done once for each video. (Plus, it may spawn a sizeable cottage industry.) The trick, then, is to design an interface that minimizes manual input. Shamir and Avidan's innovative algorithm should be adapted to take into consideration manual annotation to preserve the intent of the media.

I now look at my wide-screen TV and wistfully think, if only it is possible to customize media to displays now...

References

1. Itti, L., Koch, C., and Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20,11 (Nov. 1998), 12541259

Author

Harry Shum (hshum@microsoft.com) is a Fellow of ACM and a Corporate Vice President of Microsoft Corporation, Redmond, WA.

Footnotes

DOI: http://doi.acm.org/10.1145/1435417.1435436