Technical Perspective: Paris Beyond Frommer’s

Try to visualize Paris. If you are like me, you will imagine a tourist-guidebook composite—the Eiffel Tower lit up at night, the Notre Dame Cathedral, the bridges over the Seine, and so on. But this is not what most of Paris looks like. It turns out that many people can look at a photograph of any randomly selected street corner in Paris, and can correctly identify the city with high accuracy, even without paying attention to any text in the photo. One must conclude that all of Paris is infused with a "Paris-ness"—a certain je ne sais quoi—that leaves an indelible visual mark on the City of Light.

How can we quantify this Parisness? Can a computer automatically discover and tell us what makes Paris look so much like Paris? More broadly, this question of visual style is an important one in computer graphics and vision. Identifying the key elements that characterize a style—whether a style of interior design, art, or, in the case of a city, its architecture and ornamentations—could aid in a range of applications, such as obtaining reference imagery for a new design task, or for summarizing or categorizing the look of a large set of images.

However, identifying and characterizing visual style automatically is a very challenging problem, one that is difficult even to formulate in a rigorous way. This is where the following paper steps in. This work, and several companion papers in computer vision, offers a creative, inspiring new approach to discovering the visual style of a city like Paris. The authors achieve this feat through new algorithms that analyze massive collections of photos of Paris and other places around the world.

One key aspect of this visual discovery problem is starting with the right image data. A popular approach in computer vision is to mine data from a large set of consumer photos shared on sites like Flickr. However, in the case of a city like Paris, this approach would result in a representation of Paris akin to the collage of landmarks of my own tourist-centered imagination, because this is how Paris is represented in photos shared online. To be sure, these landmarks are important, but to capture the real essence of Paris we need to look further. This work instead turns to Google Street View to gather many thousands of photos captured systematically throughout the city from cars—the kind of views you would see on a stroll down the street. Hence, the visual elements discovered in this work are the sort you would encounter in the city every day, perhaps not even noticing they make up its visual fabric.

Can a computer observe a city and discern its visual essence?

But given a large, representative set of images of a place, how could we use them to compute these visual elements? One approach would be to break each image up into small patches (say, about the size of a door or window), group these patches by visual similarity using a clustering algorithm, and then identify the most common clusters as the key visual elements. But, as this work shows, this approach does not work very well—even if you account for uninteresting patches such as those in the sky, the results of standard clustering methods are unremarkable patches such as edges and corners.

Instead, the authors start from a crucial insight: what makes Paris look like Paris is not necessarily image patches that commonly appear in Paris, but instead those patches that appear in Paris but nowhere else—in other words, patches that distinguish Paris from all other cities. Using a new discriminative clustering technique, they show how they can automatically identify such distinctive patches. The discovered patches they show are remarkably evocative of Paris, capturing its unique balconies, signs, light posts, and other elements. When I first saw these results at SIGGRAPH, it immediately struck me they were on to something new and important, and showed something that no previous visual clustering method had shown. And many other interesting insights follow—such as the fact that U.S. cities are pretty similar to each other, but are distinctive from cities on the Continent in that they are filled with cars.

Using large image collections for computer vision and graphics is by now a tried-and-true approach. Past work has used photos mined from the Internet to train better object recognition systems or build 3D models. But it is exciting to see the fresh and innovative use of big data presented here. There is something magical about automatically distilling the visual signature of a place—signatures we all can sense but cannot easily articulate. And more broadly, this work represents an exciting new direction in discovering visual styles from big data.

Footnotes

To view the accompanying paper, visit doi.acm.org/10.1145/2830541