Moving Beyond Text-Based Video Search

Current disputes on the appropriate format for conducting searches of online video threaten to dredge up old battles on the superiority of linear, textual styles of presentation versus visual-spatial representations. The conflict centers on whether the underlying qualitative meaning of a video is best captured through text and is therefore capable of using existing Web crawling tools, or whether tools developed for the analysis of static visual features are portable to a video environment. While the video search tools of large content aggregators like Google or MSN are the most familiar, they are dismissed by startups in the field as skewed and somewhat primitive.

Because of the desire to offer search methods compatible with today’s Web-crawling tools, some independent developers like Search Inside Video use XML-based transcript tools to produce what still amounts to a text-based description of a video session. According to Search Inside, such methods still give a superior synopsis of a video’s content than any visual analysis.

A few brave startups are following the route of Videosurf Inc. Videosurf begins with a textual-tag categorization method, then applies facial recognition tools to video content. The company claims that using both methods provides more meaningful results than relying strictly on text or images. "Our understanding of the video content is significantly deeper and more extensive than that of video search vendors who simply use text metadata," says Eitan Sharon, CTO and co-founder of Videosurf. Sharon says that an approach that iteratively combines text and video methods can extract the deeper meaning of a video clip. In Videosurf’s case, the native video methods include not only facial recognition, but moving object detection and video quality analysis. Videosurf chief scientist and co-founder Achi Brandt helped develop the algebraic multigrid methods of searching a video content space. Those results are combined with text-based analysis.

Video-based search has two distinct meanings: it may refer to the use of advanced visual tools like pixel analysis for a native image-based search; or to a means of presenting text-based results in a unique way, such as a multi-dimensional representation of topic. The social media network sevenload, for example, announced in September an ability to display search results in 3D. Don’t expect a detailed holographic analysis of voxels. Instead, sevenload uses a presentation method from Cooliris called "Embed Wall," and presents traditional search results in a unique three-dimensional format.

Researchers recognize the advantages of native voxel searches, but insist that even still-image analysis is still years away from embedding into familiar seach-engine structures. Video search, while pursued at UC San Diego, Carnegie Mellon University, and other institutions, is years behind image search.

One CMU source suggested off the record that academic teams can become too enamored of video-based methods to demonstrate objectivity on those ideas that work best. In reality, textual tools can provide information for end users that is as useful as image-based methods. But will users consider the text-based methods as sexy as the new tools?