Organizing Unstructured Data

Vector databases are efficient for conducting similarity searches, and they are scalable and flexible, but high-dimensional vectors can be computationally expensive, according to Apple's Huaping Gu.

Vector database company Pinecone in April secured $100 million in venture capital (VC) funding in a $750-million valuation. Other vector database startups have also recently raised millions from VCs, including Chroma, Weviate, and Qdrant. This begs the question: what exactly are vector databases, and why are they generating buzz now?

Some 80% to 90% of any organization's data is unstructured, according to analysts' estimates, and databases have gone through many iterations, from Structured Query Language/SQL databases(in which data is structured in a collection of tables) and relational databases (which focus on the relationship between stored data elements) to NoSQL databases (in which data is stored and retrieved in different structures without using rows and columns). NoSQL was triggered by the advent of Web 2.0 in the early 2000s.

Those traditional databases were not adequately equipped to analyze unstructured data, especially in real time. Now, with artificial intelligence (AI) gaining momentum, vector databases have emerged for use in machine learning applications. A vector is a high-dimensional array of data in which each dimension is a number.

Explains Charles Xie, CEO and founder of vector database company Zilliz and the Linux Foundation's Milvus Project, "Vectors are important because when you're talking about pictures or images or video, they are the numerical representation of unstructured data that can be easily processed by a machine,''

This is where the use of machine learning models to turn unstructured data into floating point values, or vector embeddings, is key. In contrast, those unstructured images, pictures, and videos are time-consuming and a challenge to classify manually in relational databases. As an example, it took 25,000 people (curators) to label the now-famous ImageNet dataset, Xie says.

Once the data is in a machine-readable format, relational databases store and search across structured table-based data, Xie says. However, unlike structured data, there is no easy way to store and efficiently search large amounts of unstructured data within a relational database.

For example, quickly searching for similar shoes, given a collection of shoe pictures from various angles, would be impossible in a relational database since understanding shoe size, style, heel type, color, etc., purely from the image's raw pixel values is difficult, observes Chris Churilo, vice president of marketing at Zilliz."So we want to turn to a machine to do that for us," using models "that are going to spit out a numerical representation of this content'' that are embeddings or vectors, she says. "The cool thing about having this numerical representation is, now I can ask the machine to find [something] that's similar by basically comparing these numbers against each other."

The machine can do that pretty accurately, Churilo says.

Vector databases are commonly used for similarity search and product recommendations, agrees Arun Chandrasekaran, a distinguished vice president and analyst for market research firm Gartner.

"A vector database indexes and stores vector embeddings for fast retrieval,'' Chandrasekaran says. The increasing use of AI foundational models is causing greater interest in vector databases, he says. As clients fine-tune generative AI models, they will store and retrieve that organizational data in vector databases.

In generative AI, a vector database can be used to store the vector embeddings that result from the training of AI foundation models, Chandrasekaran adds.

"Vector database is the hot name for an old topic,'' observes Andy Pavlo, an associate professor of databaseology at Carnegie Mellon University, whose research area is database management systems. "It's all about keeping up with AI."

Echoing the others, Pavlo says ChatGPT and machine learning are storing vectors, and vector databases store those embeddings so users can use them for a fast lookup.

Vector databases are efficient for conducting similarity searches, and they are scalable and flexible, writes Huaping Gu, a software data engineer at Apple. However, there are also some drawbacks to using them. High-dimensional vectors can be computationally expensive. They can also be difficult to visualize and interpret, which makes it a challenge to debug or fine-tune AI/ML models, according to Gu.

Vector databases also don't return perfect search results. "At the end of the day, what they're doing is building indexes to do a nearest-neighbor search, and the idea is you have a multidimensional space that represents your vectors,'' Pavlo says. "When you do the query and convert it as an embedding into a vector, it won't land on an exact match."

Right now, "The use-cases for vector databases are quite limited for most enterprise clients," observes Chandrasekaran. However, expect to see increased use of them. They are "gaining immense popularity for generative AI applications,'' he says, adding that "this is a nascent but fast-evolving ecosystem."

Esther Shein is a freelance technology and business writer based in the Boston area.