IBM fellow and retired director of the Database Technology Institute at IBM's Almaden Research Center Patricia G. Selinger has made tremendous contributions to the database industry. After joining IBM Research in 1975, Selinger became a leading member of the team that built System R, the first proof that relational database technology was practical. Her innovative work on cost-based query optimization has been adopted by the majority of relational database vendors and is taught in most university database courses. Selinger managed the Almaden Computer Science department and established the Database Technology Institute, a joint program between IBM Research and the IBM software development team that accelerated advanced technology into data management products such as DB2 Universal Database for z/OS, IMS, DB2 UDB on Linux, Windows, and Unix.
The interview presented here was conducted by James Hamilton, a member of Microsoft's SQL Server Team and the former lead architect on DB2 and leader of the C++ compiler project at IBM.a
JAMES HAMILTON Let's start with the role of a query optimizer in a relational database management system and your invention of cost-based optimizers.
PAT SELINGER As you know, the fundamental tenet of a relational database is that data is stored in rows and columns. It's value-based in that the values themselves stand up forrepresentthe data. No information is contained in pointers. All of the information is reflected in a series of tables, and the tables have a certain well-known shape and form: there's an orders table, a customers table, an employees table, and so forth. Each of those tables has a fixed set of columns: the first name, the last name, the address.
Relational systems have a higher-level language called SQL, which is a set-oriented query language. This is a unique concept and really what distinguishes relational database systems from anything that came before or after.
The set-oriented concept of the query language allows asking for all the programmers who work in department 50; or all of the orders over $5,000; or all of the San Jose customers who have orders over $5,000; and so forth. The information in relational tables can be combined in many different ways, based on their values only.
How do you take this very high-level set-oriented question that the user asks and turn it into an exact recipe for navigating the disk and getting the information from each of the different records within each of the different tables? This process is query optimization: the idea of mapping from the higher-level SQL down to the lower-level recipe or series of actions that you go through to access the data.
Query optimizers have evolved as an enabling technology to make this high-level programming languagethis data-access language, SQLwork. Without that, you would end up using brute force: let's look at each row and see if it matches the description of what's asked for. Is it department 50? Is it an order that's over $5,000? That would be very inefficient to scan all of the data all of the time.
So we have access techniques that allow you to look at only a subset of the data, and then you have to plan which of those access techniques makes sense for any given kind of query.
The trick to cost-based query optimization is estimating a cost for each of the different ways of accessing the data, each of the different ways of joining information from multiple tables, and estimating the sizes of results and the savings that you would get by having data in the buffer pools, estimating the number of rows you'll actually touch if you use an index to access the data, and so forth.
The more deeply you can model the cost of accessing the data, the better the choice you'll make for an access path. What we did back in the late 1970s was to invent this cost-based query optimization and provide a model that was good enough for searching a very large space of choices within a reasonable amount of time, then coming up with a very good cost estimate and, therefore, a very good access path.
JH It's amazing that this number of years later, this work remains the dominant approach to relational database system query optimization. Cost-based optimizers have been a real success in technology transfer from research to industry. Can you tell us a little bit about why that was so successful?
PS The quality of cost-based query optimization has really made it possible for people to have relatively hands-free application development. That is, the application developer doesn't have to know a huge amount about the layout of the data on disk and the exact places where the records are and the exact access paths to those records. There's a huge upside from the application productivity standpoint to being able to do really good cost-based query optimization. So, there's a compelling marketplace force for having good cost-based query optimization.
I participated in inventing the System R query optimizer, which was taken lock, stock, and barrel and put into IBM's DB2 relational database product where it has been continually enhanced. Many of the simplifying assumptions to make the problem trackable back in the late 1970s have been eliminated, and the model is now deeper and richer and includes more techniques for accessing the data.
"Outstanding cost-based query optimization is critical to speeding application productivity and lowering total cost of ownership."
It's a growing science, and I think that's part of its success. It has been able to grow and adapt as new inventions come along for accessing the data, or for joining the data in different ways. All of the relational database products have gone to a cost-based query optimization approach and have moved away from a rules-based approach, which was really too inflexible to get you good performance all the time.
The technology still has room to growfor example, when the data itself behaves differently from what the model assumes. Many optimizers do not model highly correlated data really well. For example, 90210 is a ZIP code that's only in California. ZIP codes are not evenly distributed across states, and there isn't a 90210 in every state of the union. For a user request, nailing down the ZIP code to 90210 is sufficient and applying another predicate, such as state equals California, doesn't change the result. It won't reduce the number of rows because the only 90210 is in California.
JH One of the enemies of industrial query optimizers is complexity, and that can sometimes yield lack of query plan robustness. Small changes in the queries or in the data being queried can lead to substantially different plans. Customers often ask me for a good plan that is stable rather than a near-optimal plan that changes frequently in unpredictable ways. What direction should we be looking to make progress on the optimal query plan versus query plan robustness problem?
PS I think we have to approach it in two ways. One is that you have to be able to execute good plans, and during the execution of a plan you want to notice when the actual data is deviating dramatically from what you expected. If you expected five rows and you've got a million, chances are your plan is not going to do well because you chose it based on the assumption of five. Thus, being able to correct mid-course is an area of enhancement for query optimizers that IBM is pursuing.
Second, you have to continue to deepen the model because you've got to come up with reasonable plans before you can fine-tune them dynamically. Understanding the correlation between rows or between columns in different tablesnoting the ZIP code example I gave beforeis a very important part of continuing to understand the data more deeply and therefore being able to do even better query optimization.
The fact is that as customers use more and more shrink-wrapped packages or have ad hoc users who haven't gone to SQL school for a year, there's a real need to be able to do good query optimization. You can't have database administrators running into the room, saying, "Don't hit Enter yet. I've got to look at your query to see if it's going to be OK." Outstanding cost-based query optimization is critical to speeding application productivity and lowering total cost of ownership.
JH Let's look for a moment at query optimization and where the technology can be taken beyond database management systems. IBM, and the industry as a whole, have been investing in recent years in auto-tuning and in autonomic computing. Do you see a role for cost-based optimization in this application area?
PS Absolutely. Companies have a lot of data that is quite well structuredan order, a customer, an employee recordbut that's maybe 15% of all of the data that a company has. The rest of it is in document files, it's in XML, it's in pictures, it's on Web pagesall of this information also needs to be managed. XML provides a mechanism for being able to do that, but the data isn't quite as regularly structured. It's very dynamic. Every record could look different from the next record even in a related collection of things such as orders or documents.
So you have to have a query language such as XQuery that will be able to navigate and to ask set-oriented questions about this new kind of data. That opens up a different repertoire of data access techniques and requires enhancement of the query optimization process. But I think it's absolutely essential to continue on the path of automatic query optimization rather than put programmers back into the game of understanding exact data structures and doing the navigation in the application program manually. That's simply cost-prohibitive.
JH Looking at new optimization techniques, feedback-directed systems, and dynamic execution time decisionsall significant areas of continuing researchwhat do you see as the most important next steps looking out, say, five years or so?
PS I think the cost of ownership is on every customer's mind, not just because of the economic downturn that some of them are still in or have just experienced, but because the cost of processors, disk space, and memory are all going downand the cost of labor is going up.
Furthermore, you have to look at the ratio of how many administrators you need to take care of a terabyte worth of data. Unless you can dramatically improve that ratio, as you accumulate more and more terabytes of data, pretty soon you're looking at employing half the planet to administer it. So we are inventing ways to make administrators capable of handling 20, 100, 1,000 times more data than they do today.
At the same time, we're under pressure to incorporate, search, understand, and take advantage of information that's in this more unstructured formemail, for example. Companies want to be able to look at email or customer service files to give their customers better service, and have to manage and understand and analyze more kinds of information. As we look at what it's going to take to do that, we have to change the game in terms of the cost of organizing, administering, and searching this data.
JH I've seen two pictures painted of the future of unstructured data. One of them has file systems augmented with search appliances, and another is based upon an expanding role of structured stores that are much more flexible and much more capable of dealing with dynamic schema and content. Is there a role for file systems and search appliances? Where do you see this playing out?
PS I don't think that any current or future data storage mechanism will replace all the others. For example, there are many cases where file systems are just fine and that's all you need, and people are perfectly happy with that. We have to be able to reach out to those data sources with a meta-engine that knows how to reach and access all those different data repositories and understands all the different formats.jpg, .mpg, .docand knows how to interpret that data.
The notion of an intergalactic-size, centralized repository is neither reasonable nor practical. You can't just say to a customer, "Put all your data in my repository and I'll solve all your problems." The right answer from my perspective is that customers will have their data in a variety of places under a variety of applications in file systems and database engines. They're not going to centralize it in one kind of data store. That's just not practical. It's not economically feasible
So, file systems will still be around. They may get enhanced with special search techniques as we have more capability and processing power in RAID systems and disk servers, file servers, and so forth, and relational systems will get richer in what they can handle, but we're not going to replace all of the technologies with any one single answer.
"I think the cost of ownership is on every customer's mind, not just because of the economic downturn... but because... the cost of labor is going up."
JH Do you see content management systems of the future mostly layered on relational database systems, or do you see them as independent stores built using some of what we've learned over the past 30 years of working on relational technologies?
PS I like the architecture in the DB2 content manager, where DB2 is the library serverthe card catalogue, so to speakand it uses some extra semantics in a system-level application surrounding DB2 with some new user-defined types and functions, and stored procedures implementing those applications. Then it has separate resource managers, which are capable of handling a certain class of data types and styles with this kind of document, these kinds of images. They could be physically stored in either the DB2 as the library server or some separate place or file system out on a number of different engines.
It gives you a flexible configuration. You can exploit as much as you like of the functions of DB2XML, for exampleor you can choose to use some of these repository managers. They may be less feature-rich but are expert in a particular kind of information and could be stored locally to where you need that dataparticularly if it's massive amounts of data, such as mass spectrometry results. Those are huge files and you want them close to where you're doing the analysis.
JH Given that relational stores now support XML and full-text search, what's missing? Why haven't extended relational systems had a bigger impact in the unstructured world?
PS The semantics of content management go beyond just the data storage parts, the data storage engines, the DB2s of the world. There's a significant set of other abstractions and management techniques that either have to go on top or have to come from a content management system that uses and exploits an extended relational engine but doesn't solely rely on it.
For example, content management systems have the ability to allow Pat access to Chapter 1 of a document, and James access to Chapter 2, and Ed access to Chapters 1 and 2, at the sub-sub-document level. This is something that relational systems don't do today. Similarly, foldering, the idea of document collections that really aren't related to similar structure but are tied to some higher-level semantic content, is beyond what relational systems are undertaking at this point.
"I love the idea of open source. My dream is that this allows many more opportunities for using databases in places where people wouldn't ordinarily go out and buy a database engine."
JH Are there other areas where you see research needed for content managers and relational stores to improve and help customers manage a wider variety of data?
PS If I were choosing today to do research or advanced development, there are a number of areas that are very, very interesting to me. There's continued invention needed in the autonomies. What do you have to do to have a truly hands-free data system that could be embedded in anything? What do you have to do to have truly mass-parallelism at the millions-of-systems (e.g., Internet) level? As commodity hardware becomes smaller and smaller, can we link and talk to systems and compute things on a scale of millions, where today we're at a technology level of thousands? How do you deal with data streams where the queries are fixed and the data is rushing by, and it could be unstructured data? How do you accumulate metadata and keep it up to date? How do you manage it, learn from it, derive information from it?
Searching is still in its first generation. There are lots of opportunities to make search better. If it knew you were angry when you typed in your three keywords to a search engine, would that help it understand what you were searching for? If it knew what email you had just seen before you typed those search keywords, would that help it understand what you were looking for? How can a search engine find what you intended as opposed to what you typed?
How reliable is derived information? There are many sources of unreliability. What if I have a source of information that's right only half the time? How do I rate that information compared with another source who's right all of the time? How do I join together that information, and what's the level of confidence I have in the resulting joined information?
All of those things, as we start dealing with unstructured data and incomplete answers and inexact answers and so forth, are great opportunities for research and advanced development.
JH We've started to see open source having an increasingly large role in server-side computing. Specifically in the database world, we've now got a couple of open source competitors. Is open source a good thing for the database world?
PS I love the idea of open source. I was the manager of the IBM Cloud-scape team at the time that we contributed it to Apache, where it has become an incubator project under the name Derby. My dream is that this allows many more opportunities for using databases in places where people wouldn't ordinarily go out and buy a database engine.
So open source can bring the benefits of the reliability, the recoverability, the set-oriented query capabilities to another class of applicationssmall businessesand the ability to exploit the wonderful characteristics of database systems across a much richer set of applications. I think it's good for the industry.
A previous version of this interview appeared in the April 2005 issue of ACM Queue.
©2008 ACM 0001-0782/08/1200 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2008 ACM, Inc.