Keeping data in memory instead of pulling it in from a disk can speed up the processing of that data by orders of magnitude, which is why database companies have been vying for a share of the in-memory database market. IBM, Oracle, and Microsoft introduced versions of such databases this year, while SAP has been selling its Hana product for the past three years. Smaller companies including Aero-spike, VoltDB, and MemSQL have all been getting into the act as well.
What they promise is a way to speed up activities that are important to businesses, such as processing transactions with their customers or analyzing the ever-growing quantities of information those transactions produce. "I think the potential is enormous," says Amit Sinha, senior vice president for marketing at SAP, headquartered in Walldorf, Germany. "Changing the data center from disk to memory has huge implications."
In-memory databases, also known as main memory databases, speed up processing in two basic ways: with the data available in memory, the time lag caused by fetching the data off a disk is erased; also, data tends to be stored on disk in blocks, and to get the one desired piece of data, the computer imports the whole block from disk, decodes it, runs the process on the piece it wants, re-encodes the block, and sends it back where it came from.
"The data structure and the algorithms for managing all that get quite complicated and have high overhead, but when you store data in memory, you get rid of all of that," says Paul Larson, principal researcher at Microsoft Research in Redmond, WA, who has worked on Microsoft's Hekaton in-memory database. "When data lives entirely in memory, you can be much more efficient."
It is difficult to quantify exactly how much of an efficiency boost an in-memory database provides; the answer is affected by the type and quantity of the data and what is done to it. "It's extremely workload dependent," Larson says.
Michael Stonebraker, a leading database expert and a co-founder of VoltDB of Bedford, MA, as well as an adjunct professor of computer science at the Massachusetts Institute of Technology (MIT) in Cambridge, MA, says processes can be "anywhere from marginally faster to wildly faster." Some might be improved by a factor of 50 or 100, he says.
Though the speed increase varies depending on the particular applications, some of SAP's customers have managed a 10,000-fold improvement, Sinha says, with many processes that used to take minutes now being done in seconds. That can be important to businesses, for instance by allowing them to run real-time fraud analysis. "Before the credit card drops, you need to have analysis of whether this transaction is valid," he says.
It might also allow new kinds of analyses. Imagine you work at World Wide Widgets, and your job is to make sure you buy sufficient raw materials and get them to the right factories to provide all the Walmarts in a particular region with just as many widgets as they are likely to sellnot more because holding onto inventory costs money, and not fewer because that means you have missed out on sales. To do this job, known in the supply chain industry as "materials requirement planning," you keep track of both the sales at the different stores and your supplies of raw materials. If, instead of doing such an analysis weekly or daily, you could run it in real time, you could fine-tune the process and get a more accurate balance between supply and demand. "That level of planning can really take enormous amounts of costs out of business," Sinha says.
Others wonder whether the case has really been made that using in-memory databases for data analytics provides a clear economic advantage. "It's interesting that they feel there's a good commercial market here," says Samuel Madden, a professor in the Computer Science and Artificial Intelligence Laboratory at MIT.
Madden does not dispute in-memory databases are faster. "You can pore through more records per second from memory than you can through a traditional disk-based architecture," he says. However, other methods of streamlining the processfrom reducing the number of instructions to arranging data by columns rather than rowsmight produce a similar increase in efficiency, or at least enough of a boost that the added expense of more memory is not worthwhile. "The performance of these things might not be all that different," Madden says.
Data analytics are only one part of the database world. Stonebraker divides it into one-third data warehouses, one-third online transaction processing (OLTP), and one-third "everything else"Hadoop, graph databases, and so on. For OLTP, both Stonebraker and Madden say, the advantage is clearer.
With in-memory databases, "you can pore through more records per second from memory than you can through a traditional disk-based architecture."
Performing analytics on a data warehouse involves scanning many records, perhaps millions or billions, that change infrequently, if ever. In OLTPsay, processing an order on Amazon or moving money between bank accountsthe computer touches a small amount of data and performs a small operation on it, such as updating a balance. Being able to perform such transactions faster means a company can either do more of them in a given period of time, increasing its business, or do the same amount with less processing hardware, saving infrastructure costs. "Instead of tens of milliseconds per transaction, you can go to hundreds of microseconds, and in some applications that really matters," Madden says. Computerized trading on Wall Street might be one beneficiary; "if you can issue your trade 1ms faster than the other guy, you can make more money."
From an economic standpoint, OLTP might be a better fit for in-memory databases because of their relative sizes, Stonebraker says. A one-terabyte OLTP database would be considered large; 10 TB would be huge. A terabyte of main memory today costs less than $30,000, so the efficiency gain could make that affordable for some companies. On the other hand, he says, "data warehouses are getting bigger at a faster rate than main memory is getting cheaper." For instance, the gaming company Zynga has amassed around five petabytes in its data warehouse. "You're not about to buy 5PB of main memory," Stonebraker says.
Larson sees the analytics side growing as well. "The general trend is toward more real-time analytics, partly because now we can do it," he says "It's technologically feasible," because DRAM has gotten denser and cheaper, making it possible to hold so much data in memory, and because processing power has increased enough to handle all that data.
Yet DRAM, of course, is volatile; if the power goes out, the data goes away. That means there always needs to be a backup, which can add cost and slow performance. One way to back up data is to replicate it on a separate DRAM with a different power source, which means a company can switch to its backup with little delay. "How many replicas you run depends on your level of paranoia and how much money you have," Larson says.
Another approach is to make a slightly out-of-date copy of the data on a disk or in flash memory, and keep a log of transaction commands in flash. If data is deleted, the stale copy can be moved from the disk and the log used to recreate the transactions until the data is up to date. Exactly how a company handles backup depends on how fast it needs to restore, and how much it can afford to spend. Madden points out, though, "These database systems don't crash all that often."
Crashes may be less of a concern in the future, if makers of memory such as Intel introduce a non-volatile technology, such as magnetic RAM or phase-change memory. Larson says researchers say the future of memory seems promising. "The technology is beginning to look good enough and the capacities are looking quite large," he says. "What they don't want to talk about at this point is when and at what price."
Non-volatile memory will likely be slower than DRAM, but if it is significantly less expensive, that could be a worthwhile trade-off, Larson says.
Another issue that could affect the in-memory field is the existence of distributed databases. When the data an application needs is spread out among different machines, the advantages of in-memory may disappear. "Building a database system on top of the shared memory system is problematic," says Larson. "I don't really see how to make them efficient."
Stonebraker is more blunt. "Distributed memory is just plain a bad idea," he says. "No serious database system runs on distributed memory."
Sinha, on the other hand, says distributed memory can work. With fast interconnects, sometimes it is easier to access the main memory of a nearby machine, he says. It is also important to make sure a piece of data is only written to one place in memory, and that data is organized so it rarely has to cross a partition. "You can be intelligent in keeping data together," he says.
"Distributed memory is just plain a bad idea. No serious database system runs on distributed memory."
Stonebraker sees in-memory databases as eventually taking over online transactions. "I think main memory is the answer for a third of the database world," he says. He expects that takeover to take the rest of the decade to happen, while algorithms mature and businesses examine the value of the technology. "It's early in the market," he says.
Lahiri, T., Niemat, M-A., Folkman, S.
Oracle TimesTen: An In-Memory Database for Enterprise Applications, Bull. IEEE Comp. Soc. Tech. Comm. Data Engrg. 36 (2), 6-13, 2013.
Lindström, J., Raatikka, V., Ruuth, J., Soini, P., Vakkuila, K.
IBM solidDB: In-Memory Database Optimized for Extreme Speed and Availability, Bull. IEEE Comp. Soc. Tech. Comm. Data Engrg. 36 (2), 14-20, 2013.
Kemper, A., Neumann, T., Finis, J., Funke, F., Leis, V., Mühe, H., Mühlbauer, T, Rödiger, W.
Processing in the Hybrid OLTP & OLAP Main-Memory Database System HyPer, Bull. IEEE Comp. Soc. Tech. Comm. Data Engrg. 36 (2), 41-47, 2013.
Zhang, C., Ré, C.
DimmWitted: A Study of Main-Memory Statistical Analytics, ArXiv, 2014.
SAP's Hasso Plattner on Databases and Oracle's Larry Ellison https://www.youtube.com/watch?v=W6S5hrPNr1E
©2014 ACM 0001-0782/14/09
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from firstname.lastname@example.org or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2014 ACM, Inc.
No entries found