Science and common sense both tell us that the facts about the world are not directly observable but can be inferred from observations about the effects of actions. What people infer about the world is not just relations among observations, but relations among entities much more stable than observations. For example, 3D objects are more stable than the image on a person’s retina, the information directly obtained from feeling an object, or an image scanned into a computer.1 Likewise the fact a customer has children is more stable than the fact a particular basket includes Roll-ups. The fact that a customer has diabetes is more stable than a particular pattern of food purchases that may allow inferring he or she has diabetes. The phenomenal facts, partly because they are more stable than observations, are more predictive of future behavior than simple observational facts.
The extreme positivist philosophical view that science concerns relations among observations still influences the design of learning programs, and that’s what data miners are. However, science never worked that way, neither do babies and neither should data mining programs; all obtain and use representations of the objects and use observations only as a means to that end.
Data mining involves computer programs that infer relations among different kinds of data in large databases. The goal has been to infer useful relations that might not have been noticed or at least could not have been confirmed among this data. We use the standard example of a supermarket chain with a database of all the cash register receipts for some long period of time—many gigabytes of data. The company wants to mine this database for information useful for improving its business.
Data mining can be made to do more than just find relations among data. Data amounts to observations of the world, and it is possible to infer relations among entities in the world from the data. Such relations are likely to be as useful to know about as are relations among the entities directly represented in the data.
In the supermarket chain example, there are people, groups of people, their homes with pantries, refrigerators and freezers, and facts about what they cook and what they eat. It should even be possible to infer the existence of entities in the world, such as previously unidentified groups of people with distinct eating and purchasing habits. Another example is to identify bellwether groups; what they buy today, many more will buy tomorrow.
The phenomenal facts, partly because they are more stable than observations, are more predictive of future behavior than simple observational facts.
Moreover, the information will usually admit a more compact description in terms of the underlying phenomena than in terms of the data.
Although all commonsense-level knowledge of the world is potentially relevant to data mining, formalizing common sense has proved to be a difficult AI problem, and progress has been slow. Nevertheless, we can expect that certain phenomena will be related to the information in databases in a straightforward manner so that information about them can be found by data miners.
Phenomena in the World
What phenomena in the world should a data mining program be told, have built into it, or be able to discover for itself?
At first, knowledge of the general phenomena will be built into the data miners (data mining programs),2 and the programs will infer specific values. Later, data miners should use the information expressed in a logical form. This will permit them to use databases of commonsense facts about the world. Very ambitious data mining projects might hope to make programs that will come up with entirely new phenomena.
Here are some phenomena and facts relevant to the supermarket domain together with logical expressions for some of these facts. We give just two example formulas, and these are not part of a worked out scheme for constructing a knowledge base.
People. There are the shoppers and family members. The data may not identify them directly, but learning about them is the point of data mining.
Ownership and purchases. People buy things and then own them and keep them somewhere. Maybe the facts about where people keep things are not relevant for most data mining. The distinction between durable goods and consumables is important. Here’s a situation calculus expression of two facts:
Possessions. Freezers, refrigerators, cars, and microwave ovens are items that some customers will have and others won’t. Having certain possessions affects behavior.
Events. The observed events are purchases in the stores for which we have databases. Unobserved are the trips to the store and the cooking and eating and the inspections of the larder. Maybe these can be discriminated usefully, but maybe they should be lumped into consumption. Other unobserved events include purchases from the competitors. When a person purchases a freezer, his status changes to that of a freezer owner and that fact will persist. The event of acquiring a freezer is more common than of giving up the possession of a freezer.
Preferences. People have preferences among states of affairs, more specifically, among objects.
Distributions of properties over people. The customers have age, sex, income, and ethnic distributions.
Customers appear and disappear. There are causes for the appearance and disappearance of customers, and supermarket chains will be interested in finding them. These include moving in or out of the area, changes in family circumstances, advertising campaigns by the chain or its competitors, changes in the store or its hours of operation, satisfaction or dissatisfaction with goods, prices, or service.
The present state of AI is not up to formulating a full commonsense database, but full commonsense knowledge is not necessary. We can expect to do a lot with very limited knowledge. A sophisticated data mining system might be able to use the following facts in its formulation of hypotheses. An ambitious logic-based system might use logical expressions of the facts. Less ambitiously, programmers would use them in designing data mining systems.
- People persist in time. People want objects. People consume objects and want more. Some objects are permanent on the relevant time scale.
- Objects are created, appear in stores, sold to customers (people) who use them up and need more.
- There are kinds of people and kinds of objects.
- People have attributes, and these attributes change, although some are permanent.
- People buy objects with money. This uses up money and people do not buy at a rate much higher than they get more money.
- There is an is-a hierarchy of items and an is-a hierarchy of people. We suppose these are spelled out in some literature.
- There is an is-a hierarchy of food.
- Although it is tempting to organize facts into is-a hierarchies, this is not always possible or appropriate. More complicated predicates and functions and logical assertions are sometimes needed to express the facts.
- People are associated into families. Purchases are made for a family.
- When food items are purchased, some go into pantries, some into refrigerators, some into freezers, and some are eaten right away. When a food object is eaten, it is removed from where it was stored.
- There are bounds on the rate at which people eat. What they don’t get from one store they get from another.
- A person has an age that increases with time. Very young people are children.
- There are lots of people and lots of stores. The data miner will have information about only some of them.
- Customers who buy substantial quantities of frozen or freezable goods have freezers.
- Owners of microwave ovens can be identified.
- Consistent purchase of the most expensive items indicates prosperity. It can be asked whether consistent purchase of expensive items is all the data miner wants to know anyway. I don’t know about that.
- Suppose a customer comes rarely and always buys frozen spinach in bags and a few other items. Inference: The store where he buys most of his food doesn’t sell frozen spinach in bags.
The point is these facts are a priori and may be used to infer phenomena. We suppose only some phenomena need be taken into account. For this phenomenal mining we ignore birth and death, physical motion, and shape. Mass is taken into account only in connection with quantities purchased and rates of consumption.
It is clear a very large number of facts are relevant to getting information out of databases of customer purchases. These include general facts of common sense and specific facts about consumer properties, consumer goods and consumer behavior. I see no alternative to a big project like CyC, as described by Lenat and Guha, for putting facts into a knowledge base by hand. However, even a small knowledge base may be useful and adequate for experiments. Perhaps substantial parts of the existing CyC knowledge base may provide part of the information needed for supermarket phenomenal data mining.
Grouping Supermarket Purchases by Customer
We propose programs to determine from the cash register receipts which baskets were purchased by the same customer. The putative customers can then be given identifiers. Programs can infer more facts about customer characteristics and behavior with facts about purchases of an identified customer over time than could be inferred from mere statistics about the baskets themselves.3
This example of phenomenal data mining is straightforward in that it is reasonably clear what a successful result would be and how it might be used. We hope to make it plausible that enough information is present in the data to usefully distinguish customers. However, experimentation is needed to verify that feasible algorithms exist.
Demographic information about customers is known to be useful, for example, their ages, occupations, sexes, and incomes. When this information is supplied, for example, in mail order situations where credit is granted, it is extensively used. However, in our supermarket chain example, that information is not in the database of transactions. Let us consider inferring it; it might then be used in any of the presently conventional ways.
There are several approaches to associating baskets purchased by the same customer. One approach involves defining a notion of total anomaly of an assignment of baskets to customers and preferring assignments that minimize total anomaly. The total anomaly is a sum of terms, one for each putative customer, plus some global terms. For example, including a purchase of baby food in a basket assigned to a customer whose other baskets do not include baby food should contribute to the anomaly term for that customer.
One way of looking at minimizing anomaly of assignments is that we want to explain as much of the purchasing behavior as possible by allowable characterizations of the customers.
We regard the notions of minimizing anomaly in the space of assignments as a guiding theoretical idea. Programs may find complete assignments, but they are unlikely to do it by comparing a large number of alternative complete assignments. Instead, they are likely to do hill climbing in the space of partial assignments.4
The Customer as a Stochastic Process
The method discussed previously groups purchases by customer. However, the specific purchases made by the customer are of interest only insofar as they enable prediction of the customer’s future behavior and how he or she might respond to things the store might do, for example, advertisements, sales, changes in products offered, changes in prices.
In general, we might regard the customer as a stochastic process, that is, what the customer will buy (and whether he or she will come to the store at all), depends probabilistically on the state of the customer’s larder, and the actions of the store.
A regular customer may visit the store once per week for five years; that’s 250 visits. Some may make as many as 1,000 visits. Nevertheless, there often won’t be enough information to make a very sophisticated model of a customer. Therefore, simplified models are worth considering.
The simplest model is that customer c has probability p(c,i) of buying item i. The matrix ||p(c,i)|| is likely to be approximable by a matrix of much lower rank, that is, the customers form a space of lower dimension. If this is true, customers can be approximately characterized by a much smaller number of parameters than the number needed for a complete probability distribution. This, in turn, means that accurate information about the customers can be obtained with smaller samples that would otherwise be required. If the assumption of independence of the members of the signature is valid, it still takes a great deal of information to characterize the customer.
The next more elaborate model might take into account the state of the customer’s larder. The customer won’t buy more of certain items until he or she has consumed what has been bought. If we regard the customer’s state as given by the contents of his larder, we can regard the purchases as determined by a Markov process.
The model might be further elaborated to take into account the customer’s probable response to sales, among others. Economists would be tempted to try to ascribe a demand curve, most likely just two numbers—the demand at a base price and an elasticity.
We will not pursue these elaborations further in this article, but it seems likely the most useful information to supermarket companies doing data mining will involve the probabilities of response of different kinds of customers to different stimuli.
Proposed Experiments
Grouping supermarket purchases by customer can be tested with the aid of a supermarket database that does contain customer identification. We discard the customer identification, run our grouping algorithm, and compare the results with the genuine customer data.
My present opinion is that grouping baskets by customers is likely to be a difficult but feasible task. As will be seen, it will involve taking advantage of special features of the behavior of supermarket customers via a priori commonsense knowledge. In this respect, it may resemble cryptanalysis that often takes advantage of special features of the behavior of senders of messages. Moreover, the results cannot be perfect in terms of identifying the purchasers, but the uncertainties about who bought what may not affect the interesting statistics of customer behavior.
Here are some ideas about how to proceed:
- It may be best to start the experiments with a relatively small store. That way there will be fewer assignments to try and fewer similar signatures.
- We should probably start with a date in the middle of the operation of the system and try to extend identifications both forward and backward in time.
- At any time in the computation, there will be a certain collection of putative customers and a set of possible assignments of some of the baskets to customers. Maybe the computational resources will be adequate to deal with hundreds to thousands of possible assignments. Each of these assignments will have an anomaly computed on the basis of what has been assigned so far.
- Since many people shop on a weekly basis, it may be worthwhile to try to find some putative customers who buy on a particular day of the week.
- It may be possible to find some signatures for some customers that are repeated every week. For example, a shopper may buy both whole milk and skim milk every time because of the needs of different family members.
- The algorithm may grow assignments forward and backward in time. As it goes it will eliminate certain assignments.
- When it cannot decide among the assignments over some lengthy period, say two months, it should probably just pick one in order to keep down the number of open choices.
- Perhaps there will be a compact way of keeping certain choices open in order to use long-term aspects of the signature.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment