Matching Records in a National Medical Patient Index

Examining the benefits and pitfalls of a distributed collection of medical records.

Posted Sep 1 2001

Introduction
The Matching Problem
Potential Matching Methods
Existing Matching Algorithms
Sample Aproach
References
Authors
Footnotes
Figures
Tables

While it is relatively simple to get a computerized credit history for anyone who has had a credit card, obtaining a computerized medical history remains difficult if not impossible. There are many potential benefits of having a computerized medical history, including the elimination of duplicate tests and procedures, better access to patient histories for emergency use, elimination of reentry of historical data, and greater accuracy. To make all patients’ medical records in the U.S. accessible to care providers, electronic medical records need to be linked together using a massively distributed Master Patient Index (MPI). A MPI is a facility that correlates and cross-references patient identifiers and performs a matching function with high accuracy in an unattended mode. The need to combine records from different organizations exists for many reasons, including patients moving and/or changing health care providers. A massively distributed MPI will provide links to encounter data resident on provider databases and provide a patient summary. The summary will list the links to the encounter documents distributed all over the network. This summary can be loosely considered the patient’s home page or a table of contents for the patient’s massively distributed health care record. The summary page will include the most important data from past encounters such as the prescription, the diagnosis, and the facility or office visited.

The National Institute of Standards has funded research in this area to help accelerate the development of a massively distributed MPI for the U.S. (NIST ATP Grant 97-03). Research into the development of massively distributed MPI has focused on three technical hurdles: the development of a distributed database management system; being able to get data into the MPI from fragmented sources of disparate computer systems; and patient identification resolution capability. The focus of this article concerns the development of a system for patient identification resolution for use with the MPI. This article presents a description of the technical problem, a list of potential technologies that could be used to address the problem, a brief discussion of existing matching algorithms, and a potential solution using a hybridization of several technologies.

The Matching Problem

There are two instances where matching must take place. The first is when a new record is entered into the system. This record needs to be correctly linked to any existing patient data. An effective matching algorithm must be able to handle comparison of records within a database of several hundred million records. When a new record is inserted into the database, automatic matching will be necessary. The second requirement for matching is for retrieval of patient data from a query to the system. Query results must be provided to the user without noticeable delay of operation (within two seconds). An effective matching algorithm should be able to properly match records containing fields that contain equivalents that are not exact matches. Examples of this include cases of two-letter designations for states or the use of St. for Street, and so forth. The amount of data available, type of data, and quality of data will ultimately determine the maximal accuracy of a matching algorithm.

A typical patient record has fields that can be parsed from encounter data. The fields are likely to include: first name, last name, address, telephone, social security number, and date of birth. These fields are usually limited to that data from the electronic input such as Health Care Financing Administration forms, laboratory results, or pharmacy reports. The similarity of entries in fields is the basis of the decision for merging records. Medical patient records have many characteristics that can lead to errors in properly matching records. There are a variety of input errors, including phonetic errors, incorrect data entry (added spaces, missing spaces, invalid characters), and reversal of first and last names (especially in Asian names) are common. Fields are sometimes misused and addresses either run on or are broken at inappropriate places. Nicknames, abbreviations, encoded information (such as 1=married, 2=single), and the use of Jr., Sr., III, and hyphenated names may lead to mismatches. Valid changes of address, marital status, in addition to name changes can cause matches to be missed. Fraud and missing data are also potential problems.

Potential Matching Methods

A variety of algorithmic methods can be applied to the matching problem. String comparison methods use comparison of individual letters to determine matching fields. If an approximate string match is made it is weighted less than a direct string match. Several authors [2, 5] suggest the number of additions, deletions, replacements and transpositions between the two words or names can characterize their closeness. The Hamming distance [7] is the number of positions with different characters between two words of the same length. Another string comparison technique, developed by Damerau-Levenshtein [6], measures the number of changes or morphological changes required to make one string into the matching string (not necessarily of the same length). The Shift-Or Algorithm [7] computes an array with ones and zeros that maps matches and mismatches in letters. The use of bigrams or trigrams [6] for comparison of field values can be used to measure closeness of match. Bigrams are the two-letter consecutive combinations while trigrams are the three-letter consecutive combinations within the test value. For example, the word “receive” has the trigrams “rec,” “ece,” “cei,” “eiv,” and “ive.” A count of the number of bigrams or trigrams that words have in common and do not have in common can provide a quality of match. Phonetic matching can be achieved by translating each field value into an equivalent phonetic code and comparing the phonetic codes for matching.

Once a field match or field disagreement is determined, probabilities are used to calculate the value of such a determination. The Felligi-Sunter [3] probability approach uses the probability that a field agrees given the record pair examined is a matched pair (m probability). It also uses the probability that a field agrees given the record pair being examined is an unmatched pair (u probability). The values of u are calculated by occurrence of field entries within a database. Thus, u needs to be recalculated after entry of new data or segments of new data. Guessing what the occurrence of error is in various fields approximates the value of m. The weight of matching field is computed as log₂(m/u) and the weight of a field disagreement is log₂((1-m)/(1-u)). The composite weight of record matching is the sum of the weights for individual fields. The greater the composite weight, the greater is the probability the records are the same. The overall probabilistic method for record linkage has been developed for about 30 years and the original pioneers in this field are Fellegi, Sunter [3], and Newcombe [1]. This method requires knowledge of the distribution of data in the existing database. Training data is used to develop the statistics associated with the incidence of expected matches and mismatches. With the Fellegi-Sunter algorithm, the results are based on the input statistics regarding the probability of error (m value) that is not a known quantity and must be estimated for any given term.

Cases can be fed into a neural network such that a weighted system of neurons can be developed to determine matching patient records from those that do not match. A series of training cases would be developed to train the neural network. Then the neural network would be tested using a series of test cases with known solutions. The neural network has the potential for development of nonobvious algorithms but has drawbacks of case specific applicability or wide generalizations. This method has the positive benefit that a conversion process of characters to numerical values that have true distance meaning is not required. The neural network can usually derive rule-based knowledge from training instances, and the quality of the neural network is largely determined by the quality of the training data. A set of real data that has been correctly evaluated for matches and mismatches has not presently been located.

The use of signal processing could be implemented if the characters are converted into numerical values. Each character/symbol can be assigned a digital level (or analog level) that is related to the likeness of characters (keyboard location, similar sound or similar physical shape). For example, the letter “A” may be assigned a number one, the letter “B” may be assigned a two, and so on. This may require around 50 levels for numbers, letters, and some symbols. The levels will then correlate to an analog signal sampled over time. Thus, the level of the signal (analogy to voltage) is related to character/symbol and the order of the level is related to placement (analogy to time). The individual fields or combined fields can then be correlated with same field types of other records. The level of correlation will correspond to the likeness of records. If the fields are separated and correlated, then a combined weighted function of correlation will be computed. The new “signals” can be shifted in “time” and correlated. Variance of the matching records will be similar to signals with added noise. Signals can be converted to the “frequency domain” by the fast Fourier transform or to the wavelet domain if working with signals in a different domain adds computational speed.

An alternate concept is to merely slide the complete new record against the existing record and track the number of matching character/symbols. An exact match would provide a peak when the records line up. Records with small variations will have lesser peaks with a small distance from the main peak. Smaller peaks with greater distance from the main peak could be given less weight. A record with randomly matching character/symbols would be a nonmatch.

Another approach is to use clustering. Each entry in a field can be plotted in n-dimensional space and the distance between similar entries will be smaller than dissimilar entries. The values of entry distance for fields can be weighted to obtain an overall likeness value for records. A threshold value can be set for determination of merging of records. This method has the drawback that an entry in a field that has an offset in character/symbol space (for example: an entry with the first character missing) will have a large distance determination. A space shift and distance recalculation could be performed to catch such errors.

A rule-based approach would involve the use of heuristics associated with common errors based on similarity of letters, word sounds, letter proximity on keyboard, importance of name versus address, nicknames, abbreviations, date of encounter relation to age of patient, zip code relation to location, and so forth. Such an effort would incorporate the expertise involved in making a record comparison. An expert comparator would be able to differentiate between distinguishing features and probable errors.

Fuzzy logic would provide for such options as mostly matches or almost doesn’t match and other such conditional expressions. All records would be members of the sets matching records and mismatching records to varying degrees. Each of the fields would have membership in the sets of matching fields and mismatching fields to varying degrees. The rules could be made to state “if the last name matches poorly and the first name matches well and the address matches well and sex is female then …”. These types of rules could be made to make final determination of degree of membership of a recording in the set of matching records. Other measures of importance in matching such as such as length of name, commonness of name could also be used in the rules.

A matched filter approach would use a filter set to match the new record and apply this filter to existing records in the database. A matching record will be attenuated the least and thus be detectable. This optimal filter is considered to increase the signal-to-noise level by attenuating that which does not match the signal of interest. This method is essentially template matching where each record would define a template to which the new record would be compared.

Table 1 provides a crosswalk of the methods discussed and the types of matching problems. It can be seen that no one method is able to handle all potential causes of matching errors.

Existing Matching Algorithms

A search of existing software for record matching was conducted; Table 2 outlines a comparison of the various commercial software packages. The ideal approach of comparing existing matching algorithms is to use a test database and test cases to measure type I and type II errors and speed of operation. An accepted set of test cases does not exist nor have all the manufacturers agreed to participate in such a test. Thus, the best that can be done at this time is to compare the algorithms on a theoretical basis.

The U.S. Bureau of the Census has an algorithm based on Fellegi-Sunter that only works with simple flat databases. The Language Analysis Systems company has a unique position in the matching analysis in that it has the only ethnic detection algorithm but it focuses only on names. The Search Software America company has a multistep process with the ability to put in rules for heuristics, and does not use Fellegi-Sunter.

Most matching software has string-matching algorithms, although they are different and often proprietary. Different software companies use blocking, phonetic tokenization, and fuzzy keys for partitioning of the database. There are many fields of interest of which about four are name fields. Care should be taken to properly analyze the other fields as well. The use of Fellegi-Sunter is beneficial due to the application of knowledge of the database entries and estimates on the validity of data in a field. None of the existing software routines include a complete solution that uses all the available technology to solve the problem.

An effective matching algorithm must be able to handle comparison of records within a database of several hundred million records.

Sample Aproach

Here, we describe the process of steps that could be used to determine matching and mismatching records, which are also outlined in the figure. Each step of the process will analyze the problem at different levels and apply knowledge of word structure, culture, human behavior, and commonality of names. Some parts of the analysis can be obtained from commercial software vendors. The problem is truly complex and each added piece of knowledge might help to improve the accuracy of the matching routine. Unfortunately, algorithms are based on patterns related to records and may not apply to any specific record. Any algorithm that is used must emphasize methods that will optimize the correct matching for the majority of records and provide rules for application to less common instances without degrading performance on the majority of records.

Parsing. The first step in the process is the parsing of the fields from the record. The fields that are expected are: last name, first name, middle initial, social security number, address, city, state, zip code, date of birth, sex, maiden name, place of birth, and marital status.

Ethnic origin. Next, detection of the ethnic origin of names will provide insight into the appropriate use of matching routines later in the process [4]. The cultural origin of the name can be analyzed by use of look-up tables and letter analysis. Asian names often have surname and given name reversal. There will be different equivalents in different countries. The phonetic matching will be modified based on the pronunciation of different letter combination in different countries.

Equivalents. A table of equivalent names and abbreviations can be used to help avoid missing matching records. The first name can be looked up in a table to determine comparable names (for example, Bill for William). The first name field may then become a list of possible matching values.

Invalid entries. Fields expecting letters such as first name or last name can be checked for presence of digits to bring errors to notice; fields expecting numbers such as zip code or social security number can also be checked. It may be possible to detect a mismatch between zip code and address with postal information.

Partitioning. When a search for matching records is made, the concept of blocking limits the number of comparisons to be made of records by subcategorizing records by specific field values. Blocking can be made based on SOUNDEX [6] or New York State Intelligence and Identification System (NYSIIS) [4] matches for a field. For example, record comparisons may be limited to those that have a last name of “Bell” or “Belle” based on the similarity in sound. Blocking can cause a matching patient record to be missed if the match resides in a blocked-off section. Therefore, multiple passes based on blocking using a different field can be used to minimize the occurrence of missing matches between patient records.

String comparison. Letter mismatching can be analyzed by string comparison. There are several methods available for string comparisons that were discussed earlier. For example, “Jones” and “Johns” differ by the insertion of “h” and deletion of “e,” which may be less likely than “Beal” and “Bell,” which differ by the replacement of “a” with “l.”

Phonetic matching. Names will be codified into phonetic keys associated with the SOUNDEX, PHONIX [6], or NYSIIS code for the name. Words that sound alike will often have the same code. For example, Smith, Smyth, and Smythe have the same code in SOUNDEX, and PHONIX, whereas Jones and Zambrowski would not.

Record match determination. At this point each field will have a quality of match value. The individual field quality of matches will be combined into a composite score for each record. For example, the composite quality of match for last name could be calculated as QM_LN = W1*String_QM_LN+W2*Phonetic_QM_LN, where W1 and W2 are relative weights associated with the influence of quality of match for string matching and phonetic matching. This value will be used to determine the overall probability of record matching. There are several methods available for determination of record matching from the quality of match of the individual fields. Fuzzy logic has the positive attribute of being able to develop rules based on heuristics about how errors may occur due to a person moving or getting married.

Fuzzy sets (matches, matches well, matches fairly well, matches somewhat, matches poorly, mismatch) rules examples include:

If the first name matches well and the last name is a mismatch and the subject is a female and date of birth matches and the social security name matches and the maiden name matches well and the place of birth matches well and marital status of old record is single or divorced and the marital status of the new record is married then the record probably matches [Marriage]
If the social security number matches and nothing else matches then the records are probably a mismatch [Fraud], which possibly produces an on-screen alert.

For example, a quality of match value of 0.95 (on a scale of 0 to 1) may be in the fuzzy set “matches well.” Thus each field can be used to determine if the fuzzy rule is executed. The fuzzy rules can be used to modify record match probability. If the fuzzy rules are not executed then a default overall quality of match can be determined from a weighted composite of the field quality of match values. Probabilistic means have the positive attribute that the relative value of a field match can be modified based on the occurrence of that field value in the database. This weights the influence of field matches based on the characteristics of the database. The Fellegi-Sunter routine could also be used to adjust the weighting values of the individual fields toward the record match probability.

Final classification. In the case of a user query, the entries in the query fields are compared to an existing database and potential matches will be provided to the user of the system. A probability associated with the validity of the match will be provided to the user. The list will be limited by a threshold value and the user will have the ultimate decision regarding a match. In the case of a new record entry, the record will be compared to the database for near identical matches. The threshold for allowing a match may be set high to prevent incorrectly matching two different records.

Figures

Figure. Matching Algorithm.

Tables

Table 1. Comparison of matching techniques.

Table 2. Existing matching software.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Matching Records in a National Medical Patient Index

View in the ACM Digital Library

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DOI

10.1145/383694.383711

September 2001 Issue

Published: September 1, 2001

Vol. 44 No. 9

Pages: 83-88

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

BLOG@CACM Apr 17 2024

Technical Marvels

Herbert Bruderer

Computer History

BLOG@CACM Apr 16 2024

The Value of Data in Embodied Artificial Intelligence

Shaoshan Liu

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

The Matching Problem

Potential Matching Methods

Existing Matching Algorithms

Sample Aproach

Figures

Tables

Matching Records in a National Medical Patient Index

DOI

September 2001 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.