Analyzing Medical Data

A patient cohort network in Søren Brunak et al.'s Computational Biology paper: Nodes represent patients, edges are correlations between patients, and node color denotes cluster membership.

One of the technological ironies in health care is the disconnect between the advanced state of clinical technology, such as nonconfining open imaging technologies, the variety of smartphone health apps, surgical robots, and the backward state of electronic patient records.

Until the passage of the Health Information Technology for Economic and Clinical Health Act in the U.S. three years ago, only an estimated 20% of U.S.-based physicians used electronic patient records. That percentage is rapidly increasing due to the law’s financial incentives, but the new attention is also awakening researchers to the limitations of the structured data that is often exchanged between physicians, insurance companies, and organizations interested in compiling and repurposing discrete patient records to conduct population-based medical research.

Søren Brunak, director of the Center for Biological Sequence Analysis at the Technical University of Denmark, is one of these researchers who is using natural language processing (NLP) technology to mine not only structured data such as standardized disease codes, but also free text.

In a recent paper, “Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts,” published in Computational Biology, Brunak and his colleagues showed that combining free-text analysis with structured disease definition codes can help researchers discover unexpected connections between diseases, such as a link between migraine headaches and alopecia (hair loss), but also the lack of expected comorbidities, such as those between diseases coded as “mental and behavioral disorders” and those coded under the “drug abuse, liver disease, HIV” clusters.

“Not only our paper, but also work from many other groups, show that text mining and mining of unstructured data can lead to very interesting results,” says Brunak, who laments the condition of many patient records. “My priority is that the patient record becomes as information-rich as possible. That maximizes the data mining opportunities, and the success of Google and other technologies that successfully analyze unstructured texts are examples and evidence that the old-fashioned ideas that only structure is the way to go forth is wrong.”

A Lack of Flexibility

The medical community has long agreed on the basic data fields for the description of diseases. This schema, the International Classification of Diseases (ICD), uses a hierarchical system to pinpoint specific medical conditions. Two versions of ICD are concurrently in use: ICD-9, which is reaching the end of its viable usefulness, and ICD-10, its successor. ICD-9’s 13,000 available diagnostic codes use three to five alphanumeric characters per disease or medical condition but do not distinguish between injuries to the left or right sides, whereas the more granular ICD-10 diagnoses comprise 68,000 codes and include laterality.

ICD is currently the lingua franca of disease classification upon which many researchers rely to supply a structured data element when analyzing comorbidities. However, leading bioinformatics researchers, including Brunak and Christopher Chute, M.D., professor of bioinformatics at the Mayo Clinic and chairman of the World Health Organization’s steering group responsible for the next ICD iteration, ICD-11, say the terminology is not really suitable for serving as a baseline.

“ICD is not intended to be an exhaustive catalog of clinical concepts that may be encountered or enumerated,” Chute says. “It is a high-level aggregation of diseases, and that is its purpose. It was originally a public health response, and it is also used for reimbursement, but to treat it as a catalog of disease is incorrect.”

“My priority is that the patient record becomes as information-rich as possible,” says Søren Brunak. “That maximizes the data mining opportunities.”

Brunak concurs. “There is no doubt ICD-10 is not the best text-mining vocabulary for spotting things in records. It’s very difficult to say something general about which ontology or which system is best because there’s a different signal-to-noise ratio in different types of records, and one type does not fit all.”

Yet without some type of universally accepted thesaurus, Chute says the prospect for significant advances in the natural language processing of clinical data is uncertain. Chute says the ICD-11 working group has been operating in concert with the International Health Terminology Standards Organization, which oversees the development of the Systematized Nomenclature of Medicine (SNOMED), a hierarchical semantic network of more than 300,000 medical concepts and their relationships. The semantic nature of SNOMED allows for more than seven million relationships descending from the top three hierarchical classifications of “finding,” “disease,” and “procedure.”

Chute says the ICD-11 and SNOMED terms will be harmonized, yielding interoperability between the two dominant clinical data schema. “That’s what we ultimately need, and even SNOMED in its current incarnation doesn’t fully capture the spectrum of clinical concepts you’d like to catalog for natural language processing.”

S. Trent Rosenbloom, M.D., associate professor of biomedical informatics at Vanderbilt University, says the lack of an overriding priority in developing health-care data formats—a patient record must satisfy clinicians’ needs to describe a course of treatment for their patient as well as provide documentation for legal safeguards, plus serve as a billing document that satisfies insurance companies—has led to the difficulty in aligning analytical capabilities. In “Data From Clinical Notes: A Perspective on the Tension Between Structure and Flexible Documentation,” which was recently published in Journal of the American Medical Informatics Association, Rosenbloom and colleagues note that “the flexibility of a computer-based documentation method to allow healthcare providers freedom and ensure accuracy can directly conflict with a desire to produce structured data to support reuse of the information in [electronic health record] systems.”

Rosenbloom’s colleague, Joshua C. Denny, M.D., assistant professor of biomedical informatics at Vanderbilt, says the holy grail of analyzing extremely large volumes of health records to deliver “personalized medicine” to a single patient will depend not only on perfecting NLP capabilities confined within clinical walls, but also in expanding the concept of what belongs in a medical record, and who should be authorized to provide data. For example, a 50-year-old man who runs every day may paradoxically have high levels of both good high-density lipoprotein (HDL) cholesterol, which helps to clear the arteries—high amounts of exercise can elevate it—and of bad low-density lipoprotein (LDL) cholesterol, which is a risk factor for coronary disease. Following conventional medical wisdom, the man’s physician may want to prescribe medication to lower the LDL levels without actually knowing if it is necessary because there is not a current capability to pull population-wide data on such a relatively small cohort. Patient-curated data may be able to help discern what, if any, treatment would be appropriate for such patients.

“Studies can’t get that population well,” Denny says. “That sort of patient has an interest in this and is going to document that. Maybe there is detail there that allows us to mine, on the back end, the health status of the small percentage of people who run every day and have this HDL and this LDL level. The number of combinations of variables such as that is huge. You’re never going to get that amount of data in a study. For that you need population-based data.”

Perfecting NLP Capabilities

While standards groups and policy committees address the issues surrounding format harmonization, as well as the legal issues surrounding patient privacy and who should be able to contribute to health records, bioinformaticists are researching how to perfect NLP capabilities.

One NLP issue, says Denny, is that the majority of natural language processing technology has been created for dictated speech, but electronic health records are often typed.

“Typed documents are more ambiguous because there are more abbreviations and acronyms, and the documents usually contain more misspellings,” says Denny. “You don’t describe things in as much detail, so that hinders a little bit the richness of what can be done with natural language processing. Maybe speech-recognition technology will move forward fast enough that we’ll move toward documents that look like dictated documents, and the problems won’t be as big.”

William Cohen, research professor of machine learning at Carnegie Mellon University, is currently researching the creation of a domain-specific version of the Never Ending Language Learner (NELL), refining its general-purpose Web-crawling algorithm for use in published biomedical literature called BioNELL. While BioNELL is focused on published biomedical literature rather than on-the-fly clinical notes, Cohen says such domain-specific, rank-and-learn principles might be useful for organizations’ in-house lexicons while wider standards are being crafted.

“The exciting thing I find about BioNELL is taking the existing structured databases and using them to kick-start an NLP system,” Cohen says. “Symmetrically, you might want to take a natural language corpus and use that to kick-start understanding a sensor database.”

One NLP issue, says Joshua C. Denny, is that the majority of NLP technology has been created for dictated speech, but electronic health records are often typed.

Brunak says that while the attention paid to his group’s Computational Biology paper about unlikely comorbidities was interesting, the broader message should be that examining more hitherto machine-unreadable data could alter the practice of medicine.

“Real patients have more than one disease,” says Brunak, “and the patient records give us an opportunity to discover comorbidities and disease correlations—not only those that cooccur but also disease trajectories, that is, those that come before others. The message should be that we can start disease profiles of real patients instead of doing what medicine has done for hundreds of years, studying people disease by disease.”

Figures

Figure. A patient cohort network in Søren Brunak et al.’s Compuational Biology paper: Nodes represent patients, edges are correlations between patients, and node color denotes cluster membership.

A Lack of Flexibility

Perfecting NLP Capabilities

Further Reading

Figures

Analyzing Medical Data

DOI

June 2012 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

A Lack of Flexibility

Perfecting NLP Capabilities

Further Reading

Figures

Analyzing Medical Data

DOI

June 2012 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.