CACM logo

ACM News

Data Mining Uncovers New Connections Between Health Problems

[article image]
A network depicting patients' health problems (colored dots) reveals overlapping conditions, including such known connections as diabetes and hypertension. Credit: Courtesy of Technical University of Denmark

Researchers in Denmark are using data mining techniques to uncover connections between health problems as seemingly unrelated as migraines and hair loss.

In addition, the scientists determined that the gluten allergy known as celiac disease is associated with hair loss and migraines, and also is linked to schizophrenia.

Some  800 pairs of health problems turned up more than twice as often as expected by chance–and 93 of those pairs were then flagged by a doctor as being "especially intriguing."

In their article published last month, co-author Søren Brunak and his team describe patients’ electronic health records (EHRs) as "an unexplored but potentially rich data source for discovering correlations between diseases."

Brunak is both the director of the Center for Biological Sequence Analysis at the Technical University of Denmark and the head of the Disease Systems Biology Department at the Center for Protein Research, University of Copenhagen. The project is a collaboration between the two institutions.

Data mining techniques were used on clinicians’ notes within the EHRscollected over a 10-year period from 5,543 patients at Denmark’s largest psychiatric hospitalto automatically extract clinically relevant terms and map these to approximately 22,000 disease codes in the World Health Organization’s International Classification of Diseases ontology (ICD10).

Besides generating new leads about the molecular workings of disease, the approach is also revealing a much richer portrait of each patient.

"Using the text-mining approach, we can produce a much more fine-grained patient characterization, going far beyond the assigned codes," says Brunak. "This aspect also has the implication of potentially improving conventional epidemiology research as the registries typically only contain terms which the doctors put into the structured fieldswhich is about 10% of what we find."

The team’s biggest challenge involved the medical records themselves, which Brunak described as "typically dirty, full of misspelling and other errors," which are rarely corrected.

"As long as doctors understand what they or others wrote, nobody cares," he says. "In that sense, the EHRs are different from other databases which gradually are cleaned up and error corrected. Our variational dictionary took care of this difficult task, and we demonstrated by extensive, manual benchmarking that the quality of the text mining was very high."

While Brunak and his team continue to try and establish whether certain proteins and genes contain changes that could potentially explain disease-disease correlation, they have yet to draw major conclusions about the implicated proteins and mechanisms.

 

Paul Hyman was editor-in-chief of several hi-tech publications at CMP Media, including Electronic Buyers’ News.

 

Post a comment...
Name: Anonymous

Signed and anonymous comments submitted to this site are moderated and will appear if they are relevant to the topic and not abusive. Your comment will appear with your username if you are signed into the site, and will be anonymous if you are not signed in. View our policy on comments

Tools For Readers

Bookmark and Share
Default Font Size Large Font Size X-Large Font Size Text Size

Related ACM Resources

Conferences:

Courses:

  • Project Risk Management - In this course, you will explore the risk management processes in the Project Life Cycle. You will examine the inputs to and outputs from risk …

About Communications | Join ACM External Link | Renew External Link | Subscribe External Link | Sign In | For Authors | For Advertisers External Link | Privacy | Site Map | Help | Contact Us | Mobile Site

Copyright © 2012 by the ACM. All rights reserved.