The dark data extraction or knowledge base construction (KBC) problem is to populate a relational database with information from unstructured data sources, such as emails, webpages, and PDFs. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help to develop KBC systems. The key idea in DeepDive is to frame traditional extract-transform-load (ETL) style data management problems as a single large statistical inference task that is declaratively defined by the user. DeepDive leverages the effectiveness and efficiency of statistical inference and machine learning for difficult extraction tasks, whereas not requiring users to directly write any probabilistic inference algorithms. Instead, domain experts interact with DeepDive by defining features or rules about the domain. DeepDive has been successfully applied to domains such as pharmacogenomics, paleobiology, and antihuman trafficking enforcement, achieving human-caliber quality at machine-caliber scale. We present the applications, abstractions, and techniques used in DeepDive to accelerate the construction of such dark data extraction systems.
The goal of knowledge base construction (KBC) is to populate a structured relational database from unstructured input sources, such as text documents, PDFs, and diagrams. As the amount of available unstructured information has skyrocketed, this task has become a critical component in enabling a wide range of new analysis tasks. For example, analyses of protein-protein interactions for biological, clinical, and pharmacological applications29; online human trafficking activities for law enforcement support; and paleological facts for macroscopic climate studies36are all predicated on leveraging data from large volumes of text documents. This data must be collected in a structured format in order to be used, however, and in most cases doing this extraction by hand is untenable, especially when domain expertise is required. Building an automated KBC system is thus often the key development step in enabling these analysis pipelines.
The process of populating a structured relational database from unstructured sources has also received renewed interest in the database community through high-profile start-up companies, established companies such as IBM's Watson,5,15 and a variety of research efforts.9,26,31,41,46 At the same time, the natural language processing and machine learning communities are attacking similar problems.3,12,22 Although different communities place differing emphasis on the extraction, cleaning, and integration phases, all seem to be converging toward a common set of techniques that includes a mix of data processing, machine learning, and engineers-in-the-loop.a
No entries found