The importance of data mining

Information often lies. Data mining can open up this valuable seam and derive valuable business intelligence from it.

By Raoul Jetley

It is estimated that up to 80% of all information held by organizations is stored in an unstructured text format. This information includes customer requirements, sales dossiers, technical specifications, maintenance reports, and stakeholder feedback.

It is difficult to extract business intelligence from such disparate data using traditional data analysis methods so, instead, text-based data mining, or text mining, is used.

Simply put, text mining is the set of processes required to transform unstructured text documents or resources into meaningful, structured information.

The structured information can then be used to automatically discover hidden patterns and predict future outcomes using a combination of statistical, linguistic, and pattern-recognition techniques.

Text mining is an interdisciplinary field that draws on information retrieval, data mining, machine learning, statistics, and computational linguistics.

These techniques are used to discover and present knowledge—facts, business rules, and relationships—that is otherwise locked in textual form, impenetrable to automated processing.

A typical text mining process includes the following steps:

• Identify and preprocess the text to be mined. This step involves text clean-up to remove unnecessary information from the text, splitting it into individual tokens (i.e., smaller components) and identifying parts-of-speech based on the grammar of the language used.

• Extract relevant information and transform it into structured data. Information is retrieved by searching through the tokenized text and storing the results in a more structured, organized manner that is amenable to further analyses.

• Select important features to build concept and category models. The number of concepts present in unstructured data is typically very large. The key to this step is to identify the most relevant features and use these to build meaningful models based on data categories and relationships.

• Analyze the structured data to discover relationships between the concepts. At this point, the text mining process merges with the traditional data mining process. Classic data mining techniques, such as clustering, prediction, and classification can be used on the structured data resulting from the previous steps.

Common applications resulting from these analyses include recognition of named entities, automatic summarization, categorization based on relevant features, and mining for customer sentiments and opinions expressed within the text.

About the author: Raoul Jetley is Senior Principal Scientist at ABB Corporate Research, Bangalore, India. He can be reached at: raoul.jetley@in.abb.com