Email this page

2-Day Seminar

Unstructured Data:
Creating the Analytical Environment

Click here for an in-house quote request or for further information regarding in-house training.

Overview
It is estimated that 80% of data in the corporation is textual. There are emails, medical records, contracts, safety reports, patents, and a whole host of other forms of textual data. For years, managing textual data has meant placing documents in some form of ECM - Enterprise Content Management. But trying to do analysis on data found in ECM is a very different story than placing the documents there in the first place.

With DW 2.0 the idea arose that unstructured data is best placed in a data warehouse, where it can be analyzed along with other structured data found in the data warehouse.

This seminar is about the work that needs to be done in order to take textual data out of the confines of documents and integrate the textual data into a data warehouse. This is a very down to earth seminar/work shop. The first day is a lecture based on the background material needed to understand the architecture surrounding the placement of text in an analytical, data warehouse environment. The second day is a workshop that shows - step by step - how text is converted into a data base that can then be placed into a data warehouse.

There is a big difference between searching text and analyzing text. The seminar brings out these important distinctions.

The hardest part of transforming text into a data warehouse is the integration of the text. Anyone can read a text file and toss the text into a data base. Such an exercise is an exercise in futility. The resulting data base is one that cannot be usefully processed by a BI tool. In order to produce a meaningful result, the analyst must carefully transform the text. Some of the basic issues of transformation include:

  • reading and understanding semi structured data
  • applying external categories to text
  • creating internal taxonomies of text
  • standardizing dates for BI processing
  • identifying patterned variables
  • identifying named variables
  • resolving homographs, and so forth.

There is a special emphasis on the management of corporate contracts and oil and gas pipeline and refinery safety data in this seminar.

Learning Objectives

  • Recognize the difference between search processing and textual analytics
  • How to create a data warehouse that contains textual data
  • What is required to turn textual data into data that is fit for a data warehouse
  • What internal and external taxonomies are and why they are important
  • What synonym concatenation is and why it is important
  • What homographic resolution is and why it is important
  • What semi structured data is and how it has to be handled
  • How to integrate the textual integrated data to a data warehouse data model
  • What stop words, stems, and standardized data elements are and how to handle them
  • How to create a foundation that can be analyzed by standard BI tools
  • How to scale the integration process
  • How to read and interpret semi structured data
  • How to create proximity variables (and why proximity variables are so important)
  • How to integrate textual data with visualization
  • Understand the difference between textual discovery and textual analytics

Seminar Outline

     Introduction

  • Discovery
  • Analysis

     Search versus Analysis

  • Transformation of text

     Types of Unstructured Data

  • Voice
  • Image
  • Text

     Types of Textual Data

  • Simple Unstructured Data
  • Semi Structured Data
  • Volumes of Data
  • Textual versus Structured Data
  • Scaling Volumes of Data

     The Unstructured Data Base

  • Iterative Development
  • Imperfect Data

     Integrating Simple Unstructured Data

  • Stop Words
  • Stems
  • Synonyms
  • Homographs
  • Internal/External Taxonomies
  • Alternate Spelling
  • Multiple Languages
  • Proximity Variables
  • Date Standardization
  • Text to Numeric Conversion
  • Email (Screening)

     Integrating Semi Structured Data

  • Sub document Separation
  • Looking at Hidden Characters
  • Pattern Recognition
  • Symbol Recognition
  • Multiple Types of Indexing
  • The Subject Oriented Index
  • Index Trimming
  • List processing

     Linking Unstructured and Structured

  • Dynamic links
  • Static links
  • By Name
  • By variable
  • By Communication Id
  • By Business Id

     Visualization

  • SOMs
  • Simple Unstructured SOM
  • Semi Structured SOM
  • The Discovery Process

     Other Miscellaneous Topics

  • Technology Infrastructure
  • A Methodology for Unstructured Text

Audience
Data analysts and business managers who need to know how to incorporate textual data into their decision making processes. In particular,

  • the Data Architect who needs to know how to integrate textual data into a data warehouse
  • the Business Analyst who wishes to form an analytic source of data based on textual data
  • the Manager who recognizes that there is a lot of important data tied up in text and needs to know how to unlock that data into an analytical environment

Special Features

  • This seminar is based on two books which are commercially available. The first book is Tapping into Unstructured Data, Prentice Hall, 2007. The second book is DW 2.0 - Architecture for the Next Generation of Data Warehousing, El Sevier Press, 2008. Both of these books are available and the attendee is encouraged to read these books before attending the seminar.
  • In addition there is a collection of white papers to which the attendee will be directed.

In-House Training
If you require a quote for running an in-house course, please contact us with the following details:

  • Subject matter and/or speaker required
  • Estimated number of delegates
  • Location (town, country)
  • Number of days required (if different from the public course)
  • Preferred date

Please contact:
Jeanette Hall
E-mail: jeanette.hall@irmuk.co.uk
Telephone: +44 (0)20 8866 8366
Fax: +44 (0)1923 828 770

Speaker: Bill Inmon
Inmon Consulting Services
Bill Inmon, Inmon Consulting Services

Speaker Biography

Bill Inmon - known as the father of the data warehouse - has written 47 books and over 1000 articles. Bill's books have been translated into 9 languages. Bill has a weekly newsletter with b-eye-network of over 55,000 recipients. Bill holds 8 software patents.