Ray Richards is founder of Mindspan Consultants and a technology journalist hailing from Ottawa, Canada

Skip site navigation and move to main content of page.

Data to Diamonds - Part 2

Last month we scratched the surface of a very deep topic with an overview of data warehousing and data mining. We will continue this time around with an in depth look at this subject matter which has been the focus of a great deal of interest of late.

Structure

As we try to shed some light on the structure of a data warehouse, I thought it would be of some assistance to provide a graphic representation. The following is a high level diagram of a possible data warehouse implementation; utilizing a variety of structured and unstructured data repositories.

Fig. 1

Data Modelling

As you can see, all data elements must pass through a middleware application which is responsible for the dissection, analysis, purging and reformatting of data for inclusion within the data warehouse. All data that is not suitable for processing with decision support software (DSS) is disallowed admission. The structure of the actual warehouse is determined by data architects at the design phase, and takes into account many variables including, as a prime function, the rules for data summarization. There are typically five levels of summarization:

meta data, (which as we learned last month also includes the rules for data summarization)

  • current detail data;
  • older detail data;
  • lightly summarized data, and
  • highly summarized data.

The most problematic data store by far is the current detail data; as it is kept at the lowest level of granularity and hence requires the most system resources. Current detail data is always stored on hard disk; an expensive medium compared with other nearline storage tools (such as tape or optical drives) on which older data usually reside.

Current detail data is often the focus of queries which are time sensitive, as this data is the most recent reflection of transactions occurring in the operational environment; and thus require the speed of access that only hard disk systems can afford. Older detail data is usually stored nearline because it is accessed far less frequently and therefore not subject to the access speed requirements of current detail. Lightly summarized data is derived from the current detail and is composed of far fewer but more critical data elements. As the data ages, it is further summarized to include only the most vital records. Summarized data is almost always stored on hard disk due to it's compact nature and importance in knowledge discovery via trend and pattern analysis.

Data architects always face the difficulty of determining when the data in question should move from one summarization stratum to the next; and what should constitute a "key" data element scheduled for inclusion within summarized layers. This involves a careful analysis of the business processes that the warehouse is being designed to support, and the key questions that this IT solution is meant to answer.

What now?

So you've designed your warehouse structure and now you are dying to discover what business intelligence gems are hidden within all that data. How do you go about it? Well, as we discussed last month, data mining is a rapidly emerging technology which has been receiving accolades for the high return on investment (ROI) realized by it's successful utilization within the data warehouse; in a knowledge discovery capacity. International Data Corporation in cooperation with IBM recently conducted a study which concluded that the average ROI of warehousing initiatives was 401% over two to three years!

Considering the amount of resources that a typical warehouse project will consume, the dollar value of the ROI can be enormous. So how does data mining work? First let's define the subject: data mining is the extraction of hidden, predictive information from large databases. There are several tools and methodologies used to accomplish this goal; and while reviews of specific software is beyond the scope of this article, I will investigate the techniques that these powerful applications utilize.

Data mining works by attempting to discover unknown or unforeseen patterns and trends within data. There are two methods by which this task is undertaken: user or system initiative. The user initiative method is simple. The knowledge worker poses a certain query to the system which then uses one of a variety of methods to come up with an answer. The problem with this however, is that the user will often miss key information merely because he/she hasn't thought of all the possibilities. System initiative is far more intriguing; as the system itself examines the data, and based on significant patterns discovered, queries itself on information it has determined is "interesting" or of potential value, and presents it's findings to the knowledge worker without any human input.

How do they do that?

There are many ingenious methods that the machine may utilize in order to arrive at it's conclusions; neural networks, decision trees and association analysis among them. By far the most fascinating is the neural network. Artificial neural networks have been around since the ‘50's and were developed as computer based model of the human brain for use in artificial intelligence experiments. Essentially, the artificial neural network contains many independent processors, similar to neurons in the brain, which are connected based on the information retrieval task at hand resembling synaptic function.

Neural nets learn to distinguish trends and patterns in a non-linear fashion which is very comparable to human knowledge acquisition. This makes them very powerful data analysis tools indeed; especially for extremely complex problems such as fingerprint identification or speech recognition. When applied to business data, neural networks often arrive at very surprising conclusions that provide entrepreneurs the business intelligence required to succeed in today's competitive marketplace.

The other methods, while often less complex, employ powerful algorithms to achieve similar results. The trick is to use the right tool for the job, while keeping resource requirements in mind. Decision trees are relatively simple tools that a bank manager might use to determine if an applicant is suitable to receive a loan for example. Association analysis examines the relationships between data elements to determine levels of association frequency. This is great for finding out what brands of pop and dip customers purchase with their potato chips, or what options were most popular on last year's production of Chevrolet's by season.

Finally

When deciding whether or not to embark on the journey towards the production of a business intelligence engine, you must carefully weigh all the factors involved; many of which are not readily apparent. To this end you must enlist the aid of skilled consultants who will be able to guide you down this rather dangerous path. Be aware that this is an undertaking of very large proportions; and that it can quickly get out of hand. You must have a crystal clear vision of what you wish to achieve; and a carefully laid plan before any work may commence. If, however, you have done your homework and completed the preliminaries with diligence, you can expect huge rewards in the years to come.

Originally published in Monitor Magazine's Lan ConXions column, March, 1998, by technology columnist, Ray Richards.

Sidebar

Article Index