Ray Richards is founder of Mindspan Consultants and a technology journalist hailing from Ottawa, Canada

Skip site navigation and move to main content of page.

Data to Diamonds

Last month I took a little holiday from Monitor and am returning this time around with an article addressing the burgeoning interest in the areas of data mining and warehousing.

Disorganized Data Deluge

If you are in the same boat that most companies are finding themselves in, your organization is the proud parent of numerous unconnected data repositories: accounting databases, Human Resources records, client databases, spreadsheets, electronic and paper based documents, emails and the like. Sorting through all these data sources in order to obtain any really useful business intelligence is tantamount to searching for a pebble in a sand dune. Suppose you wanted to find out what customers in the Toronto area, who possess credit cards, purchased both a computer and another peripheral or accessory for it from your large retail chain within a given month; and what the profit margin on those additional items was, would you be able to do it?

Sure, I suppose it could be done, but I'm fairly certain that most businesses wouldn't even entertain the thought; as the task would prove monumental. But why would you want to gather this kind of information anyway? Let's say you wanted to get more bang for the buck on your mail advertising: by taking a targeted approach to the advertisement, instead of the shotgun method, you would reduce advertising costs and increase revenues. You could also use it to merchandise your stores appropriately; with the popular, higher margin accessories displayed next to the machines they were associated with. It would enable you to educate your sales force on what items to push in order to increase your overall profitability. You could use the information to discover how much of your profit margin was going to the credit card companies and to then strategize on less costly methods of customer financing such as leasing and so on.

Maybe you'd like to findĀ  out what types of drugs people typically purchase together, what the associated risk factors in these combinations are and what the average yearly customer intake of each is. This information would enable pharmaceutical companies to engineer new drugs, reduce side effects of drug combinations and plan marketing campaigns around the statistical data obtained.

Perhaps you'd like to find out what transactions are most susceptible to credit card fraud among retailers who accept your card. This would enable you to "red flag" these transactions in order that you might scrutinize them more carefully; thereby reducing the number of fraudulent purchases and in turn increasing profitability.

All of these initiatives are currently being undertaken by large organizations around the globe.

Mining the Data Vein

How do I go about collecting all this information? A technique that is becoming very popular in both the public and private sectors is called "data warehousing". Data warehousing is defined by Bill Inmon (the father of data warehousing) as: "A subject-oriented, integrated, time-variant, non-volatile collection of data in support of management's decision making process." Essentially, a data warehouse is a database which combines data from many different repositories and allows for the implementation of a Decision Support System (DSS), used to facilitate business growth and effectiveness. Hundreds of organizations worldwide are using data warehouses to realize these goals and gain a competitive edge on others within their markets. Let's take a closer look at Bill's definition.

A data warehouse is subject-oriented: This means that the design is based on clearly defined data categories vs. business functions or departments. In a grocery chain for example, a traditional operational database system may include Produce, Accounting, Human Resources, Advertising, Meat, Dairy etc.; whereas a data warehouse would contain sections for Vendor, Customer, Activity, Client, Product and so on. This design difference better supports the use of DSS tools and excludes data that is not relevant to DSS processing.

A data warehouse is integrated: Data from all sources must be consistent in format once entering the warehouse. Conversion filters must be applied to all incoming data sources in order to ensure warehouse standards are adhered to; otherwise, data credibility will be in question.

A data warehouse is time-variant: Data inside the warehouse is not real time. It represents a snapshot of information found within various component data repositories at a given point in time. Data in a warehouse represents information collected over a long period; typically 5-10 years. By contrast, real time operational systems are unable to probe data with the same degree of depth; as system performance is critical in transaction processing and therefore precludes this type of analysis by design.

How does it work?

Data Warehouse Diagram

A data warehouse uses information from the operational environment that has been parsed in order to render it useful in support of executive decision making. Middleware allows for the exchange of data and standardization of variables across the enterprise. This is essentially the glue that holds it all together. After going through the middleware filters, data is stored in accordance with rules defined at the warehouse design phase. Current and recent data is available in it's entirety, while older data is automatically summarized to preserve space. As data residing within the warehouse is usually extremely voluminous, meta data (dwelling within a separate database called a "data dictionary") must be utilized to assist in data retrieval. What is meta data? Simply put, meta data is data about data. If that is confusing, think of a data warehouse as a library and meta data as the card catalogue. Meta data must contain:

  1. the structure of the data;
  2. the algorithms used for summarization;
  3. and the mapping from the operational environment to the data warehouse.

Once inside the warehouse, techniques such as on-line analytical processing (OLAP), neural networks, decision trees and association analysis (all discussed next month) are used on data to provide the knowledge worker with useful intelligence otherwise not available to him/her. This represents a distinct competitive advantage to those employing this technology.

How do I get one?

The first thing that you must understand is that you can't just go out and buy a data warehouse. Implementing one requires a great deal of effort and skill sets not often found within the target organization. Data warehousing is as much about data modeling and standards institution as it is about technology implementation. Ideally, you must first set standards for the individual repositories or be prepared to pay the price later in customization of middleware. Repository standards may take into account such low level concerns as naming conventions (e.g. male or female = m,f and not 1,0 or x,y or male and female) measurement standards (container volume = cm3 vs. ml orĀ  m3 or quarts etc.) date conventions (e.g. mm/dd/yy vs. Julian or dd/mm/yyyy etc.) as well as high level considerations; including support of business processes, interoperability between departments, electronic commerce standards and so on. The resulting framework should be arrived at through the application of relevant industry "best practices" which identify critical success factors required for the attainment and subsequent sustainability of target outcomes. Often, however, the legacy systems which you are attempting to integrate into the warehouse are not customizable enough to apply these new standards to resulting in the need for complex middleware solutions.

Once the standard for relationships between the data repositories has been established, you must define the structure of your warehouse. As this is no small task, we will reserve it for the next issue where we will also discuss various data mining techniques.

Originally published in Monitor Magazine, Lan ConXions column, February, 1998, by columnist Ray Richards.

Sidebar

Article Index