Data to Diamonds - Part 2
Last month we scratched the surface of a very deep topic with an overview of data warehousing and data mining. We will continue this time around with an in depth look at this subject matter which has been the focus of a great deal of interest of late.
Structure
As we try to shed some light on the structure of a data warehouse, I thought it would be of some assistance to provide a graphic representation. The following is a high level diagram of a possible data warehouse implementation; utilizing a variety of structured and unstructured data repositories.
Fig. 1
As you can see, all data elements must pass through a middleware application which is responsible for the dissection, analysis, purging and reformatting of data for inclusion within the data warehouse. All data that is not suitable for processing with decision support software (DSS) is disallowed admission. The structure of the actual warehouse is determined by data architects at the design phase, and takes into account many variables including, as a prime function, the rules for data summarization. There are typically five levels of summarization:
meta data, (which as we learned last month also includes the rules for data summarization)
- current detail data;
- older detail data;
- lightly summarized data, and
- highly summarized data.
The most problematic data store by far is the current detail data; as it is kept at the lowest level of granularity and hence requires the most system resources. Current detail data is always stored on hard disk; an expensive medium compared with other nearline storage tools (such as tape or optical drives) on which older data usually reside.
Current detail data is often the focus of queries which are time sensitive, as this data is the most recent reflection of transactions occurring in the operational environment; and thus require the speed of access that only hard disk systems can afford. Older detail data is usually stored nearline because it is accessed far less frequently and therefore not subject to the access speed requirements of current detail. Lightly summarized data is derived from the current detail and is composed of far fewer but more critical data elements. As the data ages, it is further summarized to include only the most vital records. Summarized data is almost always stored on hard disk due to it's compact nature and importance in knowledge discovery via trend and pattern analysis.
Data architects always face the difficulty of determining when the data in question should move from one summarization stratum to the next; and what should constitute a "key" data element scheduled for inclusion within summarized layers. This involves a careful analysis of the business processes that the warehouse is being designed to support, and the key questions that this IT solution is meant to answer.
What now?
So you've designed your warehouse structure and now you are dying to discover what business intelligence gems are hidden within all that data. How do you go about it? Well, as we discussed last month, data mining is a rapidly emerging technology which has been receiving accolades for the high return on investment (ROI) realized by it's successful utilization within the data warehouse; in a knowledge discovery capacity. International Data Corporation in cooperation with IBM recently conducted a study which concluded that the average ROI of warehousing initiatives was 401% over two to three years!
Considering the amount of resources that a typical warehouse project will consume, the dollar value of the ROI can be enormous. So how does data mining work? First let's define the subject: data mining is the extraction of hidden, predictive information from large databases. There are several tools and methodologies used to accomplish this goal; and while reviews of specific software is beyond the scope of this article, I will investigate the techniques that these powerful applications utilize.
Data mining works by attempting to discover unknown or unforeseen patterns and trends within data. There are two methods by which this task is undertaken: user or system initiative. The user initiative method is simple. The knowledge worker poses a certain query to the system which then uses one of a variety of methods to come up with an answer. The problem with this however, is that the user will often miss key information merely because he/she hasn't thought of all the possibilities. System initiative is far more intriguing; as the system itself examines the data, and based on significant patterns discovered, queries itself on information it has determined is "interesting" or of potential value, and presents it's findings to the knowledge worker without any human input.
How do they do that?
There are many ingenious methods that the machine may utilize in order to arrive at it's conclusions; neural networks, decision trees and association analysis among them. By far the most fascinating is the neural network. Artificial neural networks have been around since the ‘50's and were developed as computer based model of the human brain for use in artificial intelligence experiments. Essentially, the artificial neural network contains many independent processors, similar to neurons in the brain, which are connected based on the information retrieval task at hand resembling synaptic function.
Neural nets learn to distinguish trends and patterns in a non-linear fashion which is very comparable to human knowledge acquisition. This makes them very powerful data analysis tools indeed; especially for extremely complex problems such as fingerprint identification or speech recognition. When applied to business data, neural networks often arrive at very surprising conclusions that provide entrepreneurs the business intelligence required to succeed in today's competitive marketplace.
The other methods, while often less complex, employ powerful algorithms to achieve similar results. The trick is to use the right tool for the job, while keeping resource requirements in mind. Decision trees are relatively simple tools that a bank manager might use to determine if an applicant is suitable to receive a loan for example. Association analysis examines the relationships between data elements to determine levels of association frequency. This is great for finding out what brands of pop and dip customers purchase with their potato chips, or what options were most popular on last year's production of Chevrolet's by season.
Finally
When deciding whether or not to embark on the journey towards the production of a business intelligence engine, you must carefully weigh all the factors involved; many of which are not readily apparent. To this end you must enlist the aid of skilled consultants who will be able to guide you down this rather dangerous path. Be aware that this is an undertaking of very large proportions; and that it can quickly get out of hand. You must have a crystal clear vision of what you wish to achieve; and a carefully laid plan before any work may commence. If, however, you have done your homework and completed the preliminaries with diligence, you can expect huge rewards in the years to come.
Originally published in Monitor Magazine's Lan ConXions column, March, 1998, by technology columnist, Ray Richards.
Heading Level 3
Sidebar
Article Index
- Digital New Year's Resolutions - January 2009
- Networking Basics - June 1996
- Networking Basics Part 2 - July 1996
- The Media PC - April 2005
- WiMax - Metropolitan Networks - May 2005
- Digital Rights Management - June 2005
- Digital Rights Management - Part 2 - July 2005
- Adobe Creative Suite 2 Review - August 2005
- Windows Rant, Alpha Rave - August 1998
- DEC AlphaServer Lineup - August 1998
- The Year in Retrospect, 1996-1997 - August 1997
- Bluetooth & Wireless Networking - Nov. 2000
- How to Win Government Contracts - Oct. 1999
- Mobile Phone Plans Comaprison - August 2005
- Clones Versus Brand Name PCs - June 1998
- Adobe Illustrator vs. Corel Draw - March 2000
- Illustrator vs. Draw - Part 2 - March 2000
- The Death of Customer Service - August 2000
- Customer Service Solutions - September 2001
- Data To Diamonds - February 1998
- Data To Diamonds - Part 2 - March 1998
- The End of the Internet? - December 2000
- Your Digital Legacy - March 2008
- Disaster Recovery Planning - September 1997
- Threat and Risk Assessments - October 1997
- Dr. Jeff Williams Interview - November 1997
- Jeff Williams Interview - Part 2 - December 1997
- Magma's Data Center - October 2000
- Magma's ADSL Service Interview - January 1999
- Magma's ADSL Interview - Part 2 - January 1999
- Distributed Computing - September 2001
- Distributed Computing - Part 2 - October 2001
- Gaining Internet Exposure - Part 2 - May 1999
- Enterprise Resource Planning - October 1998
- Powering ERP Applications - April 1999
- Flash Versus LiveMotion - April 2001
- FreeBalance Financials - March 1999
- Globalization - May 2001
- Barriers and Benefits of Globalization - June 2001
- Google Desktop Review - May 2006
- Graphic Design Fundamentals - February 2000
- IBM Plant & Headquarters Tour - January 1997
- IM's Effect on Society & Culture - September 2005
- Compaq Servers Review - May 1998
- Citrix Winframe Review - May 1997
- Smart Cards Overview - July 1997
- Online Anonymity - October 2008
- An Introduction to Java - December 1996
- ERP: PeopleSoft - December 1998
- Photopaint vs. Photoshop - May 2000
- Photopaint vs. Photoshop - Part 2 - June 2000
- Starting a Small Business - Admin - July 1999
- SOHO Accounting Software - August 1999
- Accpac, Simply Accounting Review - October 1999
- Rogers Rant, Quickbooks Rave - November 1999
- Intuit Quickbooks Pro Review - December 1999
- Quickbooks Pro Review - Part 2 - January 2000
- SAP R/3 Review - November 1998
- How Standards Affect Everything - March 2001
- Teleworking - Your Office at Home - April 1998
- The Ultimate Office - February 2008
- Unicenter TNG - June 1997
- Virtual Private Networking - November 1998
- Web 3.0, The Semantic Web - July 2008
- Basic Web Design Principles - February 1999
- Women in High Tech - September 1995
- Windows Driver Nightmares - January 2001
- Post Y2K Commentary - February 2001
- Bored With Technology - July 2001