Friday, June 29, 2012

Need For Data Warehouse / Data Warehouse Testing

This below post covers - What is Data Warehousing/Data Warehouse, What is the need or why do companies need Data Warehousing and lastly Data Warehouse testing.

What is Data Warehousing?
(a-f different definitions of data warehouse)

a) A large store of data accumulated from a wide range of sources within a company and used to guide management decisions. [Via]

b) A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. [Via]

c) The concept of data warehousing is deceptively simple. Data is extracted periodically from the applications that support business processes and copied onto special dedicated computers. There it can be validated, reformatted, reorganized, summarized, restructured, and supplemented with data from other sources. The resulting data warehouse becomes the main source of information for report generation, analysis, and presentation through ad-hoc reports, portals, and dashboards.

d) If your enterprise data flows all the way through and your organization stores it in a variety of in-house and external systems. This may be due to the fact that your company branches are in Mumbai, Delhi, Calcutta and Lucknow. The Sales head requires monthly sales report. Each branch has a separate operational system.

You become conscious that all the essential data is most likely there but not integrated,ordered and organized in the way for you to get the answers easily.

You or your employees might be taking out data (needed by you) by digging out/extracting data from one source and sorting or arranging the data according to city for example, then cleaning the data (if needed) and at last running reports against this data. There must be some superior and more trustworthy way of doing this.

Data Warehouse serves not only as a repository for historical data but also as an excellent data integration platform. The data in the data warehouse is integrated, subject oriented, time-variant and non-volatile to enable you to get a 360° view of your organization.

Recommended reading:
Jd Edwards Enterpriseone:Tcr By Jacot.
The Need for Data Warehousing
Why You Need a Data Warehouse
You can also read Networkworld for Do You Really Need a Data Warehouse.

e) A data warehouse is a collection of data, usually current and historical, from multiple databases that the organization can use for analysis and decision making. The purpose, of course, is to bring key sets of data about or used by the organization into one place. [Via]

f) Asktom says Most data warehouses are characterized by:
-hundreds of gigabytes to terabyte in size. dealing with that scale takes knowledge.
-periods of read mostly activity, with periods of short bursts of FURIOUS load, reorg,rebuild activity.
-use of specialized features that many people have read about but fewer have used such as bitmap indexes, materialized views, star query schemas, partitioning, etc.
-use of parallel query -- similar to the point right above.
-constantly shifting set of questions - requiring shifting sets of data, MV's, indexes, etc (most systems, once production, are relatively static. DW's -- every day is a brand new day).
To name a few. Dealing with terabytes of data is hard, takes experience. Lots of know how required.

Do you really need a data warehouse?

Most companies use data warehousing for Validation
Validation is where the user community validates with data what they already believe to be true. For example, Denver consumers buy products differently than New York City consumers. New York folks tend to purchase a candy bar on a whim (city population buying patterns), where Denver folks are less likely to do so (rural population buying patterns). This has been hypothesized for years, but empirical data shows it to be true. Another example is the question "Who are my best customers?" Once we get past the definition of "best", I bet that most of the user community already knows the answer. The biggest part is getting past the definition of "best". My estimate is about 45% of the usage of the data warehouse is validation. [Via]

Another example of Data Warehouse usefulness - A data warehouse can provide reports that indicate which product lines are popular in various regions, which employees have generated most sales etc.

A data warehouse can help identify departments that have exceed their monthly budgets, highlight suppliers who have consistently met profitability goals etc.. [Via:A Manager's Guide to Data Warehousing by Laura L. Reeves]

Examples of some of the questions answered by data warehouse:


[Idea of above image via:Modern Data Warehousing, Mining, And Visualization: Core Concepts By Marakas and Information Management]

Data Warehouse Testing
[What different people say about Data Warehouse Testing]

In paper A Comprehensive Approach to Data Warehouse Testing by Matteo Golfarelli and Stefano Rizzi, authors say that

Like for most generic software systems, different types of tests can be devised for data warehouse systems. For in- stance, it is very useful to distinguish between unit test, a white-box test performed on each individual component considered in isolation from the others, and integration test, a black-box test where the system is tested in its entirety. Also regression test, that checks that the system still func-tions correctly after a change has occurred, is considered to be very important for data warehouse systems because of their ever-evolving nature. However, the peculiar characteristics of data warehouse testing and the complexity of data warehouse projects ask for a deep revision and contextualization of these test types, aimed in particular at emphasizing the relationships between testing activities on the one side, design phases and project documentation on the other.
The main features of our approach [in testing Data Warehouse] can be summarized as follows:
-A consistent portion of the testing effort is advanced to the design phase to reduce the impact of error correction.
-A number of data mart-specific testing activities are identified and classified in terms of what is tested and how it is tested.
-A tight relationship is established between testing activities and design phases within the framework of a reference methodological approach to design.
-When possible, testing activities are related to quality metrics to allow their quantitative assessment.

Jackie McDougall in Learning to Love Data Warehouse Testing, shares his first experience of managing a data warehouse testing project? Well, the main lessons were:
-Define quality entry criteria expected from your supplier.
-Insist on design/development documentation.
-Get the requirements right and test them.
-Consider data quality and when it gets addressed.
-Get a really good test strategy.
-Get the right (skilled) resources on board (early).
-Think carefully about the reconciliation exercise as part of acceptance.

William E. Perry in his book [Effective Methods for Software Testing], shares concerns most commonly associated with data warehouse like Inaccurate or incomplete data in a data warehouse, Losing an update to a single data item, Unauthorized access to data in a data warehouse etc. They say that to test a data warehouse, testers should perform the following three tasks:
1. Measure the magnitude of data warehouse concerns. [Once the list of potential data warehouse concerns has been finalized, the magnitude of those concerns must be determined].
2. Identify data warehouse activities to test [Organizational Process, Data Documentation Process, System Development Process, Access Control Process, Backup/Recovery Process etc].
3. Test the adequacy of data warehouse activity processes[ This task is to evaluate that each of the identified processes contains controls that are adequate to reduce the concerns identified earlier]
For more on this please refer the book.

Sundaram Subramanian says that "The complexity and criticality of data warehouse testing projects is growing rapidly each day. Data warehouses need to be validated for functionality, quality, integrity, availability, scalability and security based on the defined business requirements by an organization. Based on my experiences as a testing practitioner, I believe the following best practices in the test planning phase can significantly contribute to successfully validating the data warehouse."
The best practices he suggests are
1. Comprehensively Understand the Data Model
2. Understand the Business Requirements Clearly
3. Plan Early for Data Warehouse Testing Environment

You can also look at these Tom’s Ten Data Tips on Data Warehouse Testing.

You can also go through Data warehouse Testing - Ensuring Information Quality through Innovation.

Read what Baher Malek has to say on How to test a data warehouse.

Also see Traditional testing versus DWH/BI testing by Anita van Bergenhenegouwen.