KTechITKnowledge: Data Warehousing

Data Warehousing

1. Data Warehousing

The question most asked now is, How do I build a data warehouse? This is a question that is not so easy to answer. As you will see in this Artical, there are many approaches to building one. However, at the end of all the research, planning, and architecting, you will come to realize that it all starts with a firm foundation. Whether you are building a large centralized data warehouse, one or more smaller distributed data warehouses (sometimes called data marts), or some combination of the two, you will always come to the point where you must decide on how the data is to be structured.
This is, after all, one of the most key concepts in data warehousing and what differentiates it from the more typical operational database and decision support application building. That is, you structure the data and build applications around it rather than structuring applications and bringing data to them.
Data warehouse modeling is a process that produces abstract data models for one or more database components of the data warehouse. It is one part of the overall data warehouse development process, which is comprised of other major processes such as data warehouse architecture, design, and construction. We consider the data warehouse modeling process to consist of all tasks related to requirements gathering, analysis, validation, and modeling. Typically for data warehouse development, these tasks are difficult to separate.This may suggest a rather broad gap between modeling and design activities, which in reality certainly is not the case.
The separation between modeling and design is done for practical reasons: it is our intention to cover the modeling activities and techniques quite extensively. Some trend-setting authors and data warehouse consultants have taken this point to what we consider to be the extreme. That is, they are presenting what they are calling a totally new approach to data modeling. It is called dimensional data modeling, or fact/dimension modeling. Fancy names have been invented to refer to different types of dimensional models, such as star models and snowflake models. Numerous arguments have been presented against traditional entity-relationship (ER) modeling, when used for modeling data in the data warehouse.
Rather than taking this more extreme position, we believe that every technique has its area of usability. For example, we do support the many criticisms of ER modeling when considered in a specific context of data warehouse data modeling, and there are also criticisms of dimensional modeling. There are many types of data warehouse applications for which ER modeling is not well suited, especially those that address the needs of a well-identified community of data analysts interested primarily in analyzing their business measures in their business context.
Likewise, there are data warehouse applications that are not well supported at all by star or snowflake models alone. For example, dimensional modeling is not very suitable for making large, corporatewide data models for a data warehouse. With the changing data warehouse landscape and the need for data warehouse modeling, the new modeling approaches and the controversies surrounding traditional modeling and the dimensional modeling approach all merit investigation. And that is another purpose of this post. Because it presents details of data warehouse modeling processes and techniques, the post can also be used as an initiating for those who want to learn data warehouse modeling.

2. Data Warehousing Architecture and Implementation Choices

In this post we discuss the architecture and implementation choices available for data warehousing. During the discussions we may use the term data mart. Data marts, simply defined, are smaller data warehouses that can function independently or can be interconnected to form a global integrated data warehouse. However, in this post, unless noted otherwise, use of the term data warehouse also implies data mart. Although it is not always the case, choosing an architecture should be done prior to beginning implementation. The architecture can be determined, or modified, after implementation begins. However, a longer delay typically means an increased volume of rework. And, everyone knows that it is more time consuming and difficult to do rework after the fact than to do it right, or very close to right, the first time. The architecture choice selected is a management decision that will be based on such factors as the current infrastructure, business environment, desired management and control structure, commitment to and scope of the implementation effort, capability of the technical environment the organization employs, and resources available.
The implementation approach selected is also a management decision, and one that can have a dramatic impact on the success of a data warehousing project. The variables affected by that choice are time to completion, return-on-investment, speed of benefit realization, user satisfaction, potential implementation rework, resource requirements needed at any point-in-time, and the data warehouse architecture selected.

3. Architecture Choices Selection of an architecture

Architecture Choices Selection of an architecture will determine, or be determined by, where the data warehouses and/or data marts themselves will reside and where the control resides. For example, the data can reside in a central location that is managed centrally. Or, the data can reside in distributed local and/or remote locations that are either managed centrally or independently. The architecture choices we consider in this book are global, independent, interconnected, or some combination of all three. The implementation choices to be considered are top down, bottom up, or a combination of both. It should be understood that the architecture choices and the implementation choices can also be used in combinations.
For example, a data warehouse architecture could be physically distributed, managed centrally, and implemented from the bottom up starting with data marts that service a particular workgroup, department, or line of business.

4. Global Warehouse Architecture

A global data warehouse is considered one that will support all, or a large part, of the corporation that has the requirement for a more fully integrated data warehouse with a high degree of data access and usage across departments or lines-of-business. That is, it is designed and constructed based on the needs of the enterprise as a whole. It could be considered to be a common repository for decision support data that is available across the entire organization, or a large subset thereof. Top Down Implementation A top down implementation requires more planning and design work to be completed at the beginning of the project. This brings with it the need to involve people from each of the workgroups, departments, or lines of business that will be participating in the data warehouse implementation. Decisions concerning data sources to be used, security, data structure, data quality, data standards, and an overall data model will typically need to be completed before actual implementation begins. The top down implementation can also imply more of a need for an enterprisewide or corporatewide data warehouse with a higher degree of cross workgroup, department, or line of business access to the data.
This approach is depicted in with this approach, it is more typical to structure a global data warehouse. If data marts are included in the configuration, they are typically built afterward. And, they are more typically populated from the global data warehouse rather than directly from the operational or external data sources. Bottom Up Implementation A bottom up implementation involves the planning and designing of data marts without waiting for a more global infrastructure to be put in place. This does not mean that a more global infrastructure will not be developed; it will be built incrementally as initial data mart implementations expand. This approach is more widely accepted today than the top down approach because immediate results from the data marts can be realized and used as justification for expanding to a more global implementation. depicts the bottom up approach. In contrast to the top down approach, data marts can be built before, or in parallel with, a global data warehouse. And as the figure shows, data marts can be populated either from a global data warehouse or directly from the operational or external data sources.

5. Architecting the Data A data warehouse

Architecting the Data A data warehouse is, by definition, a subject-oriented, integrated, time-variant collection of data to enable decision making across a disparate group of users. One of the most basic concepts of data warehousing is to clean, filter, transform, summarize, and aggregate the data, and then put it in a structure for easy access and analysis by those users. But, that structure must first be defined and that is the task of the data warehouse model. In modeling a data warehouse, we begin by architecting the data. By architecting the data, we structure and locate it according to its characteristics. In this chapter, we review the types of data used in data warehousing and provide some basic hints and tips for architecting that data. We then discuss approaches to developing a data warehouse data model along with some of the considerations. Having an enterprise data model (EDM) available would be very helpful, but not required, in developing the data warehouse data model. For example, from the EDM you can derive the general scope and understanding of the business requirements. The EDM would also let you relate the data elements and the physical design to a specific area of interest. Data granularity is one of the most important criteria in architecting the data. On one hand, having data of a high granularity can support any query. However, having a large volume of data that must be manipulated and managed could be an issue as it would impact response times.
On the other hand, having data of a low granularity would support only specific queries. But, with the reduced volume of data, you would realize significant improvements in performance.

6. Structuring the Data In structuring the data

Structuring the Data In structuring the data, for data warehousing, we can distinguish three basic types of data that can be used to satisfy the requirements of an organization: · Real-time data · Derived data · Reconciled data In this section, we describe these three types of data according to usage, scope, and currency. You can configure an appropriate data warehouse based on these three data types, with consideration for the requirements of any particular implementation effort. Depending on the nature of the operational systems, the type of business, and the number of users that access the data warehouse, you can combine the three types of data to create the most appropriate architecture for the data warehouse.

7. Data Modeling

Data Modeling for a Data Warehouse This chapter provides you with a basic understanding of data modeling, specifically for the purpose of implementing a data warehouse. Data warehousing has become generally accepted as the best approach for providing an integrated, consistent source of data for use in data analysis and business decision making. However, data warehousing can present complex issues and require significant time and resources to implement. This is especially true when implementing on a corporatewide basis. To receive benefits faster, the implementation approach of choice has become bottom up with data marts. Implementing in these small increments of small scope provides a larger return-on-investment in a short amount of time. Implementing data marts does not preclude the implementation of a global data warehouse.
It has been shown that data marts can scale up or be integrated to provide a global data warehouse solution for an organization. Whether you approach data warehousing from a global perspective or begin by implementing data marts, the benefits from data warehousing are significant. The question then becomes, How should the data warehouse databases be designed to best support the needs of the data warehouse users? Answering that question is the task of the data modeler. Data modeling is, by necessity, part of every data processing task, and data warehousing is no exception. As we discuss this topic, unless otherwise specified, the term data warehouse also implies data mart. We consider two basic data modeling techniques in this book: ER modeling and dimensional modeling. In the operational environment, the ER modeling technique has been the technique of choice. With the advent of data warehousing, the requirement has emerged for a technique that supports a data analysis environment. Although ER models can be used to support a data warehouse environment, there is now an increased interest in dimensional modeling for that task. In this chapter, we review why data modeling is important for data warehousing. Then we describe the basic concepts and characteristics of ER modeling and dimensional modeling.

8. The Process of Data Warehousing

The Process of Data Warehousing This chapter presents a basic methodology for developing a data warehouse. The ideas presented generally apply equally to a data warehouse or a data mart. Therefore, when we use the term data warehouse you can infer data mart. If something applies only to one or the other, that will be explicitly stated. We focus on the process of data modeling for the data warehouse and provide an extended section on the subject but discuss it in the larger context of data warehouse development. The process of developing a data warehouse is similar in many respects to any other development project. Therefore, the process follows a similar path. What follows is a typical, and likely familiar, development cycle with emphasis on how the different components of the cycle affect your data warehouse modeling efforts.
It is certainly true that there is no one correct or definitive life cycle for developing a data warehouse. We have chosen one simply because it seems to work well for us. Because our focus is really on modeling, the specific life cycle is not an issue here. What is essential is that we identify what you need to know to create an effective model for your data warehouse environment. There are a number of considerations that must be taken into account as we discuss the data warehouse development life cycle. We need not dwell on them, but be aware of how they affect the development effort and understand how they will affect the overall data warehouse design and model. · The life cycle diagram in seems to infer a single instance of a data warehouse. Clearly, this should be considered a logical view. That is, there could be multiple physical instances of a data warehouse involved in the environment. As an example, consider an implementation where there are multiple data marts. In this case you would iterate through the tasks in the life cycle for each data mart. This approach, however, brings with it an additional consideration, namely, the integration of the data marts. This integration can have an impact on the physical data, with considerations for redundancy, inconsistency, and currency levels. Integration is also especially important because it can require integration of the data models for each of the data marts as well. If dimensional modeling were being used, the integration might take place at the dimension level. Perhaps there could be a more global model that contains the dimensions for the organization. Then when data marts, or multiple instances of a data warehouse, are implemented, the dimensions used could be subsets of those in the global model. This would enable easier integration and consistency in the implementation. · Data marts can be dependent or independent. In the previous consideration we addressed dependent data marts with their need for integration. Independent data marts are basically smaller in scope data warehouses that are stand-alone. In this case the data models can also be independent, but you must understand that this type of implementation can result in data redundancy, inconsistency, and currency levels. The key message of the life cycle diagram is the iterative nature of data warehouse development. This, more than anything else, distinguishes the life cycle of a data warehouse project from other development projects. Whereas all projects have some degree of iteration, data warehouse projects take iteration to the extreme to enable fast delivery of portions of a warehouse. Thus portions of a data warehouse can be delivered while others are still being developed. In most cases, providing the user with some data warehouse function generates immediate benefits. Delivery of a data warehouse is not typically an all-or-nothing proposition. Because the emphasis of this book is on modeling for the data warehouse, we have left out discussion about infrastructure acquisition. Although this would certainly be part of any typical data warehouse effort, it does not directly impact the modeling process.

9. Requirements Gathering

The traditional development cycle focuses on automating the process, making it faster and more efficient. The data warehouse development cycle focuses on facilitating the analysis that will change the process to make it more effective. Efficiency measures how much effort is required to meet a goal. Effectiveness measures how well a goal is being met against a set of expectations. The requirements identified at this point in the development cycle are used to build the data warehouse model. But, the requirements of an organization change over time, and what is true one day is no longer valid the next. How then, do you know when you have successfully identified the user¢s requirements? Although there is no definitive test, we propose that if your requirements address the following questions, you probably have enough information to begin modeling: · Who (people, groups, organizations) is of interest to the user? · What (functions) is the user trying to analyze? · Why does the user need the data? · When (for what point in time) does the data need to be recorded? · Where (geographically, organizationally) do relevant processes occur? · How do we measure the performance or state of the functions being analyzed? There are many methods for deriving business requirements. In general, these methods can be placed in one of two categories: source-driven requirements gathering and user-driven requirements gathering.

10. Source-Driven Requirements Gathering

Source-driven requirements gathering, as the name implies, is a method based on defining the requirements by using the source data in production operational systems. This is done by analyzing an ER model of source data if one is available or the actual physical record layouts and selecting data elements deemed to be of interest.

11. User-Driven Requirements Gathering

User-driven requirements gathering is a method based on defining the requirements by investigating the functions the users perform. This is usually done through a series of meetings and/or interviews with users. The major advantage to this approach is that the focus is on providing what is needed, rather than what is available. In general, this approach has a smaller scope than the source-driven approach. Therefore, it generally produces a useful data warehouse in a shorter timespan. 11. Data Warehouse Modeling Techniques Data warehouse modeling is the process of building a model for the data that is to be stored in the data warehouse. The model produced is an abstract model, and in this sense, it is a representation of reality, or at least a part of reality which the data warehouse is assumed to support. When considered like this, data warehouse modeling seems to resemble traditional database modeling, which most of us are familiar with in the context of database development for operational applications (OLTP database development). This resemblance should be considered with great care, however, because there are a number of significant differences between data warehouse modeling and OLTP database modeling. These differences impact not only the modeling process but also the modeling techniques to be used.

12. Selecting a Modeling

Tool Modeling for data warehousing is significantly different from modeling for operational systems. In data warehousing, quality and content are more important than retrieval response time. Structure and understanding of the data, for access and analysis, by business users is a base criterion in modeling for data warehousing, whereas operational systems are more oriented toward use by software specialists for creation of applications. Data warehousing also is more concerned with data transformation, aggregation, subsetting, controlling, and other process-oriented tasks that are typically not of concern in an operational system. The data warehouse data model also requires information about both the source data that will be used as input and how that data will be transformed and flow to the target data warehouse databases. Thus, the functions required for data modeling tools for data warehousing data modeling have significantly different requirements from those required for traditional data modeling for operational systems. In this chapter we outline some of the functions that are of importance for data modeling tools to support modeling for a data warehouse. The key functions we cover are: diagram notation for both ER models and dimensional models, reverse engineering, forward engineering, source to target mapping of data, data dictionary, and reporting. We conclude with a list of modeling tools.

13. Populating the Data Warehouse

Populating is the process of getting the source data from operational and external systems into the data warehouse and data marts. The data is captured from the operational and external systems, transformed into a usable format for the data warehouse, and finally loaded into the data warehouse or the data mart. Populating can affect the data model, and the data model can affect the populating process.

Data Warehousing

Data Warehousing

1. Data Warehousing

2. Data Warehousing Architecture and Implementation Choices

3. Architecture Choices Selection of an architecture

4. Global Warehouse Architecture

5. Architecting the Data A data warehouse

6. Structuring the Data In structuring the data

7. Data Modeling

8. The Process of Data Warehousing

9. Requirements Gathering

10. Source-Driven Requirements Gathering

11. User-Driven Requirements Gathering

12. Selecting a Modeling

13. Populating the Data Warehouse

No comments:

Post a Comment

Artificial Intelligent-IV

Report Abuse