Data Warehousing
1. Data Warehousing
The question most asked now is, How do I build a data warehouse? This is a question that is not so easy to answer. As you will see in this Artical, there are many approaches to building one. However, at the end of all the research, planning, and architecting, you will come to realize that it all starts with a firm foundation. Whether you are building a large centralized data warehouse, one or more smaller distributed data warehouses (sometimes called data marts), or some combination of the two, you will always come to the point where you must decide on how the data is to be structured.This is, after all, one of the most key concepts in data warehousing and what differentiates it from the more typical operational database and decision support application building. That is, you structure the data and build applications around it rather than structuring applications and bringing data to them.
Data warehouse modeling is a process that produces abstract data models for one or more database components of the data warehouse. It is one part of the overall data warehouse development process, which is comprised of other major processes such as data warehouse architecture, design, and construction. We consider the data warehouse modeling process to consist of all tasks related to requirements gathering, analysis, validation, and modeling. Typically for data warehouse development, these tasks are difficult to separate.This may suggest a rather broad gap between modeling and design activities, which in reality certainly is not the case.
The separation between modeling and design is done for practical reasons: it is our intention to cover the modeling activities and techniques quite extensively. Some trend-setting authors and data warehouse consultants have taken this point to what we consider to be the extreme. That is, they are presenting what they are calling a totally new approach to data modeling. It is called dimensional data modeling, or fact/dimension modeling. Fancy names have been invented to refer to different types of dimensional models, such as star models and snowflake models. Numerous arguments have been presented against traditional entity-relationship (ER) modeling, when used for modeling data in the data warehouse.
Rather than taking this more extreme position, we believe that every technique has its area of usability. For example, we do support the many criticisms of ER modeling when considered in a specific context of data warehouse data modeling, and there are also criticisms of dimensional modeling. There are many types of data warehouse applications for which ER modeling is not well suited, especially those that address the needs of a well-identified community of data analysts interested primarily in analyzing their business measures in their business context.
Likewise, there are data warehouse applications that are not well supported at all by star or snowflake models alone. For example, dimensional modeling is not very suitable for making large, corporatewide data models for a data warehouse. With the changing data warehouse landscape and the need for data warehouse modeling, the new modeling approaches and the controversies surrounding traditional modeling and the dimensional modeling approach all merit investigation. And that is another purpose of this post. Because it presents details of data warehouse modeling processes and techniques, the post can also be used as an initiating for those who want to learn data warehouse modeling.
2. Data Warehousing Architecture and Implementation
Choices
In this post we discuss the architecture and implementation choices available
for data warehousing. During the discussions we may use the term data mart.
Data marts, simply defined, are smaller data warehouses that can function
independently or can be interconnected to form a global integrated data
warehouse. However, in this post, unless noted otherwise, use of the term data
warehouse also implies data mart.
Although it is not always the case, choosing an architecture should be done prior
to beginning implementation. The architecture can be determined, or modified,
after implementation begins. However, a longer delay typically means an
increased volume of rework. And, everyone knows that it is more time
consuming and difficult to do rework after the fact than to do it right, or very
close to right, the first time. The architecture choice selected is a management
decision that will be based on such factors as the current infrastructure,
business environment, desired management and control structure, commitment
to and scope of the implementation effort, capability of the technical environment
the organization employs, and resources available.The implementation approach selected is also a management decision, and one that can have a dramatic impact on the success of a data warehousing project. The variables affected by that choice are time to completion, return-on-investment, speed of benefit realization, user satisfaction, potential implementation rework, resource requirements needed at any point-in-time, and the data warehouse architecture selected.
3. Architecture Choices Selection of an architecture
Architecture Choices Selection of an architecture will determine, or be determined by, where the data warehouses and/or data marts themselves will reside and where the control resides. For example, the data can reside in a central location that is managed centrally. Or, the data can reside in distributed local and/or remote locations that are either managed centrally or independently. The architecture choices we consider in this book are global, independent, interconnected, or some combination of all three. The implementation choices to be considered are top down, bottom up, or a combination of both. It should be understood that the architecture choices and the implementation choices can also be used in combinations.For example, a data warehouse architecture could be physically distributed, managed centrally, and implemented from the bottom up starting with data marts that service a particular workgroup, department, or line of business.
4. Global Warehouse Architecture
A global data warehouse is considered one that will support all, or a large part, of the corporation that has the requirement for a more fully integrated data warehouse with a high degree of data access and usage across departments or lines-of-business. That is, it is designed and constructed based on the needs of the enterprise as a whole. It could be considered to be a common repository for decision support data that is available across the entire organization, or a large subset thereof. Top Down Implementation A top down implementation requires more planning and design work to be completed at the beginning of the project. This brings with it the need to involve people from each of the workgroups, departments, or lines of business that will be participating in the data warehouse implementation. Decisions concerning data sources to be used, security, data structure, data quality, data standards, and an overall data model will typically need to be completed before actual implementation begins. The top down implementation can also imply more of a need for an enterprisewide or corporatewide data warehouse with a higher degree of cross workgroup, department, or line of business access to the data.This approach is depicted in with this approach, it is more typical to structure a global data warehouse. If data marts are included in the configuration, they are typically built afterward. And, they are more typically populated from the global data warehouse rather than directly from the operational or external data sources. Bottom Up Implementation A bottom up implementation involves the planning and designing of data marts without waiting for a more global infrastructure to be put in place. This does not mean that a more global infrastructure will not be developed; it will be built incrementally as initial data mart implementations expand. This approach is more widely accepted today than the top down approach because immediate results from the data marts can be realized and used as justification for expanding to a more global implementation. depicts the bottom up approach. In contrast to the top down approach, data marts can be built before, or in parallel with, a global data warehouse. And as the figure shows, data marts can be populated either from a global data warehouse or directly from the operational or external data sources.
5. Architecting the Data A data warehouse
Architecting the Data A data warehouse is, by definition, a subject-oriented, integrated, time-variant collection of data to enable decision making across a disparate group of users. One of the most basic concepts of data warehousing is to clean, filter, transform, summarize, and aggregate the data, and then put it in a structure for easy access and analysis by those users. But, that structure must first be defined and that is the task of the data warehouse model. In modeling a data warehouse, we begin by architecting the data. By architecting the data, we structure and locate it according to its characteristics. In this chapter, we review the types of data used in data warehousing and provide some basic hints and tips for architecting that data. We then discuss approaches to developing a data warehouse data model along with some of the considerations. Having an enterprise data model (EDM) available would be very helpful, but not required, in developing the data warehouse data model. For example, from the EDM you can derive the general scope and understanding of the business requirements. The EDM would also let you relate the data elements and the physical design to a specific area of interest. Data granularity is one of the most important criteria in architecting the data. On one hand, having data of a high granularity can support any query. However, having a large volume of data that must be manipulated and managed could be an issue as it would impact response times.On the other hand, having data of a low granularity would support only specific queries. But, with the reduced volume of data, you would realize significant improvements in performance.
6. Structuring the Data In structuring the data
Structuring the Data In structuring the data, for data warehousing, we can distinguish three basic types of data that can be used to satisfy the requirements of an organization: · Real-time data · Derived data · Reconciled data In this section, we describe these three types of data according to usage, scope, and currency. You can configure an appropriate data warehouse based on these three data types, with consideration for the requirements of any particular implementation effort. Depending on the nature of the operational systems, the type of business, and the number of users that access the data warehouse, you can combine the three types of data to create the most appropriate architecture for the data warehouse.7. Data Modeling
Data Modeling for a Data Warehouse This chapter provides you with a basic understanding of data modeling, specifically for the purpose of implementing a data warehouse. Data warehousing has become generally accepted as the best approach for providing an integrated, consistent source of data for use in data analysis and business decision making. However, data warehousing can present complex issues and require significant time and resources to implement. This is especially true when implementing on a corporatewide basis. To receive benefits faster, the implementation approach of choice has become bottom up with data marts. Implementing in these small increments of small scope provides a larger return-on-investment in a short amount of time. Implementing data marts does not preclude the implementation of a global data warehouse.It has been shown that data marts can scale up or be integrated to provide a global data warehouse solution for an organization. Whether you approach data warehousing from a global perspective or begin by implementing data marts, the benefits from data warehousing are significant. The question then becomes, How should the data warehouse databases be designed to best support the needs of the data warehouse users? Answering that question is the task of the data modeler. Data modeling is, by necessity, part of every data processing task, and data warehousing is no exception. As we discuss this topic, unless otherwise specified, the term data warehouse also implies data mart. We consider two basic data modeling techniques in this book: ER modeling and dimensional modeling. In the operational environment, the ER modeling technique has been the technique of choice. With the advent of data warehousing, the requirement has emerged for a technique that supports a data analysis environment. Although ER models can be used to support a data warehouse environment, there is now an increased interest in dimensional modeling for that task. In this chapter, we review why data modeling is important for data warehousing. Then we describe the basic concepts and characteristics of ER modeling and dimensional modeling.
8. The Process of Data Warehousing
The Process of Data Warehousing This chapter presents a basic methodology for developing a data warehouse. The ideas presented generally apply equally to a data warehouse or a data mart. Therefore, when we use the term data warehouse you can infer data mart. If something applies only to one or the other, that will be explicitly stated. We focus on the process of data modeling for the data warehouse and provide an extended section on the subject but discuss it in the larger context of data warehouse development. The process of developing a data warehouse is similar in many respects to any other development project. Therefore, the process follows a similar path. What follows is a typical, and likely familiar, development cycle with emphasis on how the different components of the cycle affect your data warehouse modeling efforts.It is certainly true that there is no one correct or definitive life cycle for developing a data warehouse. We have chosen one simply because it seems to work well for us. Because our focus is really on modeling, the specific life cycle is not an issue here. What is essential is that we identify what you need to know to create an effective model for your data warehouse environment. There are a number of considerations that must be taken into account as we discuss the data warehouse development life cycle. We need not dwell on them, but be aware of how they affect the development effort and understand how they will affect the overall data warehouse design and model. · The life cycle diagram in seems to infer a single instance of a data warehouse. Clearly, this should be considered a logical view. That is, there could be multiple physical instances of a data warehouse involved in the environment. As an example, consider an implementation where there are multiple data marts. In this case you would iterate through the tasks in the life cycle for each data mart. This approach, however, brings with it an additional consideration, namely, the integration of the data marts. This integration can have an impact on the physical data, with considerations for redundancy, inconsistency, and currency levels. Integration is also especially important because it can require integration of the data models for each of the data marts as well. If dimensional modeling were being used, the integration might take place at the dimension level. Perhaps there could be a more global model that contains the dimensions for the organization. Then when data marts, or multiple instances of a data warehouse, are implemented, the dimensions used could be subsets of those in the global model. This would enable easier integration and consistency in the implementation. · Data marts can be dependent or independent. In the previous consideration we addressed dependent data marts with their need for integration. Independent data marts are basically smaller in scope data warehouses that are stand-alone. In this case the data models can also be independent, but you must understand that this type of implementation can result in data redundancy, inconsistency, and currency levels. The key message of the life cycle diagram is the iterative nature of data warehouse development. This, more than anything else, distinguishes the life cycle of a data warehouse project from other development projects. Whereas all projects have some degree of iteration, data warehouse projects take iteration to the extreme to enable fast delivery of portions of a warehouse. Thus portions of a data warehouse can be delivered while others are still being developed. In most cases, providing the user with some data warehouse function generates immediate benefits. Delivery of a data warehouse is not typically an all-or-nothing proposition. Because the emphasis of this book is on modeling for the data warehouse, we have left out discussion about infrastructure acquisition. Although this would certainly be part of any typical data warehouse effort, it does not directly impact the modeling process.
No comments:
Post a Comment