Excerpted from Mastering Data Warehouse Aggregates, by Chris Adamson.
Used by permission.
Many business processes can be understood in terms of a series of stages or steps that something must undergo. In made-to-order manufacturing, an individual item is ordered, manufactured, quality assured, packaged, and shipped. In banking, a mortgage application is submitted, reviewed by a loan officer, evaluated by an underwriter, and eventually consummated at a closing.
When the individual items that pass through the process are readily identifiable (an individual item being manufactured; a specific mortgage application), and the processing steps or milestones are predicable (the manufacturing process stages, the mortgage approval phases), its status can be tracked by an accumulating snapshot model. Such a model provides a way to understand the status of an individual item at the detail level, the workload at each processing step, and the efficiency with which items move from one stage to the next.
Consider the process by which a mortgage application moves from initial submission to closing. The process begins when an applicant completes a mortgage application with a mortgage officer. A processor then takes over, ensuring that documentation required to evaluate the loan application is gathered. Next, the application is reviewed by an underwriter, who reviews the application and supporting documentation, evaluates potential risk, and approves or rejects the application. The amount approved for underwriting may be less than the amount of the original application. Finally, the mortgage becomes legally binding at the time of closing.
Figure 1, below, shows an accumulating snapshot for the mortgage application process. The grain of this fact table is an application. Each application will be represented by a single row in the fact table. The major milestones are represented by multiple foreign key references to the Day dimension—the date of submission, the date approved by mortgage officer, the date all supporting documentation was complete, the date approved by an underwriter, and the date of closing.

Figure 1: An accumulating snapshot tracking mortgage applications.
Note that it is not necessary to build five Day dimension tables. A single Day table can serve in all these roles; it will simply be aliased at query time or through a series of views. Similarly, the three employees involved in the process are represented by three foreign key references to the Employee dimension table—one for the mortgage officer, one for the processor that assembles the supporting materials, and one for the underwriter.
The facts include several dollar amounts. The amount of the original application is the application_amount. The amounts approved and rejected by the mortgage officer and underwriter are recorded as officer_approved_amount and underwritten_amount. The amount actually loaned at closing is the closing_amount.
When the application is first submitted, the only date that is known is the application submission date. The appropriate day_key value is assigned to the application_submission_date_key. In order to accommodate the dates that are as yet unknown, the Day dimension must contain a row for "unknown" or "n/a." The key value for this row will be assigned to the remaining dates. The application_amount is set to the requested mortgage amount from the application. The remaining dollar amounts are initialized to zero.
As the application works its way through the process, the fact table row will be updated. The various date keys are revised as each milestone is reached. The additional facts are filled in as the appropriate milestones are reached.
The fact table also contains a series of lag columns. These represent the number of days the application spends at each stage. Each is a simple calculation based on milestone dates, populated once it completes the appropriate stage. They allow measurement of processing time.
The fact table also contains a series of quantities, each corresponding to one of the milestones. The value of each count is zero or one, depending on whether the application has reached the corresponding milestone. These facilitate workload reporting, and will take on other values when aggregated. To make it easier to identify the current status of an application, a status attribute may be added to the application dimension or as a degenerate dimension in the fact table.
Using this schema, it is possible to study the lifecycle of a particular application, the workload at each processing step, and the efficiency of the overall process itself.
Excerpted from Mastering Data Warehouse Aggregates, by Chris Adamson.
Copyright (c) 2006 by Wiley Publishing, Inc
Used by permission.












10 comments:
Chris:
You have mentioned 'Each application will be represented by a single row in the fact table'. You mean dimension table or may be I need more clarification. Shouldn't there be additional keys in fact table to qualify this stmt?
-SL
SL:
Each mortgage application has one row in the dimension table and one row in the fact table. It is the single row in the fact table that is repeatedly updated as the application passes through the various processing steps.
Two challenges: what if the number of business events is extreme? And what if it variable? I'm dealing with a business process where the customer can clearly articulate the need for about 50 dates. For some of those events, it can occur any number of times. For example, it's about insurance, and a customer can ask for a requote any number of times before asking to have it issued as a policy. How do you deal with that many events and with the variability?
A large number of milestones means a large number of key values. That is not a problem in and of itself, as long as the technical considerations are seen to by the database administrator. But it also means a large number of lag calculations, and that can really cause row size to balloon.
If the process is not consistent, or the same event type may occur a varying number of times, the accumulating snapshot may not be the best solution.
In both cases, you may be able to solve the problem by scaling back what is logged in the accumulating snapshot. You may be able to identify a subset of activities to model, or specify rules on how the row is to be updated if a particular milestone is achieved more than once.
If all that fails, develop a standard status model. Some of your reports will require intensive processing; they can be scheduled as batch processes and cached so that the results are ready when the reports are accessed.
I'm curious why the Lag columns are needed. Shouldn't that data be garnishable by somehow measuring between each event? ie. DateDiff(d,app_submitted_date_key,officer_approval_date_key)
Strictly speaking, the lag columns are not required. They are very useful, however, when reports will aggregate (or average) the lags over large numbers of records.
example:
sum( approval_lag) group by officer.
It won't be necessary for the RDBMS to compute a huge number of detail-level lags prior to aggregating query results.
How do you handle steps that are measured in under a day (i.e. minutes or seconds)?
Represent each milestone with a date and time, rather than just a date. Date and time should be separate dimensions, so that they do not get too big. Then you can measure elapsed time in seconds, or whatever increment you wish.
Chris,
Do you think Accumulating Snapshots are appropriate for monitoring student status.The student might be in the system for say 1-4 years. Kimball in is book states that these are ideal for short term processes.
I have student status monitoring source table with about 15 status codes and separate records for each status.I might end up with 50-60 attributes in the fact table if I implement Accumulating Snapshot
Post a Comment