Tuesday, March 2, 2010

Just the Facts?

A reader asks:
I’m having a problem in picking out facts. For example are all dates facts -- start date, end date and DoB?  What about personal data like post code, job title, status (active /left)  etc. Would you say that if you cant do math on it, its not a fact?
- [Name Withheld], UK
Good question.  It allows me to cover the basics, as well as touch on some advanced topics.

First, the Basics..


Facts vs. Dimensions

Facts are measurements.  I really prefer that word, because it is a bit more descriptive.  But the rest of the world already knows "Facts".

Examples that are clearly measurements:
  • order dollars
  • gross pay
  • quantity sold
  • account balance
In the original question, all the attributes mentioned are dimensions.  (Dates, personal data, and so forth.)

Dimensions are used to give facts context.  You use dimension values to filter or group facts.  Here are some in action:
  • order dollars (fact) by order date (dimension)
  • gross pay (fact) by department code (dimension)
  • quantity sold (fact) for January (dimension value)
You can usually "do math" on facts.  The most common example is adding them together, or aggregating them, as might be done in the examples above.

Don't use that as a criterion to to identify facts, though.  Many numbers are not facts. And, to make matters more complicated, and some non-numbers are facts.

Many Numbers are Not Facts

Some numbers are really dimensions.  You can tell because we use them to give context to facts.  A numeric department code is obviously a dimension.  We wouldn't normally add the values together, but we might use them to break out a measurement -- for example: budget (fact) by department (dimension).

Less obvious are things like "unit price".  It sounds like it might be a fact. After all, we can certainly add up money, right?   Actually, when we add up sales, we don't add up unit prices, we add up "extended prices".  Extended price is a fact.  It is equivalent to unit price times quantity.

Look again, and you will also see that unit price behaves like the other dimensions we looked at.  We can use it to group or filter facts. "How many did we sell at a unit price of 10 cents?" That question filters a fact (quantity sold) by a dimension value (10 cent unit price).

Some Facts are Not Numbers

There are also facts that are not numbers.  These are rare, but they do occur.

Test results or grades are the most common example.  You can't do math on "pass" or "fail," but they are measurements.  Facts like this are sometimes called "text facts."  A test_result fact that can take on the values "pass" or "fail" might be stored in a fact table that contains a row for each student taking a particular test on a particular day.

If they take on a relatively low number of possible values, consider converting text facts into discrete facts.  In this case, we might replace test_result with two facts:  number_passed and number_failed.  For each row in the fact table, one of these will take on the value 1, the other the value 0.  This may seem wasteful, but now it is easy to get a sum(number_passed) or sum(number_failed.)

Some Stars Have No Facts

If you don't have any facts, you can still have a fact table.  Sometimes, the mere occurrence of an event is all you need to know about.  For example, you might use a fact table to track the fact that a employee was present on a given day.

Factless fact tables can also track conditions.  For example, each year, circumstances might make employees eligible to participate in particular programs.   Aside from the dimensions (Employee, Year and Program) there might not be any facts to record.  That's OK, you can still count up rows that meet various conditions.

- Chris

Do you have a question?  Send it in.  I answer all my email, though sometimes it takes a while.

Image: Jack Webb as Joe "Just the Facts" Friday,
from the Public Domain

Wednesday, February 3, 2010

Dimensional and Relational: Not Opposites

A common misconception holds that the terms dimensional and relational are opposites. They are not.

The word "dimensional" describes a design method. The word "relational" describes a data storage technology.
  • A dimensional model is a design approach that describes  a process in terms of measurements (known as facts) and their context (dimensions)
  • A star schema is a dimensional model implemented using relational storage technology—that is, in a relational database (RDBMS)
  • A cube is a dimensional model implemented using multi-dimensional storage technology—that is, in a multidimensional database (MDB)
This simple diagram illustrates these concepts:

As you can see from the diagram, a star schema is both relational and dimensional. So is a snowflake schema.

By the way, don't let this confuse you:  most modern day DBMS products accommodate both kinds of storage.

Thursday, January 14, 2010

Accumulating Snapshots: Are They Necessary? [Q&A]

A reader asks the following question:

Q: We have an accumulating snapshot that tracks trouble tickets from submission to closure. The fact table has one row for each trouble ticket. Why do we need it? Can't we just place all this information in the trouble ticket dimension?
A: You probably do need the fact table, but it might be worth putting some of the information in the dimension table as well.

Before I explain, here is a quick refresher on the accumulating snapshot.

Accumulating Snapshot

An accumulating snapshot is a kind of fact table used to correlate activities surrounding a specified item. In this case, that item is a trouble ticket.


Accumulating snapshots are useful because they simplify the queries necessary to study elapsed time between events. In the reader's case, that might be the time between submission and closure of a ticket (there are probably some interim steps as well).

On the operational side, these events are probably recorded in different rows, and maybe even in different systems. Figuring out the days between these events can be messy, especially if averaging a results for a large number of tickets. Correlated subqueries will be reqiured, and performance may be poor.

The accumulating snapshot does the correlation of these activities ahead of time, as part of the ETL process. In this case, it has one row for each ticket. Presumably, there are foreign keys that identify the ticket, the date submitted, and the date closed. An optional fact can record the elapsed time between these dates. This fact is very easy to aggregate and group by any dimension of interest. No need for for correlated subqueries.

You can read more about accumulating snapshots in a prior post.

The Fact Table is Necessary

The reader has noted that the fact table has one row per trouble ticket. (That's a good statement of grain, by the way.) Since there is only one row for each ticket, it might seem that the content of the fact table could be simply be placed in the ticket dimension.

This is probably not a good idea. Though your fact table has one row per ticket, your ticket dimension may have more than one, due to type 2 changes that track history.  For a particular trouble ticket, when a type 2 change occurs, a new row is added for the new version of the ticket. A given ticket may therefore have more than one row in the dimension.

If you were to place a fact in this dimension table, like the days between submission and closure, it would be repeated for each version of the same ticket. Aggregating it would result in double counting.

The fact is better placed in the fact table, where they can safely be summed, averaged, or whatever is necessary.

(By the way, the same would occur if you placed the two dates in the dimension table. They would repeat for each version of the ticket. If you try to aggregate the elapsed time between the dates, the same ticket would be counted multiple times -- once for each version.)

Place Information in the Dimension Also (Not Instead)


Information tracked by the accumulating snapshot may also be useful in the dimension table. But here, its purpose is to serve as dimensional data.  Not as a replacement for the fact table.

For example, the dates the ticket was opened and closed could make helpful type 1 attributes in the ticket dimension.  This is especially useful if the dimension is connected to other fact tables, such as one that tracks phone calls to the support center. The dates can be used to group call counts in interesting ways. For example, "How many support calls were generated by tickets opened on January 1 vs. January 2."

If you want to, you can also place the elapsed days fact in the dimension table. Here, it should be used strictly as a behavioral dimension -- not as a fact.  Use it to group results. Don't try to use it as a fact; that would cause double counting. You might even want to convert it into bands to avoid confusion (1-5 days, 6-10 days, over 10 days.)

- Chris
Image Credit:  Creativity103 via Creative Commons

Do you have a question of your own? Send it in. I'd be happy to answer.