After you have verified the dimensional model, design the
physical database. Develop strategies to handle aggregation, aggregate
navigation, indexing, and partitioning of the data in your dimensional
model.
When you design the physical database, you collect the following
metadata:
- Aggregation information, including the following information:
- Number of aggregate tables
- Load frequency and statistics
- Usage statistics
- Archive rules and statistics
- Purging rules and statistics
- Data quality and accuracy
- Indexing strategies for dimension and fact tables
Design aggregations
In simple
terms, aggregation is the process of calculating summary
data from detail base level fact table records. Aggregates are a powerful
tool to increase the query processing speed in dimensional data marts.
The aggregation is primarily performed by using attributes of a dimension
which are a part of a hierarchy.
Each attribute that belongs
to a hierarchy associates as a parent or child with other attributes
of the hierarchy. This parent-child relationship provides different
levels of summarization. The various levels of summarization provide
the business user the ability to drill up or drill down in the report.
Highly aggregated data is faster to retrieve than detailed, atomic-level
data. The fact table typically occupies a large volume of space when
compared to the aggregated data.
The lowest level of aggregation
(or the highest level of detail) is referred as the grain of the fact table. The granularity of the dimension affects the
design of data storage and how data is retrieved.
Aggregating
detailed atomic fact tables improves query performance. However, there
are costs associated with aggregation:
- Storage
- Cost to build and maintain the processes to handle the aggregated
tables
Aggregation is not a substitute for reducing the size
of large, detailed fact tables. If the data in the fact table is summarized,
the detailed information that is in the form of dimensions and measures
is often lost. If the business needs detailed data from a summarized
fact table, that data is not available. The user would need to look
for the details in the source OLTP system that provided the aggregated
fact table data. If the business must use the source OLTP systems
to get the answers, then reevaluate the need for a dimensional model.
Avoid mixing aggregated data and detailed data by including year-to-date
aggregated measures with the detailed measures. Year-to-date measures
are additive, and if you mix the data types, miscalculations can occur.
To prepare aggregate tables, perform the following steps:
- Identify all dimensions and their hierarchies from the base level
atomic dimensional model. These dimensions and hierarchies are identified
from the base-level atomic dimensional model.
- Identify all possible combinations of these hierarchy attributes
which are used together by business for reporting. Identify all attributes
from the hierarchies to determine which of the attributes are used
together frequently. This step is especially important if there are
a huge number of dimensions with several hierarchies that have several
attributes in them.
- Calculate the number of values that each attribute has. The number
of values each attribute has is indicative of whether the attribute
is aggregated. For example, if you include a low-level member that
has many members, then you may drop that attribute and choose a higher
level attribute, which would have fewer values.
- Validate the final set of attribute candidates and build the aggregated
dimensional model.
Create indexes
You can improve query
performance by creating indexes. To learn more about indexes, see
Indexes.
Partition tables
When you partition a table, you divide the table by row, by column,
or both. If a table is divided by column, it is said to be vertically
partitioned. If you divide by row, it is said to be horizontally
partitioned. Partitioning large fact tables improves performance
because each partition is more manageable. Partition a table based
on the transaction date dimension in a dimensional model. For example,
if a huge fact table has billions of rows, it would be ideal for one
month of data to be assigned its own partition.
You achieve
several goals by partitioning the data in a data warehouse, including
the following goals:
- Providing flexible access to data
- Providing easy, efficient data management services
- Ensuring data warehouse scalability
- Enabling elements of the data warehouse to be portable, which
allows those elements to be shared across data warehouses or archived
easily
- Improving query response times
- Making maintenance easier, since partitions are easier to maintain
than large tables