Rough and ready notes relating to question I have about Data Warehousing and ETL:
1. What is a STAR schema: centralised FACT table that joins to any number of DIMENSION tables
2. What are FACT tables? as it name suggests it houses facts (i.e. measured variables). Fact tables are defined by the type / domain of facts that they contain e.g. a RAINFALL fact table may contain observations of “Rainfall Volume by Month by Region”.
3. What are DIMENSION tables: dimension tables are descriptive, they provide context to the observations within the fact tables. Dimension fields are often used to filter / restrict query results e.g. (silly psuedocode)
select f.factA, f.factB, f.factC from FACT_TABLE as f inner join DIMENSION_TABLE as d on d.key = f.foreign_key where d.qualities = &quot;some quality&quot;
4. What are slowly changing dimensions (SCDs)? A dimension is logically bounded data, e.g. weather data, sales data… SCDs change slowly over time c.f. on a regular schedule. Relevant changes are those that affect the logical assumptions that underpin the data e.g, a report based on performance data across regions may have underlying assumptions about the relevant performance of each region. If a performance unit was to move from one region to another region, then the relative value of that measure may be different based on the qualities of the new region. Wikipedia offer the following example:
“For example, you may have a dimension in your database that tracks the sales records of your company’s salespeople. Creating sales reports seems simple enough, until a salesperson is transferred from one regional office to another. How do you record such a change in your sales dimension?” (wikipedia)
That’s the easy ones. Larger questions still to answer:
5. How do distributed file systems impact typical / standard ETL processes
6. How well do ETL processes handle non-relational data (specifically, how well does SSIS and SQL Server Data Tools handle non-relational data)
7. Parallel processing is increasingly important as the volume of data increases, how well does SSIS / SQL Server Data Tools handle parallel / distributed processing?