JonathanJones

Short description of the blog

How Data Lies

How data lies

Intro

‘It’s not what you say; it’s how you say it.’ The same applies to data. Sometimes data lies, and this blog is an explanation of how. With this information, you can spot data quality issues and subsequently improve the value of your data.

Framing

By positively or negatively framing data, you can mislead, persuade, and lie. For example, if I said that 1% of young adults in London are violent, you may not have a massive cause for concern. However, if I said that there are 10,000 violent young adults in London, you may be more worried. We can deceive either with intention or not, and sometimes it’s because we want the perceiver to think a certain way.

The framing effect can include decoys, which is famous in supermarkets. For example, if you have a £30 drink next to a £10 drink, you will probably not feel as bad with buying a £10 drink. Regarding presenting data, we can do the same thing. One can compare one data set to another and change the context. Framing is famous for investment graphs. For example, when trying to show a good performance of an investment, one may hide a crash in previous years.

Relative data over absolute

Another example of displaying deceitful data is to show a relative change instead of an absolute. For example, the relative increased risk of colon cancer when eating Bacon every day is 27%. As an absolute, 7 Out of 100 people who eat bacon every day will eventually get colon cancer, but 6 out of 100 people who do not eat bacon every day will get colon cancer anyway. This is an increase of 1 person out of 100 if they eat bacon every day. From this study, the press then purported that eating bacon gave a person a 27% increase in being diagnosed with cancer, which is technically true but misrepresented.

Redmeat


Quantifying the unquantifiable

We often try to quantify the unquantifiable. It is not always possible to measure and weigh human experiences, anecdotes, and feelings. To some, happiness is an unquantifiable concept. If I asked you to rate your happiness from 1 to 10, you would put a result. This rating does not examine what 10 means to you; it does not examine the context, nor does it allow for examples. This gets worse when you ask a large pool of people to also rate their happiness. So, an average score is not valuable if current happiness means different things to a group of large people. This is not to say that quantifiable data is useless. However, sometimes to give validity to quantitative data you must add context with qualitative information.

AverageHappy

This short blog has been an introduction to three issues when representing data. For more, I would recommend looking at David Spiegelhalter’s twitter and books!

Calculation Groups

Analysis Services tabular models can have hundreds of base DAX measures, duplications of code, and syntax. However, with the new world of DAX Calculation groups we can re-use calculations and apply them to virtually any measure.

Within analysis services, DAX Calculation Groups give the user the ability to apply multi-use calculations to measures. This feature is available on Analysis Services in SQL Server 2019 2.3 and should be available in the SQL Server 2019 release. Currently, calculation groups can only be created with the Tabular Editor UI.

The first, and most prominent example of this is time intelligence and the application of MTD (Month to Date), YTD (Year to Date), and QTD (Quarter to Date) calculations. By reducing the number of measures, with calculation groups we can present a clearer user interface to the users.

Below are two images that show the calculation group hierarchy. The hierarchy is used to call a calculation on a measure.
CalculationGroup1

CalculationGroup2

The calculation group is called calculation group, the attribute is called time, and the item is called MTD.

The calculation items use the new ‘SelectedMeasure’ function. When called ‘SelectedMeasure’ will replace the measure that you have selected for calculation.

CALCULATE(SELECTEDMEASURE(), DATESMTD(DimDate[Date]))

CALCULATE(SELECTEDMEASURE(), DATESYTD(DimDate[Date]))

CALCULATE(SELECTEDMEASURE(), DATESQTD(DimDate[Date]))

The script below calculates the measure ‘InternetTotalSales’ upon the calculation items; ‘Current’, ‘QTD’, ‘YTD’ etc.

EVALUATE
CALCULATETABLE (
SUMMARIZECOLUMNS (
DimDate[CalendarYear],
DimDate[EnglishMonthName],
"Current", CALCULATE ( [InternetTotalSales], 'CalculationGroup'[Time] = "Current" ),
"QTD",     CALCULATE ( [InternetTotalSales], 'CalculationGroup'[Time] = "QTD" ),
"YTD",     CALCULATE ( [InternetTotalSales], 'CalculationGroup'[Time] = "YTD" ),
"PY",      CALCULATE ( [InternetTotalSales], 'CalculationGroup'[Time] = "PY" ),
"PY QTD",  CALCULATE ( [InternetTotalSales], 'CalculationGroup'[Time] = "PY QTD" ),
"PY YTD",  CALCULATE ( [InternetTotalSales], 'CalculationGroup'[Time] = "PY YTD" )
),
DimDate[CalendarYear] IN { 2012, 2013 }

And, when we run this script we get the following table:

calculationgroup3

Looking forward, we should see this eventually in PowerBI, SQL Server Data tools, and Azure Analysis Services. So, watch this space.

Throughout this year, you will see more blogs and guides from me on this topic as it unfolds.

Modern Data Warehouse Solution in a Week


In this blog, we attempt to create a data warehouse solution in a week. Is it possible? Yes. It is possible to create a modern data warehouse solution in a week, albeit not with outstanding complexity and detail. This is not how we would do things for clients but should give you insight at a high-level of what we do.

Purposefully, this experiment was to prove the concept of modern data warehouse design and whether it can be done in a week. Moreover, this project shows how easy it is to use Azure, and with a bit of design you can create a Modern Data warehouse.

The four repeatable steps to creating a modern data warehouse solution are as follows:

1. Profile the data.

Learn from the data, understand it. Here you want to check for data quality issues, such as missing data, redundant data, or general data inaccuracies. Moreover, understand the relationships between the entities (Tables) and the grain of the tables. For this, we should take 100 rows of data in a table, copy the data of the table into a spreadsheet, and highlight any issues. Remember, this is not an extensive data quality assessment, the agile development of the solution should bring more issues to the surface.

SpreadSheet

2. BEAM!

For this step, we need BEAM (Business Event analysis and Modelling). Here, we need to match business processes to the data. Using seven pedagogical questions of Who, What, When, Where, Why, How and, How Many we can start to understand what may be in our fact tables and dimension tables. The example below shows a business event and the subsequent division of data into these questions. How many is a field in the fact table, and the rest are in the dimensions.


BEAM


3. Draw your diagram

Use pen and paper if you need to, the purpose of this is to draw your tables and the relationships between them. Using the information in the BEAM session, you will need to create a star schema with your new dimension and fact tables. An example is below:

STAR

4. Start Mapping

Using excel spreadsheets, map your new target table columns to the source columns. For example, a customer name in your new dimension; Customer_Dimension previously existed as CstName in tblCustomers.

MAP


5. Create your skeletons

Go to the Azure portal, spin up an instance of SQL server. Connect to the SQL server with SQL Management Studio. Create a database called ‘Warehouse’ and four schemas ‘Stage’, ’Clean’, ‘Dim’, and ‘Fct’. Create the solutions tables that match the source tables you are transforming in Stage and Clean. For clean have varchar (250) data values, and in clean have the values that are appropriate for your design, these should match the dimension and fact column values.

6. Use data bricks and PySpark to Extract, Transform and Load your data through the solution.

Load up an instance of data bricks and create a cluster. Then create separate notebooks for every table in the following categories ‘SourceToStage’, ‘StageToClean’, ‘CleanToWarehouse’.

Import pyspark SQL into your pyspark notebook.

from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import *

SourceToStage

Define your data source

jdbcHostnameSource = "Sourceserver.database.windows.net"
jdbcDatabaseSource = "SourceSystemDataBase"
usernameSource = 'UserName123'
passwordSource = 'Password123'
jdbcPortSource = 1433
jdbcUrlSource = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4}".format(jdbcHostnameSource, jdbcPortSource, jdbcDatabaseSource, usernameSource, passwordSource)
connectionPropertiesSource = {
"user" : usernameSource,
"password" : passwordSource,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver" }

Define your data target

jdbcHostnameTarget = "targetserver.database.windows.net"
jdbcDatabaseTarget = "Warehouse"
usernameTarget = 'Username123'
passwordTarget = 'Password123'
jdbcPortTarget = 1433
jdbcUrlTarget = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4}".format(jdbcHostnameTarget, jdbcPortTarget, jdbcDatabaseTarget, usernameTarget, passwordTarget)
connectionPropertiesTarget = {
"user" : usernameTarget,
"password" : passwordTarget,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver" }

Create a temporary view from your source

df = spark.read.jdbc(url=jdbcUrlSource, table="Sales.Customers", properties=connectionPropertiesSource)
df.createOrReplaceTempView("source")

Create a temporary view ‘target’ as the temporary view ‘source’ using the following SQL statement – Here you want to bring every column in.

%sql
DROP TABLE IF EXISTS target;
CREATE TEMPORARY VIEW target
AS SELECT * FROM source;

Write to the new stage target table

spark.table("target").write.mode("append").jdbc(url=jdbcUrlTarget, table="Stage.Customers", properties=connectionPropertiesTarget)

StageToClean

The data source is now the warehouse database

jdbcHostnameSource = " Sourceserver.database.windows.net"
jdbcDatabaseSource = "Warehouse"
usernameSource = 'UserName123'
passwordSource = 'Password123'
jdbcPortSource = 1433
jdbcUrlSource = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4}".format(jdbcHostnameSource, jdbcPortSource, jdbcDatabaseSource, usernameSource, passwordSource)
connectionPropertiesSource = {
"user" : usernameSource,
"password" : passwordSource,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"}

Create a temporary view from your source

df = spark.read.jdbc(url=jdbcUrlSource, table="Stage.colors", properties=connectionPropertiesSource)
df.createOrReplaceTempView("source")

Create a temporary view ‘target’ as the temporary view ‘source’ using the following SQL statement, here you want to clean and manipulate columns as needed. Below I have changed and modified the ValidFrom and ValidTo column into dates.

%sql
DROP TABLE IF EXISTS target;
CREATE TEMPORARY VIEW target
AS SELECT 
ColorID,
ColorName,
LastEditedBy,
CAST(ValidFrom as date) As ValidFrom,
CAST(ValidTo as date) As ValidTo
FROM source;

Write to your clean source table and then truncate the stage table

spark.table("target").write.mode("append").jdbc(url=jdbcUrlSource, table="clean.colors", properties=connectionPropertiesSource)
spark.table("Source").write.mode("overwrite").jdbc(url=jdbcUrlSource, table="stage.colors", properties=connectionPropertiesSource)

CleanToWarehouse

Define the data source

jdbcHostnameSource = " Sourceserver.database.windows.net"
jdbcDatabaseSource = "Warehouse"
usernameSource = 'UserName123'
passwordSource = 'Password123'
jdbcPortSource = 1433
jdbcUrlSource = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4}".format(jdbcHostnameSource, jdbcPortSource, jdbcDatabaseSource, usernameSource, passwordSource)
connectionPropertiesSource = {
"user" : usernameSource,
"password" : passwordSource,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

Create a temporary view from the Clean table and the Dimension table

paymentmethods = spark.read.jdbc(url=jdbcUrlSource, table="dbo.paymentmethods", properties=connectionPropertiesSource)
paymentmethods.createOrReplaceTempView("paymentmethods")
dimpaymentmethod = spark.read.jdbc(url=jdbcUrlTarget, table="dim.paymentmethod", properties=connectionPropertiesTarget)
dimpaymentmethod.createOrReplaceTempView("dimpaymentmethod")

Create your script to insert into your target table

%sql
DROP TABLE IF EXISTS target;
CREATE TEMPORARY VIEW target
AS SELECT DISTINCT
p.PaymentMethodName as PaymentMethodName
FROM PaymentMethods p 
WHERE NOT EXISTS (SELECT PaymentMethodName FROM dimpaymentmethod );

Write the target table to the Warehouse

spark.table("target").write.mode("append").jdbc(url=jdbcUrlTarget, table="dim.paymentmethod", properties=connectionPropertiesTarget)

7. Orchestrate using Azure Data Factory

To orchestrate your solution, use Azure Data Factory. Once you’ve spun up an instance of Azure data factory place your data bricks notebooks inside pipelines. Then order your pipelines in the following order: SourceToStage, StageToClean , Dimensions, Facts, and TruncateClean. It’s important to load the dimension tables before the fact tables and then following up with a truncation of clean.

DataFac