If you have read any of my other Power Query blogs, you will have seen the results of transformed and mashed up queries - but not how it was achieved. This article picks out my favourite 5 Extract, Transform & Load (ETL) features in Power Query.
Being a keen sports fan, I wanted to find out the medal history of the 100m Olympic Finals for both men and women.
I’ve picked two datasets, both from different sources. You can download them using the links below:
- Source 1 - CSV
o ISO Countries.csv - https://datahub.io/dataset/iso-3166-1-alpha-2-country-codes/resource/9c3b30dd-f5f3-4bbe-a3cb-d7b2c21d66ce
- Source 2 – Web
o There is a lot of other data on the site but for this example, I have only picked the race result data tables - Men  and Women .
Please Note: I am happy to provide the Power BI Workbook upon request. Leave a comment below and I will get back to you ASAP.
The imported data (pre transformed) should look like the below in Power Query. You may notice some ETL has already taken place, but this was automatically carried out by Power Query and includes simple tasks like displaying column headers.
There are many occasions when we need to completely flip the contents of a table. For our example, ‘Gold’, ‘Silver’ and ‘Bronze’ should not be columns, but rows instead. We only need one column and this can be named ‘MedalType’. The un-pivot feature can be achieved (on a whole query) by just one click of a button. Ensure that following columns and features are selected:
Once the data is un-pivoted, rename the new columns to ‘Athlete’ and ‘MedalType’ respectively. Repeat the process for the Men [Edit] query, which looks like the following:
When quickly eyeballing the data (in Men [Edit]), it is apparent that two athletes shared a Bronze medal in 1896. We need to handle this data defect but unfortunately, there isn’t a one click method of achieving this. The easiest way is to use the Replace Values and Remove Top/Bottom Rows UI functions.
Firstly, let’s duplicate the full Men [Edit] query by right clicking on it and selecting Duplicate. We now want to remove all rec0rds from the copied query, apart from the one that contains two bronze medalists. To do this, right click on the little table button in the corner of the query and choose either of the following:
It will then specify how many top or bottom rows you want to remove. Ensure the only record left looks like the below:
Lastly, the Replace Values function is used to replace ‘Francis Lane’ with an empty string and therefore, removing the duplicate. The aim is to insert this one record back into the main Men [Edit] query. Right click on ‘Athlete’ and select Replace Values. Type the below text in the function box and click OK.
Value to Find: Francis Lane (USA)#(lf)#(lf) Alajos Szokolyi (HUN)
Replace With: Alajos Szokolyi (HUN)
The ‘#(if)’ elements of the search string is looking for special characters, otherwise the replace will not work.
We must tidy up the Men [Edit] query by doing the same as above, but the opposite way around. This leaves us with the following:
This is another neat feature in Power Query. In SQL Server or SSIS terms, this provides a union between two queries. If the column names and data types do not match, then it may fail or give unexpected results.
Open the Men [Edit] query, Click the Home tab in the top menu pane and select Append Queries. A pop up window will open, where you can specify the query to append into Men [Edit]. For this example, I selected the below and clicked OK.
- It is possible to append multiple tables and in fact, we need to do this in order to bring both the Men [Edit] (2) and the Women [Edit] queries into Men [Edit].
- To make it easy to distinguish between male and female, I have added a new column to the queries that specifies gender. This is not mandatory, especially if you are following this example.
- If we select the Two tables option, then only one query can be appended at a given time.
We can rename the newly appended query to Athletes and tidy up the Men [Edit] (2) and Women [Edit] queries. Putting raw queries into an appropriately named Group helps distinguish each part of our ETL – see below:
There are more queries now but the Raw folder contains imported, untouched queries, whereas Transform includes tables that have been modified and joined together. The Load folder will eventually contain the star schema tables (dimensions and facts).
Add Custom Column
Any developer using ETL will have needed to derive a column before. Power Query offers this functionality, both through the UI and M query language. We want to create a new column called ‘Nationality’, that moves the characters (in between the brackets) from ‘Athlete’.
In Power Query, a derived column is called Custom Column. Ensure the Athletes query is selected, then navigate to the Add Column tab in the top menu pane and select the Add Custom Column button. The pop up box appears, where we can write some custom M code:
- Text.End is the equivalent to the RIGHT function in SQL Server.
- Text.Start is equivalent to LEFT.
- Wrapping them both into one line of code ensure the string inside the brackets is taken from Athletes.
- Rename the new column to Nationality.
This feature is the same as applying a JOIN in SQL Server. In simple terms, it gives us the ability to mashup and bring together queries from different data sources. We will bring the ISOCountriesRaw data into the Athletes query.
Under the Home tab, select Merge Queries. A new window appears, where we can specify the tables and columns that need to be joined. Simply click select the ISOCountriesRaw query and the two columns shown in the below screenshot. The Join Kind should be a Left Outer, as this will bring ONLY matching records across ISOCountriesRaw, keeping all records in the main Athletes query.
The wizard will also give you a match preview and for this example, 142 of the 144 records in Athletes matches the ISOCountriesRaw table. Click OK to complete the merge.
There are a number of columns that aren’t needed in this table and can be filtered out accordingly:
There are two records that don’t match because the medal winner was subsequently removed from the record books. We can use the Replace Values function to replace the nulls with some more meaningful information (see below). We are now ready to turn the data into a star schema!
Now we have a fully transformed query, it can be broken down into a fact table and dimensions. I will not demonstrate in this blog but by following Kimball’s data warehousing principles, there is now a star schema that can be plugged into Power Pivot or Power View.
NOTE: Add Index Column was used to create primary and foreign keys on the ‘Load’ tables. See ‘Other Features’ for link on how to do this.
There are many other ETL related functions within Power Query that have not been discussed in this blog. It is also possible to enhance the UI features using M or write specialist code to enhance and configure your ETL process, which I demonstrated above.
For further information, I would recommend reading the below articles:
- 5 Very Useful Text Formulas - Ken Puls
- Add Index Column – Microsoft
- Split Columns - Mark Vaillancourt
- Group By – Microsoft
- ETL Examples – Chris Webb