Callum

Callum Green's Blog

Handling Web Page Errors in Power BI Query Editor

I was recently asked if it was possible to handle Error Rows in the Query Editor and more specifically, to web pages that do not exist from source. The user broached me with this question after following a previous blog on looping through multiple web pages in Query Editor – found here.

In my blog post, there were no errors from the loop output but in the user’s workbook, they were losing data directly below the expanded error row. My first suggestion was to use the Remove Errors UI option, which would delete the problematic row. However, they wanted a way of treating the errors as an Unknown Member and did not want to lose the data entirely

It is assumed consumers of this blog already have knowledge of the Power BI Query Editor and the UI functions used in the examples.  

Scenario

I created a new workbook that connects to local Council Facebook pages. Each page has a unique Object ID, which will be used as the parameter in the loop. The Council “Camberley” deliberately contains an invalid Object ID. I then proceeded to create a Parameter and Function, replicating the exact steps from my previous blog.

When I invoke the function (through the use of a Custom Column), the following is produced:

clip_image001[4]

As you can see, Camberley Council produces an error in the fnInvokeCouncils column. If we expand the contents (highlighted in yellow), the Facebook page data appears. Upon further inspection, the Farnham and Rushmoor council data are available, but Camberley (incorrect Object ID) and Guildford are not.

clip_image003[4]

The error message is a little misleading but let’s save the debugging debate for another day. The key observation is “Guildford” data is not available, simply because it comes after “Camberley” in the list. Whilst we want to see errors in a Query, we do not want them causing data loss.

Resolution

As I mentioned at the beginning of this article, using the Remove Errors function would prevent the loss of Guildford data. However, the user needs to handle errors as Unknown Members and conform to a typical Kimball Data Warehouse. 

I am sure there are many ways to fulfil the requirement, but here is how I approached it:

1.       Duplicate the existing ‘Councils’ query, naming it ‘Councils Error Rows’.

2.       Switch back to the ‘Councils’ query and Remove Errors, leaving only three records:

 

clip_image005[4]

3.       Expand the fnInvokeCouncils column, opening up the underlying fields and data:


clip_image007[4]

4.       In the ‘Council Error Rows’ query, apply the Replace Errors UI function - inserting the string “Validation Failed”.

5.       Add a Custom Column, writing the following M:

if [fnInvokeCouncils] = "Validation Failed" then 1 else 0

This is a simple IF statement that sets the error rows to 1.

6.       Now filter the selection to only display ErrorRows with the value of 1. This is achieved by using the Filter Rows UI function. The ‘Council Error Rows’ query now looks like the following:


clip_image008[4]

7.       The columns must match the ‘Councils’ query, meaning 4 new Custom Columns are needed. We can hardcode the values and remove any unwanted columns.

 

clip_image010[4]

 

8.       Right click on the previously modified ‘Councils’ query select Reference. Name the query ‘All Councils’. This makes it easier to track the transformations and persists any future changes made to the raw data.  

9.       Within the ‘All Council’ query, select Append Query transformation. Choose ‘Council Error Rows’ as the table to append and click OK.
 

10.   We are now left with a Union of both datasets, containing the Unknown Member and data from all other Councils.

clip_image012[4]

11.   The Unknown Member record is visible within the final Query.

 

clip_image014[4]

 

Summary

I have shown you how to get around two different problems with Error Rows in the Power BI Query Editor. The first is how to retain all correct data, whilst the second is keeping error records and inserting them back into the dataset as an Unknown Member. Both methods are dynamic, meaning that if we added a new Council called ‘Basingstoke’, we would see the underlying data regardless of whether the underlying Facebook Object ID exists or not.

Whilst none of the transformations in this blog are overly technical, there are quite a few hoops to jump through to retain all data from a web page/Facebook. Having said that, I am sure there are a few other ways people could approach this problem. I would be really interested to speak to anyone who does have an alternative solution.

Further Reading

Query Editor Basics (Power BI blog) – http://bit.ly/2pwBdo1
Unknown Members in Data Warehousing -
http://bit.ly/2qTefwe
Loop through Multiple Web Pages using Power Query - http://bit.ly/2q3a8Nc

Contact Me

If you would like a copy of the workbook containing the examples or want to know more about the Query Editor within Power BI, please leave a comment below. My Twitter details are also provided.

Twitter:  @CallumGAdatis

Buffer() M Function in Query Editor (Power BI)

Whilst I have been aware of the Buffer() M function in the Query Editor of Power BI for a while, I had never really utilised its capabilities until now.  There are two main types of buffer functionality – Table.Buffer and List.Buffer.  To define them simply, Table.Buffer puts an entire table into memory and prevents change during evaluation, whereas List.Buffer provides a stable list, meaning it has some form of count or order.

This blog will focus on the theory behind the general Buffer() M functionality, picking a specific scenario of when it can outperform the standard Query Editor behavior.  I will also demonstrate that this is not always the most effective technique within the same scenario.  The article will not give you a hard and fast rule of when to use the Buffer() function, because it can depend on a number of factors.  These are described further below.

Note:    It is assumed you have existing knowledge of Query Folding and if not, one of my previous blogs should help greatly.

Scenario

I found inspiration from Chris Webb’s example, using the Adventure Works DW 2014 database – available here.

The requirements are:

1.       Obtain the first 10,000 rows from FactInternetSales

2.       Remove the majority of columns, retaining ONLY:

a.       SalesOrderLineNumber

b.      CustomerKey

c.       SalesAmount

3.       Rank the current row based on Sales Amount.

List.Buffer()

Assuming your database exists on a local server and is named AdventureWorksDW2014, copy the following code into the Advanced Editor in the Query Editor screen.

let

    //Connect to SQL Server

    Source = Sql.Database("localhost", "AdventureWorksDW2014"),

    //Get first 2000 rows from FactInternetSales

    dbo_FactInternetSales = Table.FirstN(

          Source{[Schema="dbo",Item="FactInternetSales"]}[Data],

          10000),

    //Remove unwanted columns

    RemoveColumns = Table.SelectColumns(

          dbo_FactInternetSales,

          {"SalesOrderLineNumber", "CustomerKey","SalesAmount"}),

    //Get sorted list of values from SalesAmount column

   RankValues = List.Sort(RemoveColumns[SalesAmount], Order.Descending),

    //Calculate ranks

    AddRankColumn = Table.AddColumn(RemoveColumns , "Rank",

          each List.PositionOf(RankValues,[SalesAmount])+1)

in

    AddRankColumn

You can visibly see the rows loading – one by one.  In total, it takes nearly 1 minute to load all off the results.

Now let’s use the List.Buffer() function in the RankValues step.

Replace:

= List.Sort(RemoveColumns[SalesAmount], Order.Descending)

With:

= List.Buffer(List.Sort(RemoveColumns[SalesAmount], Order.Descending))

clip_image002

The entire transformation (from start to finish) completes in just under 2 seconds!  This is because the List.Buffer function stores the sorted values in memory and therefore, the rank calculation is only evaluated once.  The last query (and previous steps) were being evaluated multiple times.  The M language is both functional and at times, lazy.  In order to prevent the constant re-evaluation, buffer the list into memory. 

The final query output is shown below:

clip_image004

Query Folding

We will implement the same requirements, but this time using Query Folding. 

The third step in our current transformation is called ‘Removed Columns’. This is what prevents Query Folding, as this function cannot be interpreted/translated to the native SQL Server T-SQL language.  All steps below are inadvertently not supported either. 

The way around this is to write SQL Server View (in SSMS) to import just the fields required from the underlying FactInternetSales Table.  The below query will give you the same result up to the ‘Remove Columns’ step.

CREATE VIEW dbo.VwFactInternetSalesAmount

AS

       SELECT SalesOrderNumber   

                     ,[CustomerKey]

                     ,[SalesOrderLineNumber]

                     ,[SalesAmount]

                     ,RANK() over( order by [SalesAmount] desc) AS [Rank]

       FROM   [AdventureWorksDW2014].[dbo].[FactInternetSales]

clip_image006

The final steps are to filter on the top 10,000 rows and Group the Rows together – inserting the following M syntax into the last Applied Step:

let

    //Connect to SQL Server

    Source = Sql.Database(".", "AdventureWorksDW2014"),

    // Connect to SQL Server

    dbo_FactInternetSales = Source{[Schema="dbo",Item="VwFactInternetSalesAmount"]}[Data],

    #"Sorted Rows" = Table.Sort(dbo_FactInternetSales,{{"SalesOrderNumber", Order.Ascending}}),

    #"Kept First Rows" = Table.FirstN(#"Sorted Rows",10000),

    #"Grouped Rows" = Table.Group(#"Kept First Rows", {"CustomerKey", "SalesOrderLineNumber", "Rank"}, {{"TotalSalesAmount", each List.Sum([SalesAmount]), type number}})

in

    #"Grouped Rows"

 

The query now returns instantly (under 1 second).   Right click on the last applied step and select the View Native Query option, to show the underlying SQL.

select top 10000

    [rows].[CustomerKey] as [CustomerKey],

    [rows].

View Native Query Function in Power Query

Following on from my last blog (found here), I wanted to share a new Query Folding feature in Power Query.  The View Native Query functionality was actually released in the June 2016 Power BI update but it was hardly advertised/promoted by Microsoft.

What is View Native Query?

In a nutshell, it enables you to check if each of your step by step transformations in Power Query are being folded back to the source.  I have shown techniques of using external monitoring tools, as well as an in-built M function.  However, the View Native Query feature makes our lives so much easier.  I will demonstrate how easy it is to use.

Example

I loaded up an old Query Folding Power BI workbook in Power Query, which was connecting to the AdventureWorksDW2014 SQL Server database.  If you want a copy of it, feel free to comment below and I’ll get back to you.  However, you can do this with any workbook and need to follow the simple instructions below:

1.    Right click on the last step of your transformation and a new option called View Native Query should be available.  Left click to open the SQL query.


clip_image0024_thumb_thumb

 

clip_image0044_thumb_thumb

2.    Whilst the above SQL written at source is not the most efficient, query folding still means the data is brought back into Power Query far quicker than it would if done within the Power BI application.

 

3.   Now copy and paste the code into SSMS, This is exactly the same code that SQL Server Profiler would produce.  Execute the query.

 

clip_image0054_thumb_thumb

 

4.    In effect, this is replicating the transformation steps in Power Query.  The below screenshot is taken from Power Query and note the top 10 results from this and the above output.

clip_image0064_thumb_thumb

The coolest thing about the View Native Query feature is the fact you can check every step in the Power Query transformation.  I only demonstrated the final step, but you can go back to each one and find the underlying query folding logic that has been applied.

Why is View Native Query not available?

If you get the below screenshot, the answer is simple – your transformation step doesn’t support Query Folding! 

clip_image0074_thumb_thumb

NOTE: View Native Query is ‘greyed’ out.

If possible, move any steps that aren’t supported to the end.  Power Query will attempt to query fold, until there is a step it cannot.  All steps after this (even if query folding is possible) will not be supported and could drastically slow down the remaining transformations.

Conclusion

This feature is something that should have been made available a long time ago.  Whilst the power of query folding is undeniable, the methods of testing it were long and arduous.  With View Native Query, we can easily check where, when and more importantly, when folding is NOT happening.  Debugging slow performing transformations is so much easier now and you can in fact, swap steps around and folding will still persist where possible.

I would advise every Power Query developer to make use of this simple feature.  If there are any other Query Folding techniques, tips or tricks out there, please contact me.  Look out for future blogs, where I will be looking at the performance of query folding on very large data volumes (100 million + records) and whether the automated SQL Server code eventually becomes too unwieldy and poorly performing.   

Query Parameters in Power Query – Part 1

A new feature has recently been added into Power Query – called Query Parameters.  This was actually released to Power BI as part of May 2016’s update.

So what is a Query Parameter?  For anyone familiar with Microsoft Reporting Services, it has pretty much the same functionality as a Report Parameter.  In simple terms, it enables you to filter your data dynamically in Power Query.  There has always been a Filter UI function but it could not be linked to a parameter or plugged into the M code.  We can use Query Parameters as a way of reducing the data volumes brought into a PowerBook (useful for large datasets or tables) or to allow a user to switch between filters (e.g. changing database connections dynamically).

In this blog, I will concentrate on how we can reduce data volumes by implementing a Query Parameter.  Look out for Part 2 in the coming weeks, where I will pick a use case for when we would want to dynamically switch between parameter values.

Reducing Data Volumes

Imagine transforming or just analysing a fact table that contains over 1 billion records.  Not only will Power BI (and Power Query) probably crash or at best, perform badly, there is no need to bring all this data back into a PowerBook.  A subset, or perhaps an individual month is all that is required to undertake the necessary Power Query ETL operations.

The below example demonstrates how a Query Parameter can be used to filter records in a given dataset.  I have used the AdventureWorksDW2014 Database (download here) and picked the FactProductInventory table, which contains 776,286 rows.  I will reduce this count by adding a new Query Parameter in Power Query – using the Datekey to find all rows in between the 31st November and 30th December 2010.

1.       Assuming I have already loaded the data into the workbook, in the Query Editor, click Manage Parameters from the ribbon.
 

clip_image001

2.       Create two parameters – one named DateKeyFrom and the other DateKeyTo.  The former contains the value 20101131 and the latter 20101230.  Keep the default settings and confirm by clicking OK.  You two parameters should look like the below:

clip_image003

 

3.       Click on the DateKey column drop down and choose Number Filters and Between – highlighted below.

 

clip_image005

 

 

4.       Set the filter to pick up the two newly created parameters.  The following screenshot illustrates the criteria required, with the key change involving changing the second drop down from ‘Decimal Number’ to ‘Parameter’.

clip_image007

5.       Check that your dataset has been filtered and only contains data for the time period December 2010.

6.       If you need to change the date criteria and let’s say, only display 5 days of data, this can be achieved easily clicking the parameters in the Queries pane.  I have edited the DateKeyFrom value to 20101225.  You can also update a value by clicking the Manage Properties button.

clip_image009

7.       As soon as you click back onto the FactProductInventory table query, the data automatically begins to refresh.  We are now left with only the last 5 days in December, shown below. 

 

clip_image011

Conclusion

As Query Parameters are hot off the press, I have only just started experimenting with it.  It looks like it has been added to not only reduce and filter on data, but to enable us to plug a dynamic parameters into M.  Previously, I used a Function as a workaround but can now actually use them in the proper context.  Click here to view a previous blog, where I specifically create and invoke a function.  There are a number of other uses for Functions and Rafael Salas’ blog provides an excellent walkthrough.

My next blog will focus on how you can switch between different parameter values and dynamically display the results.  If anyone has built their own Query Parameter examples and want to share them, feel free to leave me a comment below.  I would be very interested to see how far people have already taken this new feature.

Further Reading

There are already a few great blogs and use cases out there:

1.       Chris Webb’s Blog - https://blog.crossjoin.co.uk/2016/05/08/the-m-code-behind-power-bi-parameters/

2.       Soheil Bakhshi Blog (1 of 2) – http://biinsight.com/power-bi-desktop-query-parameters-part-1/

3.       Soheil Bakhshi Blog (2 of 2) - http://biinsight.com/power-bi-desktop-query-parameters-part2-dynamic-data-masking-and-query-parameters/

Replicating SSIS Data Profiling in Power Query

As a BI Consultant, I regularly sit down with a new/prospective client and dissect both their business requirements and source data.  With the latter, the SSIS Data Profiling Task (in Visual Studio or SSDT) is a very useful tool to find out the general quality of the data we will be importing into the warehouse.  Count of NULLS, Value frequency, Min/Max/Median Values and unique count of records are just a few types of data profiling.

I have recently being playing with the out of the box Data Profiling functionality in Power Query.  Although the functions are very useful, I quickly realised that I would need to extend my PowerBook to match the Data Profiling Task in SSIS.  I will break my findings down into a couple of parts, with this instalment focusing on a specific data profiling output from SSIS. 

Table Profile Function

Table.Profile() is a relatively new function in Power Query that gives some profiling stats for columns in a specific database table or raw file.  In the example below, I pulled in a raw Excel spread sheet from the Power BI website, which contains sample financial data.  Click here to download the data.

clip_image002

To obtain stats on this query, open Advanced Editor, add the following M at the end of the existing code:

    #"TableProfiling" = Table.Profile( #"Changed Type" as table) as table

in

     #"TableProfiling"

clip_image004

All we are doing is creating a new step in M, referencing the previous step and calling the new one.  The actual Table.Profile() function doesn’t even require an expression value and the underlying Power Query engine provides all of the functionality.  Each column has been profiled – see below:

clip_image006

I think the outputs are self-explanatory, especially if you are familiar with Data Profiling.  Either way, one line of code gives us some useful high level stats about the data and can help us captures the extremes or anomalies of each column.  Not all analysis will be suitable for a column, e.g. the ‘Country’ field will not produce Standard Deviation or Average values because it uses the Text datatype. 

Column/Frequent Value Distribution

Whilst the in-built Power Query function is ideal for overall table/file analysis, data profiling is most effective when focusing on specific columns and data patterns.  Both Column and Frequent Value Distribution are basic features of the SSIS Profiling Task, shown below.

clip_image008 

For the Power Query proof of concept, I have connected to AdventureWorksDW2014 and the Person.Address table – named PersonAddress.  This means I can use the SSIS Data Profiling screenshot as the benchmark to not only prove it can be replicated, but to ensure the numbers match. 

The first step is to apply the Table.Profile() function to a Reference query – called TableProfiling.  We can then create a simple staging query that just includes what we need from the out of the box profiling – called TableProfilingStage.  ‘Number of Values’ is renamed to ‘TotalCount’ and ‘Number of Distinct Values’ to ‘DistinctCount’.  We are left with the following data:

  clip_image010

The row of interest is ‘City’ (contained in Column), as this is what we need to profile.  The queries TableProfilingStage is the reference point for the frequent value distribution.  Create a New Source and paste the following M Code into the Advanced Editor:

clip_image012

Points of note:

1.       Lines 1-4 are referencing parameters in other queries and more specifically, with the ordinal position of 3.  The ‘City’ column is the 4th record in TableProfilingStage which equates to position 3 (column 1 starts at 0).

2.       Each ‘City’ value summed with a total count.

3.       Individual record counts (value of 1 per row) to use in the eventual calculation.  This value is then summed to give a total record count on each row.

4.       Apply (Count/TotalCount)*100 to work out the Frequent Value Distribution.

5.       Renamed, sorted (by count) and hidden any unnecessary columns.

6.       Save your new query – I called it ColumnProfiling.

7.       You should have 4 queries in total:

a.       PersonAddress

b.       TableProfiling

c.       TableProfilingStage

d.       ColumnProfiling

As long as your queries are named identically to my examples, you will end up with the below outputs:

Column Value Distribution:

clip_image014

Frequent Value Distribution:

clip_image016

We have now replicated Column and Frequent Value Distribution and the figures matches the SSIS Profiling Task! Although Power Query does not provide the pretty visualizations, we can easily re-produce them in the Power BI reporting suite.  I will save this for another blog, though.    

Next Steps

Look out for my next blog, where I aim to enhance the current Data Profiling PowerBook and add configuration.  In order to make this method as or more efficient than SSIS, it will need to be automated and handle a number of sources (e.g. csv, database connection, etc) and incorporate invoked Power Query functions. 

Conclusion

It is clear to see that Power Query is a very effective tool to replicate SSIS Data Profiling.  I believe with further development, a PowerBook could be used as a quicker and more efficient tool to run data profiling. 

I am very interested to see if anyone has created their own Data Profiling tasks in Power Query and welcome any comments, feedback or questions.

Flat Files in Power Query

The out-of-the-box functionality in Power Query will enable you to import pretty much any flat file, ranging from csv to fixed width.  However, there are instances when badly formatted files cannot be loaded correctly and this blog will demonstrate a workaround for this. 

Unusual Flat Files

Whilst csv and txt are the most common types of flat file, I have worked on a number of BI projects where the data warehouse needs to ingest fixed width, ragged right or pipe delimited text files.  You won’t be surprised to hear that Power Query recognizes all of these formats!   I will demonstrate below how simple it is to import a Ragged Right file with a ‘.in’ extension. 

Ragged Right in Notepad:

clip_image002

Power Query’s Interpretation:

clip_image003

 

Badly Formed Flat Files

You won’t normally know that a flat file is badly formed until actually importing it.  In the below example, a csv file contains both missing and double commas.  Csv stands for ‘comma separated values’, so Power Query expects a comma to split each column.  Any developer who has imported data (through SSIS, SQL import, etc.) will have encountered a problem like this.

Two Commas in Notepad:

clip_image004

Two Commas in Power Query:

As expected, Power Query believes there is a blank column between the values ‘Liverpool’ and ‘Stoke’.  As a result, the data is shifted to the right.

clip_image006

Missing Commas in Power Query:

A missing comma between columns works the opposite way.  Data is shifted to the left.

clip_image008

Solution

As we already know that the csv file contains both two and missing commas, the next step is to import the data in a non-delimited format.  To do this, you need to start a new Blank Query and type the following code into the Advanced Editor:

let

Value = Table.FromList(Lines.FromBinary(File.Contents("D:\Documentation\
Internal\PowerQuery\FlatFiles\CSVMissingOrMoreCommas.csv")),Splitter.SplitByNothing())

in

                Value

The code is turning the flat file data into a table and treating every row as an individual column – hence the ‘SplitByNothing’ syntax.  This is how the query looks:

clip_image009

We can now replace two commas with just one comma and split the table by ‘,’.  I have previously shown you how to Replace Values in an earlier blog, but will quickly demonstrate the Split Columns function.

clip_image011

Power Query actually auto detected all of these settings, as it can see the frequent number of commas and that they occur 10 times.  If you drop down Advanced options, you can change the delimiters and output columns if they are not interpreted correctly by Power Query.   Finally, we can implement the Use First Row As Headers UI function and filter the nulls in the ‘HTR’ column.

clip_image013

There is no easy way to write automated logic that would fix these three records because the missing commas were in different places and in our current query, different columns.  There are also no string patterns that can be easily deduced.  Therefore, this time around we will use Replace Values to move the data to the required columns.

clip_image015

There it is - a very effective way of visualising and managing both unusual and badly formed flat files in Power Query. 

Next Steps

We could easily extend the code to handle other file types or delimiters. Ken Puls’ blog demonstrates how you can pass the flat file connection string in as a parameter within a function.  This would be very handy for someone wanting to bring in multiple flat files through one configuration.  I would be very interested to see if any other Power Query developers have their own M code that handles flat files.  Feel free to comment or contact me directly if you have any suggestions on how to extend my example further. 

ETL in Power Query

If you have read any of my other Power Query blogs, you will have seen the results of transformed and mashed up queries - but not how it was achieved.  This article picks out my favourite 5 Extract, Transform & Load (ETL) features in Power Query.

Example

Being a keen sports fan, I wanted to find out the medal history of the 100m Olympic Finals for both men and women. 

I’ve picked two datasets, both from different sources.  You can download them using the links below:

-          Source 1 - CSV

o   ISO Countries.csv - https://datahub.io/dataset/iso-3166-1-alpha-2-country-codes/resource/9c3b30dd-f5f3-4bbe-a3cb-d7b2c21d66ce

-          Source 2 – Web

o https://en.wikipedia.org/wiki/100_metres_at_the_Olympics

o   There is a lot of other data on the site but for this example, I have only picked the race result data tables - Men [edit] and Women [edit].

Please Note:      I am happy to provide the Power BI Workbook upon request.  Leave a comment below and I will get        back to you ASAP.  

The imported data (pre transformed) should look like the below in Power Query.  You may notice some ETL has already taken place, but this was automatically carried out by Power Query and includes simple tasks like displaying column headers.

clip_image002

Pivot/Unpivot

There are many occasions when we need to completely flip the contents of a table.  For our example, ‘Gold’, ‘Silver’ and ‘Bronze’ should not be columns, but rows instead.  We only need one column and this can be named ‘MedalType’. The un-pivot feature can be achieved (on a whole query) by just one click of a button.  Ensure that following columns and features are selected:

clip_image004

 

Once the data is un-pivoted, rename the new columns to ‘Athlete’ and ‘MedalType’ respectively.  Repeat the process for the Men [Edit] query, which looks like the following:

clip_image005

Replace Values

When quickly eyeballing the data (in Men [Edit]), it is apparent that two athletes shared a Bronze medal in 1896.  We need to handle this data defect but unfortunately, there isn’t a one click method of achieving this.  The easiest way is to use the Replace Values and Remove Top/Bottom Rows UI functions.

Firstly, let’s duplicate the full Men [Edit] query by right clicking on it and selecting Duplicate.  We now want to remove all rec0rds from the copied query, apart from the one that contains two bronze medalists.  To do this, right click on the little table button in the corner of the query and choose either of the following:

clip_image007

It will then specify how many top or bottom rows you want to remove.  Ensure the only record left looks like the below:

 clip_image008

Lastly, the Replace Values function is used to replace ‘Francis Lane’ with an empty string and therefore, removing the duplicate.  The aim is to insert this one record back into the main Men [Edit] query.  Right click on ‘Athlete’ and select Replace Values.  Type the below text in the function box and click OK.

Value to Find:   Francis Lane (USA)#(lf)#(lf) Alajos Szokolyi (HUN)
Replace With:   Alajos Szokolyi (HUN)

The ‘#(if)’ elements of the search string is looking for special characters, otherwise the replace will not work.

clip_image009

We must tidy up the Men [Edit] query by doing the same as above, but the opposite way around.  This leaves us with the following:

clip_image010

Append Queries

This is another neat feature in Power Query.  In SQL Server or SSIS terms, this provides a union between two queries.  If the column names and data types do not match, then it may fail or give unexpected results.

Open the Men [Edit] query, Click the Home tab in the top menu pane and select Append Queries.  A pop up window will open, where you can specify the query to append into Men [Edit].  For this example, I selected the below and clicked OK.

clip_image012

-          It is possible to append multiple tables and in fact, we need to do this in order to bring both the Men [Edit] (2) and the Women [Edit] queries into Men [Edit].

-          To make it easy to distinguish between male and female, I have added a new column to the queries that specifies gender.  This is not mandatory, especially if you are following this example.

-          If we select the Two tables option, then only one query can be appended at a given time.

We can rename the newly appended query to Athletes and tidy up the Men [Edit] (2) and Women [Edit] queries.  Putting raw queries into an appropriately named Group helps distinguish each part of our ETL – see below:

clip_image013

There are more queries now but the Raw folder contains imported, untouched queries, whereas Transform includes tables that have been modified and joined together.  The Load folder will eventually contain the star schema tables (dimensions and facts).

Add Custom Column

Any developer using ETL will have needed to derive a column before.  Power Query offers this functionality, both through the UI and M query language.  We want to create a new column called ‘Nationality’, that moves the characters (in between the brackets) from ‘Athlete’.

In Power Query, a derived column is called Custom Column.  Ensure the Athletes query is selected, then navigate to the Add Column tab in the top menu pane and select the Add Custom Column button.  The pop up box appears, where we can write some custom M code:

clip_image015

-          Text.End is the equivalent to the RIGHT function in SQL Server.

-          Text.Start is equivalent to LEFT.

-          Wrapping them both into one line of code ensure the string inside the brackets is taken from Athletes.

-          Rename the new column to Nationality.

clip_image016

Merge Queries

This feature is the same as applying a JOIN in SQL Server.  In simple terms, it gives us the ability to mashup and bring together queries from different data sources.  We will bring the ISOCountriesRaw data into the Athletes query.

Under the Home tab, select Merge Queries.  A new window appears, where we can specify the tables and columns that need to be joined.  Simply click select the ISOCountriesRaw query and the two columns shown in the below screenshot.  The Join Kind should be a Left Outer, as this will bring ONLY matching records across ISOCountriesRaw, keeping all records in the main Athletes query.

clip_image018

The wizard will also give you a match preview and for this example, 142 of the 144 records in Athletes matches the ISOCountriesRaw table.  Click OK to complete the merge.

There are a number of columns that aren’t needed in this table and can be filtered out accordingly:

clip_image019 

There are two records that don’t match because the medal winner was subsequently removed from the record books.  We can use the Replace Values function to replace the nulls with some more meaningful information (see below).  We are now ready to turn the data into a star schema!

clip_image021

Next Steps

Now we have a fully transformed query, it can be broken down into a fact table and dimensions.  I will not demonstrate in this blog but by following Kimball’s data warehousing principles, there is now a star schema that can be plugged into Power Pivot or Power View.

NOTE:   Add Index Column was used to create primary and foreign keys on the ‘Load’ tables.  See ‘Other Features’ for link on how to do this.

Queries:

clip_image023

Relationships:

clip_image025

Other Features

There are many other ETL related functions within Power Query that have not been discussed in this blog.  It is also possible to enhance the UI features using M or write specialist code to enhance and configure your ETL process, which I demonstrated above.

For further information, I would recommend reading the below articles:

-          5 Very Useful Text Formulas - Ken Puls
http://www.excelguru.ca/blog/2014/08/20/5-very-useful-text-formulas-power-query-edition/

-          Add Index Column – Microsoft

https://support.office.com/en-us/article/Insert-a-custom-column-into-a-table-Power-Query-2dbb579a-915b-4ebd-b622-8e7f3d1d61a6?ui=en-US&rs=en-US&ad=US

-          Split Columns - Mark Vaillancourt

http://markvsql.com/2015/03/advanced-column-splitting-in-power-query/

-          Group By – Microsoft

https://support.office.com/en-us/article/Group-rows-in-a-table-Power-Query-e1b9e916-6fcc-40bf-a6e8-ef928240adf1?ui=en-US&rs=en-US&ad=US

-          ETL Examples – Chris Webb

http://blog.crossjoin.co.uk/2013/08/16/some-power-querym-examples/