Callum

Callum Green's Blog

Loop Through Webpages with Multiple Parameters in Power Query

Back in February 2016, I posted a step by step article (found here) on how to loop through a website and dynamically pull down multiple pages into a single table in Power Query. I was recently approached by a gentleman who had read my blog, wanting to enhance the looping process. The requirements were:

Current Problem:

-          The user can currently only loop through Year, having to hard code the Page and Month.

Requirements:

-          Dynamically loop through three webpage parameters and output them into a single table – highlighted below:

http://www.boxofficemojo.com/monthly/?page=1&view=calendargross&yr=2017&month=1&p=.htm

-          Specify the Page, Year and Month to loop through

o   Years 2010 - 2017

o   Months 1 - 12

o   Pages 1 - 10

Parameters

Whilst GUI Parameters are not a new concept in Power Query, they did not exist when I wrote my initial blog.  As a result, I had to code a parameter into the M query and then define it as a function.  A similar concept will be adopted here, but the out of the box parameter functionality makes life so much easier now.

I previously wrote an article explaining how to set up parameters (here), so I will simply show you the output of them below:

clip_image002[4]

Note – all three parameters are set to Text.  This is needed in order to pass the parameters into the website address later on.

Web Connection

The first step is to import the data from the website (shown above) using the web connector.  To pass our parameter values into the URL, click the cog in the first Applied Step – typically called ‘Source’.

clip_image004[4]

This will open a new window, where by design, the hardcoded URL is shown under the Basic radio button.  Select Advanced and you will see additional configuration capabilities.

clip_image006[4]

URL Parts enable you to build up elements of the webpage, whilst dynamically passing in parameters.  The URL Preview will show you how the URL looks with any parameters or URL Parts.  All other options can be ignored for this example.

The URL Parts had to be broken down quite significantly and I will show you both the code and how it looks in the Advanced window.

User Interface:

clip_image007[4]

The sections highlighted in red are the parameters and sit in between some of the hard-coded URL text

Code Breakdown:

-          Text = http://www.boxofficemojo.com/monthly/?page=

-          Parameter = [Page]

-          Text = &view=calendargross&yr=

-          Parameter = [Year]

-          Text = &month=

-          Parameter = [Month]

-          Text = &p=.htm

URL Preview:

http://www.boxofficemojo.com/monthly/?page={Page}&view=calendargross&yr={Year}&month={Month}&p=.htm

Year/Month/Page Permutations Table

Before creating/invoking the function, a table that contains the different years, months and pages is needed.  There are many ways to achieve this, but I used a cross join technique (found here) with three disparate tables.  I followed the end user requirements and end up with a table like below:

clip_image009[4]

This will be the final table where the function is called.

Function

Let’s go back to the original table, where the parameters have been applied to the Web URL.  Right click the table and select Create Function.  Nothing else is required (apart from naming it), as Power Query will automatically package up the table and you end up with a new Function like below:

clip_image010[4]

The last step is to invoke the function.  Go back to the Permutations table (called ‘All Data’ in my example), and go to Add Column > Invoke Custom Function.

clip_image011[4]

When prompted with the next screen, map each parameter to the equivalent column.  The red highlighted values are the columns and yellow contain the parameters.  Both are named the same in my example and make mapping even easier.

clip_image013[4]

The function will loop through each URL permutation, dynamically changing the parameters on execution.  Once it has finished, tidy up any errors (for Years, Pages and/or Months) that do not exist.  Apply all transformations within the Query Editor and that is it – the data is ready for Reporting!  You may need to configure some of the measures (e.g. Total Gross) to be numbers, but that is easy enough to do within the Query Editor.

The example in this blog can be adapted in many ways and illustrates how dynamic Power Query can be – both in Excel and Power BI.

Further Reading

o   Cross Join in Power BI –  http://bit.ly/2ij8zJ1

o   Reza Rad Custom Functions – http://bit.ly/2zz6X2f

Contact Me

If you would like a copy of the PBIX workbook or have any general questions, feel free to leave a comment below.

Twitter:                @DataVizWhizz

Handling Web Page Errors in Power BI Query Editor

I was recently asked if it was possible to handle Error Rows in the Query Editor and more specifically, to web pages that do not exist from source. The user broached me with this question after following a previous blog on looping through multiple web pages in Query Editor – found here.

In my blog post, there were no errors from the loop output but in the user’s workbook, they were losing data directly below the expanded error row. My first suggestion was to use the Remove Errors UI option, which would delete the problematic row. However, they wanted a way of treating the errors as an Unknown Member and did not want to lose the data entirely

It is assumed consumers of this blog already have knowledge of the Power BI Query Editor and the UI functions used in the examples.  

Scenario

I created a new workbook that connects to local Council Facebook pages. Each page has a unique Object ID, which will be used as the parameter in the loop. The Council “Camberley” deliberately contains an invalid Object ID. I then proceeded to create a Parameter and Function, replicating the exact steps from my previous blog.

When I invoke the function (through the use of a Custom Column), the following is produced:

clip_image001[4]

As you can see, Camberley Council produces an error in the fnInvokeCouncils column. If we expand the contents (highlighted in yellow), the Facebook page data appears. Upon further inspection, the Farnham and Rushmoor council data are available, but Camberley (incorrect Object ID) and Guildford are not.

clip_image003[4]

The error message is a little misleading but let’s save the debugging debate for another day. The key observation is “Guildford” data is not available, simply because it comes after “Camberley” in the list. Whilst we want to see errors in a Query, we do not want them causing data loss.

Resolution

As I mentioned at the beginning of this article, using the Remove Errors function would prevent the loss of Guildford data. However, the user needs to handle errors as Unknown Members and conform to a typical Kimball Data Warehouse. 

I am sure there are many ways to fulfil the requirement, but here is how I approached it:

1.       Duplicate the existing ‘Councils’ query, naming it ‘Councils Error Rows’.

2.       Switch back to the ‘Councils’ query and Remove Errors, leaving only three records:

 

clip_image005[4]

3.       Expand the fnInvokeCouncils column, opening up the underlying fields and data:


clip_image007[4]

4.       In the ‘Council Error Rows’ query, apply the Replace Errors UI function - inserting the string “Validation Failed”.

5.       Add a Custom Column, writing the following M:

if [fnInvokeCouncils] = "Validation Failed" then 1 else 0

This is a simple IF statement that sets the error rows to 1.

6.       Now filter the selection to only display ErrorRows with the value of 1. This is achieved by using the Filter Rows UI function. The ‘Council Error Rows’ query now looks like the following:


clip_image008[4]

7.       The columns must match the ‘Councils’ query, meaning 4 new Custom Columns are needed. We can hardcode the values and remove any unwanted columns.

 

clip_image010[4]

 

8.       Right click on the previously modified ‘Councils’ query select Reference. Name the query ‘All Councils’. This makes it easier to track the transformations and persists any future changes made to the raw data.  

9.       Within the ‘All Council’ query, select Append Query transformation. Choose ‘Council Error Rows’ as the table to append and click OK.
 

10.   We are now left with a Union of both datasets, containing the Unknown Member and data from all other Councils.

clip_image012[4]

11.   The Unknown Member record is visible within the final Query.

 

clip_image014[4]

 

Summary

I have shown you how to get around two different problems with Error Rows in the Power BI Query Editor. The first is how to retain all correct data, whilst the second is keeping error records and inserting them back into the dataset as an Unknown Member. Both methods are dynamic, meaning that if we added a new Council called ‘Basingstoke’, we would see the underlying data regardless of whether the underlying Facebook Object ID exists or not.

Whilst none of the transformations in this blog are overly technical, there are quite a few hoops to jump through to retain all data from a web page/Facebook. Having said that, I am sure there are a few other ways people could approach this problem. I would be really interested to speak to anyone who does have an alternative solution.

Further Reading

Query Editor Basics (Power BI blog) – http://bit.ly/2pwBdo1
Unknown Members in Data Warehousing -
http://bit.ly/2qTefwe
Loop through Multiple Web Pages using Power Query - http://bit.ly/2q3a8Nc

Contact Me

If you would like a copy of the workbook containing the examples or want to know more about the Query Editor within Power BI, please leave a comment below. My Twitter details are also provided.

Twitter:  @DataVizWhizz

Buffer() M Function in Query Editor (Power BI)

Whilst I have been aware of the Buffer() M function in the Query Editor of Power BI for a while, I had never really utilised its capabilities until now.  There are two main types of buffer functionality – Table.Buffer and List.Buffer.  To define them simply, Table.Buffer puts an entire table into memory and prevents change during evaluation, whereas List.Buffer provides a stable list, meaning it has some form of count or order.

This blog will focus on the theory behind the general Buffer() M functionality, picking a specific scenario of when it can outperform the standard Query Editor behavior.  I will also demonstrate that this is not always the most effective technique within the same scenario.  The article will not give you a hard and fast rule of when to use the Buffer() function, because it can depend on a number of factors.  These are described further below.

Note:    It is assumed you have existing knowledge of Query Folding and if not, one of my previous blogs should help greatly.

Scenario

I found inspiration from Chris Webb’s example, using the Adventure Works DW 2014 database – available here.

The requirements are:

1.       Obtain the first 10,000 rows from FactInternetSales

2.       Remove the majority of columns, retaining ONLY:

a.       SalesOrderLineNumber

b.      CustomerKey

c.       SalesAmount

3.       Rank the current row based on Sales Amount.

List.Buffer()

Assuming your database exists on a local server and is named AdventureWorksDW2014, copy the following code into the Advanced Editor in the Query Editor screen.

let

    //Connect to SQL Server

    Source = Sql.Database("localhost", "AdventureWorksDW2014"),

    //Get first 2000 rows from FactInternetSales

    dbo_FactInternetSales = Table.FirstN(

          Source{[Schema="dbo",Item="FactInternetSales"]}[Data],

          10000),

    //Remove unwanted columns

    RemoveColumns = Table.SelectColumns(

          dbo_FactInternetSales,

          {"SalesOrderLineNumber", "CustomerKey","SalesAmount"}),

    //Get sorted list of values from SalesAmount column

   RankValues = List.Sort(RemoveColumns[SalesAmount], Order.Descending),

    //Calculate ranks

    AddRankColumn = Table.AddColumn(RemoveColumns , "Rank",

          each List.PositionOf(RankValues,[SalesAmount])+1)

in

    AddRankColumn

You can visibly see the rows loading – one by one.  In total, it takes nearly 1 minute to load all off the results.

Now let’s use the List.Buffer() function in the RankValues step.

Replace:

= List.Sort(RemoveColumns[SalesAmount], Order.Descending)

With:

= List.Buffer(List.Sort(RemoveColumns[SalesAmount], Order.Descending))

clip_image002

The entire transformation (from start to finish) completes in just under 2 seconds!  This is because the List.Buffer function stores the sorted values in memory and therefore, the rank calculation is only evaluated once.  The last query (and previous steps) were being evaluated multiple times.  The M language is both functional and at times, lazy.  In order to prevent the constant re-evaluation, buffer the list into memory. 

The final query output is shown below:

clip_image004

Query Folding

We will implement the same requirements, but this time using Query Folding. 

The third step in our current transformation is called ‘Removed Columns’. This is what prevents Query Folding, as this function cannot be interpreted/translated to the native SQL Server T-SQL language.  All steps below are inadvertently not supported either. 

The way around this is to write SQL Server View (in SSMS) to import just the fields required from the underlying FactInternetSales Table.  The below query will give you the same result up to the ‘Remove Columns’ step.

CREATE VIEW dbo.VwFactInternetSalesAmount

AS

       SELECT SalesOrderNumber   

                     ,[CustomerKey]

                     ,[SalesOrderLineNumber]

                     ,[SalesAmount]

                     ,RANK() over( order by [SalesAmount] desc) AS [Rank]

       FROM   [AdventureWorksDW2014].[dbo].[FactInternetSales]

clip_image006

The final steps are to filter on the top 10,000 rows and Group the Rows together – inserting the following M syntax into the last Applied Step:

let

    //Connect to SQL Server

    Source = Sql.Database(".", "AdventureWorksDW2014"),

    // Connect to SQL Server

    dbo_FactInternetSales = Source{[Schema="dbo",Item="VwFactInternetSalesAmount"]}[Data],

    #"Sorted Rows" = Table.Sort(dbo_FactInternetSales,{{"SalesOrderNumber", Order.Ascending}}),

    #"Kept First Rows" = Table.FirstN(#"Sorted Rows",10000),

    #"Grouped Rows" = Table.Group(#"Kept First Rows", {"CustomerKey", "SalesOrderLineNumber", "Rank"}, {{"TotalSalesAmount", each List.Sum([SalesAmount]), type number}})

in

    #"Grouped Rows"

 

The query now returns instantly (under 1 second).   Right click on the last applied step and select the View Native Query option, to show the underlying SQL.

select top 10000

    [rows].[CustomerKey] as [CustomerKey],

    [rows].

View Native Query Function in Power Query

Following on from my last blog (found here), I wanted to share a new Query Folding feature in Power Query.  The View Native Query functionality was actually released in the June 2016 Power BI update but it was hardly advertised/promoted by Microsoft.

What is View Native Query?

In a nutshell, it enables you to check if each of your step by step transformations in Power Query are being folded back to the source.  I have shown techniques of using external monitoring tools, as well as an in-built M function.  However, the View Native Query feature makes our lives so much easier.  I will demonstrate how easy it is to use.

Example

I loaded up an old Query Folding Power BI workbook in Power Query, which was connecting to the AdventureWorksDW2014 SQL Server database.  If you want a copy of it, feel free to comment below and I’ll get back to you.  However, you can do this with any workbook and need to follow the simple instructions below:

1.    Right click on the last step of your transformation and a new option called View Native Query should be available.  Left click to open the SQL query.


clip_image0024_thumb_thumb

 

clip_image0044_thumb_thumb

2.    Whilst the above SQL written at source is not the most efficient, query folding still means the data is brought back into Power Query far quicker than it would if done within the Power BI application.

 

3.   Now copy and paste the code into SSMS, This is exactly the same code that SQL Server Profiler would produce.  Execute the query.

 

clip_image0054_thumb_thumb

 

4.    In effect, this is replicating the transformation steps in Power Query.  The below screenshot is taken from Power Query and note the top 10 results from this and the above output.

clip_image0064_thumb_thumb

The coolest thing about the View Native Query feature is the fact you can check every step in the Power Query transformation.  I only demonstrated the final step, but you can go back to each one and find the underlying query folding logic that has been applied.

Why is View Native Query not available?

If you get the below screenshot, the answer is simple – your transformation step doesn’t support Query Folding! 

clip_image0074_thumb_thumb

NOTE: View Native Query is ‘greyed’ out.

If possible, move any steps that aren’t supported to the end.  Power Query will attempt to query fold, until there is a step it cannot.  All steps after this (even if query folding is possible) will not be supported and could drastically slow down the remaining transformations.

Conclusion

This feature is something that should have been made available a long time ago.  Whilst the power of query folding is undeniable, the methods of testing it were long and arduous.  With View Native Query, we can easily check where, when and more importantly, when folding is NOT happening.  Debugging slow performing transformations is so much easier now and you can in fact, swap steps around and folding will still persist where possible.

I would advise every Power Query developer to make use of this simple feature.  If there are any other Query Folding techniques, tips or tricks out there, please contact me.  Look out for future blogs, where I will be looking at the performance of query folding on very large data volumes (100 million + records) and whether the automated SQL Server code eventually becomes too unwieldy and poorly performing.   

Query Folding in Power Query

Query Folding in Power Query isn’t an entirely new concept but when I first heard of it, I thought “What is Query Folding”?  There are probably a number of you out there who think the same, so let me explain it.

When carrying out transformations (e.g. Sort) in Power Query, it is possible that some will be sent back to the source.  In other words, Power Query doesn’t attempt to sort millions of records within the application itself, but will use the source to undertake the work.  This can drastically improve performance within Power Query.

Examples

I will now demonstrate some basic Query Folding techniques in Power Query.  Until recently, it wasn’t easy to know if a transformation was performing Query Folding.  This blog primarily talks you through monitoring outside of Power Query, but I will also touch upon how this is now possible to do within the application.

External Monitoring Tool

The monitoring tool used in this example if SQL Server Profiler.  For a simple video walkthrough on how to use Profiler, click here.

1.       Start a profiler session, ensuring you connect to the server where the AdventureWorksDW2014 Database resides – available to download here.

2.       Now we can connect to the database in Power Query and more specifically to the dbo.FactProductInventory table.

3.       I will perform three Power Query transformations:

a.       Filter Rows

                                                               i.      ProductTotal > £50000

b.      Group By

                                                               i.      DimProduct.EnglishProductName

                                                             ii.      SUM(Unit Cost) As ProductTotal

c.       Transform > Trim

                                                               i.      DimProduct.Status

NOTE:   This blog does not show you how to carry out the Power Query transformations but the workbook can be provided on request.

4.       The transformed query looks like the below:

 

clip_image002

 

5.       Now let’s take a look.

 

clip_image004
clip_image005

clip_image006

Points of Note:

1.       The Filter Rows Power Query transformation on TotalUnitCost is implemented as a WHERE clause in SQL Server.

2.       The Group By also becomes a T-SQL GROUP BY.  With the SUM function also being applied to find the total value for the EnglishProductName Group By.

3.       The T-SQL functions LTRIM/RTRIM are combined together on the Status column, replicating the functionality of the Trim feature in Power Query.

Within Power Query

Rather than show you this feature myself, I will point you to a great article by Chris Webb, which describes how you can easily check for Query Folding within your Power Query workbook.

In summary, there is an out of the box M function called GetMetaData that can be nested within your transformations.  This will tell you the source of your query and if it has been query folded.

Supported Transformations

Here are some other transformation types in Power Query that support Query Folding:

-          Filtering (on rows or columns)

-          Joins

-          Aggregates and GROUP BY

-          Pivot and unpivot

-          Numeric calculations

-          Simple transformations, such as UPPER

Current Limitations

Although there is still no official list (from Microsoft) of the sources that support Query Folding, here are some of the ones currently known:

-          Relational sources (SQL Server, Oracle).

-          OData sources (e.g. Azure Marketplace)

-          Active Directory

-          Exchange

-          HDFS, Folder.Files and Folder.Contents

This means all a lot of other data sources do not support Query Folding.  For example, a flat file does not support Keep Top 5 Rows in Power Query, whereas SQL Server would use the TOP function or apply a filter in a WHERE clause. Whilst the lack of supported sources can be seen as a limitation, I would argue that any form of Query Folding, even if it were just SQL Server, is a big plus point.

You can deliberately prevent query folding, although the only time I see a benefit is if you are connecting to a server running at full capacity or during a large ETL.  There are far more instances when Power Query will not apply query folding, because of limitations to the sources functionality.  Check out Ken Verbeek’s blog if you would like know more on this.

Conclusion

This article is just scratching the surface of how beneficial Query Folding in Power Query can be.  The key takeaway is remembering to put the transformation steps (that can be folded back to source) at the beginning.  Steps that cannot be folded should be applied as late as possible.  If this is adhered to, you will see huge performance benefits – especially on large datasets.

If anyone out there would like to share other data sources that use Query Folding or just have a general interest, feel free to comment below.

Query Parameters in Power Query – Part 2

As promised, here is Part 2 of Query Parameters in Power Query.  My last blog demonstrated how this feature (released in April 2016) could be used to reduce the data volumes in your PowerBook.  This article focuses on how you can easily switch between different parameter values and dynamically display the results. 

Query Parameters are fully explained in Part 1, so I will move straight on to the exciting stuff!

Use Cases & Example

There are plenty of scenarios when you want to filter large data sets based on various criteria.  A typical business case could be looking at a customer table (in SQL Server, a text file, etc.), where the user wants to return only data by a person’s First Name, Last Name or City in which they are from.  This would require more than one parameter if we want to make our filtering dynamic. 

I will talk you through a slightly different example.  Imagine a user wanting to look at the current products their business is selling, but look at the data by:

1.       Product Name ONLY = EnglishProductName

2.       Product Category, Product Sub Category & Product Name = EnglishProductCategoryName, EnglishProductSubcategoryName & EnglishProductName

Query Parameters can easily facilitate this.  To follow my example, download AdventureWorksDW2014 Database (found here) and connect through SQL Server Database. Paste the following SQL Code into the SQL Statement box – like below:  

SELECT DISTINCT

              --DPC.[EnglishProductCategoryName] AS CategoryName,

              --DPS.[EnglishProductSubcategoryName] AS SubCategoryName,

              --DP.EnglishProductName AS ProductName,

              DP.ProductAlternateKey,

              DP.ListPrice As ProductListPrice,

              COALESCE(DP.[Status],'Previous Version') AS ProductStatus

FROM   [dbo].[DimProduct] DP

INNER JOIN

              [dbo].[DimProductSubcategory] DPS

              ON DPS.[ProductSubcategoryKey] = DP.[ProductSubcategoryKey]

INNER JOIN

              [dbo].[DimProductCategory] DPC

              ON DPS.[ProductCategoryKey] = DPC.[ProductCategoryKey]

ORDER BY

       DPC.[EnglishProductCategoryName],

       DPS.[EnglishProductSubcategoryName],

       DP.EnglishProductName

 

NOTE:   You will see the first three columns are commented out.  This is because we will be referencing them in our parameter value.

clip_image002[6]

 

Switching between Parameter Values

I will now demonstrate how to create 2 parameter values that will be used to display the product data in different ways.

The following steps show you how to achieve this.  

1.       We now want to create the new parameter, in the Query Editor, click Manage Parameters from the ribbon.
 

2.       Create one parameter and select ‘List of Values’ as the Allowed Values drop down.  You will see there are two types of list output – one for just EnglishProductName and the other amalgamating the three product columns.   Click OK when happy.


clip_image004[6]

The code for the ‘concat’ function is:
Concat(EnglishProductCategoryName,' - ',EnglishProductSubcategoryName,' - ',EnglishProductName)

Note:     As we are connecting to SQL Server, the ‘Concat’ function is required to bring the product columns together.

3.       Rename the query to ‘DimProduct’.  We now need to insert the parameter into the DimProduct – by clicking the Advanced Editor and replacing EnglishProductName with our new parameter – like below:


clip_image006[6]

4.       You will see the following warning message.  Click Edit Permission.


clip_image008[6]

5.       The message is just warning you that as you are connected to the database, the query could make changes to the underlying objects.  As this is a ‘Select’ query, there is no danger of this.  Click Run.

clip_image010[6]

6.       As we set the default value to be EnglishProductName, this is what we now see in the query pane.  The field in question is called ProductName.

 

clip_image012[6]

7.       To switch to the other parameter value, simply click Manage Parameters > Edit Parameters

clip_image013[6]

8.       Select the other parameter value and click OK.

clip_image015[6]

9.       Now look at ProductName.  It includes all the category and sub category data also.  Wirth the three concatenated fields split by a ‘-‘.

 

clip_image017[6]

10.   That is parameter value switching in a nut shell.  Feel free to experiment with more combinations of values.

Conclusion

Whilst my demo is very simplistic, this is just showing you how powerful a concept Query Parameters can be.  This blog solely focuses on configuring a parameter to display different result sets, depending on what the user wants to see.  Thinking about the entire Power BI suite, we could hook up out parameterized datasets to Dashboards and empower the user to look at KPI’s or trends in data, depending on their role/department.  The sales and customer service departments will focus on similar dimensional data – like customer and date, but the former might want to look at the ‘SalesAmount’ measure, whilst the latter would be interested in ‘RefundedAmount’.  These fields can be contained in one query parameter!

Query Parameters have been around in Power BI for a few months now and the use of the functionality is becoming more wide spread.  Look for more blogs on this subject in the near future.

Further Reading

There are already a couple of great blogs out there and the below links take you to some other excellent use cases:

1.       Chris Webb’s Blog - https://blog.crossjoin.co.uk/2016/05/08/the-m-code-behind-power-bi-parameters/

2.       Soheil Bakhshi Blog (1 of 2) – http://biinsight.com/power-bi-desktop-query-parameters-part-1/

3.       Soheil Bakhshi Blog (2 of 2) - http://biinsight.com/power-bi-desktop-query-parameters-part2-dynamic-data-masking-and-query-parameters/

Query Parameters in Power Query – Part 1

A new feature has recently been added into Power Query – called Query Parameters.  This was actually released to Power BI as part of May 2016’s update.

So what is a Query Parameter?  For anyone familiar with Microsoft Reporting Services, it has pretty much the same functionality as a Report Parameter.  In simple terms, it enables you to filter your data dynamically in Power Query.  There has always been a Filter UI function but it could not be linked to a parameter or plugged into the M code.  We can use Query Parameters as a way of reducing the data volumes brought into a PowerBook (useful for large datasets or tables) or to allow a user to switch between filters (e.g. changing database connections dynamically).

In this blog, I will concentrate on how we can reduce data volumes by implementing a Query Parameter.  Look out for Part 2 in the coming weeks, where I will pick a use case for when we would want to dynamically switch between parameter values.

Reducing Data Volumes

Imagine transforming or just analysing a fact table that contains over 1 billion records.  Not only will Power BI (and Power Query) probably crash or at best, perform badly, there is no need to bring all this data back into a PowerBook.  A subset, or perhaps an individual month is all that is required to undertake the necessary Power Query ETL operations.

The below example demonstrates how a Query Parameter can be used to filter records in a given dataset.  I have used the AdventureWorksDW2014 Database (download here) and picked the FactProductInventory table, which contains 776,286 rows.  I will reduce this count by adding a new Query Parameter in Power Query – using the Datekey to find all rows in between the 31st November and 30th December 2010.

1.       Assuming I have already loaded the data into the workbook, in the Query Editor, click Manage Parameters from the ribbon.
 

clip_image001

2.       Create two parameters – one named DateKeyFrom and the other DateKeyTo.  The former contains the value 20101131 and the latter 20101230.  Keep the default settings and confirm by clicking OK.  You two parameters should look like the below:

clip_image003

 

3.       Click on the DateKey column drop down and choose Number Filters and Between – highlighted below.

 

clip_image005

 

 

4.       Set the filter to pick up the two newly created parameters.  The following screenshot illustrates the criteria required, with the key change involving changing the second drop down from ‘Decimal Number’ to ‘Parameter’.

clip_image007

5.       Check that your dataset has been filtered and only contains data for the time period December 2010.

6.       If you need to change the date criteria and let’s say, only display 5 days of data, this can be achieved easily clicking the parameters in the Queries pane.  I have edited the DateKeyFrom value to 20101225.  You can also update a value by clicking the Manage Properties button.

clip_image009

7.       As soon as you click back onto the FactProductInventory table query, the data automatically begins to refresh.  We are now left with only the last 5 days in December, shown below. 

 

clip_image011

Conclusion

As Query Parameters are hot off the press, I have only just started experimenting with it.  It looks like it has been added to not only reduce and filter on data, but to enable us to plug a dynamic parameters into M.  Previously, I used a Function as a workaround but can now actually use them in the proper context.  Click here to view a previous blog, where I specifically create and invoke a function.  There are a number of other uses for Functions and Rafael Salas’ blog provides an excellent walkthrough.

My next blog will focus on how you can switch between different parameter values and dynamically display the results.  If anyone has built their own Query Parameter examples and want to share them, feel free to leave me a comment below.  I would be very interested to see how far people have already taken this new feature.

Further Reading

There are already a few great blogs and use cases out there:

1.       Chris Webb’s Blog - https://blog.crossjoin.co.uk/2016/05/08/the-m-code-behind-power-bi-parameters/

2.       Soheil Bakhshi Blog (1 of 2) – http://biinsight.com/power-bi-desktop-query-parameters-part-1/

3.       Soheil Bakhshi Blog (2 of 2) - http://biinsight.com/power-bi-desktop-query-parameters-part2-dynamic-data-masking-and-query-parameters/

Replicating SSIS Data Profiling in Power Query

As a BI Consultant, I regularly sit down with a new/prospective client and dissect both their business requirements and source data.  With the latter, the SSIS Data Profiling Task (in Visual Studio or SSDT) is a very useful tool to find out the general quality of the data we will be importing into the warehouse.  Count of NULLS, Value frequency, Min/Max/Median Values and unique count of records are just a few types of data profiling.

I have recently being playing with the out of the box Data Profiling functionality in Power Query.  Although the functions are very useful, I quickly realised that I would need to extend my PowerBook to match the Data Profiling Task in SSIS.  I will break my findings down into a couple of parts, with this instalment focusing on a specific data profiling output from SSIS. 

Table Profile Function

Table.Profile() is a relatively new function in Power Query that gives some profiling stats for columns in a specific database table or raw file.  In the example below, I pulled in a raw Excel spread sheet from the Power BI website, which contains sample financial data.  Click here to download the data.

clip_image002

To obtain stats on this query, open Advanced Editor, add the following M at the end of the existing code:

    #"TableProfiling" = Table.Profile( #"Changed Type" as table) as table

in

     #"TableProfiling"

clip_image004

All we are doing is creating a new step in M, referencing the previous step and calling the new one.  The actual Table.Profile() function doesn’t even require an expression value and the underlying Power Query engine provides all of the functionality.  Each column has been profiled – see below:

clip_image006

I think the outputs are self-explanatory, especially if you are familiar with Data Profiling.  Either way, one line of code gives us some useful high level stats about the data and can help us captures the extremes or anomalies of each column.  Not all analysis will be suitable for a column, e.g. the ‘Country’ field will not produce Standard Deviation or Average values because it uses the Text datatype. 

Column/Frequent Value Distribution

Whilst the in-built Power Query function is ideal for overall table/file analysis, data profiling is most effective when focusing on specific columns and data patterns.  Both Column and Frequent Value Distribution are basic features of the SSIS Profiling Task, shown below.

clip_image008 

For the Power Query proof of concept, I have connected to AdventureWorksDW2014 and the Person.Address table – named PersonAddress.  This means I can use the SSIS Data Profiling screenshot as the benchmark to not only prove it can be replicated, but to ensure the numbers match. 

The first step is to apply the Table.Profile() function to a Reference query – called TableProfiling.  We can then create a simple staging query that just includes what we need from the out of the box profiling – called TableProfilingStage.  ‘Number of Values’ is renamed to ‘TotalCount’ and ‘Number of Distinct Values’ to ‘DistinctCount’.  We are left with the following data:

  clip_image010

The row of interest is ‘City’ (contained in Column), as this is what we need to profile.  The queries TableProfilingStage is the reference point for the frequent value distribution.  Create a New Source and paste the following M Code into the Advanced Editor:

clip_image012

Points of note:

1.       Lines 1-4 are referencing parameters in other queries and more specifically, with the ordinal position of 3.  The ‘City’ column is the 4th record in TableProfilingStage which equates to position 3 (column 1 starts at 0).

2.       Each ‘City’ value summed with a total count.

3.       Individual record counts (value of 1 per row) to use in the eventual calculation.  This value is then summed to give a total record count on each row.

4.       Apply (Count/TotalCount)*100 to work out the Frequent Value Distribution.

5.       Renamed, sorted (by count) and hidden any unnecessary columns.

6.       Save your new query – I called it ColumnProfiling.

7.       You should have 4 queries in total:

a.       PersonAddress

b.       TableProfiling

c.       TableProfilingStage

d.       ColumnProfiling

As long as your queries are named identically to my examples, you will end up with the below outputs:

Column Value Distribution:

clip_image014

Frequent Value Distribution:

clip_image016

We have now replicated Column and Frequent Value Distribution and the figures matches the SSIS Profiling Task! Although Power Query does not provide the pretty visualizations, we can easily re-produce them in the Power BI reporting suite.  I will save this for another blog, though.    

Next Steps

Look out for my next blog, where I aim to enhance the current Data Profiling PowerBook and add configuration.  In order to make this method as or more efficient than SSIS, it will need to be automated and handle a number of sources (e.g. csv, database connection, etc) and incorporate invoked Power Query functions. 

Conclusion

It is clear to see that Power Query is a very effective tool to replicate SSIS Data Profiling.  I believe with further development, a PowerBook could be used as a quicker and more efficient tool to run data profiling. 

I am very interested to see if anyone has created their own Data Profiling tasks in Power Query and welcome any comments, feedback or questions.

Flat Files in Power Query

The out-of-the-box functionality in Power Query will enable you to import pretty much any flat file, ranging from csv to fixed width.  However, there are instances when badly formatted files cannot be loaded correctly and this blog will demonstrate a workaround for this. 

Unusual Flat Files

Whilst csv and txt are the most common types of flat file, I have worked on a number of BI projects where the data warehouse needs to ingest fixed width, ragged right or pipe delimited text files.  You won’t be surprised to hear that Power Query recognizes all of these formats!   I will demonstrate below how simple it is to import a Ragged Right file with a ‘.in’ extension. 

Ragged Right in Notepad:

clip_image002

Power Query’s Interpretation:

clip_image003

 

Badly Formed Flat Files

You won’t normally know that a flat file is badly formed until actually importing it.  In the below example, a csv file contains both missing and double commas.  Csv stands for ‘comma separated values’, so Power Query expects a comma to split each column.  Any developer who has imported data (through SSIS, SQL import, etc.) will have encountered a problem like this.

Two Commas in Notepad:

clip_image004

Two Commas in Power Query:

As expected, Power Query believes there is a blank column between the values ‘Liverpool’ and ‘Stoke’.  As a result, the data is shifted to the right.

clip_image006

Missing Commas in Power Query:

A missing comma between columns works the opposite way.  Data is shifted to the left.

clip_image008

Solution

As we already know that the csv file contains both two and missing commas, the next step is to import the data in a non-delimited format.  To do this, you need to start a new Blank Query and type the following code into the Advanced Editor:

let

Value = Table.FromList(Lines.FromBinary(File.Contents("D:\Documentation\
Internal\PowerQuery\FlatFiles\CSVMissingOrMoreCommas.csv")),Splitter.SplitByNothing())

in

                Value

The code is turning the flat file data into a table and treating every row as an individual column – hence the ‘SplitByNothing’ syntax.  This is how the query looks:

clip_image009

We can now replace two commas with just one comma and split the table by ‘,’.  I have previously shown you how to Replace Values in an earlier blog, but will quickly demonstrate the Split Columns function.

clip_image011

Power Query actually auto detected all of these settings, as it can see the frequent number of commas and that they occur 10 times.  If you drop down Advanced options, you can change the delimiters and output columns if they are not interpreted correctly by Power Query.   Finally, we can implement the Use First Row As Headers UI function and filter the nulls in the ‘HTR’ column.

clip_image013

There is no easy way to write automated logic that would fix these three records because the missing commas were in different places and in our current query, different columns.  There are also no string patterns that can be easily deduced.  Therefore, this time around we will use Replace Values to move the data to the required columns.

clip_image015

There it is - a very effective way of visualising and managing both unusual and badly formed flat files in Power Query. 

Next Steps

We could easily extend the code to handle other file types or delimiters. Ken Puls’ blog demonstrates how you can pass the flat file connection string in as a parameter within a function.  This would be very handy for someone wanting to bring in multiple flat files through one configuration.  I would be very interested to see if any other Power Query developers have their own M code that handles flat files.  Feel free to comment or contact me directly if you have any suggestions on how to extend my example further. 

ETL in Power Query

If you have read any of my other Power Query blogs, you will have seen the results of transformed and mashed up queries - but not how it was achieved.  This article picks out my favourite 5 Extract, Transform & Load (ETL) features in Power Query.

Example

Being a keen sports fan, I wanted to find out the medal history of the 100m Olympic Finals for both men and women. 

I’ve picked two datasets, both from different sources.  You can download them using the links below:

-          Source 1 - CSV

o   ISO Countries.csv - https://datahub.io/dataset/iso-3166-1-alpha-2-country-codes/resource/9c3b30dd-f5f3-4bbe-a3cb-d7b2c21d66ce

-          Source 2 – Web

o https://en.wikipedia.org/wiki/100_metres_at_the_Olympics

o   There is a lot of other data on the site but for this example, I have only picked the race result data tables - Men [edit] and Women [edit].

Please Note:      I am happy to provide the Power BI Workbook upon request.  Leave a comment below and I will get        back to you ASAP.  

The imported data (pre transformed) should look like the below in Power Query.  You may notice some ETL has already taken place, but this was automatically carried out by Power Query and includes simple tasks like displaying column headers.

clip_image002

Pivot/Unpivot

There are many occasions when we need to completely flip the contents of a table.  For our example, ‘Gold’, ‘Silver’ and ‘Bronze’ should not be columns, but rows instead.  We only need one column and this can be named ‘MedalType’. The un-pivot feature can be achieved (on a whole query) by just one click of a button.  Ensure that following columns and features are selected:

clip_image004

 

Once the data is un-pivoted, rename the new columns to ‘Athlete’ and ‘MedalType’ respectively.  Repeat the process for the Men [Edit] query, which looks like the following:

clip_image005

Replace Values

When quickly eyeballing the data (in Men [Edit]), it is apparent that two athletes shared a Bronze medal in 1896.  We need to handle this data defect but unfortunately, there isn’t a one click method of achieving this.  The easiest way is to use the Replace Values and Remove Top/Bottom Rows UI functions.

Firstly, let’s duplicate the full Men [Edit] query by right clicking on it and selecting Duplicate.  We now want to remove all rec0rds from the copied query, apart from the one that contains two bronze medalists.  To do this, right click on the little table button in the corner of the query and choose either of the following:

clip_image007

It will then specify how many top or bottom rows you want to remove.  Ensure the only record left looks like the below:

 clip_image008

Lastly, the Replace Values function is used to replace ‘Francis Lane’ with an empty string and therefore, removing the duplicate.  The aim is to insert this one record back into the main Men [Edit] query.  Right click on ‘Athlete’ and select Replace Values.  Type the below text in the function box and click OK.

Value to Find:   Francis Lane (USA)#(lf)#(lf) Alajos Szokolyi (HUN)
Replace With:   Alajos Szokolyi (HUN)

The ‘#(if)’ elements of the search string is looking for special characters, otherwise the replace will not work.

clip_image009

We must tidy up the Men [Edit] query by doing the same as above, but the opposite way around.  This leaves us with the following:

clip_image010

Append Queries

This is another neat feature in Power Query.  In SQL Server or SSIS terms, this provides a union between two queries.  If the column names and data types do not match, then it may fail or give unexpected results.

Open the Men [Edit] query, Click the Home tab in the top menu pane and select Append Queries.  A pop up window will open, where you can specify the query to append into Men [Edit].  For this example, I selected the below and clicked OK.

clip_image012

-          It is possible to append multiple tables and in fact, we need to do this in order to bring both the Men [Edit] (2) and the Women [Edit] queries into Men [Edit].

-          To make it easy to distinguish between male and female, I have added a new column to the queries that specifies gender.  This is not mandatory, especially if you are following this example.

-          If we select the Two tables option, then only one query can be appended at a given time.

We can rename the newly appended query to Athletes and tidy up the Men [Edit] (2) and Women [Edit] queries.  Putting raw queries into an appropriately named Group helps distinguish each part of our ETL – see below:

clip_image013

There are more queries now but the Raw folder contains imported, untouched queries, whereas Transform includes tables that have been modified and joined together.  The Load folder will eventually contain the star schema tables (dimensions and facts).

Add Custom Column

Any developer using ETL will have needed to derive a column before.  Power Query offers this functionality, both through the UI and M query language.  We want to create a new column called ‘Nationality’, that moves the characters (in between the brackets) from ‘Athlete’.

In Power Query, a derived column is called Custom Column.  Ensure the Athletes query is selected, then navigate to the Add Column tab in the top menu pane and select the Add Custom Column button.  The pop up box appears, where we can write some custom M code:

clip_image015

-          Text.End is the equivalent to the RIGHT function in SQL Server.

-          Text.Start is equivalent to LEFT.

-          Wrapping them both into one line of code ensure the string inside the brackets is taken from Athletes.

-          Rename the new column to Nationality.

clip_image016

Merge Queries

This feature is the same as applying a JOIN in SQL Server.  In simple terms, it gives us the ability to mashup and bring together queries from different data sources.  We will bring the ISOCountriesRaw data into the Athletes query.

Under the Home tab, select Merge Queries.  A new window appears, where we can specify the tables and columns that need to be joined.  Simply click select the ISOCountriesRaw query and the two columns shown in the below screenshot.  The Join Kind should be a Left Outer, as this will bring ONLY matching records across ISOCountriesRaw, keeping all records in the main Athletes query.

clip_image018

The wizard will also give you a match preview and for this example, 142 of the 144 records in Athletes matches the ISOCountriesRaw table.  Click OK to complete the merge.

There are a number of columns that aren’t needed in this table and can be filtered out accordingly:

clip_image019 

There are two records that don’t match because the medal winner was subsequently removed from the record books.  We can use the Replace Values function to replace the nulls with some more meaningful information (see below).  We are now ready to turn the data into a star schema!

clip_image021

Next Steps

Now we have a fully transformed query, it can be broken down into a fact table and dimensions.  I will not demonstrate in this blog but by following Kimball’s data warehousing principles, there is now a star schema that can be plugged into Power Pivot or Power View.

NOTE:   Add Index Column was used to create primary and foreign keys on the ‘Load’ tables.  See ‘Other Features’ for link on how to do this.

Queries:

clip_image023

Relationships:

clip_image025

Other Features

There are many other ETL related functions within Power Query that have not been discussed in this blog.  It is also possible to enhance the UI features using M or write specialist code to enhance and configure your ETL process, which I demonstrated above.

For further information, I would recommend reading the below articles:

-          5 Very Useful Text Formulas - Ken Puls
http://www.excelguru.ca/blog/2014/08/20/5-very-useful-text-formulas-power-query-edition/

-          Add Index Column – Microsoft

https://support.office.com/en-us/article/Insert-a-custom-column-into-a-table-Power-Query-2dbb579a-915b-4ebd-b622-8e7f3d1d61a6?ui=en-US&rs=en-US&ad=US

-          Split Columns - Mark Vaillancourt

http://markvsql.com/2015/03/advanced-column-splitting-in-power-query/

-          Group By – Microsoft

https://support.office.com/en-us/article/Group-rows-in-a-table-Power-Query-e1b9e916-6fcc-40bf-a6e8-ef928240adf1?ui=en-US&rs=en-US&ad=US

-          ETL Examples – Chris Webb

http://blog.crossjoin.co.uk/2013/08/16/some-power-querym-examples/