Callum

Callum Green's Blog

Loop Through Webpages with Multiple Parameters in Power Query

Back in February 2016, I posted a step by step article (found here) on how to loop through a website and dynamically pull down multiple pages into a single table in Power Query. I was recently approached by a gentleman who had read my blog, wanting to enhance the looping process. The requirements were:

Current Problem:

-          The user can currently only loop through Year, having to hard code the Page and Month.

Requirements:

-          Dynamically loop through three webpage parameters and output them into a single table – highlighted below:

http://www.boxofficemojo.com/monthly/?page=1&view=calendargross&yr=2017&month=1&p=.htm

-          Specify the Page, Year and Month to loop through

o   Years 2010 - 2017

o   Months 1 - 12

o   Pages 1 - 10

Parameters

Whilst GUI Parameters are not a new concept in Power Query, they did not exist when I wrote my initial blog.  As a result, I had to code a parameter into the M query and then define it as a function.  A similar concept will be adopted here, but the out of the box parameter functionality makes life so much easier now.

I previously wrote an article explaining how to set up parameters (here), so I will simply show you the output of them below:

clip_image002[4]

Note – all three parameters are set to Text.  This is needed in order to pass the parameters into the website address later on.

Web Connection

The first step is to import the data from the website (shown above) using the web connector.  To pass our parameter values into the URL, click the cog in the first Applied Step – typically called ‘Source’.

clip_image004[4]

This will open a new window, where by design, the hardcoded URL is shown under the Basic radio button.  Select Advanced and you will see additional configuration capabilities.

clip_image006[4]

URL Parts enable you to build up elements of the webpage, whilst dynamically passing in parameters.  The URL Preview will show you how the URL looks with any parameters or URL Parts.  All other options can be ignored for this example.

The URL Parts had to be broken down quite significantly and I will show you both the code and how it looks in the Advanced window.

User Interface:

clip_image007[4]

The sections highlighted in red are the parameters and sit in between some of the hard-coded URL text

Code Breakdown:

-          Text = http://www.boxofficemojo.com/monthly/?page=

-          Parameter = [Page]

-          Text = &view=calendargross&yr=

-          Parameter = [Year]

-          Text = &month=

-          Parameter = [Month]

-          Text = &p=.htm

URL Preview:

http://www.boxofficemojo.com/monthly/?page={Page}&view=calendargross&yr={Year}&month={Month}&p=.htm

Year/Month/Page Permutations Table

Before creating/invoking the function, a table that contains the different years, months and pages is needed.  There are many ways to achieve this, but I used a cross join technique (found here) with three disparate tables.  I followed the end user requirements and end up with a table like below:

clip_image009[4]

This will be the final table where the function is called.

Function

Let’s go back to the original table, where the parameters have been applied to the Web URL.  Right click the table and select Create Function.  Nothing else is required (apart from naming it), as Power Query will automatically package up the table and you end up with a new Function like below:

clip_image010[4]

The last step is to invoke the function.  Go back to the Permutations table (called ‘All Data’ in my example), and go to Add Column > Invoke Custom Function.

clip_image011[4]

When prompted with the next screen, map each parameter to the equivalent column.  The red highlighted values are the columns and yellow contain the parameters.  Both are named the same in my example and make mapping even easier.

clip_image013[4]

The function will loop through each URL permutation, dynamically changing the parameters on execution.  Once it has finished, tidy up any errors (for Years, Pages and/or Months) that do not exist.  Apply all transformations within the Query Editor and that is it – the data is ready for Reporting!  You may need to configure some of the measures (e.g. Total Gross) to be numbers, but that is easy enough to do within the Query Editor.

The example in this blog can be adapted in many ways and illustrates how dynamic Power Query can be – both in Excel and Power BI.

Further Reading

o   Cross Join in Power BI –  http://bit.ly/2ij8zJ1

o   Reza Rad Custom Functions – http://bit.ly/2zz6X2f

Contact Me

If you would like a copy of the PBIX workbook or have any general questions, feel free to leave a comment below.

Twitter:                @DataVizWhizz

Handling Web Page Errors in Power BI Query Editor

I was recently asked if it was possible to handle Error Rows in the Query Editor and more specifically, to web pages that do not exist from source. The user broached me with this question after following a previous blog on looping through multiple web pages in Query Editor – found here.

In my blog post, there were no errors from the loop output but in the user’s workbook, they were losing data directly below the expanded error row. My first suggestion was to use the Remove Errors UI option, which would delete the problematic row. However, they wanted a way of treating the errors as an Unknown Member and did not want to lose the data entirely

It is assumed consumers of this blog already have knowledge of the Power BI Query Editor and the UI functions used in the examples.  

Scenario

I created a new workbook that connects to local Council Facebook pages. Each page has a unique Object ID, which will be used as the parameter in the loop. The Council “Camberley” deliberately contains an invalid Object ID. I then proceeded to create a Parameter and Function, replicating the exact steps from my previous blog.

When I invoke the function (through the use of a Custom Column), the following is produced:

clip_image001[4]

As you can see, Camberley Council produces an error in the fnInvokeCouncils column. If we expand the contents (highlighted in yellow), the Facebook page data appears. Upon further inspection, the Farnham and Rushmoor council data are available, but Camberley (incorrect Object ID) and Guildford are not.

clip_image003[4]

The error message is a little misleading but let’s save the debugging debate for another day. The key observation is “Guildford” data is not available, simply because it comes after “Camberley” in the list. Whilst we want to see errors in a Query, we do not want them causing data loss.

Resolution

As I mentioned at the beginning of this article, using the Remove Errors function would prevent the loss of Guildford data. However, the user needs to handle errors as Unknown Members and conform to a typical Kimball Data Warehouse. 

I am sure there are many ways to fulfil the requirement, but here is how I approached it:

1.       Duplicate the existing ‘Councils’ query, naming it ‘Councils Error Rows’.

2.       Switch back to the ‘Councils’ query and Remove Errors, leaving only three records:

 

clip_image005[4]

3.       Expand the fnInvokeCouncils column, opening up the underlying fields and data:


clip_image007[4]

4.       In the ‘Council Error Rows’ query, apply the Replace Errors UI function - inserting the string “Validation Failed”.

5.       Add a Custom Column, writing the following M:

if [fnInvokeCouncils] = "Validation Failed" then 1 else 0

This is a simple IF statement that sets the error rows to 1.

6.       Now filter the selection to only display ErrorRows with the value of 1. This is achieved by using the Filter Rows UI function. The ‘Council Error Rows’ query now looks like the following:


clip_image008[4]

7.       The columns must match the ‘Councils’ query, meaning 4 new Custom Columns are needed. We can hardcode the values and remove any unwanted columns.

 

clip_image010[4]

 

8.       Right click on the previously modified ‘Councils’ query select Reference. Name the query ‘All Councils’. This makes it easier to track the transformations and persists any future changes made to the raw data.  

9.       Within the ‘All Council’ query, select Append Query transformation. Choose ‘Council Error Rows’ as the table to append and click OK.
 

10.   We are now left with a Union of both datasets, containing the Unknown Member and data from all other Councils.

clip_image012[4]

11.   The Unknown Member record is visible within the final Query.

 

clip_image014[4]

 

Summary

I have shown you how to get around two different problems with Error Rows in the Power BI Query Editor. The first is how to retain all correct data, whilst the second is keeping error records and inserting them back into the dataset as an Unknown Member. Both methods are dynamic, meaning that if we added a new Council called ‘Basingstoke’, we would see the underlying data regardless of whether the underlying Facebook Object ID exists or not.

Whilst none of the transformations in this blog are overly technical, there are quite a few hoops to jump through to retain all data from a web page/Facebook. Having said that, I am sure there are a few other ways people could approach this problem. I would be really interested to speak to anyone who does have an alternative solution.

Further Reading

Query Editor Basics (Power BI blog) – http://bit.ly/2pwBdo1
Unknown Members in Data Warehousing -
http://bit.ly/2qTefwe
Loop through Multiple Web Pages using Power Query - http://bit.ly/2q3a8Nc

Contact Me

If you would like a copy of the workbook containing the examples or want to know more about the Query Editor within Power BI, please leave a comment below. My Twitter details are also provided.

Twitter:  @DataVizWhizz