Callum

Callum Green's Blog

Loop Through Webpages with Multiple Parameters in Power Query

Back in February 2016, I posted a step by step article (found here) on how to loop through a website and dynamically pull down multiple pages into a single table in Power Query. I was recently approached by a gentleman who had read my blog, wanting to enhance the looping process. The requirements were:

Current Problem:

-          The user can currently only loop through Year, having to hard code the Page and Month.

Requirements:

-          Dynamically loop through three webpage parameters and output them into a single table – highlighted below:

http://www.boxofficemojo.com/monthly/?page=1&view=calendargross&yr=2017&month=1&p=.htm

-          Specify the Page, Year and Month to loop through

o   Years 2010 - 2017

o   Months 1 - 12

o   Pages 1 - 10

Parameters

Whilst GUI Parameters are not a new concept in Power Query, they did not exist when I wrote my initial blog.  As a result, I had to code a parameter into the M query and then define it as a function.  A similar concept will be adopted here, but the out of the box parameter functionality makes life so much easier now.

I previously wrote an article explaining how to set up parameters (here), so I will simply show you the output of them below:

clip_image002[4]

Note – all three parameters are set to Text.  This is needed in order to pass the parameters into the website address later on.

Web Connection

The first step is to import the data from the website (shown above) using the web connector.  To pass our parameter values into the URL, click the cog in the first Applied Step – typically called ‘Source’.

clip_image004[4]

This will open a new window, where by design, the hardcoded URL is shown under the Basic radio button.  Select Advanced and you will see additional configuration capabilities.

clip_image006[4]

URL Parts enable you to build up elements of the webpage, whilst dynamically passing in parameters.  The URL Preview will show you how the URL looks with any parameters or URL Parts.  All other options can be ignored for this example.

The URL Parts had to be broken down quite significantly and I will show you both the code and how it looks in the Advanced window.

User Interface:

clip_image007[4]

The sections highlighted in red are the parameters and sit in between some of the hard-coded URL text

Code Breakdown:

-          Text = http://www.boxofficemojo.com/monthly/?page=

-          Parameter = [Page]

-          Text = &view=calendargross&yr=

-          Parameter = [Year]

-          Text = &month=

-          Parameter = [Month]

-          Text = &p=.htm

URL Preview:

http://www.boxofficemojo.com/monthly/?page={Page}&view=calendargross&yr={Year}&month={Month}&p=.htm

Year/Month/Page Permutations Table

Before creating/invoking the function, a table that contains the different years, months and pages is needed.  There are many ways to achieve this, but I used a cross join technique (found here) with three disparate tables.  I followed the end user requirements and end up with a table like below:

clip_image009[4]

This will be the final table where the function is called.

Function

Let’s go back to the original table, where the parameters have been applied to the Web URL.  Right click the table and select Create Function.  Nothing else is required (apart from naming it), as Power Query will automatically package up the table and you end up with a new Function like below:

clip_image010[4]

The last step is to invoke the function.  Go back to the Permutations table (called ‘All Data’ in my example), and go to Add Column > Invoke Custom Function.

clip_image011[4]

When prompted with the next screen, map each parameter to the equivalent column.  The red highlighted values are the columns and yellow contain the parameters.  Both are named the same in my example and make mapping even easier.

clip_image013[4]

The function will loop through each URL permutation, dynamically changing the parameters on execution.  Once it has finished, tidy up any errors (for Years, Pages and/or Months) that do not exist.  Apply all transformations within the Query Editor and that is it – the data is ready for Reporting!  You may need to configure some of the measures (e.g. Total Gross) to be numbers, but that is easy enough to do within the Query Editor.

The example in this blog can be adapted in many ways and illustrates how dynamic Power Query can be – both in Excel and Power BI.

Further Reading

o   Cross Join in Power BI –  http://bit.ly/2ij8zJ1

o   Reza Rad Custom Functions – http://bit.ly/2zz6X2f

Contact Me

If you would like a copy of the PBIX workbook or have any general questions, feel free to leave a comment below.

Twitter:                @DataVizWhizz

Buffer() M Function in Query Editor (Power BI)

Whilst I have been aware of the Buffer() M function in the Query Editor of Power BI for a while, I had never really utilised its capabilities until now.  There are two main types of buffer functionality – Table.Buffer and List.Buffer.  To define them simply, Table.Buffer puts an entire table into memory and prevents change during evaluation, whereas List.Buffer provides a stable list, meaning it has some form of count or order.

This blog will focus on the theory behind the general Buffer() M functionality, picking a specific scenario of when it can outperform the standard Query Editor behavior.  I will also demonstrate that this is not always the most effective technique within the same scenario.  The article will not give you a hard and fast rule of when to use the Buffer() function, because it can depend on a number of factors.  These are described further below.

Note:    It is assumed you have existing knowledge of Query Folding and if not, one of my previous blogs should help greatly.

Scenario

I found inspiration from Chris Webb’s example, using the Adventure Works DW 2014 database – available here.

The requirements are:

1.       Obtain the first 10,000 rows from FactInternetSales

2.       Remove the majority of columns, retaining ONLY:

a.       SalesOrderLineNumber

b.      CustomerKey

c.       SalesAmount

3.       Rank the current row based on Sales Amount.

List.Buffer()

Assuming your database exists on a local server and is named AdventureWorksDW2014, copy the following code into the Advanced Editor in the Query Editor screen.

let

    //Connect to SQL Server

    Source = Sql.Database("localhost", "AdventureWorksDW2014"),

    //Get first 2000 rows from FactInternetSales

    dbo_FactInternetSales = Table.FirstN(

          Source{[Schema="dbo",Item="FactInternetSales"]}[Data],

          10000),

    //Remove unwanted columns

    RemoveColumns = Table.SelectColumns(

          dbo_FactInternetSales,

          {"SalesOrderLineNumber", "CustomerKey","SalesAmount"}),

    //Get sorted list of values from SalesAmount column

   RankValues = List.Sort(RemoveColumns[SalesAmount], Order.Descending),

    //Calculate ranks

    AddRankColumn = Table.AddColumn(RemoveColumns , "Rank",

          each List.PositionOf(RankValues,[SalesAmount])+1)

in

    AddRankColumn

You can visibly see the rows loading – one by one.  In total, it takes nearly 1 minute to load all off the results.

Now let’s use the List.Buffer() function in the RankValues step.

Replace:

= List.Sort(RemoveColumns[SalesAmount], Order.Descending)

With:

= List.Buffer(List.Sort(RemoveColumns[SalesAmount], Order.Descending))

clip_image002

The entire transformation (from start to finish) completes in just under 2 seconds!  This is because the List.Buffer function stores the sorted values in memory and therefore, the rank calculation is only evaluated once.  The last query (and previous steps) were being evaluated multiple times.  The M language is both functional and at times, lazy.  In order to prevent the constant re-evaluation, buffer the list into memory. 

The final query output is shown below:

clip_image004

Query Folding

We will implement the same requirements, but this time using Query Folding. 

The third step in our current transformation is called ‘Removed Columns’. This is what prevents Query Folding, as this function cannot be interpreted/translated to the native SQL Server T-SQL language.  All steps below are inadvertently not supported either. 

The way around this is to write SQL Server View (in SSMS) to import just the fields required from the underlying FactInternetSales Table.  The below query will give you the same result up to the ‘Remove Columns’ step.

CREATE VIEW dbo.VwFactInternetSalesAmount

AS

       SELECT SalesOrderNumber   

                     ,[CustomerKey]

                     ,[SalesOrderLineNumber]

                     ,[SalesAmount]

                     ,RANK() over( order by [SalesAmount] desc) AS [Rank]

       FROM   [AdventureWorksDW2014].[dbo].[FactInternetSales]

clip_image006

The final steps are to filter on the top 10,000 rows and Group the Rows together – inserting the following M syntax into the last Applied Step:

let

    //Connect to SQL Server

    Source = Sql.Database(".", "AdventureWorksDW2014"),

    // Connect to SQL Server

    dbo_FactInternetSales = Source{[Schema="dbo",Item="VwFactInternetSalesAmount"]}[Data],

    #"Sorted Rows" = Table.Sort(dbo_FactInternetSales,{{"SalesOrderNumber", Order.Ascending}}),

    #"Kept First Rows" = Table.FirstN(#"Sorted Rows",10000),

    #"Grouped Rows" = Table.Group(#"Kept First Rows", {"CustomerKey", "SalesOrderLineNumber", "Rank"}, {{"TotalSalesAmount", each List.Sum([SalesAmount]), type number}})

in

    #"Grouped Rows"

 

The query now returns instantly (under 1 second).   Right click on the last applied step and select the View Native Query option, to show the underlying SQL.

select top 10000

    [rows].[CustomerKey] as [CustomerKey],

    [rows].

ETL in Power Query

If you have read any of my other Power Query blogs, you will have seen the results of transformed and mashed up queries - but not how it was achieved.  This article picks out my favourite 5 Extract, Transform & Load (ETL) features in Power Query.

Example

Being a keen sports fan, I wanted to find out the medal history of the 100m Olympic Finals for both men and women. 

I’ve picked two datasets, both from different sources.  You can download them using the links below:

-          Source 1 - CSV

o   ISO Countries.csv - https://datahub.io/dataset/iso-3166-1-alpha-2-country-codes/resource/9c3b30dd-f5f3-4bbe-a3cb-d7b2c21d66ce

-          Source 2 – Web

o https://en.wikipedia.org/wiki/100_metres_at_the_Olympics

o   There is a lot of other data on the site but for this example, I have only picked the race result data tables - Men [edit] and Women [edit].

Please Note:      I am happy to provide the Power BI Workbook upon request.  Leave a comment below and I will get        back to you ASAP.  

The imported data (pre transformed) should look like the below in Power Query.  You may notice some ETL has already taken place, but this was automatically carried out by Power Query and includes simple tasks like displaying column headers.

clip_image002

Pivot/Unpivot

There are many occasions when we need to completely flip the contents of a table.  For our example, ‘Gold’, ‘Silver’ and ‘Bronze’ should not be columns, but rows instead.  We only need one column and this can be named ‘MedalType’. The un-pivot feature can be achieved (on a whole query) by just one click of a button.  Ensure that following columns and features are selected:

clip_image004

 

Once the data is un-pivoted, rename the new columns to ‘Athlete’ and ‘MedalType’ respectively.  Repeat the process for the Men [Edit] query, which looks like the following:

clip_image005

Replace Values

When quickly eyeballing the data (in Men [Edit]), it is apparent that two athletes shared a Bronze medal in 1896.  We need to handle this data defect but unfortunately, there isn’t a one click method of achieving this.  The easiest way is to use the Replace Values and Remove Top/Bottom Rows UI functions.

Firstly, let’s duplicate the full Men [Edit] query by right clicking on it and selecting Duplicate.  We now want to remove all rec0rds from the copied query, apart from the one that contains two bronze medalists.  To do this, right click on the little table button in the corner of the query and choose either of the following:

clip_image007

It will then specify how many top or bottom rows you want to remove.  Ensure the only record left looks like the below:

 clip_image008

Lastly, the Replace Values function is used to replace ‘Francis Lane’ with an empty string and therefore, removing the duplicate.  The aim is to insert this one record back into the main Men [Edit] query.  Right click on ‘Athlete’ and select Replace Values.  Type the below text in the function box and click OK.

Value to Find:   Francis Lane (USA)#(lf)#(lf) Alajos Szokolyi (HUN)
Replace With:   Alajos Szokolyi (HUN)

The ‘#(if)’ elements of the search string is looking for special characters, otherwise the replace will not work.

clip_image009

We must tidy up the Men [Edit] query by doing the same as above, but the opposite way around.  This leaves us with the following:

clip_image010

Append Queries

This is another neat feature in Power Query.  In SQL Server or SSIS terms, this provides a union between two queries.  If the column names and data types do not match, then it may fail or give unexpected results.

Open the Men [Edit] query, Click the Home tab in the top menu pane and select Append Queries.  A pop up window will open, where you can specify the query to append into Men [Edit].  For this example, I selected the below and clicked OK.

clip_image012

-          It is possible to append multiple tables and in fact, we need to do this in order to bring both the Men [Edit] (2) and the Women [Edit] queries into Men [Edit].

-          To make it easy to distinguish between male and female, I have added a new column to the queries that specifies gender.  This is not mandatory, especially if you are following this example.

-          If we select the Two tables option, then only one query can be appended at a given time.

We can rename the newly appended query to Athletes and tidy up the Men [Edit] (2) and Women [Edit] queries.  Putting raw queries into an appropriately named Group helps distinguish each part of our ETL – see below:

clip_image013

There are more queries now but the Raw folder contains imported, untouched queries, whereas Transform includes tables that have been modified and joined together.  The Load folder will eventually contain the star schema tables (dimensions and facts).

Add Custom Column

Any developer using ETL will have needed to derive a column before.  Power Query offers this functionality, both through the UI and M query language.  We want to create a new column called ‘Nationality’, that moves the characters (in between the brackets) from ‘Athlete’.

In Power Query, a derived column is called Custom Column.  Ensure the Athletes query is selected, then navigate to the Add Column tab in the top menu pane and select the Add Custom Column button.  The pop up box appears, where we can write some custom M code:

clip_image015

-          Text.End is the equivalent to the RIGHT function in SQL Server.

-          Text.Start is equivalent to LEFT.

-          Wrapping them both into one line of code ensure the string inside the brackets is taken from Athletes.

-          Rename the new column to Nationality.

clip_image016

Merge Queries

This feature is the same as applying a JOIN in SQL Server.  In simple terms, it gives us the ability to mashup and bring together queries from different data sources.  We will bring the ISOCountriesRaw data into the Athletes query.

Under the Home tab, select Merge Queries.  A new window appears, where we can specify the tables and columns that need to be joined.  Simply click select the ISOCountriesRaw query and the two columns shown in the below screenshot.  The Join Kind should be a Left Outer, as this will bring ONLY matching records across ISOCountriesRaw, keeping all records in the main Athletes query.

clip_image018

The wizard will also give you a match preview and for this example, 142 of the 144 records in Athletes matches the ISOCountriesRaw table.  Click OK to complete the merge.

There are a number of columns that aren’t needed in this table and can be filtered out accordingly:

clip_image019 

There are two records that don’t match because the medal winner was subsequently removed from the record books.  We can use the Replace Values function to replace the nulls with some more meaningful information (see below).  We are now ready to turn the data into a star schema!

clip_image021

Next Steps

Now we have a fully transformed query, it can be broken down into a fact table and dimensions.  I will not demonstrate in this blog but by following Kimball’s data warehousing principles, there is now a star schema that can be plugged into Power Pivot or Power View.

NOTE:   Add Index Column was used to create primary and foreign keys on the ‘Load’ tables.  See ‘Other Features’ for link on how to do this.

Queries:

clip_image023

Relationships:

clip_image025

Other Features

There are many other ETL related functions within Power Query that have not been discussed in this blog.  It is also possible to enhance the UI features using M or write specialist code to enhance and configure your ETL process, which I demonstrated above.

For further information, I would recommend reading the below articles:

-          5 Very Useful Text Formulas - Ken Puls
http://www.excelguru.ca/blog/2014/08/20/5-very-useful-text-formulas-power-query-edition/

-          Add Index Column – Microsoft

https://support.office.com/en-us/article/Insert-a-custom-column-into-a-table-Power-Query-2dbb579a-915b-4ebd-b622-8e7f3d1d61a6?ui=en-US&rs=en-US&ad=US

-          Split Columns - Mark Vaillancourt

http://markvsql.com/2015/03/advanced-column-splitting-in-power-query/

-          Group By – Microsoft

https://support.office.com/en-us/article/Group-rows-in-a-table-Power-Query-e1b9e916-6fcc-40bf-a6e8-ef928240adf1?ui=en-US&rs=en-US&ad=US

-          ETL Examples – Chris Webb

http://blog.crossjoin.co.uk/2013/08/16/some-power-querym-examples/

Scraping Website Data using Power Query

In Power BI and more specifically Power Query, there is a simple, out of the box Web connector that enables you to scrape data from a website.  This blog will show you how to achieve this in both Power BI desktop and through Excel.

Power BI Desktop

1.      Open a Power BI Desktop.  Select ‘Get Data’.


clip_image002

2.      In the pop up window, select ‘Other’ > ‘Web’ and click ‘Connect’.

clip_image004

3.      Type in the website you would like to scrape.  For this example, I have used the BBC Sport website – looking at the ‘Premier League’ football league table.  Click ‘OK’ to continue.


clip_image005

4.      Power Query will try and obtain the data from the website and may take a couple of minutes loading.  As long as you are connected to the Internet, the query will complete and looks like the below:

clip_image007

5.      I have ticked the second option, however, you could select all three datasets if preferred.  When happy with your selections, click ‘Load’.

6.      You have now imported the scraped data, but to view it in Power Query, click ‘Edit Queries’ from the top menu pane.

clip_image008

7.      The Query Editor will open in a separate window, with the scraped data showing in its raw state.  You can now modify/join/clean until your hearts content!


clip_image010

Excel

1.      Firstly, you must have Excel 2013 or later and the Power Query add in for Excel installed to be able to scrape a website using the below method.

2.      Open Excel and start a new workbook.  Then, click ‘Power Query’ from the top menu pane and select the ‘From Web’ button.


clip_image012

3.      Enter the URL (demonstrated in step 3 of Power BI Desktop example) and click ‘OK’.

4.      Choose your desired data set or alternatively, pick more than one by ticking the ‘Select Multiple Items’ option.  Click ‘Load’ to download the data.

clip_image014

5.      When loaded, you can preview the data by hovering over it – found under the ‘Workbook Queries’ menu pane.

clip_image016

6.      To open the data in Power Query, double click the workbook query.  A new window will open and the data available to edit – see below:


clip_image018

Limitations

Whilst playing with the web scraping feature, I have found that some websites work better with Power Query than others.  At time of press, I cannot find any official Microsoft documentation on what elements of a website do work and more importantly, what doesn’t.  However, web data that is wrapped in an html table or around div tags seem to work better with Power Query.

What’s Next?

As I have demonstrated, scraping websites using Power Query is very easy.  This feature does not require any M or coding expertise, as Power Query works its magic under the hood.  This is a great alternative to C#, JavaScript, etc. for scraping a web API.

This article shows you how to bring the data into Power Query but in reality, it will need to be cleaned and transformed to be useful.  Excel Power Query Web Scrape - AFL Fixtures by Matt Allington and Using Power Query M Language for Scraping Any Website Data by Chris Webb are two very useful blogs demonstrating how this can be achieved.  If you are already a very capable M developer or have used Power Query in the past, you will know how powerful, yet simple this is.

Future Blogs

Look out for more blogs on M and Power Query in the future.  The next installment will be on how we can write native M to scrape websites, focusing specifically on web data that cannot be accessed through the user interface.