Callum

Callum Green's Blog

Loop through Multiple Web Pages using Power Query

As detailed in my previous blog, scraping website data in Power Query is very simple.  The aim of this article is to demonstrate how you can advance basic web scraping and pass parameters into the Power Query web connection string.  

Scenario

I wanted to obtain the league table history of the English Premier League – starting from 1992 all the way up to the present day.  I managed to find a very useful website that gave me all the information required but the problem was each web page equated to one year’s league table.  As a result, I would have to import the data into Power Query over 20 times, hardcoding the specific web page every time.

There must be a better way surely?  Is there not something in the web address that can be parameterized or a function in Power Query that can support iterations?  Let’s find out……

Parameterize Web Connection

Firstly, to follow my example, you need to import the 92/93 league table data into Power Query.  This can be found at:
http://www.transfermarkt.co.uk/premierleague/tabelle/wettbewerb/GB1?saison_id=1992.

I carried out some basic formatting and ETL on the query, which is not shown in this blog.  If you would like a copy of the Power BI workbook, please reply to this post.  Here is how the league table looks in Power Query:

clip_image002

If you look at the web address link above very closely, you will see that part of the connection string contains the league table year – ‘1992’.  This is what we want to parameterise.  The best way to use a parameter in Power Query is to create a function.  This is just like a SQL Server function, which passes variable(s) through it to obtain an end result.   Instructions are given below.

 

1.       Open the ‘LeagueTable’ query, navigate to ‘Home’ in the top menu and select ‘Advanced Editor’.

clip_image004

2.       This will now open all of the applied steps in ‘M’ Power Query language.  Firstly, we need to write the code to create the function.  This is achieved by writing the following code.


clip_image006

 

Syntax:

(Year as number) as table =>

 

-          ‘Year’ is declaring the variable to pass through, with the data type of ‘Number’.  For this example, only one variable is needed.

-          ‘as table’ is defining the result set based on what the function returns.

-          ‘=>’ is to tell Power Query the code below is what will be contained in the function.

3.       We now need to change the web connection string and pass through our new ‘Year’ variable.  This is achieved by replacing the ‘1992’ part of the string.

clip_image008


Syntax:
Source = Web.Page(Web.Contents("http://www.transfermarkt.co.uk/premier-
league/tabelle/wettbewerb/GB1/saison_id/" & Number.ToText(Year))),

-          ‘2012’ string has been replaced with the ‘Number.ToText(Year)’ function and variable.  The ‘Number.ToText’ function helps convert the variable into a string and make it readable.

-          The ‘&’ is used to concatenate the web connection string and the variable together.

-          The variable has now replaced the hardcoded year.

4.       When you are happy with the code, click ‘Done’.  Assuming the syntax is correct, the ‘LeagueTable’ query will now display an ‘fx’ symbol next to it.  This illustrates that it is now a function.

clip_image009

5.       Finally, let’s test the function by clicking the ‘Invoke’ button highlighted in the previous screenshot.  Enter your desired year and click ‘Ok’.  I have chosen 2014.


clip_image011

6.       As you can see, our ‘LeagueTable’ query has completely changed.  It is showing the league table from the 14/15 season and we didn’t need to manually import this data!

7.       Before moving on, we need to remove the invoked function from the ‘Applied Steps’ menu pane.  Click the ‘X’ to achieve this.


clip_image012

NOTE:   There is currently no way of creating a function using the user interface.

Automated Iteration

One big limitation of Power Query is the ability to iterate through values or loop through a connection string.  Well, it is a problem if you want to write a native while or for loop.  There is a neat workaround that call a function and loops through the parameter value based on values in a lookup table.

1.       Let’s create a simple query by selecting New Source > Blank Query.


clip_image014

2.       Type the following code into the address bar and enter Return:


clip_image016

The ‘1992..2015’ syntax is Power Query’s way of auto generating all year in between those dates.

3.       Convert the query To Table and rename the column from ‘Column1’ to ‘Year’, 


clip_image017

4.       We now want to Add Custom Column.  A new screen will pop up, where we can add some native M.  We need to pass the ‘Year’ column from our table and the recently created function into the syntax.  Click OK to confirm. 


clip_image019

In effect, this is running the function against each row for the column ‘Year’.

5.       The query may take up to 5 minutes to complete, depending your Internet connection.  Once it has completed, you will see a Table link in each of the new columns rows.  Expand this column to view the data


clip_image021

6.       As if by magic, we can now see all Premier League seasons in one table!   We can now bring this data back into Power BI or Excel to carry further analysis.

clip_image023

Future Blogs

My next couple of Power Query blogs will be looking at basic data automation/configuration and some useful ETL techniques that will prepare our data for analysis. 

Comments (22) -

  • Vikram

    6/19/2016 10:57:54 AM | Reply

    Hi, Can I please get the sample workbook for this? Thanks!

    • Callum Green

      6/29/2016 10:45:21 AM | Reply

      Vikram - I have sent you the workbook via email Smile

  • Torstein

    6/21/2016 8:12:43 AM | Reply

    Awesome article and grate guide for looping through a series of URLs (eg. REST URLs). NB! Online schedule refresh currently don't support Invoked Function.

    • Callum Green

      6/29/2016 10:44:48 AM | Reply

      Torstein - Thanks for this.  I didn't actually realise that was the case!

  • Phil

    8/15/2016 4:55:13 AM | Reply

    Hi Callum, could you please send me the sample workbook as well? Thanks!

    • Callum Green

      8/15/2016 9:19:04 AM | Reply

      Hi Phil, No problem, I will send this across shortly Smile.  
      Thanks, Callum.

  • Dave

    8/31/2016 2:53:26 PM | Reply

    Hi Callum - useful! I was trying to do a similar thing using Google Maps APIs. I have written the code below to return a list of Shopping Centres from Google Places with a formatted address. The API will only return the first 20 results and a page_token to retrieve the following 20. The URL for the next 20 results is the URL in the code &page_token[page_token]. I know that a new page_token is issued for each 20 results

    I am struggling to set up the query to make use of the page tokens and retrieve all the results. Do you have any suggestions on how best to do this?

    let

    LOCATION = (location as text) =>


    let
        Source = Json.Document(Web.Contents("maps.googleapis.com/.../json[YOUR_API_KEY]&types=shopping_mall")),
        #"Converted to Table" = Record.ToTable(Source),
        Value = #"Converted to Table"{2}[Value],
        #"Converted to Table1" = Table.FromList(Value, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
        #"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table1", "Column1", {"id", "name", "place_id", "reference", "types", "vicinity"}, {"id", "name", "place_id", "reference", "types", "vicinity"}),
        #"Expanded types" = Table.ExpandListColumn(#"Expanded Column1", "types"),
        #"Filtered Rows" = Table.SelectRows(#"Expanded types", each ([types] = "shopping_mall"))
    in
        #"Filtered Rows"

    in LOCATION

    • Callum

      9/2/2016 7:45:30 AM | Reply

      Hi Dave,

      Would you be able to send me your working Workbook please?  From what you are detailing above, it looks like you need a way of looping through all of the tokens, which will then return all of the results.

      I definitely think this is possible but it is always easier to adapt working code.  Either way, I am more than happy to take a look for you Smile

      Cheers,
      Callum

  • Olivier Travers

    11/15/2016 5:57:30 PM | Reply

    Excellent tutorial that walked me through the exact steps I needed for my project, thanks a lot.

  • Montse

    11/16/2016 3:12:26 PM | Reply


    Hi,

    Can I also get the sample workbook for this?

    Thanks!

    • Callum Green

      12/1/2016 1:54:24 PM | Reply

      Hi Montse,

      I will email this to you shortly.

      Cheers,
      Callum

  • Enemaerke

    12/28/2016 1:12:03 PM | Reply

    Hey,

    Can I also get the sample workbook for this?
    And would it work if the results are on multiple pages, and it's the pagenumber in the string that changes?
    I do not always know which page that's last page.

    Thanks in advance.
    Enemaerke

    • Callum Green

      1/3/2017 12:59:16 PM | Reply

      Hi Enemaerke,

      Yes, you can pass in any type of parameter to form the web URL.  Let me know if you are having problems doing so.  

      I will email you over the workbook shortly.

      Cheers,
      Callum

  • Matt Huffman

    2/11/2017 6:10:17 PM | Reply

    Hi, this is a very useful technique!  Could you send me a copy of the workbook?  Thanks in advance!

    • Callum Green

      2/13/2017 8:15:12 AM | Reply

      Hi Matt,

      Of course - I will email it to you in a moment..

      Cheers,
      Callum

  • Brian Glynn

    2/28/2017 6:43:21 PM | Reply

    Callum Green,
    Excellent tutorial.  I am not a code cracker just a humble Excel user.  I am trying to do something similar but am having a problem determining how the parameter for the next page is being populated.  I took a shot in the dark thinking it might be "page" (there are 25 of them) but that was not it.  Can you look at what I have below and let me know where I am going wrong?

    Thank you.

    = (page as number)as table=>
    let
        Source = Web.Page(Web.Contents("990finder.foundationcenter.org/990results.aspx;)),
        Data0 = Source{0}[Data],
        #"Changed Type" = Table.TransformColumnTypes(Data0,{{"ORGANIZATION NAME", type text}, {"STATE", type text}, {"YEAR", type text}, {"FORM", type text}, {"PAGES", type text}, {"TOTAL ASSETS", type text}, {"EIN", type text}}),
        #"Removed Bottom Rows" = Table.RemoveLastN(#"Changed Type",1),
        #"Changed Type1" = Table.TransformColumnTypes(#"Removed Bottom Rows",{{"TOTAL ASSETS", Currency.Type}})
    in
        #"Changed Type1"

    • Callum Green

      3/1/2017 9:05:49 AM | Reply

      Hi Brian,

      Perhaps it would be better to contact me directly through email?  I am trying to replicate your code (by connecting to the website) but cant see where you would iterate through many pages.  Is it the fact you want to loop data through the search criteria or there are 25 pages to loop through after you have provided a search?  

      I am sure we can get to the bottom of this, but liaising with you directly will ensure a quicker solution.

      Cheers,
      Callum

      • Brian Glynn

        3/2/2017 2:24:06 PM | Reply

        Callum,
        I would like to contact you via e-mail but I could not find it.  If you could e-mail me I will get into more details.  I am across the pond so we have a bit of a time difference.

        Thank you.
        Brian

  • yZIqDClJYtg

    3/31/2017 10:26:20 AM | Reply

    418111 148067I was reading some of your content material on this internet site and I feel this internet internet site is genuinely informative! Maintain putting up. 706667

  • Brian Glynn

    3/31/2017 6:34:53 PM | Reply

    Callum,

    Thank you for taking the time to look at my specific problem.   I appreciate your efforts.  While this particular site was not Power Query friendly I have used your "loop" in other instances.

    Brian

  • Tate

    4/10/2017 2:19:33 PM | Reply

    This was a great tutorial! I'm trying to use this method when the multiple page numbers vary by text and not a number like 1992. Is there a way to do this? I keep getting an error.

    Thank you!

    • Callum Green

      4/11/2017 6:48:07 AM | Reply

      Hi Tate,  

      Yes this should easily be possible.  Please send me your code and the error message - either through here of via email.  Hopefully we can then get it working for you Smile.

      Cheers,
      Callum

Loading