Callum

Callum Green's Blog

Loop through Multiple Web Pages using Power Query

As detailed in my previous blog, scraping website data in Power Query is very simple.  The aim of this article is to demonstrate how you can advance basic web scraping and pass parameters into the Power Query web connection string.  

Scenario

I wanted to obtain the league table history of the English Premier League – starting from 1992 all the way up to the present day.  I managed to find a very useful website that gave me all the information required but the problem was each web page equated to one year’s league table.  As a result, I would have to import the data into Power Query over 20 times, hardcoding the specific web page every time.

There must be a better way surely?  Is there not something in the web address that can be parameterized or a function in Power Query that can support iterations?  Let’s find out……

Parameterize Web Connection

Firstly, to follow my example, you need to import the 92/93 league table data into Power Query.  This can be found at:
http://www.transfermarkt.co.uk/premierleague/tabelle/wettbewerb/GB1?saison_id=1992.

I carried out some basic formatting and ETL on the query, which is not shown in this blog.  If you would like a copy of the Power BI workbook, please reply to this post.  Here is how the league table looks in Power Query:

clip_image002

If you look at the web address link above very closely, you will see that part of the connection string contains the league table year – ‘1992’.  This is what we want to parameterise.  The best way to use a parameter in Power Query is to create a function.  This is just like a SQL Server function, which passes variable(s) through it to obtain an end result.   Instructions are given below.

 

1.       Open the ‘LeagueTable’ query, navigate to ‘Home’ in the top menu and select ‘Advanced Editor’.

clip_image004

2.       This will now open all of the applied steps in ‘M’ Power Query language.  Firstly, we need to write the code to create the function.  This is achieved by writing the following code.


clip_image006

 

Syntax:

(Year as number) as table =>

 

-          ‘Year’ is declaring the variable to pass through, with the data type of ‘Number’.  For this example, only one variable is needed.

-          ‘as table’ is defining the result set based on what the function returns.

-          ‘=>’ is to tell Power Query the code below is what will be contained in the function.

3.       We now need to change the web connection string and pass through our new ‘Year’ variable.  This is achieved by replacing the ‘1992’ part of the string.

clip_image008


Syntax:
Source = Web.Page(Web.Contents("http://www.transfermarkt.co.uk/premier-
league/tabelle/wettbewerb/GB1/saison_id/" & Number.ToText(Year))),

-          ‘2012’ string has been replaced with the ‘Number.ToText(Year)’ function and variable.  The ‘Number.ToText’ function helps convert the variable into a string and make it readable.

-          The ‘&’ is used to concatenate the web connection string and the variable together.

-          The variable has now replaced the hardcoded year.

4.       When you are happy with the code, click ‘Done’.  Assuming the syntax is correct, the ‘LeagueTable’ query will now display an ‘fx’ symbol next to it.  This illustrates that it is now a function.

clip_image009

5.       Finally, let’s test the function by clicking the ‘Invoke’ button highlighted in the previous screenshot.  Enter your desired year and click ‘Ok’.  I have chosen 2014.


clip_image011

6.       As you can see, our ‘LeagueTable’ query has completely changed.  It is showing the league table from the 14/15 season and we didn’t need to manually import this data!

7.       Before moving on, we need to remove the invoked function from the ‘Applied Steps’ menu pane.  Click the ‘X’ to achieve this.


clip_image012

NOTE:   There is currently no way of creating a function using the user interface.

Automated Iteration

One big limitation of Power Query is the ability to iterate through values or loop through a connection string.  Well, it is a problem if you want to write a native while or for loop.  There is a neat workaround that call a function and loops through the parameter value based on values in a lookup table.

1.       Let’s create a simple query by selecting New Source > Blank Query.


clip_image014

2.       Type the following code into the address bar and enter Return:


clip_image016

The ‘1992..2015’ syntax is Power Query’s way of auto generating all year in between those dates.

3.       Convert the query To Table and rename the column from ‘Column1’ to ‘Year’, 


clip_image017

4.       We now want to Add Custom Column.  A new screen will pop up, where we can add some native M.  We need to pass the ‘Year’ column from our table and the recently created function into the syntax.  Click OK to confirm. 


clip_image019

In effect, this is running the function against each row for the column ‘Year’.

5.       The query may take up to 5 minutes to complete, depending your Internet connection.  Once it has completed, you will see a Table link in each of the new columns rows.  Expand this column to view the data


clip_image021

6.       As if by magic, we can now see all Premier League seasons in one table!   We can now bring this data back into Power BI or Excel to carry further analysis.

clip_image023

Future Blogs

My next couple of Power Query blogs will be looking at basic data automation/configuration and some useful ETL techniques that will prepare our data for analysis. 

Comments (36) -

  • Vikram

    6/19/2016 10:57:54 AM | Reply

    Hi, Can I please get the sample workbook for this? Thanks!

    • Callum Green

      6/29/2016 10:45:21 AM | Reply

      Vikram - I have sent you the workbook via email Smile

  • Torstein

    6/21/2016 8:12:43 AM | Reply

    Awesome article and grate guide for looping through a series of URLs (eg. REST URLs). NB! Online schedule refresh currently don't support Invoked Function.

    • Callum Green

      6/29/2016 10:44:48 AM | Reply

      Torstein - Thanks for this.  I didn't actually realise that was the case!

  • Phil

    8/15/2016 4:55:13 AM | Reply

    Hi Callum, could you please send me the sample workbook as well? Thanks!

    • Callum Green

      8/15/2016 9:19:04 AM | Reply

      Hi Phil, No problem, I will send this across shortly Smile.  
      Thanks, Callum.

  • Dave

    8/31/2016 2:53:26 PM | Reply

    Hi Callum - useful! I was trying to do a similar thing using Google Maps APIs. I have written the code below to return a list of Shopping Centres from Google Places with a formatted address. The API will only return the first 20 results and a page_token to retrieve the following 20. The URL for the next 20 results is the URL in the code &page_token[page_token]. I know that a new page_token is issued for each 20 results

    I am struggling to set up the query to make use of the page tokens and retrieve all the results. Do you have any suggestions on how best to do this?

    let

    LOCATION = (location as text) =>


    let
        Source = Json.Document(Web.Contents("maps.googleapis.com/.../json[YOUR_API_KEY]&types=shopping_mall")),
        #"Converted to Table" = Record.ToTable(Source),
        Value = #"Converted to Table"{2}[Value],
        #"Converted to Table1" = Table.FromList(Value, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
        #"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table1", "Column1", {"id", "name", "place_id", "reference", "types", "vicinity"}, {"id", "name", "place_id", "reference", "types", "vicinity"}),
        #"Expanded types" = Table.ExpandListColumn(#"Expanded Column1", "types"),
        #"Filtered Rows" = Table.SelectRows(#"Expanded types", each ([types] = "shopping_mall"))
    in
        #"Filtered Rows"

    in LOCATION

    • Callum

      9/2/2016 7:45:30 AM | Reply

      Hi Dave,

      Would you be able to send me your working Workbook please?  From what you are detailing above, it looks like you need a way of looping through all of the tokens, which will then return all of the results.

      I definitely think this is possible but it is always easier to adapt working code.  Either way, I am more than happy to take a look for you Smile

      Cheers,
      Callum

  • Olivier Travers

    11/15/2016 5:57:30 PM | Reply

    Excellent tutorial that walked me through the exact steps I needed for my project, thanks a lot.

  • Montse

    11/16/2016 3:12:26 PM | Reply


    Hi,

    Can I also get the sample workbook for this?

    Thanks!

    • Callum Green

      12/1/2016 1:54:24 PM | Reply

      Hi Montse,

      I will email this to you shortly.

      Cheers,
      Callum

  • Enemaerke

    12/28/2016 1:12:03 PM | Reply

    Hey,

    Can I also get the sample workbook for this?
    And would it work if the results are on multiple pages, and it's the pagenumber in the string that changes?
    I do not always know which page that's last page.

    Thanks in advance.
    Enemaerke

    • Callum Green

      1/3/2017 12:59:16 PM | Reply

      Hi Enemaerke,

      Yes, you can pass in any type of parameter to form the web URL.  Let me know if you are having problems doing so.  

      I will email you over the workbook shortly.

      Cheers,
      Callum

  • Matt Huffman

    2/11/2017 6:10:17 PM | Reply

    Hi, this is a very useful technique!  Could you send me a copy of the workbook?  Thanks in advance!

    • Callum Green

      2/13/2017 8:15:12 AM | Reply

      Hi Matt,

      Of course - I will email it to you in a moment..

      Cheers,
      Callum

  • Brian Glynn

    2/28/2017 6:43:21 PM | Reply

    Callum Green,
    Excellent tutorial.  I am not a code cracker just a humble Excel user.  I am trying to do something similar but am having a problem determining how the parameter for the next page is being populated.  I took a shot in the dark thinking it might be "page" (there are 25 of them) but that was not it.  Can you look at what I have below and let me know where I am going wrong?

    Thank you.

    = (page as number)as table=>
    let
        Source = Web.Page(Web.Contents("990finder.foundationcenter.org/990results.aspx;)),
        Data0 = Source{0}[Data],
        #"Changed Type" = Table.TransformColumnTypes(Data0,{{"ORGANIZATION NAME", type text}, {"STATE", type text}, {"YEAR", type text}, {"FORM", type text}, {"PAGES", type text}, {"TOTAL ASSETS", type text}, {"EIN", type text}}),
        #"Removed Bottom Rows" = Table.RemoveLastN(#"Changed Type",1),
        #"Changed Type1" = Table.TransformColumnTypes(#"Removed Bottom Rows",{{"TOTAL ASSETS", Currency.Type}})
    in
        #"Changed Type1"

    • Callum Green

      3/1/2017 9:05:49 AM | Reply

      Hi Brian,

      Perhaps it would be better to contact me directly through email?  I am trying to replicate your code (by connecting to the website) but cant see where you would iterate through many pages.  Is it the fact you want to loop data through the search criteria or there are 25 pages to loop through after you have provided a search?  

      I am sure we can get to the bottom of this, but liaising with you directly will ensure a quicker solution.

      Cheers,
      Callum

      • Brian Glynn

        3/2/2017 2:24:06 PM | Reply

        Callum,
        I would like to contact you via e-mail but I could not find it.  If you could e-mail me I will get into more details.  I am across the pond so we have a bit of a time difference.

        Thank you.
        Brian

  • yZIqDClJYtg

    3/31/2017 10:26:20 AM | Reply

    418111 148067I was reading some of your content material on this internet site and I feel this internet internet site is genuinely informative! Maintain putting up. 706667

  • Brian Glynn

    3/31/2017 6:34:53 PM | Reply

    Callum,

    Thank you for taking the time to look at my specific problem.   I appreciate your efforts.  While this particular site was not Power Query friendly I have used your "loop" in other instances.

    Brian

  • Tate

    4/10/2017 2:19:33 PM | Reply

    This was a great tutorial! I'm trying to use this method when the multiple page numbers vary by text and not a number like 1992. Is there a way to do this? I keep getting an error.

    Thank you!

    • Callum Green

      4/11/2017 6:48:07 AM | Reply

      Hi Tate,  

      Yes this should easily be possible.  Please send me your code and the error message - either through here of via email.  Hopefully we can then get it working for you Smile.

      Cheers,
      Callum

  • Rob

    5/8/2017 9:36:00 PM | Reply

    Hi Callum,

    Nice name, same as my son's!! Anyway mate loved the article and it helped me with exactly what I wanted to do. Just one question though, how do you handle errors? So for example say there was no web page for the 2002 season, at the moment my data set stops and gives an "Error" instead of "Table". Do you know how I tell it to just ignore the error without losing the row?

    Thanks mate, and keep up the good work.

    • Callum Green

      5/9/2017 10:05:27 AM | Reply

      Hi Rob,  

      Using your example for the year 2002, are you saying that all years after this appear as 'Error'.  If there was no data for 2002, surely you would want that row to be excluded?

      It might be worth you sending me your workbook containing the issue and I can take a look.  The is some generic M Code that can handle errors, which involves can creating a Custom Column based on your error column - see example syntax below:

      Table.AddColumn(PreviousStepName, "HandleErrors", each try [ColumnName),

      Cheers,
      Callum

      • Rob

        5/9/2017 10:00:50 PM | Reply

        Hi Callum,

        Not quite, what happens if the 2002 year is missing, then Power BI removes all data after that year. No I don't want to lose the row, just want to have blank values for the additional columns, sorta of like an unknown member in a data warehouse.
        Sorry mate I can't send you this workbook as it has some sensitive info in it, maybe if you just change on of your years in your test book to something invalid that isn't on the website you will see what I mean.
        I'll look at adding in the code you have provided. Any further pointers greatly appreciated too.

        Cheers,


        Rob

        • Callum Green

          5/10/2017 9:36:16 AM | Reply

          Hi Rob,

          I know what you mean now.  I would advise completely removing the error row, before expanding out the rest of the data.  This will prevent the records underneath from not displaying.  The logic to remove errors could be nested within the function itself.

          I will send you an example workbook through email where I have done this.  You can then link back to the error in another step, which will produce the unknown member missing for a given link.  In my example, 'Camberley' Council does not exist on Facebook and is therefore an unknown member.

          I hope this helps?

          Cheers,
          Callum

  • Umut

    5/24/2017 1:59:42 PM | Reply

    hi Callum,

    i have a similar work to do but i couldnt manage how to do it. hope you'll help me ...

    i have the list of project codes. (1,2,3,6,74,76, .... 103) which is totally dynamic coming from a table in the database.

    next, i need to run a sql query from the tables of these projects... like LG_002_01_STLINE LG_003_01_STLINE LG_006_01_STLINE LG_076_01_STLINE etc...

    last, i need to union all the query running for these project codes.

    i can send a sample excel if you have some time for me...

    thanks a lot upfront for your time and help...

    regards,

    • Callum Green

      5/24/2017 2:11:11 PM | Reply

      Hi Umut,

      Please email me at callum.green@adatis.co.uk or Tweet me @callumgadatis.  Either way, I will take a look at your workbook and see if I can help.  What you are asking for definitely sounds possible, so that is good news.

      Cheers,
      Callum

      • Umut

        5/24/2017 3:43:11 PM | Reply

        thanks a lot, i will send it tomorrow...

        thanks again for your time and effort...

      • Umut

        5/25/2017 10:34:49 AM | Reply

        i've sent the file. waiting for your reply.

        regards,

        • Callum Green

          5/25/2017 1:25:52 PM | Reply

          Hi Umut,

          I have sent across a working Power Query model Proof of Concept that does what you require.  Please comment when you are happy this works for you.

          I will write a blog shortly, showing an example of how to loop through multiple tables and merge the results into one query result.

          Cheers,
          Callum

  • Umut

    6/1/2017 8:17:28 AM | Reply

    hi Callum,

    Sorry that i am very late to thank you for your support and valuable example excel file.

    it helped me a lot !

    Take care !

    • Callum Green

      6/1/2017 8:34:36 AM | Reply

      Hi Umut,

      Brilliant - I am glad it helped!  If you ever need assistance in the future, do not hesitate to email or tweet me.

      Cheers,
      Callum

      • Umut

        6/1/2017 8:36:17 AM | Reply

        you rock !

        though, i have another issue if you have time ...

        stackoverflow.com/.../power-query-how-to-use-a-filtered-list-of-an-sql-query

        • Callum Green

          6/1/2017 9:02:01 AM | Reply

          Hi Umut,

          I have found a workaround for you.  You need to create a formula similar to this:

          =IF(COUNTA(A2:B3000)>SUBTOTAL(3,A2:B3000), "Filtered","Not Filtered")

          I have sent across your example workbook from StackOverlow including the new 'Filter Flag' column.  The flag can then be passed into your Power Query table, meaning that every time you filter the underlying data, it is reflected in the query.

          Cheers,
          Callum

Loading