Callum

Callum Green's Blog

Microsoft BI – Coming Soon to SQL Server 2016

After attending the Pass Summit 2016 a couple of weeks ago, I attended a number of sessions that provided an insight into the direction Microsoft are heading with BI.  I thought I’d share this with the community.

Looking back to October 2015, the official Reporting Roadmap blog from Microsoft stated their intent in dramatically improving the visualisations, integration and harmonisation of both on-prem (SSRS) and cloud based services (Power BI).  Whilst reporting appeared to be a renewed focus, they are constantly driving other areas of BI - such as analysis services and database/data warehousing development in the cloud.

Now, for the interesting bit.  Here is what we can expect in the SQL Server 20916 BI Stack in the near future:

-          SQL Server 2018 CTP (to be released within the next month).

o   This seems very early, considering 2016 has only just been released!  Let’s wait for an official Microsoft announcement.

-          Azure DW

o   Auto Create Statistics

§  Currently statistics have to be generated and updated on the fly.

o   Replicated Distributed Table

§  One that will excite any cloud based SQL Server developer.

§  This will reduce data transfer between distributions/nodes and consequently improve performance.

-          Azure Data Lake

o   General Availability (GA) is imminent.

o   Future Features:

§  Polybase, so that you can connect the two big MMP platforms (ADL and DW).

§  SSIS connectors (released with GA) for Store.

§  Python and R in U-SQL.

-          SSIS

o   Lots of new connectors, including:

§  ADLS.

§  CRM.

-          SSAS

o   Migrating Power BI Models into Tabular.

§  This is coming very soon apparently, but will be developed in Azure SSAS first.

o   Object Level Security in Tabular

§  We currently have row level, but there are talk to secure a physical object, not just a row.

§  Even better news - Microsoft want to integrate the two together, which will make security awesome in Tabular.

-          SSRS

o   Supporting (not pinning) Excel reports in RS report.

§  This will come, but Power BI is the focus right now and we may have to wait a while.

-          Power BI

o   Additional and better Pivot Table functionality.

o   Integrating Active Directory dynamically.

o   Potential to use Direct Query and Imported modes together – as a hybrid.

§  Functionality is possible, but performance needs to be weighed up by Microsoft before anything will emerge.

o   Quick Calcs.

§  Only ‘Percent of Current Total’ currently available.

§  Potential is to offer lots more – such as YTD, MAT, Current Year vs. Previous Year, etc.

§  This is for the users who aren’t familiar with DAX.

o   Template organisational Content Packs.

§  The ability to give the user the ability to personalise colours, fonts, etc. within a structured (organisational) content pack.

-          Power BI Embed

o   Application developer will be able to limit user sessions and therefore, reduce the charge per 1 hour costs that come with it.

 

There are some features/issues Microsoft do not plan to change.  Although, the good thing about Microsoft is that they are community driven, so if you feel strongly about anything (and get support from your peers), they may change their minds.

-          SSRS

o   Q&A, Query Editor (Power Query), R integration, etc. not being developed.

§  Pretty obvious really.  Whilst they are going to introduce almost everything from Power BI, some elements of functionality are just not needed for on pre purposes.

§  R Scripts may come one day, but not a focus right now.

-          Power BI

o   Source Control

§  No immediate plans to integrate with TFS or modularise the pbix files (for a more developer based solution)

§  Not surprising as this is a self-service tool, not a development team.

§  Work around is to upload pbix files into OneDrive and use the versioning as an element of Source Control or add a file into Visual Studio.

§  Keep Voting on PowerBI.com if you want this! (Currently 278 votes).

·         https://ideas.powerbi.com/forums/265200-power-bi-ideas/suggestions/9677517-source-control

-          Power BI Embed

o   Licence model

§  ‘Speak to Marketing’ is what we were told.

 

Is everyone excited?  I certainly am.  Microsoft are openly admitting that any new BI related feature will be developed in the cloud first, but this was pretty obvious.  For all you on-prem fanatics (me included), at least the bugs/undocumented features will be ironed out before we get to use them!  My personal favourite has to be the direction SSRS is going.  It was recently labelled a ‘mature’ product, but now RS has suddenly become a cool and current tool to use.  The fact Power BI and SSRS will become almost the same product, tells us all that on-prem is still an important part of the Microsoft licensing strategy.

 

I am sure there are lots of other features coming into the BI stack over the next 6 months to a year and these were just the ones I had heard about.  Feel free to comment if you have additional ones to share.

Managing msdb in SQL Server

In the world of BI, the SQL Server system databases (with the exception of tembdb) are often forgotten about.  You won’t be surprised to learn that I am a BI Consultant and until recently, I was also in this boat.  System database are usually left to a DBA, whilst we develop ETL packages, model data or produce fancy reports.

After attending Ryan Adam’s session at SQL Pass 2016, my mindset has changed.  The msdb database used to be the hub for SSIS logging but since 2012, this is all stored in the SSISDB database.  However, there are still a number of processes and functions that are logged in msdb and we need to manage this, to prevent the database from becoming too large and impacting storage costs. 

This blog will provide a brief overview of what is stored and logged in msdb, as well as a little trick of how to manage SQL Agent Jobs. 

What is Stored in msdb?

Before I provide a brief list of what is in msdb, one obvious bit of advice is it should be backed up regularly.  We all have backup strategies for our OLTP or data warehouse database, but the system ones are often neglected.  All this logging and auditing history in msdb is useful!

The msdb database consists of the following:

o   SSIS - Only if packages are stored in SQL Server (not the SSIS Catalog).

o   Backup and Restore

o   Maintenance Plans

o   Database Mirroring

o   Policy Based Management

o   Log Shipping

o   Service Broker

o   Database Engine Tuning Advisor

SQL Agent Jobs

Whilst all of the logging tables in msdb are important for a DBA, one real area of interest in BI is around SQL Agent jobs.  There can be many SSIS packages being called from the Catalog, spanning across multiple jobs.  It is important that logging is available, yet we don’t want job history to be stored forever.

Out of the box, the logging for msdb looks like the below.  Right click on SQL Server Agent properties in SSMS and navigate to the ‘History’ tab.

clip_image002

Let’s work through a use case, extracted from Ryan Adam’s Pass presentation:

1.       Imagine you have 10 database backup jobs running every hour in your production environment.

a.       10 Databases * 24 rows per day = 240 Total rows per day.

b.      Total rows for all jobs would be maxed out in 4 days and 4 hours.

Ok, so now we have a problem.  All DBA’s and BI developers would want to see more than 4 days’ worth of history, which means the ‘’ setting is not a good default to retain.  How about changing the Agent job to the below?

clip_image004

Sounds perfect, right? Wrong.  The GUI setting is not persisted, meaning the Agent Job will log history for 30 days on the next run only.  This seems crazy, but is what we are dealing with unfortunately.  There is a solution, which involves executing a stored procedure within the msdb database.

DECLARE @DeletePrior DATETIME;

SET @DeletePrior = CONVERT(VARCHAR(10), DATEADD(dd, -30, GETDATE()), 101);

EXEC msdb.dbo.sp_purge_jobhistory @oldest_date = @DeletePrior;

 

NOTE:   Ensure the ‘row restriction’ property (shown above) is unchecked against the SQL Server Agent Job settings before executing.

Although you cannot see the changes in the GUI settings, you will now retain 30 days of history in msdb.  The query can be tweaked to days, months or even years if you like. 

Conclusion

If you take anything away from this article, then it should be to not ignore system databases.  Microsoft create them automatically for us but unfortunately, the out of the box configurations are not always correct for your situation or business.  For a deeper dive into msdb and other system databases, try some of the recommended blogs below.

If anyone would like to share their experiences with using/refining msdb, feel free to comment. 

Recommended Reading 

o   MSDN - https://msdn.microsoft.com/en-us/library/ms187112.aspx?f=255&MSPPError=-2147217396

o   Hemantgiri Goswami’s Blog - https://www.toadworld.com/platforms/sql-server/b/weblog/archive/2013/10/21/msdb-cleanup-is-necessary

o   Ryan Adam’s Blog - http://www.ryanjadams.com/

October 28th Technical Preview for SSRS – Including Power BI in SSRS

Today at the Pass Summit, Microsoft announced some cool new features that are coming into SSRS 2016.   I attended the ‘SQL Server Reporting Services: What’s New and What’s Next’ session (presented by Chris Finlan and Riccardo Muti and there was certainly some very exciting news for anyone in the BI sector.

There will be a technical preview available from 28th October, in which you can download a pre-configured Virtual Machine in the Azure Marketplace.  This includes sample reports and data, enabling you to try the features whilst Microsoft work on a complete preview for on premise R&D. This blog lists the new feature from a high level.  I will announce more information as and when it is available.

Features

 

1.       Power BI reports in SQL Server Reporting Services.

a.       I won’t steal Microsoft’s thunder and instead, redirect you to their blog written on October 25th:

https://blogs.msdn.microsoft.com/sqlrsteamblog/2016/10/25/announcing-a-technical-preview-of-power-bi-reports-in-sql-server-reporting-services/

b.      In essence, you can create a Power BI desktop report and hook it straight into your on premise SSRS reporting solution. Amazing, right?!

2.       Report Manager.

a.       List View is back!  You can again view reports the old fashioned way.

b.      Show Hidden Items is now unchecked by default.  Sounds insignificant, but it is the little things that sometimes matter.

3.       Report Comments Section

a.       This is really cool.  A user can write a comment against a given report and even ‘snapshot’ the report as an image and upload it.  This is a good way to start an internal company discussion, as well as trace how a visualization has changed over time.

b.      All comments and images are stored in the Report Server db, which helps for auditing or plugging the data into an external tool/text file.

4.       Other – More information to be announced soon

a.       Direct URL link from a KPI.  If you click a KPI, it will take you to a specified link, without the need to select it form an option.

b.      Mobile Reporting auditing.

c.       Better support for generated MDX.  The current date workaround will no longer be needed and make everyone’s life easier. 

d.      General performance.

Current Preview Limitations

 

As this is an interim SSRS Preview release, there are some current limitations.  These are more specific to the Power BI reports in SSRS:

-          Power BI reports that connect “live” to Analysis Services models – both Tabular and

        Multidimensional  (cubes). No other data sources are currently available.

-         Custom Visuals not supported.

Coming Soon

 

As soon as I get back from the Pass Summit (and over my jet lag), I will be downloading the preview and trying out the cool features.  Please come back to my blog page to check out my findings and more importantly, feel free to comment on any quirks/issues/limitations that you have come across yourself.

It is certainly exciting times for the on premise enthusiast out there.  I had lost all hope for SSRS, but with 2016, Microsoft have rekindled my love for enterprise data visualisation.

View Native Query Function in Power Query

Following on from my last blog (found here), I wanted to share a new Query Folding feature in Power Query.  The View Native Query functionality was actually released in the June 2016 Power BI update but it was hardly advertised/promoted by Microsoft.

What is View Native Query?

In a nutshell, it enables you to check if each of your step by step transformations in Power Query are being folded back to the source.  I have shown techniques of using external monitoring tools, as well as an in-built M function.  However, the View Native Query feature makes our lives so much easier.  I will demonstrate how easy it is to use.

Example

I loaded up an old Query Folding Power BI workbook in Power Query, which was connecting to the AdventureWorksDW2014 SQL Server database.  If you want a copy of it, feel free to comment below and I’ll get back to you.  However, you can do this with any workbook and need to follow the simple instructions below:

1.    Right click on the last step of your transformation and a new option called View Native Query should be available.  Left click to open the SQL query.


clip_image0024_thumb_thumb

 

clip_image0044_thumb_thumb

2.    Whilst the above SQL written at source is not the most efficient, query folding still means the data is brought back into Power Query far quicker than it would if done within the Power BI application.

 

3.   Now copy and paste the code into SSMS, This is exactly the same code that SQL Server Profiler would produce.  Execute the query.

 

clip_image0054_thumb_thumb

 

4.    In effect, this is replicating the transformation steps in Power Query.  The below screenshot is taken from Power Query and note the top 10 results from this and the above output.

clip_image0064_thumb_thumb

The coolest thing about the View Native Query feature is the fact you can check every step in the Power Query transformation.  I only demonstrated the final step, but you can go back to each one and find the underlying query folding logic that has been applied.

Why is View Native Query not available?

If you get the below screenshot, the answer is simple – your transformation step doesn’t support Query Folding! 

clip_image0074_thumb_thumb

NOTE: View Native Query is ‘greyed’ out.

If possible, move any steps that aren’t supported to the end.  Power Query will attempt to query fold, until there is a step it cannot.  All steps after this (even if query folding is possible) will not be supported and could drastically slow down the remaining transformations.

Conclusion

This feature is something that should have been made available a long time ago.  Whilst the power of query folding is undeniable, the methods of testing it were long and arduous.  With View Native Query, we can easily check where, when and more importantly, when folding is NOT happening.  Debugging slow performing transformations is so much easier now and you can in fact, swap steps around and folding will still persist where possible.

I would advise every Power Query developer to make use of this simple feature.  If there are any other Query Folding techniques, tips or tricks out there, please contact me.  Look out for future blogs, where I will be looking at the performance of query folding on very large data volumes (100 million + records) and whether the automated SQL Server code eventually becomes too unwieldy and poorly performing.   

Query Folding in Power Query

Query Folding in Power Query isn’t an entirely new concept but when I first heard of it, I thought “What is Query Folding”?  There are probably a number of you out there who think the same, so let me explain it.

When carrying out transformations (e.g. Sort) in Power Query, it is possible that some will be sent back to the source.  In other words, Power Query doesn’t attempt to sort millions of records within the application itself, but will use the source to undertake the work.  This can drastically improve performance within Power Query.

Examples

I will now demonstrate some basic Query Folding techniques in Power Query.  Until recently, it wasn’t easy to know if a transformation was performing Query Folding.  This blog primarily talks you through monitoring outside of Power Query, but I will also touch upon how this is now possible to do within the application.

External Monitoring Tool

The monitoring tool used in this example if SQL Server Profiler.  For a simple video walkthrough on how to use Profiler, click here.

1.       Start a profiler session, ensuring you connect to the server where the AdventureWorksDW2014 Database resides – available to download here.

2.       Now we can connect to the database in Power Query and more specifically to the dbo.FactProductInventory table.

3.       I will perform three Power Query transformations:

a.       Filter Rows

                                                               i.      ProductTotal > £50000

b.      Group By

                                                               i.      DimProduct.EnglishProductName

                                                             ii.      SUM(Unit Cost) As ProductTotal

c.       Transform > Trim

                                                               i.      DimProduct.Status

NOTE:   This blog does not show you how to carry out the Power Query transformations but the workbook can be provided on request.

4.       The transformed query looks like the below:

 

clip_image002

 

5.       Now let’s take a look.

 

clip_image004
clip_image005

clip_image006

Points of Note:

1.       The Filter Rows Power Query transformation on TotalUnitCost is implemented as a WHERE clause in SQL Server.

2.       The Group By also becomes a T-SQL GROUP BY.  With the SUM function also being applied to find the total value for the EnglishProductName Group By.

3.       The T-SQL functions LTRIM/RTRIM are combined together on the Status column, replicating the functionality of the Trim feature in Power Query.

Within Power Query

Rather than show you this feature myself, I will point you to a great article by Chris Webb, which describes how you can easily check for Query Folding within your Power Query workbook.

In summary, there is an out of the box M function called GetMetaData that can be nested within your transformations.  This will tell you the source of your query and if it has been query folded.

Supported Transformations

Here are some other transformation types in Power Query that support Query Folding:

-          Filtering (on rows or columns)

-          Joins

-          Aggregates and GROUP BY

-          Pivot and unpivot

-          Numeric calculations

-          Simple transformations, such as UPPER

Current Limitations

Although there is still no official list (from Microsoft) of the sources that support Query Folding, here are some of the ones currently known:

-          Relational sources (SQL Server, Oracle).

-          OData sources (e.g. Azure Marketplace)

-          Active Directory

-          Exchange

-          HDFS, Folder.Files and Folder.Contents

This means all a lot of other data sources do not support Query Folding.  For example, a flat file does not support Keep Top 5 Rows in Power Query, whereas SQL Server would use the TOP function or apply a filter in a WHERE clause. Whilst the lack of supported sources can be seen as a limitation, I would argue that any form of Query Folding, even if it were just SQL Server, is a big plus point.

You can deliberately prevent query folding, although the only time I see a benefit is if you are connecting to a server running at full capacity or during a large ETL.  There are far more instances when Power Query will not apply query folding, because of limitations to the sources functionality.  Check out Ken Verbeek’s blog if you would like know more on this.

Conclusion

This article is just scratching the surface of how beneficial Query Folding in Power Query can be.  The key takeaway is remembering to put the transformation steps (that can be folded back to source) at the beginning.  Steps that cannot be folded should be applied as late as possible.  If this is adhered to, you will see huge performance benefits – especially on large datasets.

If anyone out there would like to share other data sources that use Query Folding or just have a general interest, feel free to comment below.

Query Parameters in Power Query – Part 2

As promised, here is Part 2 of Query Parameters in Power Query.  My last blog demonstrated how this feature (released in April 2016) could be used to reduce the data volumes in your PowerBook.  This article focuses on how you can easily switch between different parameter values and dynamically display the results. 

Query Parameters are fully explained in Part 1, so I will move straight on to the exciting stuff!

Use Cases & Example

There are plenty of scenarios when you want to filter large data sets based on various criteria.  A typical business case could be looking at a customer table (in SQL Server, a text file, etc.), where the user wants to return only data by a person’s First Name, Last Name or City in which they are from.  This would require more than one parameter if we want to make our filtering dynamic. 

I will talk you through a slightly different example.  Imagine a user wanting to look at the current products their business is selling, but look at the data by:

1.       Product Name ONLY = EnglishProductName

2.       Product Category, Product Sub Category & Product Name = EnglishProductCategoryName, EnglishProductSubcategoryName & EnglishProductName

Query Parameters can easily facilitate this.  To follow my example, download AdventureWorksDW2014 Database (found here) and connect through SQL Server Database. Paste the following SQL Code into the SQL Statement box – like below:  

SELECT DISTINCT

              --DPC.[EnglishProductCategoryName] AS CategoryName,

              --DPS.[EnglishProductSubcategoryName] AS SubCategoryName,

              --DP.EnglishProductName AS ProductName,

              DP.ProductAlternateKey,

              DP.ListPrice As ProductListPrice,

              COALESCE(DP.[Status],'Previous Version') AS ProductStatus

FROM   [dbo].[DimProduct] DP

INNER JOIN

              [dbo].[DimProductSubcategory] DPS

              ON DPS.[ProductSubcategoryKey] = DP.[ProductSubcategoryKey]

INNER JOIN

              [dbo].[DimProductCategory] DPC

              ON DPS.[ProductCategoryKey] = DPC.[ProductCategoryKey]

ORDER BY

       DPC.[EnglishProductCategoryName],

       DPS.[EnglishProductSubcategoryName],

       DP.EnglishProductName

 

NOTE:   You will see the first three columns are commented out.  This is because we will be referencing them in our parameter value.

clip_image002[6]

 

Switching between Parameter Values

I will now demonstrate how to create 2 parameter values that will be used to display the product data in different ways.

The following steps show you how to achieve this.  

1.       We now want to create the new parameter, in the Query Editor, click Manage Parameters from the ribbon.
 

2.       Create one parameter and select ‘List of Values’ as the Allowed Values drop down.  You will see there are two types of list output – one for just EnglishProductName and the other amalgamating the three product columns.   Click OK when happy.


clip_image004[6]

The code for the ‘concat’ function is:
Concat(EnglishProductCategoryName,' - ',EnglishProductSubcategoryName,' - ',EnglishProductName)

Note:     As we are connecting to SQL Server, the ‘Concat’ function is required to bring the product columns together.

3.       Rename the query to ‘DimProduct’.  We now need to insert the parameter into the DimProduct – by clicking the Advanced Editor and replacing EnglishProductName with our new parameter – like below:


clip_image006[6]

4.       You will see the following warning message.  Click Edit Permission.


clip_image008[6]

5.       The message is just warning you that as you are connected to the database, the query could make changes to the underlying objects.  As this is a ‘Select’ query, there is no danger of this.  Click Run.

clip_image010[6]

6.       As we set the default value to be EnglishProductName, this is what we now see in the query pane.  The field in question is called ProductName.

 

clip_image012[6]

7.       To switch to the other parameter value, simply click Manage Parameters > Edit Parameters

clip_image013[6]

8.       Select the other parameter value and click OK.

clip_image015[6]

9.       Now look at ProductName.  It includes all the category and sub category data also.  Wirth the three concatenated fields split by a ‘-‘.

 

clip_image017[6]

10.   That is parameter value switching in a nut shell.  Feel free to experiment with more combinations of values.

Conclusion

Whilst my demo is very simplistic, this is just showing you how powerful a concept Query Parameters can be.  This blog solely focuses on configuring a parameter to display different result sets, depending on what the user wants to see.  Thinking about the entire Power BI suite, we could hook up out parameterized datasets to Dashboards and empower the user to look at KPI’s or trends in data, depending on their role/department.  The sales and customer service departments will focus on similar dimensional data – like customer and date, but the former might want to look at the ‘SalesAmount’ measure, whilst the latter would be interested in ‘RefundedAmount’.  These fields can be contained in one query parameter!

Query Parameters have been around in Power BI for a few months now and the use of the functionality is becoming more wide spread.  Look for more blogs on this subject in the near future.

Further Reading

There are already a couple of great blogs out there and the below links take you to some other excellent use cases:

1.       Chris Webb’s Blog - https://blog.crossjoin.co.uk/2016/05/08/the-m-code-behind-power-bi-parameters/

2.       Soheil Bakhshi Blog (1 of 2) – http://biinsight.com/power-bi-desktop-query-parameters-part-1/

3.       Soheil Bakhshi Blog (2 of 2) - http://biinsight.com/power-bi-desktop-query-parameters-part2-dynamic-data-masking-and-query-parameters/

Query Parameters in Power Query – Part 1

A new feature has recently been added into Power Query – called Query Parameters.  This was actually released to Power BI as part of May 2016’s update.

So what is a Query Parameter?  For anyone familiar with Microsoft Reporting Services, it has pretty much the same functionality as a Report Parameter.  In simple terms, it enables you to filter your data dynamically in Power Query.  There has always been a Filter UI function but it could not be linked to a parameter or plugged into the M code.  We can use Query Parameters as a way of reducing the data volumes brought into a PowerBook (useful for large datasets or tables) or to allow a user to switch between filters (e.g. changing database connections dynamically).

In this blog, I will concentrate on how we can reduce data volumes by implementing a Query Parameter.  Look out for Part 2 in the coming weeks, where I will pick a use case for when we would want to dynamically switch between parameter values.

Reducing Data Volumes

Imagine transforming or just analysing a fact table that contains over 1 billion records.  Not only will Power BI (and Power Query) probably crash or at best, perform badly, there is no need to bring all this data back into a PowerBook.  A subset, or perhaps an individual month is all that is required to undertake the necessary Power Query ETL operations.

The below example demonstrates how a Query Parameter can be used to filter records in a given dataset.  I have used the AdventureWorksDW2014 Database (download here) and picked the FactProductInventory table, which contains 776,286 rows.  I will reduce this count by adding a new Query Parameter in Power Query – using the Datekey to find all rows in between the 31st November and 30th December 2010.

1.       Assuming I have already loaded the data into the workbook, in the Query Editor, click Manage Parameters from the ribbon.
 

clip_image001

2.       Create two parameters – one named DateKeyFrom and the other DateKeyTo.  The former contains the value 20101131 and the latter 20101230.  Keep the default settings and confirm by clicking OK.  You two parameters should look like the below:

clip_image003

 

3.       Click on the DateKey column drop down and choose Number Filters and Between – highlighted below.

 

clip_image005

 

 

4.       Set the filter to pick up the two newly created parameters.  The following screenshot illustrates the criteria required, with the key change involving changing the second drop down from ‘Decimal Number’ to ‘Parameter’.

clip_image007

5.       Check that your dataset has been filtered and only contains data for the time period December 2010.

6.       If you need to change the date criteria and let’s say, only display 5 days of data, this can be achieved easily clicking the parameters in the Queries pane.  I have edited the DateKeyFrom value to 20101225.  You can also update a value by clicking the Manage Properties button.

clip_image009

7.       As soon as you click back onto the FactProductInventory table query, the data automatically begins to refresh.  We are now left with only the last 5 days in December, shown below. 

 

clip_image011

Conclusion

As Query Parameters are hot off the press, I have only just started experimenting with it.  It looks like it has been added to not only reduce and filter on data, but to enable us to plug a dynamic parameters into M.  Previously, I used a Function as a workaround but can now actually use them in the proper context.  Click here to view a previous blog, where I specifically create and invoke a function.  There are a number of other uses for Functions and Rafael Salas’ blog provides an excellent walkthrough.

My next blog will focus on how you can switch between different parameter values and dynamically display the results.  If anyone has built their own Query Parameter examples and want to share them, feel free to leave me a comment below.  I would be very interested to see how far people have already taken this new feature.

Further Reading

There are already a few great blogs and use cases out there:

1.       Chris Webb’s Blog - https://blog.crossjoin.co.uk/2016/05/08/the-m-code-behind-power-bi-parameters/

2.       Soheil Bakhshi Blog (1 of 2) – http://biinsight.com/power-bi-desktop-query-parameters-part-1/

3.       Soheil Bakhshi Blog (2 of 2) - http://biinsight.com/power-bi-desktop-query-parameters-part2-dynamic-data-masking-and-query-parameters/

Replicating SSIS Data Profiling in Power Query

As a BI Consultant, I regularly sit down with a new/prospective client and dissect both their business requirements and source data.  With the latter, the SSIS Data Profiling Task (in Visual Studio or SSDT) is a very useful tool to find out the general quality of the data we will be importing into the warehouse.  Count of NULLS, Value frequency, Min/Max/Median Values and unique count of records are just a few types of data profiling.

I have recently being playing with the out of the box Data Profiling functionality in Power Query.  Although the functions are very useful, I quickly realised that I would need to extend my PowerBook to match the Data Profiling Task in SSIS.  I will break my findings down into a couple of parts, with this instalment focusing on a specific data profiling output from SSIS. 

Table Profile Function

Table.Profile() is a relatively new function in Power Query that gives some profiling stats for columns in a specific database table or raw file.  In the example below, I pulled in a raw Excel spread sheet from the Power BI website, which contains sample financial data.  Click here to download the data.

clip_image002

To obtain stats on this query, open Advanced Editor, add the following M at the end of the existing code:

    #"TableProfiling" = Table.Profile( #"Changed Type" as table) as table

in

     #"TableProfiling"

clip_image004

All we are doing is creating a new step in M, referencing the previous step and calling the new one.  The actual Table.Profile() function doesn’t even require an expression value and the underlying Power Query engine provides all of the functionality.  Each column has been profiled – see below:

clip_image006

I think the outputs are self-explanatory, especially if you are familiar with Data Profiling.  Either way, one line of code gives us some useful high level stats about the data and can help us captures the extremes or anomalies of each column.  Not all analysis will be suitable for a column, e.g. the ‘Country’ field will not produce Standard Deviation or Average values because it uses the Text datatype. 

Column/Frequent Value Distribution

Whilst the in-built Power Query function is ideal for overall table/file analysis, data profiling is most effective when focusing on specific columns and data patterns.  Both Column and Frequent Value Distribution are basic features of the SSIS Profiling Task, shown below.

clip_image008 

For the Power Query proof of concept, I have connected to AdventureWorksDW2014 and the Person.Address table – named PersonAddress.  This means I can use the SSIS Data Profiling screenshot as the benchmark to not only prove it can be replicated, but to ensure the numbers match. 

The first step is to apply the Table.Profile() function to a Reference query – called TableProfiling.  We can then create a simple staging query that just includes what we need from the out of the box profiling – called TableProfilingStage.  ‘Number of Values’ is renamed to ‘TotalCount’ and ‘Number of Distinct Values’ to ‘DistinctCount’.  We are left with the following data:

  clip_image010

The row of interest is ‘City’ (contained in Column), as this is what we need to profile.  The queries TableProfilingStage is the reference point for the frequent value distribution.  Create a New Source and paste the following M Code into the Advanced Editor:

clip_image012

Points of note:

1.       Lines 1-4 are referencing parameters in other queries and more specifically, with the ordinal position of 3.  The ‘City’ column is the 4th record in TableProfilingStage which equates to position 3 (column 1 starts at 0).

2.       Each ‘City’ value summed with a total count.

3.       Individual record counts (value of 1 per row) to use in the eventual calculation.  This value is then summed to give a total record count on each row.

4.       Apply (Count/TotalCount)*100 to work out the Frequent Value Distribution.

5.       Renamed, sorted (by count) and hidden any unnecessary columns.

6.       Save your new query – I called it ColumnProfiling.

7.       You should have 4 queries in total:

a.       PersonAddress

b.       TableProfiling

c.       TableProfilingStage

d.       ColumnProfiling

As long as your queries are named identically to my examples, you will end up with the below outputs:

Column Value Distribution:

clip_image014

Frequent Value Distribution:

clip_image016

We have now replicated Column and Frequent Value Distribution and the figures matches the SSIS Profiling Task! Although Power Query does not provide the pretty visualizations, we can easily re-produce them in the Power BI reporting suite.  I will save this for another blog, though.    

Next Steps

Look out for my next blog, where I aim to enhance the current Data Profiling PowerBook and add configuration.  In order to make this method as or more efficient than SSIS, it will need to be automated and handle a number of sources (e.g. csv, database connection, etc) and incorporate invoked Power Query functions. 

Conclusion

It is clear to see that Power Query is a very effective tool to replicate SSIS Data Profiling.  I believe with further development, a PowerBook could be used as a quicker and more efficient tool to run data profiling. 

I am very interested to see if anyone has created their own Data Profiling tasks in Power Query and welcome any comments, feedback or questions.

Flat Files in Power Query

The out-of-the-box functionality in Power Query will enable you to import pretty much any flat file, ranging from csv to fixed width.  However, there are instances when badly formatted files cannot be loaded correctly and this blog will demonstrate a workaround for this. 

Unusual Flat Files

Whilst csv and txt are the most common types of flat file, I have worked on a number of BI projects where the data warehouse needs to ingest fixed width, ragged right or pipe delimited text files.  You won’t be surprised to hear that Power Query recognizes all of these formats!   I will demonstrate below how simple it is to import a Ragged Right file with a ‘.in’ extension. 

Ragged Right in Notepad:

clip_image002

Power Query’s Interpretation:

clip_image003

 

Badly Formed Flat Files

You won’t normally know that a flat file is badly formed until actually importing it.  In the below example, a csv file contains both missing and double commas.  Csv stands for ‘comma separated values’, so Power Query expects a comma to split each column.  Any developer who has imported data (through SSIS, SQL import, etc.) will have encountered a problem like this.

Two Commas in Notepad:

clip_image004

Two Commas in Power Query:

As expected, Power Query believes there is a blank column between the values ‘Liverpool’ and ‘Stoke’.  As a result, the data is shifted to the right.

clip_image006

Missing Commas in Power Query:

A missing comma between columns works the opposite way.  Data is shifted to the left.

clip_image008

Solution

As we already know that the csv file contains both two and missing commas, the next step is to import the data in a non-delimited format.  To do this, you need to start a new Blank Query and type the following code into the Advanced Editor:

let

Value = Table.FromList(Lines.FromBinary(File.Contents("D:\Documentation\
Internal\PowerQuery\FlatFiles\CSVMissingOrMoreCommas.csv")),Splitter.SplitByNothing())

in

                Value

The code is turning the flat file data into a table and treating every row as an individual column – hence the ‘SplitByNothing’ syntax.  This is how the query looks:

clip_image009

We can now replace two commas with just one comma and split the table by ‘,’.  I have previously shown you how to Replace Values in an earlier blog, but will quickly demonstrate the Split Columns function.

clip_image011

Power Query actually auto detected all of these settings, as it can see the frequent number of commas and that they occur 10 times.  If you drop down Advanced options, you can change the delimiters and output columns if they are not interpreted correctly by Power Query.   Finally, we can implement the Use First Row As Headers UI function and filter the nulls in the ‘HTR’ column.

clip_image013

There is no easy way to write automated logic that would fix these three records because the missing commas were in different places and in our current query, different columns.  There are also no string patterns that can be easily deduced.  Therefore, this time around we will use Replace Values to move the data to the required columns.

clip_image015

There it is - a very effective way of visualising and managing both unusual and badly formed flat files in Power Query. 

Next Steps

We could easily extend the code to handle other file types or delimiters. Ken Puls’ blog demonstrates how you can pass the flat file connection string in as a parameter within a function.  This would be very handy for someone wanting to bring in multiple flat files through one configuration.  I would be very interested to see if any other Power Query developers have their own M code that handles flat files.  Feel free to comment or contact me directly if you have any suggestions on how to extend my example further. 

ETL in Power Query

If you have read any of my other Power Query blogs, you will have seen the results of transformed and mashed up queries - but not how it was achieved.  This article picks out my favourite 5 Extract, Transform & Load (ETL) features in Power Query.

Example

Being a keen sports fan, I wanted to find out the medal history of the 100m Olympic Finals for both men and women. 

I’ve picked two datasets, both from different sources.  You can download them using the links below:

-          Source 1 - CSV

o   ISO Countries.csv - https://datahub.io/dataset/iso-3166-1-alpha-2-country-codes/resource/9c3b30dd-f5f3-4bbe-a3cb-d7b2c21d66ce

-          Source 2 – Web

o https://en.wikipedia.org/wiki/100_metres_at_the_Olympics

o   There is a lot of other data on the site but for this example, I have only picked the race result data tables - Men [edit] and Women [edit].

Please Note:      I am happy to provide the Power BI Workbook upon request.  Leave a comment below and I will get        back to you ASAP.  

The imported data (pre transformed) should look like the below in Power Query.  You may notice some ETL has already taken place, but this was automatically carried out by Power Query and includes simple tasks like displaying column headers.

clip_image002

Pivot/Unpivot

There are many occasions when we need to completely flip the contents of a table.  For our example, ‘Gold’, ‘Silver’ and ‘Bronze’ should not be columns, but rows instead.  We only need one column and this can be named ‘MedalType’. The un-pivot feature can be achieved (on a whole query) by just one click of a button.  Ensure that following columns and features are selected:

clip_image004

 

Once the data is un-pivoted, rename the new columns to ‘Athlete’ and ‘MedalType’ respectively.  Repeat the process for the Men [Edit] query, which looks like the following:

clip_image005

Replace Values

When quickly eyeballing the data (in Men [Edit]), it is apparent that two athletes shared a Bronze medal in 1896.  We need to handle this data defect but unfortunately, there isn’t a one click method of achieving this.  The easiest way is to use the Replace Values and Remove Top/Bottom Rows UI functions.

Firstly, let’s duplicate the full Men [Edit] query by right clicking on it and selecting Duplicate.  We now want to remove all rec0rds from the copied query, apart from the one that contains two bronze medalists.  To do this, right click on the little table button in the corner of the query and choose either of the following:

clip_image007

It will then specify how many top or bottom rows you want to remove.  Ensure the only record left looks like the below:

 clip_image008

Lastly, the Replace Values function is used to replace ‘Francis Lane’ with an empty string and therefore, removing the duplicate.  The aim is to insert this one record back into the main Men [Edit] query.  Right click on ‘Athlete’ and select Replace Values.  Type the below text in the function box and click OK.

Value to Find:   Francis Lane (USA)#(lf)#(lf) Alajos Szokolyi (HUN)
Replace With:   Alajos Szokolyi (HUN)

The ‘#(if)’ elements of the search string is looking for special characters, otherwise the replace will not work.

clip_image009

We must tidy up the Men [Edit] query by doing the same as above, but the opposite way around.  This leaves us with the following:

clip_image010

Append Queries

This is another neat feature in Power Query.  In SQL Server or SSIS terms, this provides a union between two queries.  If the column names and data types do not match, then it may fail or give unexpected results.

Open the Men [Edit] query, Click the Home tab in the top menu pane and select Append Queries.  A pop up window will open, where you can specify the query to append into Men [Edit].  For this example, I selected the below and clicked OK.

clip_image012

-          It is possible to append multiple tables and in fact, we need to do this in order to bring both the Men [Edit] (2) and the Women [Edit] queries into Men [Edit].

-          To make it easy to distinguish between male and female, I have added a new column to the queries that specifies gender.  This is not mandatory, especially if you are following this example.

-          If we select the Two tables option, then only one query can be appended at a given time.

We can rename the newly appended query to Athletes and tidy up the Men [Edit] (2) and Women [Edit] queries.  Putting raw queries into an appropriately named Group helps distinguish each part of our ETL – see below:

clip_image013

There are more queries now but the Raw folder contains imported, untouched queries, whereas Transform includes tables that have been modified and joined together.  The Load folder will eventually contain the star schema tables (dimensions and facts).

Add Custom Column

Any developer using ETL will have needed to derive a column before.  Power Query offers this functionality, both through the UI and M query language.  We want to create a new column called ‘Nationality’, that moves the characters (in between the brackets) from ‘Athlete’.

In Power Query, a derived column is called Custom Column.  Ensure the Athletes query is selected, then navigate to the Add Column tab in the top menu pane and select the Add Custom Column button.  The pop up box appears, where we can write some custom M code:

clip_image015

-          Text.End is the equivalent to the RIGHT function in SQL Server.

-          Text.Start is equivalent to LEFT.

-          Wrapping them both into one line of code ensure the string inside the brackets is taken from Athletes.

-          Rename the new column to Nationality.

clip_image016

Merge Queries

This feature is the same as applying a JOIN in SQL Server.  In simple terms, it gives us the ability to mashup and bring together queries from different data sources.  We will bring the ISOCountriesRaw data into the Athletes query.

Under the Home tab, select Merge Queries.  A new window appears, where we can specify the tables and columns that need to be joined.  Simply click select the ISOCountriesRaw query and the two columns shown in the below screenshot.  The Join Kind should be a Left Outer, as this will bring ONLY matching records across ISOCountriesRaw, keeping all records in the main Athletes query.

clip_image018

The wizard will also give you a match preview and for this example, 142 of the 144 records in Athletes matches the ISOCountriesRaw table.  Click OK to complete the merge.

There are a number of columns that aren’t needed in this table and can be filtered out accordingly:

clip_image019 

There are two records that don’t match because the medal winner was subsequently removed from the record books.  We can use the Replace Values function to replace the nulls with some more meaningful information (see below).  We are now ready to turn the data into a star schema!

clip_image021

Next Steps

Now we have a fully transformed query, it can be broken down into a fact table and dimensions.  I will not demonstrate in this blog but by following Kimball’s data warehousing principles, there is now a star schema that can be plugged into Power Pivot or Power View.

NOTE:   Add Index Column was used to create primary and foreign keys on the ‘Load’ tables.  See ‘Other Features’ for link on how to do this.

Queries:

clip_image023

Relationships:

clip_image025

Other Features

There are many other ETL related functions within Power Query that have not been discussed in this blog.  It is also possible to enhance the UI features using M or write specialist code to enhance and configure your ETL process, which I demonstrated above.

For further information, I would recommend reading the below articles:

-          5 Very Useful Text Formulas - Ken Puls
http://www.excelguru.ca/blog/2014/08/20/5-very-useful-text-formulas-power-query-edition/

-          Add Index Column – Microsoft

https://support.office.com/en-us/article/Insert-a-custom-column-into-a-table-Power-Query-2dbb579a-915b-4ebd-b622-8e7f3d1d61a6?ui=en-US&rs=en-US&ad=US

-          Split Columns - Mark Vaillancourt

http://markvsql.com/2015/03/advanced-column-splitting-in-power-query/

-          Group By – Microsoft

https://support.office.com/en-us/article/Group-rows-in-a-table-Power-Query-e1b9e916-6fcc-40bf-a6e8-ef928240adf1?ui=en-US&rs=en-US&ad=US

-          ETL Examples – Chris Webb

http://blog.crossjoin.co.uk/2013/08/16/some-power-querym-examples/