Adatis BI Blogs

Calculation Groups

Analysis Services tabular models can have hundreds of base DAX measures, duplications of code, and syntax. However, with the new world of DAX Calculation groups we can re-use calculations and apply them to virtually any measure. Within analysis services, DAX Calculation Groups give the user the ability to apply multi-use calculations to measures. This feature is available on Analysis Services in SQL Server 2019 2.3 and should be available in the SQL Server 2019 release. Currently, calculation groups can only be created with the Tabular Editor UI. The first, and most prominent example of this is time intelligence and the application of MTD (Month to Date), YTD (Year to Date), and QTD (Quarter to Date) calculations. By reducing the number of measures, with calculation groups we can present a clearer user interface to the users. Below are two images that show the calculation group hierarchy. The hierarchy is used to call a calculation on a measure. The calculation group is called calculation group, the attribute is called time, and the item is called MTD. The calculation items use the new ‘SelectedMeasure’ function. When called ‘SelectedMeasure’ will replace the measure that you have selected for calculation. CALCULATE(SELECTEDMEASURE(), DATESMTD(DimDate[Date])) CALCULATE(SELECTEDMEASURE(), DATESYTD(DimDate[Date])) CALCULATE(SELECTEDMEASURE(), DATESQTD(DimDate[Date])) The script below calculates the measure ‘InternetTotalSales’ upon the calculation items; ‘Current’, ‘QTD’, ‘YTD’ etc. EVALUATE CALCULATETABLE ( SUMMARIZECOLUMNS ( DimDate[CalendarYear], DimDate[EnglishMonthName], "Current", CALCULATE ( [InternetTotalSales], 'CalculationGroup'[Time] = "Current" ), "QTD", CALCULATE ( [InternetTotalSales], 'CalculationGroup'[Time] = "QTD" ), "YTD", CALCULATE ( [InternetTotalSales], 'CalculationGroup'[Time] = "YTD" ), "PY", CALCULATE ( [InternetTotalSales], 'CalculationGroup'[Time] = "PY" ), "PY QTD", CALCULATE ( [InternetTotalSales], 'CalculationGroup'[Time] = "PY QTD" ), "PY YTD", CALCULATE ( [InternetTotalSales], 'CalculationGroup'[Time] = "PY YTD" ) ), DimDate[CalendarYear] IN { 2012, 2013 } And, when we run this script we get the following table:Looking forward, we should see this eventually in PowerBI, SQL Server Data tools, and Azure Analysis Services. So, watch this space. Throughout this year, you will see more blogs and guides from me on this topic as it unfolds.

What is Azure Data Explorer and Kusto Querying Language (KQL)?

Azure Data Explorer (ADE) put simply is a platform to analyse large amounts of data quickly and to scale up your efforts in minutes to accommodate any discoveries in relevant data insights. Azure Data Explorer (ADE) has several methods for ingesting data from various sources, like social media platforms, IoT, CRM and has the ability to support high volumes of structured, semi structured and unstructured data for fast querying.Azure Data Explorer integrates with other major services and can be an important part of the data warehousing workflow by executing the explore step of the workflow focusing on analysing a large amount of diverse raw data. In order to query the data, you use Kusto Querying Language (KQL). KQL is for querying only and unlike SQL, you can not update or delete data using KQL. You can also visualize your analysis through ADE either through the 'render' command in KQL or you can connect to PowerBI and output your findings that way. KQL - Another query language? Basically yes..... Kusto Query Language (KQL) is however very straight forward and easy to learn. It is the same language used in Azure Log Analytics and Application Insights so if you are already using it there then you won't have any issues. Your query starts easily with a reference to the table. You use this tabular data and run it through a set of statements connected by pipes to shape your data. Below is an example from the Microsoft KQL documentation. StormEvents | where StartTime >= datetime(2007-11-01) and StartTime < datetime(2007-12-01) | where State == "FLORIDA"  | count How does it work?ADE is a distributed cluster instance maintained by Microsoft.In your cluster you need to host a database with tables. ADE does allow for dynamic column types meaning you can add just about anything into your tables.In terms of security, you can use Azure Active Directory to assign permissions to clusters, databases and tables.There are two methods to ingest data. Queued ingestion, where data is ingested asynchronously or direct ingestion where data is pushed into ADE immediately.Use KQL to query your data.Visualize it either through the 'render' command in KQL or through PowerBI Why use Azure Data Explorer? Microsoft already has a number of databases many of which have been around long enough to be good options to use, so why Azure Data Explorer? Pros: Fully managed service. No need to worry about infrastructure, making development easier and faster.Highly scalable.Very simply to push data to and quickly too with minimal set up.Optimized for time series analysis.Azure Data Explorer has been heavily used with in Microsoft internally for many years now, meaning the technology is relatively mature.Azure Data Explorer is available in many Azure regions already. Cons: Pricing, if you don't have a large amount of data, then there may be cheaper and better ways to analyse it.Due to Azure Data Explorer being used only internally with in Microsoft, the community support around it can be a little light for the time being.If you need to be able to edit your data, then other options like Cosmos may be better. As of publishing date of this blog, ADE is in preview. What next?  I will look to add to this blog with a working example of how to set up an Azure Data Explorer cluster, database and query using KQL. Thanks,Alex

Geospatial Data Visualisation R, Python or BI Tools

IntroductionAs a newbie to Adatis I begin my journey of learning the vast, ever-growing languages and platforms. In my first few steps I am taking a problem I already have experience with in R and understanding if the solution in Python presents similar hurdles as a comparison between the two languages.Before joining Adatis I have just graduated with a degree in Biology. During my course I discovered a website containing animal GPS tracked movement data sets, free to use from movebank. So, I am using a data set I am familiar with that contains GPS tracking data of 27 Vulture Turkeys (Cathartes aura) and using both tools to plot the 350,000 points on a map. How the data set looks as a csv file loaded into R’s viewing window.Note: While both platforms are open source there may be packages out there that would optimise the process I was trying to achieve, as a beginner with limited knowledge this was my experience of trying to map geospatial data in R compared to Python and ArcGIS in PowerBI. PythonPackages concerned: basic (numpy, pandas, matplotlib) geometry (geopandas and shapely)Python failed to install the geopandas and shapely package and I had to use the pip install command in the command window, alternatively the packages can be downloaded in the anaconda environment. The data loaded in fine as expected. First the longitude and latitude values were zipped together and shapely’s apply point function was used to turn them into a single co-ordinate value. ------> If I were to plot now, I could retrieve a world map (template from the geopandas library, changing the fill, colour of sea and country outline colour) with custom size, shape and a single colour for all the data points which is something, but we can make it more informative. Now when plotting I could not find a function that would allow python to sort colour by groups based on a string value (name of vulture). Instead I created a function just to convert every name to a number, so each co-ordinate had a group number assigned to it depending which vulture it came from. This means a list of 350,000 numbers ranging from 1:27 can be classed as colour for the points. After plotting python offers a zoom function which allows you to zoom in down to a single point but is limited by the resolution of the map/plot. I did however limit the axis to over the Americas just for initial plot clarity. The point conversion and plotting process takes around 15 seconds according to python sys.time() function placed at beginning and end of code. Ignoring the library imports the whole process took 14 lines of unique code. RPackages concerned: rworldmapNow R has some useful packages like ggmap, mapdata and ggplot2 which allow you to source you map satellite images directly from google maps, but this does require a free google API key to source from the cloud. These packages can also plot the map around the data as I am currently trimming the map to fit the data. But for a fair test I also used a simplistic pre-built map in R. This was from the package rworldmap, which allows plotting at a country level with defined borders. Axes can be scaled to act like a zoom function but without a higher resolutions map or raster satellite image map it is pointless to go past a country level.R works a little different to Python, initially we plot a blank map and in a sequential but new line add the points. This differs from the points being plotted and the map being contained in the plot function.The first step is to get the map from the rworldmap library and plot limiting axes to act a zoom. However, the axes do not automatically show this way. So next, manually define the intervals of axes aka longitude and latitude. Finally plot the points with customisation on size, shape and colour. The colour of each point can be assigned using ‘col=factor(dataframe$column)’. So, no need to make a group number index as we did in python.Using the function system.time(), R reported the plotting take 7 seconds but the actual graphic doesn’t show up for a further 8 seconds making the total time 15 seconds, same as python. R however, again ignoring the library imports executed this in only 6 lines of unique code!Just for fun I did check for a correlation between distance travelled and weight and a regression analysis would suggest a statistically significant relationship, but without a time scale we cannot conclude anything e.g. some birds could have been tagged longer than others resulting in greater distance travelled. Is geospatial suitable to present in R and Python?During the write up of this blog I read Jason’s blog(s) on ArcGIS Maps for PowerBI (Jason’s blog), So I trialled my own data. Now quite simply the program is drag and drop. An image presenting the same final information from both R and Python can be created in roughly 2 minutes. One of the feature I really got on well with is the variable zoom down to the precision of a street. If you have read Jasons blog you will know about all the customisable features, if not read it! But quite simply ArcGIS Maps gave you the option to fully explore the data visually but is limited by analysis and ‘map features’.PowerBI handles statistical analysis similarly to excel, possible but tedious having to generate each column. The map feature limit poses another problem; ArcGIS (free version) only allows for 1,500 ‘map features’ (5,000 in paid version) which equated to around 30,000 points. With this data set it meant not all the data could be plotted. I thinned out the data in R just by removing every 9 of 10 points and considering the data was continuous this meant the readings were once every half/full day instead of every 1-3 hours, this shouldn’t really affect the overall distance travelled. I suspect a limit like this is in place to save the performance as a similar platform called Tableau plotted all of the data but at the cost of performance to the point that the zoom navigation was unusable. ConclusionIn summary, I feel like difference between R/Python and PowerBI just depends on the user. PowerBI makes the data very accessible and gives great control for visually exploring the data to a business analyst for example. But if the user is more technical e.g. data scientist, then R/Python might be more fitting. Between R and Python there was no difference in performance with this data set but that could be because in the grand scheme of things 350,000 isn’t ‘big’. There was a notable difference in the amount of code needed but this could be down to my greater experience with R and lacking knowledge in a Python package capable of achieving the same task.

Hierarchy Navigation In Power BI

Until recently, the overall functionality of the Matrix Visual in Power BI has been limited. However, this all changed when Microsoft announced the Power BI March update, which gives users access to the Matrix Preview. This can currently be used alongside the old matrix. In this blog, I will be comparing the old Matrix and the new Matrix Preview. The updates featured in the latter are as follows: Performance improvements Drilling into hierarchies Stepped layout Cross highlighting and filtering from matrix headers and cell contents This article will only focus on the drilling into hierarchies’ functionality. Click here to find more information around the Matrix updates, along with the extra features not covered in the blog. Hierarchy Drill Through One of the visible changes in the new matrix preview it is the ability to show all the headers collapsed by default making the user experience easier when dealing with large datasets. The image below shows this new feature compared to the old and new Matrix. There is also the capability to show both levels of the hierarchy simultaneously , which is again done using the hierarchy navigation buttons as illustrated in the image below. You can also drill up and down on individual columns using the right click function as shown in the image below. The benefit of this is that it gives the user a more detailed drill down of a desired column. Further drill down options are available, for example, the ability to drill down on row category headers. In normal mode (without drill mode turned on), other datasets in other row category headers will be faintly visible. By turning on the drill down mode it allows users to works on a specific category row header in isolation. The following images show the differences in the two views. Conclusion The Matrix Preview has brought about interesting and useful tools making it more interactive. The ability to be able to drill up and down within a report particularly stands out for me. It is also worth mentioning that other features, not covered in this blog give users increased customisation when working on reports – showing how impressive the Matrix Preview is.  April`s Power BI update includes more features for the Matrix Preview. My next blog will be looking at the following two features added for Matrix Preview: Rename axis titles New matrix visual enhancements: column sorting, column resizing, and word wrap   Further Reading (Power BI Blogs)

How to do row counts in Azure SQL Data Warehouse

Continuing on from my last couple of blog post about working with the Azure Data Warehouse, here is another issue which has came up during development and is handy to know if you are going to be developing a solution! Keeping track of how much data has been loaded plays a key part in a BI Solution. It is important to know for a given load for example, how many rows were inserted, updated or deleted. Traditionally, we were able to use the @@ROWCOUNT function @@ROWCOUNT returns the number of rows affected by the last statement. Unfortunately, in Azure SQL Data Warehouse @@ROWCOUNT is not supported. How does it work? In the Microsoft Azure documentation,they do provide a workaround for this, please see here  for more information and a list of other unsupported functions. They suggest creating a stored procedure which will query the system tables sys.dm_pdw_exec_requests and sys.dm_pdw_request_steps in order to get the row count for the last SQL statement for the current session. sys.dm_pdw_exec_requests holds information about all requests currently or recently active in SQL Data Warehouse. It lists one row per request/query. holds information about all SQL Server query distributions as part of a SQL step in the query. sys.dm_pdw_request_steps holds information about all steps that are part of a given request or query in SQL Data Warehouse. It lists one row per query step. This is an example of what the stored procedure would look like:   As you can see above, we pass through a ‘LabelContext’ parameter. A Label is a concept in Azure SQL Data Warehouse that allows us to provide our queries with a text label that is easy to understand and we can find that label in the DMVs. For example: Here, we have given our query the label ‘Test Label’ and if we wanted to find information about this query in the DMVs we can search using this label like:   So, putting this into context, in the ETL we are calling stored procedures to load our data (for example between clean and warehouse). Therefore, within the stored procedure we have the query written to insert or update the data and we would give this query a label. Then, within the same stored procedure, we would call the Row Count stored procedure, passing through the Label as parameter so we can retrieve the row count.   Be careful though! On my current project we have come across times where we haven’t been able to get the row count back. This is because the sys.dm_pdw_exec_requests DMV we are querying is transient and only stores  the last 10,000 queries executed. So when we were running the query above, our requests were no longer there and we were getting nothing back! The table holds data on all queries that go against the distribution nodes and statistics gathering for each of the nodes. So in order to try and limit the records in this table, keep the nesting level of queries as low as possible to avoid the table blowing up and not having the data you need in it!   Stay tuned for another blog about working with Azure Data Warehouse!

Using Azure Machine Learning and Power BI to Predict Sporting Behaviour

Can we predict peoples’ sporting performance by knowing some details about them? If so, what is better at making these predictions: machines or humans? These questions seem so interesting so we decided to answer them; by creating a working IT solution to see how it would perform. The blog will provide an overview of the project providing a simple results analysis and details of technologies that we used to make it happen. As It would be hard to check all available sport disciplines we decided to focus on the one we love the most – Cycling. Our aim was to predict the maximum distance a rider would ride within one minute from standing start. Although “one minute” sounds insignificant, this is really tough exercise as we were simulating quite a tough track. We used the following equipment to perform the experiment: bike with all necessary sensors to enable recording of speed, cadence, crank turns, wheel turns, distance velodrome bike simulator heart rate monitor in form of wrist band Using this equipment allowed us to capture data about the ride in real time and display this using streaming analytics and Power BI in a live interactive dashboard as shown below (Picture 1):   Picture 1: Real time Power BI dashboards showing:  average heart rate(top row left); current speed in km/h(top row middle); average of speed(top row right); Current Crank turns, wheel turns and cadence(bottom row Left); Average of Crank turns, wheel turns and cadence (bottom row right)   Sensors were used to capture information about how fast our rider was cycling, how many crank turns they made, what was their heart rate during the ride and the most important - how far they did go within the time limit. Each rider had a chance to try to predict their maximum distance before their ride. We also made a prediction based upon previous cyclist results using Machine Learning algorithms. In order for the Machine Learning Algorithms to make estimates about each of the riders, we had to capture some representative properties about each rider before the ride. All riders needed to categorise themselves for each of properties listed below: age height weight gender smoking volume drinking volume cycling frequency   So taking weight as an example, people were asked to allocate themselves to the one out of the available buckets: e.g. Bucket 1 - 50-59kg, Bucket 2 - 60-69kg, Bucket 3 – 70-79kg … Bucket N – Above 100kg   Bucketing properties were used to help us reduce amount of distinct values, so it increased the probability that for a given ride we would find someone with similar characteristics, who had already had a ride. Obviously to make the prediction work we had to have an initial sample. That’s why we asked “Adatis people” to have a go on Friday afternoon. In true competitive spirit some of them even tried a few times a day! By the beginning of the SQLBits conference we had managed to save details of around 40 different rides. In a nutshell let me describe the process that we repeated for each rider. First step was to capture details of the volunteer by using ASP.NET Web app, including the maximum distance they think they will be able to reach (human prediction). Next, behind the scenes we provided their details to the machine learning algorithm exposed as web service to get a predicted distance. We then turned on all sensors and let the cyclist ride the bike. During the ride we captured all the data from the sensors and transferred it to the database through the Azure IoT stack. After the ride finished we updated the distance reached by the rider. The more cyclists participated, the bigger sample size we had to predict result for the next rider. Overall we captured 150 rides for 138 different riders. The initial sample size we used to make prediction was 40 riders and it grew up as more riders got involved.  The table below (Table 1) contains basic statistics of the differences between the machine learning predictions and human predictions.   Prediction Type Avg. Difference Std. Dev. For Difference Max Difference Min Difference Azure Machine Learning 119m 87m 360m 2m Humans 114m 89m 381m 0m Table 1: Absolute difference between Predicted and Reached distance for 1 minute ride. (Average distance reached 725m)   From these numbers we can easily see that neither Humans nor Machine Learning  came close to the real results reached by riders. The average difference over a 725m ride was 114m for humans with a standard deviation of 89 meters and 119 with a standard deviation of 87 meters. That means both of them were equally inaccurate. Although it is worth mentioning that we had single cases when the prediction was very close or even equal to the one reached. In trying to determine the reason behind the miscalculations in the ML prediction? I would say that the main reason is the sample size was not sufficient to make accurate predictions. Besides the small l sample there might be other reasons why predictions were so inaccurate such as: Incorrect bucket sizes for rider properties Too many properties to make a match Lack of strong enough correlation between properties and distance reached   It is also worth mentioning that some properties would show high correlation between property and distance like height of the rider or low correlation like drinking volume. The Best examples of high correlation we can see are on the charts attached below (Chart 1):   Chart 1: Correlation between distance reached in meters and height  category of the rider   And even more significant regarding fitness level (Chart 2):   Chart 2: Correlation between distance reached in meters and fitness category of the rider   On the other hand, some rider’s properties did not show the correlation that we would expect e.g. age (Chart 3)   Chart 3: Correlation between distance reached in meters and age of the rider   Although there is no straightforward correlation as previously stated we can observe a general trend that we tend to perform better the closer we get to our round birthdays. We can observe peaks at the ages of 18, 29, 39, 49. Is it perhaps because of the fear of getting to the next decade? I will leave this up to your interpretation… If you are interested into more technical explanation how we designed and build our project, I would like to invite you to the second part of the blog that would cover top level architecture of the project and also some deep insights into some core technologies used including: Azure Stream Analytics, Azure Event Bus, PowerBI web, ASP.NET MVC4 and SignalR.