Simon

Simon Whiteley's Blog

A Roundup of MS Data AMP 2017

There was a bit of buzz last night as Microsoft held the first "Data Amp" online conference. It was essentially a chance to make a bit of marketing buzz around some new products, and there was a whole lot of hype associated. But what actually happened?

Well we learned a lot about the next version of SQL Server, several Azure components moved from preview into General Availability and, amidst the noise, some cool new features added to existing ones.

So, to help cut through the hype, here are my top 3 highlights from MS Data Amp:

1. SQL Server 2017

It's official, SQL Server vNext is coming this year and will be called SQL Server 2017. CTP2.0 is now available and is deemed production ready so you can grab a copy and start preparing migrations & new applications.

But should you adopt it early? Is there anything radically different with this version?

Well actually yes, there's quite a lot in there of interest to a BI Developer, Data Scientist or anyone with analytical querying requirements.

We already knew about a few features, some small functions such as STRING_AGG() that will make like a whole lot easier, some huge, such as On-Premise PowerBI Hosting.

The major announcements last night were:

a) PYTHON in SQL Server: Hot on the heels of 2016's R integration, SQL Server 2017 will now support Python in a similar way. This provides a few more integration options and some benefits in terms of performance and optimisation, but mostly, it provides choice and compatibility. You no longer have to choose R if you're going SQL Server, you can port your existing Python across directly.

Will it change the world? Probably not - the majority of data science is currently performed on dedicated platforms, and the ease at which these can be created and torn down in Azure means there is little barrier to entry to continuing to do so. It does, however, make it really easy to try out, and super easy to integrate into existing applications and workflows. I'd disagree with it being called "Built-in AI in SQL Server", but it's certainly a step in that direction.

b) SQL Graph: You can now have Graph Databases inside SQL Server. Essentially you mark tables as edges and nodes, then use specific SQL syntax to query them. If you've got existing Neo4J databases that you would prefer to integrate directly into SQL Server, this is for you. Your edges and nodes also act like normal tables, so you can still query them in the traditional relational way.

Personally, I'm interested in using it as a reporting layer - putting some D3 visualisations over the top and natively displaying clustering, force-directed graphs and the like without and need for transformations, all from your existing SQL Server.

Adatis's very own Terry McCann has put together his introduction to SQL Graph

2. Azure Analysis Services is now Generally Available

This is fantastic news for us as it is the missing pieces that can connect together the various pieces of the modern azure warehouse. By having a Platform-as-a-Service semantic layer, we can replace any remaining SQL VMs in our architecture. We can now process data using Data Lake and SQL Datawarehouse, model it in Azure Analysis Services for highly concurrent querying, then expose it via PowerBI, all without a virtual machine in sight.

As a service, it is a little costly - it's currently priced around the same as the SQL VM it is replacing, but it only performs one of the functions. But the benefits of not managing infrastructure, and the ability to pause & resume without re-processing, are going to be very useful going forward.

3. Cognitive APIs now General Available

They've been in preview for a while, but we can now use the Face API, Computer Vision API and Content Moderator service in production systems. This allows us to do things like checking for the same person in a bunch of different photos, detect similarities between images and tag them with certain contents. This is pretty cool, although can be onerous to do en-masse… until now.

The BIG news for me here, is that the Cognitive API technology has been pushed into Data Lake Analytics. This means you can use native functions inside U-SQL to harness these APIs.

What does this mean? Well we can now push petabytes of image files through these APIs in a hugely parallel way, all without having to provision any infrastructure - you could write the code to do this from the browser on your phone as you travel home! This is an immensely powerful, scalable tool that's now available right out of the box! Want to auto-moderate several million image files at once? You can! Want to pull out any image files that contain something that looks like a car - without waiting hours for the result? You can!!

Now, there are other features and announcements that came out of last night, but these are far and away the three that I'm excited about.

What about you?

PowerBI Optimisation P3– Extracting and Source Controlling PowerBI Data Models

Source Control – once seen as “something proper developers do” – has been an integral part of the way business intelligence developers work for a long time now. The very idea of building a report, data model or database without applying some kind of source control actually pains me slightly.

However, there has been a push for “Self-Serve” reporting tools to strip out anything that looks remotely like a technical barrier for business users - This includes the ability to properly track changes to code.

We find ourselves in a very familiar situation – versions of PowerBI desktop files are controlled by including version numbers in file names. I’ve seen several copies of “Finance Dashboard v1.2.pbix”. This is obviously dangerous – who’s to say that someone didn’t open up the file, edit it and forget to increment the name. Once a file has been shared, there’s no controlling what changes happen at that point. If this happened to an SSIS package, for example, we would still be able to perform a code comparison. This would highlight differences between the two packages so we could accurately see what caused the changes. This is not currently possible with PBIX files in their entirety.

We can, however, compare the data model behind the file. This allows us to check for changes in business logic, amendments to DAX calculations, additions of new fields etc. If the performance of two PBIX files different drastically even if they were meant to be the same “version”, then this is a reasonable starting point!

Extracting the Data Model from a PowerBI PBIX File

Firstly, we need to extract the JSON that describes the Tabular Model embedded model (technically, this is TMSL, the tabular model scripting language, but it’s still JSON…)

We can do this by connecting to the model via SSMS. I’ve talked about the steps required to do this here.

So, assuming you have found your temporary SSAS port and connected via SSMS, you should see something like this:

image

As we would with any other Tabular model, you can right-click and script out the database as so:

image

If we do this to a new query window, you’ll see the various JSON objects that describe your PowerBI model:

image

This script contains the details for all tables, attributes, DAX measures etc required for your data model.

Comparing PowerBI Data Models

What if someone has been using a specific version of my PowerBI desktop file, but they’ve modified it and it has stopped working? For a Tabular model, I’d compare the model definition to source control which will automatically highlight any changes.

Now that we can script out our PowerBI model, we can apply the same principles. Say, for example, I make a couple of changes to my sample PowerBI report and want to figure out how it has changed compared to a baseline script I exported previously.

The easiest option is to use a tool like Textpad – here you can compare two text documents and it will highlight any differences it finds between the two. For example, I changed the name of a table and removed a column, the text comparison highlights this change as below:

image

I can now be confident that if someone sends me a PBIX file, I can check to see if there are any data model changes without having to manually eyeball the two side by side. This alone is a huge leap forward in manageability of models.

The next step would be to add this file to an actual Source Control provider, such as Visual Studio Team Services. This tool is free for the first 5 users and can be used with Visual Studio 2015 Community Edition – which is also free!

Essentially you would add this exported script to your source control directory each time you updated the model. By checking in your new model, you can compare previous versions, much like with the TextPad editor above.

Final Thoughts

In the end, this isn’t real, true Source Control. If you make a mistake, you can only view what the previous configuration was, you cannot roll back code directly into your PowerBI model. It is, however, a step towards managing PowerBI with a bit more discipline and rigour.

I don’t see this as a huge drawback as rumours on the wind are hinting at larger steps in this direction coming with future releases. Let’s hope we’re not having to work around these problems for much longer!

 

 

PowerBI Optimisation P2–What’s using all my memory?

If you're a regular user of PowerBI, you're probably aware of the size limitations around datasets and it's very likely you've hit them more than once whilst writing reports on top of large datasets. It's difficult to see where size savings can be made directly through PowerBI, but we can use traditional tabular optimisation techniques to help us!

For those not in the know, a single dataset can be up to 1Gb in size, with excel files limited to 250mb. Each user also has a storage limit as follows:

  • Free users have a maximum 1 GB data capacity.
  • Pro users of Power BI Pro have 10 GB maximum capacity.
  • Pro users can create groups, with a maximum 10 GB data capacity each.

For more information about the limits themselves and how to view your current usage, there's PowerBI blog about it here: https://powerbi.microsoft.com/en-us/documentation/powerbi-admin-manage-your-data-storage-in-power-bi/

But what if you're hitting that 1Gb data limit? There's very little within PowerBI itself to help you understand which tables are the largest, where you could make some savings, or generally anything about your model itself. The answer is to connect to the model via SSMS and take advantage of the Tabular system views, as described here.

What determines Tabular model size?

It’s worth discussing this briefly before going into the details. Put very simply, the XVelocity engine used by the tabular model will hold more data if there are more unique values for a column column. The key to avoiding large models is, therefore, to avoid columns with huge numbers of lots of distinct values. Text fields will generally be pretty bad for this, although there are common design patterns to avoid the worst offenders.

A simple example is to look at a DateTime column – this combination of date and time means that each minute of each day is a unique value. Even if we ignore seconds, we’re adding 1140 new, distinct records for every day within the system.

If we split this into two fields, a date and a time, this problem goes away. Each new date adds just a single record, whilst we will never have any new hours and minute combinations, so that’s a controllable field.

There are a few techniques to avoid these problems if you find them, I’d advise heading over to Russo & Ferrari for some general tips here and some more detailed techniques here.

Accessing Memory Usage Data

So - following the above instructions, connect to your data model and open a new DMX query:

clip_image001

Here you can use SQL syntax to query several DMVs behind the model - not all of them will be relevant in the cut-down tabular instance that PowerBI uses but there is one in particular that will help us manage our model size - DISCOVER_OBJECT_MEMORY_USAGE.

Admittedly, on it’s own this is pretty incomprehensible. We can filter down the results slightly into something that makes a little sense, but you’ll generally get a big list of model entities with numbers against them – OK as a starter but not great as an actual model optimisation tool:

image

Stopping here we would at least have a hit-list of the worst-offending columns and we can use this to start tackling our model. But there are much better ways to approach this problem!

Tabular Memory Reports

There are several free tools made available within the SSAS community for people to analyse their current SSAS memory usage. These tools simply query this same data but apply a bit of data modelling and make the data much more accessible.

For straight tabular, I would tend to use Kasper de Jonge’s old excel spread, which pulls in data quite reliably, however there is an updated PowerBI Model found here.

However, this doesn’t play nicely with the PowerBI flavour of tabular just yet, so I would advise using the SQLBI.com Vertipaq Analyser.

Following their instructions and pointing it at my temporary tabular instance, we can refresh successfully and use their categorisations to explore the model. I’ve added some conditional formatting to help see where the issues are.

I can see, for example, which of the tables in my model are the worst offenders, and what’s causing it:

image

Interestingly the Customer dimension is pretty huge in my example. It has a lot less data than my fact but the dictionaries required are pretty hefty. Dictionaries are built using string lookups and are heavily affected by high volumes of unique values – so I can presume I’ve got some pretty big text strings in this dimension.

Looking at the Column breakdown, I can see where the offenders are:

image

This tells a slightly different story – my main offenders are from one of the hidden date dimension tables (A sign that relying on PowerBI’s inbuilt date functionality can be a memory drain) and the Sales Order Number – a unique identifier for my fact, obviously this is going to have a large number of distinct values.

The other columns we can do more about – Email address is the next offender. We can assume each customer, of all 18,000 will have a unique email address. However, it’s very rare that we would want to do analysis on the email address specifically, this is a good candidate to remove from the model. At the very least, we could consider keeping only the domain which will yield much fewer unique values.

 

Hopefully the above will help you move forward in reducing your PowerBI data model size – I’ll be posting about Performance Analysis & Source Control over the next couple of days.

PowerBI Optimisation 1 – Connecting Via Management Studio

I recently gave a talk to the London PowerBI UserGroup and I kicked things off with a confession - "I don't do much report building in PowerBI". Perhaps an odd way to qualify myself to speak to that particular audience. But I am, however, a cloud solution architect - I spend my time designing large scalable cloud systems to process vast amounts of data and PowerBI is a common tool used on top of these systems.

Why then, do we accept the lack of controls available within PowerBI? Given any other end-user system I'd want to know about performance bottlenecks, about data model efficiency and, more than anything, I'd want it in source control.

First and foremost, the talk is available here.

The key to it all, is realising that PowerBI Desktop, when running, starts a SQL Server Analysis Services processes in the background. It doesn't just use the same engine as Tabular, it literally runs tabular in the background without telling you.

Open up a PowerBI Desktop file and, after you've seen the "initialising model…" window, you'll see this process in the background - one for each PBID session.

clip_image001

So - if the model is using Tabular in the background, we must be able to actually connect to the model!

First - Find your Temporary SSAS Port

There are two straight forward ways we can achieve this:

1. By far the easiest, is to open up DaxStudio if you have it installed.

When you open DaxStudio, it gives you a Connect window, which lists all of the PowerBI processes you have running in the background, as well as any Tabular services:

clip_image002

When you connect to a PBI file here, you'll see the Port listed

clip_image003

In this case, my port is 5524 -be aware that this will change every time you open PowerBI Desktop, so you can't hardcode anything looking for your "powerbi port".

2. Alternatively, you can find the "msmdsrv.port.txt" file related to your specific instance.

Take a look in your user appdata folder, you should find a Microsoft/Power BI Desktop/ folder with some analysis services details:

C:\Users\<YourUser>\AppData\Local\Microsoft\Power BI Desktop\AnalysisServicesWorkspaces\

You'll see an instance for each of your PBI Desktop instances, I've only got one at the moment:

clip_image004

Inside this folder, in another folder called "Data", you'll find the file we're looking for:

clip_image005

Opening this file, we see:

clip_image006

Pretty straight forward, and no DAX required. Obviously if you have multiple instances, you'll need to figure out which of these relates to the instance you're after.

Connect via SSMS

Now that we know our port, we can simply open up management studio, connect to analysis services and enter "localhost:" and the port number from earlier.

clip_image007

 

Once connected, you'll see a model connection - each PBIX file will have a GUID for this instance, but you can drill down and see the objects underneath, exactly as you would with a Tabular model:

clip_image008

You can now write queries, browse the model and basically treat it as a Tabular instance. The Database itself will use a generated GUID, and several internal tables will do the same - you can see above that a hidden data table has been created for every datekey included in my model.

We'll discuss the applications of this in my next post - namely how this unlocks performance tuning, monitoring and source control.

The Cloud BI Transition: Warehouse of the Future Talk

While we wait for the SQLBits XV videos to be published, I actually gave the same talk back in March for the London PASS Chapter MeetUp. If you were unfortunate enough to miss both talks, or happened to steal a quick nap mid-way through, the recording can be found here.

I wrote up a quick accompanying blog earlier this week, which includes a link to the slides and some additional thoughts.

As usual, if anyone has any questions following the talk, I’d love to hear from you in the comments or directly on twitter via @MrSiWhiteley.

The Cloud BI Transition: ETL & Data Movement in the Cloud

The cloud warehouse is a different beast to the on-premise systems most of us have been using for the past decade. These times, they are a changing. I’ve talked about the key concepts that we, at Adatis, see as core to this new world here, but how does this work in practice? To start us off, the core of any BI Developer’s skillset – ETL.

The main two players are Azure DataWarehouse and Azure Data Lake. I recently gave a talk to SQLBits to introduce some key concepts, explain how I see them fitting in to the new world and, hopefully, challenge some existing preconceptions - slides can be found here. I’ll pop the video link up when published.

For those in a hurry, I’ve summarised the main thrust of my talk below.

The Ways of the Past

For many years, the fastest way to transform data whilst moving it between two places was to extract, transform in memory via SSIS, then load into their final structures, hence ETL.

clip_image002[4]

However – this was eventually rivalled by massive parallelism, whilst one database engine may be slower to work through a load than memory-based operations, 10 might be much faster. But the Hadoop systems that provided this, were not very accessible to the bulk of BI developers, seen as requiring different skillsets and not being particularly standard.

The management of these clusters wasn’t always achievable even with developer acceptance, certainly not in the on-premise world. This is why the latest Azure technologies are looking to make such a difference.

The Scalable Warehouse

With Azure Data Warehouse, you can perform queries over large, distributed tables – both reading and writing using massive levels of parallelism, depending on your service tier. You can also scale your service dynamically, so if your day-to-day usage required 100 DWUs, you’d be seeing 8 reader threads and 60 writer threads. Even at this level you may find performance to be similar to SSIS.

But these are PaaS components and so are incredibly flexible and scalable. We can scale our ADW instance up to 400 DWUs just for the duration of our data load (bear in mind, you pay by the hour for the maximum level used in that time). This gives us 32 readers and 60 writers – at this level we’ll be far outstripping the performance of a powerful SSIS box and this will cost us just a couple of pounds per hour. This scaling can be done via a number of methods – a couple of lines of T-SQL, a simple powershell script or even through the UI.

Once we’ve scaled, transformations are quite likely to be faster performed directly by our data warehouse engine than a data movement tool. This gives us Extract > Load > Transform, which is a common technique by users of MPP systems – with Azure DW coming from APS/PDW this makes sense.

clip_image004[4]

Targeted Compute

With Azure Data Lake Analytics, you can spin up the equivalent of 1000 cheap servers (essentially managed Hadoop instances) for the lifetime of your query. You’ll need to write U-SQL to do this, a new language combining C# and SQL syntax, but it’s not a huge leap for an experienced SQL developer with just a hint of programming skills. But the ability to write a query directly in the browser that can utilise massive amounts of parallelism removes a lot of barriers – you don’t even need visual studio installed to take advantage of ADL.

Of course, if you do build your queries in visual studio, there is a whole tooling set released to allow you to predict compute utilisation, spotting bottlenecks and planning capacity. Not to mention the benefits that source controlling your code give you!

clip_image006[4]

So Which Do We Use?

This gives us two options for our transformation, the Data Lake and within the Warehouse itself.

My assumption is that we’re going to see TELT, some upfront targeted transformation, then an extract/load into our final technology, then some further transformation.

Transform (targeted) > Extract > Load > Transform (general)

Whereas transformation used to be the bottleneck, it’s now the data movement itself that’s slowing things down. If we have just one very large job, we can perform this in isolation within the lake before moving anything and not have to scale our warehouse. If we have a lot of transformations to do on a variety of tables, we can scale up the warehouse for the duration. Maybe we do both to limit how much we scale the ADW. Perhaps we perform the transformation as increments throughout the day, either as smaller data lake jobs, data factory movements or even stream transformations. Once again, that’s what a distributed cloud architecture gives us – flexibility and choice.

The actual answer depends heavily upon the data scenario that you’re attempting to work with – the speed of incoming data, the types of data involved, the volume and complexity of the data. The new components do not provide you with a one-size-fits-all answer, they do, however, provide you with the tools to overcome almost any data challenges you may be facing.

Further Reading

Azure DataWarehouse Loading Reference Guide

Azure Data Warehouse Documentation

Azure Data Lake Analytics Documentation

The Cloud BI Transition: Business Intelligence Key Players and their changing roles

I’ve previously talked around how Azure is changing traditional BI approaches, and the various architectural frameworks we’re going to see coming out of it. But what about the existing job roles within BI Teams and related structures? How are they going to change with the new ways of working that are emerging?

It’s a question we need to start considering now as cloud architectures become more commonplace and business are more willing to trust their data to cloud providers. From my recent work in piecing together new architectures and frameworks, I’ve been fortunate enough to have these conversations with people who are currently evaluating their position and their future training needs. From these conversations, and my own research, I’ve put some thoughts together around the “traditional” BI roles and how they’re changing:

BI Architect – In many ways, this is the role that changes the most. We now have a wealth of options for our data, each with their own specialist uses. The BI Architect needs to be familiar with both new functionality and old, able to identity the most relevant technology choice for each. Whereas we would previously be able to host the majority of our different systems within a single, cover-all SQL Server architecture, we will now find certain structures to be more aligned with the performance profiles of specific Azure components and less so with others. Business Users expecting lightning-fast dashboard performance would not benefit from a Data Lake system, whereas a Data Scientist would be unable to work if confined to a pre-defined, data model. A small, lightweight data mart of several Gb would likely perform worse in a full Azure DW system, given the data would be split across so many distributions the overhead of aggregating the results would outweigh the parallelism gains – in this case we would introduce an Azure SQL DB or Azure SQL VM to cater for these smaller marts.

Infrastructure Specialist – Gone are the days of the infrastructure specialist needing to know the install parameters of the SQL Server and the best disk configurations to use in different scenarios. We’re now focusing on security models, network architecture and automation and scaling management. The PaaS and IaaS systems differ greatly in their approach to security, with IaaS requiring traditional networking, setting up VLANs/security layers and PaaS components each having their own firewall layers with individual exception management. The infrastructure specialist should also be advising/designing the Azure subscription setup itself, connections to other subscriptions, perhaps managing expressroute and gateway connections back to the on-premise systems. There is also the considerations of whether to extend Active Directory into the Azure domain, making the Azure estate more an extension of the company’s internal network.

Data Modeller – The end role here doesn’t change dramatically, many previous design principles are still the case in the new approaches we’ve outlines. However there are some additional performance considerations that they will need to build into their designs. Azure DataWarehouse, for example, fundamentally relies on minimising data movements that occur when querying data. A snow-flaked model, or a model with several very high cardinality dimensions, might find performance degrades significantly, when it may have been the most performant design in a traditional multidimensional cube.

BI Developer – This role will still include many of the traditional tools, very strong SQL skills, an understanding of data movement & transformation technologies and excellent data visualisation skills. However, the traditional “stack” skills of SSIS/SSAS/SSRS are extended and augmented with the additional tools at our disposal – components such as Stream Analytics, Data Factory, Event Hubs and IoT sensor arrays could all easily fall into the domain of BI yet require radically different skillsets. Exposure to C# and the .NET framework becomes more valuable in extending systems beyond the basic BI stack. The management of code is essential as these environments grow – being able to rapidly deploy systems and being confident in the development process is vital in order to get the most out of cloud technologies.

Data Scientist – For the first time, users with advanced analytical skills have a place in the architecture to allow for experimentation, ad-hoc analysis and integration with statistical tools. This free-form analysis outside of strict development protocols accelerates the business’s access to the insight and understanding locked within their collected data. Many of these insights will mature into key measures for the business and can be built into the more stable, curated data models

Data Steward – If anything, the importance of a nominated data steward grows as we introduce systems designed for ad-hoc analysis. Without governance and controls around how data is stored within a data lake, it can quickly become a dumping ground for anything and everything. Some critics see data lakes as a dystopian future with uncontrolled “swamps” of data that grow meaningless over time. Whilst “store everything” is a fundamental tenant of the data lake mentality, everything stored should be carefully catalogued and annotated for maximum usefulness – the importance of this should not be underestimated. Our steward should embody the meticulous collector, not the disorganised hoarder.

Database Administrator – The introduction of PaaS services as our main components change this role dramatically, but they remain a core member of the business intelligence team. The common DBA tasks of security management, capacity and growth planning, performance optimisation and system monitoring are all very much a part of day-to-day life. Certain tasks such as backup & recovery are taken away as services Microsoft provide but the additional skills needed to manage performance on these new technologies are critical. Data Lakes produce large amounts of output as a by-product of running jobs and queries, these need to be cleaned and maintained over time. Access levels to different areas of the lakes will be a growing concern as our lake models mature. Finally, the tuning of high-performance queries, whether in U-SQL or Azure DW now require a whole new set of skills to analyse and optimise. PowerShell, a traditional tool of the DBA, becomes hugely powerful within Azure as it is the key to managing system automation – scaling systems up, down, on and off requires a reasonable grasp of PowerShell if you want to get the most out of your systems.

BI Analyst – Somewhere in between our Data Scientists, Developers and Consumers, we have our BI Analysts. Whereas previously they may have been expert cube users, building reports and dashboards for end users to consume using the BI systems provisioned, they now have far more autonomy. PowerBI and other reporting technologies, whilst being touted as the silver-bullet for all self-service reporting needs, can deliver far more when in the hands of an experienced reporting analyst. Essentially, the analyst still acts as the champion of these tools, pushing data exploitation and exposure, except they now have the ability to deliver far more powerful systems than before.

Data Consumer – The business user is, by far, the beneficiary of these tools and systems. With a flexible, scalable architecture defined, as well as different streams of data management and exploitation, there will be many benefits for those who need to gain insight and understanding from the company’s various data sources. The data models supporting self-service tools will benefit from faster performance and can include data sets that were previously size-prohibitive, giving the data consumers instant access to wider models. If the datasets are too complex or new to be exposed, they can contact specialists such as the BI Analyst or Data Scientist to investigate the data on their behalf. These manual-analysis tasks, because the architecture is built and designed to support them, should be more maintainable for a BI team to provide as an ongoing service.

Conclusions

In many ways, things aren’t changing that much – we still need this mix of people in our BI team in order to succeed. But as always, technology is moving on and people must be willing to move along with it. Many of the design patterns and techniques we’ve developed over the past years may no longer apply as new technology emerges to disrupt the status quo. These new technologies and approaches bring with them new challenges, but the lessons of the past are essential in making the most of the technology of the future.

The people who will be thriving in this new environment are those who are willing to challenge previous assumptions, and those who can see the new opportunities presented by the changing landscapes – both in terms of delivering value quicker and more efficiently, and finding value where previously there was none.

Slowly-Changing Parent-Child Dimensions – Part 2: The Implementation

Distributing the slides from my talk last night, it was well received and I think everyone was able to take something away.

Credit goes to Marco Russo and Alberto Ferrari for their data modelling ideas, visit www.sqlbi.com 

http://dl.dropbox.com/u/25925341/Learning%27s%20from%20large%20scale%20dw%20project.pptx

Hopefully not my last talk as i enjoyed doing it.  Look out for further Bath based events at:

http://avonim.wordpress.com/

All comments welcome

Calvin

First of all, an apology. It’s always hard to recognise whether your posts are useful for people out there and this one slipped through the cracks. After several requests, I thought I’d finally get around to posting the related solution to a post I wrote 2 years ago! Apologies to anyone this could have helped quite some time ago.

To set the scene, I had to build a slowly-changing parent child dimension and manage the relationships accordingly. I described the theory behind the solution in this post, probably best give that a skim if you’re not familiar with the approach.

To summarise – if a parent node changes, you create new SCD records for all descendants, essentially you create historic “branches” as each change happens.

Initial Data Setup

For the example, we’ll use the AdventureWorksDW2012 sample DB as it has an org chart we can very easily play with. You can find the download here.

We’re going to create a new parent-child SCD dimension using dbo.DimOrganization by running the script below:

CREATE TABLE [dbo].[DimOrganizationSCD](
    [OrgKey]         [int]    IDENTITY(1,1),
    [ParentOrgKey]     [int]    NULL,
    [OrganizationID] [int]  NOT NULL,
    [ParentOrganizationID] [int] NULL,
    [PercentageOfOwnership] [nvarchar](16) NULL,
    [OrganizationName] [nvarchar](50) NULL,
    [CurrencyKey] [int] NULL,
    [StartDate] datetime NOT NULL,
    [EndDate] datetime NULL,
    [CurrentFlag] [int] NOT NULL
) ON [PRIMARY]

GO

INSERT INTO [dbo].[DimOrganizationSCD]
(ParentOrgKey, OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag)
SELECT -1 [ParentOrgKey]
      ,[OrganizationKey]
      ,[ParentOrganizationKey]
      ,[PercentageOfOwnership]
      ,[OrganizationName]
      ,[CurrencyKey]
      ,'2014-01-01 00:00:000' [StartDate]
      ,NULL [EndDate]
      ,1 [CurrentFlag]
FROM [dbo].[DimOrganization]

UPDATE C
SET ParentOrgKey = P.OrgKey
FROM [dbo].[DimOrganizationSCD] C
    INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID

Take a look in the table and you’ll see something like this:

clip_image001

We now have a separation between the unique business identifier for each organisational entity (OrganizationID), and its historical version (OrganizationKey).

The Org Change Occurs

Next let’s make a change to track – maybe the North American operation has been split off as a subsidiary and we need to insert a hierarchy level for the new brand “SuperCycles US”. We’ll create an ‘updates’ table to represent the delta coming from our ETL and put in two records – the new organisation record for SuperCycles, and the update to North American Operation to represent the new Parent ID.

CREATE TABLE [dbo].[DimOrganization_Updates](
    [OrganizationID] [int]  NOT NULL,
    [ParentOrganizationID] [int] NULL,
    [PercentageOfOwnership] [nvarchar](16) NULL,
    [OrganizationName] [nvarchar](50) NULL,
    [CurrencyKey] [int] NULL
) ON [PRIMARY]

GO

INSERT INTO [dbo].[DimOrganization_Updates]
VALUES  (15, 1, 1, 'SuperCycles US', 100),
        (2, 15, 1, 'North America Operations', 100)

The ETL Logic

For this example, I’ve written it as one big stored procedure, will all the logic occurring in one script. I’d normally break this into separate procedures with SSIS handling data flows, lookups etc and logging results of individual scripts, but a single SQL flow is easier to demonstrate.

Firstly, we create a couple of variables used within the proc:

--Createa a table used to hold all records impacted by the hierarchy changes
DECLARE @Updates TABLE  (
    [OrganizationID] [int]  NOT NULL,
    [ParentOrganizationID] [int] NULL,
    [PercentageOfOwnership] [nvarchar](16) NULL,
    [OrganizationName] [nvarchar](50) NULL,
    [CurrencyKey] [int] NULL
);

--Use a common timestamp across all SCD updates
DECLARE @SCDDate datetime = getdate();

Now, as far as any children of the North American Operation go, they technically haven’t changed and so we won’t see an update for them. So our ETL logic needs to take that into account.

We take our update rows and build a recursive CTE that finds all active child rows in our dimension:

--Use a recursive CTE to find all updates AND all descendants of updated records
WITH OrgUpdates AS
(
    --Our Changed Records
    SELECT 
        U.OrganizationID,
        U.ParentOrganizationID, 
        U.PercentageOfOwnership, 
        U.OrganizationName, 
        U.CurrencyKey
    FROM [dbo].[DimOrganization_Updates] U
        LEFT JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID
                                                    and D.CurrentFlag = 1
UNION ALL
    --Our Descendants (this will recurse until no more descendants are found)
    SELECT 
        D.OrganizationID,
        D.ParentOrganizationID, 
        D.PercentageOfOwnership, 
        D.OrganizationName, 
        D.CurrencyKey
    FROM [dbo].[DimOrganizationSCD] D
        INNER JOIN OrgUpdates U on D.ParentOrganizationID = U.OrganizationID
                                                    and D.CurrentFlag = 1
)
--Store the results of the CTE in a table variable
INSERT INTO @Updates
SELECT 
    OrganizationID,
    ParentOrganizationID, 
    PercentageOfOwnership, 
    OrganizationName, 
    CurrencyKey
FROM OrgUpdates

This will loop a maximum of 100 times by default, you’ll need to override this if your parent-child structure has more than 100 levels. Hopefully this isn’t an issue for anyone!

We now have a table variable populated with each of the business IDs that relates to a changed record, or a descendant of a changed record.

From here on, we can follow a standard SCD Type 2 method – we stamp the existing records with an end date & amend the current flag, then insert the new records.

The end-dating of the old records needs to be done first, as it is easier to identify the new records by their current flag once they have been inserted. So the historical update can be something like:

--Mark the existing records as old by finding all related records that are currently active.
--It is very important that we do this before inserting the new records as otherwise we would not be able to distinguish between old and new!
UPDATE D
SET EndDate = @SCDDate,
    CurrentFlag = 0
FROM @Updates U
    INNER JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID
                                                    and D.CurrentFlag = 1

And then we insert new records:

--Insert the new rows into the dimension, these will each have new surrogate IDs associated
--These will all have NULL ParentOrganizationKeys as we need to wait for all surrogates to be generated
INSERT INTO [dbo].[DimOrganizationSCD]
(OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag)
SELECT     OrganizationID,
    ParentOrganizationID, 
    PercentageOfOwnership, 
    OrganizationName, 
    CurrencyKey,
    @SCDDate    StartDate,
    NULL        EndDate,
    1            CurrentFlag
FROM @Updates

Note that at this point, we’re taking the full cut of records from our “Updates” table, so this will include any new records. Our “SuperCycles US” record will have been inserted here, along with the updated records.

The newly inserted records have deliberately not had their ParentOrganizationKey populated as for many of the record, the parent’s surrogate key has not yet been generated. We need to run an update on the dimension table after the insert to go through and fill this in.

--Finally, update the ParentOrganizationKey for all Current records
UPDATE C
SET ParentOrgKey = P.OrgKey
FROM [dbo].[DimOrganizationSCD] C
    INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID
WHERE C.CurrentFlag = 1
    AND P.CurrentFlag = 1

And that’s it, we now have a new branch of the organisation chart to represent today’s changes.

Taking another look at our dimension table, we now have a new set of active records representing the current Org hierarchy.

image

Improvements

This particular logic has been written to demonstrate an approach, there are many improvements you may want to make if including in a large, resilient ETL process.

Firstly, as mentioned above, I would separate the logic into individual stored procedures that each returned rowcounts for better logging.

You can also amend the insertion procedure to take original values if you need to mix and match SCD types and include a Type 3 “original value” column.

Finally, you may want to include a HierarchyId type column and populate using the full grandparent>parent>child surrogate key path, this allows you to query for such things as “all descendants” without having to build in recursion. This can vastly improve performance if you’re going to be running aggregations over huge volumes directly on the SQL layer.

Further Reading

Recursive Query using Common Table Expressions

Hierarchal Data

Code Extract

Putting all the code together, our ETL query is as follows:

--Create a table used to hold all records impacted by the hierarchy changes
DECLARE @Updates TABLE  (
    [OrganizationID] [int]  NOT NULL,
    [ParentOrganizationID] [int] NULL,
    [PercentageOfOwnership] [nvarchar](16) NULL,
    [OrganizationName] [nvarchar](50) NULL,
    [CurrencyKey] [int] NULL
);

--Use a common timestamp across all SCD updates
DECLARE @SCDDate datetime = getdate();

--Use a recursive CTE to find all updates AND all descendants of updated records
WITH OrgUpdates AS
(
    --Our Changed Records
    SELECT 
        U.OrganizationID,
        U.ParentOrganizationID, 
        U.PercentageOfOwnership, 
        U.OrganizationName, 
        U.CurrencyKey
    FROM [dbo].[DimOrganization_Updates] U
        LEFT JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID
                                                    and D.CurrentFlag = 1
UNION ALL
    --Our Descendants (this will recurse until no more descendants are found)
    SELECT 
        D.OrganizationID,
        D.ParentOrganizationID, 
        D.PercentageOfOwnership, 
        D.OrganizationName, 
        D.CurrencyKey
    FROM [dbo].[DimOrganizationSCD] D
        INNER JOIN OrgUpdates U on D.ParentOrganizationID = U.OrganizationID
                                                    and D.CurrentFlag = 1
)
--Store the results of the CTE in a table variable
INSERT INTO @Updates
SELECT 
    OrganizationID,
    ParentOrganizationID, 
    PercentageOfOwnership, 
    OrganizationName, 
    CurrencyKey
FROM OrgUpdates

--Mark the existing records as old by finding all related records that are currently active.
--It is very important that we do this before inserting the new records as otherwise we would not be able to distinguish between old and new!
UPDATE D
SET EndDate = @SCDDate,
    CurrentFlag = 0
FROM @Updates U
    INNER JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID
                                                    and D.CurrentFlag = 1

--Insert the new rows into the dimension, these will each have new surrogate IDs associated
--These will all have NULL ParentOrganizationKeys as we need to wait for all surrogates to be generated
INSERT INTO [dbo].[DimOrganizationSCD]
(OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag)
SELECT     OrganizationID,
    ParentOrganizationID, 
    PercentageOfOwnership, 
    OrganizationName, 
    CurrencyKey,
    @SCDDate    StartDate,
    NULL        EndDate,
    1            CurrentFlag
FROM @Updates

--Finally, update the ParentOrganizationKey for all Current records
UPDATE C
SET ParentOrgKey = P.OrgKey
FROM [dbo].[DimOrganizationSCD] C
    INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID
WHERE C.CurrentFlag = 1
    AND P.CurrentFlag = 1

The Cloud BI Transition: Azure Warehousing Architecture

There has been a lot of turmoil in the BI space over the past couple of years, however many of us are still holding on to our favoured design patterns. The standard Kimball warehousing pattern brings data from various sources into a generic staging database. We then copy the data to another location, transforming it en-route before finally placing it into conformed, curated star schemas. There may be further transformations, tools, integrations along the way, but this is our most simplistic architecture – and it all happens on one or more SQL Server boxes.

But business challenges are driving technology changes, data volumes are exponentially rising, the different types of data businesses require in order to remain competitive is growing and the speed at which organisations generate and or/receive data is increasing. As a result there are new, cloud-driven components that facilitate entirely different ways of working, SSIS is no longer the fastest way to transform data when we’re talking about terabytes, or more, of data being moved. We also have different velocities of data, both of how it comes in and how it is analysed. Sensors sending out realtime data streams, and web logs constantly writing away – these don’t always fit nicely into a batch-job ETL. Data Scientists wanting access to raw, unprocessed data in ways that have yet to be thought of – a curated warehouse isn’t necessarily right for them.

So – to solve these issues people are looking towards Microsoft Azure’s data architecture, but there are a lot of different tools now available for a lot of different purposes. I thought I’d start our architectural analysis with a look at some key concepts and components, largely as a primer.

Scalable Processing

Above all, cloud technologies bring scalability. By hinging their charge models on actual usage, cloud providers allow users to dynamically change their resource allocations to match capacity requirements. If you’re expecting a busy period, then add more capacity. If your business hours are 9-5, turn things off in the evening and weekends to make massive cost savings.

Newer components take this even further – Azure Data Lake Analytics allows for compute units to be spun up to run a single job, closing them down afterwards. You can spin up 1000 compute units for the lifetime of a single query, all with only a few minute’s lead time.

Distributed Architectures

Because we can dynamically scale, we can manage our infrastructure to a much finer level. However, if most of our system resides on a single box, our scaling design has to take a large, complex schedule into account. Splitting various pieces of functionality into separate components and scaling them individually provides massive amounts of flexibility in managing our stack. With BI specifically, batch-processing ETL windows are a prime candidate for this.

If we have an SSIS package that needs to be run within a certain window and requires a huge amount of resource when it is running, we don’t want this to impact the rest of our ecosystem. In this case we can have a separate server for the SSIS and leave it turned off outside of our ETL window. We then power our server on (using Azure Automation, aka – hosted powershell), run our packages, then turn the server off again. If we’re talking about hour-long daily runs, then even with some allowances for maintenance windows and support, we’re talking about no more than 40 hours of Azure VM cost per month.

Queryable Archive

One of the biggest changes is the attitude to storage with cloud technologies. It has gone from a prohibitive cost that means systems need to justify their storage and growth expectations, to a relatively cheap item which can almost be considered as an afterthought. There is a growing trend of companies storing absolutely everything their systems generate, with an assumption that they will find a use for it at later dates. The cost of storing this data, thanks to falling storage prices and improved technology, is now seen as less than the potential gain that a new seam of rich analysis could bring.

This data doesn’t necessarily make it through to our curated data models – these still need to be carefully designed, governed and documented to ensure they are accessible to less data-savvy users. But expert users, developers and data scientists can write their own queries against the more expansive data stores, pull data into Azure ML and run R algorithms to unearth new insights.

Bringing it Together

With the concepts above, we get the best of both worlds – the ability to analyse vast amounts of data and apply modern data analysis techniques, and expose these findings to end-users and dashboarding systems seamlessly.

This is what we’re building into our reference Azure architecture going forward – a combined approach that has huge potential to scale yet has the functionality to meet the needs of modern BI projects.

Azure Data Lake provides the queryable store, as well as powerful targeted compute, which serves as a feed for more established, curated data models within the Azure Data Warehouse.

clip_image001[5]

Over the next few blog posts, I’ll be discussing how this new architecture works in practice, how it affects the teams and companies around it, best practice for ETL, reporting architectures and development practices, as well as the new governance challenges that will grow from these structures.

Getting Started with Azure Data Lake Analytics & U-SQL

Data Lake Analytics is the querying engine that sits on top of the Azure Data Lake (ADL) Storage layer. If you have not yet got to grips with ADL Store, I’d suggest you go through my quick introduction here.

Azure’s Data Lake Analytics has developed from internal languages used within Microsoft – namely ‘SCOPE’. It is evolved from Apache YARN which, in turn, is a reimplementation of their original Apache MadReduce language. For a little light reading around the history of ADL, I’d suggest looking at the ADL Analytics Overview here.

The new language introduced by ADL Analytics, mysteriously named U-SQL, brings .NET functionality and data types to a SQL syntax. You declare variables as strings, not varchars, but frame your code in SELECT, FROM and WHERE clauses. The extensibility of the code is huge as a result – you can easily write your own C# methods and call them within your select statements. It’s this unification of SQL and .NET that supposedly gives U-SQL its name. The familiarity of code on both sides should open this up to Database and Application developers alike.

Setting up your Data Lake Analytics Account

Assuming you have already set up an ADL Store, setting up your own Analytics service is just as easy.

First, go through New > Data & Analytics > Data Lake Analytics:

clip_image002

You’ll get the usual new item configuration screen, simply pop in your details and link it to your ADL Store account.

image

A few minutes later, you’ll be ready to go with your Analytics service.

To start with, we’ll write a very basic U-SQL Job through the Azure Preview Portal. This way, you can start performing large transformations on your files without the need to download any extensions, updates etc. You don’t even need a copy of Visual Studio! However, as you formalise your system and begin to rely on it as your primary datasource, you’ll definitely want to be keeping your code as source-controlled solutions and making use of the various capacity management tools Microsoft have recently released for managing your Data Lake projects.

Download Sample Data

Back in the Azure Preview Portal, when we open up our new ADL Analytics account we see the familiar overview blade:

image

There’s a decent overview of the U-SQL language here, along with several sample jobs provided through the “Explore Sample Jobs” link on the overview blade. If you follow the samples link, you’ll see a couple of options on a new blade.

clip_image006

For now, click the “Copy Sample Data” button at the top. This will populate your data lake store with the sample files used by the provided examples. I’ll walk through some more advanced examples over the next few posts, but let’s simply try and access some data first. The first example uses SearchLog.tsv found in /Samples/Data/ after installing the samples.

U-SQL functions by defining rowset variables and passing them between various functions. Your first rowset may be data extracted from your sample text file, this rowset is then passed to an output which writes it to an aggregate table, or another file.

Your first U-SQL Job

Simply click on the “New Job” icon on the ADL Analytics Overview blade to start writing your very first job.

clip_image008

Admittedly, this throws you in the deep end. You’re faced with a blinking cursor on line one of your script, but I’ll talk you through the structure of the first example query.

The ADL Store can contain SQL tables, as well as unstructured objects, and the syntax used varies depending on what you’re using. Tables are accessed using the traditional SELECT clause whereas for files we use EXTRACT. I’m assuming most readers will be familiar with a select statement, so let’s get an initial Extract working.

We start by defining our rowset variable, let’s call it @searchlog for now. There’s no need to declare this formally, we can just go ahead and assign the results of a query to it.

The basic structure of this query would be:

@searchlog = 
    EXTRACT <column1> <datatype>
    FROM <sourcelocation>
    USING <extraction method>;

The major assumption is that we will be defining schema on query – the flat files do not contain their own schema information and so we define it when writing the query.

So, to bring back some data from the “SearchLog.tsv” sample file, we need to give each column a name a data type. It appears that we need to define the whole file for now, although it seems that support for querying across variable structures is on its way – it doesn’t seem to be documented just yet.

Defining each column, we build up the EXTRACT statement to:

EXTRACT UserId          int, 
        Start           DateTime, 
        Region          string, 
        Query           string, 
        Duration        int, 
        Urls            string, 
        ClickedUrls     string

Remember, we’re using C# datatypes so we don’t need to worry about lengths of strings etc.

Next, we define the filename. In the first example, we can use a reference to a specific file – this can be the fixed URL to the file specifically, or a relative reference within the Store itself. Our FROM statement for the SearchLog file is therefore:

FROM @"/Samples/Data/SearchLog.tsv"

Finally, we need to tell the query how to understand the particular file we’re attempting to extract data from. There are many extraction interfaces defined by default, for many of the most common flat files, so don’t worry if you prefer CSVs to TSVs, or even if you prefer to define your own delimiters.

In this case, as we’re using a TSV, we use the inbuilt Extractors.TSV() function.

Putting this all together gives us the example query:

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

This will leave us with a rowset variable that has been populated with the columns defined from the TSV file. In SQL parlance, this is like defining a table variable and throwing it away at the end of the query.

In order to view our results, we need to output our resultset somewhere, this is where the OUTPUT clause comes into play.

A full OUTPUT statement requires:

OUTPUT 
    <rowset variable>
TO 
    <location>
USING 
    <output method>

We know our rowset variable, that’s the @searchlog we’ve just extracted data into. We can define a new file for our location, this simply needs to be a relative path and the name of the file to be created.

Finally, as with Extractors, we need to instruct the query what function to use to output the data if we’re pushing it to a flat file. Once again, many providers are included as standard, but let’s stick with TSV for simplicity.

Our output statement therefore looks like:

OUTPUT @searchlog 
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();

Putting this together our full U-SQL script is:

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();
 
OUTPUT @searchlog 
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();

Now this isn’t terribly exciting. This will simply take the contents of the sample file and dump it into a new file with exactly the same structure.

For now, click the Submit Job button to execute your job.

clip_image010

One key point here is that it will take at least a minute or two for the job to run, even if you have very little data in your files. The whole system is optimised for massive scale not lightning fast micro transactions. The real gains are running queries across hundreds or thousands of files at once, scaling to run across many Terabytes of data efficiently.

Hit refresh on the job status blade and you’ll eventually see the results of your job, hopefully succeeded.

image

You’ll then be able to navigate to the output file and view your results.

That’s your very first foray into writing U-SQL, and not a terribly exciting example. However, we can write additional queries in between the EXTRACT and OUTPUT steps that can add calculations, aggregations, join to additional rowsets and even apply any C# libraries that we associate. This in itself if nothing new, especially if you’ve familiar with PIG, however this can all be scaled to massive levels using a simple slider and a pay-as-you-go charge rate. We’ll come to these more advanced examples in future posts.