Simon

Simon Whiteley's Blog

A Morning with Scott Guthrie

Monday saw the UK Azure usergroup hosting the legendary ScottGu for a morning of all-things-Azure. The excited crowds outside the sell-out arena are testament to the respect for Scott’s opinion around the current cloud landscape and his thoughts on the future.

But what of the talk itself? What followed was a bit of a mixed bag – the audience was of a size that presenting anything in any depth was a lost cause and so the talk itself was pretty high level. He spent a fair while discussing the argument for adopting cloud-based platforms and infrastructure. This seemed a bit unnecessary for a talk organised for members of the UK Azure usergroup, but as the talk had been widened to other circles, so the scope of the talk broadened.

We saw demos of the new enterprise-scale VMs now available (G series apparently stands for Godzilla), with the all-important IOPS counters to prove their power. The sheer performance differential is impressive - premium storage, an abundance of cores and vast swathes of memory obviously making an incredible difference. The big argument here being that they can be turned on and off as and when the real crunching power is required, limiting their costs.

In the Platform-as-a-Service (PaaS) world, the talks and demos were understandably around the preview features announced over the past few months – the Azure API service, Event Hubs, Streaming Analytics and Machine Learning. Scott’s showpiece demo really shows the power behind some of these elements – a hand-held heat/humidity sensor linked to his PC sends realtime readings to an event hub, which is polled continuously by a streaming analytics job that pushes its aggregated results directly to a PowerBI dashboard. Not to an intermediary database, not to a linked blob storage account, but directly to the dashboard data model itself. The real-world applications of knowing when ScottGu is breathing onto a sensor are questionable, but the infrastructure costs and development turnaround time of getting a real-time operational dashboard are mind-blowingly small.

A few small niceties crept in during the management piece. Resource Groups allow for services to be managed as a singular entity, all components that make up a single application can be brought up/down at once, simplifying automation scripts no end. Also resource tags, a minor functionality addition that allows for services to be ‘tagged’. For hosting companies, they could potentially tag different customers and allow for their entire subscription to be filtered on a customer-by-customer basis. Again, a small addition that further enriches the management of Azure services.

Finally, we saw some real-world implementation presentations – two companies who have leveraged Azure to their advantage. Ad Coelum, a small startup with a SaaS product aimed at legal firms, have been able to plan a simultaneous global rollout simply through the various scale out options available. JustGiving however, have a complex chain of services that utilises some of the most powerful functionality Azure had to offer. From their newsfeed-style user timelines, to machine learning to understand the complex behaviour of ‘Givers’, they’re really pushing forward in cloud-based PaaS architecture.

Overall it was a good day, it’s certainly good to see the level of enthusiasm in the developer community for cloud architecture and it’s great to see how quickly it’s currently developing. Next time a client solution requires some blue-sky thinking, I’ll certainly be imagining clouds.

Adatis Hackathon Jan 2015 – Streaming Analytics First Thoughts

There have been a huge number of additions to the Azure platform over recent months, with a few exciting features for those of us in the BI world.

At the second Adatis Hackathon we decided to delve into these features to see what we could create. The plan was ambitious to say the least – three teams would each take a different new feature and attempt to assemble them into a coherent system.

STREAM team would fake a thermostat reading, push it to an event hub and aggregate the data via Stream Analytics.

DREAM team would use this data to predict how old the system has to get before it starts breaking and needs to be replaced. They would achieve this via Azure Machine Learning.

GLEAM team would create live operational and analysis dashboards using PowerBI, specifically via the new Power BI Designer application.

Everyone involved was new to the particular technologies we used, just in case it wasn’t challenging enough.

clip_image002

I captained the STREAM team, with a hope of understanding stream analytics’ role in the future of cloud BI and whether we’ve finally achieved true real-time BI. DREAM and GLEAM will be posting their findings over the coming weeks.

Stream Analytics:

We were up and running incredibly quickly – we spun up an event hub and had it populated with event data using a demo C# script grabbed from a tutorial. We then followed the quickstart guide and had a stream analytics service working away, depositing data.

There was a rather tense moment as we sat waiting for the streaming data to get back to us but, sure enough, we started seeing record counts on the dashboard and records begun to arrive in our Azure DB.

So, having achieved the basic POC very quickly, we went back and added more details to the message stored in the event hub. Just a couple of additional string and numeric attributes that we could use in some interesting aggregations.

This is where our problems began – we had to fiddle around a little, recreating the event hub a few times to be very clear about what was making it through. However Stream Analytics proved nearly impossible to debug. There are operational messages when things clearly aren’t working – you’ll see errors in the operational log when it can’t connect to the database etc, but we got to a point where we saw 1000 input records and 0 output records with no details about why. Was our query incorrect? Were there data mapping problems at the destination? Did we need to rebuild the analytics query to refresh the metadata? We resorted to blind fixes to resolve the issue – each time having to wait 10+ minutes to see if our changes had taken effect.

In the end we ran out of time before fully getting to the bottom of the issue. Our overall impression is that it could be a very powerful service and has a myriad of uses, especially around the implementation of real-time BI. However as a development experience it is truly frustrating, there is too much black box and not enough debugging/error logging functionality.

Obviously we’re still in a preview phase and it’s entirely possible that we simply missed the various debugging options – but I’d certainly be wary of recommending a client down this route until I was much more confident that any issues could be dealt with.

Not to be deterred, I’m determined to get a more advanced streaming example set up over the next few weeks. I’ll be posting my findings when I do…

Adatis BI Hackathon – Q3 2014 Inaugural Hack Day

Hackathons or ‘Hack Days’ are incredibly common in the wider development community. They encourage innovation, research, cross-team communication and collaboration. However, many people working with the SQL Server BI stack don’t see themselves as real developers and, as such, don’t see themselves as eligible for such events.

Here at Adatis HQ we disagree – the development practices driving forwards the entire community can and do apply to Business Intelligence. We’ve successfully implemented agile approaches, CI, TDD and a host of other disciplines previously seen as ‘only for real developers’, so why not adopt the community events as well?

To test this out, we recently held the very first Adatis BI Hackday. This was a purely internal event to trial the idea and see what we could come up with. It was very much a learning experience, but we had a great day and are already planning the next!

The structure:

Given it was the first hackathon for many, we opted for an unusual structure. We polled the company for a couple of ideas, picked the best two and then let everyone choose which team they wanted to join. Usually you would attend the event as a team and decide on your project during the time allowance.

Team A – Power BI

The first team aimed to recreate our internal BI systems using only PowerBI – no SQL Licences allowed!

Much of the work went on creating a PowerPivot model that pulled data in from our various timesheet, HR and Finance systems to provide a single model that represented our core business.

This data was then augmented using PowerQuery – clients were given descriptions pulled from Wikipedia and employees linked to their LinkedIn page.

Finally, several dashboards were produced in PowerView, charting earning by consultant over time, consultant-client cross-overs, core client dashboards and more.

Whilst the team achieved a huge amount in a day, there was a sense that there was a lot more than could be done. The building of the PowerPivot model consumed the lion’s share of the day, meaning further technologies such as PowerMap, Q&A and AzureML were beyond reach. Plenty of ideas for next time though!

Team B – Auto BI

The second team wanted to automate the early parts of the full datawarehouse build using BIML. If a base SSIS template can be produced to pull data from a standard source to a staging table, this would save a lot of time spent manually duplicating and tweaking template packages.

The team split into 2 groups – one to create a small C# front end that connected to a source system (SQL Server in the demo instance) and pull out the required metadata around the selected tables. It would then deploy a template configuration database and store the metadata there for later use.

The other half of Team B got to grips with BIML – working to build a BIML script for a fixed source table that created the necessary SSIS package. As this was their first time seeing BIML script, there was quite a learning curve getting to grips with it, but they soon had a package that matched the necessary requirements.

In the final hours of the day, the two groups reunited to add the C# wrapper to the BIML script that iterated through the source metadata and modified the BIML script for each table, creating a custom SSIS package for each table.

The proof of concept is certainly there and, with time to improve the front-end with additional source system types and more advanced functionality added to the BIML scripts, this could be a huge improvement in how we approach the more standard parts of the traditional warehouse build.

Conclusions

Everyone agreed that the day was hugely successful in getting everyone involved. Many technologies were new to Adatis, meaning the junior members were working at the same level as the experienced consultants and directors, with interesting results!

As a staunch supporter of Team B I couldn’t possibly comment on the overall victor (although I think it was clear…), but it was immediately obvious that HackDays are immensely beneficial to business intelligence developers.

Q4 2014 HackDay is already in the works – whilst the day will still be made up of internal teams while we revise and improve the format of the day, future sessions may expand to include teams from the wider BI community. Then we’ll see some real competition…

Slowly-Changing Parent-Child Dimensions – Part 1: The Theory

A common structure in OLTP systems is a parent-child relationship with Object and ObjectParent tables creating a recursive structure. This is easily represented as a Warehouse dimension table, usually flattened out but occasionally left as native Parent-Child if required.

image

I recently encountered an issue where the client not only required a flexible, dynamic parent-child dimension but also required it to be slowly changing. Each fact record joining to the dimension, at any granularity, had to be aware of its hierarchal context at that point in time, despite there only being one source record for dimension object. We end up with something like this:

image

Throughout the course of the engagement I implemented two different models for this as requirements changed, I’ll detail the solutions in part 2 of this post.

If you’re not familiar with Slowly Changing Dimensions, namely Type 2 SCD, I put together a quick introduction in a blog post here to bring you up to speed.

So how do you slowly change a Parent-Child relationship?

Since we have source records for each node in our hierarchy, it makes sense to keep this structure. However, over time that same node can fit into the hierarchy in different places, changing parent records, gaining/losing children or even moving levels.

We therefore need to differentiate between the different hierarchal contexts of node. As with other SCD implementations, we give each historical version a surrogate key, so we can accurately identify the node in the relevant context.

Parent-Child makes this tricky however – if we just created historical versions when the individual nodes changed, records joining in at lower granularities would not know which historical version to use. We therefore need to amend child nodes to point to the new surrogate keys. This, in turn, means we have to create new references to those nodes, and so on down the hierarchy.

Essentially, anytime a node in the hierarchy changes, we need to create new historical versions for all descendants of that node. Our original structure, using this method, would now look like this:

image

You can see that when Parent N’s own parent changed, we had to propagate that change to Object A to ensure our lowest granularity object has a key for each temporal version of the structure.

We can now use this structure with a fact table – we know that a fact record occurring on 10/02/14, for example, would aggregate up the orange-marked path through A2 > N1 > Y.

That’s the key point to implementing SCD for Parent-Child structures. If any changes occur, anywhere in the hierarchy, all descendants will need a new type 2 record created. By using Type 2 SCD, each object is referencing the surrogate key of its parent, not the business key, this way every join in the structure is based upon a specific historical version of that record and thus historical context is implied by the foreign key relationships.

Whilst complex in theory, once implemented your fact > dimension relationship is very simple. Your fact record has a single foreign key which holds the full historical context of that record.

In the next post, I’ll discuss a couple of techniques for implementing the above transformation inside a standard ETL structure.

Introduction to Slowly Changing Dimensions (SCD) Types

When setting about writing a recent blog post, I wanted to link to a clear, concise blog post on the different SCD types for anyone not familiar with the topic. Whilst there are a variety of thorough introductions out there, I didn’t find one as clear and concise as I’d like.

I therefore give you my own offering, a quick introduction to Slowly Changing Dimensions, or SCD, in a datawarehousing scenario.

For a more detailed discussion of slowly changing dimensions, I’d suggest looking at Kimball Group’s own posts on type 1 and types 2 and 3.

What are slowly changing dimensions?

When organising a datawarehouse into Kimball-style star schemas, you relate fact records to a specific dimension record with its related attributes. But what if the information in the dimension changes? Do you now associate all fact records with the new value? Do you ignore the change to keep historical accuracy? Or do you treat facts before the dimension change differently to those after?

It is this decision that determines whether to make your dimension a slowly changing one. There are several different types of SCD depending on how you treat incoming change.

What are the types of SCD?

Very simply, there are 6 types of Slowly Changing Dimension that are commonly used, they are as follows:

  • Type 0 – Fixed Dimension
    • No changes allowed, dimension never changes
  • Type 1 – No History
    • Update record directly, there is no record of historical values, only current state
  • Type 2 – Row Versioning
    • Track changes as version records with current flag & active dates and other metadata
  • Type 3 – Previous Value column
    • Track change to a specific attribute, add a column to show the previous value, which is updated as further changes occur
  • Type 4 – History Table
    • Show current value in dimension table but track all changes in separate table
  • Type 6 – Hybrid SCD
    • Utilise techniques from SCD Types 1, 2 and 3 to track change

In reality, only types 0, 1 and 2 are widely used, with the others reserved for very specific requirements. Confusingly, there is no SCD type 5 in commonly agreed definitions.

After you have implemented your chosen dimension type, you can then point your fact records at the relevant business or surrogate key. Surrogate keys in these examples relate to a specific historical version of the record, removing join complexity from later data structures.

Practical Examples

We have a very simple ‘customer’ dimension, with just 2 attributes – Customer Name and Country:

image 

However, Bob has just informed us that he has now moved to the US and we want to update our dimension record to reflect this. We can see how the different SCD types will handle this change and the pro/cons of each method.

Type 0

Our table remains the same. This means our existing reports will continue to show the same figures, maybe it is a business requirement that each customer is always allocated to the country they signed up from.

All future transactions associated to Bob will also be allocated to the ‘United Kingdom’ country.

Type 1

The table is updated to reflect Bob’s new country:

image

All fact records associated with Bob will now be associated with the ‘United States’ country, regardless of when they occurred.

We often just want to see the current value of a dimension attribute – it could be that the only dimension changes that occur are corrections to mistakes, maybe there is no requirement for historical reporting.

Type 2

In order to support type 2 changes, we need to add four columns to our table:

· Surrogate Key – the original ID will no longer be sufficient to identify the specific record we require, we therefore need to create a new ID that the fact records can join to specifically.

· Current Flag – A quick method of returning only the current version of each record

· Start Date – The date from which the specific historical version is active

· End Date – The date to which the specific historical version record is active

With these elements in place, our table will now look like:

image

This method is very powerful – you maintain the history for the entire record and can easily perform change-over-time analysis. However, it also comes with more maintenance overhead, increased storage requirement and potential performance impacts if used on very large dimensions.

Type 2 is the most common method of tracking change in data warehouses.

Type 3

Here, we add a new column called “Previous Country” to track what the last value for our attribute was.

image

Note how this will only provide a single historical value for Country. If the customer changes his name, we will not be able to track it without adding a new column. Likewise, if Bob moved country again, we would either need to add further “Previous Previous Country” columns or lose the fact that he once lived in the United Kingdom.

Type 4

There is no change to our existing table here, we simply update the record as if a Type 1 change had occurred. However, we simultaneously maintain a history table to keep track of these changes:

Our Dimension table reads:

image

Whilst our Type 4 historical table is created as:

image

Depending on your requirements, you may place both ID and Surrogate Key onto the fact record so that you can optimise performance whilst maintaining functionality.

Separating the historical data makes your dimensions smaller and therefore reduces complexity and improves performance if the majority of uses only need the current value.

However, if you do require historical values, this structure adds complexity and data redundancy overheads. It is generally assumed that the system will use Type 1 or Type 2 rather than Type 4.

Type 6

The ‘Hybrid’ method simply takes SCD types 1, 2 and 3 and applies all techniques. We would maintain a history of all changes whilst simultaneously updating a “current value” column on all records.

image 

This gives you the ability to provide an element of change comparison without additional calculation, whilst still maintaining a full, detailed history of all changes in the system.

Personally, if this requirement came up, I would avoid the data redundancy of this extra column and simply calculate the current value using the “LAST_VALUE()” window function at run-time. Although this depends on your priorities between data storage and direct querying performance.

SQL 2014: Boosting ETL Performance using Memory-Optimised Tables

With SQL Server 2014 due for release soon, there’s a lot of talk about how little is changing for those of us more interested in Business Intelligence. We’ll be getting clustered, updatable columnstore, which is very convenient, but I’ve been looking into the potential benefits of other new features. The real gem for me is the potential of memory-optimised tables or in-memory OLTP (previously known as Project Hekaton).

I’ve had a few questions around how columnstore and memory-optimised tables differ – they’re both in-memory solutions for data storage right?

The big difference comes in disk usage – in order for a columnstore index to exist, there must also be a physical table written to the disk. Any updates to the table, even in SQL Server 2014, will require disk writes, log writes too if you’re not bulk loading. Memory-optimised tables don’t have this overhead – we can push the staging data straight into memory and skip the disk entirely.

There are plenty of uses for this new type of table, but it’s the traditional ETL routine I’m interested in. We would normally pull data from our source systems, place it in a staging database while we clean, validate etc, then write the cleaned data to the warehouse:

image

The data written to the staging database is transient, it’s only of use during the load process. It is therefore redundant to write this data to disk each time; I believe this is where memory-optimised tables will really have an impact.

To prove this theory, I set up a quick demonstration. I started by creating two identical tables, one physical, one memory-optimised, both schema copies of [AdventureWorks2012].[Sales].[SalesOrderDetail].

The memory optimised table was created using the following script:

 

image

 

I then built the simplest of SSIS packages, a data flow to move data from AdventureWorks2012.Sales.SalesOrderDetail to a ‘staging’ table, then read the data from the staging table and write it back to a disk-based table, representing our final warehouse.

clip_image003

I configured two versions of the package, one where the staging table was disk-based, one where it is our memory-optimised table. I deployed and scheduled these packages as alternating jobs and kicked them off for a few hours to see how performance varies over time. The results are as follows:

image

clip_image005

 

The initial results are very promising - we see a huge reduction in execution time when using in-memory tables for our transient ETL data. We’re still performing minimal logging with this setup – you can reduce this even further by switching to SCHEMA_ONLY durability, but by bypassing writing the data to disk we’ve cut our ETL window in half.

There are obvious hardware limits to how this can be used, and in-memory tables have their own limitations, but implemented carefully it could change the face of warehouse loading drastically.