BI Lateral

The BI Lateral Blog

The What and Why of Apache Spark on Azure HDInsight

Big data i.e. that which ticks the 3V’s (volume, velocity and variety) boxes, has been around forever in the world of engineering and since the days of the exciting Human Genome Project completed in April 2003.  Then I was a fairly new BI developer, and almost did a career switch into Bioinformatics as was so inspired by all that was being done in this area.  If any of you remember The Summer of 2003 was a glorious one in England, I enjoyed it, and decided to stick with Microsoft BI technologies which I have derived much pleasure from.  However, I have always kept my eye on what that bioinformaticians are doing.  With data, whether it be big or not, there is always the non-functional requirements one needs to address with regard to:

  • Availability – how soon can I get my data after it was born.  In traditional warehousing this is usually daily, falling into our daily circadian rhythms of work days.  And sometimes, with up to biggish data sets this was possible with traditional Microsoft BI technologies
  • Accessibility – how soon can I get the data that makes sense after it is born.  Usually data needs to be cleaned, scrubbed, conformed before we unleash it onto the end-user.  So the daily ETL processes fitted quite nicely into this.
  • Interactivity – ok, so I have this report, but I see something of interest, how easily can I dig into that little trough on that graph and see it in a different way so I can understand it and do it quickly.  Yes, this could be done, but was usually met with frustration as one got mesmerized by a little spinning wheel, or just gave up and raised a ticket to the reporting team to do an ‘adhoc query’.

So, what does this have to do with Apache Spark on HDInsight?

Apache Spark on Azure HDInsight is actually *the* open source processing framework.  Here we see Microsoft supporting open source.  However, the Microsoft angle on this is that it is hosted in the cloud on Azure HDInsight.  Spark It is a fast, general purpose engine that supports in-memory operations.  OK so what does this mean and how might this address some of the challenges we face delivering BI solutions on big data to users who are not demanding the answers right now?  In the words of Freddie Mercury “we want it all and we want it now” Smile

Microsoft also announced yesterday a “major commitment to Apache Spark” see here.


It is fast.  I believe it is 100 times faster in-memory than Hadoop Map Reduce processes, but still uses the scale out of processing data on multiple clusters. This is because it uses the DAG (directed acyclical graph) execution engine that supports cyclic data flow in in-memory parallel computing.  If anyone from the ETL world might want to visualise what DAG might look this there is a nice diagram here. Data can be persisted in-memory or on disk.  In HD Insight, the on-disk would be blob storage or data lakes.  Unlike Hadoop, Spark can manipulate data in-memory.  Manipulate data could apply to ETL and/or Reporting operations.  The lines between ETL and extracting data for reporting and analysis is blurring.  We see this in Power BI where one can perform pretty powerful ETL operations and then visualise right away.

A BI Big Data one stop shop

Spark is a single platform to support the following all flavours of data manipulation operations on big data:

  • Batch processing
  • Real time and Stream Analytics
  • Machine Learning and Predictive Analytics

It is general purpose, so does not support only one kind of language.  Developers can write data manipulation jobs and queries in:

  • Java
  • Scala
  • Python
  • R – with R Server now also being hosted on Spark on HDInsight

Looking at this list of languages we see the data scientists and traditional developers from the Microsoft and Open Source worlds collide.  Techniques used by those long standing bioinformaticians can be applied to our corporate big data, in tools they are used to working with.  This opens up all sorts of possibilities of recruiting and the landscape of the traditional BI team.  And then for us traditional ETL developers, while SSIS might have an in-memory pipeline for data transformations and manipulation this does need to be materialised along the way, and/or processed into a cache in-memory in another reporting specific technology before the end user can access it.  Spark, being a multi-purpose in-memory data manipulation platform, shortens the gap between data being born and it being available and accessible, in all the layers from it being 100% raw organic to being packaged and processed.

Ready to Go

Apache Spark supported on Azure HD Insight means that is is in the cloud.  It is “Software as a Service”.  The cost utilising a Spark cluster includes the managed service costs.  One does not need to get a physical server, you are just using someone else’s server.  With it being in the cloud, the storage is separated from the cluster.  You pay for each (cluster and storage) separately.  The storage is cheap. The cluster is more expensive.  The cluster does the computation so can be turned on an off when required and you only pay for the up-time and size of your cluster in terms of the number of nodes.  The cluster is scalable, you can increase the number of nodes depending on how much data you have to process and how quickly.

With spinning up a Spark cluster in Azure HD Insight, it also comes pre-loaded with Jupyter and Apache Zeppelin notebooks. These are browser based tools which allow for the creation of Python or Scala scripts and a power user to be able to run queries and visualise data using Spark SQL.

Spark also integrates with all our favourite reporting and visualisation tools i.e.

  • Power BI – also supporting Spark Streaming
  • Tableau
  • Qlik
  • SAP Lumira

So with drag and drop operations the Spark SQL gets generated behind the scenes with almost immediacy even on vary large to big data sets.

In Conclusion

I have been keeping an eye on this one.  My delight has been *sparked* with the appearance on R Server being hosted on Spark on HDInsight.  I even heard a Microsoft engineer saying he “was in love with this technology”.  Next I will be looking at the automation of getting data into the cloud based in-memory structures on Spark.

If your curiosity has been piqued, here are some good starter links…

And if you are really keen and want to understand more about DAG execution and the Dryad Microsoft Research project there is a very detailed paper here…

“Master” Master Data: Entity Sync in Master Data Services 2016

When I heard that “entities could be shared” between different models in Master Data Services 2016, my curiosity was piqued.  There is a nice video of this feature and hierarchy management here. The concept of sharing entities is called Entity Sync in Master Data Services 2016.

You may ask the question as to why entities need to be shared between models? This is something I have thought about a lot with deciding how to split up the subject areas into different models in Master Data Services.  The conclusion I have come to is that is a better to keep subject areas separated rather than having one very large model in master Data Services with many entities.  Keeping the models separated makes it much easier to manage when working in teams, deployment, versioning etc. 

However, there are some use cases where having a single entity with the same definition and data might be needed in different models. 

For example…

  • Say different departments have their own data stewards for master data in an organisation maintain their own master data in different models and one needs to secure those models such that different departments cannot see or modify each others data.  Then a department entity may be needed in each of those models
  • Some master data may change very infrequently, but need to be used in multiple models e.g. currency, location, title.  In the ‘old days’ of Master Data Services 2014 and prior one would have to create replica entities and make sure that data is loaded into each of the entities in each of the models. 

    I have always thought it would be nice to have “Master” master data in Master Data Services.  This is where Entity Sync in Master Data Services 2016 comes in…

    Setting up a new Entity Sync Relationship

    Using the ChartOfAccounts, Customer and Product sample Master Data Services models I created a new entity called Disney Characters in the ChartOfAccounts model and want to sync this to the other two models i.e. Customer and Product.


    What is nice with Master Data Services 2016 is that one can add descriptions to the Models, Entities and Attributes, which is a nice feature to ‘document as you develop’.


    To set up Entity Sync, go to System Administration, and select ‘Entity Sync’ from the ‘Manage’ menu


    Create a new Entity Sync and select the Source Model, Entity and Version.  Then select the Target Model and Version.  You can even change the name of the target entity to something different.  I’m thinking ‘role playing’ entities.  You can set a schedule for when the entity gets synced, or ‘Sync On Demand’. So the below screenshot shows the setup of the Entity Sync for the Disney Characters entity from ChartOfAccounts model to the Customer model and renaming it along the way to Cartoon Characters.


    The entity will not sync from the Source Entity (ChartOfAccounts) to the Target Entity (Customer) until the sync is run.  You can’t seem to select more than one Entity Sync to sync at a time through the Master Data Services web UI.


    On going to the Customer model and viewing the synced Cartoon Characters entity, all the metadata (including all those descriptions) and data appears.  However it is not possible to add or modify members.  And we cannot add any new attributes to the synced entity.  This is what we want, right?  “Master” master data with a single source in the source entity and model.


    In the management of entities it is also indicated that this entity is a ‘Sync Target’. Nice.


    Now that the entity is synced I can use it like I would any other entity in the Customer model.  I can add domain based dependencies and use it in other entities in the model, and I can apply security onto it.  I can make changes to the source entity and re-sync and the new metadata and data will be propagated to the target entities.

    Deleting Entity Sync Relationship

    But what if you delete the entity sync relationship? Will the target table get deleted too?  So trying this I removed the Entity Sync for the Disney Characters entity from ChartOfAccounts model to Cartoon Characters entity in the Customer model.

    Nope. The entity is still there, but now you can edit it (I’ve added ‘Road Runner’), and carry on modelling it like you would any other entity in the model.  Although I did note that it lost it’s setting of ‘Create Code Values Automatically’.  Not sure if this is a feature?


    I can turn the entity sync relationship back on but specify ‘Existing Entity’. 


    But this threw an error, it seemed to have a problem with the new member ‘Road Runner’ that I had created. 


    I deleted ‘Road Runner’ and set up the Entity Sync relationship no problem.  I guess one could also drop and recreate the entity using the Entity Sync.

    So what happens if I try to synchronise and entity that has dependencies on other domain entities e.g. Trying this out with the SubClass entity from the ChartOfAccounts model gives the following error “Domain based attribute(s) in the source entity will not be synced”.


    This is what the SubClass source entity looks like with a domain based attribute to the Class entity in the ChartOfAccounts model.


    This is now what the SubClass target entity looks like in the Customer model.  It has lost all of it’s domain attributes!


    So I thought I would try to first synchronise the Class domain entity first, then synchronise the SubClass entity. Same “Domain based attribute(s) in the source entity will not be synced” as before and the domain based attributes of SubClass will not be synched.

    Deployment Considerations

    With the current SQL Server 2016 CTP 3.2 it does not appear that Entity Sync relationships will be packaged up with creation of packages either from the Master Data Services web UI or using the MDSModelDeploy command line utility.  Although I guess they might not be? As one would have to consider the order of deployment of models with the dependencies that the Entity Sync relationships create.  They are very easy to set up, but then again there is a new meta data table in the Master data Services database called [mdm].[tblSyncRelationship]. There are also corresponding stored procedures that operate on this table.


    So I guess one could populate the required metadata in [mdm].[tblSyncRelationship] and regenerate the Entity Sync relationships as part of a deployment process.

    In Conclusion

    The introduction of the Entity Sync functionality in Master Data Services for SQL Server 2016 is a welcome new feature and opens up all sorts of possibilities with creating “Master” master data.  The new Master Data Services web UI for managing these is great.  So long as the entities you want to sync to other models do not contain any domain based entities.

    I will definitely be using it, as:

    It will definitely speed up development – goodbye to all those SSIS packages to load the same data into the same entity in different models!

    Add value to the end user as everything will be in perfect sync

  • Generating Master Data Services Subscription Views

    Creating subscription views through the Master Data Services web UI is easy. If not a little slow at times if the Master Data Services installation is on a busy or slow server.  Generally one does not have that many models and entities for which subscription views need to be created.  But what if you do? 

    I am working for a client where they are planning a large Master Data Services implementation with possibly hundreds of entities. Reverse engineering the entities from a SQL Server relational engine to Master Data Services can be done using the MDSModelImport utility on Codeplex.  Theoretically this utility could be extended to create subscription views too.  However there is an easier way that could be used in a couple of other scenarios e.g.

    • Regenerating subscription views when you have updated entities after the subscription view on the entity has been created
    • Some of my colleagues reported that subscription views had ‘disappeared’ when deploying models to a new Master Data Services server. A bug?
    • Regenerating subscription views for a different version of the model

    So I started doing some digging about in the Master Data Services database and came across a nicely named stored procedure called


    The content of this stored procedure is very useful and has a wealth of information on what the parameters are.  I put in the required parameters for a leaf level subscription view on one of my entities.  It created the view, but the view did not show up in the Master Data Services web UI. In addition on querying the view no data was returned, even though it had created all the expected attribute columns. Hmm?

    So, using the above stored procedure as a starting point, and trawling through a number of stored procedures I came upon:

    • the table [mdm].[tblSubscriptionView]. Of course!
    • and the stored procedure [mdm].[udpCreateAllViews]

    Using information that I gleaned from above mentioned digging, I inserted the required metadata into the table [mdm].[tblSubscriptionView], and then ran the stored procedure [mdm].[udpCreateAllViews]. Voilà! 

    Using the sample ‘Customer’ model provided with Master Data Services, I wanted to create leaf format subscription views for all thirteen entities.  So using the below code to populate the metadata to the [mdm].[tblSubscriptionView] table to generated Leaf format views.  Also be sure to create a subscription view name that is unique.  I like ModelName_EntityName to keep all the views from a model together.  Generating other kinds of subscription views would use the same method, the metadata tags would have to change.  See contents of the stored procedure [mdm].[udpCreateSubscriptionViews].

    In SQL Server 2012 or SQL Server 2014


    In SQL Server 2016 (a few minor changes)


    Then in execute the stored procedure [mdm].[udpCreateAllViews] to update all the views.  This stored procedure will use whatever data is in the table [mdm].[tblSubscriptionView] to create any missing subscription view and to update any that might require updating.  So remember this is a very useful stored procedure in the Master Data Services database.


    This is what the data looks like in the [mdm].[tblSubscriptionView] table


    The subscription views appear in the Master Data Services web UI

    In SQL Server 2014


    In SQL Server 2016


    And data come back from all the views. 


    In Summary…

    So whether this is for ten views or hundreds the process would be the same and just as quick! 

    It is worth remembering the stored procedure [mdm].[udpCreateAllViews] as it might be best practice to run this as part of the Master Data Services model deployment process to make sure that all subscription views are up to date.

    Also if any have ‘disappeared’ but the relevant metadata exists in the table [mdm].[tblSubscriptionView] then running this stored procedure will regenerate them.


    Next on the agenda is to start digging through these views…the possibilities are endless.