Adatis

Adatis BI Blogs

The What and Why of Apache Spark on Azure HDInsight

Big data i.e. that which ticks the 3V’s (volume, velocity and variety) boxes, has been around forever in the world of engineering and since the days of the exciting Human Genome Project completed in April 2003.  Then I was a fairly new BI developer, and almost did a career switch into Bioinformatics as was so inspired by all that was being done in this area.  If any of you remember The Summer of 2003 was a glorious one in England, I enjoyed it, and decided to stick with Microsoft BI technologies which I have derived much pleasure from.  However, I have always kept my eye on what that bioinformaticians are doing.  With data, whether it be big or not, there is always the non-functional requirements one needs to address with regard to: Availability – how soon can I get my data after it was born.  In traditional warehousing this is usually daily, falling into our daily circadian rhythms of work days.  And sometimes, with up to biggish data sets this was possible with traditional Microsoft BI technologies Accessibility – how soon can I get the data that makes sense after it is born.  Usually data needs to be cleaned, scrubbed, conformed before we unleash it onto the end-user.  So the daily ETL processes fitted quite nicely into this. Interactivity – ok, so I have this report, but I see something of interest, how easily can I dig into that little trough on that graph and see it in a different way so I can understand it and do it quickly.  Yes, this could be done, but was usually met with frustration as one got mesmerized by a little spinning wheel, or just gave up and raised a ticket to the reporting team to do an ‘adhoc query’. So, what does this have to do with Apache Spark on HDInsight? Apache Spark on Azure HDInsight is actually *the* open source processing framework.  Here we see Microsoft supporting open source.  However, the Microsoft angle on this is that it is hosted in the cloud on Azure HDInsight.  Spark It is a fast, general purpose engine that supports in-memory operations.  OK so what does this mean and how might this address some of the challenges we face delivering BI solutions on big data to users who are not demanding the answers right now?  In the words of Freddie Mercury “we want it all and we want it now” Microsoft also announced yesterday a “major commitment to Apache Spark” see here. Speed It is fast.  I believe it is 100 times faster in-memory than Hadoop Map Reduce processes, but still uses the scale out of processing data on multiple clusters. This is because it uses the DAG (directed acyclical graph) execution engine that supports cyclic data flow in in-memory parallel computing.  If anyone from the ETL world might want to visualise what DAG might look this there is a nice diagram here. Data can be persisted in-memory or on disk.  In HD Insight, the on-disk would be blob storage or data lakes.  Unlike Hadoop, Spark can manipulate data in-memory.  Manipulate data could apply to ETL and/or Reporting operations.  The lines between ETL and extracting data for reporting and analysis is blurring.  We see this in Power BI where one can perform pretty powerful ETL operations and then visualise right away. A BI Big Data one stop shop Spark is a single platform to support the following all flavours of data manipulation operations on big data: Batch processing Real time and Stream Analytics Machine Learning and Predictive Analytics It is general purpose, so does not support only one kind of language.  Developers can write data manipulation jobs and queries in: Java Scala Python R – with R Server now also being hosted on Spark on HDInsight Looking at this list of languages we see the data scientists and traditional developers from the Microsoft and Open Source worlds collide.  Techniques used by those long standing bioinformaticians can be applied to our corporate big data, in tools they are used to working with.  This opens up all sorts of possibilities of recruiting and the landscape of the traditional BI team.  And then for us traditional ETL developers, while SSIS might have an in-memory pipeline for data transformations and manipulation this does need to be materialised along the way, and/or processed into a cache in-memory in another reporting specific technology before the end user can access it.  Spark, being a multi-purpose in-memory data manipulation platform, shortens the gap between data being born and it being available and accessible, in all the layers from it being 100% raw organic to being packaged and processed. Ready to Go Apache Spark supported on Azure HD Insight means that is is in the cloud.  It is “Software as a Service”.  The cost utilising a Spark cluster includes the managed service costs.  One does not need to get a physical server, you are just using someone else’s server.  With it being in the cloud, the storage is separated from the cluster.  You pay for each (cluster and storage) separately.  The storage is cheap. The cluster is more expensive.  The cluster does the computation so can be turned on an off when required and you only pay for the up-time and size of your cluster in terms of the number of nodes.  The cluster is scalable, you can increase the number of nodes depending on how much data you have to process and how quickly. With spinning up a Spark cluster in Azure HD Insight, it also comes pre-loaded with Jupyter and Apache Zeppelin notebooks. These are browser based tools which allow for the creation of Python or Scala scripts and a power user to be able to run queries and visualise data using Spark SQL. Spark also integrates with all our favourite reporting and visualisation tools i.e. Power BI – also supporting Spark Streaming Tableau Qlik SAP Lumira So with drag and drop operations the Spark SQL gets generated behind the scenes with almost immediacy even on vary large to big data sets. In Conclusion I have been keeping an eye on this one.  My delight has been *sparked* with the appearance on R Server being hosted on Spark on HDInsight.  I even heard a Microsoft engineer saying he “was in love with this technology”.  Next I will be looking at the automation of getting data into the cloud based in-memory structures on Spark. If your curiosity has been piqued, here are some good starter links…   https://blogs.technet.microsoft.com/dataplatforminsider/2016/06/06/microsoft-announces-major-commitment-to-apache-spark/ https://azure.microsoft.com/en-us/services/hdinsight/apache-spark/ https://spark.apache.org/ And if you are really keen and want to understand more about DAG execution and the Dryad Microsoft Research project there is a very detailed paper here… http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf

Azure HDInsight Polybase Example

Polybase has now made it over to SQL Server 2016 and is available to use as part of the SQL Server 2016 CTPs. One of its capabilities is to connect to Azure Blob Storage, so this blog post gives an overview of how you can use Polybase to connect to data in Azure HDInsight. Installing Polybase in SQL Server 2016 Polybase will appear in the standard feature list of the SQL Server 2016 CTP installs, alongside the other database engine features, so it’s just a selection box to install it: Having said that, there’s one small limitation at the moment in the Polybase install, which is that only the SQL_Latin1_General_CP1_CI_AS and Latin1_General_100_CI_AS_KS_WS collations are supported, as explained in the following getting started guide. If you don’t have the correct collation and you try and use Polybase, then you may get the issue of the Polybase windows service stopping. Azure HDInsight For this post, I’m just going to use the sensor data sample that comes with Azure HDInsight when you provision a cluster. If you run through the following HDInsight Tutorial, then you can get up and running with an HDInsight cluster and use the samples. In this case, I’m using the sensor data analysis sample, which comes with a csv file called HVAC.csv – this is a set of temperature readings. The file looks as follows when opened in Excel: Within the Azure portal, if we go to the HDInsight cluster, then we need to note down the access keys that will allow Polybase to connect to Azure: We can then copy one of the access keys, which we’ll need for later: Polybase in SQL Server Management Studio Over to SQL Server Management studio now, where we need to create an External Data Source, a File Format and an External Table to access the Azure table. Before creating these objects, we need to create a credential, where we will need to specify the Azure access key that we copied earlier: --Enable traceflag for using a database-scoped credential. DBCC TRACEON(4631,-1); -- Create a db master key if one does not already exist, using your own password. CREATE MASTER KEY ENCRYPTION BY PASSWORD='MyP@ssword31'; --Enter the Azure account name and the acccess key as the secret CREATE CREDENTIAL AzureCred ON DATABASE WITH IDENTITY = 'accountname', SECRET='accesskeyhere' Now that we’ve set up the credential, we need to create an external data source, so that Polybase knows where it needs to go to get the data. Notice how we’re referencing the credential that we created earlier: CREATE EXTERNAL DATA SOURCE AzureDs WITH ( TYPE = HADOOP, --Specifiy the container name and account name LOCATION = 'wasbs://containerd@accountname.blob.core.windows.net/', --Specify the credential that we created earlier CREDENTIAL = AzureCred ); Now we need to specify the format of the file that we want to access. In this case, having looked at the CSV, its comma delimited, so the SQL required is as follows: CREATE EXTERNAL FILE FORMAT CommaFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR =',') ); Finally we need to create an external table – this is the object that we will be able to query with a T-SQL statement and consume data within the Azure HDInsight cluster. The syntax for this example is: --Define the table structure CREATE EXTERNAL TABLE Hvac ( Date varchar(10), Time varchar(10), TargetTemp smallint, ActualTemp smallint, SystemID smallint, SystemAge smallint, BuildingID smallint ) WITH ( --Set the file to be the HVAC sensor sample file LOCATION='/HdiSamples/SensorSampleData/hvac/HVAC.csv', DATA_SOURCE = AzureDs, FILE_FORMAT = CommaFormat, --We will allow the header to be rejected REJECT_TYPE = VALUE, REJECT_VALUE = 1 ); Interestingly you can’t tell Polybase to skip a header record, as confirmed in the following connect issue, so I found that my statement was failing when trying to query the data. The solution is to use the REJECT_TYPE and REJECT_VALUE, which will allow us to reject a number of rows, as shown above. Once this is done, then we’ll see that we have an External Table: Querying it is just like any other table with a SELECT statement, which gives the following results in this case: So that’s it – by creating the external data source, file format and external table, we can now use T-SQL in SQL Server 2016 to easily query data hosted within Azure HDInsight.