Adatis

Adatis BI Blogs

LETS Process Data–Modern ETL for the Data Lake

At the PASS Summit this year, I attended a session by Michael Rys. In this session he introduced the concept of LETS as an approach to process data in the data lake. If you are familiar with data lake, then you will be familiar of having to apply a schema to the data held within. The LETS approach is purpose design for schematization. Where ETL stands for Extract, Transform, Load or ELT stands for Extract, Load, Transform – LETS stands for Load, Extract, Transform, Store. Data are Loaded into the data lakeData are Extracted and schematizedData are Transformed in rowsetsData are Stored in a location, such as the Catalog in Azure Data Lake Analytics, Azure Data Warehouse, Azure Analysis Services, for analysis purposes. I really like this approach as it makes sense for how data are handled in the data lake. It’s something that I will be advocating and using, and I hope you do too!

Geographic Spatial Analysis with Azure Data Lake Analytics (ADLA)

Whilst working on an Azure Data Lake project, a requirement hit the backlog that could be easily solved with a Geographical Information System (GIS) or even SQL Server - Spatial data type support was introduced into SQL Server 2008. However, Azure Data Lake Analytics (ADLA) does not natively support spatial data analytics so we'll have to extract the data into another service right? Wrong ? :) Due to the extensibility of Azure Data Lake Analytics, we can enhance it to do practically anything. In fact, we can lean on existing components and enhance the service without having to develop the enhancement itself. This blog is a quick run through demonstrating how to enhance ADLA such that it will support Spatial analytics and meet our project requirement. Problem For simplicity I've trivialised the problem. Here's the requirement: Indicate which Bus Stops are within 1.5 km of Southwark Tube Station. To support this requirement, we have two datasets: A list of all the Bus Stops in London, including their Geo location (circa 20k records) The Geo location record of Southwark Tube Station (a single record !) In fact, the location of the tube station is pretty accurate and is geo located to the entrance pavement outside the tube station: This would be an easy problem for a GIS to solve. You would specify the central point i.e. our Southwark Tube station marker and draw a circle, or buffer, with a radius 1.5 km around it and select all bus stops that fall within or intersect with that circle. This spatial analysis is easy for these systems as that's essentially what they are built to do. SQL Server 2008 introduced the Spatial Data Type, this allowed spatial style analysis to be performed on geo data using T-SQL in conjunction with the supplied Geometry and Geography data types. More info on those can be found here So, how can we solve our problem in ADLA, without a GIS and without having to export the data to SQL Server?? Solution You can register existing assemblies with ADLA. It so happens that the SQL Server Data Types and Spatial assemblies are nicely packaged up and can be used directly within ADLA itself - think about that, it's pretty awesome ! Caveat: At the time of writing we have no idea of the licence implications. It will be up to you to ensure you are not in breach :) Those assemblies can be downloaded from here.  You only need to download and install the following file: ENU\x64\SQLSysClrTypes.msi This installs two key assemblies, which you'll need to grab and upload to your Data Lake Store: C:\Program Files (x86)\Microsoft SQL Server\130\SDK\Assemblies\Microsoft.SqlServer.Types.dll C:\Windows\System32\SqlServerSpatial130.dll Once they have been uploaded to your Data Lake Store, you need to register those assemblies with ADLA. DECLARE @ASSEMBLY_PATH string = "/5.UTILITY/USQL-Extend/SQL-Server/"; DECLARE @TYPES_ASM string = @ASSEMBLY_PATH+"Microsoft.SqlServer.Types.dll"; DECLARE @SPATIAL_ASM string = @ASSEMBLY_PATH+"SqlServerSpatial130.dll"; CREATE DATABASE IF NOT EXISTS SQLServerExtensions; USE DATABASE SQLServerExtensions; DROP ASSEMBLY IF EXISTS SqlSpatial; CREATE ASSEMBLY SqlSpatial FROM @TYPES_ASM WITH ADDITIONAL_FILES = ( @SPATIAL_ASM ); Following registration of the assemblies, we can see the registration loaded in the ADLA Catalog database we created: We are now ready to use this U-SQL enhancement in our U-SQL Query - let's go right ahead and solve our problem in one U-SQL Script. // Reference the assemblies we require in our script. // System.Xml we get for free as a System Assembly so we didn't need to register that and our SQLServerExtensions.SqlSpatial assembly REFERENCE SYSTEM ASSEMBLY [System.Xml]; REFERENCE ASSEMBLY SQLServerExtensions.SqlSpatial; // Once the appropriate assemblies are registered, we can alias them using the USING keyword. USING Geometry = Microsoft.SqlServer.Types.SqlGeometry; USING Geography = Microsoft.SqlServer.Types.SqlGeography; USING SqlChars = System.Data.SqlTypes.SqlChars; // First create the centralised point. // In this case it's the pavement outside the entrance of Southwark Tube Station, London. // Format is Longitude, Latitude and then SRID. // NB: It's Longitude then Latitude, that's the opposite way to what you might expect.. DECLARE @southwarkTube Geography = Geography.Point(-0.104777,51.503829,4326); // Next we extract our entire London bus stop data set from the file. // There's about 20k of them. @busStopInput = EXTRACT [StopCode] string, [StopName] string, [Latitude] double?, [Longitude] double? FROM @"/1.RAW/OpenData/Transport/bus-stops-narrow-full-london.csv" USING Extractors.Csv(skipFirstNRows:1,silent:true); // This is effectively the transform step and where the magic happens // Very similar syntax to what you would do in T-SQL. // We are returning all the bus stops that fall within 1500m of Southwark Tube // Essentially we return all stops that intersect with a 1500m buffer around the central tube point @closeBusStops= SELECT * FROM @busStopInput WHERE @southwarkTube.STBuffer(1500).STIntersects(Geography.Point((double)@busStopInput.Longitude,(double)@busStopInput.Latitude,4326)).ToString()=="True"; // The results are written out to a csv file. OUTPUT @closeBusStops TO "/4.LABORATORY/Desks/Sach/spatial-closebusstops.csv" USING Outputters.Csv(outputHeader: true); The query outputs a list of bus stops that are within the specified Spatial distance from Southwark Tube Station. If we have a look at all the bus stops (in red) and overlay all the 'close' bus stops (in green), we can see the results: Pretty neat. Azure Data Lake Analytics does not natively support spatial data analytics but by simply utilising the assemblies that ship with SQL Server, we can extend the capability of U-SQL to provide that functionality or practically any functionality we desire.

Data Flow Pt 2: Vertexes In Azure Data Lake

Following on from my previous post on Job Execution in the Azure Data Lake, this post will explore the concept of Extents and how they are utilised by Vertexes in the actual Job processing in Azure. The U-SQL script that has been authored, compiled and deployed is the logical plan of how the author intends to transform input data into output data. This creates a total amount of work – essentially the amount of data it has to process – which is decomposed into a set of vertexes. Each vertex will process a subset of data, or extents (see Azure Data Lake Storage for more information) and represent a fraction of the total. Vertexes are displayed, or grouped, in a super vertex, also known as a stage. Vertexes in each stage are doing the same operation on a different part of the same data. The number of vertexes in a stage indicates the maximum theoretical parallelisation of that stage. The containers requested for the job will be allocated to complete each vertex. Say there is a 10GB file. This file will be split into 40 Extents and allocated to at least 10 Vertexes. If one wants to process all of the file in parallel then requesting 10 containers will allow for concurrent parallelism. All this is visualised as the job graph.If you have multiple transformations going on in your USQL script this will create multiple stages, and the output of one vertex becomes the input of another vertex in a dependent stage. As a result, dependent stages can begin processing even if preceding stages haven’t completed processing. If dependency does exist, it will create a critical path – which is the dependency chain of vertexes which keep the job running to the very end because the vertex on the bottom depends on the output of the vertex on the top. This can be seen in the Vertex Execution View in the Visual Studio Azure Data Lake Tools view. It’s useful for optimising job, by re-positioning or re-writing elements of your script, by checking which vertex takes the longest. It may not be possible to use all the reserved parallelism during a stage if there are fewer vertexes than Azure Data Lake Analytics Units (ADLAUs) available. For example, if I have 10 ADLAUs – it’s great for an early stage as all ADLAUs will be allocated a vertex. But with later stages, more and more ADLAUs will be idle. So by the last stage only 2 of the 14 are utilised. Unfortunately it is not currently possible to dynamically de-allocate ADLAUs during a job. In conclusion, understanding how vertexes interact with extents and can influence parallelism, you should be able to provision adequate resources for your jobs as well as know where to start the investigation for any long running queries you might have with the Vertex Execution View. In the next post of the series I will be looking into resource costs and optimising spend and time taken to process data.

Connecting Power BI to Hive

On a recent project I was tasked with importing data into Power BI from a Hive table. For those of you who are new to Azure or Big Data, Hive is a data warehousing infrastructure for Hadoop which sits in the HDInsight stack on Azure. The primary purpose of Hive is to provide data summarisation, query and analysis for big data sets. In this blog I’m going to take you through the steps and note any Gotchas so that you can connect to Hive using Power BI. Connecting to HiveAs Hive is part of the Azure HDInsight stack it would be tempting to select the HDInsight or Hadoop connector when you’re getting data. However, note HDFS in brackets beside the Azure HDInsight and Hadoop File options as this means that you’ll be connecting to the underlying data store, which can be Azure Data Lake Store or Azure Blob Storage – both of which use HDFS architectures. But this doesn’t help when you want to access a Hive table. In order to access a Hive table you will first of all need to install the Hive ODBC driver from Microsoft. Once you’ve downloaded and installed the driver you’ll be able to make your connection to Hive using the ODBC connector in PowerBI.You will need to input a connection string to connect even though it says optional. The format of the connection string is as follows:Driver={Microsoft Hive ODBC Driver};Host=hdinsightclustername.azurehdinsight.net;Port=443;Schema=default; RowsFetchedPerBlock=10000; HiveServerType=2; AuthMech=6; DefaultStringColumnLength=200;One the next screen you’ll be asked to enter a username and password. The credentials used here are not what you use to access Azure but the credentials you created when you set up the HDInsight cluster and use to login to the cluster. Click connect and you’ll be able to pull through the tables you need into Power BI. Or, if you want to be selective in what is returned, you can write a HiveQL query in the ODBC dialog. It’s also worth noting that at the moment it’s only possible to do an import of Hive Data in Power BI and not perform Direct Query, so if your data set is huge you’ll want to summarise the data or be really selective in what is returned first.

Data Flow: Job Execution in the Azure Data Lake

In this blog we will deep dive into the job execution of the Azure Data Lake Analytics (ADLA). If you’ve used ADLA you will have come across the following image: This is the graphical representation of the job execution. But what is is doing at each stage? Preparing The job execution begins with the authoring of the U-SQL script. The U-SQL script itself is the logical plan of how you intend to transform input data into output data. The script gets compiled, which will translate your U-SQL script primarily into C#, as well as a few other items such as XML– which will contain metadata information and the job graph. Once compiled it will create an initial plan – which will then be optimised thus producing an optimised plan. If it fails at this stage it will mostly be down to your U-SQL script failing to compile successfully. Fortunately, the error codes at this stage are fairly helpful and will indicate the line in question. Queued Once compiled and optimised the job progresses to the queued stage. There are a few things which can cause the a job to queue and these are: Running Jobs The default setting per subscription is 3. So if you or someone else in your subscription is already running a series of jobs, your job will be queued until at least one of them completes Higher priority jobs As the queue is ordered by job priority, with loser numbers having a higher priority, other jobs can jump to the top of the queue thus forcing your job to queue for longer Lack of resources Even if there are no running jobs and there are no jobs ahead of yours in the queue, your job will continue to queue if there are not enough Azure Data Lake Analytic Units (ADLAUs) to start the job.   Running Once the job has finished queuing it will run. The Job Manager allocates vertexes, which are collections of data to be processed, to the ADLAUs and uses YARN to orchestrate it. The vertexes will then execute. At this stage, if there is a vertex failure you can download the vertex locally and debug in visual studio. When the vertexes have successfully completed you will be able to consume the end product – either locally, if you have run the job locally, or in the Azure Data Lake Store. Conclusion The above process can be visualised in this graph. Knowing how ADLA executes your job is the first of many steps to being able to write performant U-SQL and debug effectively. Be on the look out for more blogs on the technical deep dive into ADLA!

A Guide to Azure SQL DataWarehouse

So you've heard the hype - the Azure SQL DW is going to solve all of your problems in one fell swoop… Right? Well… maybe. The system itself is a mix of technologies designed for low concurrency analytics across huge amounts of relational data. In short, it's a cloud-scalable T-SQL-based MPP platform, with all the benefits and restrictions that performing everything in parallel brings. If your problem can be solved by performing lots of calculations over small of your data before aggregating the results into a whole, this is the technology for you. However, before you jump right in, be aware that SQLDW is a very different beast to other SQL tools. There are specific concepts you need to be familiar with before building your system, otherwise you're not going to see the promised performance gains and will likely lose faith very quickly! I'm in the process of documenting these concepts, plus there is a wealth of information available on the Azure SQLDW site. For the next few months, I'll be running through the blog topics below and updating the links accordingly. If there are topics you'd like me to add to the list, please get in touch! Azure SQLDW Core Concepts: - What is it? - How Does Scaling Work? - Distributions - Polybase - Polybase Limitations - Polybase Design Patterns - CTAS - Resource Classes - Partitioning Designing ETL (or ELT) In Azure SQLDW - Row counts - Statistics - Surrogate Keys Performance Tuning Azure SQLDW - Plan Exploration in SQLDW - Data Movement Types - Minimising Data Movement Managing Azure SQLDW - Backup & Restore - Monitoring Distributions - System Monitoring - Job Orchestration - Scaling and Management - Performance Tuning Queries Azure SQLDW Architecture - Presentation Layers - Data Lake Integrations

Testing the Waters: An Overview of Data Science using Azure Data Lakes

Data Science can fit seamlessly within the ecosystem of the data lake, whether this is through HDInsight or the extensibility of Azure Data Lake Analytics and U-SQL. This blog will give a brief overview of what Data Science is; how to link Data Science toolkits to the Azure Data Lake; and best practices for managing the data output from experiments.   Data Science Data Science is the relatively new kid on the block. One way to consider data science is as an evolutionary step in interdisciplinary fields like business analysis that incorporate computer science, modelling, statistics, analytics, and mathematics.At its core, data science involves using automated methods to analyse massive amounts of data and to extract insight from them. Data science is helping to create new branches of science, and influencing areas of social science and the humanities. The trend is expected to accelerate in the coming years as data from mobile sensors, sophisticated instruments, the web, and more, grows.   Data Science In The Data Lake The nature of the Azure Data Lake Store lends itself to Data Science in that it can hold any data, which the data scientist will want to access, transform and analyse.   HDInsight contains many implementations for data science, such as Spark, R Server and others. Hooking HDInsight to Azure Data Lake Store is pretty simple and follows these steps: In the Azure Portal Marketplace, select HDInsight which will bring up a series of blades. In this blog, I will be using Spark as my cluster type on HDInsight. In the storage settings you can then link your HDInsight cluster to Azure Data Lake Store Confirm your configuration on the next blade and wait around 20 minutes for your cluster to deploy and you’re good to go!   With Azure Data Lake Analytics, you incorporate data science by extending the capabilities of U-SQL and you do this by installing a series of files. Open up your Azure Data Lake Analytics account and click on Sample Scripts at the top. This will bring forward the following blade From there you’ll want to click on the U-SQL Advanced Analytics tab, which will copy about 1.5GB of files to the default Azure Data Lake Store associated to your ADLA account. This will take about 3 minutes to complete. When it’s finished copying the files it will then call a job to register the extension, which can be found in the assemblies folder of the master database. More resources about extensibility of U-SQL can be found here: https://blogs.msdn.microsoft.com/azuredatalake/2017/03/10/using-custom-python-libraries-with-u-sql/ https://blogs.msdn.microsoft.com/azuredatalake/2016/11/22/u-sql-advanced-analytics-introducing-python-extensions-for-u-sql/ The Laboratory Within Azure Data Lake Store, Folder and File Management is incredibly important for a well running data lake. See my blogs on Storage and Best Practices and Shaping The Lake for more information on how to set up your Azure Data Lake Store. The Laboratory is an area to be exclusively used by a data scientist. It’s an area where they can persist the results of experiments and data sets without impacting the day-to-day operations within the data lake or other data scientists. The laboratory is organised in to two broad area: Desks and Exhibits. Desks contain personal workspaces, the contents of which can be organised however the person wishes. It can be as well organised, or disorganised, as the person themselves. The Exhibit contains data sources produced in the Laboratory which are ready to be consumed by other users or systems. Both of which are laid out below. As always, if you have any feedback or comments do let me know!

Azure Data Lake Store–Storage and Best Practices

The Azure Data Lake Store is an integral component for creating a data lake in Azure as it is where data is physically stored in many implementations of a data lake. Under the hood, the Azure Data Lake Store is the Web implementation of the Hadoop Distributed File System (HDFS). Meaning that files are split up and distributed across an array of cheap storage.   What this blog will go into is the physical storage of files in the Azure Data Lake Store and then best practices, which will utilise the framework.   Azure Data Lake Store File Storage As mentioned, the Azure Data Lake Store is the Web implementation of HDFS. Each file you place into the store is split into 250MB chunks called extents. This enables parallel read and write. For availability and reliability, extents are replicated into three copies. As files are split into extents, bigger files have more opportunities for parallelism than smaller files. If you have a file smaller than 250MB it is going to be allocated to one extent and one vertex (which is the work load presented to the Azure Data Lake Analytics), whereas a larger file will be split up across many extents and can be accessed by many vertexes.   The format of the file has a huge implication for the storage and parallelisation. Splittable formats – files which are row oriented, such as CSV – are parallelizable as data does not span extents. Non-splittable formats, however, – files what are not row oriented and data is often delivered in blocks, such as XML or JSON – cannot be parallelized as data spans extents and can only be processed by a single vertex.   In addition to the storage of unstructured data, Azure Data Lake Store also stores structured data in the form of row-oriented, distributed clustered index storage, which can also be partitioned. The data itself is held within the “Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many, working with the structured data in the data lake is very similar to working with SQL databases.   Azure Data Lake Store Best Practices The best practices generally involve the framework as outlined in the following blog: http://blogs.adatis.co.uk/ustoldfield/post/Shaping-The-Lake-Data-Lake-Framework The framework allows you to manage and maintain your data lake. So, when setting up your Azure Data Lake Store you will want to initially create the following folders in your Root Raw is where data is landed in directly from source and the underlying structure will be organised ultimately by Source. Source is categorised by Source Type, which reflects the ultimate source of data and the level of trust one should associate with the data. Within the Source Type, data is further organised by Source System. Within the Source System, the folders are organised by Entity and, if possible, further partitioned using the standard Azure Data Factory Partitioning Pattern of Year > Month > Day etc., as this will allow you to achieve partition elimination using file sets. The folder structure of Enriched and Curated is organised by Destination Data Model. Within each Destination Data Model folder is structured by Destination Entity. Enriched or Curated can either be in the folder structure and / or within the Database.  

Shaping The Lake: Data Lake Framework

The Azure Data Lake has just gone into general availability and the management of Azure Data Lake Store, in particular, can seem daunting especially when dealing with big data. In this blog, I will take you through the risks and challenges of working with data lakes and big data. Then I will take you through a framework we’ve created to help best manage these risks and challenges. If you need a refresh as to what a data lake is and how to create your first Azure Data Lake Store and your first Azure Data Lake Analytics job, please feel free to follow the links. Risks and Challenges of Big Data and Data Lake The challenges posed by big data are as follow: Volume – is the sheer amount of data becoming unmanageable? Variety – Structured tables? Semi-structured JSON? Completely unstructured text dumps? We can normally manage with systems that contain just one of these, but if we’re dealing with a huge mix, it gets very tricky Velocity – How fast is the data coming in? And how fast do we need to get it to the people who need it? Veracity - How do we maintain accuracy, veracity, when the data is of varying volumes, the sources and structures are different and the speed in which they arrive in the Lake are of differing velocities? Managing all four simultaneously is where the challenges begin. It is very easy to treat a data lake as a dumping ground for anything and everything. Microsoft’s sale pitch says exactly this – “Storage is cheap, Store everything!!”. We tend to agree – but if the data is completely malformed, inaccurate, out of date or completely unintelligible, then it’s no use at all and will confuse anyone trying to make sense of the data. This will essentially create a data swamp, which no one will want to go into. Bad data & poorly managed files erode trust in the lake as a source of information. Dumping is bad. There is also data drowning – as the volume of the data tends towards the massive and the velocity only increases over time we are going to see more and more information available via the lake. When it gets to that point, if the lake is not well managed, then users are going to struggle to find what they’re after. The data may all be entirely relevant and accurate, but if users cannot find what they need then there is no value in the lake itself. Essentially, data drowning is when the amount of data is so vast you lose the ability to find what’s in there. If you ignore these challenges, treat the lake like a dumping ground, you will have contaminated your lake and it will no longer be fit for purpose. If no one uses the Data Lake, it’s a pointless endeavour and not worth maintaining. Everyone needs to be working together to ensure the lake stays clean, managed and good for a data dive! Those are the risks and challenges we face with Azure Data Lake. But how do we manage it? The Framework   We’ve carved the lake up into different sections. The key point is that the lake contains all sorts of different data – some that’s sanitised and ready to consume by the business user, some that’s indecipherable raw data that needs careful analysis before it is of use. By ensuring data are carefully managed you can instantly understand the level of preparation that data has undergone. Data flows from left to right – the further left areas represent where data has been input directly from source systems. The horizontal sections describe the level of preparation – Manual, Stream and Batch. Manual – aka, the laboratory. Here data is manually prepared with ad-hoc scripts. Stream – The data here flows in semi-real time, coming from event hubs and being landed after processing through stream-specific tools such as Streaming Analytics. Once landed, there is no further data processing – the lake is essentially a batch processing tool. Batch – This is more traditional data processing, the kind of “ETL” seen by many BI developers. We have a landing area for our raw data, a transitional area where data is cleaned, validated, enriched and augmented with additional sources and calculation, before finally being placed in a curated area where it is ready for consumption by the business. We’re taking the blank-canvas of the Data Lake Store and applying a folder structure, a file management process and a curation process over the top. The folder structure itself can be as detailed as you like, we follow a specific structure ourselves:   The Raw data area, the landing place for any files entering the lake, has sub-folders for each source of data. This allows for the easy browsing of the data sources within the Lake and ensures we are not receiving the same data twice, even if we use it within different systems. The Enriched and Curated layers however, have a specific purpose in mind. We don’t take data and enrich/clean/process it without a business driver, it’s not something we do for fun. We can therefore assign a project or system name to it, at this point it is organised into these end-systems. This means we can view the same structure within Enriched as within Curated. Essentially Raw data is categorised by Source whilst Enriched and Curated data is categorised by Destination. There’s nothing complicated about the Framework we’ve created or the processes we’ve ascribed to it, but it’s incredibly important that everyone is educated on the intent of it and the general purpose of the data lake. If one user doesn’t follow process when adding data, or an ETL developer doesn’t clean up test files, the system starts to fall apart and we succumb to the challenges we discussed at the start. To summarise, structure in your Azure Data Lake Store is key to maintaining order: • You need to enforce and maintain folder structure. • Remember that structure is necessary whether using unstructured data or tables & SQL • Bear in mind that schema on read applies temporary structure – but if you don’t know what you’re looking at, this is going to be very hard to do!

The What and Why of Apache Spark on Azure HDInsight

Big data i.e. that which ticks the 3V’s (volume, velocity and variety) boxes, has been around forever in the world of engineering and since the days of the exciting Human Genome Project completed in April 2003.  Then I was a fairly new BI developer, and almost did a career switch into Bioinformatics as was so inspired by all that was being done in this area.  If any of you remember The Summer of 2003 was a glorious one in England, I enjoyed it, and decided to stick with Microsoft BI technologies which I have derived much pleasure from.  However, I have always kept my eye on what that bioinformaticians are doing.  With data, whether it be big or not, there is always the non-functional requirements one needs to address with regard to: Availability – how soon can I get my data after it was born.  In traditional warehousing this is usually daily, falling into our daily circadian rhythms of work days.  And sometimes, with up to biggish data sets this was possible with traditional Microsoft BI technologies Accessibility – how soon can I get the data that makes sense after it is born.  Usually data needs to be cleaned, scrubbed, conformed before we unleash it onto the end-user.  So the daily ETL processes fitted quite nicely into this. Interactivity – ok, so I have this report, but I see something of interest, how easily can I dig into that little trough on that graph and see it in a different way so I can understand it and do it quickly.  Yes, this could be done, but was usually met with frustration as one got mesmerized by a little spinning wheel, or just gave up and raised a ticket to the reporting team to do an ‘adhoc query’. So, what does this have to do with Apache Spark on HDInsight? Apache Spark on Azure HDInsight is actually *the* open source processing framework.  Here we see Microsoft supporting open source.  However, the Microsoft angle on this is that it is hosted in the cloud on Azure HDInsight.  Spark It is a fast, general purpose engine that supports in-memory operations.  OK so what does this mean and how might this address some of the challenges we face delivering BI solutions on big data to users who are not demanding the answers right now?  In the words of Freddie Mercury “we want it all and we want it now” Microsoft also announced yesterday a “major commitment to Apache Spark” see here. Speed It is fast.  I believe it is 100 times faster in-memory than Hadoop Map Reduce processes, but still uses the scale out of processing data on multiple clusters. This is because it uses the DAG (directed acyclical graph) execution engine that supports cyclic data flow in in-memory parallel computing.  If anyone from the ETL world might want to visualise what DAG might look this there is a nice diagram here. Data can be persisted in-memory or on disk.  In HD Insight, the on-disk would be blob storage or data lakes.  Unlike Hadoop, Spark can manipulate data in-memory.  Manipulate data could apply to ETL and/or Reporting operations.  The lines between ETL and extracting data for reporting and analysis is blurring.  We see this in Power BI where one can perform pretty powerful ETL operations and then visualise right away. A BI Big Data one stop shop Spark is a single platform to support the following all flavours of data manipulation operations on big data: Batch processing Real time and Stream Analytics Machine Learning and Predictive Analytics It is general purpose, so does not support only one kind of language.  Developers can write data manipulation jobs and queries in: Java Scala Python R – with R Server now also being hosted on Spark on HDInsight Looking at this list of languages we see the data scientists and traditional developers from the Microsoft and Open Source worlds collide.  Techniques used by those long standing bioinformaticians can be applied to our corporate big data, in tools they are used to working with.  This opens up all sorts of possibilities of recruiting and the landscape of the traditional BI team.  And then for us traditional ETL developers, while SSIS might have an in-memory pipeline for data transformations and manipulation this does need to be materialised along the way, and/or processed into a cache in-memory in another reporting specific technology before the end user can access it.  Spark, being a multi-purpose in-memory data manipulation platform, shortens the gap between data being born and it being available and accessible, in all the layers from it being 100% raw organic to being packaged and processed. Ready to Go Apache Spark supported on Azure HD Insight means that is is in the cloud.  It is “Software as a Service”.  The cost utilising a Spark cluster includes the managed service costs.  One does not need to get a physical server, you are just using someone else’s server.  With it being in the cloud, the storage is separated from the cluster.  You pay for each (cluster and storage) separately.  The storage is cheap. The cluster is more expensive.  The cluster does the computation so can be turned on an off when required and you only pay for the up-time and size of your cluster in terms of the number of nodes.  The cluster is scalable, you can increase the number of nodes depending on how much data you have to process and how quickly. With spinning up a Spark cluster in Azure HD Insight, it also comes pre-loaded with Jupyter and Apache Zeppelin notebooks. These are browser based tools which allow for the creation of Python or Scala scripts and a power user to be able to run queries and visualise data using Spark SQL. Spark also integrates with all our favourite reporting and visualisation tools i.e. Power BI – also supporting Spark Streaming Tableau Qlik SAP Lumira So with drag and drop operations the Spark SQL gets generated behind the scenes with almost immediacy even on vary large to big data sets. In Conclusion I have been keeping an eye on this one.  My delight has been *sparked* with the appearance on R Server being hosted on Spark on HDInsight.  I even heard a Microsoft engineer saying he “was in love with this technology”.  Next I will be looking at the automation of getting data into the cloud based in-memory structures on Spark. If your curiosity has been piqued, here are some good starter links…   https://blogs.technet.microsoft.com/dataplatforminsider/2016/06/06/microsoft-announces-major-commitment-to-apache-spark/ https://azure.microsoft.com/en-us/services/hdinsight/apache-spark/ https://spark.apache.org/ And if you are really keen and want to understand more about DAG execution and the Dryad Microsoft Research project there is a very detailed paper here… http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf