Ust

Ust Oldfield's Blog

Testing the Waters: An Overview of Data Science using Azure Data Lakes

Data Science can fit seamlessly within the ecosystem of the data lake, whether this is through HDInsight or the extensibility of Azure Data Lake Analytics and U-SQL. This blog will give a brief overview of what Data Science is; how to link Data Science toolkits to the Azure Data Lake; and best practices for managing the data output from experiments.

 

Data Science

Data Science is the relatively new kid on the block. One way to consider data science is as an evolutionary step in interdisciplinary fields like business analysis that incorporate computer science, modelling, statistics, analytics, and mathematics.
At its core, data science involves using automated methods to analyse massive amounts of data and to extract insight from them. Data science is helping to create new branches of science, and influencing areas of social science and the humanities. The trend is expected to accelerate in the coming years as data from mobile sensors, sophisticated instruments, the web, and more, grows.

 

Data Science In The Data Lake

The nature of the Azure Data Lake Store lends itself to Data Science in that it can hold any data, which the data scientist will want to access, transform and analyse.

 

HDInsight contains many implementations for data science, such as Spark, R Server and others. Hooking HDInsight to Azure Data Lake Store is pretty simple and follows these steps:

In the Azure Portal Marketplace, select HDInsight which will bring up a series of blades. In this blog, I will be using Spark as my cluster type on HDInsight.

image

In the storage settings you can then link your HDInsight cluster to Azure Data Lake Store

image

Confirm your configuration on the next blade and wait around 20 minutes for your cluster to deploy and you’re good to go!

 

With Azure Data Lake Analytics, you incorporate data science by extending the capabilities of U-SQL and you do this by installing a series of files.

Open up your Azure Data Lake Analytics account and click on Sample Scripts at the top. This will bring forward the following blade

image

From there you’ll want to click on the U-SQL Advanced Analytics tab, which will copy about 1.5GB of files to the default Azure Data Lake Store associated to your ADLA account. This will take about 3 minutes to complete.

When it’s finished copying the files it will then call a job to register the extension, which can be found in the assemblies folder of the master database.

image

More resources about extensibility of U-SQL can be found here:

https://blogs.msdn.microsoft.com/azuredatalake/2017/03/10/using-custom-python-libraries-with-u-sql/

https://blogs.msdn.microsoft.com/azuredatalake/2016/11/22/u-sql-advanced-analytics-introducing-python-extensions-for-u-sql/

The Laboratory

Within Azure Data Lake Store, Folder and File Management is incredibly important for a well running data lake. See my blogs on Storage and Best Practices and Shaping The Lake for more information on how to set up your Azure Data Lake Store.

The Laboratory is an area to be exclusively used by a data scientist. It’s an area where they can persist the results of experiments and data sets without impacting the day-to-day operations within the data lake or other data scientists. The laboratory is organised in to two broad area: Desks and Exhibits.

Desks contain personal workspaces, the contents of which can be organised however the person wishes. It can be as well organised, or disorganised, as the person themselves. The Exhibit contains data sources produced in the Laboratory which are ready to be consumed by other users or systems. Both of which are laid out below.

image

As always, if you have any feedback or comments do let me know!

Shaping The Lake: Data Lake Framework

The Azure Data Lake has just gone into general availability and the management of Azure Data Lake Store, in particular, can seem daunting especially when dealing with big data. In this blog, I will take you through the risks and challenges of working with data lakes and big data. Then I will take you through a framework we’ve created to help best manage these risks and challenges.

If you need a refresh as to what a data lake is and how to create your first Azure Data Lake Store and your first Azure Data Lake Analytics job, please feel free to follow the links.

Risks and Challenges of Big Data and Data Lake

The challenges posed by big data are as follow:

  • Volume – is the sheer amount of data becoming unmanageable?
  • Variety – Structured tables? Semi-structured JSON? Completely unstructured text dumps? We can normally manage with systems that contain just one of these, but if we’re dealing with a huge mix, it gets very tricky
  • Velocity – How fast is the data coming in? And how fast do we need to get it to the people who need it?
  • Veracity - How do we maintain accuracy, veracity, when the data is of varying volumes, the sources and structures are different and the speed in which they arrive in the Lake are of differing velocities?

Managing all four simultaneously is where the challenges begin.

It is very easy to treat a data lake as a dumping ground for anything and everything. Microsoft’s sale pitch says exactly this – “Storage is cheap, Store everything!!”. We tend to agree – but if the data is completely malformed, inaccurate, out of date or completely unintelligible, then it’s no use at all and will confuse anyone trying to make sense of the data. This will essentially create a data swamp, which no one will want to go into. Bad data & poorly managed files erode trust in the lake as a source of information. Dumping is bad.

There is also data drowning – as the volume of the data tends towards the massive and the velocity only increases over time we are going to see more and more information available via the lake. When it gets to that point, if the lake is not well managed, then users are going to struggle to find what they’re after. The data may all be entirely relevant and accurate, but if users cannot find what they need then there is no value in the lake itself. Essentially, data drowning is when the amount of data is so vast you lose the ability to find what’s in there.

If you ignore these challenges, treat the lake like a dumping ground, you will have contaminated your lake and it will no longer be fit for purpose.

If no one uses the Data Lake, it’s a pointless endeavour and not worth maintaining.

Everyone needs to be working together to ensure the lake stays clean, managed and good for a data dive!

Those are the risks and challenges we face with Azure Data Lake. But how do we manage it?

The Framework

Data Lake

 

We’ve carved the lake up into different sections. The key point is that the lake contains all sorts of different data – some that’s sanitised and ready to consume by the business user, some that’s indecipherable raw data that needs careful analysis before it is of use. By ensuring data are carefully managed you can instantly understand the level of preparation that data has undergone.

Data flows from left to right – the further left areas represent where data has been input directly from source systems. The horizontal sections describe the level of preparation – Manual, Stream and Batch.

Manual – aka, the laboratory. Here data is manually prepared with ad-hoc scripts.

Stream – The data here flows in semi-real time, coming from event hubs and being landed after processing through stream-specific tools such as Streaming Analytics. Once landed, there is no further data processing – the lake is essentially a batch processing tool.

Batch – This is more traditional data processing, the kind of “ETL” seen by many BI developers. We have a landing area for our raw data, a transitional area where data is cleaned, validated, enriched and augmented with additional sources and calculation, before finally being placed in a curated area where it is ready for consumption by the business.

We’re taking the blank-canvas of the Data Lake Store and applying a folder structure, a file management process and a curation process over the top.

The folder structure itself can be as detailed as you like, we follow a specific structure ourselves:

Data Lake 2

 

The Raw data area, the landing place for any files entering the lake, has sub-folders for each source of data. This allows for the easy browsing of the data sources within the Lake and ensures we are not receiving the same data twice, even if we use it within different systems.

The Enriched and Curated layers however, have a specific purpose in mind. We don’t take data and enrich/clean/process it without a business driver, it’s not something we do for fun. We can therefore assign a project or system name to it, at this point it is organised into these end-systems. This means we can view the same structure within Enriched as within Curated.

Essentially Raw data is categorised by Source whilst Enriched and Curated data is categorised by Destination.

There’s nothing complicated about the Framework we’ve created or the processes we’ve ascribed to it, but it’s incredibly important that everyone is educated on the intent of it and the general purpose of the data lake. If one user doesn’t follow process when adding data, or an ETL developer doesn’t clean up test files, the system starts to fall apart and we succumb to the challenges we discussed at the start.

To summarise, structure in your Azure Data Lake Store is key to maintaining order:

• You need to enforce and maintain folder structure.

• Remember that structure is necessary whether using unstructured data or tables & SQL

• Bear in mind that schema on read applies temporary structure – but if you don’t know what you’re looking at, this is going to be very hard to do!