Ust

Ust Oldfield's Blog

Shaping The Lake: Data Lake Framework

The Azure Data Lake has just gone into general availability and the management of Azure Data Lake Store, in particular, can seem daunting especially when dealing with big data. In this blog, I will take you through the risks and challenges of working with data lakes and big data. Then I will take you through a framework we’ve created to help best manage these risks and challenges.

If you need a refresh as to what a data lake is and how to create your first Azure Data Lake Store and your first Azure Data Lake Analytics job, please feel free to follow the links.

Risks and Challenges of Big Data and Data Lake

The challenges posed by big data are as follow:

  • Volume – is the sheer amount of data becoming unmanageable?
  • Variety – Structured tables? Semi-structured JSON? Completely unstructured text dumps? We can normally manage with systems that contain just one of these, but if we’re dealing with a huge mix, it gets very tricky
  • Velocity – How fast is the data coming in? And how fast do we need to get it to the people who need it?
  • Veracity - How do we maintain accuracy, veracity, when the data is of varying volumes, the sources and structures are different and the speed in which they arrive in the Lake are of differing velocities?

Managing all four simultaneously is where the challenges begin.

It is very easy to treat a data lake as a dumping ground for anything and everything. Microsoft’s sale pitch says exactly this – “Storage is cheap, Store everything!!”. We tend to agree – but if the data is completely malformed, inaccurate, out of date or completely unintelligible, then it’s no use at all and will confuse anyone trying to make sense of the data. This will essentially create a data swamp, which no one will want to go into. Bad data & poorly managed files erode trust in the lake as a source of information. Dumping is bad.

There is also data drowning – as the volume of the data tends towards the massive and the velocity only increases over time we are going to see more and more information available via the lake. When it gets to that point, if the lake is not well managed, then users are going to struggle to find what they’re after. The data may all be entirely relevant and accurate, but if users cannot find what they need then there is no value in the lake itself. Essentially, data drowning is when the amount of data is so vast you lose the ability to find what’s in there.

If you ignore these challenges, treat the lake like a dumping ground, you will have contaminated your lake and it will no longer be fit for purpose.

If no one uses the Data Lake, it’s a pointless endeavour and not worth maintaining.

Everyone needs to be working together to ensure the lake stays clean, managed and good for a data dive!

Those are the risks and challenges we face with Azure Data Lake. But how do we manage it?

The Framework

Data Lake

 

We’ve carved the lake up into different sections. The key point is that the lake contains all sorts of different data – some that’s sanitised and ready to consume by the business user, some that’s indecipherable raw data that needs careful analysis before it is of use. By ensuring data are carefully managed you can instantly understand the level of preparation that data has undergone.

Data flows from left to right – the further left areas represent where data has been input directly from source systems. The horizontal sections describe the level of preparation – Manual, Stream and Batch.

Manual – aka, the laboratory. Here data is manually prepared with ad-hoc scripts.

Stream – The data here flows in semi-real time, coming from event hubs and being landed after processing through stream-specific tools such as Streaming Analytics. Once landed, there is no further data processing – the lake is essentially a batch processing tool.

Batch – This is more traditional data processing, the kind of “ETL” seen by many BI developers. We have a landing area for our raw data, a transitional area where data is cleaned, validated, enriched and augmented with additional sources and calculation, before finally being placed in a curated area where it is ready for consumption by the business.

We’re taking the blank-canvas of the Data Lake Store and applying a folder structure, a file management process and a curation process over the top.

The folder structure itself can be as detailed as you like, we follow a specific structure ourselves:

Data Lake 2

 

The Raw data area, the landing place for any files entering the lake, has sub-folders for each source of data. This allows for the easy browsing of the data sources within the Lake and ensures we are not receiving the same data twice, even if we use it within different systems.

The Enriched and Curated layers however, have a specific purpose in mind. We don’t take data and enrich/clean/process it without a business driver, it’s not something we do for fun. We can therefore assign a project or system name to it, at this point it is organised into these end-systems. This means we can view the same structure within Enriched as within Curated.

Essentially Raw data is categorised by Source whilst Enriched and Curated data is categorised by Destination.

There’s nothing complicated about the Framework we’ve created or the processes we’ve ascribed to it, but it’s incredibly important that everyone is educated on the intent of it and the general purpose of the data lake. If one user doesn’t follow process when adding data, or an ETL developer doesn’t clean up test files, the system starts to fall apart and we succumb to the challenges we discussed at the start.

To summarise, structure in your Azure Data Lake Store is key to maintaining order:

• You need to enforce and maintain folder structure.

• Remember that structure is necessary whether using unstructured data or tables & SQL

• Bear in mind that schema on read applies temporary structure – but if you don’t know what you’re looking at, this is going to be very hard to do!

SQL PASS Summit–Day 3 and Reflections

Apologies for the delay in getting this blog out to you all. When PASS finished on the Friday we had to rush over to the airport to get our flight back to the UK. When I landed on Saturday I was suffering from jet lag and only now am I in a fit state to blog again.

 

I got the impression from the schedule that Day 3 of PASS was going to be a wind-down day as very few of the sessions seemed as intense as the previous days’ sessions.

My first session of the day, despite being the last day of PASS, was early. Earlier than any of the keynotes, but worth getting up for – a Chalk Talk with the Data Warehouse Fast Track Team. This also included the Azure Data Warehouse team as well, and the conversation was much more focused on the Azure side of Data Warehousing. Lots of conversations around Polybase and patterns in how to get data from on-prem to cloud using Polybase. In terms of patterns, it was reassuring to learn that the approach Adatis has adopted is spot on. Simon Whiteley is the man to see about that. His blog is here: http://blogs.adatis.co.uk/simonwhiteley/

On the Fast Track theme, my next session was  exploring the SQL Server Fast Track Data Warehouse, which was interesting to know about, especially the various testing that these pre-configured servers go through. At some point next year, Microsoft will be releasing the Fast Track Testing Programme to the community so that everyone will be able to test their hardware to the same exacting standards and know what their maximum throughput / IO demand etc., is in order to properly gauge hardware performance.

After this session I got talking to a few people about data warehousing. The conversation was so engrossing that I missed the session that I was due to attend. Luckily, most of the sessions at PASS are recorded so I will have to chase up that session and others when they get released.

 

My final session of the day was a Deep Dive of SQL SSIS 2016. It wasn’t so much a deep dive and more a run-down of upcoming features. The one I’m most excited about is the Azure Data Lake Store connector, which will be released once Azure Data Lake goes into General Availability, which I’ve been told is soon…..

 

Now that I’ve had to week to digest and reflect on SQL PASS Summit, my findings are thus:

SQL PASS Summit is invaluable.

It provides an opportunity to learn so much from so many people, and not just learn from the presenters. There are so many people from all over the SQL community, with different experiences of SQL, different experiences of data, different experiences of life, that you can’t not learn something. PASS provides the easy environment to share ideas among peers and learn new technologies, new ways of working and new tricks.

I’ve already started sharing some of my learning's with colleagues and I can’t wait to share them with everyone else too!

SQL PASS Summit–Day 2

Day 2, Thursday, started off with a keynote from David DeWitt on cloud data warehousing, scalable storage and scalable compute. This set my theme for the majority of the day – which turned out to be big data tech.

 

My first session was with James Rowland-Jones and Kevin Ngo on sizing Azure SQL Data Warehouse for proposals – essentially answering “how much is this going to cost me?”. There are various factors to consider, which I will blog on separately. I’ve already briefly fed back to members of the team and they’re excited to know what I learnt in more detail.

 

My second session was about best practices for Big BI which, unfortunately, ended up being a sales pitch and I came away having felt that I’ve didn’t learn anything. There’s a lot of promise for BI in the big data space, so watch this space as we explore Azure SQL Data Warehouse, Azure Data Lake (Store and Analytics), and other big data technology for BI.

 

The third session was with Michael Rys on Tuning and Optimising U-SQL Queries for Maximum Performance. It was a full on session, learnt loads and took loads of notes. I need time to digest this information as Michael covered off a very complex topic, very quickly. I will, however, be blogging on it in due course.

 

After an intense third session, I chose a less intense session for the last session of the day: a Q&A with the SQL Engineering team. This was a great opportunity to learn from other users how they’re using SQL. Most users who asked questions were wanting to know about indexing, backups and High Availability.

 

Tonight – packing, and networking before the last day of PASS tomorrow!

SQL PASS Summit–Day 1

Day 1, Wednesday, technically started on Tuesday with a newbies speed networking event in which we had to rotate through a crowd of 10 other people - introducing ourselves and asking questions about our professional lives. This was awkward to begin with but, as the evening wore on, introducing ourselves to strangers became a lot easier and more normal. We then moved on to the Welcome Reception and then a #SQLKaraoke event. Great opportunities to meet new people from different areas of the world and parts of the community.

Wednesday morning proper, began with a keynote from Joseph Sirosh. This keynote from Joseph essentially set the tone and theme for a large part of the conference sessions - Azure, Big Data and the Cortana Intelligence Suite.

The first session I attended was on Design Patterns for Azure SQL Database (for which a separate blog will be forthcoming).

The next session I attended was about incorporating Azure Data Lake Analytics into a BI environment (again, another blog is in the pipeline).

My final session of the day was Going Under the Hood with Azure Data Lake. This was the most insightful session of the day, which has subsequently sparked my brain into Data Lake mode (expect many blogs on this), and went through how Azure Data Lake works as well as how the U-SQL language works and resources are allocated.

Tonight - more networking.

So far, the community has been so welcoming and I’m very much looking forward to tomorrow where I’ll be learning about Big Data solutions and best practices. I’m also looking forward to sharing all my experiences and learning's with my colleagues and wider SQL Community.

Introduction to Data Lakes

Data Lakes are the new hot topic in the big data and BI communities. Data Lakes have been around for a few years now, but have only gained popular notice within the last year. In this blog I will take you through the concept of a Data Lake, so that you can begin your own voyage on the lakes.

What is a Data Lake?

Before we can answer this question, it's worth reflecting on a concept which most of us know and love - Data Warehouses. A Data Warehouse is a form of data architecture. The core principal of a Data Warehouse isn't the database, it's the data architecture which the database and tools implement. Conceptually, the condensed and isolated features of a Data Warehouse are around:

1.     Data acquisition

2.     Data management

3.     Data delivery / access

A Data Lake is similar to a Data Warehouse in these regards. It is an architecture. The technology which underpins a Data Lake enables the architecture of the lake to flow and develop. Conceptually, the architecture of a Data Lake wants to acquire data, it needs careful, yet agile management, and the results of any exploration of the data should be made accessible. The two architectures can be used together, but conceptually the similarities end here.

 

Conceptually, Data Lakes and Data Warehouses are broadly similar yet the approaches are vastly different. So let's leave Data Warehousing here and dive deeper into Data Lakes.

 

Fundamentally, a Data Lake is just not a repository. It is a series of containers which capture, manage and explore any form of raw data at scale, enabled by low cost technologies, from which multiple downstream applications can access valuable insight which was previously inaccessible.

 

How Do Data Lakes Work?

 

Conceptually, a Data Lake is similar to a real lake - water flows in, fills up the reservoir and flows out again. The incoming flow represents multiple raw data formats, ranging from emails, sensor data, spreadsheets, relational data, social media content, etc. The reservoir represents the store of the raw data, where analytics can be run on all or some of the data. The outflow is the analysed data, which is made accessible to users.

 

To break it down, most Data Lake architectures come as two parts. Firstly, there is a large distributed storage engine with very few rules/limitations. This provides a repository for data of any size and shape. It can hold a mixture of relational data structures, semi-structured flat files and completely unstructured data dumps. The fundamental point is that it can store any type of data you may need to analyse. The data is spread across a distributed array of cheap storage that can be accessed independently.

 

There is then a scalable compute layer, designed to take a traditional SQL-style query and break it into small parts that can then be run massively in parallel because of the distributed nature of the disks.

 

In essence – we are overcoming the limitations of traditional querying by:

·       Separating compute so it can scale independently

·       Parallelizing storage to reduce impact of I/O bottlenecks

 

 

There are various technologies and design patterns which form the basis of Data Lakes. In terms of technologies these include:

·        Azure Data Lake

·        Cassandra

·        Hadoop

·        S3

·        Teradata

With regards to design patterns, these will be explored in due course. However, before we get there, there are some challenges which you must be made aware of. These challenges are:

1.     Data dumping - It's very easy to treat a data lake as a dumping ground for anything and everything. This will essentially create a data swamp, which no one will want to go into.

2.     Data drowning - the volume of the data could be massive and the velocity very fast. There is a real risk of drowning by not fully knowing what data you have in your lake.

These challenges require good design and governance, which will be covered off in the near future.

 

Hopefully this has given you a brief, yet comprehensive high-level overview of what data lakes are. We will be focusing on Azure Data Lake, which is a management implementation of the Hadoop architectures. Further reading on Azure Data Lake can be found below.

 

Further Reading

 

In order to know more about Data Lakes the following resources are invaluable.

Getting Started With Azure Data Lake Store

Getting Started With Azure Data Lake Analytics and U-SQL

Azure Data Lake Overview

 

 

R - What Is It?

I have blogged often on the subject of R, but have not previously addressed what R is and why you should use it. In the blog post, I will set out what R is, why you should use R and how you can learn R.

What Is R?

R is a powerful tool for statistical programming statistics and graphics. There are lots of software available which can do all of these things: spreadsheet applications like Excel; point-and-click applications like SPSS; data mining applications like SSAS; and so on. But what sets R apart from applications like those listed?

R is a free and open source application. Because it is free you don't have to worry about subscription fees, usage caps or licence managers. Just as importantly, R is open. You can inspect the source code and tinker with it as much as you want.

Leading academics and researchers use R to develop latest methods in statistics, machine learning and predictive modelling. These methods are stored in packages which can be accessed by anyone for free! There are thousands of packages available to download and use.

R is an interactive language. In R you do analysis by writing functions and scripts, not by pointing and clicking. As an interactive language (as opposed to a data-in-data-out black box), R promotes experimentation and exploration, which improves data analysis and sometimes leads to discoveries that would not have been made otherwise. Scripts document all your work, from data access to reporting, which can be re-run at any time.This makes it easier to update results when the data changes. Scripts also make it easy to automate a sequence of taks that can be integrated into other processes, such as an ETL.

One of the design principles of R was that visualtision of data through charts and graphs is an essential part of data analysis. As a result, it has excellent tools for creating graphics, from staples like bar charts to brand new graphics of your own design.

With R you are not restricted to choosing a rigid set of routines and procedures. YOu can use code and packages contributed by others in the community, or extend R with your own functions and packages. R is also excellent for mash-ups with other applications. For example you can build it into your SSIS routine, or take advantage of the new integration in SQL Server 2016.

We've covered, briefly, what R is. But why do you want to use it?

Why Use R?

There is a vibrant community built around R. With thousands of contributors and millions of users around the worldif you have a question about R chances are someone has answered it, or can answer it.

It is quickly becoming an integral part of the Microsoft BI stack. Since Microsoft's acquisition of Revolution Analytics, R has been featuring in the more recent releases in the Microsoft BI world. From Power BI to SQL Server, Visual Studio to Azure ML; R is becoming an integral component of the BI stack.

Where can you learn R?

There are various books and online courses on R which can be used to quickly skill you up in this powerful language. These are some which I recommend:

EdX: Introduction to R for Data Science

Book: R Cookbook by Paul Teetor

Book: R Programming for Data Science by Roger Peng

Here, at Adatis, we run internal R training courses which mean that all of our employees have the opportunity to learn from internal subject matter experts and improve their knowledge and skills.

You can also try out the R tutor from Revolution Analytics, which is a package built for R.









Deploying a Hybrid Cloud

Operations Management Suite (OMS) is an Azure based tool that helps manage your entire IT infrastructure, whether on premise or in the cloud. OMS allows you to monitor the machines you have in your infrastructure and provides a bridge for a hybrid cloud solution, by moving multi-tier workloads into Azure, or run tests on a copy of production workloads in Azure, as well as storing critical data in Azure.

Configuring and deploying OMS is very quick and relatively straightforward. This post will deal with creating a hybrid cloud solution using OMS and configuring the solution so you can make best use of resources.

Within the Azure management portal you will have to create a new Operational Insights workspace as detailed below:


Once the Operational Insights has been created, you can navigate to it in the Azure Portal and click “Manage” which will bring up the OMS itself. The start page should look like this:


And you want to click on the “Get Started” button to begin creating your hybrid cloud solution. You then want to add various solutions to the suite. For this demonstration we’re going to accept the default and have all the solutions.

 

The next step is to connect a data source, which can be an on premise machine, a virtual machine or an Azure data storage. For this demonstration we are only interested in connecting to an on premise machine and a VM.