Ust

Ust Oldfield's Blog

Azure Data Lake Store–Storage and Best Practices

The Azure Data Lake Store is an integral component for creating a data lake in Azure as it is where data is physically stored in many implementations of a data lake. Under the hood, the Azure Data Lake Store is the Web implementation of the Hadoop Distributed File System (HDFS). Meaning that files are split up and distributed across an array of cheap storage.

 

What this blog will go into is the physical storage of files in the Azure Data Lake Store and then best practices, which will utilise the framework.

 

Azure Data Lake Store File Storage

As mentioned, the Azure Data Lake Store is the Web implementation of HDFS. Each file you place into the store is split into 250MB chunks called extents. This enables parallel read and write. For availability and reliability, extents are replicated into three copies. As files are split into extents, bigger files have more opportunities for parallelism than smaller files. If you have a file smaller than 250MB it is going to be allocated to one extent and one vertex (which is the work load presented to the Azure Data Lake Analytics), whereas a larger file will be split up across many extents and can be accessed by many vertexes.

 

The format of the file has a huge implication for the storage and parallelisation. Splittable formats – files which are row oriented, such as CSV – are parallelizable as data does not span extents. Non-splittable formats, however, – files what are not row oriented and data is often delivered in blocks, such as XML or JSON – cannot be parallelized as data spans extents and can only be processed by a single vertex.

 

In addition to the storage of unstructured data, Azure Data Lake Store also stores structured data in the form of row-oriented, distributed clustered index storage, which can also be partitioned. The data itself is held within the “Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many, working with the structured data in the data lake is very similar to working with SQL databases.

 

Azure Data Lake Store Best Practices

The best practices generally involve the framework as outlined in the following blog: http://blogs.adatis.co.uk/ustoldfield/post/Shaping-The-Lake-Data-Lake-Framework

The framework allows you to manage and maintain your data lake. So, when setting up your Azure Data Lake Store you will want to initially create the following folders in your Root

image

Raw is where data is landed in directly from source and the underlying structure will be organised ultimately by Source.

image

Source is categorised by Source Type, which reflects the ultimate source of data and the level of trust one should associate with the data.

Within the Source Type, data is further organised by Source System.

image

Within the Source System, the folders are organised by Entity and, if possible, further partitioned using the standard Azure Data Factory Partitioning Pattern of Year > Month > Day etc., as this will allow you to achieve partition elimination using file sets.

image

The folder structure of Enriched and Curated is organised by Destination Data Model. Within each Destination Data Model folder is structured by Destination Entity. Enriched or Curated can either be in the folder structure and / or within the Database.  

image

Comments (8) -

  • Tris

    4/11/2017 8:27:27 PM | Reply

    Great blog Ust, well explained! You may of explained this in another post by what is the difference between curated and enriched - and what is laboratory? Going to try catch up on your session when I get a spare hour or two.

    • UstOldfield

      4/13/2017 7:25:39 AM | Reply

      Hi Tristan, thanks for your comment. There shouldn't be a significant difference between Curated and Enriched as they should both be structured by Destination Model and Destination Entity. However, the contents and structure of the data will differ between the stages as the data is supposed to be cleaned, validated and enriched when it transitions to Enrich. With Curated, the data should be in its final form, either for consumption by reporting solutions or for additional processing by other systems.

      With regards to the Laboratory, that will be explained in a separate post as it's only really relevant for Data Science within the Data Lake.

      Ust

      • Tris

        4/13/2017 7:44:11 AM | Reply

        Gotcha - thanks!

  • Matt

    4/13/2017 7:52:20 AM | Reply

    Really good blog Ust! Does the stream section of the framework adhere to similar rules and structures as the Batch section? e.g. broken up by source or destination? partitioned by a data hierarchy?

    • UstOldfield

      4/13/2017 8:00:10 AM | Reply

      Hi Matt,
      The stream structure will most likely be partitioned by date, which is the output by Stream Analytics. However, the overarching structure can be similar to RAW, in that it will be organised by Source Model and Entity. Because the data has most likely already been transformed it won't need to have further sub folders - like Batch - but will form a data source for the batch process.

  • José Mendes

    5/8/2017 8:58:36 PM | Reply

    Another great blog Ust. Can you automate the folder structure, i.e., is there a way to automatically create new folders? If yes, how can you do that?
    E.g. Today you create three folders: 2017, 05, 08. Tomorrow you need 2017, 05, 09.

    • UstOldfield

      5/9/2017 7:50:39 AM | Reply

      Hi Jose,
      You can automate the folder structure using PowerShell. The script you'll want to use will look something like this:
      $DataLakeStoreName = "pleasereplaceme"
      $myrootdir = "/"

      New-AzureRmDataLakeStoreItem -Folder -AccountName $dataLakeStoreName -Path $myrootdir/Raw

      With the date partitions in Raw, when you land data in using either data factory or stream analytics it'll create the partitions automatically. Again, PowerShell will also work.

Pingbacks and trackbacks (2)+

Loading