Ust

Ust Oldfield's Blog

Azure Data Lake Store–Storage and Best Practices

The Azure Data Lake Store is an integral component for creating a data lake in Azure as it is where data is physically stored in many implementations of a data lake. Under the hood, the Azure Data Lake Store is the Web implementation of the Hadoop Distributed File System (HDFS). Meaning that files are split up and distributed across an array of cheap storage.

 

What this blog will go into is the physical storage of files in the Azure Data Lake Store and then best practices, which will utilise the framework.

 

Azure Data Lake Store File Storage

As mentioned, the Azure Data Lake Store is the Web implementation of HDFS. Each file you place into the store is split into 250MB chunks called extents. This enables parallel read and write. For availability and reliability, extents are replicated into three copies. As files are split into extents, bigger files have more opportunities for parallelism than smaller files. If you have a file smaller than 250MB it is going to be allocated to one extent and one vertex (which is the work load presented to the Azure Data Lake Analytics), whereas a larger file will be split up across many extents and can be accessed by many vertexes.

 

The format of the file has a huge implication for the storage and parallelisation. Splittable formats – files which are row oriented, such as CSV – are parallelizable as data does not span extents. Non-splittable formats, however, – files what are not row oriented and data is often delivered in blocks, such as XML or JSON – cannot be parallelized as data spans extents and can only be processed by a single vertex.

 

In addition to the storage of unstructured data, Azure Data Lake Store also stores structured data in the form of row-oriented, distributed clustered index storage, which can also be partitioned. The data itself is held within the “Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many, working with the structured data in the data lake is very similar to working with SQL databases.

 

Azure Data Lake Store Best Practices

The best practices generally involve the framework as outlined in the following blog: http://blogs.adatis.co.uk/ustoldfield/post/Shaping-The-Lake-Data-Lake-Framework

The framework allows you to manage and maintain your data lake. So, when setting up your Azure Data Lake Store you will want to initially create the following folders in your Root

image

Raw is where data is landed in directly from source and the underlying structure will be organised ultimately by Source.

image

Source is categorised by Source Type, which reflects the ultimate source of data and the level of trust one should associate with the data.

Within the Source Type, data is further organised by Source System.

image

Within the Source System, the folders are organised by Entity and, if possible, further partitioned using the standard Azure Data Factory Partitioning Pattern of Year > Month > Day etc., as this will allow you to achieve partition elimination using file sets.

image

The folder structure of Enriched and Curated is organised by Destination Data Model. Within each Destination Data Model folder is structured by Destination Entity. Enriched or Curated can either be in the folder structure and / or within the Database.  

image

Comments (13) -

  • Tris

    4/11/2017 8:27:27 PM | Reply

    Great blog Ust, well explained! You may of explained this in another post by what is the difference between curated and enriched - and what is laboratory? Going to try catch up on your session when I get a spare hour or two.

    • UstOldfield

      4/13/2017 7:25:39 AM | Reply

      Hi Tristan, thanks for your comment. There shouldn't be a significant difference between Curated and Enriched as they should both be structured by Destination Model and Destination Entity. However, the contents and structure of the data will differ between the stages as the data is supposed to be cleaned, validated and enriched when it transitions to Enrich. With Curated, the data should be in its final form, either for consumption by reporting solutions or for additional processing by other systems.

      With regards to the Laboratory, that will be explained in a separate post as it's only really relevant for Data Science within the Data Lake.

      Ust

      • Tris

        4/13/2017 7:44:11 AM | Reply

        Gotcha - thanks!

  • Matt

    4/13/2017 7:52:20 AM | Reply

    Really good blog Ust! Does the stream section of the framework adhere to similar rules and structures as the Batch section? e.g. broken up by source or destination? partitioned by a data hierarchy?

    • UstOldfield

      4/13/2017 8:00:10 AM | Reply

      Hi Matt,
      The stream structure will most likely be partitioned by date, which is the output by Stream Analytics. However, the overarching structure can be similar to RAW, in that it will be organised by Source Model and Entity. Because the data has most likely already been transformed it won't need to have further sub folders - like Batch - but will form a data source for the batch process.

  • José Mendes

    5/8/2017 8:58:36 PM | Reply

    Another great blog Ust. Can you automate the folder structure, i.e., is there a way to automatically create new folders? If yes, how can you do that?
    E.g. Today you create three folders: 2017, 05, 08. Tomorrow you need 2017, 05, 09.

    • UstOldfield

      5/9/2017 7:50:39 AM | Reply

      Hi Jose,
      You can automate the folder structure using PowerShell. The script you'll want to use will look something like this:
      $DataLakeStoreName = "pleasereplaceme"
      $myrootdir = "/"

      New-AzureRmDataLakeStoreItem -Folder -AccountName $dataLakeStoreName -Path $myrootdir/Raw

      With the date partitions in Raw, when you land data in using either data factory or stream analytics it'll create the partitions automatically. Again, PowerShell will also work.

  • Thulasi

    9/25/2017 4:24:48 PM | Reply

    Hi Ust,

    Please share your expertise. Im looking for power shell script to check the Data Lake store usage at the folder level.
    Was wondering why is that this is not displayed in the Azure portal, as the size column is there without details.

    Thank You,
    Thulasi

    • UstOldfield

      12/5/2017 8:58:21 PM | Reply

      Hi Thulasi,
      I'm afraid I've not used PowerShell much with the data lake store. But my approach would be to iterate over the folders, summing up the size of each child object for each folder.
      I don't fully know why size information at a folder level isn't displayed in the portal, but my understanding is because the folders are only logical creations as part of a url path, rather than a physical collection of files. However, I am happy to be corrected on that assumption.

  • Paul Andrew

    12/5/2017 3:13:07 PM | Reply

    Hey Ust, great post. Since I saw your USQL talk at PASS in November something has been bugging me... You say files get split into chucks of 250MB called extents. But according to this white paper (https://dl.acm.org/citation.cfm?id=3056100) extents are only 4MB in size. What is the source of your understanding for this please so I can do some further reading? Thanks

    • UstOldfield

      12/5/2017 8:47:45 PM | Reply

      Hi Paul,
      Thanks for sharing. It's an interesting article! My source has been Michael Rys and Saveen Reedy for the 250MB claim (to be accurate, this should be 256MB) , and this article would back up that claim docs.microsoft.com/.../data-lake-store-performance-tuning-guidance

      For extents to be 4MB would strike me as odd and low, considering the default for HDFS is 128MB... If it truly is 4MB, then my material has been made out of date.

      I will definitely want to explore this further, a) to further my own understanding and b) get some clarity on extents. Thanks again for sharing!

      • Paul Andrew

        12/6/2017 8:29:52 AM | Reply

        Hey Ust, yes my thoughts exactly. Time to tweet Mike I think Smile
        Cheers

Pingbacks and trackbacks (4)+

Loading