Ust Oldfield's Blog

Azure Data Lake Store–Storage and Best Practices

The Azure Data Lake Store is an integral component for creating a data lake in Azure as it is where data is physically stored in many implementations of a data lake. Under the hood, the Azure Data Lake Store is the Web implementation of the Hadoop Distributed File System (HDFS). Meaning that files are split up and distributed across an array of cheap storage.


What this blog will go into is the physical storage of files in the Azure Data Lake Store and then best practices, which will utilise the framework.


Azure Data Lake Store File Storage

As mentioned, the Azure Data Lake Store is the Web implementation of HDFS. Each file you place into the store is split into 250MB chunks called extents. This enables parallel read and write. For availability and reliability, extents are replicated into three copies. As files are split into extents, bigger files have more opportunities for parallelism than smaller files. If you have a file smaller than 250MB it is going to be allocated to one extent and one vertex (which is the work load presented to the Azure Data Lake Analytics), whereas a larger file will be split up across many extents and can be accessed by many vertexes.


The format of the file has a huge implication for the storage and parallelisation. Splittable formats – files which are row oriented, such as CSV – are parallelizable as data does not span extents. Non-splittable formats, however, – files what are not row oriented and data is often delivered in blocks, such as XML or JSON – cannot be parallelized as data spans extents and can only be processed by a single vertex.


In addition to the storage of unstructured data, Azure Data Lake Store also stores structured data in the form of row-oriented, distributed clustered index storage, which can also be partitioned. The data itself is held within the “Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many, working with the structured data in the data lake is very similar to working with SQL databases.


Azure Data Lake Store Best Practices

The best practices generally involve the framework as outlined in the following blog:

The framework allows you to manage and maintain your data lake. So, when setting up your Azure Data Lake Store you will want to initially create the following folders in your Root


Raw is where data is landed in directly from source and the underlying structure will be organised ultimately by Source.


Source is categorised by Source Type, which reflects the ultimate source of data and the level of trust one should associate with the data.

Within the Source Type, data is further organised by Source System.


Within the Source System, the folders are organised by Entity and, if possible, further partitioned using the standard Azure Data Factory Partitioning Pattern of Year > Month > Day etc., as this will allow you to achieve partition elimination using file sets.


The folder structure of Enriched and Curated is organised by Destination Data Model. Within each Destination Data Model folder is structured by Destination Entity. Enriched or Curated can either be in the folder structure and / or within the Database.  


Comments (19) -

  • Tris

    4/11/2017 8:27:27 PM | Reply

    Great blog Ust, well explained! You may of explained this in another post by what is the difference between curated and enriched - and what is laboratory? Going to try catch up on your session when I get a spare hour or two.

    • UstOldfield

      4/13/2017 7:25:39 AM | Reply

      Hi Tristan, thanks for your comment. There shouldn't be a significant difference between Curated and Enriched as they should both be structured by Destination Model and Destination Entity. However, the contents and structure of the data will differ between the stages as the data is supposed to be cleaned, validated and enriched when it transitions to Enrich. With Curated, the data should be in its final form, either for consumption by reporting solutions or for additional processing by other systems.

      With regards to the Laboratory, that will be explained in a separate post as it's only really relevant for Data Science within the Data Lake.


      • Tris

        4/13/2017 7:44:11 AM | Reply

        Gotcha - thanks!

  • Matt

    4/13/2017 7:52:20 AM | Reply

    Really good blog Ust! Does the stream section of the framework adhere to similar rules and structures as the Batch section? e.g. broken up by source or destination? partitioned by a data hierarchy?

    • UstOldfield

      4/13/2017 8:00:10 AM | Reply

      Hi Matt,
      The stream structure will most likely be partitioned by date, which is the output by Stream Analytics. However, the overarching structure can be similar to RAW, in that it will be organised by Source Model and Entity. Because the data has most likely already been transformed it won't need to have further sub folders - like Batch - but will form a data source for the batch process.

  • José Mendes

    5/8/2017 8:58:36 PM | Reply

    Another great blog Ust. Can you automate the folder structure, i.e., is there a way to automatically create new folders? If yes, how can you do that?
    E.g. Today you create three folders: 2017, 05, 08. Tomorrow you need 2017, 05, 09.

    • UstOldfield

      5/9/2017 7:50:39 AM | Reply

      Hi Jose,
      You can automate the folder structure using PowerShell. The script you'll want to use will look something like this:
      $DataLakeStoreName = "pleasereplaceme"
      $myrootdir = "/"

      New-AzureRmDataLakeStoreItem -Folder -AccountName $dataLakeStoreName -Path $myrootdir/Raw

      With the date partitions in Raw, when you land data in using either data factory or stream analytics it'll create the partitions automatically. Again, PowerShell will also work.

  • Thulasi

    9/25/2017 4:24:48 PM | Reply

    Hi Ust,

    Please share your expertise. Im looking for power shell script to check the Data Lake store usage at the folder level.
    Was wondering why is that this is not displayed in the Azure portal, as the size column is there without details.

    Thank You,

    • UstOldfield

      12/5/2017 8:58:21 PM | Reply

      Hi Thulasi,
      I'm afraid I've not used PowerShell much with the data lake store. But my approach would be to iterate over the folders, summing up the size of each child object for each folder.
      I don't fully know why size information at a folder level isn't displayed in the portal, but my understanding is because the folders are only logical creations as part of a url path, rather than a physical collection of files. However, I am happy to be corrected on that assumption.

  • Paul Andrew

    12/5/2017 3:13:07 PM | Reply

    Hey Ust, great post. Since I saw your USQL talk at PASS in November something has been bugging me... You say files get split into chucks of 250MB called extents. But according to this white paper ( extents are only 4MB in size. What is the source of your understanding for this please so I can do some further reading? Thanks

    • UstOldfield

      12/5/2017 8:47:45 PM | Reply

      Hi Paul,
      Thanks for sharing. It's an interesting article! My source has been Michael Rys and Saveen Reedy for the 250MB claim (to be accurate, this should be 256MB) , and this article would back up that claim

      For extents to be 4MB would strike me as odd and low, considering the default for HDFS is 128MB... If it truly is 4MB, then my material has been made out of date.

      I will definitely want to explore this further, a) to further my own understanding and b) get some clarity on extents. Thanks again for sharing!

      • Paul Andrew

        12/6/2017 8:29:52 AM | Reply

        Hey Ust, yes my thoughts exactly. Time to tweet Mike I think Smile

  • Channing

    2/16/2018 2:55:02 PM | Reply

    Would external tables in the Azure Data Warehouse only reference files in Curated?  What if the files are formatted well enough for consumption in the DW stage environment as they hit the Data Lake?  Do they have to pass through Raw at all?

    • UstOldfield

      2/16/2018 3:25:21 PM | Reply

      Typically, you want to draw the the external tables for your Azure Data Warehouse from a single location, so Curated is the logical place to do that.

      Although there is definitely redundancy in storing the data in RAW then moving it through to CURATED with little or no change, the point of RAW is that it stores ALL data in its original form. If we have a Data Science team, for example, that sources their data in it’s original format from RAW wherever possible, they would need to know to look in RAW for some files, CURATED for others and the process starts to get a little messy.

      Having a strict process that everything follows, all the way through the lake is the easiest way to enforce consistency and avoid any confusion.

      We went to automatically generate any data processing scripts, so the overhead of having these steps is pretty minimal – fill in the metadata and off you go!

  • Scott

    2/22/2018 5:57:56 PM | Reply

    Ust, great post and Deep dive into Data Lakes. We are using somewhat similar RAW, STAGING, PreRelease, Production, Reporting. My question is how to manage environments for DEV, TEST and Production Old Way of release management. Do we have all environments rolled into one ADLS and resource group. Looking for best practices due to testing using V2 VSTS within ADF V2 and managing code changes. We are being asked if we can maintain different ADLS or environments. If we do use your best practice above what is the best practice to manage code changes to pipelines, datasets and connections?

  • jdro

    2/26/2018 4:16:44 PM | Reply

    Hi. Great post! I'm trying to achieve raw data folder structure similar to yours (want folder for each source) from Event Hub As ingest service. And I'm struggling to do that with available components in Azure. I have multiple sources sending to one event hub without partition key. The sources may be plenty. Stream analytics was the obvious one, but has limitation: 90-days token, each folder has to be as output in Asa. That means if I have ten sources already on production, adding next one stop work for all.
    I don't want use partition key due to high availability requirements.
    The second one is functions app on Eventhub and writing to data lake. Im not sure it will be efficient in big scale.
    Next one is the structure your data after ingest - directly in data lake. But how acommplish this efficient and delete old files after splitting them by folders? Do you have any suggestions? How did you  manage this?

    • Simon

      3/1/2018 8:51:54 AM | Reply

      Hey There,

      Our default process when dealing with streaming events is to use stream analytics to fork the stream - one stream going directly to operational reports, the other landing as flat files in the lake over a dynamic file set. You can encode the date & time into the file path so that every day/hour it starts writing to a new file - this nicely fits into the vertical date partitioning we keep within the immutable RAW layer - you can push those completed files through any lake processing paths you want from that point.

      If you've got a mix of sources going into the event hub and you want them to be separate files in the Lake, I admit it would be tricky to do directly, you would need to build those filters into your stream analytics job directly. But you can always land the data as a single file and use Spark/U-SQL to separate the files during batch processing? That said, if you have separate sources that you want to keep separate, why combine them in a single event hub at all?


      • jdro

        3/1/2018 9:38:36 AM | Reply

        Hi! Thanks for reply!

        "That said, if you have separate sources that you want to keep separate, why combine them in a single event hub at all?"

        We want one eventHub as entry point to data logging.
        For responsibility and maintenance.
        Imagine, that you have 10-20 products within company. You are building solution for it, which allow add new sources to data logging as it go. Each product needs own preparation for that, implementation of sdk and so on. It is easy to say, hey, new source, just send events to that entry point. It is easy to maintenance one eventHub rather than 20, and add each time new as go on.

        Second question which i have - how do you deal with renew token to DataLake in stream analytics on production? For us, it is real deal broker, where you cannot automate the process and needs renew token each time when deploy, or 90 days pass, or password of authority changed.
        Also, Stream Analytics as stream orchestrator is bad, because adding input, output or editing query needs to stop it. Imagine you have 20 products/sources on production which already log data - you have to add another one. That means you have to stop work for all of them for a while.

        We are looking for transparent, easy to maintain solution, which allow to log data for many sources, add new one without stopping flow, and aggregate them nicely per source on Data Lake for future purpose.

        Overall, we came up with two ideas - one, which you mentioned - do the splitting after data land on Data Lake, second - using Functions App as consumer of EventHub and do splitting there based on headers/payload of the message.

        First one seems great, but there is another problem - to do separation on Data Lake to Data Lake with Data Lake Analytics, both have to be in the same subscription. And there is a problem - Data Lake as global storage for company is within other team responsibility, to admin and maintain it, and data logging to it is another. To bypass this, we have to use something like Blob Storage to Data Lake solution.
        The second one seems great also, but we dont know if that will be efficient and scale well.
        Of course the Main Question is - which will be cheaper and better overall.

        We still investigate this.

        Did you see any other possibilities to do that?

Pingbacks and trackbacks (4)+