Simon

Simon Whiteley's Blog

Azure SQLDW-CTAS Statements

So here we are - you've designed your target distribution, you're scaling your warehouse appropriately and you've set up Polybase to access raw data outside of the system. You are now ready to bring data into the warehouse itself so we can start working on it.

Firstly, you should be aware that we're building using ELT not ETL, ie: we design to allow for the warehouse itself to do the work as it is the powerful processing option available to us. When first introducing Polybase, I described how it is the only method of loading data into SQLDW that does not get bottlenecked through the control node. Well the same limitations will be present if we try and transform our data using traditional ETL tools tool. If we used an SSIS data flow, for example, this would extract data out, pull it across to the SSIS server, perform the transformation and then attempt to reinsert the data. Not only are we introducing network latency, but we are also ignoring the power of the MPP engine AND encountering the throughput limits of the control node.

clip_image001

In order to utilise SQLDW effectively, we write SQL for our transformations, rather than relying on external tools such as SSIS. This ensures the work is being done by our compute nodes and, therefore, can be scaled up to increase performance.

General best practice, therefore, would be write stored procedures for each of the data movements we want to occur. This allows us to add in auditing, logging etc. But what we're interested in here is the core data movement itself.

Writing a traditional INSERT statement isn’t the fastest way to get data into a table. There is a special syntax which creates a new table and inserts into it, that is automatically configured for optimal bulk loading, this is the CTAS, or "Create Table as Select" statement.

CREATE TABLE [dbo].[MyTable]
WITH
(
HEAP,
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT *
FROM [dbo].[MySourceTable]

 

As it is the only minimally-logged, bulk insert method within SQLDW, it’s the fastest way to get data into a table. If you have the table already exists and you want to wipe it and fill it with data, the fastest way to do so is to delete the table and recreate it using a CTAS. Performing a truncate/insert combination would be slower, as each row insert causes a write to the transaction log. SQLDW performs all standard actions as a transaction by default, so if you have a huge insert statement and something goes wrong, be prepared for a lengthy rollback process.

The table creates by a CTAS statement infers the schema from the query, so you need to be explicit with ISNULLs, CASTs and anything needed to point the query in the right direction. This is where the bulk of the data processing happens - the select part of the CTAS both defines the destination table and carries out the data transformations, all as a single step.

If you’ve ever used “SELECT field, field2 INTO [dbo].[NewTable] FROM [dbo].[SourceTable]” then you’ll have a pretty good idea of how it’ll work. But if you ran the two side by side, CTAS would perform faster as, again, it is specifically treated differently by the SQLDW engine.

What does this mean to our traditional ETL pattern? If you're following the standard approach of having intermediary tables between data processing steps (ie: Staged data, cleaned data, transformed data and so on), these transient tables can actually be deleted between loads. They won't exist in your database project. They won't exist until your load process has been run.

A CTAS to 'Clean' our fact table would be:

BEGIN TRY
    --Primary CTAS to insert data from external table into strongly typed Clean table
      CREATE TABLE Clean.MyTable
      WITH(
        HEAP,
        DISTRIBUTION = ROUND_ROBIN)
      AS
      SELECT
          ISNULL(CAST(Column_0 as INT),-1)  TableId, 
          CAST(Column_1 as VARCHAR(50))     SomeTextField, 
          CAST(Column_2 as DECIMAL(12,5))   SomeNumber
      FROM SRC.MySourceTable
      OPTION (LABEL = 'Clean.LoadMyTable.CTAS');
    END TRY
    BEGIN CATCH
        --CTAS Failed, mark process as failed and throw error
        SET @ErrorMsg = 'Error loading table "MyTable" during Cleaning CTAS: ' + ERROR_MESSAGE()
        RAISERROR (@ErrorMsg, 16, 1)
    END CATCH

By including the CAST and ISNULL statements, I’m controlling the schema of the created tables. The resulting table would have the following definition:

CREATE TABLE [Clean].[MyTable]
WITH
(
    HEAP,
    DISTRIBUTION = ROUND_ROBIN
)
(
    [TableId]       INT                NOT NULL,
    [SomeTextField] VARCHAR(50)        NULL,
    [SomeNumber]    NUMERIC(12,5)    NULL
);

By using CTAS wherever possible, you will maintain the speed of your data loads and avoid and issues around long-running transactions. An efficient SQLDW loading process contains ONLY CTAS functions.

For example – there is no such thing as a MERGE statement within SQLDW. In order to merge the results of one table into another, you can write a CTAS that selects a union of the two tables and inserts them into a new table. You then drop the old table and rename your newly created table in it’s stead.

There are many similar patterns that will help you around any data movement problem – we’ll discuss them in later posts in this series.

For more information about the CTAS statement, see the Microsoft Docs.

A Guide to Azure SQL DataWarehouse

So you've heard the hype - the Azure SQL DW is going to solve all of your problems in one fell swoop… Right? Well… maybe. The system itself is a mix of technologies designed for low concurrency analytics across huge amounts of relational data. In short, it's a cloud-scalable T-SQL-based MPP platform, with all the benefits and restrictions that performing everything in parallel brings. If your problem can be solved by performing lots of calculations over small of your data before aggregating the results into a whole, this is the technology for you.

However, before you jump right in, be aware that SQLDW is a very different beast to other SQL tools. There are specific concepts you need to be familiar with before building your system, otherwise you're not going to see the promised performance gains and will likely lose faith very quickly!

I'm in the process of documenting these concepts, plus there is a wealth of information available on the Azure SQLDW site. For the next few months, I'll be running through the blog topics below and updating the links accordingly. If there are topics you'd like me to add to the list, please get in touch!

Azure SQLDW Core Concepts:

- What is it?

- How Does Scaling Work?

- Distributions

- Polybase

- Polybase Limitations

- Polybase Design Patterns

- CTAS

- Resource Classes

- Partitioning

Designing ETL (or ELT) In Azure SQLDW

- Row counts

- Statistics

- Surrogate Keys

Performance Tuning Azure SQLDW

- Plan Exploration in SQLDW

- Data Movement Types

- Minimising Data Movement

Managing Azure SQLDW

- Backup & Restore

- Monitoring Distributions

- System Monitoring

- Job Orchestration

- Scaling and Management

- Performance Tuning Queries

Azure SQLDW Architecture

- Presentation Layers

- Data Lake Integrations

The Cloud BI Transition: Warehouse of the Future Talk

While we wait for the SQLBits XV videos to be published, I actually gave the same talk back in March for the London PASS Chapter MeetUp. If you were unfortunate enough to miss both talks, or happened to steal a quick nap mid-way through, the recording can be found here.

I wrote up a quick accompanying blog earlier this week, which includes a link to the slides and some additional thoughts.

As usual, if anyone has any questions following the talk, I’d love to hear from you in the comments or directly on twitter via @MrSiWhiteley.

The Cloud BI Transition: ETL & Data Movement in the Cloud

The cloud warehouse is a different beast to the on-premise systems most of us have been using for the past decade. These times, they are a changing. I’ve talked about the key concepts that we, at Adatis, see as core to this new world here, but how does this work in practice? To start us off, the core of any BI Developer’s skillset – ETL.

The main two players are Azure DataWarehouse and Azure Data Lake. I recently gave a talk to SQLBits to introduce some key concepts, explain how I see them fitting in to the new world and, hopefully, challenge some existing preconceptions - slides can be found here. I’ll pop the video link up when published.

For those in a hurry, I’ve summarised the main thrust of my talk below.

The Ways of the Past

For many years, the fastest way to transform data whilst moving it between two places was to extract, transform in memory via SSIS, then load into their final structures, hence ETL.

clip_image002[4]

However – this was eventually rivalled by massive parallelism, whilst one database engine may be slower to work through a load than memory-based operations, 10 might be much faster. But the Hadoop systems that provided this, were not very accessible to the bulk of BI developers, seen as requiring different skillsets and not being particularly standard.

The management of these clusters wasn’t always achievable even with developer acceptance, certainly not in the on-premise world. This is why the latest Azure technologies are looking to make such a difference.

The Scalable Warehouse

With Azure Data Warehouse, you can perform queries over large, distributed tables – both reading and writing using massive levels of parallelism, depending on your service tier. You can also scale your service dynamically, so if your day-to-day usage required 100 DWUs, you’d be seeing 8 reader threads and 60 writer threads. Even at this level you may find performance to be similar to SSIS.

But these are PaaS components and so are incredibly flexible and scalable. We can scale our ADW instance up to 400 DWUs just for the duration of our data load (bear in mind, you pay by the hour for the maximum level used in that time). This gives us 32 readers and 60 writers – at this level we’ll be far outstripping the performance of a powerful SSIS box and this will cost us just a couple of pounds per hour. This scaling can be done via a number of methods – a couple of lines of T-SQL, a simple powershell script or even through the UI.

Once we’ve scaled, transformations are quite likely to be faster performed directly by our data warehouse engine than a data movement tool. This gives us Extract > Load > Transform, which is a common technique by users of MPP systems – with Azure DW coming from APS/PDW this makes sense.

clip_image004[4]

Targeted Compute

With Azure Data Lake Analytics, you can spin up the equivalent of 1000 cheap servers (essentially managed Hadoop instances) for the lifetime of your query. You’ll need to write U-SQL to do this, a new language combining C# and SQL syntax, but it’s not a huge leap for an experienced SQL developer with just a hint of programming skills. But the ability to write a query directly in the browser that can utilise massive amounts of parallelism removes a lot of barriers – you don’t even need visual studio installed to take advantage of ADL.

Of course, if you do build your queries in visual studio, there is a whole tooling set released to allow you to predict compute utilisation, spotting bottlenecks and planning capacity. Not to mention the benefits that source controlling your code give you!

clip_image006[4]

So Which Do We Use?

This gives us two options for our transformation, the Data Lake and within the Warehouse itself.

My assumption is that we’re going to see TELT, some upfront targeted transformation, then an extract/load into our final technology, then some further transformation.

Transform (targeted) > Extract > Load > Transform (general)

Whereas transformation used to be the bottleneck, it’s now the data movement itself that’s slowing things down. If we have just one very large job, we can perform this in isolation within the lake before moving anything and not have to scale our warehouse. If we have a lot of transformations to do on a variety of tables, we can scale up the warehouse for the duration. Maybe we do both to limit how much we scale the ADW. Perhaps we perform the transformation as increments throughout the day, either as smaller data lake jobs, data factory movements or even stream transformations. Once again, that’s what a distributed cloud architecture gives us – flexibility and choice.

The actual answer depends heavily upon the data scenario that you’re attempting to work with – the speed of incoming data, the types of data involved, the volume and complexity of the data. The new components do not provide you with a one-size-fits-all answer, they do, however, provide you with the tools to overcome almost any data challenges you may be facing.

Further Reading

Azure DataWarehouse Loading Reference Guide

Azure Data Warehouse Documentation

Azure Data Lake Analytics Documentation

The Cloud BI Transition: Business Intelligence Key Players and their changing roles

I’ve previously talked around how Azure is changing traditional BI approaches, and the various architectural frameworks we’re going to see coming out of it. But what about the existing job roles within BI Teams and related structures? How are they going to change with the new ways of working that are emerging?

It’s a question we need to start considering now as cloud architectures become more commonplace and business are more willing to trust their data to cloud providers. From my recent work in piecing together new architectures and frameworks, I’ve been fortunate enough to have these conversations with people who are currently evaluating their position and their future training needs. From these conversations, and my own research, I’ve put some thoughts together around the “traditional” BI roles and how they’re changing:

BI Architect – In many ways, this is the role that changes the most. We now have a wealth of options for our data, each with their own specialist uses. The BI Architect needs to be familiar with both new functionality and old, able to identity the most relevant technology choice for each. Whereas we would previously be able to host the majority of our different systems within a single, cover-all SQL Server architecture, we will now find certain structures to be more aligned with the performance profiles of specific Azure components and less so with others. Business Users expecting lightning-fast dashboard performance would not benefit from a Data Lake system, whereas a Data Scientist would be unable to work if confined to a pre-defined, data model. A small, lightweight data mart of several Gb would likely perform worse in a full Azure DW system, given the data would be split across so many distributions the overhead of aggregating the results would outweigh the parallelism gains – in this case we would introduce an Azure SQL DB or Azure SQL VM to cater for these smaller marts.

Infrastructure Specialist – Gone are the days of the infrastructure specialist needing to know the install parameters of the SQL Server and the best disk configurations to use in different scenarios. We’re now focusing on security models, network architecture and automation and scaling management. The PaaS and IaaS systems differ greatly in their approach to security, with IaaS requiring traditional networking, setting up VLANs/security layers and PaaS components each having their own firewall layers with individual exception management. The infrastructure specialist should also be advising/designing the Azure subscription setup itself, connections to other subscriptions, perhaps managing expressroute and gateway connections back to the on-premise systems. There is also the considerations of whether to extend Active Directory into the Azure domain, making the Azure estate more an extension of the company’s internal network.

Data Modeller – The end role here doesn’t change dramatically, many previous design principles are still the case in the new approaches we’ve outlines. However there are some additional performance considerations that they will need to build into their designs. Azure DataWarehouse, for example, fundamentally relies on minimising data movements that occur when querying data. A snow-flaked model, or a model with several very high cardinality dimensions, might find performance degrades significantly, when it may have been the most performant design in a traditional multidimensional cube.

BI Developer – This role will still include many of the traditional tools, very strong SQL skills, an understanding of data movement & transformation technologies and excellent data visualisation skills. However, the traditional “stack” skills of SSIS/SSAS/SSRS are extended and augmented with the additional tools at our disposal – components such as Stream Analytics, Data Factory, Event Hubs and IoT sensor arrays could all easily fall into the domain of BI yet require radically different skillsets. Exposure to C# and the .NET framework becomes more valuable in extending systems beyond the basic BI stack. The management of code is essential as these environments grow – being able to rapidly deploy systems and being confident in the development process is vital in order to get the most out of cloud technologies.

Data Scientist – For the first time, users with advanced analytical skills have a place in the architecture to allow for experimentation, ad-hoc analysis and integration with statistical tools. This free-form analysis outside of strict development protocols accelerates the business’s access to the insight and understanding locked within their collected data. Many of these insights will mature into key measures for the business and can be built into the more stable, curated data models

Data Steward – If anything, the importance of a nominated data steward grows as we introduce systems designed for ad-hoc analysis. Without governance and controls around how data is stored within a data lake, it can quickly become a dumping ground for anything and everything. Some critics see data lakes as a dystopian future with uncontrolled “swamps” of data that grow meaningless over time. Whilst “store everything” is a fundamental tenant of the data lake mentality, everything stored should be carefully catalogued and annotated for maximum usefulness – the importance of this should not be underestimated. Our steward should embody the meticulous collector, not the disorganised hoarder.

Database Administrator – The introduction of PaaS services as our main components change this role dramatically, but they remain a core member of the business intelligence team. The common DBA tasks of security management, capacity and growth planning, performance optimisation and system monitoring are all very much a part of day-to-day life. Certain tasks such as backup & recovery are taken away as services Microsoft provide but the additional skills needed to manage performance on these new technologies are critical. Data Lakes produce large amounts of output as a by-product of running jobs and queries, these need to be cleaned and maintained over time. Access levels to different areas of the lakes will be a growing concern as our lake models mature. Finally, the tuning of high-performance queries, whether in U-SQL or Azure DW now require a whole new set of skills to analyse and optimise. PowerShell, a traditional tool of the DBA, becomes hugely powerful within Azure as it is the key to managing system automation – scaling systems up, down, on and off requires a reasonable grasp of PowerShell if you want to get the most out of your systems.

BI Analyst – Somewhere in between our Data Scientists, Developers and Consumers, we have our BI Analysts. Whereas previously they may have been expert cube users, building reports and dashboards for end users to consume using the BI systems provisioned, they now have far more autonomy. PowerBI and other reporting technologies, whilst being touted as the silver-bullet for all self-service reporting needs, can deliver far more when in the hands of an experienced reporting analyst. Essentially, the analyst still acts as the champion of these tools, pushing data exploitation and exposure, except they now have the ability to deliver far more powerful systems than before.

Data Consumer – The business user is, by far, the beneficiary of these tools and systems. With a flexible, scalable architecture defined, as well as different streams of data management and exploitation, there will be many benefits for those who need to gain insight and understanding from the company’s various data sources. The data models supporting self-service tools will benefit from faster performance and can include data sets that were previously size-prohibitive, giving the data consumers instant access to wider models. If the datasets are too complex or new to be exposed, they can contact specialists such as the BI Analyst or Data Scientist to investigate the data on their behalf. These manual-analysis tasks, because the architecture is built and designed to support them, should be more maintainable for a BI team to provide as an ongoing service.

Conclusions

In many ways, things aren’t changing that much – we still need this mix of people in our BI team in order to succeed. But as always, technology is moving on and people must be willing to move along with it. Many of the design patterns and techniques we’ve developed over the past years may no longer apply as new technology emerges to disrupt the status quo. These new technologies and approaches bring with them new challenges, but the lessons of the past are essential in making the most of the technology of the future.

The people who will be thriving in this new environment are those who are willing to challenge previous assumptions, and those who can see the new opportunities presented by the changing landscapes – both in terms of delivering value quicker and more efficiently, and finding value where previously there was none.

The Cloud BI Transition: Azure Warehousing Architecture

There has been a lot of turmoil in the BI space over the past couple of years, however many of us are still holding on to our favoured design patterns. The standard Kimball warehousing pattern brings data from various sources into a generic staging database. We then copy the data to another location, transforming it en-route before finally placing it into conformed, curated star schemas. There may be further transformations, tools, integrations along the way, but this is our most simplistic architecture – and it all happens on one or more SQL Server boxes.

But business challenges are driving technology changes, data volumes are exponentially rising, the different types of data businesses require in order to remain competitive is growing and the speed at which organisations generate and or/receive data is increasing. As a result there are new, cloud-driven components that facilitate entirely different ways of working, SSIS is no longer the fastest way to transform data when we’re talking about terabytes, or more, of data being moved. We also have different velocities of data, both of how it comes in and how it is analysed. Sensors sending out realtime data streams, and web logs constantly writing away – these don’t always fit nicely into a batch-job ETL. Data Scientists wanting access to raw, unprocessed data in ways that have yet to be thought of – a curated warehouse isn’t necessarily right for them.

So – to solve these issues people are looking towards Microsoft Azure’s data architecture, but there are a lot of different tools now available for a lot of different purposes. I thought I’d start our architectural analysis with a look at some key concepts and components, largely as a primer.

Scalable Processing

Above all, cloud technologies bring scalability. By hinging their charge models on actual usage, cloud providers allow users to dynamically change their resource allocations to match capacity requirements. If you’re expecting a busy period, then add more capacity. If your business hours are 9-5, turn things off in the evening and weekends to make massive cost savings.

Newer components take this even further – Azure Data Lake Analytics allows for compute units to be spun up to run a single job, closing them down afterwards. You can spin up 1000 compute units for the lifetime of a single query, all with only a few minute’s lead time.

Distributed Architectures

Because we can dynamically scale, we can manage our infrastructure to a much finer level. However, if most of our system resides on a single box, our scaling design has to take a large, complex schedule into account. Splitting various pieces of functionality into separate components and scaling them individually provides massive amounts of flexibility in managing our stack. With BI specifically, batch-processing ETL windows are a prime candidate for this.

If we have an SSIS package that needs to be run within a certain window and requires a huge amount of resource when it is running, we don’t want this to impact the rest of our ecosystem. In this case we can have a separate server for the SSIS and leave it turned off outside of our ETL window. We then power our server on (using Azure Automation, aka – hosted powershell), run our packages, then turn the server off again. If we’re talking about hour-long daily runs, then even with some allowances for maintenance windows and support, we’re talking about no more than 40 hours of Azure VM cost per month.

Queryable Archive

One of the biggest changes is the attitude to storage with cloud technologies. It has gone from a prohibitive cost that means systems need to justify their storage and growth expectations, to a relatively cheap item which can almost be considered as an afterthought. There is a growing trend of companies storing absolutely everything their systems generate, with an assumption that they will find a use for it at later dates. The cost of storing this data, thanks to falling storage prices and improved technology, is now seen as less than the potential gain that a new seam of rich analysis could bring.

This data doesn’t necessarily make it through to our curated data models – these still need to be carefully designed, governed and documented to ensure they are accessible to less data-savvy users. But expert users, developers and data scientists can write their own queries against the more expansive data stores, pull data into Azure ML and run R algorithms to unearth new insights.

Bringing it Together

With the concepts above, we get the best of both worlds – the ability to analyse vast amounts of data and apply modern data analysis techniques, and expose these findings to end-users and dashboarding systems seamlessly.

This is what we’re building into our reference Azure architecture going forward – a combined approach that has huge potential to scale yet has the functionality to meet the needs of modern BI projects.

Azure Data Lake provides the queryable store, as well as powerful targeted compute, which serves as a feed for more established, curated data models within the Azure Data Warehouse.

clip_image001[5]

Over the next few blog posts, I’ll be discussing how this new architecture works in practice, how it affects the teams and companies around it, best practice for ETL, reporting architectures and development practices, as well as the new governance challenges that will grow from these structures.

Getting Started with Azure Data Lake Store

You may have heard about Microsoft’s new Azure Products – Azure Data Lake Storage and Data Lake Analytics. If not, take a look at the official blurb here!

What is it?

Essentially it’s hadoop-made-easy – the storage engine is a WebHDFS layer with a management overlay that takes care of scaling and cluster distribution so you don’t have to. For many hadoop veterans this will feel as uncomfortable as a grizzled DBA allowing Azure SQL DB to determine the best backup schedule on their behalf. But for everyone else, this removes a significant technical barrier to using truly scalable, distributed storage. What’s more, your store can be accessed by any hadoop-compatible technologies, so you can integrate with your existing architecture as a hassle-free migration path.

Data Lake Analytics provides the parallel querying engine, comes courtesy of a new language U-SQL, based on Apache YARN (which is essentially a version 2.0 of their populate Apache MapReduce). Once again, the big difference being that all of the cluster management is being done for you – and on a pay-as-you-use model.

For a more detailed overview, take Microsoft’s own overview page. Their architectural vision has the Data Lake as the central hub for all data-based operations, they headline with a diagram making this very clear:

clip_image002

The goal is for the Data Lake Storage to be a dumping ground for any and all of your data, be it structured, semi-structured or entirely unstructured. The competitive price of Azure storage means they’re now pushing a “store first, think of questions to ask later” methodology – initial indications are that 1Gb of data will cost just £0.05 per month. This makes it an attractive offer for most archiving scenarios where the data still needs to be accessible in some form. I’ve a few thoughts around where this is heading and what it means to the Data Warehouse, but first I’ll run through how to get your Data Lake set up – you can figure out what you’re going to do with it later!

Creating your Azure Data Lake Store

First you’ll need to set up your storage account – you can do this through any valid Azure subscription. At the time of writing, the Azure Data Lake elements are in public preview, so you’ll need to sign up and be accepted first.

To create your storage account, simply go to New > Data + Storage > Data Lake Store:

clip_image004

You’ll be presented with the usual configuration screen:

clip_image005

Again, at the time of writing this service is only available in the East US 2 location. A few minutes later, your Data Lake Store will be ready to go!

You can now begin loading files into the store, ready for use. You have several options on how to get your data into the store, but for now we’ll go with the most basic.

Uploading your first file using Data Explorer

Open up the resource blade for your DLS and you’ll see a standard overview panel for your new object.

clip_image006

I’ll let you explore the settings and monitoring options yourself. What we’re interested in is the Data Explorer. Click that button (or there’s another entry point via settings) and you’ll be presented with the new Data Explorer blade.

clip_image008

Here you can easily see all of the objects that have been uploaded to the store so far. Let’s add a file by clicking the upload button.

You’ll see a basic upload screen:

clip_image010

And if we select a csv file (or one of many supported file types that takes your fancy), we can click start upload to send the file across.

You’ll receive a portal notification when the upload has completed, and the data explorer view will refresh to show us our new contents:

clip_image012

Clicking on the file will then open it up in the file preview blade, so you can explore the various objects contained in the store:

image

And that’s it – your file is now available through a variety of methods. You can treat your Azure Data Lake Store like any other WebHDFS layer, or you can spin up an Azure Data Lake Storage account to write U-SQL against your stored file.

A minor word of warning here – in my initial attempts of using the store, I found it to have difficulty recognising column titles within flat files, I’d receive parsing errors where it was trying to convert titles into numerical fields etc. This is most likely because I’m new to U-SQL as a language and the documentation is very much a work in progress still. I’m assuming this limitation will be easy to work around but have, for now, removed column titles from my sample sheets.

Over the next few weeks, I’ll be exploring other methods of getting data into your Data Lake Store, whether it’s programmatically through C# or Powershell, streamed through Event Hubs and Azure Streaming Analytics or even more traditionally, uploading to relational database tables through SSIS, all of which can then be accessed via a single interface.

I will also be looking at U-SQL in particular, and how it can be leveraged in a new-world Warehousing Solution.