Jeremy Kashel

Jeremy Kashel's Blog

Azure HDInsight Polybase Example

Polybase has now made it over to SQL Server 2016 and is available to use as part of the SQL Server 2016 CTPs. One of its capabilities is to connect to Azure Blob Storage, so this blog post gives an overview of how you can use Polybase to connect to data in Azure HDInsight.

Installing Polybase in SQL Server 2016

Polybase will appear in the standard feature list of the SQL Server 2016 CTP installs, alongside the other database engine features, so it’s just a selection box to install it:


Having said that, there’s one small limitation at the moment in the Polybase install, which is that only the SQL_Latin1_General_CP1_CI_AS and Latin1_General_100_CI_AS_KS_WS collations are supported, as explained in the following getting started guide. If you don’t have the correct collation and you try and use Polybase, then you may get the issue of the Polybase windows service stopping.

Azure HDInsight

For this post, I’m just going to use the sensor data sample that comes with Azure HDInsight when you provision a cluster. If you run through the following HDInsight Tutorial, then you can get up and running with an HDInsight cluster and use the samples.

In this case, I’m using the sensor data analysis sample, which comes with a csv file called HVAC.csv – this is a set of temperature readings. The file looks as follows when opened in Excel:


Within the Azure portal, if we go to the HDInsight cluster, then we need to note down the access keys that will allow Polybase to connect to Azure:


We can then copy one of the access keys, which we’ll need for later:


Polybase in SQL Server Management Studio

Over to SQL Server Management studio now, where we need to create an External Data Source, a File Format and an External Table to access the Azure table. Before creating these objects, we need to create a credential, where we will need to specify the Azure access key that we copied earlier:

--Enable traceflag for using a database-scoped credential.
DBCC TRACEON(4631,-1);

-- Create a db master key if one does not already exist, using your own password.

--Enter the Azure account name and the acccess key as the secret
CREATE CREDENTIAL AzureCred ON DATABASE WITH IDENTITY = 'accountname', SECRET='accesskeyhere'

Now that we’ve set up the credential, we need to create an external data source, so that Polybase knows where it needs to go to get the data. Notice how we’re referencing the credential that we created earlier:

    --Specifiy the container name and account name
    LOCATION = 'wasbs://',
    --Specify the credential that we created earlier
    CREDENTIAL = AzureCred

Now we need to specify the format of the file that we want to access. In this case, having looked at the CSV, its comma delimited, so the SQL required is as follows:


Finally we need to create an external table – this is the object that we will be able to query with a T-SQL statement and consume data within the Azure HDInsight cluster. The syntax for this example is:

--Define the table structure
    Date varchar(10),
    Time varchar(10),
    TargetTemp smallint,
    ActualTemp  smallint,
    SystemID  smallint,
    SystemAge  smallint,
    BuildingID  smallint
        --Set the file to be the HVAC sensor sample file
        DATA_SOURCE = AzureDs,
        FILE_FORMAT = CommaFormat,
        --We will allow the header to be rejected
        REJECT_VALUE = 1

Interestingly you can’t tell Polybase to skip a header record, as confirmed in the following connect issue, so I found that my statement was failing when trying to query the data. The solution is to use the REJECT_TYPE and REJECT_VALUE, which will allow us to reject a number of rows, as shown above.

Once this is done, then we’ll see that we have an External Table:


Querying it is just like any other table with a SELECT statement, which gives the following results in this case:


So that’s it – by creating the external data source, file format and external table, we can now use T-SQL in SQL Server 2016 to easily query data hosted within Azure HDInsight.

Analytics Platform System Polybase Example

To me one of the most exciting parts of the Microsoft Analytics Platform System (APS) is Polybase, which in a nutshell allows you to access data residing in Hadoop or Windows Azure Blob Storage, all via T-SQL in PDW. What this means is that data can be transparently queried by a user or developer, in real time, regardless of whether the data lives in PDW or Hadoop/Azure. James Roland-Jones gives a thorough overview of all things Polybase here.

What I’m going to do in this post is to show an example of how existing data within Hadoop can be combined with data that resides in an APS PDW region.

Polybase Example - Setup

There are two key tasks to complete before we’re able to start querying data. We need to setup a data source, so that Polybase knows where to get the data from, plus we need to describe the file format of the external file that we’re going to read. The data source here specifies that we’re going to use the Hadoop nodes that are co-located with the PDW nodes, but this could be a non-appliance Hadoop cluster:

        TYPE = HADOOP,
        LOCATION = 'hdfs://hdfs://H12345-C-HHN01',
        --Optional specify a tracker location to enable predicate
        --push down to Hadoop
        JOB_TRACKER_LOCATION = 'H12345-C-HHN01:50300'

The next task is to setup a file format. In this case we’re defining that the file is pipe delimited, although we can use Polybase with other formats (e.g. RCFile):

CREATE EXTERNAL FILE FORMAT HadoopRegion_DelimitedPipe_NoCompression
    Field_terminator = '|')

Accessing External Data

Before we access data in Hadoop, I want to show the file for this example. In this case I’ve got a customer dimension within PDW, but I have customer survey data within Hadoop that i wish to combine with my data in the warehouse. The survey data has been uploaded in this case to the Hadoop region within APS, via the portal:


The following shows a sample of the Hadoop file, note that its pipe delimited, with columns for Date, Survey Id, Product, Customer Id and Survey Score:


Now I want to access the Hadoop data and combine it with other data in the warehouse. To do this we need to use the CREATE EXTERNAL TABLE command.


    DateId int NOT NULL,
    SurveyResponseId int NOT NULL,
    ProductCategoryName varchar(100) NULL,
    CustomerId    varchar(50) NULL,
    SurveyResult INT NULL
    WITH ( 
        LOCATION = '/user/data/survey/',
        DATA_SOURCE = HadoopRegion_DataSource,
        FILE_FORMAT = HadoopRegion_DelimitedPipe_NoCompression

This results in an external table being available within the PDW region, as shown below:


Now if we want to query the Hadoop data it’s just a simple case of T-SQL. Here I’m joining Hadoop survey data to the customer dimension:

SELECT      CS.CustomerId,
            CS.DateId AS SurveyDate,
            CS.ProductCategoryName AS SurveyProduct,
FROM        dbo.HDFS_CustomerSurvey CS
INNER JOIN  dbo.DimCustomer C ON C.CustomerAlternateKey = CS.CustomerId

Which gives the following results:


So that’s it, just a simple T-SQL query in the end. Polybase has taken away the complexity and allowed us to integrate different data sources using a widely used standard query language.