Adatis

Adatis BI Blogs

Azure Data Factory, using the Copy Data task to migrate data from on premise SQL Server to Blob storage.

This is the third blog in an ongoing series on Azure Data Factory. I recommend you start with the following blogs before continuing: Introduction to Azure Data Factory Setting up your first Azure Data Factory   In the first blog in this series I talked about working through a use case. Over the next 3 blogs we will look at 3 different methods for migrating data to Azure Blob storage. Using the Azure Data Factory Copy Data Wizard. (on table) Using BIML and SSIS (entire database - SSIS) Using Azure Data Factory and PowerShell (entire database - ADF) The reason I have included the latter 2 versions is because if you just want to load an entire database in the blob storage it can be quicker to use one of these methods as a one off or on a scheduled basis. Hand writing all the JSON required for each move table from on premise to blob storage is very time consuming. Depending on whether you need to do a one off upload or something on a schedule options 2 and 3 might help. Our original use case from an introduction to Azure Data Factory: Let's imagine a process where you have an on premise SQL server box, you want to move multiple tables to blob storage, from there you then want to issue a stored procedure which will consume that data in to an Azure SQL data warehouse via PolyBase - As illustrated in the image below. Linked services: On-premise SQL database Azure BLOB storage Azure SQL data warehouse Datasets: Table in On-premise SQL database The blob container The stored procedure Pipelines: Pipeline to move data from SQL database to blob storage Pipeline to issue stored procedure In the blog we will tackle the first part: Copying data: We are going to start looking in a bit more in detail at the Azure Data Factories (ADF) copy data task (CD). CD is still in preview (at the time of writing [01/2007]). Prior to the inclusion of the copy data wizard, you had to manually configure ADF artefacts and write most of the JSON for linked services, datasets and pipeline by hand. The copy data task is a wizard for generating a data movement activity pipeline, complete with datasets and linked services. To get started connect to azure and navigate to your existing Azure data factory (if you do not have an existing ADF you can follow how to create one here http://blogs.adatis.co.uk/Terry%20McCann/post/Setting-up-your-first-Azure-Data-Factory). To begin setting up a copy data pipeline click on the "copy data (PREVIEW)" option in the ADF panel in Azure. Once you have selected "Copy data" you will be taken to the new ADF interface. Datafactory.azure.com enables the creation and monitoring of ADF pipelines. The general process for creating an ADF pipeline process (prior to the copy data task) was Create an ADF Create a linked service/s Create a gateway as needed Create you input and output datasets Create a pipeline Monitor the pipeline When using the ADF copy data the process is slightly flipped Create an ADF Configure the properties for the pipeline Create a gateway Configure the linked service/s Configure the datasets Deploy all configurations Monitor the pipeline. The main difference here is that you do not deploy anything until it has all been configured, you have the added advantage that it is somewhat cumbersome to do this manually. At present, the copy data task is very narrow in its functionality. If your intention is to build a more advanced pipeline will either need to generate a move task and tweak it or create it all manually. Copy data has many short comings, for our example the most prevalent is that a movement to blob storage only has the option to sink data to a folder and not multiple folders. Option 3 in our list of methods for migration aims to get around this limitation using PowerShell Configure Pipeline properties:Once you have selected "copy data" you will be launched in to datafactory.azure.com, the new fresher looking environment. The copy data task is a 4 stop process which will guide you through the creation of a data movement pipeline (I want to highlight that this is only used for data movement and not transformation). This is a great way to get you started with ADF without having to understand the json or trickier elements such as data slices and scheduling, although we will touch on scheduling as it is quite tricky. (image 1 - Configure Properties) The first screen is you will see are the properties of the pipeline you're creating. It is here you will configure the frequency and schedule of the pipeline. A pipeline is a group of logically related activities. Task name - This is important and will be used as a prefix for the names of datasets and data stores. Task description - Task schedule - See below for a more in depth analysis. Start time - This date is in UTC / End time - This data is also in UTC For quick conversions to your time zone, I recommend worldtimebuddy (http://www.worldtimebuddy.com/) More on schedules:The Microsoft page about scheduling is incredibly deep and takes a lot of effort to fully digest and understand. I will attempt to impart my understanding of pipeline scheduling in to a brief list of key points. You can read more here https://docs.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution I would recommend that you do read this site as there is a lot of good examples. The key points from this document are: A schedule can be one-off or between dates It can run on a scheduled basis Minute, hourly, daily or weekly. Every 15 minutes is the minimum. This forms what is known as a tumbling window. Microsoft defines a tumbling window as "A series of fixed-size, non-overlapping, contiguous time intervals". Tumbling windows are also known as activity windows. A pipeline schedule's interval needs to be the same as a datasets availability - However, it does not need to run at the same time. For our example we will use frequency of "Daily" and an interval of "1", this will run our pipeline every day. To write this in JSON and not through the wizard you would use the following JSON as part of your pipeline. "scheduler": {"frequency": "Daily","interval": 1} To create a pipeline which will run undefinably you can set the end date time to "12/31/2099 12:00am" which while this is not infinite, the date will outlive ADF. Start date time will default to the time you have created the pipeline (n.b. these dates are expressed in US format MM/DD/YYYY). Creating linked services (Source data store):The next screen is the configuration of the linked sources. ADF is still new and the terminology is somewhat confusing. Depending on tool you're using and sometimes the screen you're looking at, ADF will mix up the names for key parts, anywhere you see the term "data store" assume it is referring to a linked service. For our example we will use the SQL Server option (bottom right of the above image). (Image - SQL Server linked service configuration screen) You should now be able to configure the connection details (via a gateway) to you SQL server database. Connection Name - You can refer to the Adatis naming standard as a referenceLS - Linked serviceMSQL - Microsoft SQL ServerPerson - Table being exported. Gateway - Select an existing or create a new gateway (see below) Server name - For my example I am using my local server with the default instance "." local would also work. If you're connecting to a named instance this will need to be server\InstanceName Database Name - Which database you want to connect to Credential encryption - You have the option to save a credential in Azure or use authentication through the browser. For simplicity I am using the latter. For production, use the former. Authentication type - How to connect to SQL Server, Windows or SQL login. User name Password   Creating and configuring a gateway:In our example we will be connecting to a local version of SQL Server, to connect and read data we will need to create an ADF gateway connection and also install our gateway on the server which has our database (or at least connection to that database). You have a few options to create the gateway, but before you can configure any of these you will need to download and install the gateway. You can find the latest version of the gateway here https://www.microsoft.com/en-gb/download/details.aspx?id=39717. Once installed the gateway will be waiting for an access key. (image - Microsoft gateway - Awaiting gateway key) We have 3 options to create an ADF gateway and obtain the key the gateway is expecting. Copy Data configuration page (click create gateway)This will build the gateway and add the name to your pipeline. You will need to take the access key it generates and add that to you installed gateway. Add via author and deployNavigate to author and deploy on the main panel of ADF in Azure. Click on "...More" and select New Data gateway, configure and deploy. This will return a key. Add the key to the gateway on your server. Via PowerShellOpen a new PowerShell prompt and connect to Azure (Login-AzureRmAccount)Replace the following with your Azure details - $ResourceGroup, $DataFactoryName and $GatewayNew-AzureRmDataFactoryGateway -ResourceGroupName $ResourceGroup -Name $Gateway -DataFactoryName $DataFactoryName -Description $GatewayThis will return a key. Add the key to the gateway on your server.   (image - A registered gateway) (Image - main screen on a registered gateway) Configuring linked services:Select next to choose which table/s you wish to move. (Image - ADF Copy - Select tables) You can select one or more tables here. For our example we will be consuming the data using PolyBase. We want our data to sink to its own container in Azure. As such we cannot move multiple tables at once (at the time of writing this is limited to one sink container). (Image - ADF data filter) You will next be asked how you want to filter the data. Each time our data runs we are looking to move the whole table. If we were looking to do incremental loads, we could select a column which indicates which rows to import each hour. For our example select Filter: None. Next (image - ADF destination Source) Configuring the destination source:On the next screen you will see the list of available sinks (where you can insert data). You will notice the list of sinks is far smaller than the list of sources - at present not all sources can be sinks. For our example select Azure Blob storage Connection Name - Based on Adatis ADF naming standards http://blogs.adatis.co.uk/Terry%20McCann/post/Azure-Data-Factory-Suggested-naming-conventions-and-best-practices LS_ Linked Service ABLB_ Blob storage Person - blob container data will sink to Account selection method - Manual/Azure list Azure subscription - Select you subscription Storage account name - Select your storage account (Image - Selecting a blob container) Select a folder for the file to sink to. I created a folder ahead of time called person. (Image - ADF file format configuration screen) Customise you output settings. For now we will just select the defaults to create a CSV. Select finish to build your pipeline. (image - ADF deployment completed screen) As long as everything has worked you will see the screen above. Congratulations your pipeline has been deployed. To monitor the pipeline and see what has been created select the link "click here to monitor your pipeline". You will be taken to a different screen in the ADF portal. We will have more on how to monitor ADF shortly. Image - (ADF pipeline in datafactory.azure.com) You can check data has moved successfully using Azure storage explorer. ASE is a great utility for browsing files in blob storage. You can download ASE here http://storageexplorer.com/ (image - Storage explorer) I can see that my file is there and is populated as intended. Once a further 24 hours has passed this file will be over written. So we have seen what we can do with the Copy data task in Azure. While it is fantastic at basic data movement functions, copy data does not offer much beyond that. I have listed the following pains and shortfalls which exit in ADF copy data at present . Limitations of the copy data wizard:There are quite a few limitations, some of these are nice to have, others are show stoppers. The CD action is limited to only a subset of the pipeline activities. As the name suggests you can only copy data, or move data. There is no transformation wizard. The menus are very temperamental and regularly do not work You cannot name a dataset - InputDataset-8tl was created in my example. This is not helpful The name of the pipeline is also not helpful. You cannot chain multiple activities together Each pipeline needs to created separately. You can only sink datasets to one blob container. Now that we have our data in blob storage we can begin to look at the rest of our solution, where we will create an Azure SQL Data Warehouse, with external PolyBase tables. We will use stored procedures to persist the external tables in to ASDW. In the next blog we will look at moving an entire database to Azure blob storage using SSIS and BIML. Links https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-scheduling-and-execution http://blogs.adatis.co.uk/Terry%20McCann/post/Azure-Data-Factory-Suggested-naming-conventions-and-best-practices

How to do row counts in Azure SQL Data Warehouse

Continuing on from my last couple of blog post about working with the Azure Data Warehouse, here is another issue which has came up during development and is handy to know if you are going to be developing a solution! Keeping track of how much data has been loaded plays a key part in a BI Solution. It is important to know for a given load for example, how many rows were inserted, updated or deleted. Traditionally, we were able to use the @@ROWCOUNT function @@ROWCOUNT returns the number of rows affected by the last statement. Unfortunately, in Azure SQL Data Warehouse @@ROWCOUNT is not supported. How does it work? In the Microsoft Azure documentation,they do provide a workaround for this, please see here  for more information and a list of other unsupported functions. They suggest creating a stored procedure which will query the system tables sys.dm_pdw_exec_requests and sys.dm_pdw_request_steps in order to get the row count for the last SQL statement for the current session. sys.dm_pdw_exec_requests holds information about all requests currently or recently active in SQL Data Warehouse. It lists one row per request/query. holds information about all SQL Server query distributions as part of a SQL step in the query. sys.dm_pdw_request_steps holds information about all steps that are part of a given request or query in SQL Data Warehouse. It lists one row per query step. This is an example of what the stored procedure would look like:   As you can see above, we pass through a ‘LabelContext’ parameter. A Label is a concept in Azure SQL Data Warehouse that allows us to provide our queries with a text label that is easy to understand and we can find that label in the DMVs. For example: Here, we have given our query the label ‘Test Label’ and if we wanted to find information about this query in the DMVs we can search using this label like:   So, putting this into context, in the ETL we are calling stored procedures to load our data (for example between clean and warehouse). Therefore, within the stored procedure we have the query written to insert or update the data and we would give this query a label. Then, within the same stored procedure, we would call the Row Count stored procedure, passing through the Label as parameter so we can retrieve the row count.   Be careful though! On my current project we have come across times where we haven’t been able to get the row count back. This is because the sys.dm_pdw_exec_requests DMV we are querying is transient and only stores  the last 10,000 queries executed. So when we were running the query above, our requests were no longer there and we were getting nothing back! The table holds data on all queries that go against the distribution nodes and statistics gathering for each of the nodes. So in order to try and limit the records in this table, keep the nesting level of queries as low as possible to avoid the table blowing up and not having the data you need in it!   Stay tuned for another blog about working with Azure Data Warehouse!

Statistics in Azure SQL Data Warehouse

Following on from my previous post about Update Queries in Azure SQL Data Warehouse, I thought I would put together a mini-series of blogs related to my ‘struggles’ working with the Azure SQL DW. Don’t get me wrong, its great, just has some teething issues of which there are work-arounds! This blog post is going to look at what Statistics in the database world are, the differences between them on-prem (SQL Server) and in the cloud (Azure SQL Data Warehouse) and also how to use them in Azure Data Warehouse. What are statistics? Statistics are great, they provide information about your data which in turn helps queries execute faster, The more information that is available about your data, the quicker your queries will run as it will create the most optimal plan for the query.  Think of the statistics as you would the mathematical ones- they give us information regarding the distribution of values in a table, column(s) or indexes. The statistics are stored in a histogram which shows the distribution of values, range of values and selectivity of values. Statistics objects on multiple columns store information regarding correlation of values among the columns. They are most important with queries that have JOINS and GROUP BY, HAVING, and WHERE clauses. In SQL Server, you can get information about the statistics by querying the catalog views sys.stats and sys.stats_columns. By default, SQL Server automatically creates statistics for each index, and single columns. See here for more information. How does it work in Azure SQL Data Warehouse? In Azure SQL Data Warehouse, statistics have to be created manually. On previous SQL Server projects, creating and maintaining statistics wasn’t something that we had to incorporate into our design (and really think about!) however with SQL DW we need to make sure we think about how to include it in our process in order to make sure we take advantage of the benefits of working with Azure DW. The major selling point of Azure SQL Data Warehouse is that it is capable of processing huge volumes of data, one of the specific performance optimisations that has been made is the distributed query optimiser. Using the information obtained from the statistics (information on data size and distribution), the service is able to optimize queries by assessing the cost of specific distributed query operations. Therefore, since the query optimiser is cost-based, SQL DW will always choose the plan with the lowest cost. Statistics are important for minimising data movement within the warehouse i.e. moving data from distributions to satisfy a query. If we don’t have statistics, azure data warehouse could end up performing data movement on the larger (perhaps fact) table instead of the smaller (dimension) table as it wouldn’t know any information about the size of them and would just have to guess! How do we implement statistics in Azure Data Warehouse? Microsoft have actually provided the code of how to generate the statistics so its just a case of deciding when in your process you want to create them or maintain. In my current project, we have created a stored procedure which will create statistics and another that will update them if they already exists. Once data has ben loaded into a table, we call the stored procedure and then the statistics will be created or updated (depending on what is needed). See the documentation for more information and the code. Tip: On my current project, we were getting errors when running normal stored procedures to load the data. Error message: ‘Number of Requests per session had been reached’. Upon investigation in the system tables,’Show Statistics’ was treated as a request which was also evaluated for each node causing the number of requests to blow up. By increasing the data warehouse units (DWUs) and also the resource group allocation this problem went away. So, take advantage of the extra power available to you!   There is a big list on the Microsoft Azure website of features not supported in Azure SQL Data Warehouse, take a look ‘here’. I will cover further issues in my next few blogs

Update queries in Azure SQL Data Warehouse

  I’ve recently started working on a project where we working in the cloud with Azure SQL Data Warehouse: “Azure SQL Data Warehouse is a cloud-based, scale-out database capable of processing massive volumes of data, both relational and non-relational”  For more information about Azure SQL Data Warehouse, see here. Although we develop with the same T-SQL as we do using the on-prem version of SQL Server, I did come across a bit of a quirk when writing some update statements. If we are using the on-prem version of SQL Server, when we need to update data, we would have a SQL query like:   That is a basic update to a specific row of data in the Sales.MyOrderDetails table, using a where clause to filter for the row. Sometimes, it isn’t always as straight forward and we need to join to other tables, so that we can refer to attributes from those rows for filtering purposes. For example:     However, if we take this approach in SQL Data Warehouse, we get the following error.   SQL Data Warehouse doesn't support ANSI joins in the FROM clause of an UPDATE statement (it is also the case for DELETE statements too). There is a way round it and it uses an implicit join. Before we look at how the update query can be written, it is a good place to point out that Inner joins can be written in a couple of different ways to what we had above. In an inner join, the ON and WHERE clause both perform the same filtering and they both return rows where the ON and WHERE predicate is true. Therefore, we could write an inner join as or implicitly like, However, it is normally best to stick with the original example rather than the implicit version as although it is still supported, it is an old deprecated syntax and not considered best practise. So, in order to write an update query in SQL Data Warehouse that uses inner joins to filter the rows, the workaround is as follows:   In conclusion, most SQL statements written in Azure SQL Data Warehouse are written in the same way we would with he on-prem  version of SQL Server, however, there are some cases where the syntax differs slightly and I will be blogging more about these special cases as I come across them!