Terry McCann

Terry McCann's Blog

Azure Data Factory - Suggested naming conventions and best practices.

image

 

Years ago I stumbled on a great blog about SSIS best practices and naming conventions. In the years to come I could refer back to this at the start of every SSIS project. Written by Jamie Thomson, this has become the standard, and although there were variants, Jamie's still remains very popular (Jamie Thompson, Link).

 

I want to propose a list of best practices and naming conventions for the various elements of ADF based around my experience with ADF. This list may change over time as I discover what works and what does not. At present the standard names generated by ADF are generic and not very helpful (LinkedServiceAzureSQLDataWareouse1 etc). In light of this I propose the following best practices and standards for naming ADF elements.

 

Best practices

 

  1. Use a standard naming convention. For a large ADF project you can end up with 300-400 json objects. It is tricky to navigate that many objects and deploying from visual studio makes this even harder. Having a standard helps to quickly locate which json object needs to be changed.
  2. When you have a solution with a lot of JSON objects (a few hundred or more) publishing from visual studio becomes an issue. A bottleneck begins to appear, which is brought on by each element needing to validate against it peers. When you have hundreds of elements deploying new/updated pipelines adds a significant overhead. In my recent example I had 222 JSON objects and 74 pipelines, when initially deploying, each pipeline (after solution validation) took ~10 seconds each. Once published and updated this increased to ~40 seconds per pipeline.

 

74x10 is approximately 12:20 minutes (13 minutes in total)

74x40 is approximately 49:20 minutes (gave up after 40 minutes)

 

This can be resolved by adding a step in to your automation which removes your ADF instance before publishing a large amount of changes (see point 3 - Automate with Po$h)

 

  1. Don’t neglect PowerShell. PowerShell is a great way to automate the development of json objects - especially when you need to create a few hundred! I have a blog coming in the next few weeks detailing how to do this.

 

* More best practices will be added.

 

Naming conventions

 

There are a few standard naming conventions which apply to all elements in Azure Data factory.

 

  1. Object names must start with a letter or a number, and can contain only letters, numbers, and the dash (-) character.
  2. Maximum number of characters in a table name: 260.
  3. Object names must start with a letter number, or an underscore (_).
  4. Following characters are not allowed: “.”, “+”, “?”, “/”, “<”, ”>”,”*”,”%”,”&”,”:”,”\”

(Microsoft, 2017 [https://docs.microsoft.com/en-us/azure/data-factory/data-factory-naming-rules])

 

Linked services & Datasets.

A linked service connects data from a source to a destination (sink), it stands to reason that there would therefore be the same for a data set. Rather than having two separate lists, you will see the below table has a column for Linked services and datasets.

 

Type

Linked Service

Name

Linked Service

Dataset

Full

Azure

Azure Blob storage

ABLB_

LS_ABLB_

DS_ABLB_

LS_ABLB_Example

 

Azure Data Lake Store

ADLS_

LS_ADLS_

DS_ADLS_

LS_ADLS_Example

 

Azure SQL Database

ASQL_

LS_ASQL_

DS_ASQL_

LS_ASQL_Example

 

Azure SQL Data Warehouse

ASDW_

LS_ASDW_

DS_ASDW_

LS_ASDW_Example

 

Azure Table storage

ATBL_

LS_ATBL_

DS_ATBL_

LS_ATBL_Example

 

Azure DocumentDB

ADOC_

LS_ADOC_

DS_ADOC_

LS_ADOC_Example

 

Azure Search Index

ASER_

LS_ASER_

DS_ASER_

LS_ASER_Example

Databases

SQL Server*

MSQL_

LS_SQL_

DS_SQL_

LS_SQL_Example

 

Oracle*

ORAC_

LS_ORAC_

DS_ORAC_

LS_ORAC_Example

 

MySQL*

MYSQ_

LS_MYSQ_

DS_MYSQ_

LS_MYSQ_Example

 

DB2*

DB2_

LS_DB2_

DS_DB2_

LS_DB2_Example

 

Teradata*

TDAT_

LS_TDAT_

DS_TDAT_

LS_TDAT_Example

 

PostgreSQL*

POST_

LS_POST_

DS_POST_

LS_POST_Example

 

Sybase*

SYBA_

LS_SYBA_

DS_SYBA_

LS_SYBA_Example

 

Cassandra*

CASS_

LS_CASS_

DS_CASS_

LS_CASS_Example

 

MongoDB*

MONG_

LS_MONG_

DS_MONG_

LS_MONG_Example

 

Amazon Redshift

ARED_

LS_ARED_

DS_ARED_

LS_ARED_Example

File

File System*

FILE_

LS_FILE_

DS_FILE_

LS_FILE_Example

 

HDFS*

HDFS_

LS_HDFS_

DS_HDFS_

LS_HDFS_Example

 

Amazon S3

AMS3_

LS_AMS3_

DS_AMS3_

LS_AMS3_Example

 

FTP

FTP_

LS_FTP_

DS_FTP_

LS_FTP_Example

Others

Salesforce

SAFC_

LS_SAFC_

DS_SAFC_

LS_SAFC_Example

 

Generic ODBC*

ODBC_

LS_ODBC_

DS_ODBC_

LS_ODBC_Example

 

Generic OData

ODAT_

LS_ODAT_

DS_ODAT_

LS_ODAT_Example

 

Web Table (table from HTML)

WEBT_

LS_WEBT_

DS_WEBT_

LS_WEBT_Example

 

GE Historian*

GEHI_

LS_GEHI_

DS_GEHI_

LS_GEHI_Example

 

Pipelines

Pipelines are slightly different in that there are two different types. The copy data activity and the data transformation activity.

 

Type

Name

Action

Example

Data movement Activity

PL_DATA_

NA

PL_DATA_DS_SQL_Person_To_DS_ABLB_Person

Data transformation pipeline

PL_TRAN_

SPRC - Stored Procedure

PL_TRAN_SPRC_CleanDimAccount

 

PL_TRAN_

DNET - Script

PL_TRAN_DNET_AggregateSales

 

PL_TRAN_

ADLK - Azure Data Lake

PL_TRAN_ADLK_AggregateSales

 

PL_TRAN_

HIVE - Hive

PL_TRAN_HIVE_AggregateSales

 

PL_TRAN_

PIG - Pig

PL_TRAN_PIG_AggregateSales

 

PL_TRAN_

MAPR - MapReduce

PL_TRAN_MAPR_AggregateSales

 

PL_TRAN_

HADP - Hadoop Stream

PL_TRAN_HADP_StreamData

 

PL_TRAN_

AML - Azure Machine Learning

PL_TRAN_AML_CalculateMonthlyChurn

 

If you have any recommendations, please leave a comment of send me a tweet @SQLShark

Loading