Matt How

Matt How's Blog

Using ADF V2 Activities in Logic Apps

Logic Apps recently introduced the ability to connect to an Azure Data Factory V2 instance and perform a number of actions including cancelling a pipeline, invoking a pipeline and also interrogating the service for information about a particular pipeline run. This blog will focus on how to utilise the full potential of ADF V2 via a Logic App connector when calling a pipeline. In its current form there is no way to provide input parameters to ADF from Logic Apps, thereby hamstringing one of the best features about ADF V2! Later in this blog I will show how to get around that. Despite this there are a number of reasons why a Logic App calling an ADF V2 pipeline with a simple activity is a great thing and I have gone in to some detail below.

Event Driven Execution

The Azure Data Factory V2 team are now starting to bring in event driven triggers but currently this is limited only to the creation or modification blobs in a blob storage account. This is a good start but pales in comparison to the overwhelming number of events that can trigger a Logic App, notably things like Service Bus, Event Grid, HTTP calls etc etc… the list goes on! More info on Logic App connectors here. By utilising Logic Apps as a wrapper for your ADF V2 pipelines you can open up a huge amount of opportunities to diversify what triggers a pipeline run.

Simplifying Loops, Conditionals and Failure Paths

In addition to event driven triggers, the ADF team have also brought in an IF activity and a number of looping activities which are really useful in a lot of scenarios. However, This kind of logic can be simplified when its built in Logic Apps. The visual editor in Logic Apps makes understanding the flow of a loop or an IF seem much simpler to those maybe not so familiar with the business logic that's been implemented. Additionally failure paths can be handled much more efficiently and clearly with many more options for logging or notifying of failure right out the box – e.g. the Send Email activity.

Heavy lifting of Data

Logic Apps, for all its benefits, is definitely not a heavy lifter of data. They really excel at lightweight messaging and orchestration whereas Data Factory is great at moving big chunks of data when using the Copy activity. When you pair these two together you get something that resembles SSIS Control Flow (Logic Apps) and SSIS Data Flow (Data Factory). Now that they can be closely and easily integrated it makes orchestrating the logical flow and movement of data in the cloud much simpler.

So now we know that pairing Logic Apps and Data Factory V2 is a great idea, lets look at how to do it. Understandably there are no triggers from Data Factory at this point so you will need to trigger your Logic App in any one of the million ways that are available. Once you have your trigger sorted you can search for the Data Factory connecter and choose the action you need. You will then need to connect to the Data Factory service as below by logging into your Azure tenant.

image

Once logged in you will need to supply a few details to locate the Data Factory instance and pipeline as below.

image

Now you have completed this you can trigger your Logic App and you will see that your Data Factory pipeline will be invoked as normal. However, by using only this method we are depriving ourselves of the ability to derive parameter values outside of Data Factory and pass them in at execution time. Currently there is no where to supply the input data! This is where the blog gets a bit hacky and we will venture into the Code View of the Logic App. [Side note: you can usually get around a lot of tricky situations by manipulating the code in logic apps and not just relying on what the UI gives you] Click the “Code View” button in the designer and then locate your Data Factory V2 activity definition in the code page. It will look something like this (I have whited out my subscription id)

image

In order to pass in data to this Data Factory pipeline, all we need to do is add a “body” attribute within the “inputs” object like the below image. We can then reference any of the variables called out in the Logic App or any other value that might be available to us. In case you’re wondering, these input values need to configured as input parameters to the Data Factory Pipeline you will be calling. More on how to do that here

image

Once you have updated your code accordingly then save and run your logic app and you should be able to see the pipeline invoked with your inputs in the Data Factory service as below.

image

In conclusion, the marriage of Logic Apps and Data Factory is a happy and harmonious one with many benefits to be exploited. Now armed with the key to utilise input parameters for pipelines called from Logic Apps you can cater for a vast amount of data integration scenarios that require complex but clear logic and heavy lifting of data. Any questions or comments please supply below or catch me on twitter @MattTheHow.

Optimising Lookups with Caching

One of, if not the most, used transformations in SQL Server Integration Services (SSIS) must be the Lookup transform. Pretty much every ETL routine will make use of the component at some point, especially at the time of loading fact tables. Given the nature of fact and dimension tables, these lookups can be tasked with loading large amounts of data in memory to perform high speed lookups. There are occasions when a few optimisations on these lookups can greatly improve the efficiency of your ETL.

 

Choosing the Best Cache Option

One of the most important options featured on the Lookup Transform is the Cache Mode. By understanding the differences between these choices, you can be sure to pick the option that best suits your ETL needs.

image

 

Using the Full Cache Option (Default)

The Full Cache option tell SSIS to consume all your reference data into memory and then perform its lookups on that data. This has the benefit of only hitting the database once to fetch the data but can cause delay on package start-up as all the lookup data will have to be loaded prior to executing the package. You can optimise this by using a SQL Command to fetch only the columns you need but as you can imagine, a dimension with > 1million rows will take some time to cache and this is before you have even done any lookups!

Be aware that if you cannot fit your entire reference set into memory your package will FAIL!! SSIS cannot spool data out to disk if Full Cache is chosen and there is no graceful way of handling this error either. Luckily there is a great blog on how to calculate the cache size here.

Use this option when:

  • You want to avoid excessive traffic to your database or it is remote
  • You have a small reference table that can be cached quickly
  • You can fit your whole lookup dataset into memory

 

Using the Partial Cache Option

The Partial Cache is a good half way house. This option will tell SSIS to fetch the data at runtime, meaning a quicker start-up time but also a lot more queries to the database. Any value not currently in the cache will be queried for in the database and then stored in the cache, making it available for future lookups. The efficiency here comes from not having to charge the cache on package start-up and then also from having any previously fetched items stored in memory. One very specific benefit the partial cache has is that if your lookup table is being written to during the data flow, the full cache wouldn’t see those rows as it only queries the database once prior to execution. The partial cache however works perfectly here as it will see all the rows in the table at runtime.

Additional performance can be gained by making use of the Miss Cache. This option (found on the advanced page) allows a portion of the cache to be reserved for rows that don’t have a match in the database. SSIS will then know not to look for these rows again, thereby saving any unnecessary queries.

Use this option when:

  • You have a small reference set not worth charging a full cache for
  • You want to optimise your cache for more frequently accessed rows.
  • Your reference table is being updated within your dataflow.

 

Using the No Cache Option

Finally, No Cache literally means that no data will be cached. This means that every single row will be looked up against the database, even if that row has already been looked up before.

Use this option when:

  • You only need to process a small number of rows
  • You have severe memory limitations
  • For some reason, the partial cache wasn’t a better option

 

The Cascading Lookup Pattern

A nice way of optimising your lookups is to combine a full and partial cache creating what is known as a Cascading Lookup. The idea here is to use the full cache to store a subset of your reference data that is frequently looked up and then a partial cache for the rest. This works because the full cache shouldn’t take too long to charge, as it’s only a subset of the data, but will still have the benefits of being in memory. The partial cache will be used for the larger, non-frequently looked up data as it will have no charging time but can also make use of the cache to avoid querying the database twice for the same lookup value. An added benefit is that if the reference table is written to during the data flow, these rows will be caught by the partial cache lookup.

In this example, we have a large customer dimension, but we know that only 500 of those customers make a lot of purchases and so we can use a cascading lookup to optimise this dataflow.

image 

  1. Firstly, choose the Full Cache option. Then use a select statement to get only the data that is most looked up against. In this case, the 500 most loyal customers. Remember to redirect non-matching rows to the non-matching output!
  2. Now choose the Partial Cache option and then use a select statement to get the rest of the dataset – excluding the loyal customers.
  3. Finally, union the rows back together so that you have a single result set.

 

The Cache Connection Manager

Another option that is available is to use the Cache Connection Manager. This is akin to having a persisted full cache that can be referenced several times throughout your package. Better yet, you can convert a package cache connection manager to a project cache connection manager so that it can be referenced by other packages in the project! This offers multiple benefits including:

  • Charging your lookup caches in parallel to speed up package start-up
  • Caches can be loaded at any given time (although must be before the lookups are needed) so if you need to load the dimension during your ETL, you can.
  • Caches will only need to be charged once for your whole ETL project, meaning you can avoid similar lookups querying the same data twice.

The way to implement a Cache Connection Manager is seriously easy!

  • To start you need to create a data flow task and then drag a data source onto the design surface.
  • Below that, add a Cache Transform and connect the two up. Once you have configured your data source (remembering to select only the columns you absolutely need!), open the editor for the cache transform. Where prompted for a cache connection manager, select New.
  • On the General page, you can add a name and description.
  • On the Columns page, you will need to do some configuration. For the column/columns you want look up on you should add an Index Position. The lowest index number that is > 0 will be looked up first followed by any other subsequent columns with an Index Position > 0. Any columns with a 0 index are columns that you want to pull into the data flow. In this case that is the CustomerKey. You can also set data types here.

image

  • Once that’s complete you can map the columns in the cache transform.
  • Finally, you will now see a new connection manager in your Connection Managers window. Right click on the new cache connection manager and click Convert to Project.

An implementation of a cache connection manager might have a parent package like the below. Within each “Populate XXXXXX Cache” is a dataflow that extracts source data and puts it into a cache transform. The fact table load packages below that can now reference all the dimensional data using the cache connection manager.

image

In conclusion, there is a lot of optimisation that can be done with regards to lookups and their caching options. Not all of these need to be implemented all of the time. It is more on a case by case basis but knowing that there are options is half the battle.

Extracting Data from Salesforce using SSIS

This blog looks at a couple of methods of extracting data from Salesforce and gives a demo of how to output tables and queries into flat files.

 

Salesforce is HUGE! How do I find what I’m looking for?

Thankfully SFDC make it easy for us to find the tables that custom objects as they are suffixed with “__c” (e.g. “Assessment_Questions__c”). This convention carries through to Column level as well, meaning that when a Salesforce standard table is used, such as SFDCs Event table, any custom columns are also suffixed with “__c”.

 

Can I use SQL?

Salesforce utilises a modified version of SQL called SOQL (Salesforce Object Query Language). There are some Keyword similarities such as starting a query with “SELECT” and calling a table with “FROM” but if you want to limit a query result to a fixed number of rows you need to add “LIMIT n” to the end of the query string, not the start!

Additionally, SFDC provide some handy functions to make SOQL slightly easier on us developers. I have listed a few that I used below but a full documentation site is available here - https://help.salesforce.com/articleView?id=customize_functions.htm&type=0

Function

Usage

Example

LAST_YEAR

Provides a data range from Jan 1st of the previous year to Dec 31st of the previous year.

SELECT AccountId FROM Customers WHERE CreatedDate > LAST_YEAR

LAST_N_MONTHS: n

Provides a data range from the last day of the previous month and continues for the past n months.

SELECT Id FROM Account WHERE CreatedDate > LAST_N_MONTHS:12

BLANKVALUE

Similar to ISNULL. Will provide a substitute value if the field is null

SELECT BLANKVALUE(AddressLine2,”N/A”)

FROM…

 

How do I connect to Salesforce?

In order to access the Salesforce system, you will need a username, password and security token. From my research there are two viable methods for connecting to Salesforce using SSIS which are using an API component or using a linked server. Both have their pros and cons but generally I have found the API route to be more efficient. There are a number of 3rd Party providers around and I have listed a few of these below.

Provider

Comments

Link

C – Data (API)

 

http://www.cdata.com/drivers/salesforce/ssis/

Task Factory (API)

Forces you to down load all 50 components. Can use SSIS variables and SOQL

http://pragmaticworks.com/Products/Task-Factory/Free-Trial

Kingsway Soft (API)

Simple interface and can use SSIS variables. Auto generates SOQL. Cant edit error output from main component

http://www.kingswaysoft.com/products/ssis-integration-toolkit-for-salesforce/download

DB – Amp (Linked Server)

Allows you to use OLE DB Connection manager

http://forceamp.com/trial.htm

 

You can use these steps below to access SFDC and start retrieving data.

1.      Download an SFDC SSIS component. For this demo I have used the Kingsway Soft component available from the link above.

 

2.      Run through the install wizard and once complete, open Visual Studio.

3.      Create a new SSIS project and add a Data Flow Task. Rename this to Kingsway Soft Demo.

 

clip_image002[12]

 

4.      Go into the data flow and right click in the toolbox, hit “Refresh Toolbox”. Now you should see 2 new components called “Salesforce Destination” and “Salesforce Source”.

 

clip_image005[12]

 

5.      Drag a Salesforce Source onto the design surface and double click to edit the properties. Drop down in “Connection Manager” and select “New”. Fill out the details required using the below connection info.

 

Username: mih@adatis.co.uk

Password: S4lesforce

Security Token: oKar72dhKcBAWlq0i4M0RF7ua 

 

clip_image007[12]

 

6.      Test the connection, and click “OK” once it has connected successfully. From the component properties screen use the Source Type to select “Object Query”. Add this SOQL statement into the text editor below. Click “OK”.

 

clip_image009[12]

TIP!! – If you select “Object” from the Source Type drop down and choose your table you can then switch the source type back to “ObjectQuery” and the full SOQL statement to retrieve all columns will be generated automatically.

 

7.      Add a flat file destination and configure the connection manager. Enable the data viewer between your SFDC component and the flat file destination and preview the data that is being extracted from SFDC.

 

clip_image011[12]

 

8.      From here can then repeat the process to build up a package that will bring down all the files that you might need for your solution.  

There are additional tools such as Salesforce Workbench that will allow developers greater access to the underlying solution but these will be covered in a different blog. If this solution does or doesn’t work for you then let me know!