Matt How

Matt How's Blog

Optimising Lookups with Caching

One of, if not the most, used transformations in SQL Server Integration Services (SSIS) must be the Lookup transform. Pretty much every ETL routine will make use of the component at some point, especially at the time of loading fact tables. Given the nature of fact and dimension tables, these lookups can be tasked with loading large amounts of data in memory to perform high speed lookups. There are occasions when a few optimisations on these lookups can greatly improve the efficiency of your ETL.

 

Choosing the Best Cache Option

One of the most important options featured on the Lookup Transform is the Cache Mode. By understanding the differences between these choices, you can be sure to pick the option that best suits your ETL needs.

image

 

Using the Full Cache Option (Default)

The Full Cache option tell SSIS to consume all your reference data into memory and then perform its lookups on that data. This has the benefit of only hitting the database once to fetch the data but can cause delay on package start-up as all the lookup data will have to be loaded prior to executing the package. You can optimise this by using a SQL Command to fetch only the columns you need but as you can imagine, a dimension with > 1million rows will take some time to cache and this is before you have even done any lookups!

Be aware that if you cannot fit your entire reference set into memory your package will FAIL!! SSIS cannot spool data out to disk if Full Cache is chosen and there is no graceful way of handling this error either. Luckily there is a great blog on how to calculate the cache size here.

Use this option when:

  • You want to avoid excessive traffic to your database or it is remote
  • You have a small reference table that can be cached quickly
  • You can fit your whole lookup dataset into memory

 

Using the Partial Cache Option

The Partial Cache is a good half way house. This option will tell SSIS to fetch the data at runtime, meaning a quicker start-up time but also a lot more queries to the database. Any value not currently in the cache will be queried for in the database and then stored in the cache, making it available for future lookups. The efficiency here comes from not having to charge the cache on package start-up and then also from having any previously fetched items stored in memory. One very specific benefit the partial cache has is that if your lookup table is being written to during the data flow, the full cache wouldn’t see those rows as it only queries the database once prior to execution. The partial cache however works perfectly here as it will see all the rows in the table at runtime.

Additional performance can be gained by making use of the Miss Cache. This option (found on the advanced page) allows a portion of the cache to be reserved for rows that don’t have a match in the database. SSIS will then know not to look for these rows again, thereby saving any unnecessary queries.

Use this option when:

  • You have a small reference set not worth charging a full cache for
  • You want to optimise your cache for more frequently accessed rows.
  • Your reference table is being updated within your dataflow.

 

Using the No Cache Option

Finally, No Cache literally means that no data will be cached. This means that every single row will be looked up against the database, even if that row has already been looked up before.

Use this option when:

  • You only need to process a small number of rows
  • You have severe memory limitations
  • For some reason, the partial cache wasn’t a better option

 

The Cascading Lookup Pattern

A nice way of optimising your lookups is to combine a full and partial cache creating what is known as a Cascading Lookup. The idea here is to use the full cache to store a subset of your reference data that is frequently looked up and then a partial cache for the rest. This works because the full cache shouldn’t take too long to charge, as it’s only a subset of the data, but will still have the benefits of being in memory. The partial cache will be used for the larger, non-frequently looked up data as it will have no charging time but can also make use of the cache to avoid querying the database twice for the same lookup value. An added benefit is that if the reference table is written to during the data flow, these rows will be caught by the partial cache lookup.

In this example, we have a large customer dimension, but we know that only 500 of those customers make a lot of purchases and so we can use a cascading lookup to optimise this dataflow.

image 

  1. Firstly, choose the Full Cache option. Then use a select statement to get only the data that is most looked up against. In this case, the 500 most loyal customers. Remember to redirect non-matching rows to the non-matching output!
  2. Now choose the Partial Cache option and then use a select statement to get the rest of the dataset – excluding the loyal customers.
  3. Finally, union the rows back together so that you have a single result set.

 

The Cache Connection Manager

Another option that is available is to use the Cache Connection Manager. This is akin to having a persisted full cache that can be referenced several times throughout your package. Better yet, you can convert a package cache connection manager to a project cache connection manager so that it can be referenced by other packages in the project! This offers multiple benefits including:

  • Charging your lookup caches in parallel to speed up package start-up
  • Caches can be loaded at any given time (although must be before the lookups are needed) so if you need to load the dimension during your ETL, you can.
  • Caches will only need to be charged once for your whole ETL project, meaning you can avoid similar lookups querying the same data twice.

The way to implement a Cache Connection Manager is seriously easy!

  • To start you need to create a data flow task and then drag a data source onto the design surface.
  • Below that, add a Cache Transform and connect the two up. Once you have configured your data source (remembering to select only the columns you absolutely need!), open the editor for the cache transform. Where prompted for a cache connection manager, select New.
  • On the General page, you can add a name and description.
  • On the Columns page, you will need to do some configuration. For the column/columns you want look up on you should add an Index Position. The lowest index number that is > 0 will be looked up first followed by any other subsequent columns with an Index Position > 0. Any columns with a 0 index are columns that you want to pull into the data flow. In this case that is the CustomerKey. You can also set data types here.

image

  • Once that’s complete you can map the columns in the cache transform.
  • Finally, you will now see a new connection manager in your Connection Managers window. Right click on the new cache connection manager and click Convert to Project.

An implementation of a cache connection manager might have a parent package like the below. Within each “Populate XXXXXX Cache” is a dataflow that extracts source data and puts it into a cache transform. The fact table load packages below that can now reference all the dimensional data using the cache connection manager.

image

In conclusion, there is a lot of optimisation that can be done with regards to lookups and their caching options. Not all of these need to be implemented all of the time. It is more on a case by case basis but knowing that there are options is half the battle.

Power BI Mobile Feature Review

In March Microsoft released Deep Touch integration for iOS amongst several other improvements to the Power BI Mobile application. This blog will look at a few of those features and examine some of the areas that still need work.


So far it seems Microsoft are doing a pretty good job of developing Power BI for the mobile platform and this is most apparent when they exploit some of the built-in functionality that make mobiles so handy! One of the best features of the latest iOS is 3D Touch integration and Power BI has fully grasped this bull by the horns. Using a Deep Touch, you can launch a pop-up menu offering some of the most useful features such as search and notifications but also quick access to your recently accessed dashboards.
 2017-04-05-T10-17-08_thumb2_thumb

 

Another big issue that the Power BI team have tackled head on is how to make rich visualisations mobile optimised. For this they have two solutions, the first being the desktop and mobile view options within Power BI desktop.

Desktop view of Report

Mobile view of Report

clip_image002_thumb1_thumb

clip_image004_thumb_thumb

 
The mobile view essentially de-constructs your report and lets you drag and drop your visualisations into the mobile template. By default, this view will always be displayed when viewing the report on a mobile device unless you rotate the device into landscape mode in which case the desktop version loads. I have mixed feelings about this feature. On the one hand, I like that I can see both views but if the report were to remain in mobile view but expand to fill the horizontal space as well, this could open up a lot more opportunities for mobile reporting.

 
However, despite gaining some control in how the report looks on a mobile device there are some pretty major limitations for the time being. Firstly, you cannot specify different properties, such as text size, for the desktop and mobile view. This means that you would need to consider both views when creating a report that will be both mobile and desktop otherwise your visual will be sacrificed as seen above in the deliveries and fragile items cards.


Another drawback is that each visual element has to be snapped to the prescribed grid and this includes the transparent grab handle/border that is used to select the visual. This border is half a grid square tall in the mobile view which means that you get a lot of white space, something you want to reduce in a mobile view.

 
Finally, visuals cannot be overlaid. Where I have circles around some of my cards in the desktop view, this is not possible in the mobile view.  Whilst you can add an image or logo you could not make use of any backgrounds whether they be an image or just a flat colour.
Thankfully, all custom visuals will work in mobile view and any cross filtering, auto play or auto zoom features (maps) are preserved perfectly.
 
Microsoft’s second solution is focussed around dashboards. From within the Power BI service you can arrange your pinned visuals into a mobile dashboard by switching the view as shown below.
image_thumb1_thumb

 

However, the best part is that it if you access a dashboard that doesn’t already have a mobile view then the app will automatically optimise the visuals into a mobile view so you don’t have to!


One of Power BI’s most notable features is Q&A – a method to query your data using natural language. Using a recognisable messenger format this feature is really well replicated in the mobile app and adds another layer of accessibility for non-techie, non-deskbound users.
 2017-04-05-T12-00-43_thumb2_thumb

 

A relatively new feature to Power BI is the ability to add custom links to a dashboard visual. This can be utilised quite nicely in the mobile app as it will make use of the deep linking technology in iOS so that I can launch a relevant app instead of just a web page. Here I have set a google maps URL as the custom URL for the map visual. Clicking on this in the mobile app launches my google maps app, not a webpage!
 2017-04-05-T12-10-49-p1bcut03ue1uis1[2]_thumb

 

Overall I can see that the Power BI mobile team have not tried to just recreate the desktop version but have embraced the mobile platform and have made use of existing features within iOS to present a feature rich app that has a familiar feel to it. Whilst there are some limitations, my next blog will look at how to create a mobile optimised report right from the start so that your users can benefit from mobile BI straight away!

Extracting Data from Salesforce using SSIS

This blog looks at a couple of methods of extracting data from Salesforce and gives a demo of how to output tables and queries into flat files.

 

Salesforce is HUGE! How do I find what I’m looking for?

Thankfully SFDC make it easy for us to find the tables that custom objects as they are suffixed with “__c” (e.g. “Assessment_Questions__c”). This convention carries through to Column level as well, meaning that when a Salesforce standard table is used, such as SFDCs Event table, any custom columns are also suffixed with “__c”.

 

Can I use SQL?

Salesforce utilises a modified version of SQL called SOQL (Salesforce Object Query Language). There are some Keyword similarities such as starting a query with “SELECT” and calling a table with “FROM” but if you want to limit a query result to a fixed number of rows you need to add “LIMIT n” to the end of the query string, not the start!

Additionally, SFDC provide some handy functions to make SOQL slightly easier on us developers. I have listed a few that I used below but a full documentation site is available here - https://help.salesforce.com/articleView?id=customize_functions.htm&type=0

Function

Usage

Example

LAST_YEAR

Provides a data range from Jan 1st of the previous year to Dec 31st of the previous year.

SELECT AccountId FROM Customers WHERE CreatedDate > LAST_YEAR

LAST_N_MONTHS: n

Provides a data range from the last day of the previous month and continues for the past n months.

SELECT Id FROM Account WHERE CreatedDate > LAST_N_MONTHS:12

BLANKVALUE

Similar to ISNULL. Will provide a substitute value if the field is null

SELECT BLANKVALUE(AddressLine2,”N/A”)

FROM…

 

How do I connect to Salesforce?

In order to access the Salesforce system, you will need a username, password and security token. From my research there are two viable methods for connecting to Salesforce using SSIS which are using an API component or using a linked server. Both have their pros and cons but generally I have found the API route to be more efficient. There are a number of 3rd Party providers around and I have listed a few of these below.

Provider

Comments

Link

C – Data (API)

 

http://www.cdata.com/drivers/salesforce/ssis/

Task Factory (API)

Forces you to down load all 50 components. Can use SSIS variables and SOQL

http://pragmaticworks.com/Products/Task-Factory/Free-Trial

Kingsway Soft (API)

Simple interface and can use SSIS variables. Auto generates SOQL. Cant edit error output from main component

http://www.kingswaysoft.com/products/ssis-integration-toolkit-for-salesforce/download

DB – Amp (Linked Server)

Allows you to use OLE DB Connection manager

http://forceamp.com/trial.htm

 

You can use these steps below to access SFDC and start retrieving data.

1.      Download an SFDC SSIS component. For this demo I have used the Kingsway Soft component available from the link above.

 

2.      Run through the install wizard and once complete, open Visual Studio.

3.      Create a new SSIS project and add a Data Flow Task. Rename this to Kingsway Soft Demo.

 

clip_image002[12]

 

4.      Go into the data flow and right click in the toolbox, hit “Refresh Toolbox”. Now you should see 2 new components called “Salesforce Destination” and “Salesforce Source”.

 

clip_image005[12]

 

5.      Drag a Salesforce Source onto the design surface and double click to edit the properties. Drop down in “Connection Manager” and select “New”. Fill out the details required using the below connection info.

 

Username: mih@adatis.co.uk

Password: S4lesforce

Security Token: oKar72dhKcBAWlq0i4M0RF7ua 

 

clip_image007[12]

 

6.      Test the connection, and click “OK” once it has connected successfully. From the component properties screen use the Source Type to select “Object Query”. Add this SOQL statement into the text editor below. Click “OK”.

 

clip_image009[12]

TIP!! – If you select “Object” from the Source Type drop down and choose your table you can then switch the source type back to “ObjectQuery” and the full SOQL statement to retrieve all columns will be generated automatically.

 

7.      Add a flat file destination and configure the connection manager. Enable the data viewer between your SFDC component and the flat file destination and preview the data that is being extracted from SFDC.

 

clip_image011[12]

 

8.      From here can then repeat the process to build up a package that will bring down all the files that you might need for your solution.  

There are additional tools such as Salesforce Workbench that will allow developers greater access to the underlying solution but these will be covered in a different blog. If this solution does or doesn’t work for you then let me know!