Jeremy Kashel

Jeremy Kashel's Blog

Optimise SSIS Fuzzy Lookup Matches

This is a quick post to highlight how making some small changes to the SSIS Fuzzy Lookup settings can make a significant difference to the matches that are returned.

As I’ve mentioned before, the SSIS Fuzzy matching relies on matching substrings within a string (known as q-grams) in order to produce a result. A given input string (and reference strings) is split up into several different substrings. Simply put, the more substrings, or q-grams, that the input string has in common with the reference string, then the better the match. For example, a word such as ‘sandwich’ would be split up into ‘sand’, ‘andw’ and ‘dwic’, in order to aid the matching process.

I’ve hit an issue recently whereby the sub strings being produced where causing false matches – lets start by going through a quick example, loosely based on Adventure Works data. Here is a package that takes data from a data source, before using a Fuzzy lookup:

image

I’ve narrowed down the data source to have just two rows for this example, which are:

image

There are a few more so called Master Products, these are what we want to match against:

image

By default, if we run the SSIS package then the following results will appear in a data viewer after the fuzzy lookup. Note – the lookup is intentionally returning the top 2 matches to illustrate this point.

image

The Source Product column reflects the actual situation I experienced recently, whereby the Brand name (Apollo and RS in this case) appears in the column to match. Although this is made up, it closely follows the situation I experienced whereby the wrong match is produced in both cases – “Apollo X.C. Large” looks very similar to “Apollo XC”, yet its not the top match.

The reason this occurs is actually the Token Delimiters setting on the Fuzzy Lookup. By default they include both ampersands and full stops, meaning in the case of ‘Apollo X.C. Large’ the following tokens/substrings are created by default:

image

Note there is no token created for XC, meaning this will not match well against a substring of ‘XC’. One option is to remove the full stops and ampersands as delimiters as shown below in the advanced tab:

image

This produces the following results in the data viewer:

image

In this case, the correct master product per row (namely ‘Apollo XC’ and ‘RS Shorts’) is returned correctly, so the settings in this case have made a difference.

Conclusion

This shows the effect that the token delimiters can have on the matching process. Unfortunately, what works in one situation may not work in another, so one option may be to try an initial match, followed by a second match for any low matching rows. Data quality will always have a big impact on matching, so cleaning and enriching data before matching is always a good idea.

Master Data Services Many to Many Relationships

Many to Many Relationships (M2M) are possible in the current version of Master Data Services, but they come with a few workarounds. Thankfully, many to many hierarchies are one of the many new features that can be found in Master Data Services 2016. This post will contrast the M2M approach in the current version of MDS, before showing how its greatly improved in the current MDS 2016 CTP.

Customer Example – MDS 2014

The simple example that I’m going to show is Customers and Bank Accounts. That is, a customer can have more than one bank account, but a bank account can also have more than one customer, e.g. a joint account or a business account.

First, to show the MDS 2014 approach, there are a few entities to look at, namely Customer and Account:

image

image

Then there is an other entity that relates to both Customer and Account, called CustomerAccount, which essentially acts as a bridge table, allowing the M2M relationship to naturally exist in the data:

image

Moving away from the entities, I’ve created two derived hierarchies so that we can look at the data by account or by customer. So here we can see, for example, that Bob is keeping his own personal account, as well as the account with his wife…hmmm:

image

Or looking at this another way, we can see the inverted version of the hierarchy, where we see the Customers as the top level, with some accounts belonging to more than one customer:

image

As you may have spotted, there’s an issue with this approach. Whilst we do see the name of the Customer in the first hierarchy, its pointing at a member in the CustomerAccount entity, which is why the code is different each time. Enter SQL Server 2016…

Master Data Services 2016

MDS 2016 does a far better job at visualising the M2M relationship. You still need the link entity, but it will allow you to show the actual Customers and Accounts with their correct codes.

I’ve used the exact same entity setup as MDS 2014, so lets begin by looking at how we can build a hierarchy that shows customers by accounts. When creating a new Derived Hierarchy, we can see that the initial available levels list is just the same as it would be in MDS 2014:

image

The first step is to drag over the entity called Customer to the Current levels section. Once we perform this action, the Available Entities pane looks very different in SQL 2016 that in would in the current version. Now MDS is allowing us to pick the Account entity, but via another entity. Hence it says “mapped via” on the left hand side:

image

Once we finished the hierarchy build, and explore the hierarchy, we can see the same data as before, but this time its the actual correct customer codes that are displayed:

image

The derived hierarchy by Customer also displays the correct customer and account codes:

image

Summary

A derived hierarchy in Master Data Services 2016 now allows the correct display of many to many relationships between two entities. This is one of many new features in MDS 2016, offering quite a number of improvements over the SQL Server 2014 version.

Azure HDInsight Polybase Example

Polybase has now made it over to SQL Server 2016 and is available to use as part of the SQL Server 2016 CTPs. One of its capabilities is to connect to Azure Blob Storage, so this blog post gives an overview of how you can use Polybase to connect to data in Azure HDInsight.

Installing Polybase in SQL Server 2016

Polybase will appear in the standard feature list of the SQL Server 2016 CTP installs, alongside the other database engine features, so it’s just a selection box to install it:

image

Having said that, there’s one small limitation at the moment in the Polybase install, which is that only the SQL_Latin1_General_CP1_CI_AS and Latin1_General_100_CI_AS_KS_WS collations are supported, as explained in the following getting started guide. If you don’t have the correct collation and you try and use Polybase, then you may get the issue of the Polybase windows service stopping.

Azure HDInsight

For this post, I’m just going to use the sensor data sample that comes with Azure HDInsight when you provision a cluster. If you run through the following HDInsight Tutorial, then you can get up and running with an HDInsight cluster and use the samples.

In this case, I’m using the sensor data analysis sample, which comes with a csv file called HVAC.csv – this is a set of temperature readings. The file looks as follows when opened in Excel:

image

Within the Azure portal, if we go to the HDInsight cluster, then we need to note down the access keys that will allow Polybase to connect to Azure:

image

We can then copy one of the access keys, which we’ll need for later:

image

Polybase in SQL Server Management Studio

Over to SQL Server Management studio now, where we need to create an External Data Source, a File Format and an External Table to access the Azure table. Before creating these objects, we need to create a credential, where we will need to specify the Azure access key that we copied earlier:

--Enable traceflag for using a database-scoped credential.
DBCC TRACEON(4631,-1);

-- Create a db master key if one does not already exist, using your own password.
CREATE MASTER KEY ENCRYPTION BY PASSWORD='MyP@ssword31';

--Enter the Azure account name and the acccess key as the secret
CREATE CREDENTIAL AzureCred ON DATABASE WITH IDENTITY = 'accountname', SECRET='accesskeyhere'

Now that we’ve set up the credential, we need to create an external data source, so that Polybase knows where it needs to go to get the data. Notice how we’re referencing the credential that we created earlier:

CREATE EXTERNAL DATA SOURCE AzureDs
WITH (
    TYPE = HADOOP,
    --Specifiy the container name and account name
    LOCATION = 'wasbs://containerd@accountname.blob.core.windows.net/',
    --Specify the credential that we created earlier
    CREDENTIAL = AzureCred
);

Now we need to specify the format of the file that we want to access. In this case, having looked at the CSV, its comma delimited, so the SQL required is as follows:

CREATE EXTERNAL FILE FORMAT CommaFormat
WITH (
    FORMAT_TYPE = DELIMITEDTEXT, 
    FORMAT_OPTIONS (FIELD_TERMINATOR =',')
);

Finally we need to create an external table – this is the object that we will be able to query with a T-SQL statement and consume data within the Azure HDInsight cluster. The syntax for this example is:

--Define the table structure
CREATE EXTERNAL TABLE Hvac (
    Date varchar(10),
    Time varchar(10),
    TargetTemp smallint,
    ActualTemp  smallint,
    SystemID  smallint,
    SystemAge  smallint,
    BuildingID  smallint
)
WITH (
        --Set the file to be the HVAC sensor sample file
        LOCATION='/HdiSamples/SensorSampleData/hvac/HVAC.csv',
        DATA_SOURCE = AzureDs,
        FILE_FORMAT = CommaFormat,
        --We will allow the header to be rejected
        REJECT_TYPE = VALUE,
        REJECT_VALUE = 1
    );

Interestingly you can’t tell Polybase to skip a header record, as confirmed in the following connect issue, so I found that my statement was failing when trying to query the data. The solution is to use the REJECT_TYPE and REJECT_VALUE, which will allow us to reject a number of rows, as shown above.

Once this is done, then we’ll see that we have an External Table:

image

Querying it is just like any other table with a SELECT statement, which gives the following results in this case:

image

So that’s it – by creating the external data source, file format and external table, we can now use T-SQL in SQL Server 2016 to easily query data hosted within Azure HDInsight.

Data Quality Services Leading Values

The Leading Values checkbox is one of the options available when you setup a domain within DQS Domain Management. I’ve run a couple of courses and demos recently on DQS and one of the questions that cropped up was what does the Leading Values checkbox do? This blog post will explain the Leading Values option by showing a few examples.

Domain Management

In this case we’re going to setup a domain called Country, which will be fairly simple for the purposes of this blog post. All of the defaults have been left, aside from the Use Leading Values, which is unchecked.

image 

Knowledge Discovery

The next step is to add some knowledge to the Knowledge Base (KB), which we’ll do via knowledge discovery. To assist in this process I have a simple Excel file that contains a list of countries – this needs to be fed into the KB:

image

The first step within knowledge discovery is to pick a data source. In this case the data source is Excel and the Source Column of Name has been mapped to the domain of Country:

image

Once we get to the part of managing the Domain Values, then here is where we can have a look at the data in order to spot valid and invalid values. Towards the bottom of the list, we can see that there are values for both USA and for United States of America:

image

We can set these as synonyms, as shown above. If we go back into Domain Management afterwards, then we can see that USA is a synonym of United States of America:

image

Data Quality Projects

If we move on and create a data quality project to clean some data, then we’ll see these values in action. The new source file has some new countries, but also has both USA and United States of America. If we run the new source file through the Data Quality Project then we’ll see the following output:

image

As we’ve turned off “Use Leading Values”, USA has been deemed a correct “Domain Value” as its a synonym of United States of America.

To contrast this – lets take a look at what would have happened if we left the default setting – which is “Use Leading Values” checked. Therefore with the exact same setup, but with “Use Leading Values” checked, then we will get the following output:

image

This time USA is now on the Corrected tab. Although its value is “correct” as far as the domain is concerned, because we’ve specified to Use Leading Values, then DQS has altered USA to its leading value. The reason for the correction is very clear – “Corrected to leading value”.

I’ve personally not had a need to alter the setting, as I’ve always wanted the leading value to be the primary output. But its good that DQS does give an element of flexibility in its setup to suit different requirements.

Power BI MDS Dashboard

Master Data Services tables such as the staging tables have been well blogged and documented, but there are a host of other system tables and views that can help you with an MDM project. By querying this tables and views, we can gather MDS metrics such as the number of members, number of validation errors and staging errors, all of which can then be exposed to a data steward via a dashboard. Given all of the recent updates around Power BI, I decided to use Power BI Desktop in order to build an MDM dashboard for this blog post. The goal in this case is to produce a dashboard that allows a data steward to get an overview of metrics on an MDS model by model basis.

Power BI Desktop

The first step when working with Power BI is to import some data. Once we start Power BI Desktop the welcome screen gives you a shortcut to get data, or we can do this via the ribbon:

image

In this case we need to take data from SQL Server, so clicking the SQL Server option requires us to enter a server name and optionally a database name. Once we do this and pick the MDS database, we get a list of tables choose from:

image

Clicking on Load will load the tables into Power BI.

Master Data Services Objects

The question is what tables do we need? If you navigate the MDS database you will see that there is a mix of Master Data tables (e.g. post-fixed with _EN for entity), application tables and metadata views (prefixed with viw_SYSTEM). Its the system views that we need to start with, as we want to allow a breakdown by model, entity and version. To do this we need to load in the following views:

--Models
SELECT          ID AS ModelId, Name AS ModelName
FROM            MDM.viw_SYSTEM_SCHEMA_MODEL

--Entities
SELECT          E.Id AS EntityId, E.Name AS EntityName, E.Model_ID, e.Model_Name, E.Model_MUID
FROM            MDM.viw_SYSTEM_SCHEMA_ENTITY E

--Get a count of versions
--This is used as a total and by version status
SELECT      V.ID AS VersionId, V.Name AS VersionName, V.Model_MUID, V.Model_Name, V.Status AS VersionStatus
FROM        [mdm].[viw_SYSTEM_SCHEMA_VERSION] V

Next it would be useful to report on the member count in each entity. This is a little harder as its not stored in the metadata tables, but we can combine the MDS metadata views with the SQL Server sys.dm_db_partition_stats Dynamic Management View in order to return the row counts per member:

--This gives the row count per entity for all models
SELECT          E.Name AS EntityName, M.Name AS ModelName, M.MUID, M.ID AS ModelId,
                'mdm.' + E.EntityTable AS TableName, P.row_count
FROM            MDM.viw_SYSTEM_SCHEMA_ENTITY E
INNER JOIN      MDM.viw_SYSTEM_SCHEMA_MODEL M ON M.MUID = E.Model_MUID
INNER JOIN      sys.dm_db_partition_stats P on P.object_id = OBJECT_ID('mdm.' + e.EntityTable)
AND             P.index_id = 1

Finally a few more metrics that would be useful are the number of staging errors and a breakdown of the members by validation status (Validation Succeeded, Validation Failed, Awaiting Validation). To do this we can take in the following tables and stored procedures:

--Get the current count of errors that have occured during staging
SELECT    Entity_ID AS EntityId, ErrorMemberCount, Version_ID
FROM    mdm.tblStgBatch

--Pass in the appropriate version id
EXEC [mdm].[udpValidationStatusSummaryGet]
--This gets the validation issues by model/entity:
SELECT    ValidationIssue_ID, Version_ID, Model_ID, ModelName,
        Entity_ID AS EntityId, EntityName, Member_ID, BusinessRuleName
FROM    [mdm].[viw_SYSTEM_USER_VALIDATION]

Power BI Relationships and Visualisations

Once we import the data, Power BI Desktop will detect relationships automatically for us, but can alter them by clicking on Manage Relationships on the ribbon if we wish. The following shows the relationships between the various MDS objects mentioned above:

image

Once we’re happy with the model and the relationships, then we need to start dragging and dropping to build the Power BI report. As an example we can create a simple card visualisation to show the count of various metrics:

image

This can be used to produce the following metrics, the count of entities, versions, validation issues and staging errors:

image

Equally we can create a simple column chart by using the [mdm].[viw_SYSTEM_USER_VALIDATION] view. The count of issues is the Value section of the chart, whereas the Business Rule Name is on the Axis:

image

Putting it all together we can get the following report, shown for the sample Product model:

image

Conclusion

This is just a start, there are other metrics that you could put onto an MDM dashboard – as an example you could include MDM ETL metrics if you’ve logged them. But overall the combination of the rich MDS metadata and intuitive Power BI interface means its relatively easy to produce simple dashboards.

Master Data Services SQL Server 2016

Unless you’ve been living under a rock for the last week or so, you will have noticed that CTP2 for SQL Server 2016 has been released and is now available for download.

I finally got round to downloading the CTP myself yesterday and my first port of call was to take a look at the new version of Master Data Services, as it’s been given a several improvements. Included in this list of improvements is a revamped front end, so I thought it would be worth making a start by exploring the front end improvements in this blog post.

New Look and Feel

The first thing that you will notice is the new look and feel that will greet you in the web UI. Here’s the new look:

image

To recap - here’s the equivalent in SQL Server 2014 MDS:

image

The have been a few changes to the actual Explorer grid interface. One is that you now get a Copy Member button and also there is the ability to export your current view to an Excel MDS Add-In favourites file:

image

Another useful change is that on each entity you now get told who last updated the member and who created the member. In the previous version, MDS logged who last updated the member, but it wasn’t displayed to the end users. I’ve spoken to users who need this feature so its good to see it implemented:

image

Entity Dependencies

If you go into the Explorer, the main grid is still present and looks the same, but there are a few changes to the menu. There used to be a Explorer menu but this is now gone – after all you can get to the same thing via the other menus. There is a new Entity Dependencies menu, which will list all of the entities within the model, as you can see here:

image

After each entity is some brackets, which indicates the number of dependencies that the entity has. If you click on the entity (e.g. the Region entity), then you will get the opportunity on the Entity Dependencies view to drill down on an entity member in order to locate its dependencies at different levels:

image

Hierarchies and Collections

There have been some major changes to MDS hierarchies as covered on the MDS Team blog. In summary, Explicit Hierarchies and collections are now depreciated with the idea that they can be replaced with some of the new features of derived hierarchies. One such feature is the “Unused” node, which shows any members of the Derived Hierarchy that are missing a parent. E.g. in the example below I’ve created a new Area member, but its not been assigned a value for its Big Area attribute:

image

A small but useful change that I’ve noticed is that there is a setting for hierarchies that will auto-collapse a hierarchy branch if you expand a sibling branch:

image

For example, looking back at the screenshot I’ve included above, if North America was expanded, then turning this setting on would mean that that North America would collapse if I expanded the International branch of the hierarchy.

Summary

These UI improvements to the web interface make a welcomed change to a product that is now in its fourth SQL Server version. It’s great to see Microsoft investing in MDS, this will no doubt help increase the uptake of a product that (aside from in SQL 2014) has now seen steady improvement.

Master Data Services Staging Tables Setup

This is just a quick tip for Master Data Services, specifically around the setup of new entities. When creating new Master Data Services entities one of the settings that you can choose is the Staging Table Name, which is an optional setting:

image

As it says in the screenshot, if you do not complete the field, the entity name will be used. This might be ok in simple, single model environments, but you can easily get your staging tables into a mess if you don’t complete this setting. MDS will try and default the staging table name to the name of the entity. But if you’ve got several models that all have the same entity, then MDS will add an underscore and number to the end of your staging table name. For example, here there is a Region entity within the Customer model and within the Engineer model. But which staging table belongs to which model?

image

The approach that I tend to take in the entity setup is give the staging table an explicit name, which is the Model name prefixed with the Entity name. E.g. I would choose EngineerRegion and CustomerRegion. This results in a much clearer set of staging tables:

image

We now know which table belongs to which model. As the MDS Staging Tables all follow a particular pattern, we can now create a generic set of ETL packages using SSIS. I won’t go into this in huge detail, but by specifying the model name and the entity name as variables you can create yourself a template MDS package that can be easily copied and pasted to load other entities. For example below, the staging table load, staging stored procedure and (optionally) model validation all are controlled by the following variables:

image

The variables are used as expressions in the Control Flow and Data Flow components resulting in a generic MDS loading template:

image

Master Data Services Permissions Guide

I’ve had a few questions recently on various projects around the Master Data Services Model Object permissions. At a simple level they can be quite easy to master, but can get more complex as you try and combine the different permissions to create your desired security setup.

This guide assumes that you’ve set the Functional Area Permissions and given the user or group access to the Explorer functional area. The tables below show the different permissions that can be applied to each object within a model:

Model Permissions

This example shows the MS sample Customer model, which I’ve named as Customer Sample.

Permission

Action

Result

Update

clip_image002

The user will be a Model Administrator, meaning that they can add/delete/modify entities and access all data within each entity in the model. If they also have access to the System Administration functional area, then they will be able to actually add/edit/delete entities and other model objects.

Read Only

clip_image004

The user will be able to read the data within each entity in the entire model.

Deny

clip_image006

The user will not be able see the model.

Entity Permissions

This example shows the entities that are contained within the same Customer Sample model.

Permission

Action

Result

Update

clip_image002[4]

You can add/modify and delete members within the entity, in this case “Big Area”.

Read Only

clip_image004[4]

You can read all members within the entity.

Deny + Read Only on Model

clip_image006[4]

This is where it starts to get interesting. By having Read Only on a Model it means that you will have access to all entities. But by setting Deny on a specific entity, the user will now not see that entity in the drop down list.

Update + Read Only on Model

clip_image008

This will result in Read Only for all entities, but the user will be able to add, edit and delete members within the Big Area entity.

Member Type Permissions

Depending on the entity, you could have both “Leaf” Member Type or Consolidated Member Type permissions (see here for the difference between the two). This table covers Leaf Members:

Permission

Action

Result

Read Only

clip_image002[6]

Only choosing Read Only at the “Member Type” level will cause the same result as choosing Read Only on the entity. Therefore you will be able to read all members in the entity.

Update

clip_image004[6]

You can add/modify and delete members within the entity, in this case “Big Area”.

Deny

clip_image006[6]

The user will not be able see or access the entity.

As you can see, if you only have Leaf Members within the entity, then the permissions are the same as if you are setting them at the entity level.

Attribute Permissions

Permission

Action

Result

Read Only

clip_image002[8]

With just Read Only on one attribute, this will actually result in the user being able to view the whole of the Area entity, even though the Read Only permission has been set on one attribute.

Update

clip_image004[8]

The user will be able to view the whole entity, but will only be able to modify the Big Area attribute on existing members.

Deny

clip_image006[8]

The user will not be able to see anything, as Deny on just an attribute is not enough to give the user access to the entity. See the next row for how to correct this.

Deny + Read Only on Entity

clip_image008[4]

The user will be able to view the Area entity, but the Big Area attribute will not be visible.

Update + Read Only on Entity

clip_image010

The user will be able to view the entity, but will only be able to update the Big Area attribute.

Read Only + Update on Entity

clip_image012

The user will be able view, add and edit members, but will not be able to specify a value for the Read Only attribute. The user can, however, delete members.

The last three examples really highlight how combining MDS permissions can be powerful. Lets look at these in more detail:

Deny on Attribute + Read Only on Entity

Setting Deny on the Big Area attribute, but Read Only on the entity will result in the user seeing the following:

image

The “Big Area” attribute is completely missing from the MDS Explorer grid.

Update on Attribute and Read Only on Entity

On the other hand, setting Update on the Big Area attribute and Read Only on the Area entity will result in the user seeing the following when editing a member in the Area entity:

image

The two highlighted attributes (Name and Code) are both Read Only as we set the entity to be Read Only. The user can edit the Big Area attribute for each member, as we gave the user Update permission on that attribute.

Read Only on Attribute and Update on Entity

Finally, in the last example we have Read Only on the Big Area Attribute, but Update on the Area Entity. This means the user cannot specify a value for the Big Area attribute:

image

So there we have it! Hopefully this post will be useful – as you can see the MDS Object Model permissions can be very granular and should be able to cover most of the scenarios that you need.

SSIS Foreign Language Translation

A standard practice in the data warehousing world is the process of delivering conformed dimensions, which is an activity that can involve data cleaning, standardisation, de-duplicating and matching, with a goal of ensuring that data from multiple different locations can be analysed in a consistent manner. For example, System A may have a customer called William Smith, whereas System B might hold the exact same customer as Mr W Smith. The data warehouse ultimately needs to output one version of William Smith to the end user in this example, which will allow all data associated with this customer to be easily analysed.

I’m currently facing the challenge of de-duplicating and matching across data sources, although in my case most of the data sources are not in English. Its one thing to match customers or products that have slightly different descriptions using fuzzy logic in your ETL, but how do you do this when your products are in multiple different languages? Although we’ve not actually ended up needing to use it, the experience got me thinking, could SSIS be used to translate values in one language to another?

Microsoft Translator API

Enter Microsoft Translator, a “statistical machine translation system”, that’s available on the Azure Marketplace for what looks to be a fairly reasonable price. For example, you can translate up to 2 million characters a month for free, whereas 4 million a month will set you back £26.36 a month. As an example I’ve translated the word ‘Bonjour’ here, without even telling it what the input language is (the service has auto detect capability):

image

SSIS and the Microsoft Translator API

Using an example from AdventureWorks, I’d like to see if its possible to translate the French descriptions found in dbo.DimProduct into English. To do this, I’ve set up a very simple data flow that starts off with extracting the top 100 products from Adventure Works that have a French description. There is then a Script Component, which makes the all important call to the Microsoft Translator API, then finally a Union All just to allow me to catch the rows while debugging. Overall the data flow is as follows:

image

The script component is where all the magic happens, as this contains the call to the API using following MS translator example code. The MS translator code needs adapting to fit into the SSIS environment and rather than re-posting every line of code, I’m just going to highlight the changes I made to get this working with SSIS. My first step was to add a script component with the following output column:

image

Then within the Edit Script window I added the following additional references:

image

I then setup my using section as follows:

using System;
using System.Data;
using System.IO;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.Xml.Linq;
using System.Collections.Generic;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Net;

The next step is to call the PreExecute() method once in order to get an access token from Azure Marketplace. The key is to pass in a Client ID and a Client Secret, both of which can be setup here.

Finally ProcessInputRow now needs to carry out the translation, which is carried out with the following code:

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    if (!Row.FrenchInputName_IsNull)
    {
        //Speficy from french, to English
        string uri = "http://api.microsofttranslator.com/v2/Http.svc/Translate?text=" + 
            System.Web.HttpUtility.UrlEncode(Row.FrenchInputName.ToString()) + "&from=fr&to=en";
        System.Net.WebRequest translationWebRequest = System.Net.WebRequest.Create(uri);
        translationWebRequest.Headers.Add("Authorization", headerValue);
        System.Net.WebResponse response = null;
        response = translationWebRequest.GetResponse();
        Stream stream = response.GetResponseStream();
        System.Text.Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
        StreamReader translatedStream = new System.IO.StreamReader(stream, encode);
        System.Xml.XmlDocument xTranslation = new System.Xml.XmlDocument();
        xTranslation.LoadXml(translatedStream.ReadToEnd());

        Row.Translation = xTranslation.InnerText;
    }        
}


Results

Et voila! After compiling the script code and running the package, we can see the following results in the data viewer:

image

The third column called Translation is the translated English column name from the API, whereas the last column is the correct English translation from Adventure Works. As you can see, for the first few rows its only missing hyphens and capitalisation and the other rows are very close to the correct English translation from Adventure Works. It’s not perfect in all scenarios that I tested, and this approach won’t be suitable for large data volumes, but certainly for the above data it gives a very good match – better than I expected.

Microsoft Translator seems to have done a very good job as a first pass of translating the data. There would be nothing stopping you enhancing this further of course, perhaps with something like DQS or similar cleansing tools, which in turn would mean that you could get the data into good shape before attempting to carry out fuzzy matching.

Core and Custom Dimensions and Facts

An interesting challenge in data warehousing is how to cope with a dimension that is heterogeneous in nature, e.g. a product dimension that only has a handful of attributes for one particular product type, but then may have many more attributes for a different product type. One way to deal with this problem is to create ‘Core and Custom’ dimensions or fact tables, which is a concept I will introduce in this first post, before outlining some of the technical challenges in a later posts.

The ‘Core and Custom’ concept (AKA Supertype and Subtype) is covered by Ralph Kimball in the Data Warehouse Toolkit, where he uses the example of a retail bank selling different types of products, e.g. a mortgage ‘product’ and a current account ‘product’. They both have a balance and an interest amount, which are the facts, but they have a varying number of different dimension attributes. E.g. a Current Account product may have attributes such as ‘Monthly Fee Indicator’ and ‘Debit Card Eligible Indicator’, whereas a Mortgage product may have attributes such as ‘Term’ and ‘Fixed/Adjustable’ to represent Fixed/Variable etc.

I’ve seen this example myself recently where a company sells Commercial Vehicles, Cars, Spare Parts and Bikes. This is the example that I’ll use in order to illustrate the concept a bit further.

Core and Custom Dimensions

If a company wants a global view of all of its product sales, whether they be commercial vehicles, parts, cars or bikes, then an ideal scenario is to have one sales fact table and one product dimension table. Unfortunately due to the diverse nature of the products offered, it could become unmanageable and messy to have a single dimension table that simple has all attributes from all different product lines. For example, a car might have attributes such as Engine Size, Transmission, Airbag Indicator and Fuel Type, whereas a bike might have attributes such as Size, Gender, Speed and Rear Travel – they are clearly very different.

The Core and Custom dimension approach is essentially to have a ‘Core’ dimension table that holds all products, with a column for the surrogate key and then all of the common attributes. The Core dimension table would look like the following:

image

This is then complemented by the ‘Custom’ dimension tables which are highly specialised dimension tables that are specific to a product line. A given product will be present in both the Core dimension table and one of the Custom dimension tables, with the same surrogate key. This is an important point – a product in the Core table will have the same surrogate key as a product in one of the Custom dimension tables. Therefore in the example below, the highlighted rows in the Core dimension table also appear in the Custom bike product dimension table:

image

image

Fact Tables

The idea is that the fact table can be joined interchangeably with either the Core dimension table to get an overview of all products, or to the Custom dimension table to get highly specialised analysis. The Core dimension table would join to a fact table just like any other dimension table:

image

In contrast, either Custom dimension table could be joined to the fact table according to the type of analysis required. The example below shows how either one of three of the Custom dimension tables could be joined to the fact table:

image

Note – The above diagram shows the optional join paths for the Custom dimensions to the fact table. From the user’s perspective, a single dimension would always be presented – either all products, in which case the Core dimension would be used along with the whole fact table, or one of the Custom dimensions, in which case the fact table should only contain the products that relate to the Custom dimension, e.g. with a view.

The alternative approach is to have separate Custom fact tables for each business line – this could be necessary if there are specific facts or dimensions that only apply to a particular product line. Therefore an approach with both Custom fact and dimension tables might look as follows:

image

Summary

My parting thoughts are that whilst I like the Core and Custom approach, you should only use it where appropriate. For example if you have a few product lines and they vary only slightly, then its fine to create a few attributes within a single dimension table that do not apply to all products. On the other hand, if you want to do detailed analysis across highly varied product lines, then Core and Custom dimension and fact tables provide a very flexible method of presenting different information to different groups of users.