A standard practice in the data warehousing world is the process of delivering conformed dimensions, which is an activity that can involve data cleaning, standardisation, de-duplicating and matching, with a goal of ensuring that data from multiple different locations can be analysed in a consistent manner. For example, System A may have a customer called William Smith, whereas System B might hold the exact same customer as Mr W Smith. The data warehouse ultimately needs to output one version of William Smith to the end user in this example, which will allow all data associated with this customer to be easily analysed.
I’m currently facing the challenge of de-duplicating and matching across data sources, although in my case most of the data sources are not in English. Its one thing to match customers or products that have slightly different descriptions using fuzzy logic in your ETL, but how do you do this when your products are in multiple different languages? Although we’ve not actually ended up needing to use it, the experience got me thinking, could SSIS be used to translate values in one language to another?
Microsoft Translator API
Enter Microsoft Translator, a “statistical machine translation system”, that’s available on the Azure Marketplace for what looks to be a fairly reasonable price. For example, you can translate up to 2 million characters a month for free, whereas 4 million a month will set you back £26.36 a month. As an example I’ve translated the word ‘Bonjour’ here, without even telling it what the input language is (the service has auto detect capability):
SSIS and the Microsoft Translator API
Using an example from AdventureWorks, I’d like to see if its possible to translate the French descriptions found in dbo.DimProduct into English. To do this, I’ve set up a very simple data flow that starts off with extracting the top 100 products from Adventure Works that have a French description. There is then a Script Component, which makes the all important call to the Microsoft Translator API, then finally a Union All just to allow me to catch the rows while debugging. Overall the data flow is as follows:
The script component is where all the magic happens, as this contains the call to the API using following MS translator example code. The MS translator code needs adapting to fit into the SSIS environment and rather than re-posting every line of code, I’m just going to highlight the changes I made to get this working with SSIS. My first step was to add a script component with the following output column:
Then within the Edit Script window I added the following additional references:
I then setup my using section as follows:
The next step is to call the PreExecute() method once in order to get an access token from Azure Marketplace. The key is to pass in a Client ID and a Client Secret, both of which can be setup here.
Finally ProcessInputRow now needs to carry out the translation, which is carried out with the following code:
public override void Input0_ProcessInputRow(Input0Buffer Row)
//Speficy from french, to English
string uri = "http://api.microsofttranslator.com/v2/Http.svc/Translate?text=" +
System.Web.HttpUtility.UrlEncode(Row.FrenchInputName.ToString()) + "&from=fr&to=en";
System.Net.WebRequest translationWebRequest = System.Net.WebRequest.Create(uri);
System.Net.WebResponse response = null;
response = translationWebRequest.GetResponse();
Stream stream = response.GetResponseStream();
System.Text.Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
StreamReader translatedStream = new System.IO.StreamReader(stream, encode);
System.Xml.XmlDocument xTranslation = new System.Xml.XmlDocument();
Row.Translation = xTranslation.InnerText;
Et voila! After compiling the script code and running the package, we can see the following results in the data viewer:
The third column called Translation is the translated English column name from the API, whereas the last column is the correct English translation from Adventure Works. As you can see, for the first few rows its only missing hyphens and capitalisation and the other rows are very close to the correct English translation from Adventure Works. It’s not perfect in all scenarios that I tested, and this approach won’t be suitable for large data volumes, but certainly for the above data it gives a very good match – better than I expected.
Microsoft Translator seems to have done a very good job as a first pass of translating the data. There would be nothing stopping you enhancing this further of course, perhaps with something like DQS or similar cleansing tools, which in turn would mean that you could get the data into good shape before attempting to carry out fuzzy matching.