Adatis

Adatis BI Blogs

Getting started with Azure Data Factory Mapping Data Flows

Azure Data Factory V2 is the go-to service for moving large amounts of data within the Azure platform and, up until relatively recently, was focussed predominantly on control flow rather than data flow. Those familiar with SQL Server Integration Services will be aware of the difference however, to clarify, Control flow is used to orchestrate activities within a data pipeline whilst Data flow is used to physically move and transform the data. In May 2019, the Azure Data Factory team released into public preview the Azure Data Factory Mapping Data Flows (MDF) functionality, which effectively moves ADF from being an orchestration tool into a fully fledged ETL tool. This new feature allows developers to configure data transformation logic using a no code, drag and drop approach and implements many of the transformation concepts that existed in SSIS. In fact, there exists an excellent comparison between SSIS, SQL and ADF MDF written by Kamil Nowinski available through this link which highlights how the three technologies matchup: https://sqlplayer.net/2018/12/azure-data-factory-v2-and-its-available-components-in-data-flows/. Points to note are that ADF does not have event handlers in the way SSIS does and also does not have an equivalent to the script component within Data Flows. It does have a custom activity which allows you to write C# within the control flow but currently you cannot write any custom code within a Mapping Data Flow. Prior to the release of MDF, Data Factory could move data from one place to another with its copy activity however it could not modify the file in any meaningful way and would require external services such as SQL Data Warehouse or Data Lake Analytics (RIP) to be used to fill this gap. The copy activity performed the data movement using the Azure Integration Runtime which provided the compute power needed to complete the operation, so does that mean that Mapping Data Flows run on the same compute resource? NO is the answer. In fact, your graphical data flow is converted into Scala and then compiled into a JAR library to be executed on a Databricks (Spark) cluster which is deployed and managed by Microsoft solely for running your data flow activities. This does mean that any pipelines that utilise MDF have a slight overhead to allow for the cluster to start-up and configure, however Microsoft are looking at ways to reduce this and there is no need for an ETL developer looking to build MDF’s to know anything about Databricks, Scala or Spark Clusters – although it will certainly help!So, it’s in public preview so let’s get using it! This blog will walk through the process of creating a basic cleaning data flow that will populate a SQL Database table with the values from a delimited text file. To begin, we need some data factory objects, anyone familiar with data factory will understand we need Linked Services and Datasets to connect to our data and create schemas, and Pipelines to give structure to the activities and create our control flow. In this example I have a small comma separated file that has some data quality issues such as,Leading and trailing white spaceWeird column namesNon standardised NULL valuesWeak data typesThe data flow we will create will address all of these issues before depositing the file into a SQL Database. First, we should create a linked service that connects to a blob store or data lake account where our dirty file is stored. We then need a dataset that sits on top of the Linked Service which allows us to locate and read the file using the specified parameters such as file path, column delimiter, row delimiter etc. Additionally, in this dataset we can import the schema of the file so that we have some column names available. Next, we can create a linked service that connects to our output SQL Database and also a dataset that points to the correct table again, importing the schema. After creating these objects we also need a pipeline to orchestrate the process and this will ultimately call our fancy new data flow. You only need a single activity in this pipeline which will be found under the “Move & Transform” heading and is called “Data Flow (Preview)”At this point you should have something similar to the following with the exception of the data flow object under the Data Flows (Preview) tab:NOTE: I have not gone into much detail about creating these objects as they will be familiar to most ADF developers. For pointers on how to create the above follow this link: https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-portal Now we can begin creating our Data Flow. Firstly, we will add the source and the sink to our data flow, this gives us our bookends as a clear starting point, and we can then play about with middle bit afterwards. Also, at this point, toggle the switch at the top of the Data Factory UI named “Data Flow Debug”, this will start a cluster up for you and avoid you having to wait later onHover over the Data Flows (Preview) header and click on the ellipsis, choose “Add Data Flow” and this will take you into the Mapping Data Flow UI where you can begin creating you Data Flows. Remember to set a sensible name for your data flow in the general tab of the data flow UI. Click “Add Source” and we can begin configuring the source activity of our data flow. Provide a clear name that identifies the source. This source is fed from a dataset so the names should closely match. Choose the dataset that is linked to your blob account and uncheck “Allow schema drift”. This is a useful option and allows for constantly changing sources files to be handled however we will cover that in a later blog. On the “Source Options” tab you can choose to add actions that occur after completion such as deleting or moving source files. On the “Projection” tab you can tailor your schema which is defined in the source dataset. My preference is to leave these all as strings to avoid any early typing errors as we will address these later in the flow. Finally, on the Optimize, Inspect and Data Preview tabs, all defaults can remain the same. Now click on the tiny + icon in the bottom right corner of the source activity and choose “Sink” from the drop-down list. Here we can configure the output of our flow which in this case will be a SQL Database. Specify the name, again relating it to your sink dataset, and choose the SQL Database dataset created earlier. On the “Settings” tab you can choose which methods can be used by ADF when working with your table. These can be any combination of Insertion, Deletion, Upserting or Updating. Also, you can define actions to occur in the database before loading the data such as recreating the entire table or truncating the existing table before loading into it. Finally, on the “Mapping” tab you can map columns from source to sink. Be aware that any columns that are not strings in your database will not be able to be mapped until the data typing has occurred. Now we have a basic copy structure that does nothing clever yet but does connect our dirty source file up to our sink dataset. We can begin doing the actual transform. The first transformation we will do will be to Trim white space from columns. Click on the + icon and choose “Derived Column”. Within the “Derived Column Settings” tab you should add each of the columns in your source dataset and then enter the following expression for each one in the expressions editor: trim({column name}). This expression will remove any whitespace from the columns value ensuring the database receives a clean value. Now we will standardise any NULL values and also transform any columns into their correct data types. To do this, click the + icon again and choose “Derived Column” again. Similar to the above step you can add an entry in the “Derived Column Settings” tab for each column, adding the below expression for each column: replace(replace({column name}, ' ',''),'Unknown',''). This expression will replace any empty values with NULL and also any values where we have ‘Unknown’ will also get replaced with NULL so that we have some standardisation before loading into the database. Any NULL values already present will be untouched. In my dataset I need to change one column from its originating string data type into an int so that it can be dropped in the DB. Rather than doing this change in place, it is best to create a new column to do this so that you have an original column and the new column with the correct type. Whilst still in the expression editor, hover over any column name in the “OUTPUT SCHEMA” window that is visible on the left hand side and choose the + icon. This will allow you add a new column to you data flow and you can use any of the conversion functions (toInteger, toLong, toString, toDate, toBoolean etc) to coerce the value into its correct type.At this point you should have four steps that resemble the below screenshot. Once your Data Flow Debug session is online you can debug the data flow and hopefully see the nice clean values pass through into the database. Throughout this process I recommend taking a peek at the Inspect and Data Preview tabs. The Inspect tabs give a bit more information about what steps are taking place on the data in that activity and the Data Preview will show you how the data will look, although the Debug session needs to be active for this to work. Finally, the optimize tab allows you to set the partitioning of the data using techniques such as Round Robin, HASH and range distribution although these are out of the scope of this blog. Hopefully this brief tutorial has been helpful and allowed you to gain some early knowledge on Data Factory Mapping Data Flows meaning that you can go on to create Flows that are tailored to your needs and cover off a wider variety of scenarios. Any questions, thoughts or feedback, please catch me on twitter @MattTheHow.

Injecting Databricks DataFrame into a Power BI Push Dataset

Using Python and the Power BI REST APINow, why would you want to do that?There are some scenarios where this approach may be beneficial. To get the full picture and establish the landscape let’s first answer two questions:Why a Databricks DataFrame?Recently Databricks became an integral part of the Modern Datawarehouse approach when aiming for the Azure cloud. Its wide usage in data transformation begs for a richer variety of data destinations. The usual and most widely used persistence is the file store (lake, blob, etc.). It’s fine with large volumes of data, but then we have to go a long way before reaching the data presentation (additional storage layers, SQL, Tabular data model etc…).What if we want to instantly update a Power BI report directly from Databricks? It could be a small dataset feeding a dashboard for example. It could be business data or data-related metrics that we want to see in (near)real-time. Or we want to monitor the data transformation process continuously in a Databricks streaming scenario.Why a Push Dataset?We can do real-time streaming in Power BI by using Streaming or PubNub streaming datasets, which is fine, but let’s see some of the advantages of using a Push dataset:You can build a report on top of a Push datasetYou can pin report visuals to a dashboard, and they will update in real-time on each data pushData is stored permanently in Power BIYou don’t need a data gateway in placeFor a more detailed comparison of all the real-time streaming methods, please check here.If you think that all this makes sense, then continue further.ImplementationSince we’ll be utilising the Power BI REST API there are some limitations that we need to be aware of upfront. You can see them here.In order to be able to call the Power BI API you need to register an application and set proper permissions. Follow the steps here. App registration details below:After creating the app registration, you should grant permissions in the Azure portal:Without granting these permissions from the Azure portal you won’t be able to get authorisation token and use the REST API.Now that we’re all set-up let’s go straight to the point.My initial intention was to show you how to build the whole thing step-by-step in this post. But then it become complicated and I decided that an abstraction is needed here to keep things simple. That’s why I wrapped up all the “boring” stuff in a Python class called pbiDatasetAPI, which you can find on GitHub here. The comments in the code should be enough to understand the logic. The methods of this class will take care of:AuthenticationHTTP requestsHTTP requests JSON body generation (data and metadata)Data type conversion (Spark to EDM)Wrapping in a Python function of all the Power BI REST API operations that we need to performIn order to start using it you should import a notebook (using the above-mentioned URL) in your Databricks Workspace:Now that the class notebook is imported you can create a new Python notebook in the same folder to test how it’s working. Let’s call it “Inject DataFrame into Power BI Push Dataset”. First, we’ll execute our class notebook:%run "./pbiDatasetAPI" Next, we’ll need a DataFrame with data that will be pushed to the Power BI Push dataset. We can use some of the sample datasets which come with Databricks (in this case Amazon reviews):dfInject = spark.read.parquet('dbfs:/databricks-datasets/amazon/test4K') dfInject = dfInject.select("brand", "img", "price", "rating", "review", "time").limit(200) We take 200 rows, which is just enough.In the next command we’ll create some variables and instantiate the pbiDatasetAPI class:# Initialise variables username = "<USER_EMAIL>" password = "<USER_PASSWORD>" application_id = "********-****-****-****-************" # Power BI application ID groupId = None # Set to None if not using Power BI groups datasetName = "InjectedDataset" # The name of the Power BI dataset tableNames = ["AmazonReviews"] # Table name or list of table names dataFrames = [dfInject] # DataFrame name or list of DataFrame names # Create a pbiDatasetAPI class instance pbi = pbiDatasetAPI(username, password, application_id) You should set your username and password, and the application ID obtained in the previous steps. Optionally provide also a group ID or set it to None if you’re not using groups on powerbi.com.The tableNames and dataFrames variables are lists, because we may want to insert multiple DataFrames in multiple tables. In our case it’s one DataFrame to one table.Let’s create a dataset with one table in Power BI:# Create the dataset and the table in PBI pbi.executePBIOperation("postdataset", groupId = groupId, datasetName = datasetName, tableNames = tableNames, dataFrames = dataFrames, reCreateIfExists = True) The next step is to post all the DataFrame’s rows to the Push dataset.# Get the datasetKey for the dataset (by name) datasetKey = pbi.executePBIOperation("getdatasetbyname", groupId = groupId, datasetName = datasetName)[0]["id"] # Insert the contents of the DataFrame in the PBI dataset table pbi.executePBIOperation("postrows", groupId = groupId, datasetKey = datasetKey, tableNames = tableNames, dataFrames = dataFrames) This is where the magic happens. One important note here is that the dataset key is always unique. This does not apply to the dataset’s name, which means that we can have multiple datasets with the same name.You can see the newly created dataset on powerbi.com:You can create a report on top of this dataset and pin visuals to a dashboard (tiles will be updated automatically upon data change).Now, let’s try to manipulate the metadata a bit – change the data type of the “rating” column in the DataFrame from double to string and update the Push dataset accordingly:# Change the data type of the column 'rating' from double to string dfInjectModified = dfInject.withColumn("rating", dfInject.rating.cast("string")) # Get the datasetKey for the dataset (by name) datasetKey = pbi.executePBIOperation("getdatasetbyname", groupId = groupId, datasetName = datasetName)[0]["id"] # Update the metadata of the Power BI table pbi.executePBIOperation("puttable", groupId = groupId, datasetKey = datasetKey, tableNames = tableNames, dataFrames = [dfInjectModified]) The only thing remaining now is to try and delete all the rows (it’s all or nothing – we can’t delete some of the rows) from the table in the dataset and then delete the entire dataset:# Get the datasetKey for the dataset (by name) datasetKey = pbi.executePBIOperation("getdatasetbyname", groupId = groupId, datasetName = datasetName)[0]["id"] # Delete all rows from the table(s) pbi.executePBIOperation("deleterows", groupId = groupId, datasetKey = datasetKey, tableNames = tableNames) # Get the datasetKey for the dataset (by name) datasetKey = pbi.executePBIOperation("getdatasetbyname", groupId = groupId, datasetName = datasetName)[0]["id"] # Delete the dataset pbi.executePBIOperation("deletedatasetbyid", groupId = groupId, datasetKey = datasetKey) All the code above you can import in a notebook using this URL.If you closely examine the Power BI REST API documentation here, you’ll find that the pbiDatasetAPI class is not completely finished yet. There’s more that needs to be done, like:Create measures with DAXCreate table relationshipsSet cross-filtering behaviouretc.I intend to update the class in the future so that all the features will be available. Check GitHub now and then for updates.

PySpark DataFrame Transformation Chaining

After working with Databricks and PySpark for a while now, its clear there needs to be as much best practice defined upfront as possible when coding notebooks. While you can get away with quite a bit when writing SQL - which is all too familiar to most of us now, the transition into other languages (from a BI background) requires a bit more thought in terms of coding standards. In this short blog, I will talk about multiple DataFrame transformations and what I believe to be the best way to go about structuring your code for them. More often than not, if you have not come from a programming background or are unfamiliar with a language, you will end up writing spaghetti code. Essentially code that reads a bit like a book, with no real structure to it, causing debugging/maintenance issues. We’ve all been there, especially at the start of a project, just wanting to get some output – but its worth trying to bear in mind some coding rules while designing to save the refactoring a few weeks down the line. Jumping into the subject matter of the blog, this came about because I noticed some of our project code when it came to transformations was made up of re-assigning to a new data frame on each transformation. While functionality it was acceptable, it didn’t feel the right way to go about things. The same can be said with re-using the same data frame (although this caused other dependency issues). It also left us with huge chunks of code which was only possible to read by using the comments to see where some transformations stopped and others started. After a small bit of research I discovered the concept of monkey patching (modifying a program to extend its local execution) the DataFrame object to include a transform function. This function is missing from PySpark but does exist as part of the Scala language already. The following code can be used to achieve this, and can be stored in a generic wrapper functions notebook to separate it out from your main code. This can then be called to import the functions whenever you need them.   # Monkey patch the DataFrame object with a transform method from pyspark.sql.dataframe import DataFrame def transform(self, f): return f(self) DataFrame.transform = transform   As part of your main code, we can then wrap the transformations into functions, passing in and returning a DataFrame. A separate statement can then be called specifying transform on the original DataFrame and the list of functions (transformations) you want to pass in. By using this method, the code is almost self-documenting as its clear what transformations you’ve then applied to move a DataFrame from one context into another. The example below only includes 2 transformations, but imagine you have 20+ to implement – this makes the code much easier to read and debug should there be an issue.   # Create started/finished timestamps and int dates in UTC as new columns in the DataFrame. def append_utc_dates(df): df = df.withColumn("finishedTimestampUTC", to_utc_timestamp(col("finished"), "EST")) \ .withColumn("startedTimestampUTC", to_utc_timestamp(col("started"), "EST")) \ .withColumn("startedDateUTC", ConvertDateTimeToIntDate(col("startedTimestampUTC")))\ .withColumn("finishedDateUTC", ConvertDateTimeToIntDate(col("finishedTimestampUTC"))) return df # Adds new attribute based on hand structure in to the DataFrame def append_BB(df): df = df.join(refHandStructure, df["structure"] == refHandStructure["structure"], "left") \ .select(df["*"], refHandStructure["handStructureName"]) \ .withColumn("BB",when(col("HSName") == "Limit", col("x")).otherwise(col("y"))) df = df.drop("handStructureName") return df # Perform hand transformations dfTransformed = dfRaw.transform(append_utc_dates) \ .transform(append_BB) display(dfTransformed)   You should then use new DataFrames for changes in grain or changes in purpose (not just for each transformation). This technique probably shouldn’t be used for some of the more basic transformations, especially if you only have 1 or 2, but more so when you have 50/100+ lines of code.  Not everything needs to be wrapped into functions, but it certainly reads better for larger notebooks!

Implementing Enterprise Security in Azure Databricks - Part 2

Following on from my last blog on the topic of security within Azure Databricks which concentrated on the implementation model for data processing for platform, the following blog concentrates on the alternative - data processing for users.   Data Processing for Users By this, I mean data-related activities that a user is performing interactively, for instance data analysis from Data Lake. With the addition of Databricks runtime 5.1 which was released December 2018, comes the ability to use Azure AD credential pass-through. This is a huge step forward since there is no longer a need to control user permissions through Databricks Groups / Bash and then assigning these groups access to secrets to access Data Lake at runtime. As mentioned previously - with the lack of support for AAD within Databricks currently, ACL activities were done on an individual basis which was not ideal. By using this feature, you can now pass the authentication onto Data Lake, and as we know one of the advantages of Data Lake is the tight integration into Active Directory so this simplifies things. Its worth noting that this feature is currently in public preview but having tested it thoroughly, am happy with the implementation/limitations. The feature also requires a premium workspace and only works with high concurrency clusters – both of which you’d expect to use in this scenario. The other good thing is that its incredibly easy to enable this functionality, as it is controlled by the cluster configuration process. To enable, navigate to the cluster configuration page, select runtime 5.1 or higher, and expand the advanced options. At this point, you will see a tick box which needs to be checked (see below). This will add another line of code into your spark config. Its actually good to see it was implemented in this way – and helps to stick to the Microsoft mantra of keeping things simple.     Once enabled, only connections into Data Lake via acl:// are valid – any existing connections with the dbfs or through the databricks databases mounted to route via the dbfs for unmanaged tables will stop working. This is a current limitation and may be fixed at GA (although technically you could re-build the tables using the acl:// path if this was an issue). Great – so now my ACL can be controlled entirely within Data Lake without the need for Service Principals! But there’s more.. I touched on this with my previous blog, but as part of the access implementation for Data Lake, it is preferable to define Azure Active Directory Groups (AAD) to provide further flexibility moving forward. By this I mean, creating AAD groups and assigning them to Data Lake so as to create a level of abstraction away from Users/User AD Groups/Service Principals so that modifications will not need to be made to Data Lake permissions in the future, only to the AAD groups that are already assigned. From experience, by not going down this route - any additional user permissions that need applying in future have to be applied across all folders/files which depending on the size of the data lake, can take a particularly long time \ be awkward to add. Therefore this needs to be done upfront as part of the design! I would suggest the following conventions for this group setup, an example being AAD_PLATFORM_ADL_ENV_RAW_READER. AAD – to separate out AD/AAD once sync is implemented. PLATFORM – the platform or project/department this group is used by. ADL – the resource on the platform that the group is used by. ENV – the environment on which the resource resides (prod/non-prod). RAW – the layer within lake the permissions will be applied to. READ – the permission the group will have access to. You would then need to create the following AAD groups to cover all areas across the Data Lake. This assumes using our standard Data Lake layer/folder pattern first introduced by Ust’s blog back in 2016. AAD_PLATFORM_ADL_PROD_RAW_READER AAD_PLATFORM_ADL_PROD_RAW_WRITER AAD_PLATFORM_ADL_PROD_BASE_READER AAD_PLATFORM_ADL_PROD_BASE_WRITER AAD_PLATFORM_ADL_PROD_ENRICHED_READER AAD_PLATFORM_ADL_PROD_ENRICHED_WRITER AAD_PLATFORM_ADL_PROD_CURATED_READER AAD_PLATFORM_ADL_PROD_CURATED_WRITER AAD_PLATFORM_ADL_PROD_LABORATORY AAD_PLATFORM_ADL_PROD_LIBRARY_READER AAD_PLATFORM_ADL_PROD_ADMIN The permissions applied to these groups can then be implemented using the following matrix. When adding these permissions, they need to be added to the current folder, and all children, and added as both an access permission entry and a default permission entry. Without this, any changes to Data Lake in the future will not inherit these permissions.     Once this is complete, you would add the Users/AD User Groups/Service Principals into these AAD groups to provide access and the pass-through permissions work as expected. By using this method it separates out both read/write and the data lake layers meaning that unless specifically granted, there will be a much more limited footprint in terms of access permissions into Data Lake. Using a combination of this, and the AD credential pass-through from Databricks provides a suitable solution for implementing security using the Databricks/Data Lake combo.

Implementing Enterprise Security in Azure Databricks - Part 1

In recent weeks, I’ve spent a fair chunk of time working with different aspects of Databricks and as part of this, one topic that consumed a proportion of that time is the security and authentication aspects of the service. Our scenario was one that I expect most people will come across over the next few years, essentially the integration of Databricks with Azure Data Lake for data processing. In the following blogs (yup I need 2) I intend to document the way we went about implementing the security model, breaking it up into data processing for a platform and data processing for users – each has a slightly different implementation.   Data Processing for Platform By this, I mean data processing activities that happen as part of a service, orchestrated through something such as Azure Data Factory that happen on a schedule or in real-time. I expect the very first thing most people will do once they have a Databricks workspace is to mount their Data Lake.  Its by the far easiest option for getting your hands on your data, and one that’s still worth using to start off as long as you understand the consequences. If you need to do this, I will point you in the direction of Hugh’s blog post covering the topic which I’ve used a number of times now to remember all the different secrets/GUIDs you need.  By mounting the Data Lake in your Databricks workspace, you are able to interact with the data as if it were in the DBFS (the databricks local file system). This proves incredibly powerful for tools such as dbutils which allow you to perform a vast number of file system operations. Unfortunately the one downside to this method is that anyone with access to the workspace is able to directly interact with all layers in your Data Lake – not ideal. Since Data Lake’s are not treated in the same way as something like an Azure SQL DB, you also won’t have those handy recovery options if something happens to your data because of this unrestricted access. This obviously poses a risk for an Enterprise solution. Its worth noting that the mount is done once, so your clusters can go down and be spun up again by other users and they would still be able to access the mountpoint until it is un-mounted. With that in mind, it is far better to implement the access to Data Lake directly at runtime using the Spark API. The config for this is very similar to the mounting, except the spark config is set at runtime and therefore if this is not included you will be unable to access the Data Lake. This avoids the issues mentioned above where anyone can access the data, just because they can access the workspace. The following commands can be used to set this up at runtime.   # Get Secrets from Databricks ClientId = dbutils.secrets.get(scope = "ScopeX", key = "ClientId") ClientSecret = dbutils.secrets.get(scope = "ScopeX", key = "ClientSecret") DirectoryId = dbutils.secrets.get(scope = "ScopeX", key = "DirectoryId") # Apply variables to spark config spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential") spark.conf.set("dfs.adls.oauth2.client.id", ClientId) spark.conf.set("dfs.adls.oauth2.credential", ClientSecret) spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/" + DirectoryId + "/oauth2/token")   You’ll notice that as part of this I’m retrieving the secrets/GUIDS I need for the connection from somewhere else – namely the Databricks-backed secrets store. This avoids exposing those secrets in plain text in your notebook – again this would not be ideal. The secret access is then based on an ACL (access control list) so I can only connect to Data Lake if I’m granted access into the secrets. While it is also possible to connect Databricks up to the Azure Key Vault and use this for secrets store instead, when I tried to configure this I was denied based on permissions. After research I was unable to overcome the issue. This would be more ideal to use but unfortunately there is limited support currently and the fact the error message contained spelling mistakes suggests to me the functionality is not yet mature. To configure the databricks-backed secrets, the easiest method is to use an Azure Bash console and go in via the Databricks CLI. To access the console - within the Azure portal you’ll notice an icon similar to below as part of the top ribbon.     Clicking this will then prompt you to start either a PowerShell or Bash console – which will look similar to below.     The connection into the Databricks CLI can be setup as per the following commands.   virtualenv -p /usr/bin/python2.7 databrickscli source databrickscli/bin/activate pip install databricks-cli databricks configure --token   At this point, you’ll need to provide it both your databricks host – something similar to https://northeurope.azuredatabricks.net and a token. The token will need to be generated through your Databricks workspace – under User Settings / Access Tokens. This essentially lets you into the API without the need for a password. The important thing to mention at this point is that the token is a personal access token. While this doesn’t impact anything in particular with Azure Bash, its worth noting that the token is created under your Databricks account, and therefore using this token for something such as a linked service into ADF then will use your account to process the connection authenticating as you. Hopefully over time, this will be matured and the use of Managed Identity's or Service Principals directly connected into the platform will be possible. As you might of guessed then, you will need to make sure the account that generates the token is then used within the ACL to read the secret, otherwise at processing time – ADF will not be able to read through Databricks into Lake. The configuration of the secrets is then required and can be done using the following commands. Simply replace [client id], [secret value], [directory id] with the necessary values from your service principal.   databricks secrets create-scope --scope ScopeX databricks secrets put --scope ScopeX --key ClientId --string-value [client id] databricks secrets put --scope ScopeX --key ClientSecret --string-value [secret value] databricks secrets put --scope ScopeX --key DirectoryId --string-value [directory id] databricks secrets put-acl --scope ScopeX --principal admins --permission READ   I then granted access to read these secrets to the admins group – this just keeps things simple but you can obviously provide it individual users as well or other Databricks groups. One of my hopes for 2019 is that the platform is integrated better into AAD moving forwards as everyone needs to be named individually currently. You would then need to manage permissions through the groups which can only be done via the CLI. Once this is complete, you can now interact with your Data Lake and authenticate at runtime. Its also worth mentioning that this interaction changes the connection string from dbfs:// to adl:// which may have a knock-on effect if you use certain tools such as dbutils to do things within your code. This is also important to know since with databricks runtime 5.1 and AD credential pass-through, you will be unable to access anything other than Data Lake file systems. I’ll explain this further in my next blog.   Conclusion Hopefully this will prove a useful to anyone venturing onto the platform and provide a basis to implement a security model. If you wanted to go a step further, you may also want to implement service principal access to Data Lake on a number of levels – both across folders AND read/write. This would add slightly more complexity to the solution but provide an even securer method avoiding the scenario where accounts can access the entire Data Lake. In my next blog, I will look at it from a user-perspective which takes on a slightly different implementation.

Spark Streaming in Azure Databricks

Real-time stream processing is becoming more prevalent on modern day data platforms, and with a myriad of processing technologies out there, where do you begin? Stream processing involves the consumption of messages from either queue/files, doing some processing in the middle (querying, filtering, aggregation) and then forwarding the result to a sink – all with a minimal latency. This is in direct contrast to batch processing which usually occurs on an hourly or daily basis. Often is this the case, both of these will need to be combined to create a new data set. In terms of options for real-time stream processing on Azure you have the following: Azure Stream Analytics Spark Streaming / Storm on HDInsight Spark Streaming on Databricks Azure Functions Stream Analytics is a simple PaaS offering. It connects easily into other Azure resources such as Event Hubs, IoT Hub, and Blob, and outputs to a range of resources that you’d expect. It has its own intuitive query language, with the added benefit of letting you create functions in JavaScript. Scaling can be achieved by partitions, and it has windowing and late arrival event support that you’d expect from a processing option. For most jobs, this service will be the quickest/easiest to implement as long as its relatively small amount of limitations fall outside the bounds of what you want to achieve. Its also worth noting that the service does not currently support Azure network security such as Virtual Networks or IP Filtering. I suspect this may only be time with the Preview of this in EventHubs. Both Spark Streaming on HDInsight and Databricks open up the options for configurability and are possibly more suited to an enterprise level data platform, allowing us to use languages such as Scala/Python or even Java for the processing in the middle. The use of these options also allows us to integrate Kafka (an open source alternative to EventHubs) as well as HDFS, and Data Lake as inputs. Scalability is determined by the cluster sizes and the support for other events mentioned above is also included. These options also give us the flexibility for the future, and allow us to adapt moving forward depending on evolving technologies. They also come with the benefit of Azure network security support so we can peer our clusters onto a virtual network. Lastly – I wouldn’t personally use this but we can also use Functions to achieve the same goal through C#/Node.js. This route however does not include support for those temporal/windowing/late arrival events since functions are serverless and act on a per execution basis. In the following blog, I’ll be looking at Spark Streaming on Databricks (which is fast becoming my favourite research topic). A good place to start this is to understand the structured streaming model which I’ve seen a documented a few times now. Essentially treating the stream as an unbounded table, with new records from the stream being appended as a new rows to the table. This allows us to treat both batch and streaming data as tables in a DataFrame, therefore allowing similar queries to be run across them.     At this point, it will be useful to include some code to help explain the process. Before beginning its worth mounting your data sink to your databricks instance so you can reference it as if it were inside the DBFS (Databricks File System) – this is merely a pointer. For more info on this, refer to the databricks documentation here. Only create a mount point if you want all users in the workspace to have access. If you wish to apply security, you will need to access the store directly (also documented in the same place) and then apply permissions to the notebook accordingly. As my input for my stream was from EventHubs, we can start by defining the reading stream. You’ll firstly need to add the maven coordinate com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.2 to add the EventHub library to the cluster to allow the connection. Further options can be added for the consumer group, starting positions (for partitioning), timeouts and events per trigger. Positions can also be used to define starting and ending points in time so that the stream is not running continuously. connectionString = "Endpoint=sb://{EVENTHUBNAMESPACE}.servicebus.windows.net/{EVENTHUBNAME};EntityPath={EVENTHUBNAME};SharedAccessKeyName={ACCESSKEYNAME};SharedAccessKey={ACCESSKEY}" startingEventPosition = { "offset": "-1", # start of stream "seqNo": -1, # not in use "enqueuedTime": None, # not in use "isInclusive": True } endingEventPosition = { "offset": None, # not in use "seqNo": -1, # not in use "enqueuedTime": dt.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ"), # point in time "isInclusive": True } ehConf = {} ehConf['eventhubs.connectionString'] = connectionString ehConf['eventhubs.startingPosition'] = json.dumps(startingEventPosition) ehConf['eventhubs.endingPosition'] = json.dumps(endingEventPosition) df = spark \ .readStream \ .format("eventhubs") \ .options(**ehConf) \ .load() The streaming data that is then output then follows the following schema – the body followed by a series of metadata about the streaming message.     Its important to note that the body comes out as a binary stream (this contains our message). We will need to cast the body to a String to deserialize the column to the JSON that we are expecting. This can be done by using some Spark SQL to turn the binary into a string as JSON and then parsing the column into a StructType with specified schema. If multiple records are coming through in the same message, you will need to explode out the result into separate records. Flattening out the nested columns is also useful as long as the data frame is still manageable. Spark SQL provides some great functions here to make our life easy. rawData = df. \ selectExpr("cast(body as string) as json"). \ select(from_json("json", Schema).alias("data")). \ select("data.*") While its entirely possible to construct your schema manually, its also worth noting that you can take a sample JSON, read it into a data frame using spark.read.json(path) and then calling printSchema() on top of it to return the inferred schema. This can then used be used to create the StructType. # Inferred schema: # root # |-- LineTotal: string (nullable = true) # |-- OrderQty: string (nullable = true) # |-- ProductID: string (nullable = true) # |-- SalesOrderDetailID: string (nullable = true) # |-- SalesOrderID: string (nullable = true) # |-- UnitPrice: string (nullable = true) # |-- UnitPriceDiscount: string (nullable = true) Schema = StructType([ StructField('SalesOrderID', StringType(), False), StructField('SalesOrderDetailID', StringType(), False), StructField('OrderQty', StringType(), False), StructField('ProductID', StringType(), False), StructField('UnitPrice', StringType(), False), StructField('UnitPriceDiscount', StringType(), False), StructField('LineTotal', StringType(), False) ]) At this point, you have the data streaming into your data frame. To output to the console you can use display(rawData) to see the data visually. However this is only useful for debugging since the data is not actually going anywhere! To write the stream into somewhere such as data lake you would then use the following code. The checkpoint location can be used to recover from failures when the stream is interrupted, and this is important if this code were to make it to a production environment. Should a cluster fail, the query be restarted on a new cluster from a specific point and consistently recover, thus enabling exactly-once guarantees. This also means we can change the query as long as the input source and output schema are the same, and not directly interrupt the stream. Lastly, the trigger will check for new rows in to stream every 10 seconds. rawData.writeStream \ .format("json") \ .outputMode("append") \ .option("path", PATH) \ .trigger(processingTime = "10 seconds") \ .option("checkpointLocation", PATH) \ .start() Checking our data lake, you can now see the data has made its way over, broken up by the time intervals specified.     Hopefully this is useful for anyone getting going in the topic area. I’d advise to stick to Python given the extra capacity of the PySpark language over Scala, even though a lot of the Databricks documentation / tutorials uses Scala. This was just something that felt more comfortable. If you intend to do much in this area I would definitely suggest you use the PySpark SQL documentation which can be found here. This is pretty much a bible for all commands and I’ve been referencing it quite a bit. If this is not enough there is also a cheat sheet available here. Again, very useful for reference when the language is still not engrained.

Introduction to Spark-Part 1: Installing Spark on Windows

This is the first post in a series on Introduction To Spark.For those wanting to learn Spark without the overhead of spinning up a cluster in the cloud or installing a multi-node cluster on-prem, you can get started relatively easy by creating a local mode installation on Windows. There are a number of posts available that outline steps to achieve this. Hopefully this one will provide you with all you need to get up and running. Software RequirementsThere are five main components you will require in order to successfully setup Spark with Jupyter notebooks. You can download the installers/archives from the following links:AnacondaJava 1.6 or later SDKScalaSparkWinUtilsI mention Anaconda because it is basically the easiest way to get your hands on all you need for the Jupyter and Python side of working with Spark. It also makes managing Python environments straight forward and comes with a whole bunch of packages already included, saving you having to install these. You can run Spark without this if you prefer, you’ll just have to download Python (recommended 3.6 but min. 3.5 for Windows) and configure environments using the Python native functionality, none of which is particularly difficult. I’ll leave you to Google that one as there are plenty of articles on this.Updating Environment VariablesJust a quick note around this, as we’ll be making some changes to these. Before editing your path environment variable it is advisable to save the current value safely to file, in case you make unwanted changes and need to revert. You can set environment variables using either SETX <var> “<value>” /M (you will need to run your command prompt with Administrator privileges) or via the System Properties dialogue.Note that the SETX method will truncate your path variable to 1024 characters, so if you have a longer path variable I’d suggest using the System Properties dialogue method below. It’s worth mentioning that if you use SET rather than SETX from the command line, your changes will only be scoped to the lifetime of that command window. SETX will persist them to the master environment in the system registry, as will the System Properties dialogue approach. Another alternative is of course PowerShell, and there is a code snippet on StackOverflow here that should help. The permanent changes to the environment variables will not be visible within any open command windows, so if you want to test them you’ll need to open an new window. Installing AnacondaAnaconda has an automated installer, which you should run, accepting the defaults as required. As you may want to be creating multiple Python environments on your machine, it is best to not add the default Anaconda install path to your Path environment variable, as advised in the installer as this can cause unwanted executables to be found. Creating a Python EnvironnmentOnce installed, it is advisable to create a Python environment for use with your Spark installation. Although not strictly required, this will allow you to make changes to the environment in isolation, without fear of breaking other Python development projects you may be working on.If you are using Anaconda, this is easily achieved using Anaconda Navigator. From there, go to Environments | Create, then in the dialogue, name your new environment and choose your Python version. I would recommend Python 3.6 for our Spark installation.With your environment created, add the following Jupyter packages:Your versions may differ slightly, but targeting these version or higher is recommended. You may experience Anaconda Navigator misbehaving, occasionally hanging when trying to resolve the package dependencies. If this happens, close it down, restart the application and try again. You may also want ot include other packages that are not installled by default, such as numpy, matplotlib and pandas as these will be of use within your environment.Installing the Java SDKRun the installer, accepting the required installation parametersAdd the path to the respective Java JDK<version>\bin directory to your Path environment variable. This is not strictly required as we have set JAVA_HOME via the installer, but will make the various executables accessible without requiring an explicit path. So in the case of installing to C:\Java\jdk1.8.0_151, we can runSETX path “%%path%%;C:\Java\jdk1.8.0_151\bin;” /Mor in the case of a path environment variable approaching 1024 characters, using the System Properties dialogue method above.Installing ScalaRun the windows installer downloaded and accept the defaults. Install Scala to a suitable location on your machine (I use C:\Scala for simplicity). The defaults will add some useful directories to your path environment variable.Add the following environment variable SCALA_HOME = <your Scala destination>\binAdd the SCALA_HOME environment variable to your path environment variable. And that’s it.To confirm you’re up and running with Scala, simply run “Scala” from your favourite command prompt. You should get a prompt returned as below:To exit from the Scala environment shell and return to the command prompt, type :qInstalling SparkThis is very straight forward. Simply unzip to a suitable location (I use C:\Spark for simplicity). I prefer to use PeaZip for this as it can handle pretty much any archive format, but 7Zip is also a popular choice.Add the following environment variableSPARK_HOME = <your Spark destination>Add the following to your path environment variable.%SPARK_HOME%\bin%SPARK_HOME%\python%SPARK_HOME%\python\lib\py4j-0.10.4-src.zip%SPARK_HOME%\python\pysparkTo test all is well, at the command prompt, type spark-shell and you should be greeted with the following.Your version of Spark will of course depend on the specific build you acquired. Again we’re taken into a Scala prompt, so we can type :q to return to the command line.Installing WinUtilsWinUtils provides a number of HDFS-emulating utilities that allow us to run Spark as though it were talking to an HDFS storage system (at least to a certain degree). Without this you will get all manner of file system-related issues wit Spark and won’t get off the launchpad.Within the WinUtils archive you may have a number of Hortonworks Data Platform versioned folders. For the version of Spark I’m using, being 2.2.1, I have chosen hadoop-2,7,1\bin for my files. Unzip and copy the contents of the bin directory to a directory of your choice. It must however be called ‘bin’ in order to be located by the calling programs. I actually placed mine in the C:\Spark\bin directory together with the other executables that Spark uses but this is not essential.Once done, you will need to set the following environment variable:HADOOP_HOME = <your winutils ‘bin’ parent directory>Note we don’t include the \bin, so for my example this is C:\Spark.Setting Folder Permissions Required by HiveEh? Who said anything about Hive? Well, Spark will make use of certain parts of Hive under the covers such as for the SparkSQL Hive metastore and the like, and as such needs to have access in place for writing files that Hive uses. The first thing you will need to do with the WinUtils tools is change some permissions on a temporary file store used by the Hive Session driver. Create a directory hive under you windows tmp directory (C:\tmp by default).You will need to assign some Posix permissions to the folder, to allow read write execute for owner, user and group. You will need to open a command prompt with administrator permissions and ensure you are connected to any domain that the computer belongs to in order for this to be successful. At the command prompt, type the following:winutils chmod –R 777 c:\tmp\hiveTo confirm this has been applied, type winutils ls –L c:\tmp\hiveIf your permissions have been successfully applied you should see something like the following, signifying read write execute permissions on the directory for all users.Running a PySpark ShellAs well as Scala, you can also work with Python. You should be able to start a PySpark session from an Administrator-elevated command prompt by simply issuing pysparkNote: Failure to use an elevated command prompt will result in an error relating to the inability to start a HiveSession, due to an access denied issue.py4j.protocol.Py4JJavaError: An error occurred while calling o23.sessionState. : java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':This returns a similar output to that for the spark-shell command above, with the difference being that you will have a python ‘>>>’ prompt rather than the ‘scala>’ one.This should be all you need in order to have a working environment for developing locally with Spark. Coming Soon…In the second post in this series we will be looking at working with Jupyter Notebooks, a popular all-encompassing environment for conducting Data Analytics using Spark.

Introduction to Spark-Part 3:Installing Jupyter Notebook Kernels

This is the third post in a series on Introduction To Spark.IntroductionThere are a large number of kernels that will run within Jupyter Notebooks, as listed here.I’ll take you through installing and configuring a few of the more commonly used ones, as listed below:Python3PySparkScalaApache Toree (Scala)Kernel Configuration Each kernel has its own kernel.json file, containing the required configuration settings. Jupyter will use this when loading the kernels registered in the environment. These are created in a variety of locations, depending on the kernel installation specifics. The file must be named kernel.json, and located within a folder that matches the kernel name.Kernel LocationsThere are various locations for the installed kernels. For those included in this article the locations below have been identified:<UserProfileDir>\AppData\Roaming\jupyter\kernels<AnacondaInstallDir>\envs\<EnvName>\share\jupyter\kernels<ProgramDataDir>\jupyter\kernelsWhere <UserProfileDir> will be as per the Environment variable %UserProfile%, <ProgramDataDir> will be as per %ProgramData%, and <AnacondaInstallDir> is the installation root directory for Anaconda, assuming you are using this for your Python installation.Listing Jupyter KernelsYou can see what kernels are currently installed by issuing the following:Jupyter kernelspec listInstallationPython3This comes ‘out of the box’ with the Python 3 environment, so should require no actual setup in order to use. You’ll find the configuration file at <AnacondaInstallDir>\envs\Python36\share\jupyter\kernels\Python3. The configuration contains little else other than the location of the python.exe file, some flags, and the Jupyter diplay name and language to use. It will only be available within the Python environment in which it is installed, so you will need to change to that Python environment prior to starting Jupyter notebooks, using ‘Activate <envName>’ from a conda prompt.PySparkThis requires a little more effort than the Python 3 kernel. You will need to create a PySpark directory in the required location for your Python environment, i.e. <AnacondaInstallDir>\envs\<EnvName>\share\jupyter\kernels\PySparkWithin this directory, create a kernel.json file, with the following data:{"display_name": "PySpark","language": "python","argv": [ "<AnacondaInstallDir>\\Envs\\<EnvName>\\python.exe","-m","ipykernel_launcher","-f","{connection_file}"],"env": {"SPARK_HOME": "<SparkInstallDir>","PYSPARK_PYTHON": ""<AnacondaInstallDir>\\Envs\\<EnvName>\\python.exe ","PYTHONPATH": "<SparkInstallDir>\\python; <SparkInstallDir>\\python\\pyspark; <SparkInstallDir>\\python\\lib\\py4j-0.10.4-src.zip; <SparkInstallDir>\\python\\lib\\pyspark.zip","PYTHONSTARTUP": "<SparkInstallDir>\\python\\pyspark\\shell.py","PYSPARK_SUBMIT_ARGS": "--master local[*] pyspark-shell"}}All windows paths will of course use backslashes, which must be escaped using a backslash, hence the ‘\\’. You need to include paths to a zip archives for py4j and pyspark in order to have full kernel functionality. In addition to the basic Python pointers we saw in the Python 3 configuration, we have set a number of windows environment variables for the lifetime of the kernel. These could have course be set ‘globally’ within the machine settings (see here for details on setting these variables), but this is not necessary and I have avoided this to reduce clutter.The PYSPARK_SUBMIT_ARGS parameter will vary based on how you are using your Spark environment. Above I am using a local install with all cores available (local[*]).In order to use the kernel within Jupyter you must then ‘install’ it into Jupyter, using the following:jupyter PySpark install <AnacondaInstallDir>\envs\<EnvName>\share\jupyter\kernels\PySparkJupyter-ScalaThis can be downloaded from here. Unzip and run the jupyter-scala.ps1 script on windows using elevated permissions in order to install. The kernel files will end up in <UserProfileDir>\AppData\Roaming\jupyter\kernels\scala-develop and the kernel will appear in Jupyter with the default name of ‘Scala (develop)’. You can of course change this in the respective kernel.json file.Apache ToreeThis allows the use of Scala, Python and R languages (you will only see Scala listed after install but apparently it can also process Python and R), and is currently at incubator status within the Apache Software Foundation. The package can be downloaded from Apache here, however to install, just use pip install with the required tarball archive url and then jupyter install as below (from an elevated command prompt):pip install http://apache.mirror.anlx.net/incubator/toree/0.1.0-incubating/toree-pip/apache-toree-0.1.0.tar.gzjupyter toree install This will install the kernel to <ProgramDataDir>\jupyter\kernels\apache_toree_scalaYou should now see your kernels listed when running Jupyter from the respective Python environment. Select the ‘New’ dropdown to create a new notebook, and select your kernel of choice.Coming Soon...In part 4 of this series we’ll take a quick look at the Azure HDInsight Spark offering.

Introduction to Spark-Part 2:Jupyter Notebooks

This is the second post in a series on Introduction To Spark.What Are They?Jupyter Notebooks provide an interactive environment for working with data and code. Used by Data Analysts, Data Scientists and the like, they are an extremely popular and productive tool. Jupyter Notebooks were previously known as IPython, or ‘Interactive’ Python, and as such you will still find reference to this name throughout various documents.The Notebook EnvironmentThe Jupyter notebok environment consists of a browser-based notebook UI and a back-end server, running on port 8888 by default (if this port is taken it will start up on the next available port). This web server-based delivery of Notebooks means that you can browse to a remote server and execute your code there. This is the case, for example, when using a ready-made cluster such as an HDInsight Spark cluster, where all the tooling has been pre-installed for you. You open the notebook in the cluster portal within Azure, and it logs you in to the Jupyter server running on a node within the cluster. Note that if you want to allow multi-user access to your local Jupyter environment, you’ll need to be running a product such as JupyterHub.Starting the Notebooks EnvironmentFor our local install, we can run our Jupyter Notebooks using a couple of different methodsAnaconda Jupyter NotebookAs mentioned in the previous post, Anaconda comes with Jupyter pre-installed. For each Python Environment that you have the Jupyter Notebook package installed to, you will see a Jupyter Notebook(<env name>) entry under the Anaconda Start menu items. You can also access this from the Anaconda Navigator Home tab, together with a bunch of other Data Analytics-related applications such as rstudio and spyder.Conda PromptOpen an Anaconda Prompt, change to the required environment and execute the application:Activate <env name>Jupyter NotebookThis will execute the jupyter-notebook.exe file (via the Jupyter.exe file) installed within the selected environment, being the entry point for the Jupyter Notebook application. It is important to load the installation of Jupyter Notebook in the desired Python environment in order to have access to kernels that have been installed there.Shutting Down the Jupyter ServerYou can close your Jupyter Notebook at any time, but you will need to make sure that the server process has also stopped. Back in your command window, press Ctrl+C twice, and it will shutdown. This will return you to the command prompt.KernelsThe code you enter in your notebook is executed using a specified kernel. There are a whole bunch of kernels supported, as detailed here, which can be easily installed into the environment and configured as required. The most popular kernels for working with Spark are PySpark and Scala. I’ll take you through installing some of these kernels in the next post in the series.CellsJupyter notebooks consist of cells, in which you enter code for your data analytics needs for number crunching and rendering visuals, write markdown text (for documentation) and even add basic UI controls such as sliders, dropdowns and buttons. Your code will use the chosen kernel and as such must comply with the respective execution language. This offers a very productive collaboration environment in which to work, with the notebooks containing the instructions to calculate and visualise the data of interest being easily shared amongst co-workers. They may appear at first to be a bit basic, but in reality they offer pretty much everything you need to get to grips with analysing and displaying your data, leveraging all manner of libraries within your chosen language.Cell output can be in ASCII-text rendered tables, formatted text, or various visualisations such as formatted tables, histograms, scatter charts or, if you tap into the more advanced widgets/tools, animated 3D graphs and more.Edit Mode vs Command ModeThese two modes offer different behaviour within the notebook. ‘Edit mode’ is for editing within your cells, ‘command mode’ for issuing commands that are not related to cell editing but more to the notebook itself. To enter ‘edit mode’, simply press Enter on a cell, or click within the cell. To leave ‘edit mode’ and enter ‘command mode’ press Esc. On executing a cell, you will automatically enter ‘command mode’.Keyboard ShortcutsThere are a considerable number of keyboard shortcuts within the notebook environment, the list of which can be seen by clicking on the ‘command palette’ button. Some commonly used shortcuts of note are: Function Shortcut Description Run Cell and Select Next Shift + Enter (Edit Mode) This executes the current cell and moves to the next one. Run Cell and Insert Next Alt + Enter (Edit Mode) This executes the current cell and inserts a new cell below the current one. Run Selected Cells Ctrl + Enter (Edit Mode) This executes all currently selected cells. Delete Cells d d This deletes the currently selected cell. Cut Selected Cells x (Command Mode) Cuts the selected cells to the clipboard. Copy Selected Cells c (Command Mode) Copies the cells to the clipboard. Paste Selected Cells below v (Command Mode) Pastes the cells from the clipboard to the notebook below the current cell. Paste Selected Cells above Shift + v (Command Mode) Pastes the cells from the clipboard to the notebook above the current cell. Select Next, Previous Cell Up, Down (Command Mode) Moves to the cell above or below the current cell. It’s worth familiarising yourself with the various shortcuts of course as an aid to productivity.MagicsMagics are essentially shortcut commands for various actions within the Jupyter Notebook environment. The magics auto-loaded will depend on the kernel chosen. Magics come in two flavours, cells magics and line magics.Cells MagicsThese are prefixed %% and act on the contents of the cellLine MagicsThese are prefixed % and act on the contents of the line that the magic is on. If the ‘Automagic’ functionality is turned on, the % is not required.Many magics have both cell and line versions.Viewing Magics Available in the KernelYou can list the magics available using, yep, you guessed it, a magic. %lsmagic will show all cell and line magics currently loaded. If you require loading another magic, use the %load_ext <magic package name>. You should consult the specific magic documentation in order to get the correct reference to use for loading with %load_ext. You can get help on a specific magic by typing %<magic cmd>?Common Magics%load_extAs mentioned, this will load a magic up from a library, providing it has been installed in the respective kernel environment.%configThis allows configuration of classes made available to the IPython environment. You can see which classes can be configured by executing %config with no parameters. We’ll make use of this for the %%read_sql magic below.%%SQLProbably the most commonly used within Spark is the SQL Magic, %%SQL. This allows using SparkSQL with a SQL syntax from within the notebook, reducing the code written considerably. It is available within the Spark Kernel installed within Apache Toree, so you will need to start the ‘Apache Toree – Scala’ kernel to use this magic. With SparkSQL you would ordinarily write something along the lines ofbut with the %%SQL magic, this simply becomesIt basically returns a Spark DataFrame from the SQL expression used, and renders it to the results area under the cell. Given the power that SparkSQL has (as we’ll see in a later post) this simplifies use of that most popular of data querying languages, namely our good old friend SQL.%read_sql, %%read_sqlIf you are using a different kernel, you will need to use an alternative, such as %read_sql from the sql_magic package supplied by Pivotal, which requires a little more code. You can read about installing and using that here.To use this with Spark, you will need to connect to the Spark Context. This is done by changing a configuration property for the conn_name property of the sql config class to ‘spark’. After installation, you’ll need to use the following boilerplate code:You can then use the magic as below:Slightly different output style, but essentially the same as %%SQL.%%HTMLPretty obvious this one. Will output your HTML to the results area under the cell, allowing display of webpage content within the notebook%%LatexFor those writing equations or requiring the formatting commonly found in scientific papers, this offers a subset of LaTex functionality as provided by MathJax.Rather spiffing considering all the maths generally kicking around on your average Data Science project.Notable Packages for InstallationBelow are a few packages of note. You can install these into your python environment using pip install or the Anaconda Navigator environment package manager (please see previous post for information on this) or conda prompt. When using pip you will need to enable the extension as well. The extension does not necessarily share the same name as the package installed, so check the respective documentation for installation specifics. For example, installing IPyWidgets requires running the following:pip install ipywidgets jupyter nbextension enable --py widgetsnbextensionIPyWidgetsA set of core Ipython/Jupyter widgets that include a bunch of controls, such as the Slider for selecting numeric values, Dropdown, Command Button etc. You can build a rudimentary UI in your notebook with these, at least for your data filtering/selection purposes. See here for some examples.This package is a dependency for various other packages, such as AutoVizWidget.AutoVizWidgetThis provides simple visualisation of pandas data frames, with controls to change the visualisation type. You’ll see these controls pre-installed in the HDInsight installation of Jupyter Notebooks.You can see examples of usage here. This uses the Plot.ly library. As you can see this allows very quick visualisations with some easy options for changing displays.Plot.lyAn advanced data visualisation library, with some pretty impressive display options and requiring a relatively small amount of code input. You can find an introduction here.See the Visualisations section below for an example usage, and note that you don’t have to create an account to use this, just use the plotly.offline objects.DashAlso from Plot.ly, this has some serious visual capabilities. It is a commercial venture from Plot.ly but is available for free if you don’t need the distribution platform for Enterprise use. To use within Jupyter however you will need to use an HTML element to embed it in an iframe. I haven’t played with this one yet, but you can find out how this is apparently done here.VisualisationsIt is easy to quickly output visuals when working with pyspark, using libraries such as matplotlib, plotly(based on matplotlib) and others. Simply import the required packages, prepare a dataset for rendering and with a few simple lines of code you have your data displayed.MatplotlibThe foundation for pretty much everything else for numerical visualisation, this allows Pandas data frames to be rendered in a very succinct fashion. Installation instructions are here, with a dizzying number of examples here.Here’s another example, not so much of an impressionist style…Sinusoidal with exponential decay (dampened Sinusoid).Plot.lyHere’s a quick example of a Histogram 2D Contour with heatmap colouring. Not bad for less than 10 lines of code (although it does look like it might double as a secret base for a Bond villain).TroubleshootingBelow are a couple of common issues you may encounter when running Spark locally with Jupyter Notebooks.Apache Derby lockoutsThe default hive metastore used under the covers by Spark will run on Apache Derby, which is a single user connection database. So if you try and fire up another process that wants to access this, you will receive an error similar to that below:Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@1dc0ba9e, see the next exception for details.at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source)... 107 moreCaused by: ERROR XSDB6: Another instance of Derby may have already booted the database C:\Windows\System32\metastore_db. at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)…The path above to the metastore will depend on where your current working directory is, so may well be different. The lock file db.lck should clear when shutting down the other notebook instance, but if it doesn’t you are okay to delete it as it will be created on each connection. For ‘real’ installations you should be using a different database for the metastore, such as MySQL, but for personal experimentation locally this is not really necessary.Hive Session Creation FailedAs per the previous post you may encounter this due to permissions on the tmp\hive directory. Please see the previous post for a solution to this issue.Next up…In the next post in the series we’ll look to extending Jupyter Notebooks by Installing Jupyter Kernels.

Introduction to Spark

Spark is all the rage at the moment (and has been for a while) in the Big Data and Analytics communities, seeing application for all aspects of working with data, from Streaming to Data Science. It offers a very performant, multi-purpose scalable platform with a very strong user community. In this series I’ll be looking at setting Spark up on a local machine for learning purposes, working with the Jupyter notebook environment for data wrangling, mungeing and visualisation. We’ll also take a quick look at cloud platform offerings and some of the basics of the extensions to the core Spark platform such as Spark Structured Streaming and Spark ML. Spark is a very large subject and I won’t be going into too much depth, just enough to give readers a taster for capabilities and ease of use. There are some fantastics sources of information out there in the Spark community for those interested in a deeper understanding, which I’ll provide references to along the way.Part 1: Installing Spark on WindowsPart 2: Jupyter NotebooksPart 3: Installing Jupyter Notebook KernelsPart 4: Spark on Azure HDInsightPart 5: Spark on Azure Databricks Part 6: Spark Core Part 7: Spark SQLPart 8: Spark Structured StreamingPart 9: Spark MLPart 10: Spark GraphX

My Experience of the Microsoft Professional Program for Data Science

(Image 1 – Microsoft 2017 - https://academy.microsoft.com/en-us/professional-program/data-science)   In 2016 I was talking to Andrew Fryer (@DeepFat)- Microsoft technical evangelist, (after he attended Dundee university to present about Azure Machine Learning), about how Microsoft were piloting a degree course in data science. My interest was immediately spiked. Shortly after this hints began appear and the Edx page went live. Shortly after the Edx page went live, the degree was rebranded as the "Professional Program". I registered to be part of the pilot, however was not accepted until the course went live in September 2016.   Prior to 2016 my background was in BI, predominately in Microsoft Kimball data warehousing using SQL Server. At the end of 2015 I enrolled on a Master's Degree in Data Science through the University of Dundee. I did this with the intention of getting exposure to tools I had an interest in, but had some/little commercial experience (R, Machine learning and statistics). This course is ongoing and will finish in 2018, I highly recommend it! I would argue that it is the best Data Science Master's degree course in the UK. So going in to the MPP I had a decent idea of what to expect, plus a lot of SQL experience, R and Power BI. Beyond that I had attended a few sessions at various conferences on Azure ML. When the syllabus for the MPP came out, it directly complemented my studies.   Link to program - https://academy.microsoft.com/en-us/professional-program/data-science Link to Dundee Masters - https://www.dundee.ac.uk/study/pg/data-science/   Structure of the program The program is divided up in to 9 modules and a final project. All modules need to be completed but there are different options you can take - You can customise the course to suit your interests. You can choose to pay for the course (which you will need to do if you intend to work towards the certification) or audit the course for free.  I will indicate which modules I took and why. Most modules recommend at least 6 weeks part-time to complete. I started the first module in the middle of September 2016 and completed the final project middle of January 2017 – So the 6 week estimate is quite high, especially if you already have decent a base knowledge of the concepts already.   You can if you wish complete multiple modules at once. I am not sure I recommend this approach as to get the most out of the course, you should read around the subject as well as watching the videos. Each module has a start date and an end date that you need to complete it between. If you do not you will need to do it all again. You can start a module in one period and wait until the next for another module. You do not need to complete them all in 3 months. If you pay for the module but do not request your certificate before the course closes, you will need to take it again (top tip, as soon as you're happy with you score, request you certificate).   Module list Module Detail Time taken Data Science Orientation Data Science Orientation 2 - 3 days Query Relational Data Querying Data with Transact-SQL 1 day - Exam only Analyze and Visualize Data Analyzing and Visualizing Data with Excel  Analyzing and Visualizing Data with Power BI 2 - 4  days Understand Statistics Statistical Thinking for Data Science and Analytics 7 - 9 days Explore Data with Code Introduction to R for Data Science Introduction to Python for Data Science 7 - 9 days Understand Core Data Science Concepts Data Science Essentials 7 - 9 days Understand Machine Learning Principles of Machine Learning 2 weeks Use Code to Manipulate and Model Data  Programming with R for Data Science Programming with Python for Data Science R - 2 - 3 daysPython - 3 weeks Develop Intelligent Solutions   Applied Machine Learning  Implementing Predictive Solutions with Spark in HDInsight Developing Intelligent Applications 2 weeks Final Project Data Science Challenge 2 months*   The times taken are based on the time I had spare. I completed each module between projects, in the evening and at the weekend. This module can be completed in a few days, however you need to wait until it has completed to get you grade.   Structure of the modules Each modules is online. You log on to the Edx website and watch videos by leading experts. Either at the end of the video, after reading some text or at the end of a section of the modules you are given a multiple choice test. The multiple choice options are graded and form part of your overall score. The other main assessment method is labs, where you will be required to complete a series of tasks and enter the results. Unlike certifications, you get to see what your score is as you progress through the module. The multiple choice questions generally allow you to have two to three attempts at the answer, sometimes these are true/false with two attempts, which does undermine the integrity of the course.   There is normally a final section which you're only given one chance to answer, and holds a higher % towards your final mark. You need 70% to pass. Once you hit 70% you can claim your certificate - if you have chosen to pay for the module. Modules range from $20 to $100. For the most part I answered the questions fully and tried for the highest score possible. However, In all honestly towards the end, once I hit around 80%, I started looking at a different module. If the module was really interesting I would persevere.   Modules Data Science Orientation, Query Relational Data & Analyze and Visualize Data. These modules are very basic and really only skim the surface of all the topics they describe. The first module is a gentle introduction to the main concepts you will learn throughout the program. The next modules focused on querying data with SQL. Regardless of your opinion of SQL, you must agree that SQL the is language of data. Having an understanding of the fundamentals of SQL is paramount, as almost every level of the Microsoft Data Science stack has integration with databases. If you're familiar with SQL (I already held an MCSE in SQL 2012) you can skip the main content of this module and just take the test at the end. For the next you have an option of Excel or Power BI for visualisation. As I have experience with Power BI I opted for this module. Once again this is a very basic introduction to Power BI. It will get you familiar enough with the tool that you can do basic data exploration. Some parts of this course jarred with me. Data visualisation is so important and a key skill for any data scientist. In the Power BI module one of the exercises was to create a 3d pie chart. Pie charts are not a good visualisation as it is hard to differentiate between angles and making it 3d only escalates the issue. I wish Microsoft would have made reference to some of the great data viz experts when making this module - I cannot comment on the Excel version.   Understanding statistics. This module is different from its predecessors, in that it is not run by Microsoft. This is a MOOC from Columbia university, which you might have completed before. It covers a lot of the basic and more advanced stats that you need to know for data science. In particular a solid grounding in probability and probability theory. In BI you become familiar with descriptive stats and measures of variance, however I had not done a great deal of stats beyond this. I have researching statistical methods for the MSc, but I had not done any real stats since A-Level maths. This course was really interesting and I learnt a lot. I don’t know if this is the best way to really learn stats, but it is a good primer to what you need to know. I found topping up my understanding with blogs, books and YouTube helped support this module.   Explore data with code. You have two options again for this module, R and Python. Which should you learn I imagine you're asking, well the simple answer is both. Knowing either R or Python will get you so far, knowing both with make you a unicorn. Many ask why to learn one language over the other - aside from the previous point. R is very easy to get in to, it has a rich catalogue of libraries written by some of the smartest statistical minds. It has a simple interface and is easy to install. Python is harder to learn in my opinion as the language is massive! I found Python harder to work with, but it is much richer. I would recommend Python just for SciKitLearn the machine learning library. The python module is extended to use code dojo (the great online tuition site). As you progress through the questions and examples, you have an ide which will check you understanding and  will grade you as you go. I found this really helpful. This module is again a bit on the easier side. If you think the later Python module will be similar, you are in for a surprise! I did not take the R module as I was already using R in my day job.   Understand core data science concepts. Almost a redo of the first module and the understanding statistics module. Not a lot to say here, but repetition helped me understand and remember the concepts. The more I had to think about the core concepts the more they stuck. This module could have been removed with little to no impact on the course, but helped solidify my knowledge.   Understanding Machine learning. As this is a Microsoft course this module is all about Azure Machine Learning. If you have not used Azure ML before, it has a nice drag and drop interface which allows you to build quick simple models and create a web api key which you can then pass data to using any tool with a REST API. This module is half theory and half practical. There are a lot of labs, so you will need to take you time. If you skip ahead you will get the answers wrong and might not make it to 70%.   Using code to manipulate and model data. This section has two options again R and Python. I know quite a bit or R already so I started with Python. I wanted to do them both to see how you can do machine learning in both. I was expecting a continuation of the code dojo format from the previous module, this was far from the case. Each of the modules up until this point have worked with you to find the right answer. This module will equip you with the basics, but expect you to find the correct function and answer. Believe me when I say it was hard (with little prior experience of Python). The course will lead you to towards the right resources, but you need to read the documentation to answer the question. This was a great change of pace. Having to search for the answers made me absorb more than just the quizzes. This module was a struggle. Once I completed this I did the same for R. On a difficulty scale, if the Python module was 100, R was only at 20. The disparity in difficult is massive and frankly unfair. I was able to complete the R module very quickly. I left feeling disappointed that this did not have the same complexity that the Python module did.   Develop intelligent solutions. For this section you can pick one of three modules, Machine learning, Spark or micro services. I went with Spark. Why? Because I had already worked with Spark and Hadoop as part of the MSc at Dundee. I knew how it worked and what it did from an open source point of view, but not from a Microsoft HD-Insight perspective. This module was tricky but nothing compared to the Python module. I spent the best part of the week working on Spark, setting up HD-Insight clusters and forgetting to tear them down (top tip! Don’t leave a HD-Insight cluster running - They are EXPENSIVE!). The last module is a machine learning project, so picking the "Applied Machine Learning" option might put you in a better place than your competition. I did not attempt either the Machine Learning or the Micro-services modules.   Final project. Here is where the fun begins. You're given a problem and a dataset. You need to clean, reduce, derive features and process the dataset, then apply an ML technique to predict something. In my case it was whether or not someone will default on a loan. You could use any technique you liked as long as the final result was in Azure ML. I was pretty happy with my model early on and made very few tweaks as the course progressed. Unlike the previous modules where you can complete a module and get your score, your final score is only available once the module has ended. You will build an ML experiment and test against a private dataset. You can submit your experiment 3 times a day to be scored against the private data (maximus of 100 attempts). This will give you an indication of your score, but this is not your score! You score is calculated against a different dataset after the module has finished.  You top 5 scores will be used to test against the private closed data. If you have over-fitted you model, you might have a shock (as many did on the forums) when you score is marked.   I completed all modules at the start of January and waited until February to get my final score. My highest scoring answer, when used against the closed private dataset, did not get over the required 70% to pass. This was surprising but not all that unexpected. I had over-fitted the model. To counter balance this, I created 5 different experiments with 5 similar but different approaches. All score similar (~1-3% accuracy difference). This was enough to see me past the required 70% and to obtain the MPP in data science. The private dataset has been published now. In the coming weeks I will blog about the steps I took to predict if someone would default on their loan.   I have been asked at different stages of the course "would you recommend the course?". It really depends on what you want out of the course! If you expect to be a data scientist after completing the MPP, then you might be in for a shock. To get the most out of the course you need to supplement it with wider reading / research. YouTube has many great videos and recorded lectures which will really help process the content and see it taught from a different angle. If you're looking to get an understanding of the key techniques in  Data Science (from a Microsoft point-of-view) then you should take this course. If you're doing a degree where you need to do research, many of the modules will really help and build upon what you already know.   I hope you have found this interesting and that it has helped you decide whether or not you want to invest the time and money (each module is not free). If you do decide and you persevere you will too be the owner of the MPP in Data Science (as seen below).   Terry McCann - Adatis Data Science Consultant & Organiser of the Exeter Data Science User Group - You can find us on MeetUp.