Adatis

Adatis BI Blogs

Implementing Enterprise Security in Azure Databricks - Part 2

Following on from my last blog on the topic of security within Azure Databricks which concentrated on the implementation model for data processing for platform, the following blog concentrates on the alternative - data processing for users.   Data Processing for Users By this, I mean data-related activities that a user is performing interactively, for instance data analysis from Data Lake. With the addition of Databricks runtime 5.1 which was released December 2018, comes the ability to use Azure AD credential pass-through. This is a huge step forward since there is no longer a need to control user permissions through Databricks Groups / Bash and then assigning these groups access to secrets to access Data Lake at runtime. As mentioned previously - with the lack of support for AAD within Databricks currently, ACL activities were done on an individual basis which was not ideal. By using this feature, you can now pass the authentication onto Data Lake, and as we know one of the advantages of Data Lake is the tight integration into Active Directory so this simplifies things. Its worth noting that this feature is currently in public preview but having tested it thoroughly, am happy with the implementation/limitations. The feature also requires a premium workspace and only works with high concurrency clusters – both of which you’d expect to use in this scenario. The other good thing is that its incredibly easy to enable this functionality, as it is controlled by the cluster configuration process. To enable, navigate to the cluster configuration page, select runtime 5.1 or higher, and expand the advanced options. At this point, you will see a tick box which needs to be checked (see below). This will add another line of code into your spark config. Its actually good to see it was implemented in this way – and helps to stick to the Microsoft mantra of keeping things simple.     Once enabled, only connections into Data Lake via acl:// are valid – any existing connections with the dbfs or through the databricks databases mounted to route via the dbfs for unmanaged tables will stop working. This is a current limitation and may be fixed at GA (although technically you could re-build the tables using the acl:// path if this was an issue). Great – so now my ACL can be controlled entirely within Data Lake without the need for Service Principals! But there’s more.. I touched on this with my previous blog, but as part of the access implementation for Data Lake, it is preferable to define Azure Active Directory Groups (AAD) to provide further flexibility moving forward. By this I mean, creating AAD groups and assigning them to Data Lake so as to create a level of abstraction away from Users/User AD Groups/Service Principals so that modifications will not need to be made to Data Lake permissions in the future, only to the AAD groups that are already assigned. From experience, by not going down this route - any additional user permissions that need applying in future have to be applied across all folders/files which depending on the size of the data lake, can take a particularly long time \ be awkward to add. Therefore this needs to be done upfront as part of the design! I would suggest the following conventions for this group setup, an example being AAD_PLATFORM_ADL_ENV_RAW_READER. AAD – to separate out AD/AAD once sync is implemented. PLATFORM – the platform or project/department this group is used by. ADL – the resource on the platform that the group is used by. ENV – the environment on which the resource resides (prod/non-prod). RAW – the layer within lake the permissions will be applied to. READ – the permission the group will have access to. You would then need to create the following AAD groups to cover all areas across the Data Lake. This assumes using our standard Data Lake layer/folder pattern first introduced by Ust’s blog back in 2016. AAD_PLATFORM_ADL_PROD_RAW_READER AAD_PLATFORM_ADL_PROD_RAW_WRITER AAD_PLATFORM_ADL_PROD_BASE_READER AAD_PLATFORM_ADL_PROD_BASE_WRITER AAD_PLATFORM_ADL_PROD_ENRICHED_READER AAD_PLATFORM_ADL_PROD_ENRICHED_WRITER AAD_PLATFORM_ADL_PROD_CURATED_READER AAD_PLATFORM_ADL_PROD_CURATED_WRITER AAD_PLATFORM_ADL_PROD_LABORATORY AAD_PLATFORM_ADL_PROD_LIBRARY_READER AAD_PLATFORM_ADL_PROD_ADMIN The permissions applied to these groups can then be implemented using the following matrix. When adding these permissions, they need to be added to the current folder, and all children, and added as both an access permission entry and a default permission entry. Without this, any changes to Data Lake in the future will not inherit these permissions.     Once this is complete, you would add the Users/AD User Groups/Service Principals into these AAD groups to provide access and the pass-through permissions work as expected. By using this method it separates out both read/write and the data lake layers meaning that unless specifically granted, there will be a much more limited footprint in terms of access permissions into Data Lake. Using a combination of this, and the AD credential pass-through from Databricks provides a suitable solution for implementing security using the Databricks/Data Lake combo.

Implementing Enterprise Security in Azure Databricks - Part 1

In recent weeks, I’ve spent a fair chunk of time working with different aspects of Databricks and as part of this, one topic that consumed a proportion of that time is the security and authentication aspects of the service. Our scenario was one that I expect most people will come across over the next few years, essentially the integration of Databricks with Azure Data Lake for data processing. In the following blogs (yup I need 2) I intend to document the way we went about implementing the security model, breaking it up into data processing for a platform and data processing for users – each has a slightly different implementation.   Data Processing for Platform By this, I mean data processing activities that happen as part of a service, orchestrated through something such as Azure Data Factory that happen on a schedule or in real-time. I expect the very first thing most people will do once they have a Databricks workspace is to mount their Data Lake.  Its by the far easiest option for getting your hands on your data, and one that’s still worth using to start off as long as you understand the consequences. If you need to do this, I will point you in the direction of Hugh’s blog post covering the topic which I’ve used a number of times now to remember all the different secrets/GUIDs you need.  By mounting the Data Lake in your Databricks workspace, you are able to interact with the data as if it were in the DBFS (the databricks local file system). This proves incredibly powerful for tools such as dbutils which allow you to perform a vast number of file system operations. Unfortunately the one downside to this method is that anyone with access to the workspace is able to directly interact with all layers in your Data Lake – not ideal. Since Data Lake’s are not treated in the same way as something like an Azure SQL DB, you also won’t have those handy recovery options if something happens to your data because of this unrestricted access. This obviously poses a risk for an Enterprise solution. Its worth noting that the mount is done once, so your clusters can go down and be spun up again by other users and they would still be able to access the mountpoint until it is un-mounted. With that in mind, it is far better to implement the access to Data Lake directly at runtime using the Spark API. The config for this is very similar to the mounting, except the spark config is set at runtime and therefore if this is not included you will be unable to access the Data Lake. This avoids the issues mentioned above where anyone can access the data, just because they can access the workspace. The following commands can be used to set this up at runtime.   # Get Secrets from Databricks ClientId = dbutils.secrets.get(scope = "ScopeX", key = "ClientId") ClientSecret = dbutils.secrets.get(scope = "ScopeX", key = "ClientSecret") DirectoryId = dbutils.secrets.get(scope = "ScopeX", key = "DirectoryId") # Apply variables to spark config spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential") spark.conf.set("dfs.adls.oauth2.client.id", ClientId) spark.conf.set("dfs.adls.oauth2.credential", ClientSecret) spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/" + DirectoryId + "/oauth2/token")   You’ll notice that as part of this I’m retrieving the secrets/GUIDS I need for the connection from somewhere else – namely the Databricks-backed secrets store. This avoids exposing those secrets in plain text in your notebook – again this would not be ideal. The secret access is then based on an ACL (access control list) so I can only connect to Data Lake if I’m granted access into the secrets. While it is also possible to connect Databricks up to the Azure Key Vault and use this for secrets store instead, when I tried to configure this I was denied based on permissions. After research I was unable to overcome the issue. This would be more ideal to use but unfortunately there is limited support currently and the fact the error message contained spelling mistakes suggests to me the functionality is not yet mature. To configure the databricks-backed secrets, the easiest method is to use an Azure Bash console and go in via the Databricks CLI. To access the console - within the Azure portal you’ll notice an icon similar to below as part of the top ribbon.     Clicking this will then prompt you to start either a PowerShell or Bash console – which will look similar to below.     The connection into the Databricks CLI can be setup as per the following commands.   virtualenv -p /usr/bin/python2.7 databrickscli source databrickscli/bin/activate pip install databricks-cli databricks configure --token   At this point, you’ll need to provide it both your databricks host – something similar to https://northeurope.azuredatabricks.net and a token. The token will need to be generated through your Databricks workspace – under User Settings / Access Tokens. This essentially lets you into the API without the need for a password. The important thing to mention at this point is that the token is a personal access token. While this doesn’t impact anything in particular with Azure Bash, its worth noting that the token is created under your Databricks account, and therefore using this token for something such as a linked service into ADF then will use your account to process the connection authenticating as you. Hopefully over time, this will be matured and the use of Managed Identity's or Service Principals directly connected into the platform will be possible. As you might of guessed then, you will need to make sure the account that generates the token is then used within the ACL to read the secret, otherwise at processing time – ADF will not be able to read through Databricks into Lake. The configuration of the secrets is then required and can be done using the following commands. Simply replace [client id], [secret value], [directory id] with the necessary values from your service principal.   databricks secrets create-scope --scope ScopeX databricks secrets put --scope ScopeX --key ClientId --string-value [client id] databricks secrets put --scope ScopeX --key ClientSecret --string-value [secret value] databricks secrets put --scope ScopeX --key DirectoryId --string-value [directory id] databricks secrets put-acl --scope ScopeX --principal admins --permission READ   I then granted access to read these secrets to the admins group – this just keeps things simple but you can obviously provide it individual users as well or other Databricks groups. One of my hopes for 2019 is that the platform is integrated better into AAD moving forwards as everyone needs to be named individually currently. You would then need to manage permissions through the groups which can only be done via the CLI. Once this is complete, you can now interact with your Data Lake and authenticate at runtime. Its also worth mentioning that this interaction changes the connection string from dbfs:// to adl:// which may have a knock-on effect if you use certain tools such as dbutils to do things within your code. This is also important to know since with databricks runtime 5.1 and AD credential pass-through, you will be unable to access anything other than Data Lake file systems. I’ll explain this further in my next blog.   Conclusion Hopefully this will prove a useful to anyone venturing onto the platform and provide a basis to implement a security model. If you wanted to go a step further, you may also want to implement service principal access to Data Lake on a number of levels – both across folders AND read/write. This would add slightly more complexity to the solution but provide an even securer method avoiding the scenario where accounts can access the entire Data Lake. In my next blog, I will look at it from a user-perspective which takes on a slightly different implementation.

Whitelisting Azure IP addresses for SQL Server

In a recent blog post, I wrote about whitelisting Azure Data Centre IP addresses for Key Vault. Key Vault’s firewall uses CIDR notation for IP ranges, which is exactly what is contained within the list of IP addresses supplied by Microsoft. However, there are some resources, like Azure SQL Server, which only accept IP ranges. Therefore, we need a way of converting CIDR to an IP range. Handily, there’s a PowerShell script which exists to provide that conversion – called ipcalc.ps1. When you download it, make sure it’s in the same working folder as the script you’re going to use to create the new firewall rules.From there, we can make slight amends to the script we had in the previous post and produce the following script:If you need to assign the IP ranges to other resources you can substitute the New-AzSqlServerFirewallRule with the appropriate cmdlet and parameters

Whitelisting Azure IP addresses for Key Vault

A colleague came to me with an interesting request: We want to put Key Vault behind a firewall, but when we do that it means that Azure Data Factory can no longer access the secrets. Is there a way to whitelist the IP addresses for a given Azure Data Centre?The short answer is: Yes. By default, the following option is enabled on Azure Key Vault under the Firewalls and virtual networks blade.For most users, having unrestricted access from external networks to a resource that holds secrets, certificates and other sensitive information is a big red flag. If we choose to only allow access from Selected Networks we get the following options opening up for us:Note that trusted Microsoft services is not an extensive list and does not include Azure Data Factory. Therefore we need to whitelist a series of IP Addresses in the firewall rules. The list of IP Addresses are published by Microsoft and are updated on a weekly basis. The IP addresses are published in an XML document, which isn’t always the best format when one needs to update firewalls in Azure. Shredding XMLTo update the Firewall in Azure, we’re going to use PowerShell to shred the XML and extract the IP ranges for a given region. Then, we’re going to use the updated Azure PowerShell module to register the IP ranges against the Key Vault. Using the last command, we can check that the IP ranges have been registered successfully. You should see something like:There we have it, explicit IP whitelisting of Azure Data Centres so we can lock down Azure resources, only opening up access when we need to. UpdateKey Vault is currently limited to 127 firewall rules. If you are adding a region with more than 127 IP ranges, you might have an issue…

Data Source Permissions and On-Premises Data Gateway: SQL Server and Analysis Services

In Microsoft’s documentation surrounding the On-Premises Data Gateway, the advice on permissions for the account used to authenticate the Data Source in the Power BI Service can be concerning for most, especially DBAs. In the Analysis Services section of the documentation, the advice is:The Windows account you enter must have Server Administrator permissions for the instance you are connecting to. If this account’s password is set to expire, users could get a connection error if the password isn’t updated for the data source.Server Administrator permissions…? What happened to the principle of least-privilege? In a practical sense, the On-Premises Data Gateway has to deal with two very different implementations of Analysis Services: Multidimensional and Tabular. Each are setup and configured differently from the other, and the nature of their security models are also different. As a one size fits all approach, it works. As we will soon see, the permissions do not have to be set as Server Admin The SQL section of the documentation, on the other hand, doesn’t actually specify what permissions are required for the Data Source to be established in the Power BI Service. PermissionsExactly what permissions are required for these common data sources, I hear you ask. As data sources are established at a database level, so too are the permissions set.Data SourceMinimum Permissions LevelSQL Server Databasedb_datareaderSSAS Tabular DatabaseProcess database and ReadSSAS Multidimensional DatabaseFull control (Administrator) Principle of least-permissions is now restored. Though there still are the curious incidents of Analysis Services data sources requiring permissions in addition to read. I am unsure, I have my suspicions, and have tried to find out. If you know, please leave a comment below!

Introduction to Dynamic Data Masking

What is it SQL Server Dynamic Data Masking is a new feature which has been released with SQL Server 2016. It is designed to allow the creation of a pre-defined rule which can be applied to the data in a column limiting the exposure of the actual data. For example, if you have the following password column in a table which contains user passwords and has a datatype of nvarchar; Password Kingston123 Sh@r0na you can apply a data masking rule to the column which will make is appear as the following to any unauthorised user; Password XXXX XXXX Masking Options Dynamic Data Masking currently has 4 possible masking options which can be applied to a column; Default This will mask the full value depending on the column’s data type. For string data types; char, nchar, varchar, nvarchar, text and ntext the value will be replaced with XXXX. If the length of the field is less than 4 characters, then that number of X’s will be used. For numeric data types; bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float and real the value will be replaced with 0. For date and time data types; date, datetime2, datetime, datetimeoffset, smalldatetime and time the value will be replaced with 01.01.2000 00:00:00.0000000. For binary date types; binary, varbinary and image the value will be replaced with 0. Email This will mask the full value exposing the first letter of the string and masks the rest with XXX@XXXX.com. For example, ‘james.russell@myemail.co.uk’ will be masked as ‘jXXX@XXXX.com’. Custom String This will mask part of a value exposing a number of characters at the start of end of the string based on a prefix and suffix padding value. For example, ‘abcdefghij’ when given a prefix padding of 2 and a suffix padding of 4 will be masked as ‘abXXXXghij’. Random This can be used with any numeric data type and will mask the original value with a random value based on the supplied range. For example, ‘123456789’ when given a range of 1-5 will be masked as either 1, 2, 3, 4 or 5. Uses and Limitations Dynamic data masking is designed to restrict the exposure of sensitive to non-privileged users with minimal impact on the application layer. As its application only effects the result set of a query over designated database fields while keeping the actual data in the database the same, it is perfect for reporting or Business Intelligence uses; this also means that it can be incorporated without modifying pre-existing queries. This feature is implemented by running Transact-SQL commands in SQL 2016 or by using the Azure portal for Azure SQL Databases. It is important to note that dynamic data masking is not designed with the purpose of extensive database security and will not be able to prevent database users from running intrusive queries to expose extra pieces of the sensitive data by connecting directly to the database. The feature can however be used in conjunction with other SQL security features such as encryption, row level security and auditing. There are a number of other restrictions\applications that should be noted; · Dynamic data masking will not work with; encrypted columns using Always Encrypted, FILESTREAM column types, COLUMN_SET or a sparse column that is part of a column set, computed columns (if the computed column depends on a masked column then the result will be masked data), keys for a FULLTEXT index. · Updates can still be made to a masked column, even though when queried the user will see masked data. · Using SELECT INTO or INSERT INTO to copy data from a masked column into another table will result in masked data in the target table. · When using SQL Server import and export functionality on a database containing masked data the resulting backup file or imported table will contain masked data. I will be posting a future blog showing in-depth real-life applications of dynamic data masking with examples, applied permissions and further applications for both on-premises SQL 2016 and Azure SQL Databases. Further Reading I will be posting a future blog showing in-depth real-life applications of dynamic data masking with examples, applied permissions and further applications for both on-premises SQL 2016 and Azure SQL Databases. For more detailed information see the following links;        MSDN Dynamic Data Masking, Get started with SQL Database Dynamic Data Masking (Azure Portal)

PerformancePoint Monitoring Dashboard Object Security

On the properties tab of every object in PPS Monitoring you will find a permissions section that allows you to assign either reader or editor rights.  These permissions actually relate to two quite different areas: What you will see when you view the deployed dashboard; and What objects you can use and edit to build a dashboard using dashboard designer. For the latter you'll also need be in a suitable dashboard designer security role which I've posted about previously but otherwise the concepts are fairly clear;  Reader will allow you to use the objects in your dashboard and Editor will allow you to edit the objects as well. For the viewing of dashboards things are a little less straight forward: DashboardsTo view a dashboard you'll need to be at least a member of the reader role otherwise you'll get a "dashboard is unavailable" message.  Being in the editor group adds no additional permissions when viewing the dashboard (that i can see) Data sourcesDashboard viewers need to have at least reader permissions on a data source if it used in a kpi or report or you will get an error message ScorecardsYou must be at least a reader on a scorecard to be able to view it in a dasboard otherwise it is just not displayed.  This, as with reports also, can obviously cause an issue with the layout of your dashboard as things will get moved depending on your permissions.To be able to add comments to KPI's you need to be a member of the editor role but only be a reader to view them KPIsIf a KPI is used on a dashboard that you have access to, the kpi will be displayed but you must have at least reader access on that particular KPI to see any values otherwise they will be blank.  Strangely, if you have editor permissions on the scorecard you will be able to add a comment to the KPI whether you have permission to see it or not ReportsTo view a report on a dashboard you need to be at least a reader.  If not the report will not appear at all (no message).  No additional permissions seem to be available in the editor role. IndicatorsFrom a display point of view Indicators inherit permission from the kpi they are displayed in, so there is no need to set any specific user permissions.  This seems to be the only area where any form of permission inheritance is used. Note that all roles can either use Groups or specific users. Note also that you only need to publish dashboards to PPSM server to update security permissions - there is no need to re-deploy to Sharepoint unless you have changed the layout of the dashboard.

Setting up PerformancePoint Monitoring Design Role Security

There's lot of info out there about setting up the application pool identity so that you can set up data sources for dashboards (For example).  However there seems to be very little about the various roles that are used in the Dashboard Designer and how to set them up. If you try and connect to the Dashboard designer without appropriate access it will let you open the application and try and create data sources etc but you'll just get an error message like the one below when you try and connect to a server. Granting Permissions is a bit hidden away if you don't know where to look and you need to be logged as an existing monitoring server administrator to do this: Click on the office icon top left and then the Options button. On the Server tab of the new window that pops up you'll see a Connect button. Enter the server name e.g. http://servername:40000/WebService/PmService.asmx Click the Connect button - if you're not an existing Monitoring Server admin (or a local admin on the monitoring box) you'll get an error message here If all is well the Server Options and Permissions buttons will get enabled Click Permissions and then Add on the next window and you'll see the window at the centre of the image below. Enter the user name and select the role you want to put that user in. If you want to put someone in two different roles (e.g. Power Reader and Data Source Manager) just add them twice. There are four different roles available: Admin. (unsurprisingly) full rights over the server to build dashboards, administer security etc.  Members of the (windows) administrator group on the Monitoring server are automatically put in this group. Data Source Manager. Create, delete and publish data sources on the monitoring server. Creator. Create any dashboard element (and delete any that they own - see below). Power Reader. Read-only access to everything. Finally, there is a further level to the security that allows you to assign any domain user to have rights to a particular dashboard element without them having to be members of any role on the server. To do this open the element that you wish to amend the permissions for and select the  Properties tab.  At the bottom you'll see a Permissions section.  In here you can add any domain user as an editor or reader. Don't forget to click publish after you've changed the security for any dashboard element. Happy Day of the Dead!!