Ust

Ust Oldfield's Blog

Data Source Permissions and On-Premises Data Gateway: SQL Server and Analysis Services

In Microsoft’s documentation surrounding the On-Premises Data Gateway, the advice on permissions for the account used to authenticate the Data Source in the Power BI Service can be concerning for most, especially DBAs.

In the Analysis Services section of the documentation, the advice is:

The Windows account you enter must have Server Administrator permissions for the instance you are connecting to. If this account’s password is set to expire, users could get a connection error if the password isn’t updated for the data source.

Server Administrator permissions…? What happened to the principle of least-privilege?

In a practical sense, the On-Premises Data Gateway has to deal with two very different implementations of Analysis Services: Multidimensional and Tabular. Each are setup and configured differently from the other, and the nature of their security models are also different. As a one size fits all approach, it works. As we will soon see, the permissions do not have to be set as Server Admin

The SQL section of the documentation, on the other hand, doesn’t actually specify what permissions are required for the Data Source to be established in the Power BI Service.

Permissions

Exactly what permissions are required for these common data sources, I hear you ask. As data sources are established at a database level, so too are the permissions set.

Data Source

Minimum Permissions Level

SQL Server Database

db_datareader

SSAS Tabular Database

Process database and Read

SSAS Multidimensional Database

Full control (Administrator)

Principle of least-permissions is now restored.

Though there still are the curious incidents of Analysis Services data sources requiring permissions in addition to read. I am unsure, I have my suspicions, and have tried to find out. If you know, please leave a comment below!



Introduction to Data Lakes

Data Lakes are the new hot topic in the big data and BI communities. Data Lakes have been around for a few years now, but have only gained popular notice within the last year. In this blog I will take you through the concept of a Data Lake, so that you can begin your own voyage on the lakes.

What is a Data Lake?

Before we can answer this question, it's worth reflecting on a concept which most of us know and love - Data Warehouses. A Data Warehouse is a form of data architecture. The core principal of a Data Warehouse isn't the database, it's the data architecture which the database and tools implement. Conceptually, the condensed and isolated features of a Data Warehouse are around:

1.     Data acquisition

2.     Data management

3.     Data delivery / access

A Data Lake is similar to a Data Warehouse in these regards. It is an architecture. The technology which underpins a Data Lake enables the architecture of the lake to flow and develop. Conceptually, the architecture of a Data Lake wants to acquire data, it needs careful, yet agile management, and the results of any exploration of the data should be made accessible. The two architectures can be used together, but conceptually the similarities end here.

 

Conceptually, Data Lakes and Data Warehouses are broadly similar yet the approaches are vastly different. So let's leave Data Warehousing here and dive deeper into Data Lakes.

 

Fundamentally, a Data Lake is just not a repository. It is a series of containers which capture, manage and explore any form of raw data at scale, enabled by low cost technologies, from which multiple downstream applications can access valuable insight which was previously inaccessible.

 

How Do Data Lakes Work?

 

Conceptually, a Data Lake is similar to a real lake - water flows in, fills up the reservoir and flows out again. The incoming flow represents multiple raw data formats, ranging from emails, sensor data, spreadsheets, relational data, social media content, etc. The reservoir represents the store of the raw data, where analytics can be run on all or some of the data. The outflow is the analysed data, which is made accessible to users.

 

To break it down, most Data Lake architectures come as two parts. Firstly, there is a large distributed storage engine with very few rules/limitations. This provides a repository for data of any size and shape. It can hold a mixture of relational data structures, semi-structured flat files and completely unstructured data dumps. The fundamental point is that it can store any type of data you may need to analyse. The data is spread across a distributed array of cheap storage that can be accessed independently.

 

There is then a scalable compute layer, designed to take a traditional SQL-style query and break it into small parts that can then be run massively in parallel because of the distributed nature of the disks.

 

In essence – we are overcoming the limitations of traditional querying by:

·       Separating compute so it can scale independently

·       Parallelizing storage to reduce impact of I/O bottlenecks

 

 

There are various technologies and design patterns which form the basis of Data Lakes. In terms of technologies these include:

·        Azure Data Lake

·        Cassandra

·        Hadoop

·        S3

·        Teradata

With regards to design patterns, these will be explored in due course. However, before we get there, there are some challenges which you must be made aware of. These challenges are:

1.     Data dumping - It's very easy to treat a data lake as a dumping ground for anything and everything. This will essentially create a data swamp, which no one will want to go into.

2.     Data drowning - the volume of the data could be massive and the velocity very fast. There is a real risk of drowning by not fully knowing what data you have in your lake.

These challenges require good design and governance, which will be covered off in the near future.

 

Hopefully this has given you a brief, yet comprehensive high-level overview of what data lakes are. We will be focusing on Azure Data Lake, which is a management implementation of the Hadoop architectures. Further reading on Azure Data Lake can be found below.

 

Further Reading

 

In order to know more about Data Lakes the following resources are invaluable.

Getting Started With Azure Data Lake Store

Getting Started With Azure Data Lake Analytics and U-SQL

Azure Data Lake Overview

 

 

Deploying a Hybrid Cloud

Operations Management Suite (OMS) is an Azure based tool that helps manage your entire IT infrastructure, whether on premise or in the cloud. OMS allows you to monitor the machines you have in your infrastructure and provides a bridge for a hybrid cloud solution, by moving multi-tier workloads into Azure, or run tests on a copy of production workloads in Azure, as well as storing critical data in Azure.

Configuring and deploying OMS is very quick and relatively straightforward. This post will deal with creating a hybrid cloud solution using OMS and configuring the solution so you can make best use of resources.

Within the Azure management portal you will have to create a new Operational Insights workspace as detailed below:


Once the Operational Insights has been created, you can navigate to it in the Azure Portal and click “Manage” which will bring up the OMS itself. The start page should look like this:


And you want to click on the “Get Started” button to begin creating your hybrid cloud solution. You then want to add various solutions to the suite. For this demonstration we’re going to accept the default and have all the solutions.

 

The next step is to connect a data source, which can be an on premise machine, a virtual machine or an Azure data storage. For this demonstration we are only interested in connecting to an on premise machine and a VM.