Adatis

Adatis BI Blogs

A Crash Course in R Part 2

Following on from Part 1 which introduces the R basics (which can be found here), in Part 2 I’ll start to use lists, data frames, and more excitingly graphics.   Lists Lists differ from vectors and matrices because they are able to store different data types together in the same data structure. A list can contain all kinds of R objects - vectors, matrices, data frames, factors and more. They can even store other lists. Unfortunately calculations are less straightforward because there's no predefined structure. To store data within a list, we can use the list() function, trip <- c("London","Paris",220,3). As mentioned in my previous blog we can use str(list) to understand the structure, this will come in handy. Notice, the different data types. Subsetting lists is not similar to subsetting vectors. If you try to subset a list using the same square brackets as when subsetting a vector, it will return a list element containing the first element, not just the first element "London" as a character vector. To extract just the vector element, we can use double square brackets [[ ]]. We can use similar syntax to before, to extract the first and third elements [c(1,3)] as a new list, but the same does not work with double brackets. The double brackets are reserved to select single elements from a list. If you have a list inside a list, then [[c(1,3)]] would work and would select the 3rd element of the 1st list! Subsetting by names, and logicals is exactly the same as vectors. Another piece of syntax we can use is $ to select an element, but it only works on named lists. To select the destination of the trip, we can use trip$destination. We can also use this syntax to add elements to the list, trip$equipment <- c("bike","food","drink"). Interested in testing your knowledge, check out the DataCamp exercises here and here.   Data Frames While using R, you'll often be working with data sets. These are typically comprised of observations and each observation has a number of variables against it, similar to a customer/product table you'd usually find in a database. This is where data frames come in, as the other data structures are not practical to store this type of data. In a data frame, the rows correspond to the observations, while the columns the variables. Similar to a matrix but we can store different data types (like a list!). Under the hood, a data frame is actually a list but with some special requirements such as vector length, and char vectors as factors. Creating a data frame is usually achieved by importing data from source rather than manually created but you can do this via the data.frame function as shown below:   # Definition of vectors planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune") type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant") diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883) rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67) rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE) # Encoded type as a factor type_factor <- factor(type) # Constructed planets_df planets_df <- data.frame(planets, type_factor, diameter, rotation, rings, stringsAsFactors = FALSE) # Displays the structure of planets_df str(planets_df)   Similar to SQL, sometimes you will want just a small portion of the dataset to inspect. This can be done by using the head() and tail() functions. We can also use dim() to show the dimensions which returns the number of rows and number of columns, but str() is preferable as you get a lot more detail. Due to the fact a data frame is an intersection between a matrix and a list, the subsetting syntax can be used from both, so the familiar [ ], [[ ]], and $ will work to extract either elements, vectors, or data frames depending upon the type of subsetting performed. We can also use the subset() function, for instance subset(planets_df, subset = has_rings == TRUE). The subset argument should be a logical expression that indicates which rows to keep. We can also extend the data frame in a similar format to lists/matrices. To add a new variable/column we can use either people$height <- new_vector or people[["height"]] <- new_vector. We can also use cbind() for example people <- cbind(people, “height” = new_vector), it works just the same. Similarly rbind() can add new rows, but you'll need to add a data frame made of new vectors to the existing data frame, the names between the data frames will also need to match. Lastly, you may also want to sort your data frame - you can do this via the order() function. The function returns a vector with the rank position of each element. For example in a vector of 21,23,22,25,24 the order function will return 1,3,2,5,4 to correspond the position in the rank. We can then use ages[ranks, ] to re-arrange the order of the data frame. To sort in descending order, we can use the argument decreasing = TRUE.   # Created a desired ordering for planets_df positions <- order(planets_df$diameter, decreasing = TRUE) # Created a new, ordered data frame largest_first_df <- planets_df[positions, ] # Printed new data frame largest_first_df   Interested in testing your knowledge, check out the DataCamp exercises here and here.   Graphics One of the main reasons to use R is its graphics capabilities. The difference between R and a program such as Excel is that you can create plots with lines of R code which you can replicate each time. The default graphics functionality of R can do many things, but packages have also been developed to extend this – popular packages include ggplot2, ggvis, and lattice. The first function to look at is plot() which is a very generic function to plot. For instance, take a data frame of MPs (members of parliament), which contains the name of the MP, their constituency area, their party, their constituency county, number of years as MP, etc. We can plot the parties in a bar chart by using plot(parliamentMembers$party). R realises that their party is a factor and you want to do a count across it in a bar chart format (by default). If you pick a different variable such as the continuous variable number of years as MP, the figures are shown on an indexed plot. Pick 2 continuous variables, such as number of years as MP vs. yearly expenses - plot(parliamentMembers$numOfYearsMP, parliamentMembers$yearlyExpenses) and the result is a scatter plot (each variable holds an axis). Are these variables related? To make relationships clearer, we can use the log() function- plot(log(parliamentMembers$numOfYearsMP), log(parliamentMembers$yearlyExpenses)). For 2 categorical variables, R handles it differently again, and creates a stacked bar chart giving the user an idea of the proportion of the 2 variables. Notice how the type of variable you use impacts the type of chart displayed.  You will find that the first element of plot forms the x axis and the second element the y axis. For 3 variables, it gets better still – here I used plot(parliamentMembers[c("numOfYearsMP","votesReceived","yearlyExpenses")]). As you can see below, R plots the variables against one another in 9 different plots!       The second function to look at is hist(), which gives us a way to view the distribution of data. We can use this in a similar format to plot by specifying the syntax hist(parliamentMembers$numOfYearsMP). By default, R uses an algorithm to work out the number of bins to split the data into based on the data set. To create a more detailed representation, we can increase the bins by using the breaks argument - hist(parliamentMembers$numOfYearsMP, breaks = 10). There are of course, many other functions we can use to create graphics, the most popular being barplot(), boxplot(), pie(), and pairs().   Customising the Layout To make our plots more informative,we need to add a number of arguments to the plot function. For instance the following R script creates the plot below:   plot( parliamentMembers$startingYearMP, parliamentMembers$votesReceived, xlab = "Number of votes", ylab = "Year", main = "Starting Year of MP vs Votes Received", col = "orange", col.main = "black", cex.axis = 0.8 pch = 19 ) # xlab = horizontal axis label # ylab = vertical axis label # main = plot title # col = colour of line # col.main = colour of the title # cex.axis = ratio of font size on axis tick marks # pch = plot symbol (35 different types!!)   To inspect and control these graphical parameters, we can use the par() function. To get a complete list of options we can use ?par to bring up the R documentation. By default, the parameters are set per plot, but to specify session-wide parameters, just use par(col = “blue”).   Multiple Plots The next natural step for plots is to incorporate multiple plots – either side by side or on top of one another. To build a grid of plots and compare correlations we can use the mfrow parameter, like this par(mfrow = c(2,2)) by using par() and passing in a 2x2 vector which will build us 4 subplots on a 2 by 2 grid. Now, when you start creating plots, they are not replaced each time but are added to the grid one by one. Plots are added in a row-wise fashion - to use column-wise, we can use mfcol. To reset the graphical parameters, so that R plots once again to a single figure per layout, we pass in a 1x1 grid. Apart from these 2 parameters, we can also use the layout() function that allows us to specify more complex arrangements. This takes in a matrix, where you specify the location of the figures. For instance:   grid <- matrix(c(1,1,2,3), nrow = 2, ncol = 2, byrow = TRUE) grid [1][2] [1] 1 1 [2] 2 3 layout(grid) plot(dataFrame$country, dataFrame$sales) plot(dataFrame$time, dataFrame$sales) plot(dataFrame$businessFunction, dataFrame$sales) # Plot 1 stretches across the entire width of the figure # Plot 2 and 3 sit underneath this and are half the size   To reset the layout we can use layout(1) or use the mfcol / mfrow parameters again. One clever trick to save time is to store the default parameters in a variable such as old_par <- par() and then reset once done using par(old_par). Apart from showing the graphics next to one another, we can also stack them on top of one another in layers. There are functions such as abline(), text(), lines(), and segments() to add depth to a graphic. Using lm() we can create an object which contains the coefficients of the line representing the linear fit using 2 variables, for instance movies_lm <- lm(movies$rating ~ movies$sales). We can then add it to an existing plot such as plot(movies$rating, movies$sales) by using the abline() (adds the line) and coef() (extracts the coefficients) functions, for instance abline(coef(movies_lm)). To add some informative text we can use xco <- 7e5 yco <- 7 and text(xco, yco, label = “More votes? Higher rating!”) to generate the following:     Interested in testing your knowledge, check out the DataCamp exercises here, here and here.   Summary This pretty much concludes the crash course in R. We’ve looked at the basics, vectors, matrices, factors, lists, data frames and graphics – and in each case looked at arithmetic and subsetting functions. Next I’ll be looking at some programming techniques using R, reading/writing to data sources and some use cases – to demonstrate the power of R using some slightly more interesting data sets such as the crime statistics in San Francisco.      

A Crash Course in R Part 1

What is R?  Also known as the language for statistical computing, it was developed in the 1990s, and provides the ability to use a wide variety of statistical techniques and visualization capabilities across a set of data.  Pros for the language include the fact its open source, it has great graphical capabilities, runs via a CLI (provides the ability to script and automate), has a huge community behind it, and is gaining a wider adoption in business! R can be used by a number of tools; the most common are R Tools for Visual Studio, RStudio standalone, or more recently R Scripts in Power BI.   Basics Let’s start with the absolute basics. If you print 1+2 to the command line, the console will return 3. Print text to the command line, the console will return the same body of text. Everyone loves variables. To be able to store values in variables, we can use the syntax apples <- 4 and pears <- 2 to the command line to store 4 in the apples, and 2 in the pears variable. There’s no print out here because its a variable. We can then do total_fruit <- apples + pears to create a new variable using existing variables. As you create variables, you create a workspace, which you can reference using ls(). This details all the variables created within the R session. You can then use rm(variable) to clean up the workspace to maintain resource. Now, not everyone loves commenting but you can comment your code via #. Here’s a simple example script to calculate the volume of a circle: # Create the variables r and R r <- 2 R <- 6 # Calculate the volume of a circle vol_circle <- 2*pi^2*r^2*R # Remove all intermediary variables that you've used with rm() rm(r,R) # List the elements in your workspace ls() [1] "vol_circle"   Data Types As with any language, there are a number of data types supported: TRUE/FALSE are "logical" "This is text" is "character" 2 or 2.5 are "numerics” You can add L to numeric such as 2L to call this number an integer (the outputs the same, but the class is different). Here we have what's known as a type hierarchy. Other types include double, complex, and raws. We can use the function class() to determine the data type of variable. We can also use the dot function as to coerce or transform between the data types, such as as.character(4) = "4" and as.numeric("4.5") = 4.5.  To evaluate the type use is.numeric(2) = TRUE or is.numeric(“2”) = TRUE.  NA is returned when trying to convert “hello” to numeric. Interested in testing your knowledge, check out the DataCamp exercises here and here.   Vectors A vector is a sequence of data elements of the same data type which can called using the c() function. Vectors can be of all the types seen previously.  The three properties of a vector are type, length, and attributes. For example, a vector to provide us UK Government Parties (who doesn’t love politics) and assigned to the variable parties can be built parties <- c("Labour","Conservative","Libdems","SNP"). A check can done to see if the variable is of a vector type similar to before using is.vector(parties). But what if the vector contains data that has slightly more meaning behind it, for instance seat_count <- c(262,318,12,35). You can attach labels to the vector elements by using the names() function: names(seat_count) <- parties.  You could also do this using one line - c(Labour = 262, Conservative = 318, Libdem = 12, SNP = 35). The variables created previously are actually stored in a vector of length 1. R does not provide a data structure to hold a single number/character string.  If you do build a vector of different data types, R performs coercion to make sure they are all of the dame data type. For instance c(1,5,"A","B","C","D") becomes "1","5","A","B","C","D".  If you need to differentiate data types, you can use a list instead!   Vector Arithmetic Computations can be performed between vectors and are done so in an element-wise fashion. This means you can do earnings <- c(10,20,30) * 3 to generate [30] [60] [90]. You can also do vector minus vector so earnings – expenses (again done element wise). Multiplication and division using this method does not result in a matrix! Other functions include sum(bank) to sum all elements of the vector, sum(bank > 0) to return a count of elements in the vector (given bank contains numerics) or sum(bank == x) to return the count of element x in a vector.   Vector Subsetting As the name suggests, this is basically selecting parts of your vector to end up as a new vector (a subset of your original vector). If you want to select the first element from our seat_count vector, we write seat_count[1] and this will return Labour 262. Both the value and name are kept from the original vector. We can also use the name, so seat_count[“Labour”] will return the same result. If you want to select multiple elements, you can use the syntax seat_count[c(1,4)] by passing in a vector. To select elements in a sequence you can use 2:5 instead of 2,3,4,5. You can also subset via an inverse, by using the syntax seat_count[-1] which returns all the seats, apart from the element in [1]. One last method to create subsetting is by logical vectoring, so by specifying seat_count[c(TRUE,FALSE,FALSE,FALSE)] we can return the equivalent of [1]. R is also able to “recycle” this type of vectoring so if your logical vector is length 2, it will loop over itself to fit the vector you are subsetting. Remember we can also use the arithmetic from the previous section, to select our vector contents for subsetting, examples include main_parties <- seat_count[seat_count > 50]. Interested in testing your knowledge, check out the DataCamp exercises here, here, and here.   Matrices While a vector represents a sequence of data elements (1 dimensional), a matrix is a similar collection of data elements but arranged as rows/columns (2 dimensional) – a very natural extension to a vector. To build a matrix you’ll need to use the following format; matrix(1:6, nrow = 2) which creates a 2-by-3 matrix for values 1-6. You can also specify columns rather than rows by using ncol = 3. R infers the other dimension by using the length of the input vector. To fill the vector by row instead of by columns, you can use the argument byrow = TRUE. Another way to create a matrix is by using the functions rbind() and cbind(). These essentially take the 2 vectors you pass the function and stick them together. You can also use these functions to bind together a new vector with an existing matrix. For example my_matrix <- rbind(matrixA, vectorA) To name the matrix, you can use rownames() and colnames(). For example rownames(my_matrix) <- c(“row1”,”row2”) and colnames(my_matrix) <- c(“col1”,”col2”). You can also create a matrix using a one-liner, by using dimnames() and specifying a list of vector names. my_matrix <- matrix(1:6, byrow = TRUE, nrow = 2, dimnames = list(c("row1", "row2"), c("col1","col2","col3"))) Similar to vectors, matrices are also able to recycle themselves, only store a single atomic data type and perform coercion automatically. Continuing on from vectors, matrices can also be subsetted. If you’re after a single element, you’ll need to specify both row and column elements of interest using the syntax m[1,3] for row 1 column 3. If you’re looking to subset an entire row or column, you can use the syntax m[3,] (notice the missing column value) to select the entirety of row 3. Columns can be selected using the inverse via m[,3]. You can also select multiple elements using a simple methodology to vectors. This can be achieved by using the syntax m[2, c(2,3)] to select the 2nd and 3rd column values of row 2. Subsetting the names works just the same as by index, you can even use a combination of both! The same is true of subsetting by a logical vector – just use c(FALSE,TRUE,TRUE) for the last 2 rows of a 3 row matrix. You can see some examples below. # Create a matrix using 2 vectors my_mega_matrix <- cbind(vectorA, vectorB) # Subset the matrix to get columns 2 and 3 my_subsetted_mega_matrix <- my_mega_matrix[,c(FALSE,TRUE,TRUE,FALSE] # Subset the matrix using names for columns 1 to 4 my_alt_subsetted_mega_matrix <- my_mega_matrix[,c("A","B","C","D")] # Calculate totals for the columns 2 and 3 total_mega_matrix <- colSums(my_subsetted_mega_matrix) As seen above there are another 2 functions we can use on matrices namely colSums() and rowSums() to do column and row arithmetic. This is addition to other standard arithmetic. All computations are performed element wise. So we can do my total_mega_matrix * 1.3 to convert the totals in GBP to USD (as an example). Performing calculations using 2 matrices is just the same (matrixA – MatrixB). Be careful here though, if they contain the same number of elements, everything will be done element wise, else recycling will occur. Notice the similarity between vectors and matrices – they’re both data structures that store elements of the same type. Both perform coercion, and recycling. Arithmetic is also similar as everything is performed element wise. Interested in testing your knowledge, check out the DataCamp exercises here, here, and here.   Factors Unlike numeric variables, categorical variables can only take a limited number of different values. The specific data structure for this is what is known as a factor. A good example of this is blood, types can only be of type A, B, AB, or O – we then define a vector of peoples blood types blood <- c(“B”,”AB”,”O”,”O”,”A”,”B”,”B”). To convert this vector to a factor we can use factor(blood). R scans the vector to check for categories, and stores the distinct list as levels (sorted alphabetically). Values in the vector and then replaced with numeric values corresponding to the associated level. You can think of factors as integer vectors, where each integer refers to a category or level. To inspect the structure, we can use str(factor). Similar to the names() function, you can also specify the levels() function and pass a vector to name the levels differently to those categories picked up in the scan, for instance levels(my_factor) <- c(“BT_A”,”BT_B”,”BT_O”,”BT_AB”). However its much safer to pass in both the levels and the labels because of the way in which it sets the levels alphabetically which means you have to be careful your names correspond correctly to the levels. In statistics, there is also a difference between nominal categorical variables and ordinal categorical variables – nominal variables have no implied order, i.e. blood type A is not greater or less than B. There are examples however where ordering does exist, for example with t-shirt sizes, and you can use R to impose this on the factor. To do this, you can set the ordered function inside the vector to TRUE, and then specify the levels in the correct order. You can now evaluate the levels in the factor. An example can be seen below. # Definition of temperature_vector temperature_vector <- c("High", "Low", "High", "Low", "Medium") # Encoded temperature_vector as a factor temperature_factor <- factor(temperature_vector, ordered = TRUE, levels = c("Low","Medium","High") ) # Print out temperature_factor Interested in testing your knowledge, check out the DataCamp exercises here.   This is only just the start of understanding R – in the next blog I’ll look at lists, data frames and most importantly graphics! We can then start to looking at some more complicated examples and use cases.

My Experience of the Microsoft Professional Program for Data Science

(Image 1 – Microsoft 2017 - https://academy.microsoft.com/en-us/professional-program/data-science)   In 2016 I was talking to Andrew Fryer (@DeepFat)- Microsoft technical evangelist, (after he attended Dundee university to present about Azure Machine Learning), about how Microsoft were piloting a degree course in data science. My interest was immediately spiked. Shortly after this hints began appear and the Edx page went live. Shortly after the Edx page went live, the degree was rebranded as the "Professional Program". I registered to be part of the pilot, however was not accepted until the course went live in September 2016.   Prior to 2016 my background was in BI, predominately in Microsoft Kimball data warehousing using SQL Server. At the end of 2015 I enrolled on a Master's Degree in Data Science through the University of Dundee. I did this with the intention of getting exposure to tools I had an interest in, but had some/little commercial experience (R, Machine learning and statistics). This course is ongoing and will finish in 2018, I highly recommend it! I would argue that it is the best Data Science Master's degree course in the UK. So going in to the MPP I had a decent idea of what to expect, plus a lot of SQL experience, R and Power BI. Beyond that I had attended a few sessions at various conferences on Azure ML. When the syllabus for the MPP came out, it directly complemented my studies.   Link to program - https://academy.microsoft.com/en-us/professional-program/data-science Link to Dundee Masters - https://www.dundee.ac.uk/study/pg/data-science/   Structure of the program The program is divided up in to 9 modules and a final project. All modules need to be completed but there are different options you can take - You can customise the course to suit your interests. You can choose to pay for the course (which you will need to do if you intend to work towards the certification) or audit the course for free.  I will indicate which modules I took and why. Most modules recommend at least 6 weeks part-time to complete. I started the first module in the middle of September 2016 and completed the final project middle of January 2017 – So the 6 week estimate is quite high, especially if you already have decent a base knowledge of the concepts already.   You can if you wish complete multiple modules at once. I am not sure I recommend this approach as to get the most out of the course, you should read around the subject as well as watching the videos. Each module has a start date and an end date that you need to complete it between. If you do not you will need to do it all again. You can start a module in one period and wait until the next for another module. You do not need to complete them all in 3 months. If you pay for the module but do not request your certificate before the course closes, you will need to take it again (top tip, as soon as you're happy with you score, request you certificate).   Module list Module Detail Time taken Data Science Orientation Data Science Orientation 2 - 3 days Query Relational Data Querying Data with Transact-SQL 1 day - Exam only Analyze and Visualize Data Analyzing and Visualizing Data with Excel  Analyzing and Visualizing Data with Power BI 2 - 4  days Understand Statistics Statistical Thinking for Data Science and Analytics 7 - 9 days Explore Data with Code Introduction to R for Data Science Introduction to Python for Data Science 7 - 9 days Understand Core Data Science Concepts Data Science Essentials 7 - 9 days Understand Machine Learning Principles of Machine Learning 2 weeks Use Code to Manipulate and Model Data  Programming with R for Data Science Programming with Python for Data Science R - 2 - 3 daysPython - 3 weeks Develop Intelligent Solutions   Applied Machine Learning  Implementing Predictive Solutions with Spark in HDInsight Developing Intelligent Applications 2 weeks Final Project Data Science Challenge 2 months*   The times taken are based on the time I had spare. I completed each module between projects, in the evening and at the weekend. This module can be completed in a few days, however you need to wait until it has completed to get you grade.   Structure of the modules Each modules is online. You log on to the Edx website and watch videos by leading experts. Either at the end of the video, after reading some text or at the end of a section of the modules you are given a multiple choice test. The multiple choice options are graded and form part of your overall score. The other main assessment method is labs, where you will be required to complete a series of tasks and enter the results. Unlike certifications, you get to see what your score is as you progress through the module. The multiple choice questions generally allow you to have two to three attempts at the answer, sometimes these are true/false with two attempts, which does undermine the integrity of the course.   There is normally a final section which you're only given one chance to answer, and holds a higher % towards your final mark. You need 70% to pass. Once you hit 70% you can claim your certificate - if you have chosen to pay for the module. Modules range from $20 to $100. For the most part I answered the questions fully and tried for the highest score possible. However, In all honestly towards the end, once I hit around 80%, I started looking at a different module. If the module was really interesting I would persevere.   Modules Data Science Orientation, Query Relational Data & Analyze and Visualize Data. These modules are very basic and really only skim the surface of all the topics they describe. The first module is a gentle introduction to the main concepts you will learn throughout the program. The next modules focused on querying data with SQL. Regardless of your opinion of SQL, you must agree that SQL the is language of data. Having an understanding of the fundamentals of SQL is paramount, as almost every level of the Microsoft Data Science stack has integration with databases. If you're familiar with SQL (I already held an MCSE in SQL 2012) you can skip the main content of this module and just take the test at the end. For the next you have an option of Excel or Power BI for visualisation. As I have experience with Power BI I opted for this module. Once again this is a very basic introduction to Power BI. It will get you familiar enough with the tool that you can do basic data exploration. Some parts of this course jarred with me. Data visualisation is so important and a key skill for any data scientist. In the Power BI module one of the exercises was to create a 3d pie chart. Pie charts are not a good visualisation as it is hard to differentiate between angles and making it 3d only escalates the issue. I wish Microsoft would have made reference to some of the great data viz experts when making this module - I cannot comment on the Excel version.   Understanding statistics. This module is different from its predecessors, in that it is not run by Microsoft. This is a MOOC from Columbia university, which you might have completed before. It covers a lot of the basic and more advanced stats that you need to know for data science. In particular a solid grounding in probability and probability theory. In BI you become familiar with descriptive stats and measures of variance, however I had not done a great deal of stats beyond this. I have researching statistical methods for the MSc, but I had not done any real stats since A-Level maths. This course was really interesting and I learnt a lot. I don’t know if this is the best way to really learn stats, but it is a good primer to what you need to know. I found topping up my understanding with blogs, books and YouTube helped support this module.   Explore data with code. You have two options again for this module, R and Python. Which should you learn I imagine you're asking, well the simple answer is both. Knowing either R or Python will get you so far, knowing both with make you a unicorn. Many ask why to learn one language over the other - aside from the previous point. R is very easy to get in to, it has a rich catalogue of libraries written by some of the smartest statistical minds. It has a simple interface and is easy to install. Python is harder to learn in my opinion as the language is massive! I found Python harder to work with, but it is much richer. I would recommend Python just for SciKitLearn the machine learning library. The python module is extended to use code dojo (the great online tuition site). As you progress through the questions and examples, you have an ide which will check you understanding and  will grade you as you go. I found this really helpful. This module is again a bit on the easier side. If you think the later Python module will be similar, you are in for a surprise! I did not take the R module as I was already using R in my day job.   Understand core data science concepts. Almost a redo of the first module and the understanding statistics module. Not a lot to say here, but repetition helped me understand and remember the concepts. The more I had to think about the core concepts the more they stuck. This module could have been removed with little to no impact on the course, but helped solidify my knowledge.   Understanding Machine learning. As this is a Microsoft course this module is all about Azure Machine Learning. If you have not used Azure ML before, it has a nice drag and drop interface which allows you to build quick simple models and create a web api key which you can then pass data to using any tool with a REST API. This module is half theory and half practical. There are a lot of labs, so you will need to take you time. If you skip ahead you will get the answers wrong and might not make it to 70%.   Using code to manipulate and model data. This section has two options again R and Python. I know quite a bit or R already so I started with Python. I wanted to do them both to see how you can do machine learning in both. I was expecting a continuation of the code dojo format from the previous module, this was far from the case. Each of the modules up until this point have worked with you to find the right answer. This module will equip you with the basics, but expect you to find the correct function and answer. Believe me when I say it was hard (with little prior experience of Python). The course will lead you to towards the right resources, but you need to read the documentation to answer the question. This was a great change of pace. Having to search for the answers made me absorb more than just the quizzes. This module was a struggle. Once I completed this I did the same for R. On a difficulty scale, if the Python module was 100, R was only at 20. The disparity in difficult is massive and frankly unfair. I was able to complete the R module very quickly. I left feeling disappointed that this did not have the same complexity that the Python module did.   Develop intelligent solutions. For this section you can pick one of three modules, Machine learning, Spark or micro services. I went with Spark. Why? Because I had already worked with Spark and Hadoop as part of the MSc at Dundee. I knew how it worked and what it did from an open source point of view, but not from a Microsoft HD-Insight perspective. This module was tricky but nothing compared to the Python module. I spent the best part of the week working on Spark, setting up HD-Insight clusters and forgetting to tear them down (top tip! Don’t leave a HD-Insight cluster running - They are EXPENSIVE!). The last module is a machine learning project, so picking the "Applied Machine Learning" option might put you in a better place than your competition. I did not attempt either the Machine Learning or the Micro-services modules.   Final project. Here is where the fun begins. You're given a problem and a dataset. You need to clean, reduce, derive features and process the dataset, then apply an ML technique to predict something. In my case it was whether or not someone will default on a loan. You could use any technique you liked as long as the final result was in Azure ML. I was pretty happy with my model early on and made very few tweaks as the course progressed. Unlike the previous modules where you can complete a module and get your score, your final score is only available once the module has ended. You will build an ML experiment and test against a private dataset. You can submit your experiment 3 times a day to be scored against the private data (maximus of 100 attempts). This will give you an indication of your score, but this is not your score! You score is calculated against a different dataset after the module has finished.  You top 5 scores will be used to test against the private closed data. If you have over-fitted you model, you might have a shock (as many did on the forums) when you score is marked.   I completed all modules at the start of January and waited until February to get my final score. My highest scoring answer, when used against the closed private dataset, did not get over the required 70% to pass. This was surprising but not all that unexpected. I had over-fitted the model. To counter balance this, I created 5 different experiments with 5 similar but different approaches. All score similar (~1-3% accuracy difference). This was enough to see me past the required 70% and to obtain the MPP in data science. The private dataset has been published now. In the coming weeks I will blog about the steps I took to predict if someone would default on their loan.   I have been asked at different stages of the course "would you recommend the course?". It really depends on what you want out of the course! If you expect to be a data scientist after completing the MPP, then you might be in for a shock. To get the most out of the course you need to supplement it with wider reading / research. YouTube has many great videos and recorded lectures which will really help process the content and see it taught from a different angle. If you're looking to get an understanding of the key techniques in  Data Science (from a Microsoft point-of-view) then you should take this course. If you're doing a degree where you need to do research, many of the modules will really help and build upon what you already know.   I hope you have found this interesting and that it has helped you decide whether or not you want to invest the time and money (each module is not free). If you do decide and you persevere you will too be the owner of the MPP in Data Science (as seen below).   Terry McCann - Adatis Data Science Consultant & Organiser of the Exeter Data Science User Group - You can find us on MeetUp.    

Using R Tools for Visual Studio (RTVS) with Azure Machine Learning

Azure Machine Learning Whilst in R you can implement very complex Machine Learning algorithms, for anyone new to Machine Learning I personally believe Azure Machine Learning is a more suitable tool for being introduced to the concepts. Please refer to this blog where I have described how to create the Azure Machine Learning web service I will be using in the next section of this blog. You can either use your own web service or follow my other blog, which has been especially written to allow you to follow along with this blog. Coming back to RTVS we want to execute the web service we have created. You need to add a settings JSON file. Add an empty JSON file titled settings.json to C:\Users\<your name>\Documents\.azureml. Handy tip: if you ever want to have a dot at the beginning of a folder name you must place a dot at the end of the name too, which will be removed by windows explorer. So for example if you want a folder called .azureml you must name the folder .azureml. in windows explorer. Copy and paste the following code into the empty JSON file, making sure to enter your Workspace ID and Primary Authorization Token. {"workspace":{ "id" : "<your Workspace ID>", "authorization_token" : "<your Primary Authorization Token>", "api_endpoint": "https://studioapi.azureml.net", "management_endpoint": "https://management.azureml.net" }} You can get your Workspace ID by going to Settings > Name. And the Primary Authorization Token by going to Settings > Authorization Tokens. Once you’re happy save and close the JSON file. Head back into RTVS, we’re ready to get started. There are two ways to proceed. Either I will take you line by line what to do or I have provided an R script containing a function, allowing you to take a shortcut. Whichever option you take the result is the same. Running the predictive experiment in R – Line by line With each line copy and paste it into the console. Firstly a bit of setup, presuming you’ve installed the devtools package as described on the github page for the download, load AzureML and connect to the workspace specified in settings.JSON. To do this use the code below: ## Load the AzureML package. library(AzureML) ## Load the workspace settings using the settings.JSON file. workSpace <- workspace() Next we need to set the web service, this can be any web service created in Azure ML, for this blog we will use the web service created in this blog. The code is as follows: ## Set the web service created in Azure ML. automobileService <- services(workSpace, name = "Automobile Price Regression [Predictive Exp.]") Next we need to define the correct endpoint, this can easily be achieved using: ## Set the endpoint from the web service. automobileEndPoint <- endpoints(workSpace, automobileService) Everything is set up and ready to go, except we need to define our test data. The test data must be in the exact same format as the source data of your experiment. So the exact same amount of columns and with the same column names. Even include the column you are predicting, entering just a 0 or leaving it blank. Below is the test data I used: This will need to be loaded into R and then a data frame. To do so use the code below, make sure the path is pointing towards your test data. ## Load and set the testing data frame. automobileTestData <- data.frame(read.csv("E:\\OneDrive\\Data Science\\AutomobilePriceTestData.csv")) Finally we are ready to do the prediction and see the result! The final line of code needed is: ## Send the test data to the web service and output the result. consume(automobileEndPoint, automobileTestData) Running the predictive experiment – Short cut Below is the entire script, paste the entirety of it into top left R script. automobileRegression <- function(webService, testDataLocation) { ## Load the AzureML package. library(AzureML) ## Load the workspace settings using the settings.JSON file. amlWorkspace <- workspace() ## Set the web service created in Azure ML. automobileService <- services(amlWorkspace, name = webService) ## Set the endpoint from the web service. automobileEndPoint <- endpoints(amlWorkspace, automobileService) ## Load and set the testing data frame. automobileTestData <- data.frame(read.csv(testDataLocation)) ## Send the test data to the web service and output the result. consume(automobileEndPoint, automobileTestData) } Run the script by highlighting the whole of the function and typing Ctrl + Enter. Then run the function by typing the below code into the console: automobileRegression("Automobile Price Regression [Predictive Exp.]","E:\\OneDrive\\Data Science\\AutomobilePriceTestData.csv") Where the first parameter is the name of the Azure ML web service and the second is the path of the test data file. The Result Both methods should give you the same result: an output of a data frame displaying the test data with the predicted value: Wahoo! There you have it, a predictive analytic regression Azure Machine Learning experiment running through Visual Studio… the possibilities are endless!

Introduction to R Tools for Visual Studio (RTVS)

Introduction This blog is not looking at one or two exciting technologies, but THREE! Namely Visual Studio, R and Azure Machine Learning. We will be looking at bringing them together in harmony using R Tools for Visual Studio (RTVS). Installation As this blog will be touching on a whole host of technologies, I won’t be going into much detail on how to set each one up. However instead I will provide you with a flurry of links which will provide you with all the information you need. Here comes the flurry…! · Visual Studio 2015 with Update 1 – I hope anyone reading this is familiar with Visual Studio, but to piece all these technologies together version 2015 with Update 1 is required, look no further than here: visualstudio.com/en-us/downloads/download-visual-studio-vs.aspx · R – Not sure exactly what version is needed but just go ahead and get the latest version you can, which can be found here: https://cran.r-project.org/bin/windows/base/ · Azure Machine Learning – No installation required here, yay! But you will need to set up an account if you have not done so already, this can be done here studio.azureml.net/ · R Tools for Visual Studio - More commonly known as RTVS. The name is fairly self-explanatory but it allows you to run R through Visual Studio. If you have used R and Visual Studio separately before it will feel strangely familiar. Everything you need to download, install and set up can be found here: microsoft.github.io/RTVS-docs/ · Devtools Package - The final installation step is a simple one. Installing the correct R packages to allow you to interact with Azure ML. If you’ve used R to interact with Azure ML before you probably have already done this step, but for those who have not, all the information you will need to do so can be found here: github.com/RevolutionAnalytics/AzureML Introduction to RTVS Once all the prerequisites have been installed it is time to move onto the fun stuff! Open up Visual Studio 2015 and add an R Project: File > Add > New Project and select R. You will be presented with the screen below, name the project AutomobileRegression and select OK. Microsoft have done a fantastic job realising that the settings and toolbar required in R is very different to those required when using Visual Studio, so they have split them out and made it very easy to switch between the two. To switch to the settings designed for using R go to R Tools > Data Science Settings you’ll be presented with two pop ups select Yes on both to proceed. This will now allow you to use all those nifty shortcuts you have learnt to use in RStudio. Anytime you want to go back to the original settings you can do so by going to Tools > Import/Export Settings. You should be now be looking at a screen similar to the one below: This should look very recognisable to anyone familiar to R:   For those not familiar, the top left window is the R script, this will be where you do your work and what you will run. Bottom left is the console, this allows you to type in commands and see the output, from here you will run your R scripts and test various functions. Top right is your environment, this shows all your current objects and allows you to interact with them. You can also change to History, which displays a history of the commands used so far in the console. Finally the bottom right is where Visual Studio differs from RStudio a bit. The familiar Solution Explorer is visible within Visual Studio and serves its usual function. Visual Studio does contain R Plot and R Help though, which both also feature in RStudio. R Plot will display plots of graphs when appropriate. R Help provides more information on the different functions available within R. Look for my next blog, which will go into more detail on how to use RTVS.

Connecting SQL Server to R

In this post I’m going to use R to retrieve some data from SQL Server. In order to use R in conjunction with SQL Server, but in the absence of SQL Server 2016 and its soon to be incorporated R functionality, it is necessary to use a few workarounds in order to produce the desired outcome. R is a package based platform and does not inherently communicate with other platforms unless the relevant package is installed and called. There are quite a few packages that can be used for R to talk to SQL Server, however I prefer to use the RODBC package as it is simpler to use than other packages available, and will be using it for this example. CONNECTING SQL SERVER TO R Step 1: Create a connection As RODBC requires a connection to your SQL Server database you’ll have to open up the ODBC Data Source Administrator instance on the machine you’re using. Under the User DSN tab (though you could use the System DSN) click Add to create a new connection. Select SQL Server Native Client 11.0 and Finish. It will then open up the following screen and fill in appropriately It will then open up with the necessary security information, but as I’m using a local version I will persist with Windows authentication. The next screen is where you choose the database you want to connect to. By default you will just connect to the server, but if you wish to save the results from your analysis in R back to SQL Server and to the correct database it is important that you select the desired database. For this example I’m connecting to Adventure Works. The next screen is general configuration properties and can be changed for the user’s needs Click finish, test the connection and you’re all done for this step! Step 2: Connecting R For this next part we’ll be using an R client. My preferred client is R Studio as it provides a clean interface. Now that you’ve created your connection you’ll want to use it within R. After firing up an instance of R with the RODBC package installed you will want to invoke it with the following syntax: library(RODBC) To bring the connection through to R you’ll need to assign a variable to it with the help of the odbcConnect function. The format for invoking the function is as follows: connectionstring <- odbcConnect("some dsn", uid = "user", pwd = "****") connectionstring is the variable assigned to store the connection odbcConnect is the function “some dsn” is the name of your DSN connection uid  and pwd are the User ID and password for the server, if needed For our example using AdventureWorks on a local machine the syntax is as follows: AdventureWorks <- odbcConnect ("AdventureWorks")<?xml:namespace prefix = "o" /> In order to see which objects are in your database you should run the following syntax: sqlTables(AdventureWorks) Which produces an output similar to this: You can then begin to use your data from SQL Server in R by using the sqlQuery function to extract data. Employee <- sqlQuery(AdventureWorks, "SELECT * FROM HumanResources.Employee") The purpose of the sqlQuery function is to be able to get a specific set of data, potentially from multiple tables. If you just wish to return the contents of one table it would be better to use the sqlFetch function. Employee <- sqlFetch(AdventureWorks,"HumanResources.Employee") sqlFetch returns the contents of the specified table and stores it in the assigned variable.   Connecting R to SQL Server is relatively easy and allows you to unleash the power of R on your data without employing expensive add-ons or waiting for a future SQL Server 2016 CTP to be released.

SQL PASS Summit – Day 1

Having spent several days enjoying Seattle and getting to know and enjoy the city the conference is now underway. Yesterday there were some introductory meetings where we got to meet some of the other attendees and get a feel for the environment, today everything was in full swing. The morning started with the opening keynote presented to a huge audience. There were demonstrations of some really exciting new features – in one we observed Kinect being used with Power Map in order to track customer movements to observe customer interest in different parts of the store. We saw some great looking Power BI dashboarding functionality with the ability to drillthrough into detailed reports. As well as this we saw some further enhancements SQL Server Azure and on-premise integration including a new stretch functionality which will allow users to seamlessly ‘stretch’ their data into the cloud, keeping the most frequently queried records on premise and the other ones in the cloud. We also saw a Columnstore index being created on an in memory table! Miguel Lopes gave a talk on the new features in Power Query where we saw the ability to query using ODBC and support for Analysis services connections, on the whole though whilst I think the ODBC will be particularly useful for some situations, much of this talk was giving an overview of Power query as a whole rather than demonstrating new functionality. The integration of SSIS and power query in future was mentioned, however no dates have been set for this and we are told that this may (or may not) be available at some point in 2015. Jen Stirrup gave an interesting overview of some of the features available in R, the session was so popular that many people had to sit round the front on the floor! Niko Neugebauer’s contagious enthusiasm when presenting his session on ETL Patterns with Clustered Columnstore indexes was great to see and I picked up a few tricks here that were quite handy when working in 2014 environments. I also very much enjoyed John Welch’s session in Continuous Delivery for Data Warehouses and Marts, this is something I myself have been involved with a lot recently and it was very interesting to see his methods of achieving this and also to discover that in many cases we were both doing things in the same way :) Overall the day has been very interesting, we have seen some exciting new features announced today and some significant enhancements to the Power BI product range, it seemed to me for some time that the lack of dashboarding functionality in Power View was holding it back and I think many people will be very pleased with this new functionality and the further enhancements to the Azure service.