ToriTompkins

Tori Tompkins' Blog

Getting Started with PyTorch: A Deep Learning Tutorial

PyTorch is a deep learning framework created by the Artificial Intelligence Research Group at Facebook to build neural networks for machine learning projects. It isn’t brand new; PyTorch has been around since October 2016, almost exactly two years ago, but only now it is gaining the momentum it deserves. When used as an alternative to Keras, TensorFlow or NumPy, PyTorch shines in the following areas:

  • It is very tightly integrated with native Python API’s which allows it to seamlessly interact with Python libraries such as NumPy, SciPy and Pandas.
  • This also includes native debuggers so there is no need for a specialist debugger – Looking at you, TensorFlow…
  • It supports both forward passes for prediction and back propagation using the Autograd library for training.

And most importantly:

  • PyTorch builds dynamic computation graphs which can run code immediately with no separated build and run phases.
  • This makes the neural networks much easier to extend, debug and maintain as you can edit your neural network during runtime or build your graph one step at a time.

** Note: If the basics of Deep Learning, Neural Networks and Back Propagation are alien to you or even if you fancy a little revision, I really recommend that you check out 3Blue1Brown’s awesome YouTube playlist covering the foundations of Neural Networks: https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi **

In this tutorial I will introduce a basic deep neural network in PyTorch and explain the important steps along the way. Now, let’s get started!

Installation

The simplest and recommended way to install PyTorch is through a package management tool like Conda or Pip but it can be installed directly from source using the instructions found at: https://pytorch.org/

Install using Pip

Linux:

pip3 install torch torchvision

Mac:

pip3 install torch torchvision
# MacOS Binaries dont support CUDA, install from source if CUDA is needed

Windows:

pip3 install http://download.pytorch.org/whl/cu90/torch-0.4.1-cp37-cp37m-win_amd64.whl
pip3 install torchvision

Install using Conda

Linux:

conda install pytorch torchvision -c pytorch

Mac:

conda install pytorch torchvision -c pytorch
# MacOS Binaries dont support CUDA, install from source if CUDA is needed

Windows:

conda install pytorch -c pytorch
pip3 install torchvision

PyTorch can be installed on Azure Databricks as a Databricks PyPI library and comes preinstalled and configured on Azure Data Science Virtual Machines.

Dataset Selection

I originally found the dataset used in this tutorial in the UCI Machine Learning Repository. The dataset represents data on 11,000+ instances of phishing and non-phishing webpages which have 30 categorical attributes including PageRank, AbnormalURL, Google_Index and age_of_domain. The data also contained a label, 1 or -1 indicating if the webpage with a phishing webpage or not. The aim of this tutorial is to show how to use a deep neural network, so all data was cleansed before being split into two separate CSV files, train.csv and test.csv.

Simple Neural Network Design

The neural network I will build consists of:

  • 30 input nodes each representing a column of the dataset.
  • These are passed through the first ReLU layer. Rectified Linear Units (ReLU) improve neural networks by speeding up training – all negative numbers are set to 0 and positive aren’t changed to speed up computations.
  • Next is the first hidden layer consisting of 128 fully connected nodes.
  • Then we have the second ReLU layer and second fully connected hidden layer which also consists of 128 nodes.
  • The next layer is the Logarithmic Softmax layer. This layer integrates both the softmax and log functions to calculate probabilities of each output in the range of 0 and 1.
  • Finally, the last layer comprises of just two nodes representing the two labels – phishing and non-phishing.

This neural network is visualized below:

Capture 3

Simple Neural Network Build

As always, the first step is to import the libraries we’ll need.


import numpy as np
import torch
import torch.nn as nn
from torch.autograd import Variable
import pandas as pd
import torch.utils.data


The next step is to read in the CSVs that contain the data we need into a pandas dataframe.


trainDataset = pd.read_csv("..\\train.csv", header=None)
testDataset = pd.read_csv("..\\test.csv", header=None)


Now we choose our hyperparameters.

  • The input size will equal the number of attributes in our dataset, 30.
  • The size of both hidden layers will be 128 nodes.
  • The number of classes will be 2, non-phishing and phishing.
  • The number of epochs will be 100. An epoch is both one forward pass through the network and one backward pass for training. We want to do these two steps 100 times.
  • Our batch size will also be 100. This means we will complete each epoch with a batch of 100 rows of the dataset to speed up the training.
  • The learning rate will be 0.001. The lower the learning rate, generally, the slower but more accurate the training is. The learning rate can be referred to as the speed of the gradient descent – Too big and you may overshoot the minimum, too small and it’ll take a long long time to converge.

These parameters can be set using the following code:


inputSize = len(trainDataset.columns) -1
hidden1Size = 128
hidden2Size = 128
numClasses = 2
numEpoch = 100
batchSize = 100
learningRate = 0.001


Data loaders are a really simple abstraction to the standard batch machine learning pipeline. Behind the scenes, the data loader will handle:

  • Batching the data.
  • Shuffling the data.
  • Loading the data in parallel using multiprocessing workers.


trainLoader = torch.utils.data.DataLoader(dataset=torch.tensor(trainDataset.values), batch_size=batchSize, shuffle=True)
testLoader = torch.utils.data.DataLoader(dataset=torch.tensor(testDataset.values), batch_size=batchSize, shuffle=False)


Now it’s time to define our neural network. The best way to do this is to subclassing the nn.Module class. The new DeepNeuralNetwork class is made up of the seven layers discussed earlier.


class DeepNeuralNetwork(nn.Module):
    def __init__(self, inputSize, hidden1Size, hidden2Size, numClasses):
        super(DeepNeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(inputSize, hidden1Size)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(hidden1Size, hidden2Size)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(hidden1Size, numClasses)
        self.logsm1 = nn.LogSoftmax(dim=1)
        
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu1(out)
        out = self.fc2(out)
        out = self.relu2(out)
        out = self.fc3(out)
        out = self.logsm1(out)
        return out

dnn = DeepNeuralNetwork(inputSize, hidden1Size, hidden2Size, numClasses)


We now need to define both the loss function and optimizer. The loss function will calculate how effective our current weights and biases are at producing an accurate classification and we will be using the NLL function or negative log-likelihood function. NLL is very often the loss function of choice alongside the softmax activation we have in our neural network. Additionally, we need an optimizer to take the results from the loss function and alter the weights and biases to move the accuracy in a positive direction. The optimizer we will want use is called Adam (Adaptive Moment Estimation) – a popular optimizer that will outperform (almost) every other optimization algorithm.


lossFN = nn.NLLLoss()
optimizer = torch.optim.Adam(dnn.parameters(), lr=learningRate)


Time to train the network! We will loop through the training steps 100 times as we stated in our hyperparameters. These steps consist of:

  • Looping through each batch of training data.
  • Separating each batch of training data into two variables – one for the attributes, one for the class labels.
  • Zero the gradient of the previous epoch.
  • Completing a forward pass through the network with the batch of training attributes.
  • Calculating the loss with respect to the class labels.
  • And finally performing back propagation.


for epoch in range(0, numEpoch):
    for i, data in enumerate(trainLoader,0):
        labels = Variable(data[:,-1])
        data = Variable(data[:,0:30].float())
        optimizer.zero_grad()
        outputs = dnn(data)
        loss = lossFN(outputs, labels.long())
        loss.backward()
        optimizer.step() 
                                        
    print('Epoch [%d/%d], Loss: %.4f'
        %(epoch+1, numEpoch, loss.item()))


Finally, test our network using the following code:


correct = 0
total = 0
for data in testLoader:
    labels = Variable(data[:,-1])
    data = Variable(data[:,0:30].float()) 
    outputs = dnn(data)
    _, predicted = torch.max(outputs.data, 1)  
    total += labels.size(0)
    correct += (predicted == labels.long()).sum()
    
print('Accuracy of the network on the data: %d %%' % (100 * correct / total))


This tells us that our accuracy is at 95%. This is pretty good for a first try! We can now identify phishing websites with a high accuracy using only 30 features.

Due to its Python integration and dynamic computational graphs, PyTorch is relatively easy to pick up making it a more approachable neural network framework than TensorFlow. However, PyTorch is a relatively new framework, so it only has a small community and limited resources hindering the ability to learn and debug. But as with any tech, it’s all a matter of personal preference.

The dataset and code used in this tutorial have been uploaded to my GitHub account which can be found at: https://github.com/ToriTompkins/DataShare

Visualising Network Data in Power BI with Python Integration and NetworkX

The long awaited Python Integration in Power BI added earlier this month welcomes the opportunity for further customised reporting by exploiting the vast range of Python visualisation libraries.

Among my favourite of these Python visualisation/ data science libraries is NetworkX, a powerful package designed to manipulate and study the structure and dynamics of complex networks. While NetworkX excels most at applying graph theory algorithms on network graphs in excess of 100 million edges, it also provides the capability to visualise these networks efficiently and, in my opinion, easier than the equivalent packages in R.

In this article, I will explain how to visualise network data in Power BI utilising the new Python Integration and the NetworkX Python library.

Getting Started

To begin experimenting with NetworkX and Python in Power BI, there are several pre-requisites:

  • Enable Python integration in the preview settings by going to File –> Options and Settings –> Options –> Preview features and enabling Python support.

clip_image002_thumb4_thumb

  • Ensure Python is installed and fully up-to-date.
  • Install the following Python libraries:
    • NetworkX
    • NumPy
    • pandas
    • Matplotlib

Loading Data

The data I used was created to demonstrate this task in Power BI but there are many real-world network datasets to experiment with provided by Stanford Network Analysis Project. This small dummy dataset represents a co-purchasing network of books.

The data I loaded into Power BI consisted of two separate CSVs. One, Books.csv, consisted of metadata pertaining to the top 40 bestselling books according to Wikipedia and their assigned IDs. The other, Relationship.csv, was an edgelist of the book IDs which is a popular method for storing/ delivering network data. The graph I wanted to create was an undirected, unweighted graph which I wanted to be able to cross-filter accurately. Because of this, I duplicated this edgelist and reversed the columns so the ToNodeId and FromNodeId were swapped. Adding this new edge list onto the end of the original edgelist has created a dataset with can be filtered on both columns later down the line. For directed graphs, this step is unnecessary and can be ignored.

Once loaded into Power BI, I duplicated the Books table to create the following relationship diagram as it isn’t possible to replicate the relationship between FromNodeId to Book ID and ToNodeId to Book ID with only one Books table.

clip_image004_thumb3_thumb

From here I can build my network graph.

Building the Network Graph

Finally, we can begin the Python Integration!

Select the Python visual from the visualizations pane and drag this onto the dashboard.

clip_image006_thumb3_thumb

Drag the Book Title columns of both Books and Books2 into Values.

clip_image008_thumb1_thumb

Power BI will create a data frame from these values. This can be seen in the top 4 lines in the Python script editor.

clip_image010_thumb3_thumb

The following Python code (also shown above) will create and draw a simple undirected and unweighted network graph with node labels from the data frame Power BI generated:

import networkx as nx
import matplotlib.pyplot as plt
G = nx.from_pandas_edgelist(dataset, source="Book Title", target="Book Title.1")
nx.draw(G, with_labels = True)
plt.show()

** NOTE: You may find that the code above will fail to work with large networks. This is because by default networkx will draw the graph according to the Fruchterman Reingold layout, which will position the nodes for the highest readability. This layout is unsuitable for networks larger than 1000 nodes due to the memory and run time required to run the algorithm. As an alternative, you can position the nodes in a circle or randomly by editing the line

nx.draw(G, with_labels = True)

to

nx.draw(G, with_labels = True, pos=nx.circular_layout(G))

or

nx.draw(G, with_labels = True, pos=nx.random_layout(G))
**

This will produce the network graph below:

clip_image012_thumb3_thumb

You are also able to cross filter the network graph by selecting rows in the table on the right-hand side:

clip_image014_thumb4_thumb

Conclusion

Python visuals are simple to produce and although the visual itself isn’t interactive, they will update with data refreshes and cross filtering, much like the R integration added 3 years ago. The introduction of Python in Power BI has opened doors for visualisation with libraries such as NetworkX, to visualise all BI networks from Airline Connection Flights and Co-Purchasing networks to Social Network Analysis.