Keras
My second project is an introduction to deep machine learning. It follows the walkthrough here. It focuses on using Keras which wraps the libraries Theano and TensorFlow. In this project, a deep neural network is trained to predict whether Pima people will have an onset of diabetes in 5 years using numerical medical information. A 4 layer sequential model is used. The input (8 nodes) and 2 hidden layers (12 and 8 nodes) use rectified linear unitactivation functions (ReLU) while the output layer uses a sigmoid function (1 node). For the loss function binary cross-entropy is used and adam is chosen for the optimiser. The result is an model with a ~77% accuracy rate.
This project is taken from https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/ and is a started project to get experience working with deep neural networks.
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
The dataset to be used is medical information from Pima people and whether or not they had an onset of diabetes within five years. A 0 means no diabetes while 1 means an onset. All inputs are numerical.
Input Variables (X):
- Number of times pregnant
- Plasma glucose concentration at 2 hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Triceps skin fold thickness (mm)
- 2-Hour serum insulin (mu U/ml)
- Body mass index (weight in kg/(height in m)^2)
- Diabetes pedigree function
- Age (years)
Output Variables (y):
- Class variable (0 or 1)
# load the dataset
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=',')
# split into input (X) and output (y) variables
X = dataset[:,0:8]
y = dataset[:,8]
In Keras sequential models are created. The first layer is the input layer which will have 8 dimensions for the 8 input variables. input_dim=8
The number of layers to use can be found using trial and error but for this problem we will use a fully connected network structure with 3 layers. The first two layers will be rectified linear unit activation functions referred to as ReLU while the output layer will be a Sigmoid function. The Sigmoid function ensures the output is between 0 and 1 and easy to map to either a probability of class 1 or snap to a hard classification of either class with a default threshold of 0.5.
All together, it looks like this:
- The model expects rows of data with 8 variables (the
input_dim=8
argument) - The first hidden layer has 12 nodes and uses the ReLU activation function.
- The second hidden layer has 8 nodes and uses the ReLU activation function.
- The output layer has one node and uses the sigmoid activation function.
# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
2022-11-01 15:48:31.115986: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
ReLU is the positive part of its argument, so if $x\leq0:$ return $0$, if $x>0:$ return $x$.
The sigmoid function follows the equation $$\frac{1}{1+e^{-x}}.$$
For small values ($x<-5$) the output is close to 0, while for large values ($x>5$) the output approaches 1.
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
The loss function is used to evaluate a set of weights to map inputs to outputs in the dataset ($Y=\sum(weight\times input)+bias$). Loss should penalise bad predictions, so when the probability is low, you want a high value and when the probability is 1.0 you want the loss to be 0. In this problem we use binary cross entropy as our loss function as it is a binary clasification problem. The equation for this is $$H_p(q)=-\frac{1}{N}\sum^N_{i=1}y_i\log(p(y_i))+(1-y_i)\log(1-p(y_i))$$ where $y_i$ would be 1 if there was an onset of diabetes and 0 if not, and $p(y_i)$ is the probability of an onset of diabetes. Since the loss function should be 0 for a perfect model and gets larger as the probability gets smaller. This is achieved by the negative log(probabity). Then the binary cross-entropy is the mean of each -log(probability).
The optimizer searches through different weights and any other optional metrics we want to collect and report during training. In this case, "adam" is chosen because it automatically tunes itself and gives a good reasults for a wide variety of problems. Rather than having a single fixed learning rate (alpha) for all weight updates, adam has a different rate for each network weight which can be adapted as learning unfolds. Several parameters are used in adam:
- $\alpha$: Learning rate, the proportion that weights are updated by. Larger values lead to quicker learning
- $\beta1$: The exponential decay rate of the first moment estimates (the mean)
- $\beta2$: The exponential decay rate of the second moment estimates (the uncentred variance)
- $\epsilon$: A small number to prevent division-by-zero errors.
# fit the keras model on the dataset
model.fit(X, y, epochs=150, batch_size=10, verbose=0)
# evaluate the keras model
_, accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (accuracy*100))
24/24 [==============================] - 0s 541us/step - loss: 0.4797 - accuracy: 0.7721 Accuracy: 77.21
The fit()
function is what trains the model on the data.
- Epoch is a single pass through all rows in the training dataset
- Batch is the number of samples considered by the model in an epoch before the weights are updated
- Verbose decides whether the accuracy of each epoch is printed These can be chosen through trial and error.
The evaluate()
function returns a list with two values, the loss and the accuracy of the model on the dataset. In this case, we didn't save a subsection of the data for testing, so we will instead evaluate using the training data.
# make class predictions with the model
predictions = (model.predict(X) > 0.5).astype(int)
# summarize the first 5 cases
for i in range(5):
print('%s => %d (expected %d)' % (X[i].tolist(), predictions[i], y[i]))
[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0] => 1 (expected 1) [1.0, 85.0, 66.0, 29.0, 0.0, 26.6, 0.351, 31.0] => 0 (expected 0) [8.0, 183.0, 64.0, 0.0, 0.0, 23.3, 0.672, 32.0] => 1 (expected 1) [1.0, 89.0, 66.0, 23.0, 94.0, 28.1, 0.167, 21.0] => 0 (expected 0) [0.0, 137.0, 40.0, 35.0, 168.0, 43.1, 2.288, 33.0] => 1 (expected 1)
Once the model has been trained, it can be used to make predictions. This can be done using the predict()
function. As the sigmoid function is used, the output will be a probability between 0 and 1. This can be converted to binary directly as shown above or manually after being outputted for example with:
rounded = [round(x[0]) for x in predictions]