AI vs ML Note
Posted on: November 02, 2023
In this blog I am going to make a note of my understandings of Artificial Intelligence (AI) and Machine Learning (ML).
Terminology
AI is the effort to automate intellectual tasks normally performed by humans. It used to be a set of pre-defined rules that humans typed into computers in computer language, so a good AI at that time means a good, long, complete set of rules.
ML is a subset of AI. Rather than humans coding rules, ML figures out the rules by learning data and the expected ouputs. This is also the reason of ML doesn't have 100% of accuracy. Therefore the goal of a good ML model is to have as high accuracy as possible.
Neural Network (NN) is a form of ML that uses a layered representation of data. There is a input layer and aoutput layer, there might be multiple layers in between these two layers which represent as a set of of rules, so the input data transform into some other data when going through each layer. NN is not modelled after the way the human brain works, as we don't really know how human brains process data. We say the input layer and its next layer are densely connected because every input feature is connected to every one neuron in its next layer.
Deep Learning (DL) refers to training NNs, sometimes very large NNs.
ReLU function stands for Rectified Linear Unit, "rectified" means taking a max of zero. Therefore this function starts from zero sometimes and takes off as a straight line.
Types of neural networks and applications
There are different types of NN and are useful for different problems.
- Standard neural networks: used for problems like predicting housing prices in real estate and online advertising.
- Convolutional neural networks (CNN): used for image applications like photo tagging.
- Recurrent neural networks (RNN): used on sequence data, such as audio (a 1D time series as it is played out over time) in speech recognition problem, and language (as the alphabets/words come one at a time) in machine translation.
- Hybrid/Custom neural networls: used for more complex applications like auto driving.
ML can be applied to structured data (like excel/table data) and unstructured data (like audio, image and text).
How amount of data affects the performance of learning algorithms
- Traditional learning algorithms (e.g. support vector machine, logistic regression): performance improves as amount of data increases, then after a while it flattens as a horizontal line because it does not what to do with a huge amount of data.
- Small NN: perform slightly better than traditional learning algorithm with a large amount of data
- Medium NN: perform better than small NN with a large amount of data
- Large NN: perform better than medium NN with a large amount of data
For small set of training data, the performance differences between these learning algorithms are not so well-defined, it is more depending on skills at hand engineer features. Only with large amoount of data, we will see the larger neural networks consistently dominating other algorithms.
So in conclusion, to reach a high performance of deep learning progress, two things are needed:
- A big enough neural network (with a lot of hidden units, a lot of parameters and a lot of connections)
- A huge amount of data
But, the method of training a bigger neural network or throwing more data to boost performance only works up to a point, because eventually we run out of data or the network is so big that it takes too long to train.
In the early days, the modern rise of deep learning was scale of data and computation (the ability to train very large neural networks either on CPU or GPU).
In the last several years, there are many algorithmic innovations, which helped computation (i.e. run NN faster and allow train bigger NN). An exmpale of algorithmic innovation is one of the huge breakthroughs in neural networks: switching the activation function from using sigmoid function to ReLU function. Because sigmoid function has some regions where the gradient is nearly zero, causing the parameters changing very slowly when implementing the gradient descent, thus learnig becomes very slow. Whereas ReLU's gradient is equal to 1 for all positive input values.
Fast computation is important because it helped speeing up the rate of building an effective NN and get back experimental results in a short period of time.
Binary Classification
An example of binary classification is: giving an image input, output 1 if the image is a cat or 0 if the image is not a cat.
Logistic regression
Logistic regression is an algorithm for binary classification.
y-hat (ŷ) is the probability of the result y is 1 in a binary classification. Given a set of input x and output y as supervised learning data, we want ŷ to be as close to y as possible (i.e. as accurate as possible).
We use Sigmoid function to calculate ŷ because it will give a result between 0 and 1 which is in the probability range.
w and b are the parameters for calculating ŷ.
Here, we use m to represent the size of training set.
We use z = w^T * x + b
, where w^T
is transposed w and w is a matrix (nx dimensional vectors), x
is also a matrix (nx dimensional vectors).
Loss Function
Loss function is to measure how well the algorithm is doing on a single training example. Loss function could use square error L(ŷ, y) = 1/2 (ŷ - y)^2
to calculate loss, but in logistic regression square error will have an optimisation problem (i.e. having multiple local optima and not able to find the global optimum).
In logistic regression, we use the following for loss function: L(ŷ, y) = -(y log(ŷ) + (1 - y) log(1 - ŷ))
. We want the result as small as possible.
This is because:
- If y = 1: L(ŷ, y) = -log(ŷ), => in order to have ŷ as close to y as possible (i.e. more accurate), ŷ needs to be close to 1 (i.e. ŷ needs to be large) => log(ŷ) needs to be large => loss function needs to be small.
- If y = 0: L(ŷ, y) = -log(1 - ŷ), => to be more accurate, ŷ needs to be close to 0 (i.e. ŷ needs to be small) => (1 - ŷ) will be large => log(1 - ŷ) needs to be large => loss function needs to be small.
Cost Function
Cost function (written as J(w, b)) measures how well the parameters w and b are doing with the entire training set, so it is the average of the all m sets of loss functions (i.e. sum of m sets of L(ŷ, y) divided by m). J(w, b) is the cost of parameters, so in training logistic regression we are trying to find parameters that can minimise the overall cost function.
Gradient Descent
We can use gradient descent to train or learn the parameters w and b on the training set.
Imagine a 3D space, the horizontal axes of the illustration of gradient descent diagram represent w and b, the height of the disgrame represents the value of J. It turns out that this cost function J we used for logistic regression is a convex function looking like a single bowl so we can find the optimum.
For logistic regression almost any initialisation methods works, this is because the function is convex and we should get to the same point or roughly the same point no matters where we starts. Therefore the initial value of w and b could be random, but we usually use 0 for logistic regression.
One iteration of gradient descent: starts at a point (e.g. initial point) -> takes a step in the steepest downhill direction (i.e. as quickly downhill as possible).
After a few iterations, hopefully it converges to the global optimum or somewhere close.
In gradient descent, we repeatedly update the parameters w and b as follows: w := w - α x dw
and b := b - α x db
. :=
means update the parameter. α
is the learning rate, used to control how big the step is. dw
is the derivative of the cost function dJ(w,b)/dw, which is the slope of the function at a point in the w-axis direction. And db
is the derivative of the cost function dJ(w,b)/db.
Gradient descent converges faster after normalisation so it leads to a better performance. Normalisation is changing x to x/‖x‖ (i.e. dividing each row vector of x by its norm). To find norm, square each element in the same row vector and sum them up, then take the square root, i.e. sqrt(a^2 + b^2 + c^2). Softmax is a normalising function used when your algorithm needs to classify two or more classes.
Computation
Neural networks are computed with a forward propagation step (used to compute output of neural network), follow by a backward propagation step (used to compute gradients or derivatives).
There is a final output variable we care about, which is usually the last node of the computation graph, such as J in the above example. And backpropagation will try to compute the derivative of the final output vairable with respect to some other variable, in notation: dFinalOutputVar / dVar (in short: dVar). For example it could be dJ/dw or dJ/db, or any other variables that involve in calculating J. Chain rule is used in this step.
Since the cost function J(w, b) is the average loss function L(a, y) (where a is the sigmoid function, aka ŷ) over m training examples, dJ(w, b)/dw is the average of dL/dw over these m examples. The same applies to other variables such as w1, w2, w3, b.
After the derivatives of all variables are calculated, the variable values are then updated with a learning rate α, e.g. w := w - α x dw
.
Vectorisation
Neural network programming guideline: Whenever possible, avoid explicit for-loops. So we use vectorisation, as vectorisation helps to avoid using for loops, as for loops are not efficient with a large dataset.
In logistic regression, we need to compute z = w^T * x + b
. In non-vectorised implementation, we will use for loop to compute w^T * x
in the range of nx times. It is usually very slow. A vectorised implementation would compute w^T * x
directly using numpy in Python, i.e. the implementation will be z = np.dot(w, x) + b
where w^T * x
is computed by np.dot(w, x)
. Vectorisation is much faster than non-vectorised implementation (i.e. using for loop).
Vectorisation can also be done on a GPU and a CPU, because both GPU and CPU have parallelisation instruction (aka SIMD - single instruction multiple data), meaning if using built-in functions like np functions or other functions that don't require you explicitly implementing a for loop, it enable sPython to take much better parallelism to do your computations much faster. It's just that GPUs are remarkly good at these SIMD calculations, and CPUs are not as good as GPUs but not bad.
Another example is: if we have many features such as w1, w2, w3..., we don't want to use a for loop to update its dw in the computation, then we can use vectorisation to calculate dw by: 1. initialise dw as a vector dw = np.zeros((nx, 1))
; 2. update dw during neural network dw += x * dz
; 3. at the end calculate average: dw /= m
.
Vectorisation in Forward Propagation
In forward propagation, instead of using for loop and compute z = w^T * x + b
and a = sigmoid(z)
over m examples, we can use vectorisation.
- The input x is a nx dimensional vector, we can put all m inputs into a (nx * m)-sized matrix:
X = [x1 x2 x3 ... xm]
. - According to
z = w^T * x + b
, we can produce Z which is a matrix that stacks all z horizontally, as this:Z = [z1 z2 z3 ... zm] = [(w^T * x1 + b) (w^T * x2 + b) ... (w^T * xm + b)] = w^T * X + [b b b ...]
. Note: Both (w^T * x1 + b) and [b b b ...] are (1 * m) dimensional row vectors. Broadcasting also applies to real numbers as they can be seen as a 11 matrix that can copy itself into mn matrix. a. In Python, Z can be calculated asZ = np.dot(w^T, X) + b
. Note: b is a real number (i.e. it is a 1 x 1 vector), but Python will automatically expands b into a 1 * m vector as [b b b ...], which is called broadcasting in Python. - According to
a = sigmoid(z)
, we can compute A which is a matrix asA = [a1 a2 a3 ... am] = sigmoid(Z)
.
Vectorisation in Back Propagation
Backpropagation can be vectorised like in forward propagation. For example, we can vectorised dz, where dz = a - y
and each dz is a 1xm dimensional vector.
- According to
dz = a - y
, we can computedZ = A - Y
where dZ = [dz1 dz2 ... dzm], A = [a1 a2 ... am] and Y = [y1 y2 ... ym]. - To avoid for loops, we can calculate db and dw using dZ, as db is the average of all dz:
db = 1/m * np.sum(dZ)
, anddw = 1/m * X * dZ^T
(i.e. dZ transposed).
Although we try to avoid explicit for loops, but we will still need a for loop to implement multiple iterations of graident descent.
Broadcasting
In Python, if a mn dimensional martix is performing operations (addition, subtraction, multiplication or division) with a 1n matrix or a m1 matrix, then Python will broadcast this 1n matrix or m1 matrix into a mn matrix before operations being performed.
Note on Python Numpy Vector
- When creating random matrix, instead of using
np.random.randn(5)
which returns a sized-5 rank-1 array that has a shape of (5,) and does not behave like vector (e.g. cannot be matrix transposed), usenp.random, randn(5,1)
because this will return a nested array (i.e. a 51 column matrix) that has a shape of (5,1) and can be transposed into a 15 row vector. - If a rank-1 array is created, it can be reshaped into a vector. E.g.
a = np.random.randn(5)
and thena = a.reshape((5,1))
- In Numpy, given a and b are matrices,
a * b
is element-wise multiplication not matrix multiplication,np.dot(a, b)
is matrix multiplication. - The result of
np.dot(a, b)
has a shape of (number of rows in a, number of columns in b). - math vs np: We rarely use the "math" library in deep learning because the inputs of the functions are real numbers. In deep learning we mostly use matrices and vectors, np can apply the function to every element in the vector (or list), whereas math will return an error as it can only deal with real numbers.
Common Numpy Functions in Deep Learning
np.shape()
: X.shape is used to get the shape (dimension) of a matrix/vector X.np.reshape()
: X.reshape(a, b) is used to reshape X into some other dimension like (a,b).X.reshape(-1, 1)
reshapes X into a single feature column vector as we provided column as 1 but rows as unknown. Vice versa,X.reshape(1, -1)
reshapes X into a row vector. Therefore -1 indicates unknown, and reshapes matrices according to its number of elements. ButX.reshape(-1, -1)
returns an error.
np.linalg.norm()
: calculate the norm of a matrix, e.g.‖𝑥‖=np.linalg.norm(x, axis=1, keepdims=True)
wherekeepdims=True
allows the result to broadcast correctly against the original x. Andaxis=1
meaning get the norm in a row-wise manner, andaxis=0
is column-wise.np.dot()
: performs a matrix-matrix or matrix-vector multiplication. This is different fromnp.multiply()
and the * operator, which performs an element-wise multiplication.- A trick when you want to flatten a matrix X of shape (a,b,c,d) to a matrix X_flatten of shape (bcd, a) is to use:
X_flatten = X.reshape(X.shape[0], -1).T
where X.T is the transpose of X.
Common steps for pre-processing a new dataset:
- Figure out dimensions and shapes of the problem/dataset
- Reshape the datasets such that each example is now a vector of size, i.e. a column vector with shape (size, 1)
- Standardise the data, i.e. (Each example - the mean of the whole array) / standard deviation of the whole array