NAV
• ML & AI
• Keras
• Deep Concepts in DNNs
• DCNN Performance
• YoloV2
• SSD
• # ML & AI

## Introduction

Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome.

Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI.

Here, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you'll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems.

As a community, you'll build datasets, open source projects and help others learn with you. Currently, we are focused on building a robust dataset for number plate recognition. At every handson meetup event, participants submit 25 number plate images. Check out our Projects and Datasets sections to learn more.

Finally, you'll learn about some of Silicon Valley's best practices in innovation as it pertains to machine learning and AI.

# Keras

Install Keras

# Hoping you have install Tensorflow already!
pip install keras


Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

Use Keras if you need a deep learning library that:

• Allows for easy and fast prototyping (through user friendliness, modularity, and extensibility).
• Supports both convolutional networks and recurrent networks, as well as combinations of the two.
• Runs seamlessly on CPU and GPU.
• Supported on Google Colab, which is our default choice for learning ML/AI.

## Why use Keras?

• Keras prioritizes developer experience - it offers consistent & simple APIs, it minimizes the number of user actions required for commen use cases, and provides clear and actionable feedback upon user error
• Keras has broad adoption in the industry and the research community - it would be hard to come across a non-keras DNN implementation of any algorithm
• Keras makes it easy to turn models into products - supported on iOS, Android, cloud, etc.
• It supports multople backends (Tensorflow, CNTK, Theano)
• It has strong multi-GPU support and distributed training support
• Keras development is backed by Google, Microsoft, NVidia and AWS.

## Code First!

Install Keras - required everytime on Google Colab


# https://keras.io/
!pip install -q keras
import keras


At MLBLR we are focusing on real hands-on experience. Let's immediately jump to the code logic and see how a DNN is built from scratch.

There would be terms here which you might not totally understand at first but don't worry, we would go through them once you finish your first DNN.

## Initializations

Initializations


import numpy as np

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten, Add
from keras.layers import Convolution2D, MaxPooling2D
from keras.utils import np_utils

from keras.datasets import mnist


Like in all programs, we need to do few initializations to access functions we'd need for our program.

Though Keras is supported on colab.google.com, we need to install it every time.

First we import Sequential. Sequential model helps use create a linear stack of layers. Dense, Dropout, Activation, Flatten, Add, Convolution2D, MaxPooling2D are the layers which we'd need, so we import them.

We also need few in-built utilities in Keras, so we import keras.utils.

We also need access to a dataset. Keras has sweet access to few datasets. Here we import mnist from keras.datasets.

(X_train, y_train), (X_test, y_test) = mnist.load_data()


mnist.load_data() fetches MNIST dataset from the internet, shuffles it, and splits it into (X_train, y_train) and (X_test, y_test). We need train data to train our model, and test data to test how well is it performing.

Shape of the image

print (X_train.shape)
# (60000, 28, 28)
from matplotlib import pyplot as plt
%matplotlib inline
plt.imshow(X_train[0])


Sample image

We need to know the shape (dimensions) of the image we are dealing with. .shape functions spits out the shape of our sample data.

It is very important to keep track of the dimensions right from the input stage to the output. High-level neural APIs like Keras take care of all intermediate dimensional changes (for example, in Caffe, you'd need to calculate all dimensions manually).

%matplotlib inline allows us to view our data inside the notebook.

plt.imshow(X_train[0]) allows us to view the sample image.

Notice few things:

• MNIST dataset is grayscale. It is also edivent from .shape result. If it was color you'd get (60000, 28, 28, 3).
• MNIST contains handwritten 0, 1, ... 9 characters, but see how close it is to English character "S".
• There are 60000 sample images in train set.

## Reshape the input

Reshape the input

X_train = X_train.reshape(X_train.shape[0], 28, 28,1)
X_test = X_test.reshape(X_test.shape[0], 28, 28,1)


We need to reshape the input as new Keras API expects us to mention the total color-channels as well (1 in our case as we have grayscale images).

Reshaping data is going to be one of the most used and confusing element in ML, so make sure you pay full attention to reshaping methods. Keras takes care of reshaping for you for hidden layer, but if you were to write something new, say a new loss function, then you'll have to take care of it yourself!

## Conversion

Converting from unit8 to float32 & normalizing

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255


Conversion to Float32 & Normalization

All images are stored as unit8, however, we work with floats in neural network. So first thing we do is to convert our images from uint8 to float32. Then, we convert our images from 0-255 scale to 0-1. This is called normalization. We usually do this because:

• 0-1 is easier to deal with.
• keeping other variables within a range (0-1) becomes easier if inputs are also between (0-1).
• keeping values between (0-1) also, sort of, adds an in-built threshold (0.9*0.9 = 0.81, but 229*229 = 52441).

## Categorical

Categorical

# Convert 1-dimensional class arrays to 10-dimensional class matrices
Y_train = np_utils.to_categorical(y_train, 10)
Y_test = np_utils.to_categorical(y_test, 10)


If you print y_train[:10], you'd get array([5, 0, 4, 1, 9, 2, 1, 3, 1, 4], dtype=uint8). This means if we want to predict, the network must output "5" as the value. This is trickier as for 9, one need to output "9" and for 5, "5". How would loss function behave in such a case? Moreover, what about predicting A, B...Z?

Easier method is to output one-hot vector for each digit. For e.g. (0, 0, 0, 0, 0, 1, 0, 0, 0, 0)_ can represent 5 where the 6th varialbe is 1. (0, 0, 0, 0, 0, 0, 0, 0, 0, 1)_ can represent 9. Similarly, (1, 0, 0, 0, 0, 0, 0, 0, 0, ...)_can represent _A.

We achieve this by using np_utils.to_categorical function.

## DNN Architecture

The Layers

model = Sequential()

model.summary()


Here is what we did:

• First, we told Keras that we are building a sequential model.
• Then, we added a 2D convolution layer with 32 kernels, each of size (3,3), and performed ReLU activation as well. This is the only time we need to inform about any dimension input_shape = (28, 28, 1)
• Now we have 32 new images, each seen through the eyes of our 32 kernels.
• Then we randomly dropped 50% of the neurons (Kernel + Activation part together). Dropout is a regularization technique. It makes "double (0.5) sure" that there is no "over-fitting". Dropout forces a NN to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
• Then we add another 2D convolution layer with 10 kernels each of size 1x1! 1x1 convolution helps us reduce the number of images seen through the eyes of kernel. Before this step, we have 32 images, each of size (26, 26). After this step, we have 10 images of the same size.
• We again perform Dropout.
• Then we perform another 2D convolution, but this time with 10 kernels, each of size (26x26). This will give us 10 values (which is what we need to perform prediction)!
• Then we Flatten the previous output (convert from (1, 1, 10) to (10)) to feed directly to our prediction layer
• At last, we add a softmax activation layer. This is where prediction happens.

## Compile

Compile

model.compile(loss='categorical_crossentropy',
metrics=['accuracy'])


Next we compile our model and configure it's learning process. We inform Keras, that we want to use 'Cross Entropy' as our loss and optimise our gradient descent process using Adam algorithm.

We have several other loss functions as well like: mean_squared_loss, mean_absolute_error, squared_hinge, sparse_categorical_crossentropy, poisson, etc. You can learn more about these, and more, loss functions in the Keras Documentation. Similarly, we have lot of other gradient descent optimization functions like: Adam, Adamax, SGD, RMSprop, Adadelta, etc.

## Training

Fit the model

model.fit(X_train, Y_train, batch_size=32, nb_epoch=10, verbose=1)


Finally, to begin the training process we fit the model with out training variables, inform Keras about how many images to look at simultaneously (batch_size = 32), how long to train the model (epoch=10) and to provide us with a training log (verbose = 1).

Your output would look something like this:

...

## Evaluate

Evaluate

score = model.evaluate(X_test, Y_test, verbose=0)
print(score)

# manual test
y_pred = model.predict(X_test)
print(y_pred[:9])
print(y_test[:9])


Only after 10 epochs, we have achieved 97.5% accuracy. But this is on train dataset. We need to test our acuracy on the test dataset which our model has not seen yet!

We do this by calling model.evaluate function. The result which gets printed is: [0.05154316661804915, 0.9841]

If you want to manually see what is happening here, print y_pred[:9], the predicted values for first 10 numbers in Y_test dataset. You'd see that the highest value in each row is maximum for number mentioned in y_test[:9].

Congrats! You have built your first DNN with an accuracy of 98.41%.

## Quick Recap

ImgAug - one of the best libraries for Image Augmentation

Xavier or Glorot Initialization

Angular Softmax!

Dropout

Amazing way to understand Batch Normalization

Max Pooling

Source

In your first DNN you have taken few decisions, which you'll have to take everytime you write something new.

We can boil it down to these (including future requirements) components:

• Data Augmentation:
• Here we haven't performed this step, but this is turning out to be one of the important step in ML.
• ImgAug is one of the best image augmentation library we have seen available
• Yolo-V2 and many new algorithms first train the image on, say, 224x224 and then fine tune for less epochs to larger size, say (448x448)
• Yolo-V2 also trains it's image detection network with image classification dataset (Yolo-9000 specifically)

• Initialization:
• Since we were looking at a very easy dataset, we didn't implement any specific initialization, but Keras implemented random_uniform initialization for us as a default.
• There are many different initializations available like random_uniform, glorot_normal (also called Xavier Initialization)
• To specifically mention which initialization to use one would re-write a layer as : model.add(Dense(64, kernel_initializer = random_uniform, bias_initializer=zeros))

• Activation Functions:
• In our code we used 'relu', but we have many choices to pick from like, sigmoid,
• Leaky ReLU
• ELU
• SELU

• Loss Functions:
• In our code we used Cross-Entropy Loss
• Hinge Loss : $L_i = \sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta)$
• Triplet-Loss - multi class loss from FaceNet
• Center-loss - from this paper
• Angular-loss - from Sphere Face

• Normalization:
• Again, given very simple nature of problem we were soling here, we did not use any normalization technique. But for any more difficult situation, you might need it. We have few options:
• Batch Normalization
• Layer Notmalization
• Weight Normalization

• Misc Decisions:
• Some other decisions for complex models we might have to take are: should we implement Average Pooling as a part of the last classifier?
• should we use conv with stride without overlapping, and not average or max-pooling?
• should we use resnet or densenet architecture?
• should we concatenate features from current layer with features from previous layers?

• Evaluation:
• We performed very simple evaluation - run the model on test dataset, but there are other methods for deeper evaluation like:
• use ELU or other xELUs and check performance
• apply a learned colorspace transformation of RGB
• use a linear learning rate decay policy
• use a sum of the average and max pooling layers
• use different mini-batch sizes and check performance
• use FC layers as convolutions and check performance

You'll actually go deeper once you have a base model ready, to tweak it for maximum accuracy. You may also drop these tweaks in favor of speed/performance. It might seem a lot to look at while writing a new neural network, but once you start writing a few, you'll realize few combinations work well with each other.

# Deep Concepts in DNNs

Disclaimer: Most of the content is taken from online sources. This particular MIT paper, by Waseem Rawat and Zenghui Wang, has been the biggest source of them all. I apologize in advance if I missed giving credit back to the original author. I will make continuous attempt to point you back to the original authors as much as possible.

Let us observe how good ML has become is to see the ILSRVC results:

Year Team Layers Contribution Position
2010 NEC Shallow Fast Feature Extraction, Compression, SVM First
2011 XRCE Shallow High dimensional image, SVM First
2012 SuperVision 8 GPU based DCNN, Dropout First
2013 Clarifai 8 Deconvolution visualization First
2014 GoogLeNet 22 DCNN, 1x1 conv First
2014 VGG 19 All 3x3 convs Second
2015 MSRA 152 Ultra Deep, Residuals First
2016 CUImage 269 Ensamble, Gated Bi-Directional CNN First
2017 Momenta 152 Squeeze & excitation, feature recalibration First
2018* Facebook 264 Direct connection between any two layers SOA*

As we can see networks are getting deeper and more sophisticated. Every year there is there is an addition of new technology which is making everything till then obsolete.

Below we cover some of the best additions to machine learning algorithms, something which we all need to know to build beautiful networks.

## Convolutional Layers

Convolution without zero-padding and with stride of 1

Convolution layers serve as feature extractors, and thus they learn the feature representations of their input images.

The neurons in the convolutional layers are arranged into feature maps.

Each neuron in a feature map has a receptive field, which is connected to a neighborhood of neurons in the previous layers via a set of trainable weights.

Inputs are convolved with the learned weights in order to compute a new feature map, and convolved results are sent through a nonlinear activation function.

In simple terms, imagine a kernel rubbing across the whole image. If you have 32 kernels/filters in a layer, the layer will output 32 new convolved images.

## Pooling Layer

Max Pooling Layer

The purpose of the pooling layer is to reduce the spatial resolution of the feature maps and thus achieve invariance to input dimensions and translations.

Initially it was common practice to use average pooling aggregation layers to propagate the average of all the input values.

However, in more recent models, max pooling aggregation layers propagate the maximum value within a receptive field of the next layer.

In 2007, backpropagation was applied for the first time to a DCNN like architecture that used max pooling. It was showed empirically that the max pooling operation was vastly superior for capturing invariance in an image like data and could lead to improved generalization and faster convergence, and it also alleviated the need for a rectification layer.

Lee, Gross, Ranganathan and Ng introduced and applied probabilistic max-pooling to convolution DBNs, which resulted in translation invariant hierarchical generative model.

## FC Layers

Fully connected Layers

Several convolution and pooling layers are usually stacked on top of each other to extract more abstract feature representations in moving through the network.

Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset. See the Neural Network section of the notes for more information.

The output from the convolutional layers represents high-level features in the data. While that output could be flattened and connected to the output layer, adding a fully-connected layer is a (usually) cheap way of learning non-linear combinations of these features.

Essentially the convolutional layers are providing a meaningful, low-dimensional, and somewhat invariant feature space and the fully-connected layer is learning a (possibly non-linear) function in that space.

Fully connected layers are computationally expensive and has maximum bumber of weights. Newer approaches, like using $1x1$ kernels, are trying to replace them.

## ReLU

ReLU

ReLU or Rectified Linear Unit allows much faster training times.

It is a piecewise linear function, with form

$f(x) = max(x, 0)$.

It retains only the positive part of the activation, by reducing the negative part to zero, while the integrated maximum operator promotes faster computation and do not suffer from vanishing gradient problem (in which lower layers have gradients near zero because high layers are almost saturated).

LReLU

LReLU

Even though ReLUs is awesome, they are at a possible disadvantage during optimization since the gradient is zero when the unit is not active. This may lead to cases where units never get activated since popular gradient descent optimization algorithms fine-tune only the weights of units previously activated. Thus, ReLUs suffer from slow convergence when training networks with constant zero gradients. To compensate for this Massetal.(2013) introduced leaky rectified units $(LReLU)$, which allow for small nonzero gradients when the unit is not active yet is saturated. Mathematically:

$f(x) = max(x, 0) + \lambda min(x, 0)$

where $\lambda$ is a predefined parameter within the range $(0, 1)$. They perform slightly better than ReLU.

PReLU

PReLU

While LReLUs (Maasetal.,2013) rely on a predefined parameter to compress the negative part of the activation signal, Heetal.(2015a) proposed a parametric rectified linear unit (PReLU) to adaptively learn the parameters of the activation units during backpropagation. Mathematically, the PReLU is the same as the LReLU, except that λ is replaced with the learnable $λ_k$, which is allowed to vary for different input channels, denoted by k. Thus, the PReLU can be expressed as:

$f(x_k ) = max(x_k, 0) + λ_k min(x_k, 0)$.

PReLU has $1\%$ performance increase on the ILSVC dataset. PReLU always perform better than other rectified units, such as ReLU and LReLU.

ELU

ELU

While ReLU and PReLU are all nonsaturating and thus lessen the vanishing gradient problem, only ReLU ensure a noise-robust deactivation state, however, they are nonnegative and thus have a mean activation larger than zero. To deal with this, Clevert et al. (2016) proposed the exponential linear unit (ELU), which has negative values to allow for activation near zero, but also saturates to a negative value with smaller arguments. Since the saturation decreases the variation of the units when deactivated, the precise deactivation argument becomes less relevant, thereby making ELUs robust against noise. Formally:

$f(x) = max(x, 0) + min(λ(e^{x-1}, 0))$

where λ is a predetermined parameter that controls the amount an ELU will saturate for negative inputs. ELUs sped up DCNN learning and led to higher classification accuracy and obtain encouraging convergence speed.

SELU plotted for α=1.6732~, λ=1.0507~

SELU

Scaled Exponential linear units is some kind of ELU but with a little twist. Mathemathcally it is:

$selu(x) = λ( max(x, 0) + min(αe^x - α, 0 ))$

Here all the parameters are fixed. The make feed-forward networks self-normalizing. The activation function is required to have negative and positive values for controlling the mean, saturation regions (derivatives approaching 0) to dampen the variance if it is too large in the lower layer and a slope larger than one to increase variane if it is too small in the lower layer and a continuous curve. For standard scaled inputs (mean 0, stddev 1), the values are α=1.6732~, λ=1.0507~ (this is mathematically calculated). If you use SELU, you need to initialize the weights with zero mean and use standard deviation of the squared root of 1 / (size of input). SELU converges better and gets better accuracy.

Too many X-ELUs?? Best practise is to use any of above with dropout and then experiment! :D

Or Do not use DropOut and perform Image Augmentation!: Read this paper

## 1x1 Convolutions

1x1 Convolutions

A breakthrough in 2014 by Google. 1x1 convolutions performs two functions. They serve as dimension-reduction blocks prior to the more computationally costly 3x3 and 5x5 convolutions, and they include the use of rectified linear activations making. 1x1 are used to increase the depth and width of the network, while only marginally increasing the computation cost.

Let us assume that a layer had an output of 256x256x64. You can imagine these as 64 feature maps with 256x256 each. Maxpooling layer (2x2) would change this to 128x128x64, whereas a layer with 10 1x1 kernels would reduce this to 256x256x10.

A layer with 100 1x1 kernels would create 100 geature maps (256x256x100). 1x1 kernels give us an extremely low compute way to increase or reduce the depth of the features.

## Dilated Convolutions

Dilated Convolution

Dilated convolutions are applied with defined gaps. It is a way of increasing receptive view of the network exponentially and linear parameter accretion. It finds it's usage in applications which cares more about integrating knowledge of the wider context with less cost, like image segmentation where each pixel is labeled by its corresponding class. In this case, the network output needs to be in the same size of the input image. Dilated Convolutions avoids needs of upsampling.

They are applied to audio as well (WaveNet) to capture a global view of the input with fewer parameters. In short, it is a simple but effective idea which you might consider when:

1. Detection of fine details by processing inputs in higher resolution, or
2. Broader view of the input to capture more contextual information, or
3. Faster run-time with less parameters.

Biologically inspired $L_p$ pooling (modelled on complex cell). In a given pooling region $R_j$, it takes the weighted average of the activations $a_i$, as:

$s_j$ = $\Bigg(\sum$ $a_i{^p}\Bigg)^{1/p}$

Notable, when $p = 1$, the equation corresponds to average pooling, while $p = \infty$, translates to max pooling. $L_p$ can be seen as a trade-off between average and maxpooling. It has shown exceptional image classification results earlier.

## Spatial Pyramidical Pooling

Spatial Pyramidical Pooling

DCNNs are restricted in that they can only handle a fixed input image ($e.g., 96 * 96$). In order to make them more flexible and thus handle images of different sizes, scales and aspect ratios, He, Zhang, Ren and Sun (2014) proposed SPP. They used multilevel spatial bins, which have sizes proportional to the image size, and thus allowed them to generate fixed-length representation. The SPP layer was integrated into DCNN between the final convolution/pooling layer and the first fully connected layer and thus performed information aggregation deep in the network to prevent fixing the size. Spatial pyramid matching is, conceptually, a method of building a more abstract representation of the images preserving some spatial information by spatially dividing images in some special way. Such higher order representation introduces some invariances (under translation for instance), but it does it at the cost of throwing away information. For instance, you may divide the image into 4x4 regions and then aggregate statistics (or pool) within the subregion. You can repeat the same process using 2x2 division and 1x1 (here, you pool over the whole image) forming a pyramid.

## Softmax

SVM and SoftMax

It is the most widely used in the last fully connected layer of DCNNs, owing to its simplicity and probabilistic interpretation. When this activation function is combined with cross-entropy loss, they form the extensively used softmax loss.

## Triplet Loss

Triplet Loss

The Triplet loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

This is most useful when we are directly trying to learn embeddings (face recognition). It consists of two matching references and a non-matching reference, and the loss aims to separate the positive pair from the negative by a distance margin.

## Regularization

DCNNs are very expressive models capable of learning exceptionally complicated relationships, however, many of these complicated mappings are due to sampling noise. Thus they exist only in the training set rather than in the test set, irrespective of whether they are drawn from the same data distribution. This leads to overfitting, which can be mitigated by regularization. The easiest and most common method to reduce overfitting is data augmentation, but it required a larger memory footprint and comes at a higher computational cost.

## Dropout

Dropout

In Dropout (Hinton et al., 2012; Srivastava et al.,2014), each unit of a layer's output is retained with probability $p$; else it is set to zero with probability $1-p$, with 0.5 being a common value of $p$. When dropout is applied to a fully connected layer of a DCNN (or any DNN), the output of the layer $r = [r_1, r_2, . . . , r_d]^T$, can be expressed as: $r = m * a(W_v)$, where $*$ denotes the element-wise product between a binary mask vector m and the matrix product between the input vector $v = [v_1, v_2, . . . v_n]^T$ and the weight matrix $W$ (with dimensions $d X n$), followed by the nonlinear activation function, $a$. The primary benefit of Dropout is its proven ability to significantly reduce overfitting by effectively preventing feature coadaptation; it is also capable of attaining model averaging.

Dropout at test time

Spatial Dropout

In an object localization application that used a DCNN, it was found that applying regular Dropout before a 1x1 convolution layer increased training time but did not prevent overfitting. Thus, they proposed Spatial Dropout. We randomly set entire feature maps to 0, rather than individual 'pixels'. Regular dropout would not work so well on images because adjacent pixels are highly correlated. Dropout randomly zeros the out activations in order to force the network to generalize better, do less overfitting and build in redundancies as regularization techniques. Spatial dropout in convolution layers zeros out entire filters. Spatial dropout is well suited to a data set that has a small number of training samples, thus making it a good candidate to reduce overfitting for smaller datasets, where generalization is usually an issue.

Label Smoothing Regularization was first proposed in 1980's.

Instead of hard labels like 0 and 1, smoothen the labels by making them close to 0 and 1

For example 0, 1 -> 0.1, 0.9

Label Smoothing Regularization

It maintains a realistic ratio between the unnormalized log probabilities of erroneous classes by estimating, during training, the marginalized consequence of label dropout. This averts the model from allocating a complete likelihood for each training case. It can be considered the equivalent of replacing a single cross-entropy loss with a pair of losses, the second of which looks at a prior distribution and penalizes the deviation of the predicted label relative to it. This is what was used in 2017 ILSVRC's top model.

## Initializations

Imagine if you'd initialize your variable at least close to the trained values! Wouldn't that save time and compute resources! As of now it's still a dream!

There are hacks though!

You'll be surprised to know that most of the new algorithms actually "steal" their variable values from an already pre-trained network like VGG etc!

Poor initialization of DCNN parameters which are typically in the millions and in particular their weights, can hamper the training process because of the vanishing/exploding gradient problem and hinder convergence. Thus, their initialization is extremely critical. The key factors to consider when selecting an initialization scheme are the activation function to be used, the network depth, which would hamper classification accuracy due to the degradation, the computational budget available, the size of the dataset and the tolerable complexity of the required solution.

Xavier Initializations

Xavier initialization Glorot and Bengio (2010) evaluated how backpropagation gradients and activations varied across different layers; based on these considerations, they proposed a normalized initialization scheme that essentially adopted a balanced uniform distribution for weight initialisation. The initial weights are drawn from a uniform or gaussian distribution, with a zero-mean and precise variance. Xavier initialization promotes the propagation of signals deep into DNNs and has been shown to lead to substantively faster convergence. Its main limitation is that its derivation is based on the assumption that the activations are linear, thus making it inappropriate for ReLU and PReLU activations (Heetal., 2015a).

Theoretically derived adaptable initialization He et al., (2015) derived a theoretically sound initialisation that considered these non-linear activations. It led to the initialization of weights from a zero-mean gaussian distribution whose standard deviation is $\sqrt{2/n_l}$, where $n$ is the number of connections of the response and $l$ is the layer index. Furthermore, they initialized the biases to zero and showed empirical proof that this initialization scheme was suited to training extremely deep models, where Xavier initialization was not.

## Batch Normalization

Notice when Batch Normalization is performed!

Normalization

Notice what BN does. Also notice that BN doesn't do much if Skip-Connections are used (as in ResNet)

In addition to having a large number of parameters, the training of DCNNs is convoluted by a phenomenon known as internal covariance shift, which is caused by changes to the distribution of each layer's inputs because of parameter changes in the previous layer.

A very good explanation from quora - If you take a very long column and apply load on it, it will fail much before than axial capacity of its section due to buckling. Most appropriate engineering solution to increase its load capacity is to prevent buckling by lateral support intermittently. Such supports may be of fixed or pinned type, to prevent sway, rotation or both for the intermittent column sections. Similarly, for very deep network unless intermediate layers are constrained with whitening, it will lose its learning capacity and accuracy due to its internal co-variate shift, where whitening means zero mean, unit variance and decorrelated. One of the brilliant solutions for such problem is Batch Normalization. For both, Buckling or Co-Variate Shift a small perturbation leads to a large change in the later.

This phenomenon has severe consequences, which include slower training due to lower learning rates, the need for careful parameter initializations, and complexities when training DCNNs with saturating non-linear activations. To reduce the consequences of internal covariate shirt, Ioffe and Szegedy (2015) proposed a technique known as batch normalization. This technique introduces a normalization step, which is simply a nonlinear transform applied to each activation, that fixed the means and variances of layer inputs. BN computes the mean and variance estimates after mini-batches rather than over the entire training set. The idea is that, instead of just normalising the inputs to the network, we normalise the inputs to layers within the network. It’s called “batch” normalization because, during training, we normalise the activations of the previous layer for each batch, i.e. apply a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.

Benefits of Batch Normalization are: networks train faster, allows higher learning rates, makes weights easier to initialize, makes more activation functions viable, simplifies the creation of deeper networks, and provides some regularization (as it adds a little noise to the network). BN adds close to 30% computational overhead and requires additional 25% of parameters per iteration.

## Residuals

ResNet Block

In 2015 an ultra-deep network was introduced. UDNs suffer from poor propagation of activation and gradients because of stacking of several nonlinear transformations on top of each other.

This was solved by residual learning framework.

He et al. reformulated the layers of the network and forced them to learn residual functions with reference to their preceding layer inputs. This allowed for errors to be propagated directly to the preceding units and thus made these networks easier to optimize and easier to train.

## Wide Residual Networks

Wide - Increasing Number of filters

Given that the extremely deep residual networks of He et al. (2016) were slow to train, their depth was reduced and width increased in a new variant called WRN (Zagoruyko & Komodakis, 2017).

These WRNs were much shallower since they consisted of 16 layers compared to the 1000 of He et al. (2016), yet they outperformed all the previous residual models in terms of efficiency and accuracy and set new SOA.

Wider networks run faster and have more accuracy. For example a WRN-50-ResNet would perform as well as ResNet200, both achieving 6% error rate.

However, these were later superseded by the shortcut technique introduced by DenseNet.

## DenseNet

DenseNet

The recently proposed densely connectd convolution networks (Huang, Liu, Weinberger, & van der Maaten, 2016) extend the idea of skip connections by connecting in the usual feedforward modus each layer with every other layer in the network.

Thus, the feature maps of all the previous layers are used as the inputs for each succeeding layer. the new feature map is a concatenation of the earlier layers.

They obtained accuracy comparable to residual networks but required significantly fewer parameters.

On dataset, for example, DenseNet achieves a similar accuracy as ResNet but using less than half the amount of parameters and roughly half the number of FLOPs.

## MobileNet

MobileNet

Regular Convolution - all input channels (like RBG) are combined into 1 channel.

First Step in Depthwise Separable MobileNet Architecture.

Second Step - Pointwise 1x1 convolution in MobileNet Architecture.

Source

Sometime this April 2017 a very interesting paper titled MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications appeared on arXiv.org.

The authors of the paper claim that this kind of neural network runs very efficiently on mobile devices and is nearly as accurate as much larger convolutional networks like our good friend VGGNet-16.

Depthwise Separable Convolutions

The big idea behind MobileNets: Use depthwise separable convolutions to build light-weight deep neural networks.

A regular convolutional layer applies a convolution kernel (or “filter”) to all of the channels of the input image. It slides this kernel across the image and at each step performs a weighted sum of the input pixels covered by the kernel across all input channels.

The important thing is that the convolution operation combines the values of all the input channels. If the image has 3 input channels, then running a single convolution kernel across this image results in an output image with only 1 channel per pixel.

So for each input pixel, no matter how many channels it has, the convolution writes a new output pixel with only a single channel. (In practice we run many convolution kernels across the input image. Each kernel gets its own channel in the output.)

The MobileNets architecture also uses this standard convolution, but just once as the very first layer. All other layers do “depthwise separable” convolution instead. This is actually a combination of two different convolution operations: a depthwise convolution and a pointwise convolution.

A depthwise convolution works like the right image titled First Step in Depthwise Separable MobileNet Architecture.

Unlike a regular convolution, it does not combine the input channels but it performs convolution on each channel separately. For an image with 3 channels, a depthwise convolution creates an output image that also has 3 channels. Each channel gets its own set of weights.

The purpose of the depthwise convolution is to filter the input channels. Think edge detection, color filtering, and so on.

The depthwise convolution is followed by a pointwise convolution. This really is the same as a regular convolution but with a 1×1 kernel - (Right Image - Second Step).

In other words, this simply adds up all the channels (as a weighted sum).

As with a regular convolution, we usually stack together many of these pointwise kernels to create an output image with many channels.

The purpose of this pointwise convolution is to combine the output channels of the depthwise convolution to create new features.

Why do this?

The end results of both approaches are pretty similar — they both filter the data and make new features — but a regular convolution has to do much more computational work to get there and needs to learn more weights.

So even though it does (more or less) the same thing, the depthwise separable convolution is going to be much faster!

The paper shows the exact formula you can use to compute the speed difference but for 3×3 kernels. This new approach is about 9 times as fast and still as effective.

It’s no surprise therefore that MobileNets uses up to 13 of these depthwise separable convolutions in a row!

Topics covered here need understanding of basic concepts first. If you haven't read Deep Concepts in DNNs, we recommend you do that right away!

## Recognition vs Detection

Object Recognition vs Object Detection

In object recognition, our aim is to recognize what all is there in the image, for e.g. Dog and Bridge. In object detection however, we need to specify where exactly the dog(s) and bridge are. Recognition is pretty easy these days, while detection is still a work in development. All you've been covering till now was object recognition.

Recognition is easy, because the algorithm needs to predict only few things (like class name), while in detection, it needs to classify each pixel. We simply this problem by ignoring lower prediction values, predicting bounding boxes instead of exact object mask, and mixing different receptive field layers, but this is easier said than done.

There are two main approaches driving detection algorithms, namely:

• YOLO-like approach, where k-means extracted anchor boxes are used, and
• SSD-like approach, where fixed number of predefined bounding boxes are used.

Both are similar yet very different.

## Receptive Field

Receptive Field

Center "Pixel" in the second layer has "seen" all the image!

A dog and a Cat, and block classification representing class predictions

Look what each block is actually predicting!

Compound Eyes

Receptive field is perhaps one of the most important concepts in Convolution Neural Networks that deserve far more attention than given to. All the DNNs algorithms are biult around this idea, but very few cover the fundamental aspects or features of the receptive field. Most just call it filter-size without giving us that 'aha' moment to realise the power it packs!

One of the best papers on this topic was written last year by Wenjio Luo et. al..

Let's focus on the second image on the right.

We are performing 3x3 convolutions twice. The first time 3x3 convolution happens, every pixel in the first layer has "seen" in total 9 pixels (marked green in the input image). Now since we were convolving, 3x3 matrix multiplcation must have changed/filtered the results, but important thing to realise here is that, this pixel now has information about 9 pixels in input image. We are sort of collecting (and filtering) information about 9 pixels and placing it in a new pixel in the first layer.

Question we need to spend some time on is, what does this new image (layer 1) represent? Wouldn't each pixel in this layer represent "concepts", "features" or "data" from 9 other pixels?

Magic happens once we perform the same process on the first layer, convolving through a 3x3 filter. Now the new pixel in the center of the second layer output has "seen" all of the image! (This is because the 9 pixels it convolved over in the first layer had each seen surrounding 9 pixels in the original image). Look at the image, and you should understand what we mean here. The center pixel is a "champion informant" of the whole image. Pixels next to it would act same, but their "receptive field" would be less.

Though traditionally receptive field is considered just as the filter size, but if we dig deeper, we realize that when we cascade convolutions, receptive field of pixels in deeper layers, if much more than just the filter size.

I have used the word pixel here abundantly, and I apologize for that. Technically what I am calling a pixel here is just a "cell" value in a Matrix, but calling it pixels helps me visualize the problem and explain it better.

Another amazing paper which came just 3 days ago (6th March 2018) is The Building Blocks of Interpretability

Look at the dog and the cat image on your right. When we run CNN, and only look at final image, each box is predicting a class accordingly.

Let's get awed now! Second image shows what each cell sees (check out high res images here). The geometric centroid of the dog would be "seeing" whole of the dog, and as we move away from this conceptual centroid each cell would see less of it. The moment the cell sees very less of dog, the bounding box would end in the object detection algorithm!

Look at the insect vision image on the right. First image is hollywood's representation of how a compound eye might see. This is what our neural networks are doing and this is totally wrong. Insect eyes seems to be doing sort of average/max-pooling straight after the input from each photo-receptor and creating a blurry-image, good enough for their tasks.

Personally, I feel good to learn a new way in which receptive field can be interpretted, but sad as well as now I ask, should this much work be done for object detection? There must be an easier way hiding just in-front of us!

## Anchor Boxes

Sliding Window

Region Proposal

Anchor Boxes

h vs w graph

k-means

Anchor box dimensions in YOLOv2

Until very recently, sliding window was the main method used for detecting an object.

As should be evident from the image on your right there are few problems here:

• What should be the size of this window? How do we handle big and small objects at the same time?
• Do we need to slide it at each pixel, or we can make a jump? What is the perfect jump size?

Sliding Window as a concept is simple, but extremely compute intensive.

To understand how frustrating it can get to use this algorithm, just check out this PyImageSearch Image. Don't forget to sit back and relax!

RCNN came out with region proposal mechanism where we need two different networks. First (region proposal) network predicts all the possible regions where an object might exist, then second (prediction) network predicts the object. This handled the differently sized object problem, but again is extremely slow (RCNN took 20 second to detect an object!) as it required a forward pass of the CNN (AlexNet) for every single region proposal for every single image (and that's about running AlexNet 2000 times for 2000 proposals).

Why not run the CNN just once per image and then find a way to share that computation across the ~2000 proposals?

This is exactly what Fast R-CNN does using a technique known as RoIPool (Region of Interest Pooling). At its core, RoIPool shares the forward pass of a CNN for an image across its subregions. In the image above, notice how the CNN features for each region are obtained by selecting a corresponding region from the CNN’s feature map. Then, the features in each region are pooled (usually using max pooling). So all it takes us is one pass of the original image as opposed to ~2000! But this was also too slow for any real time implementation, taking around 2 seconds per image.

Welcome Anchor Boxes

Faster RCNN, SSD and YOLOv2, all use Anchor Boxes.

Intuitively, we know that objects in an image should fit certain common aspect ratios and sizes. For instance, we know that we want some rectangular boxes that resemble the shapes of humans. Likewise, we know we won’t see many boxes that are very very thin. In such a way, we create k such common aspect ratios we call anchor boxes. For each such anchor box, we output one bounding box and score per position in the image.

Let us see how to do it!

• Compute the center location, width and height of each bounding box, and normalize it by image dimensions (in the dataset)
• Plot the $h$ and $w$ for each box as shown in the right "h vs w graph"
• Use K-means clustering to compute cluster centers (centroids).
• Compute different number of clusters and compute the mean of maximum IOU between bounding boxes and individual anchors.
• Plot centroids vs mean IOU.
• Pick the top 5 anchor boxes (5 for YOLOv2 where IOU was above 65%)

Top 5 results for YOLOv5

Now if we divide our image, say in 13x13 cells, allow for scaling and slight translation for these bounding boxes, and predict the best anchor box, we are done!

If a cell is offset from the top left corner of the image by $(c_x, c_y)$ and the bounding box ground-truth/prior has width and height $(g_w, g_y)$, then the predictions corresponds to:

• $b_x = σ(t_x) + c_x$, where σ is sigmoid
• $b_y = σ(t_y) + c_y$
• $b_w = g_w e^{t_w}$
• $b_h = g_h e^{t_h}$

In YOLOv2, we directly predict $t_x, t_y, t_w,$ and $t_h$ among other things for each cell.

You can learn more abour exact math behind the k-means part of the YOLOv2 algorithm here.

## Intersection over Union (IoU)

Image over Union

Intersection over Union is an evaluation metric used to measure the accuracy of an object detector on a particular dataset. We often see this evaluation metric used in object detection challenges such as the popular PASCAL VOC challenge.

More formally, in order to apply Intersection over Union to evaluate an (arbitrary) object detector we need:

1. The ground-truth bounding boxes (i.e., the hand labeled bounding boxes from the testing set that specify where in the image our object is).
2. The predicted bounding boxes from our model.

In the numerator we compute the area of overlap between the predicted bounding box and the ground-truth bounding box.

The denominator is the area of union, or more simply, the area encompassed by both the predicted bounding box and the ground-truth bounding box.

Dividing the area of overlap by the area of union yields our final score — the Intersection over Union.

## mAP

Mean Average Precision

Precision vs Recall

### Precision and Recall.

Precision is the percentage of true positives in the retrieved results. Let us say, we have 8 images of airplane in a dataset of 10 images and the network predicted all 10 as aeroplanes. Then the precision of this network is 80%.

$$precision = \frac {t_p}{t_p + f_p} = \frac {t_p}{n}$$

where $t_p, f_p,$ and $n$ represents detected images, wrongly-detected images, and total images respectively.

Recall is the percentage of the objects that the system retrieves. Let us assume we have 10 images of aeroplane in our dataset, but our network only detected 9 of them as aeroplane, then our recall is 90%.

$$recall = \frac {t_p}{t_p + f_n}$$

where $t_p,$ and $f_n,$ represents detected images, and not-detected images respectively.

This means if our precision is low, our network is prediting more things as aeroplanes than there exist, and low recall means, it is not able to detect all the aeroplanes.

### mAP or Mean Average Precision

mAP is the mean of all the average precision across all our classes!

Generally it is seen that precision falls if we try to improve recall.

## Non-Max Suppression

Just take the maximum one! More description coming soon!

## YOLOv2 Loss Function

Scary?

YOLOv2 Network Output

var1 var2 (var1 - var2)^2 (sqrtvar1 - sqrtvar2)^2
0.0300 0.020 9.99e-05 0.001
0.0330 0.022 0.00012 0.0011
0.0693 0.046 0.000533 0.00233
0.2148 0.143 0.00512 0.00723
0.8808 0.587 0.0862 0.0296
4.4920 2.994 2.2421 0.1512

On your right is YOLOv2's loss function. Doesn't it look scary?

Let's first look at what the network actually predicts.

If we recap, YOLOv2 predicts detections on a 13x13 feature map, so in total we have 169 maps/cells.

We have 5 anchor boxes. For each anchor box we need Objectness-Confidence Score (where there is an object found?), 4 Coordinates ($t_x, t_y, t_w,$ and $t_h$), and 20 top classes. This can crudely be seen as 20 coordinates, 5 confidence scores, and 100 class probabilities as shown in the image on the right, so in total 125 filter of 1x1 size would be needed.

So we have few things to worry about:

• $x_i, y_i$, which is the location of the centroid of the anchor box
• $w_i, h_i$, which is the width and height of the anchor box
• $C_i$, which is the Objectness, i.e. confidence score of whethere there is any object or not, and
• $p_i(c)$, which is the classification loss.

All losses are mean squared errors, except classification loss, which uses cross entropy function.

Now, let's break the code in the image.

• We need to compute losses for each Anchor Box (5 in total)

• $\sum_{j=0}^B$ represents this part.
• We need to do this for each of the 13x13 cells where S = 13

• $\sum_{i=0}^{S^2}$ represents this part.
• $𝟙_{ij}^{obj}$ is 1 when there is an object in the cell $i$, else 0.

• $𝟙_{ij}^{noobj}$ is 1 when there is no object in the cell $i$, else 0. We need to do this to make sure we reduce the confidence when there is no object as well.

• $𝟙_{i}^{obj}$ is 1 when there is a particular class is predicted, else 0.

• λs are constants. λ is highest for coordinates in order to focus more on detection (remember, we have already trained the network for recognition!)

• We can also notice that $w_i, h_i$ are under square-root. This is done to penalize the smaller bounding boxes as we need to adjust them more. Check out the table on your right.

You can find another similar explaination here.

Not that scary, right!

# DCNN Performance

DCNN Performance on the ImageNet Dataset : MIT

Sample images in ImageNet

ML overtaking CV

ImageNet

The ImageNet project is a large visual database designed for use in visual object recognition software research. Over 14 million URLs of images have been hand-annotated by ImageNet to indicate what objects are pictured; in at least one million of the images, bounding boxes are also provided.

A dramatic 2012 breakthrough in solving the ImageNet Challenge is widely considered to be the beginning of the deep learning revolution of the 2010s

Code Name Contribution Top 5 Error %
AlexNet 2012 DCNN model across parallel GPUs, innovations include Dropout, data augmentation, ReLUs, and local response normalization 15.3
Zieler & Fergus Net 2014 Novel visualization method; larger convolution layers 11.7
Spatial Pyramid Pooling Net 2014 SPP for flexible image size 8.06
VGG-net 2014 Increase depth, more convolution layers, 3x3 convs 7.32
GoogLeNet 2015 Novel inception arch., ultra large, dimensionality reduction 6.67
PReLU-net 2015a PReLU activation functions and robust initialization scheme 4.94
BN-Inception 2015 BN combined with inception 4.83
Inception-V3 2015 Factorized convolutions, aggressive dimensionality reduction 3.58
ResNets 2015 Residual functions/blocks integrated into DCNN 3.57
BN-Inception-ResNet 2016 BN, inception arch. integrated with residual functions 3.08
CUImage 2016 Ensamble, Gated Bi-Directional CNN 2.991
Squeeze and Excite 2017 Feature Recalibration, Label-smoothning regularization, large Batch Size 2.251
ILSVRC 2018 STOPPED! Focus now is on efficiency :)

# YoloV2

Awesome YoloV2 Video from Andreas Refsgaard at a creative coding studio - Støj

Source YOLOv2 Paper - General purpose object detection should be fast, accurate, and able to recognize a wide variety of objects. Since the introduction of neural networks, detection frameworks have become increasingly fast and accuracte. However, most detection methods are still constrained to a small set of objects. Not YOLOv2!

YOLO is definitely the most written about object detection algorithm, and why not:

• At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007.
• Using YOLOv2's joint training algorithm, it can be trained as a detector any classification data (without bounding boxes)!
• Even faster version runs at 155 fps!
• And the best thing! It is open source.
• You can read the original paper here.
• YOLO paper is hard to understand, so alongside this paper, we also refer to this beautiful end-to-end implementation by Experiencor.

## YOLOv2 Paper Notes

We need to cover some very important concepts before we could understand YOLOv2. If you haven't already read Advance Concepts, we recommend you do that right away!

Taken from Understanding YOLO Bounding Boxes

YOLOv2 vs Others

13x13 Grid Prediction

• YOLOv2 fixed v1's recall and localization problems
• By adding BN to all the layers it gets 2% improvement in mAP
• YOLOv2 is first fine tuned on higher resolution images. This gives the network time to adjust it's filters to work better on higher resolution network. Then it is fine tuned on detection. This gives 4% mAP improvement.
• YOLOv2 fully removes FC layers and use anchor boxes instead to predict bounding boxes.
• YOLOv2 predicts class as well as objectness for every anchor box.
• Anchor dimensions are picked using k-means clustering on the dimensions of original bounding boxes. Final anchor boxes are: (0.57273, 0.677385), (1.87446, 2.06253), (3.33843, 5.47434), (7.88282, 3.52778), (9.77052, 9.16828)
• The network predicts 5 bounding boxes and 5 coordinates for each bounding box: $t_x, t_y, t_w, t_h,$ and $t_o$.
• If the cell is offset from the top left corner of the image by $(c_x, c_y)$ and the bounding box ground-truth/prior has width and height $(g_w, g_y)$, then the predictions corresponds to:
• $b_x = σ(t_x) + c_x$, where σ is sigmoid
• $b_y = σ(t_y) + c_y$
• $b_w = g_w e^{t_w}$
• $b_h = g_h e^{t_h}$
• $P_r(object) * IOU(b, object) = σ(t_o)$
• YOLO predicts detections on a 13x13 feature map. This may not be sufficient to detect smaller objects. To fix this, they simply add a passthrough layer that brings features from an earlier layer at 26x26 resolution. The passthrough layer concatenates the higher resolution features with the low resolution features by stacking adjacent features into different channels.
• Every 10 batches, network chooses random new image dimension size (multiples of 32) from 320x320 to 608x608.
• Final model, called Darknet-19 has 19 convolution layers and 5 maxpooling layers. 1x1 convolutions are used to compress the feature representations between 3x3.
• Network is first trained on classification for 160 epochs.
• After classification training, last convolution layer is removed, and three 3x3 convolution layers with 1024 filters each followed by final 1x1 convolution layer is added. Network is again trained for 160 epochs.
• During training both, detection and classification datasets are mixed. When network sees image with detection label, full back-propagation is performed, else only classification part is back-propagated.

## Scary Loss Function

Scary?

YOLOv2 Network Output

var1 var2 (var1 - var2)^2 (sqrtvar1 - sqrtvar2)^2
0.0300 0.020 9.99e-05 0.001
0.0330 0.022 0.00012 0.0011
0.0693 0.046 0.000533 0.00233
0.2148 0.143 0.00512 0.00723
0.8808 0.587 0.0862 0.0296
4.4920 2.994 2.2421 0.1512

On your right is YOLOv2's loss function. Doesn't it look scary?

Let's first look at what the network actually predicts.

If we recap, YOLOv2 predicts detections on a 13x13 feature map, so in total we have 169 maps/cells.

We have 5 anchor boxes. For each anchor box we need Objectness-Confidence Score (where there is an object found?), 4 Coordinates ($t_x, t_y, t_w,$ and $t_h$), and 20 top classes. This can crudely be seen as 20 coordinates, 5 confidence scores, and 100 class probabilities as shown in the image on the right, so in total 125 filter of 1x1 size would be needed.

So we have few things to worry about:

• $x_i, y_i$, which is the location of the centroid of the anchor box
• $w_i, h_i$, which is the width and height of the anchor box
• $C_i$, which is the Objectness, i.e. confidence score of whethere there is any object or not, and
• $p_i(c)$, which is the classification loss.

All losses are mean squared errors, except classification loss, which uses cross entropy function.

Now, let's break the code in the image.

• We need to compute losses for each Anchor Box (5 in total)

• $\sum_{j=0}^B$ represents this part.
• We need to do this for each of the 13x13 cells where S = 13

• $\sum_{i=0}^{S^2}$ represents this part.
• $𝟙_{ij}^{obj}$ is 1 when there is an object in the cell $i$, else 0.

• $𝟙_{ij}^{noobj}$ is 1 when there is no object in the cell $i$, else 0. We need to do this to make sure we reduce the confidence when there is no object as well.

• $𝟙_{i}^{obj}$ is 1 when there is a particular class is predicted, else 0.

• λs are constants. λ is highest for coordinates in order to focus more on detection (remember, we have already trained the network for recognition!)

• We can also notice that $w_i, h_i$ are under square-root. This is done to penalize the smaller bounding boxes as we need to adjust them more. Check out the table on your right.

You can find another similar explaination here.

Not that scary, right!

## Code - Inference

Clone into DarkNet repository

git clone https://github.com/pjreddie/darknet
cd darknet
make


Let's run a simpler code first (iOS and Ubuntu only).

git clone the original DarkNet repository from github and then cd in to it. Issue the make command to build DarkNet for your system.

wget https://pjreddie.com/media/files/yolo.weights


Let's excute the model now.

In the data folder we have an image called dog.jpg. We shall use this image to run our network.

Run DarkNet Yolo

./darknet detect cfg/yolo.cfg yolo.weights data/dog.jpg


Predictions

Issue the detect command. It needs a model-architecture-configuration file cfg/yolo.cfg matching our weights, the weights file yolo.weights which you just downloaded, and finally the image file you want to run detection on data/dog.jpg.

Your console would such a similar output as:

• dog: 82%
• car: 28%
• truck: 64%
• bicycle: 85%

You can also run Tiny YOLO model. First get the weights: wget https://pjreddie.com/media/files/tiny-yolo-voc.weights and then run this command: ./darknet detect cfg/tiny-yolo-voc.cfg tiny-yolo-voc.weights data/dog.jpg. This time you'll see:

• train: 35%
• train: 55%
• stop sign: 78%
• bicycle: 36%

## Code - Training

Initializations

from keras.models import Sequential, Model
from keras.layers import Reshape, Activation, Conv2D, Input, MaxPooling2D, BatchNormalization, Flatten, Dense, Lambda
from keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard
from keras.optimizers import SGD, Adam, RMSprop
from keras.layers.merge import concatenate
import matplotlib.pyplot as plt
import keras.backend as K
import tensorflow as tf
import imgaug as ia
from tqdm import tqdm
from imgaug import augmenters as iaa
import numpy as np
import pickle
import os, cv2
from preprocessing import parse_annotation, BatchGenerator
from utils import WeightReader, decode_netout, draw_boxes, normalize

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = ""

%matplotlib inline

LABELS = ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']

IMAGE_H, IMAGE_W = 416, 416
GRID_H,  GRID_W  = 13 , 13
BOX              = 5
CLASS            = len(LABELS)
CLASS_WEIGHTS    = np.ones(CLASS, dtype='float32')
OBJ_THRESHOLD    = 0.3#0.5
NMS_THRESHOLD    = 0.3#0.45
ANCHORS          = [0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828]

NO_OBJECT_SCALE  = 1.0
OBJECT_SCALE     = 5.0
COORD_SCALE      = 1.0
CLASS_SCALE      = 1.0

BATCH_SIZE       = 16
WARM_UP_BATCHES  = 0
TRUE_BOX_BUFFER  = 50

wt_path = 'yolo.weights'
train_image_folder = '/home/andy/data/coco/train2014/'
train_annot_folder = '/home/andy/data/coco/train2014ann/'
valid_image_folder = '/home/andy/data/coco/val2014/'
valid_annot_folder = '/home/andy/data/coco/val2014ann/'


If you have covered the stuff above, you can just brush through the code now! This is the code we'll refer to.

Outline of Steps

• Initialization
• http://images.cocodataset.org/zips/train2014.zip <= train images
• http://images.cocodataset.org/zips/val2014.zip <= validation images
• http://images.cocodataset.org/annotations/annotations_trainval2014.zip <= train and validation annotations
• Run this script to convert annotations in COCO format to VOC format
• https://gist.github.com/chicham/6ed3842d0d2014987186#file-coco2pascal-py
• https://pjreddie.com/media/files/yolo.weights
• Specify the directory of train annotations (train_annot_folder) and train images (train_image_folder)
• Specify the directory of validation annotations (valid_annot_folder) and validation images (valid_image_folder)
• Specity the path of pre-trained weights by setting variable wt_path
• Construct equivalent network in Keras
• Network arch from https://github.com/pjreddie/darknet/blob/master/cfg/yolo-voc.cfg
• Perform training
• Perform detection on an image with newly trained weights
• Perform detection on an video with newly trained weights

As you can see, before we run this code, we need to first pre-process our dataset. This model will be trained on COCO dataset.

COCO dataset is huge, so dwonloading and pre-processing it is going to take a lot of time.

Also, to save time, we will be using pre-trained model from YOLOv2's original implementation, shuffle the last few layers and then train just those layers.

If you try to train the whole network from scratch, it would take close to two weeks!

### Constructing the network

# the function to implement the orgnization layer (thanks to github.com/allanzelener/YAD2K)
def space_to_depth_x2(x):
return tf.space_to_depth(x, block_size=2)

input_image = Input(shape=(IMAGE_H, IMAGE_W, 3))
true_boxes  = Input(shape=(1, 1, 1, TRUE_BOX_BUFFER , 4))

# Layer 1
x = Conv2D(32, (3,3), strides=(1,1), padding='same', name='conv_1', use_bias=False)(input_image)
x = BatchNormalization(name='norm_1')(x)
x = LeakyReLU(alpha=0.1)(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
# Layer 2
x = Conv2D(64, (3,3), strides=(1,1), padding='same', name='conv_2', use_bias=False)(x)
x = BatchNormalization(name='norm_2')(x)
x = LeakyReLU(alpha=0.1)(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
# Layer 3
x = Conv2D(128, (3,3), strides=(1,1), padding='same', name='conv_3', use_bias=False)(x)
x = BatchNormalization(name='norm_3')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 4
x = Conv2D(64, (1,1), strides=(1,1), padding='same', name='conv_4', use_bias=False)(x)
x = BatchNormalization(name='norm_4')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 5
x = Conv2D(128, (3,3), strides=(1,1), padding='same', name='conv_5', use_bias=False)(x)
x = BatchNormalization(name='norm_5')(x)
x = LeakyReLU(alpha=0.1)(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
# Layer 6
x = Conv2D(256, (3,3), strides=(1,1), padding='same', name='conv_6', use_bias=False)(x)
x = BatchNormalization(name='norm_6')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 7
x = Conv2D(128, (1,1), strides=(1,1), padding='same', name='conv_7', use_bias=False)(x)
x = BatchNormalization(name='norm_7')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 8
x = Conv2D(256, (3,3), strides=(1,1), padding='same', name='conv_8', use_bias=False)(x)
x = BatchNormalization(name='norm_8')(x)
x = LeakyReLU(alpha=0.1)(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
# Layer 9
x = Conv2D(512, (3,3), strides=(1,1), padding='same', name='conv_9', use_bias=False)(x)
x = BatchNormalization(name='norm_9')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 10
x = Conv2D(256, (1,1), strides=(1,1), padding='same', name='conv_10', use_bias=False)(x)
x = BatchNormalization(name='norm_10')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 11
x = Conv2D(512, (3,3), strides=(1,1), padding='same', name='conv_11', use_bias=False)(x)
x = BatchNormalization(name='norm_11')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 12
x = Conv2D(256, (1,1), strides=(1,1), padding='same', name='conv_12', use_bias=False)(x)
x = BatchNormalization(name='norm_12')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 13
x = Conv2D(512, (3,3), strides=(1,1), padding='same', name='conv_13', use_bias=False)(x)
x = BatchNormalization(name='norm_13')(x)
x = LeakyReLU(alpha=0.1)(x)

skip_connection = x

x = MaxPooling2D(pool_size=(2, 2))(x)
# Layer 14
x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name='conv_14', use_bias=False)(x)
x = BatchNormalization(name='norm_14')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 15
x = Conv2D(512, (1,1), strides=(1,1), padding='same', name='conv_15', use_bias=False)(x)
x = BatchNormalization(name='norm_15')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 16
x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name='conv_16', use_bias=False)(x)
x = BatchNormalization(name='norm_16')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 17
x = Conv2D(512, (1,1), strides=(1,1), padding='same', name='conv_17', use_bias=False)(x)
x = BatchNormalization(name='norm_17')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 18
x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name='conv_18', use_bias=False)(x)
x = BatchNormalization(name='norm_18')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 19
x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name='conv_19', use_bias=False)(x)
x = BatchNormalization(name='norm_19')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 20
x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name='conv_20', use_bias=False)(x)
x = BatchNormalization(name='norm_20')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 21
skip_connection = Conv2D(64, (1,1), strides=(1,1), padding='same', name='conv_21', use_bias=False)(skip_connection)
skip_connection = BatchNormalization(name='norm_21')(skip_connection)
skip_connection = LeakyReLU(alpha=0.1)(skip_connection)
skip_connection = Lambda(space_to_depth_x2)(skip_connection)

x = concatenate([skip_connection, x])
# Layer 22
x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name='conv_22', use_bias=False)(x)
x = BatchNormalization(name='norm_22')(x)
x = LeakyReLU(alpha=0.1)(x)
# Layer 23
x = Conv2D(BOX * (4 + 1 + CLASS), (1,1), strides=(1,1), padding='same', name='conv_23')(x)
output = Reshape((GRID_H, GRID_W, BOX, 4 + 1 + CLASS))(x)
# small hack to allow true_boxes to be registered when Keras build the model
output = Lambda(lambda args: args[0])([output, true_boxes])

model = Model([input_image, true_boxes], output)


The actual classification/detection network is very simple.

The model builds off of prior work on network design as well as common knowledge in the field.

Similar to the VGG models it uses mostly 3 × 3 filters and double the number of channels after every pooling step.

Following the work on Network in Network (NIN) it uses global average pooling to make predictions as well as 1 × 1 filters to compress the feature representation between 3 × 3 convolutions.

It uses batch normalization to stabilize training, speed up convergence, and regularize the model.

Final model has only 19 convolution layers and 5 maxpooling layers.

The network architecture looks like this:

Type Filters Size/Stride Output
Convolutional 32 3 × 3 224 × 224
Maxpool 2 × 2/2 112 × 112
Convolutional 64 3 × 3 112 × 112
Maxpool 2 × 2/2 56 × 56
Convolutional 128 3 × 3 56 × 56
Convolutional 64 1 × 1 56 × 56
Convolutional 128 3 × 3 56 × 56
Maxpool 2 × 2/2 28 × 28
Convolutional 256 3 × 3 28 × 28
Convolutional 128 1 × 1 28 × 28
Convolutional 256 3 × 3 28 × 28
Maxpool 2 × 2/2 14 × 14
Convolutional 512 3 × 3 14 × 14
Convolutional 256 1 × 1 14 × 14
Convolutional 512 3 × 3 14 × 14
Convolutional 256 1 × 1 14 × 14
Convolutional 512 3 × 3 14 × 14
Maxpool 2 × 2/2 7 × 7
Convolutional 1024 3 × 3 7 × 7
Convolutional 512 1 × 1 7 × 7
Convolutional 1024 3 × 3 7 × 7
Convolutional 512 1 × 1 7 × 7
Convolutional 1024 3 × 3 7 × 7
Convolutional 1000 1 × 1 7 × 7
Avgpool Global 1000
Softmax

We modify this network for detection by removing the last convolutional layer and instead adding on three 3 × 3 convolutional layers with 1024 filters each followed by a final 1 × 1 convolutional layer with the number of outputs we need for detection. For VOC we predict 5 boxes with 5 coordinates each and 20 classes per box so 125 filters. We also add a passthrough layer from the final 3 × 3 × 512 layer to the second to last convolutional layer so that our model can use fine grain features.

Several different initial configurations are possible (most differing on the size of the images it was trained on), which results in different performances.

weight_reader = WeightReader(wt_path)

nb_conv = 23

for i in range(1, nb_conv+1):
conv_layer = model.get_layer('conv_' + str(i))

if i < nb_conv:
norm_layer = model.get_layer('norm_' + str(i))

size = np.prod(norm_layer.get_weights()[0].shape)

weights = norm_layer.set_weights([gamma, beta, mean, var])

if len(conv_layer.get_weights()) > 1:
kernel = kernel.reshape(list(reversed(conv_layer.get_weights()[0].shape)))
kernel = kernel.transpose([2,3,1,0])
conv_layer.set_weights([kernel, bias])
else:
kernel = kernel.reshape(list(reversed(conv_layer.get_weights()[0].shape)))
kernel = kernel.transpose([2,3,1,0])
conv_layer.set_weights([kernel])


Training Neural Networks is hard, and given how complex YOLOv2's loss is (remember it has to predict 13x13x5x25 numbers!) it would be a trememdous challenge to train such a network, both in terms of compute time required as well as computational resources required.

What is done to avoid such a situation is to use a base pretrained network which shares the same architecture where we transfer the weights directly to the network we are training. We can decide to retrain these weights (but now instead of being random, we have an extremely good starting point) or decide to freeze the layers (and only change the layers we have added).

To repeat, we make sure the network we want to train has two parts. First part must be same (or similar in case of retraining) to the pretrained network we would use transfer weights, and the second part consists of the additional layers or strategies pertinent to the design problem (like object detection in this case).

You can use same trategy to design other solutions like Image Segmentation, Human Pose Estimation, Super Resolution, etc. A fully trained network is used as a feature extractor, and we pass on these features to the new layers.

### Randomize weights of the last layer

Randomize weights

layer   = model.layers[-4] # the last 4 convolutional layers
weights = layer.get_weights()

new_kernel = np.random.normal(size=weights[0].shape)/(GRID_H*GRID_W)
new_bias   = np.random.normal(size=weights[1].shape)/(GRID_H*GRID_W)

layer.set_weights([new_kernel, new_bias])


As mentioned earlier, we can decide to train the whole network or use an existing network as a base network to initialize our new network.

In this case, YOLOv2 authors decided to freeze all but the last 4 layers of the base network. They randomized the last 4 layers, to allow it to learn new weights to allow integration with new layers. Network structure is kept intact.

### YOLOv2 Loss Function

The Loss Function

def custom_loss(y_true, y_pred):

cell_x = tf.to_float(tf.reshape(tf.tile(tf.range(GRID_W), [GRID_H]), (1, GRID_H, GRID_W, 1, 1)))
cell_y = tf.transpose(cell_x, (0,2,1,3,4))

cell_grid = tf.tile(tf.concat([cell_x,cell_y], -1), [BATCH_SIZE, 1, 1, 5, 1])

seen = tf.Variable(0.)
total_recall = tf.Variable(0.)

"""
"""
pred_box_xy = tf.sigmoid(y_pred[..., :2]) + cell_grid

pred_box_wh = tf.exp(y_pred[..., 2:4]) * np.reshape(ANCHORS, [1,1,1,BOX,2])

pred_box_conf = tf.sigmoid(y_pred[..., 4])

pred_box_class = y_pred[..., 5:]

"""
"""
true_box_xy = y_true[..., 0:2] # relative position to the containing cell

true_box_wh = y_true[..., 2:4] # number of cells accross, horizontally and vertically

true_wh_half = true_box_wh / 2.
true_mins    = true_box_xy - true_wh_half
true_maxes   = true_box_xy + true_wh_half

pred_wh_half = pred_box_wh / 2.
pred_mins    = pred_box_xy - pred_wh_half
pred_maxes   = pred_box_xy + pred_wh_half

intersect_mins  = tf.maximum(pred_mins,  true_mins)
intersect_maxes = tf.minimum(pred_maxes, true_maxes)
intersect_wh    = tf.maximum(intersect_maxes - intersect_mins, 0.)
intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]

true_areas = true_box_wh[..., 0] * true_box_wh[..., 1]
pred_areas = pred_box_wh[..., 0] * pred_box_wh[..., 1]

union_areas = pred_areas + true_areas - intersect_areas
iou_scores  = tf.truediv(intersect_areas, union_areas)

true_box_conf = iou_scores * y_true[..., 4]

true_box_class = tf.argmax(y_true[..., 5:], -1)

"""
"""
### coordinate mask: simply the position of the ground truth boxes (the predictors)
coord_mask = tf.expand_dims(y_true[..., 4], axis=-1) * COORD_SCALE

### confidence mask: penelize predictors + penalize boxes with low IOU
# penalize the confidence of the boxes, which have IOU with some ground truth box < 0.6
true_xy = true_boxes[..., 0:2]
true_wh = true_boxes[..., 2:4]

true_wh_half = true_wh / 2.
true_mins    = true_xy - true_wh_half
true_maxes   = true_xy + true_wh_half

pred_xy = tf.expand_dims(pred_box_xy, 4)
pred_wh = tf.expand_dims(pred_box_wh, 4)

pred_wh_half = pred_wh / 2.
pred_mins    = pred_xy - pred_wh_half
pred_maxes   = pred_xy + pred_wh_half

intersect_mins  = tf.maximum(pred_mins,  true_mins)
intersect_maxes = tf.minimum(pred_maxes, true_maxes)
intersect_wh    = tf.maximum(intersect_maxes - intersect_mins, 0.)
intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]

true_areas = true_wh[..., 0] * true_wh[..., 1]
pred_areas = pred_wh[..., 0] * pred_wh[..., 1]

union_areas = pred_areas + true_areas - intersect_areas
iou_scores  = tf.truediv(intersect_areas, union_areas)

best_ious = tf.reduce_max(iou_scores, axis=4)
conf_mask = conf_mask + tf.to_float(best_ious < 0.6) * (1 - y_true[..., 4]) * NO_OBJECT_SCALE

# penalize the confidence of the boxes, which are reponsible for corresponding ground truth box

### class mask: simply the position of the ground truth boxes (the predictors)
class_mask = y_true[..., 4] * tf.gather(CLASS_WEIGHTS, true_box_class) * CLASS_SCALE

"""
Warm-up training
"""

true_box_xy, true_box_wh, coord_mask = tf.cond(tf.less(seen, WARM_UP_BATCHES),
lambda: [true_box_xy + (0.5 + cell_grid) * no_boxes_mask,
true_box_wh + tf.ones_like(true_box_wh) * np.reshape(ANCHORS, [1,1,1,BOX,2]) * no_boxes_mask,
lambda: [true_box_xy,
true_box_wh,

"""
Finalize the loss
"""

loss_xy    = tf.reduce_sum(tf.square(true_box_xy-pred_box_xy)     * coord_mask) / (nb_coord_box + 1e-6) / 2.
loss_wh    = tf.reduce_sum(tf.square(true_box_wh-pred_box_wh)     * coord_mask) / (nb_coord_box + 1e-6) / 2.
loss_conf  = tf.reduce_sum(tf.square(true_box_conf-pred_box_conf) * conf_mask)  / (nb_conf_box  + 1e-6) / 2.
loss_class = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=true_box_class, logits=pred_box_class)
loss_class = tf.reduce_sum(loss_class * class_mask) / (nb_class_box + 1e-6)

loss = loss_xy + loss_wh + loss_conf + loss_class

nb_true_box = tf.reduce_sum(y_true[..., 4])
nb_pred_box = tf.reduce_sum(tf.to_float(true_box_conf > 0.5) * tf.to_float(pred_box_conf > 0.3))

"""
Debugging code
"""
current_recall = nb_pred_box/(nb_true_box + 1e-6)

loss = tf.Print(loss, [tf.zeros((1))], message='Dummy Line \t', summarize=1000)
loss = tf.Print(loss, [loss_xy], message='Loss XY \t', summarize=1000)
loss = tf.Print(loss, [loss_wh], message='Loss WH \t', summarize=1000)
loss = tf.Print(loss, [loss_conf], message='Loss Conf \t', summarize=1000)
loss = tf.Print(loss, [loss_class], message='Loss Class \t', summarize=1000)
loss = tf.Print(loss, [loss], message='Total Loss \t', summarize=1000)
loss = tf.Print(loss, [current_recall], message='Current Recall \t', summarize=1000)
loss = tf.Print(loss, [total_recall/seen], message='Average Recall \t', summarize=1000)

return loss



If you look at the code for calculating cell_x we are tiling (GRID_W x GRID_H ) to (1 x GRID_W x GRID_H x 1 x 1)

Code for cell_y swaps second and third dimensions.

We then combine cell_x and cell_y to create our cell_grid.

pred_box_xy is a variable storing centroid of the predicted box ( y_pred predicts it w.r.t. to the cell, so we append it with location of the cell to get correct location w.r.t. starting point of the image). Also notice the sigmoid function (making predictions a 0~1 fraction of the cell dimensions).

pred_box_wh is a variable storing the predicted box's dimensions. Here we use exp as box dimensions can be much bigger than the dimensions of the cell predicting the box. This is because the cell has already seen whole of the image because of it's receptive field.

Next few variables here are used to calculate the mAP, the value we use to quantize how well our network is performing.

Another interesting concept in play here is the use of Warm-Up training.

YOLOv2 is an Object Localization algorithm. It not only tries to predict the object class, it also tries to predict the location of that object in the image.

If we were to train for both simultaneously, then we have too many parameters to predict and learn. This is simplified by first letting the network stabilize for the object classification problem, and then training for localisation. Warm-up training code does just that.

In the "Finalize the loss" code section you should also take a note, that sparse_softmax_cross_entropy_with_logits is used only for loss_class whereas for others we use reduce_sum.

### Parse Annotations

Annotations

generator_config = {
'IMAGE_H'         : IMAGE_H,
'IMAGE_W'         : IMAGE_W,
'GRID_H'          : GRID_H,
'GRID_W'          : GRID_W,
'BOX'             : BOX,
'LABELS'          : LABELS,
'CLASS'           : len(LABELS),
'ANCHORS'         : ANCHORS,
'BATCH_SIZE'      : BATCH_SIZE,
'TRUE_BOX_BUFFER' : 50,
}

train_imgs, seen_train_labels = parse_annotation(train_annot_folder, train_image_folder, labels=LABELS)
### write parsed annotations to pickle for fast retrieval next time
#with open('train_imgs', 'wb') as fp:
#    pickle.dump(train_imgs, fp)

### read saved pickle of parsed annotations
#with open ('train_imgs', 'rb') as fp:
train_batch = BatchGenerator(train_imgs, generator_config, norm=normalize)

valid_imgs, seen_valid_labels = parse_annotation(valid_annot_folder, valid_image_folder, labels=LABELS)
### write parsed annotations to pickle for fast retrieval next time
#with open('valid_imgs', 'wb') as fp:
#    pickle.dump(valid_imgs, fp)

### read saved pickle of parsed annotations
#with open ('valid_imgs', 'rb') as fp:
valid_batch = BatchGenerator(valid_imgs, generator_config, norm=normalize, jitter=False)


This is a straightforward code to parse the annotations which are then use for label generation, calculating loss and validation.

### Callbacks and Training

Setup a few callbacks and start the training

early_stop = EarlyStopping(monitor='val_loss',
min_delta=0.001,
patience=3,
mode='min',
verbose=1)

checkpoint = ModelCheckpoint('weights_coco.h5',
monitor='val_loss',
verbose=1,
save_best_only=True,
mode='min',
period=1)
tb_counter  = len([log for log in os.listdir(os.path.expanduser('~/logs/')) if 'coco_' in log]) + 1
tensorboard = TensorBoard(log_dir=os.path.expanduser('~/logs/') + 'coco_' + '_' + str(tb_counter),
histogram_freq=0,
write_graph=True,
write_images=False)

optimizer = Adam(lr=0.5e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
#optimizer = SGD(lr=1e-4, decay=0.0005, momentum=0.9)
#optimizer = RMSprop(lr=1e-4, rho=0.9, epsilon=1e-08, decay=0.0)

model.compile(loss=custom_loss, optimizer=optimizer)

model.fit_generator(generator        = train_batch,
steps_per_epoch  = len(train_batch),
epochs           = 100,
verbose          = 1,
validation_data  = valid_batch,
validation_steps = len(valid_batch),
callbacks        = [early_stop, checkpoint, tensorboard],
max_queue_size   = 3)


Callback code consists of some utilities for logging and other purposes.

We also define the optimizer (Adam) here.

We run the code then for 100 epochs. This should take 14-25 days if you were using Titan X or similar GPU!

### Inference

Perform detection on images

model.load_weights("weights_coco.h5")

dummy_array = np.zeros((1,1,1,1,TRUE_BOX_BUFFER,4))

plt.figure(figsize=(10,10))

input_image = cv2.resize(image, (416, 416))
input_image = input_image / 255.
input_image = input_image[:,:,::-1]
input_image = np.expand_dims(input_image, 0)

netout = model.predict([input_image, dummy_array])

boxes = decode_netout(netout[0],
obj_threshold=OBJ_THRESHOLD,
nms_threshold=NMS_THRESHOLD,
anchors=ANCHORS,
nb_class=CLASS)
image = draw_boxes(image, boxes, labels=LABELS)

plt.imshow(image[:,:,::-1]); plt.show()


Inference code is to finally be able to use the trained model to detect the accuracy of the model. If you do not want to train and directly run this section, then make sure you do not perform training above (use the fully trained network and neither train it or randomize the layer weights).

Below we show the detection being performed on few sample images.

It also works on artworks!

YOLOv2 is awesome, but we must look at SSD as well as YOLOv2 took inspiration and made modifications to an another awesome algorithm called SSD.

# SSD

SSD vs YOLOv1

Source SSD Paper - Released in Dec 2015, SSD has turned into de facto detection pipeline of many modern DNN object detectors, including YOLOv2. SSD discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generated scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape.

Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.

SSD differs from YOLOv2 due to the usage of multiple layers that provide a finer accuracy on objects with different scales.

Like YOLOv2, SSD uses VGG on ResNet backbone, and then ad some extra conv layers and loss function.

One important point to notice is that after the image is passed on the VGG network, some conv layers are added producing feature maps of sizes 19x19, 10x10, 5x5, 3x3, 1x1. These, together with the 38x38 feature map produced by VGG’s conv4_3, are the feature maps which will be used to predict bounding boxes. The conv4_3 is responsible to detect the smallest objects while the conv11_2 is responsible for the biggest objects.

YOLOv2 uses Anchor boxes, a concept inspired by SSD. As we learnt in YOLOv2, here too, the model is then trained to make two (in YOLOv2, we make 3 different types of predictions, what is the 3rd one?) predictions for each anchor: 1. a discrete class prediction for each anchor 2. 4 Coordinates ($t_x, t_y, t_w,$ and $t_h$), prediction of an offset by which the anchor needs to be shifted to fit the ground-truth bounding box.

Anchor boxes

Consider the image title "Anchor boxes" over the right side, observe that the cat is has 2 boxes that match on the 8x8 feature map, but none on the dog. Now on the 4x4 feature map there is one box that matches the dog.

It is important to note that the boxes in the 8x8 feature map are smaller than those in the 4x4 feature map: SSD grab some feature maps, each responsible for a different scale of objects, allowing it to identify objects across a large range of scales.

## SSD Paper Notes

Please make sure you have covered Advance Concepts. If you haven't, then we recommend you do that right away!

SSD Object Detection

SSD

Different aspect ratios are needed for detection

SSD predicts multiple boxes with various aspect ratio

Width and height calculations Predictions

• SSD is the first deep network based object detector that does not re-sample pixels or features for bounding box hypothesis and is as accurate as others.
• This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP of 74.3 % on VOC2007 test).
• This improvement comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage.
• SSD uses small convolution filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales.
• This last approach to use multiple layers for prediction at different scales, SSD achieve high-accuracy using relatively low resolution input, further increasing detection speed.
• SSD only needs an input image and ground truth boxes for each object during training. In a convolution fashion, SSD evaluate a small set of default boxes at different aspect ratios at each location in several feature maps with different scales.
• For each default box, we predict both the shape offsets and the confidence for all object categories.
• At training time, SSD first match these default boxes to the ground truth boxes. The model loss is a weighted sum between localization loss (Smooth L1) and confidence loss (Softmax).
• Although the L2 norm is more precise and better in minizing prediction errors, the L1 norm produces sparser solutions, ignore more easily fine details and is less sensitive to outliers. Sparser solutions are good for feature selection in high dimensional spaces, as well for prediction speed.
• SSD adds convolution feature layers to the end of the truncated base network. These layers decrease the size progressively and allow predictions of detections at multiple scales.
• For each box out of $k$ at a given location, SSD computes $c$ class scores and the 4 offsets relative to the original bounding box shape. This results in a total of $(c + 4)k$ filters that are applied around each location in the feature map, yielding $(c + 4)kmn$ output for a $m × n$ feature map.
• During training SSD selects multiple default ground truth boxes. SSD matches each ground truth box to the default box with the best IoU overlap with threshold of 0.5. This simplifies the learning problem, allowing the network to predict high scores for multiple overlapping default boxes rather than requiring it to pick only the one with maximum overlap.
• Instead of scaling the object to different sizes while training, SSD uses feature maps from several different layers for prediction mimicing the same image scaling effect.
• SSD designs the tiling of default boxes so that specific feature maps learn to be responsive to particular scales of object.
• SSD uses conv4_3, conv7 (fc7), conv8_2, conv9_2, conv10_2 and conv11_2 of VGG16 network to predict both location and confidences.
• For conv4_3, default box is set with a scale of 0.1.
• For conv4_3, conv10_2 and conv11_2, only 4 anchor boxes are associated - omitting the ones with aspect ratios of $\frac{1}{3} and$3$. For all other layers, SSD uses all 6 anchor boxes. • By default, SSD tries to predict objects for total 8732 anchor boxes located at different feature maps and of different scales. • Considering SSD generated large number of anchor boxes, it is esential to perform non-maximum separation. First a confidence threshold of 0.01 is used to filter out most boxes. Then SSD applies IoU filter of 0.45 per class and keep the top 200 detections per image. • With given output boxes (8732), SSD sorts them using class confidence and picks top 200 boxes. Each of these boxes is a 7 dimensional vector (batch_idx, class confidence, label, and 4-coordinate values). This step is post processing of SSD's original output of 8732 boxes. ### Additional Notes • We see in the code that actual number of scales are : 0.1, 0.2, 0.37, 0.54, 0.71, 0.88, 1.05 for VOC and 0.07, 0.15, 0.33, 0.51, 0.69, 0.87, 1.05 for COCO dataset. • Each feature map is sort of working at these resolutions: 8, 16, 32, 64, 100, 300 • Final prediction is actually (batch_size, #boxes, #classes + 12) and not 4 (coordinate boxes). The last eight entries of the last axis are not used by this function and therefore their contents are irrelevant, they only exist so that y_true has the same shape as y_pred, where the last four entries of the last axis contain the anchor box coordinates, which are needed during inference. • 8732 = 38*38*4 + 19*19*6 + 10*10*6 + 5*5*6 + 3*3*4 + 1*1*4 • Final prediction would look something like this: class conf xmin ymin xmax ymax 2. 0.91 11.39 81.83 282.1 283.14 15. 1. 124.41 2.4 215.08 159.87 ## SSD Prior Boxes SSD Detections Scales Unlike in YOLOv2, where we had 3 combined losses (figured the third one yet?), in SSD we combine only the classification and location-regression loss. But first let's talk about anchor box scaling. ### Box Scaling SSD tiles the default box so that specific feature maps learn to be responsive to particular scales of the objects. Suppoer we want to use$m$feature maps for prediction. The scale of the default boxes in each feature map is computed as: $$s_k = s_{min} + \frac{s_{max} - s_{min}}{m - 1}(k - 1), k ∈ [1, m]$$ where$s_{min}$is 0.2 and$s_{max}$is 0.9, meaning the lowest layer has a scale of 0.2 and highest layer has scale of 0.9, and all layers in between are regularly spaced. ### Aspect Ratios SSD impose different aspect ratios fot the default boxes, and denote then as: $$a_r ∈ \{1, 2, 3, \frac{1}{2}, \frac{1}{3}\}$$ For the aspect ratio of 1, SSD also adds a default box whose scale is: $$s^\prime_k = \sqrt{s_k s_{k+1}}$$ This results in a total of 6 anchor boxes per feature map location. The center of each anchor box is set to: $$\Big( \frac{i + 0.5}{|f_k|}, \frac{j + 0.5}{|f_k|} \Big)$$ where$|f_k|$is the size of the$k$-th square feature map. By combined predictions for all anchor boxes with different scales and aspect ratios from all locations of many feature maps, SSD has a diverse set of predictions, covering various input object sizes and shapes. ## SSD Loss function$L(x, c, l, g) =\frac{1}{6}{ + \alpha L_{loc}(x, l, g))}L_{loc}(x, l, g) = \sum_{i ∈ Pos m ∈ \{cx, cy, w, h\}}^N (L_{conf} (x, c)\sum x_{ij}^k smooth_{L1}(l_i^m - \hat{g}_j^m)\hat{g}_j^{cx} = \frac{(g_j^{cx} - d_i^{cx})}{d_i^w} \ \ \ \ \ \ \ \ \ \ \hat{g}_j^{cy} = \frac{(g_j^{cy} - d_i^{cy})}{d_i^h}\hat{g}_j^h = \log \left ( \frac{g_j^h}{d_i^h} \right ) \ \ \ \ \ \ \ \ \ \ \hat{g}_j^w = \log \left ( \frac{g_j^w}{d_i^w} \right )L_{conf}(x, c) = - \sum_{i \epsilon \ Pos}^N x_{i, j}^p\log (\hat{c}_i^{p}) - \sum_{i \ \epsilon \ Neg} \log (\hat{c}_i^{0}) \\ \ \ \ \ \ where \hat{c}_i^{p} = \frac{exp(c_i^p)}{\sum_p exp(c_i^p)}smooth_L(x) = \begin{Bmatrix} 0.5x^2 & if |x| < 1 \\ |x| - 0.5 & otherwise, \end{Bmatrix}$The expression for the loss, which measures how far off our prediction “landed”, is: multibox_loss = confidence_loss + alpha * location_loss The confidence_loss computed by$(L_{conf} (x, c)$is a simple softmax loss function between the actual label and the predicted label. The$\alpha$term helps us in balancing the contribution of the location loss. You see two terms:$i \ \epsilon \ Pos$and$i \ \epsilon \ Neg$. As we have done in YOLOv2, we not only want to detect positive predictions, we want to reduce negative predictions as well. In SSD we regress to offsets for the center$(cx, cy)$of the default bounding box$(d)$and for its width$(w)$and height$(h)\$.

In the location_loss, however, we only consider the positive predicion boxes.

Although the L2 norm is more precise, L1 norm good for feature selection in high dimensional spaces, as well for prediction speed.

### Code

Let's look at the code here: LINK

And you can find one of the best description of SSD on this link.