NAV Navbar
  • ML & AI
  • Keras
  • Your 1st DNN
  • Deep Concepts in DNNs
  • Advance Concepts
  • DCNN Performance
  • YoloV2
  • SSD
  • ML & AI


    image image

    Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome.

    Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI.

    Here, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you'll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems.

    As a community, you'll build datasets, open source projects and help others learn with you. Currently, we are focused on building a robust dataset for number plate recognition. At every handson meetup event, participants submit 25 number plate images. Check out our Projects and Datasets sections to learn more.

    Finally, you'll learn about some of Silicon Valley's best practices in innovation as it pertains to machine learning and AI.



    Install Keras

    # Hoping you have install Tensorflow already!
    pip install keras

    Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

    Use Keras if you need a deep learning library that:

    Read the documentation at

    Why use Keras?

    Your 1st DNN

    Code First!

    Install Keras - required everytime on Google Colab

    !pip install -q keras
    import keras

    At MLBLR we are focusing on real hands-on experience. Let's immediately jump to the code logic and see how a DNN is built from scratch.

    There would be terms here which you might not totally understand at first but don't worry, we would go through them once you finish your first DNN.



    import numpy as np
    from keras.models import Sequential
    from keras.layers import Dense, Dropout, Activation, Flatten, Add
    from keras.layers import Convolution2D, MaxPooling2D
    from keras.utils import np_utils
    from keras.datasets import mnist

    Like in all programs, we need to do few initializations to access functions we'd need for our program.

    Though Keras is supported on, we need to install it every time.

    First we import Sequential. Sequential model helps use create a linear stack of layers. Dense, Dropout, Activation, Flatten, Add, Convolution2D, MaxPooling2D are the layers which we'd need, so we import them.

    We also need few in-built utilities in Keras, so we import keras.utils.

    We also need access to a dataset. Keras has sweet access to few datasets. Here we import mnist from keras.datasets.

    Load Dataset

    Load MNIST dataset

    (X_train, y_train), (X_test, y_test) = mnist.load_data()

    Load pre-shuffled MNIST data

    mnist.load_data() fetches MNIST dataset from the internet, shuffles it, and splits it into (X_train, y_train) and (X_test, y_test). We need train data to train our model, and test data to test how well is it performing.

    Shape of the image

    print (X_train.shape)
    # (60000, 28, 28)
    from matplotlib import pyplot as plt
    %matplotlib inline


    Sample image

    We need to know the shape (dimensions) of the image we are dealing with. .shape functions spits out the shape of our sample data.

    It is very important to keep track of the dimensions right from the input stage to the output. High-level neural APIs like Keras take care of all intermediate dimensional changes (for example, in Caffe, you'd need to calculate all dimensions manually).

    %matplotlib inline allows us to view our data inside the notebook.

    plt.imshow(X_train[0]) allows us to view the sample image.

    Notice few things:

    Reshape the input

    Reshape the input

    X_train = X_train.reshape(X_train.shape[0], 28, 28,1)
    X_test = X_test.reshape(X_test.shape[0], 28, 28,1)

    We need to reshape the input as new Keras API expects us to mention the total color-channels as well (1 in our case as we have grayscale images).

    Reshaping data is going to be one of the most used and confusing element in ML, so make sure you pay full attention to reshaping methods. Keras takes care of reshaping for you for hidden layer, but if you were to write something new, say a new loss function, then you'll have to take care of it yourself!


    Converting from unit8 to float32 & normalizing

    X_train = X_train.astype('float32')
    X_test = X_test.astype('float32')
    X_train /= 255
    X_test /= 255

    Conversion to Float32 & Normalization

    All images are stored as unit8, however, we work with floats in neural network. So first thing we do is to convert our images from uint8 to float32. Then, we convert our images from 0-255 scale to 0-1. This is called normalization. We usually do this because:



    # Convert 1-dimensional class arrays to 10-dimensional class matrices
    Y_train = np_utils.to_categorical(y_train, 10)
    Y_test = np_utils.to_categorical(y_test, 10)

    If you print y_train[:10], you'd get array([5, 0, 4, 1, 9, 2, 1, 3, 1, 4], dtype=uint8). This means if we want to predict, the network must output "5" as the value. This is trickier as for 9, one need to output "9" and for 5, "5". How would loss function behave in such a case? Moreover, what about predicting A, B...Z?

    Easier method is to output one-hot vector for each digit. For e.g. (0, 0, 0, 0, 0, 1, 0, 0, 0, 0)_ can represent 5 where the 6th varialbe is 1. (0, 0, 0, 0, 0, 0, 0, 0, 0, 1)_ can represent 9. Similarly, (1, 0, 0, 0, 0, 0, 0, 0, 0, ...)_can represent _A.

    We achieve this by using np_utils.to_categorical function.

    DNN Architecture

    The Layers

    model = Sequential()
    model.add(Convolution2D(32, 3, 3, activation='relu', input_shape=(28,28,1)))
    model.add(Convolution2D(10, 1, activation='relu'))
    model.add(Convolution2D(10, 26))


    Here is what we did:




    Next we compile our model and configure it's learning process. We inform Keras, that we want to use 'Cross Entropy' as our loss and optimise our gradient descent process using Adam algorithm.

    We have several other loss functions as well like: mean_squared_loss, mean_absolute_error, squared_hinge, sparse_categorical_crossentropy, poisson, etc. You can learn more about these, and more, loss functions in the Keras Documentation. Similarly, we have lot of other gradient descent optimization functions like: Adam, Adamax, SGD, RMSprop, Adadelta, etc.


    Fit the model, Y_train, batch_size=32, nb_epoch=10, verbose=1)

    Finally, to begin the training process we fit the model with out training variables, inform Keras about how many images to look at simultaneously (batch_size = 32), how long to train the model (epoch=10) and to provide us with a training log (verbose = 1).

    Your output would look something like this:

    Epoch 1/10 - 60000/60000 [====] - 20s 329us/step - loss: 0.3135 - acc: 0.9059


    Epoch 10/10 - 7008/60000 [====] - 19s 316us/step - loss: 0.0774 - acc: 0.9756



    score = model.evaluate(X_test, Y_test, verbose=0)
    # manual test
    y_pred = model.predict(X_test)

    Only after 10 epochs, we have achieved 97.5% accuracy. But this is on train dataset. We need to test our acuracy on the test dataset which our model has not seen yet!

    We do this by calling model.evaluate function. The result which gets printed is: [0.05154316661804915, 0.9841]

    If you want to manually see what is happening here, print y_pred[:9], the predicted values for first 10 numbers in Y_test dataset. You'd see that the highest value in each row is maximum for number mentioned in y_test[:9].

    Congrats! You have built your first DNN with an accuracy of 98.41%.

    Quick Recap

    imgaug ImgAug - one of the best libraries for Image Augmentation

    xavier Xavier or Glorot Initialization

    angular-softmax Angular Softmax!

    dropout Dropout

    amazin-bn Amazing way to understand Batch Normalization

    gradient-descent Various Gradient Descent Algorithms

    max-pooling Max Pooling


    In your first DNN you have taken few decisions, which you'll have to take everytime you write something new.

    We can boil it down to these (including future requirements) components:

    You'll actually go deeper once you have a base model ready, to tweak it for maximum accuracy. You may also drop these tweaks in favor of speed/performance. It might seem a lot to look at while writing a new neural network, but once you start writing a few, you'll realize few combinations work well with each other.

    Deep Concepts in DNNs




    Disclaimer: Most of the content is taken from online sources. This particular MIT paper, by Waseem Rawat and Zenghui Wang, has been the biggest source of them all. I apologize in advance if I missed giving credit back to the original author. I will make continuous attempt to point you back to the original authors as much as possible.

    Let us observe how good ML has become is to see the ILSRVC results:

    Year Team Layers Contribution Position
    2010 NEC Shallow Fast Feature Extraction, Compression, SVM First
    2011 XRCE Shallow High dimensional image, SVM First
    2012 SuperVision 8 GPU based DCNN, Dropout First
    2013 Clarifai 8 Deconvolution visualization First
    2014 GoogLeNet 22 DCNN, 1x1 conv First
    2014 VGG 19 All 3x3 convs Second
    2015 MSRA 152 Ultra Deep, Residuals First
    2016 CUImage 269 Ensamble, Gated Bi-Directional CNN First
    2017 Momenta 152 Squeeze & excitation, feature recalibration First
    2018* Facebook 264 Direct connection between any two layers SOA*

    As we can see networks are getting deeper and more sophisticated. Every year there is there is an addition of new technology which is making everything till then obsolete.

    Below we cover some of the best additions to machine learning algorithms, something which we all need to know to build beautiful networks.

    Convolutional Layers


    Convolution without zero-padding and with stride of 1

    Convolution layers serve as feature extractors, and thus they learn the feature representations of their input images.

    The neurons in the convolutional layers are arranged into feature maps.

    Each neuron in a feature map has a receptive field, which is connected to a neighborhood of neurons in the previous layers via a set of trainable weights.

    Inputs are convolved with the learned weights in order to compute a new feature map, and convolved results are sent through a nonlinear activation function.

    In simple terms, imagine a kernel rubbing across the whole image. If you have 32 kernels/filters in a layer, the layer will output 32 new convolved images.

    Pooling Layer


    Max Pooling Layer

    The purpose of the pooling layer is to reduce the spatial resolution of the feature maps and thus achieve invariance to input dimensions and translations.

    Initially it was common practice to use average pooling aggregation layers to propagate the average of all the input values.

    However, in more recent models, max pooling aggregation layers propagate the maximum value within a receptive field of the next layer.

    In 2007, backpropagation was applied for the first time to a DCNN like architecture that used max pooling. It was showed empirically that the max pooling operation was vastly superior for capturing invariance in an image like data and could lead to improved generalization and faster convergence, and it also alleviated the need for a rectification layer.

    Lee, Gross, Ranganathan and Ng introduced and applied probabilistic max-pooling to convolution DBNs, which resulted in translation invariant hierarchical generative model.

    FC Layers


    Fully connected Layers

    Several convolution and pooling layers are usually stacked on top of each other to extract more abstract feature representations in moving through the network.

    Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset. See the Neural Network section of the notes for more information.

    The output from the convolutional layers represents high-level features in the data. While that output could be flattened and connected to the output layer, adding a fully-connected layer is a (usually) cheap way of learning non-linear combinations of these features.

    Essentially the convolutional layers are providing a meaningful, low-dimensional, and somewhat invariant feature space and the fully-connected layer is learning a (possibly non-linear) function in that space.

    Fully connected layers are computationally expensive and has maximum bumber of weights. Newer approaches, like using $1x1$ kernels, are trying to replace them.




    ReLU or Rectified Linear Unit allows much faster training times.

    It is a piecewise linear function, with form

    $f(x) = max(x, 0)$.

    It retains only the positive part of the activation, by reducing the negative part to zero, while the integrated maximum operator promotes faster computation and do not suffer from vanishing gradient problem (in which lower layers have gradients near zero because high layers are almost saturated).




    Even though ReLUs is awesome, they are at a possible disadvantage during optimization since the gradient is zero when the unit is not active. This may lead to cases where units never get activated since popular gradient descent optimization algorithms fine-tune only the weights of units previously activated. Thus, ReLUs suffer from slow convergence when training networks with constant zero gradients. To compensate for this Massetal.(2013) introduced leaky rectified units $(LReLU)$, which allow for small nonzero gradients when the unit is not active yet is saturated. Mathematically:

    $f(x) = max(x, 0) + \lambda min(x, 0)$

    where $\lambda$ is a predefined parameter within the range $(0, 1)$. They perform slightly better than ReLU.




    While LReLUs (Maasetal.,2013) rely on a predefined parameter to compress the negative part of the activation signal, Heetal.(2015a) proposed a parametric rectified linear unit (PReLU) to adaptively learn the parameters of the activation units during backpropagation. Mathematically, the PReLU is the same as the LReLU, except that λ is replaced with the learnable $λ_k$, which is allowed to vary for different input channels, denoted by k. Thus, the PReLU can be expressed as:

    $f(x_k ) = max(x_k, 0) + λ_k min(x_k, 0)$.

    PReLU has $1\%$ performance increase on the ILSVC dataset. PReLU always perform better than other rectified units, such as ReLU and LReLU.




    While ReLU and PReLU are all nonsaturating and thus lessen the vanishing gradient problem, only ReLU ensure a noise-robust deactivation state, however, they are nonnegative and thus have a mean activation larger than zero. To deal with this, Clevert et al. (2016) proposed the exponential linear unit (ELU), which has negative values to allow for activation near zero, but also saturates to a negative value with smaller arguments. Since the saturation decreases the variation of the units when deactivated, the precise deactivation argument becomes less relevant, thereby making ELUs robust against noise. Formally:

    $f(x) = max(x, 0) + min(λ(e^{x-1}, 0))$

    where λ is a predetermined parameter that controls the amount an ELU will saturate for negative inputs. ELUs sped up DCNN learning and led to higher classification accuracy and obtain encouraging convergence speed.


    SELU plotted for α=1.6732~, λ=1.0507~


    Scaled Exponential linear units is some kind of ELU but with a little twist. Mathemathcally it is:

    $selu(x) = λ( max(x, 0) + min(αe^x - α, 0 ))$

    Here all the parameters are fixed. The make feed-forward networks self-normalizing. The activation function is required to have negative and positive values for controlling the mean, saturation regions (derivatives approaching 0) to dampen the variance if it is too large in the lower layer and a slope larger than one to increase variane if it is too small in the lower layer and a continuous curve. For standard scaled inputs (mean 0, stddev 1), the values are α=1.6732~, λ=1.0507~ (this is mathematically calculated). If you use SELU, you need to initialize the weights with zero mean and use standard deviation of the squared root of 1 / (size of input). SELU converges better and gets better accuracy.

    Too many X-ELUs?? Best practise is to use any of above with dropout and then experiment! :D

    Or Do not use DropOut and perform Image Augmentation!: Read this paper

    1x1 Convolutions


    1x1 Convolutions

    A breakthrough in 2014 by Google. 1x1 convolutions performs two functions. They serve as dimension-reduction blocks prior to the more computationally costly 3x3 and 5x5 convolutions, and they include the use of rectified linear activations making. 1x1 are used to increase the depth and width of the network, while only marginally increasing the computation cost.

    Let us assume that a layer had an output of 256x256x64. You can imagine these as 64 feature maps with 256x256 each. Maxpooling layer (2x2) would change this to 128x128x64, whereas a layer with 10 1x1 kernels would reduce this to 256x256x10.

    A layer with 100 1x1 kernels would create 100 geature maps (256x256x100). 1x1 kernels give us an extremely low compute way to increase or reduce the depth of the features.

    Dilated Convolutions


    Dilated Convolution

    Dilated convolutions are applied with defined gaps. It is a way of increasing receptive view of the network exponentially and linear parameter accretion. It finds it's usage in applications which cares more about integrating knowledge of the wider context with less cost, like image segmentation where each pixel is labeled by its corresponding class. In this case, the network output needs to be in the same size of the input image. Dilated Convolutions avoids needs of upsampling.

    They are applied to audio as well (WaveNet) to capture a global view of the input with fewer parameters. In short, it is a simple but effective idea which you might consider when:

    1. Detection of fine details by processing inputs in higher resolution, or
    2. Broader view of the input to capture more contextual information, or
    3. Faster run-time with less parameters.

    Biologically inspired $L_p$ pooling (modelled on complex cell). In a given pooling region $R_j$, it takes the weighted average of the activations $a_i$, as:

    $s_j$ = $\Bigg(\sum$ $a_i{^p}\Bigg)^{1/p}$

    Notable, when $p = 1$, the equation corresponds to average pooling, while $p = \infty$, translates to max pooling. $L_p$ can be seen as a trade-off between average and maxpooling. It has shown exceptional image classification results earlier.

    Spatial Pyramidical Pooling


    Spatial Pyramidical Pooling

    DCNNs are restricted in that they can only handle a fixed input image ($e.g., 96 * 96$). In order to make them more flexible and thus handle images of different sizes, scales and aspect ratios, He, Zhang, Ren and Sun (2014) proposed SPP. They used multilevel spatial bins, which have sizes proportional to the image size, and thus allowed them to generate fixed-length representation. The SPP layer was integrated into DCNN between the final convolution/pooling layer and the first fully connected layer and thus performed information aggregation deep in the network to prevent fixing the size. Spatial pyramid matching is, conceptually, a method of building a more abstract representation of the images preserving some spatial information by spatially dividing images in some special way. Such higher order representation introduces some invariances (under translation for instance), but it does it at the cost of throwing away information. For instance, you may divide the image into 4x4 regions and then aggregate statistics (or pool) within the subregion. You can repeat the same process using 2x2 division and 1x1 (here, you pool over the whole image) forming a pyramid.



    SVM and SoftMax

    It is the most widely used in the last fully connected layer of DCNNs, owing to its simplicity and probabilistic interpretation. When this activation function is combined with cross-entropy loss, they form the extensively used softmax loss.


    Triplet Loss


    Triplet Loss

    The Triplet loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

    This is most useful when we are directly trying to learn embeddings (face recognition). It consists of two matching references and a non-matching reference, and the loss aims to separate the positive pair from the negative by a distance margin.



    DCNNs are very expressive models capable of learning exceptionally complicated relationships, however, many of these complicated mappings are due to sampling noise. Thus they exist only in the training set rather than in the test set, irrespective of whether they are drawn from the same data distribution. This leads to overfitting, which can be mitigated by regularization. The easiest and most common method to reduce overfitting is data augmentation, but it required a larger memory footprint and comes at a higher computational cost.




    In Dropout (Hinton et al., 2012; Srivastava et al.,2014), each unit of a layer's output is retained with probability $p$; else it is set to zero with probability $1-p$, with 0.5 being a common value of $p$. When dropout is applied to a fully connected layer of a DCNN (or any DNN), the output of the layer $r = [r_1, r_2, . . . , r_d]^T$, can be expressed as: $r = m * a(W_v)$, where $*$ denotes the element-wise product between a binary mask vector m and the matrix product between the input vector $v = [v_1, v_2, . . . v_n]^T$ and the weight matrix $W$ (with dimensions $d X n$), followed by the nonlinear activation function, $a$. The primary benefit of Dropout is its proven ability to significantly reduce overfitting by effectively preventing feature coadaptation; it is also capable of attaining model averaging.


    Dropout at test time

    Spatial Dropout

    In an object localization application that used a DCNN, it was found that applying regular Dropout before a 1x1 convolution layer increased training time but did not prevent overfitting. Thus, they proposed Spatial Dropout. We randomly set entire feature maps to 0, rather than individual 'pixels'. Regular dropout would not work so well on images because adjacent pixels are highly correlated. Dropout randomly zeros the out activations in order to force the network to generalize better, do less overfitting and build in redundancies as regularization techniques. Spatial dropout in convolution layers zeros out entire filters. Spatial dropout is well suited to a data set that has a small number of training samples, thus making it a good candidate to reduce overfitting for smaller datasets, where generalization is usually an issue.

    Label Smoothing Regularization was first proposed in 1980's.

    Instead of hard labels like 0 and 1, smoothen the labels by making them close to 0 and 1

    For example 0, 1 -> 0.1, 0.9

    Label Smoothing Regularization

    It maintains a realistic ratio between the unnormalized log probabilities of erroneous classes by estimating, during training, the marginalized consequence of label dropout. This averts the model from allocating a complete likelihood for each training case. It can be considered the equivalent of replacing a single cross-entropy loss with a pair of losses, the second of which looks at a prior distribution and penalizes the deviation of the predicted label relative to it. This is what was used in 2017 ILSVRC's top model.


    Imagine if you'd initialize your variable at least close to the trained values! Wouldn't that save time and compute resources! As of now it's still a dream!

    There are hacks though!

    You'll be surprised to know that most of the new algorithms actually "steal" their variable values from an already pre-trained network like VGG etc!

    Poor initialization of DCNN parameters which are typically in the millions and in particular their weights, can hamper the training process because of the vanishing/exploding gradient problem and hinder convergence. Thus, their initialization is extremely critical. The key factors to consider when selecting an initialization scheme are the activation function to be used, the network depth, which would hamper classification accuracy due to the degradation, the computational budget available, the size of the dataset and the tolerable complexity of the required solution.


    Xavier Initializations

    Xavier initialization Glorot and Bengio (2010) evaluated how backpropagation gradients and activations varied across different layers; based on these considerations, they proposed a normalized initialization scheme that essentially adopted a balanced uniform distribution for weight initialisation. The initial weights are drawn from a uniform or gaussian distribution, with a zero-mean and precise variance. Xavier initialization promotes the propagation of signals deep into DNNs and has been shown to lead to substantively faster convergence. Its main limitation is that its derivation is based on the assumption that the activations are linear, thus making it inappropriate for ReLU and PReLU activations (Heetal., 2015a).

    Theoretically derived adaptable initialization He et al., (2015) derived a theoretically sound initialisation that considered these non-linear activations. It led to the initialization of weights from a zero-mean gaussian distribution whose standard deviation is $\sqrt{2/n_l}$, where $n$ is the number of connections of the response and $l$ is the layer index. Furthermore, they initialized the biases to zero and showed empirical proof that this initialization scheme was suited to training extremely deep models, where Xavier initialization was not.

    Batch Normalization


    Notice when Batch Normalization is performed!




    Notice what BN does. Also notice that BN doesn't do much if Skip-Connections are used (as in ResNet)

    In addition to having a large number of parameters, the training of DCNNs is convoluted by a phenomenon known as internal covariance shift, which is caused by changes to the distribution of each layer's inputs because of parameter changes in the previous layer.

    A very good explanation from quora - If you take a very long column and apply load on it, it will fail much before than axial capacity of its section due to buckling. Most appropriate engineering solution to increase its load capacity is to prevent buckling by lateral support intermittently. Such supports may be of fixed or pinned type, to prevent sway, rotation or both for the intermittent column sections. Similarly, for very deep network unless intermediate layers are constrained with whitening, it will lose its learning capacity and accuracy due to its internal co-variate shift, where whitening means zero mean, unit variance and decorrelated. One of the brilliant solutions for such problem is Batch Normalization. For both, Buckling or Co-Variate Shift a small perturbation leads to a large change in the later.

    This phenomenon has severe consequences, which include slower training due to lower learning rates, the need for careful parameter initializations, and complexities when training DCNNs with saturating non-linear activations. To reduce the consequences of internal covariate shirt, Ioffe and Szegedy (2015) proposed a technique known as batch normalization. This technique introduces a normalization step, which is simply a nonlinear transform applied to each activation, that fixed the means and variances of layer inputs. BN computes the mean and variance estimates after mini-batches rather than over the entire training set. The idea is that, instead of just normalising the inputs to the network, we normalise the inputs to layers within the network. It’s called “batch” normalization because, during training, we normalise the activations of the previous layer for each batch, i.e. apply a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.

    Benefits of Batch Normalization are: networks train faster, allows higher learning rates, makes weights easier to initialize, makes more activation functions viable, simplifies the creation of deeper networks, and provides some regularization (as it adds a little noise to the network). BN adds close to 30% computational overhead and requires additional 25% of parameters per iteration.



    ResNet Block

    In 2015 an ultra-deep network was introduced. UDNs suffer from poor propagation of activation and gradients because of stacking of several nonlinear transformations on top of each other.

    This was solved by residual learning framework.

    He et al. reformulated the layers of the network and forced them to learn residual functions with reference to their preceding layer inputs. This allowed for errors to be propagated directly to the preceding units and thus made these networks easier to optimize and easier to train.

    Wide Residual Networks


    Wide - Increasing Number of filters

    Given that the extremely deep residual networks of He et al. (2016) were slow to train, their depth was reduced and width increased in a new variant called WRN (Zagoruyko & Komodakis, 2017).

    These WRNs were much shallower since they consisted of 16 layers compared to the 1000 of He et al. (2016), yet they outperformed all the previous residual models in terms of efficiency and accuracy and set new SOA.

    Wider networks run faster and have more accuracy. For example a WRN-50-ResNet would perform as well as ResNet200, both achieving 6% error rate.

    However, these were later superseded by the shortcut technique introduced by DenseNet.




    The recently proposed densely connectd convolution networks (Huang, Liu, Weinberger, & van der Maaten, 2016) extend the idea of skip connections by connecting in the usual feedforward modus each layer with every other layer in the network.

    Thus, the feature maps of all the previous layers are used as the inputs for each succeeding layer. the new feature map is a concatenation of the earlier layers.

    They obtained accuracy comparable to residual networks but required significantly fewer parameters.

    On dataset, for example, DenseNet achieves a similar accuracy as ResNet but using less than half the amount of parameters and roughly half the number of FLOPs.





    Regular Convolution - all input channels (like RBG) are combined into 1 channel.


    First Step in Depthwise Separable MobileNet Architecture.


    Second Step - Pointwise 1x1 convolution in MobileNet Architecture.


    Sometime this April 2017 a very interesting paper titled MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications appeared on

    The authors of the paper claim that this kind of neural network runs very efficiently on mobile devices and is nearly as accurate as much larger convolutional networks like our good friend VGGNet-16.

    Depthwise Separable Convolutions

    The big idea behind MobileNets: Use depthwise separable convolutions to build light-weight deep neural networks.

    A regular convolutional layer applies a convolution kernel (or “filter”) to all of the channels of the input image. It slides this kernel across the image and at each step performs a weighted sum of the input pixels covered by the kernel across all input channels.

    The important thing is that the convolution operation combines the values of all the input channels. If the image has 3 input channels, then running a single convolution kernel across this image results in an output image with only 1 channel per pixel.

    So for each input pixel, no matter how many channels it has, the convolution writes a new output pixel with only a single channel. (In practice we run many convolution kernels across the input image. Each kernel gets its own channel in the output.)

    The MobileNets architecture also uses this standard convolution, but just once as the very first layer. All other layers do “depthwise separable” convolution instead. This is actually a combination of two different convolution operations: a depthwise convolution and a pointwise convolution.

    A depthwise convolution works like the right image titled First Step in Depthwise Separable MobileNet Architecture.

    Unlike a regular convolution, it does not combine the input channels but it performs convolution on each channel separately. For an image with 3 channels, a depthwise convolution creates an output image that also has 3 channels. Each channel gets its own set of weights.

    The purpose of the depthwise convolution is to filter the input channels. Think edge detection, color filtering, and so on.

    The depthwise convolution is followed by a pointwise convolution. This really is the same as a regular convolution but with a 1×1 kernel - (Right Image - Second Step).

    In other words, this simply adds up all the channels (as a weighted sum).

    As with a regular convolution, we usually stack together many of these pointwise kernels to create an output image with many channels.

    The purpose of this pointwise convolution is to combine the output channels of the depthwise convolution to create new features.

    Why do this?

    The end results of both approaches are pretty similar — they both filter the data and make new features — but a regular convolution has to do much more computational work to get there and needs to learn more weights.

    So even though it does (more or less) the same thing, the depthwise separable convolution is going to be much faster!

    The paper shows the exact formula you can use to compute the speed difference but for 3×3 kernels. This new approach is about 9 times as fast and still as effective.

    It’s no surprise therefore that MobileNets uses up to 13 of these depthwise separable convolutions in a row!

    Advance Concepts

    Topics covered here need understanding of basic concepts first. If you haven't read Deep Concepts in DNNs, we recommend you do that right away!

    Recognition vs Detection

    recovdet Object Recognition vs Object Detection

    In object recognition, our aim is to recognize what all is there in the image, for e.g. Dog and Bridge. In object detection however, we need to specify where exactly the dog(s) and bridge are. Recognition is pretty easy these days, while detection is still a work in development. All you've been covering till now was object recognition.

    Recognition is easy, because the algorithm needs to predict only few things (like class name), while in detection, it needs to classify each pixel. We simply this problem by ignoring lower prediction values, predicting bounding boxes instead of exact object mask, and mixing different receptive field layers, but this is easier said than done.

    There are two main approaches driving detection algorithms, namely:

    Both are similar yet very different.

    Receptive Field

    image Receptive Field

    CascadingConvolutions.png Center "Pixel" in the second layer has "seen" all the image!

    dogs A dog and a Cat, and block classification representing class predictions

    amazing Look what each block is actually predicting!

    ant Compound Eyes

    Receptive field is perhaps one of the most important concepts in Convolution Neural Networks that deserve far more attention than given to. All the DNNs algorithms are biult around this idea, but very few cover the fundamental aspects or features of the receptive field. Most just call it filter-size without giving us that 'aha' moment to realise the power it packs!

    One of the best papers on this topic was written last year by Wenjio Luo et. al..

    Let's focus on the second image on the right.

    We are performing 3x3 convolutions twice. The first time 3x3 convolution happens, every pixel in the first layer has "seen" in total 9 pixels (marked green in the input image). Now since we were convolving, 3x3 matrix multiplcation must have changed/filtered the results, but important thing to realise here is that, this pixel now has information about 9 pixels in input image. We are sort of collecting (and filtering) information about 9 pixels and placing it in a new pixel in the first layer.

    Question we need to spend some time on is, what does this new image (layer 1) represent? Wouldn't each pixel in this layer represent "concepts", "features" or "data" from 9 other pixels?

    Magic happens once we perform the same process on the first layer, convolving through a 3x3 filter. Now the new pixel in the center of the second layer output has "seen" all of the image! (This is because the 9 pixels it convolved over in the first layer had each seen surrounding 9 pixels in the original image). Look at the image, and you should understand what we mean here. The center pixel is a "champion informant" of the whole image. Pixels next to it would act same, but their "receptive field" would be less.

    Though traditionally receptive field is considered just as the filter size, but if we dig deeper, we realize that when we cascade convolutions, receptive field of pixels in deeper layers, if much more than just the filter size.

    I have used the word pixel here abundantly, and I apologize for that. Technically what I am calling a pixel here is just a "cell" value in a Matrix, but calling it pixels helps me visualize the problem and explain it better.

    Another amazing paper which came just 3 days ago (6th March 2018) is The Building Blocks of Interpretability

    Look at the dog and the cat image on your right. When we run CNN, and only look at final image, each box is predicting a class accordingly.

    Let's get awed now! Second image shows what each cell sees (check out high res images here). The geometric centroid of the dog would be "seeing" whole of the dog, and as we move away from this conceptual centroid each cell would see less of it. The moment the cell sees very less of dog, the bounding box would end in the object detection algorithm!

    Look at the insect vision image on the right. First image is hollywood's representation of how a compound eye might see. This is what our neural networks are doing and this is totally wrong. Insect eyes seems to be doing sort of average/max-pooling straight after the input from each photo-receptor and creating a blurry-image, good enough for their tasks.

    Personally, I feel good to learn a new way in which receptive field can be interpretted, but sad as well as now I ask, should this much work be done for object detection? There must be an easier way hiding just in-front of us!

    Anchor Boxes

    image Sliding Window

    image Region Proposal

    image Anchor Boxes

    image h vs w graph



    yoloanchor Anchor box dimensions in YOLOv2

    Until very recently, sliding window was the main method used for detecting an object.

    As should be evident from the image on your right there are few problems here:

    Sliding Window as a concept is simple, but extremely compute intensive.

    To understand how frustrating it can get to use this algorithm, just check out this PyImageSearch Image. Don't forget to sit back and relax!

    RCNN came out with region proposal mechanism where we need two different networks. First (region proposal) network predicts all the possible regions where an object might exist, then second (prediction) network predicts the object. This handled the differently sized object problem, but again is extremely slow (RCNN took 20 second to detect an object!) as it required a forward pass of the CNN (AlexNet) for every single region proposal for every single image (and that's about running AlexNet 2000 times for 2000 proposals).

    Why not run the CNN just once per image and then find a way to share that computation across the ~2000 proposals?

    This is exactly what Fast R-CNN does using a technique known as RoIPool (Region of Interest Pooling). At its core, RoIPool shares the forward pass of a CNN for an image across its subregions. In the image above, notice how the CNN features for each region are obtained by selecting a corresponding region from the CNN’s feature map. Then, the features in each region are pooled (usually using max pooling). So all it takes us is one pass of the original image as opposed to ~2000! But this was also too slow for any real time implementation, taking around 2 seconds per image.

    Welcome Anchor Boxes

    Faster RCNN, SSD and YOLOv2, all use Anchor Boxes.

    Intuitively, we know that objects in an image should fit certain common aspect ratios and sizes. For instance, we know that we want some rectangular boxes that resemble the shapes of humans. Likewise, we know we won’t see many boxes that are very very thin. In such a way, we create k such common aspect ratios we call anchor boxes. For each such anchor box, we output one bounding box and score per position in the image.

    Let us see how to do it!


    Top 5 results for YOLOv5

    Now if we divide our image, say in 13x13 cells, allow for scaling and slight translation for these bounding boxes, and predict the best anchor box, we are done!

    If a cell is offset from the top left corner of the image by $(c_x, c_y)$ and the bounding box ground-truth/prior has width and height $(g_w, g_y)$, then the predictions corresponds to:

    In YOLOv2, we directly predict $t_x, t_y, t_w,$ and $t_h$ among other things for each cell.

    You can learn more abour exact math behind the k-means part of the YOLOv2 algorithm here.

    Intersection over Union (IoU)

    image Image over Union

    Intersection over Union is an evaluation metric used to measure the accuracy of an object detector on a particular dataset. We often see this evaluation metric used in object detection challenges such as the popular PASCAL VOC challenge.

    More formally, in order to apply Intersection over Union to evaluate an (arbitrary) object detector we need:

    1. The ground-truth bounding boxes (i.e., the hand labeled bounding boxes from the testing set that specify where in the image our object is).
    2. The predicted bounding boxes from our model.

    In the numerator we compute the area of overlap between the predicted bounding box and the ground-truth bounding box.

    The denominator is the area of union, or more simply, the area encompassed by both the predicted bounding box and the ground-truth bounding box.

    Dividing the area of overlap by the area of union yields our final score — the Intersection over Union.


    mAP Mean Average Precision

    vs Precision vs Recall

    Precision and Recall.

    Precision is the percentage of true positives in the retrieved results. Let us say, we have 8 images of airplane in a dataset of 10 images and the network predicted all 10 as aeroplanes. Then the precision of this network is 80%.

    $$ precision = \frac {t_p}{t_p + f_p} = \frac {t_p}{n}$$

    where $t_p, f_p,$ and $n$ represents detected images, wrongly-detected images, and total images respectively.

    Recall is the percentage of the objects that the system retrieves. Let us assume we have 10 images of aeroplane in our dataset, but our network only detected 9 of them as aeroplane, then our recall is 90%.

    $$ recall = \frac {t_p}{t_p + f_n}$$

    where $t_p,$ and $f_n,$ represents detected images, and not-detected images respectively.

    This means if our precision is low, our network is prediting more things as aeroplanes than there exist, and low recall means, it is not able to detect all the aeroplanes.

    mAP or Mean Average Precision

    mAP is the mean of all the average precision across all our classes!

    Generally it is seen that precision falls if we try to improve recall.

    Non-Max Suppression

    Just take the maximum one! More description coming soon!

    YOLOv2 Loss Function

    image Scary?

    image YOLOv2 Network Output

    var1 var2 (var1 - var2)^2 (sqrtvar1 - sqrtvar2)^2
    0.0300 0.020 9.99e-05 0.001
    0.0330 0.022 0.00012 0.0011
    0.0693 0.046 0.000533 0.00233
    0.2148 0.143 0.00512 0.00723
    0.8808 0.587 0.0862 0.0296
    4.4920 2.994 2.2421 0.1512

    On your right is YOLOv2's loss function. Doesn't it look scary?

    Let's first look at what the network actually predicts.

    If we recap, YOLOv2 predicts detections on a 13x13 feature map, so in total we have 169 maps/cells.

    We have 5 anchor boxes. For each anchor box we need Objectness-Confidence Score (where there is an object found?), 4 Coordinates ($t_x, t_y, t_w,$ and $t_h$), and 20 top classes. This can crudely be seen as 20 coordinates, 5 confidence scores, and 100 class probabilities as shown in the image on the right, so in total 125 filter of 1x1 size would be needed.

    So we have few things to worry about:

    All losses are mean squared errors, except classification loss, which uses cross entropy function.

    Now, let's break the code in the image.

    You can find another similar explaination here.

    Not that scary, right!

    DCNN Performance

    DCNN Performance on the ImageNet Dataset : MIT


    Sample images in ImageNet



    ML overtaking CV


    The ImageNet project is a large visual database designed for use in visual object recognition software research. Over 14 million URLs of images have been hand-annotated by ImageNet to indicate what objects are pictured; in at least one million of the images, bounding boxes are also provided.

    A dramatic 2012 breakthrough in solving the ImageNet Challenge is widely considered to be the beginning of the deep learning revolution of the 2010s

    Code Name Contribution Top 5 Error %
    AlexNet 2012 DCNN model across parallel GPUs, innovations include Dropout, data augmentation, ReLUs, and local response normalization 15.3
    Zieler & Fergus Net 2014 Novel visualization method; larger convolution layers 11.7
    Spatial Pyramid Pooling Net 2014 SPP for flexible image size 8.06
    VGG-net 2014 Increase depth, more convolution layers, 3x3 convs 7.32
    GoogLeNet 2015 Novel inception arch., ultra large, dimensionality reduction 6.67
    PReLU-net 2015a PReLU activation functions and robust initialization scheme 4.94
    BN-Inception 2015 BN combined with inception 4.83
    Inception-V3 2015 Factorized convolutions, aggressive dimensionality reduction 3.58
    ResNets 2015 Residual functions/blocks integrated into DCNN 3.57
    BN-Inception-ResNet 2016 BN, inception arch. integrated with residual functions 3.08
    CUImage 2016 Ensamble, Gated Bi-Directional CNN 2.991
    Squeeze and Excite 2017 Feature Recalibration, Label-smoothning regularization, large Batch Size 2.251
    ILSVRC 2018 STOPPED! Focus now is on efficiency :)


    Awesome YoloV2 Video from Andreas Refsgaard at a creative coding studio - Støj

    Source YOLOv2 Paper - General purpose object detection should be fast, accurate, and able to recognize a wide variety of objects. Since the introduction of neural networks, detection frameworks have become increasingly fast and accuracte. However, most detection methods are still constrained to a small set of objects. Not YOLOv2!

    YOLO is definitely the most written about object detection algorithm, and why not:

    YOLOv2 Paper Notes

    We need to cover some very important concepts before we could understand YOLOv2. If you haven't already read Advance Concepts, we recommend you do that right away!

    anchors Taken from Understanding YOLO Bounding Boxes

    image YOLOv2 vs Others

    image 13x13 Grid Prediction

    Scary Loss Function

    image Scary?


    YOLOv2 Network Output

    var1 var2 (var1 - var2)^2 (sqrtvar1 - sqrtvar2)^2
    0.0300 0.020 9.99e-05 0.001
    0.0330 0.022 0.00012 0.0011
    0.0693 0.046 0.000533 0.00233
    0.2148 0.143 0.00512 0.00723
    0.8808 0.587 0.0862 0.0296
    4.4920 2.994 2.2421 0.1512

    On your right is YOLOv2's loss function. Doesn't it look scary?

    Let's first look at what the network actually predicts.

    If we recap, YOLOv2 predicts detections on a 13x13 feature map, so in total we have 169 maps/cells.

    We have 5 anchor boxes. For each anchor box we need Objectness-Confidence Score (where there is an object found?), 4 Coordinates ($t_x, t_y, t_w,$ and $t_h$), and 20 top classes. This can crudely be seen as 20 coordinates, 5 confidence scores, and 100 class probabilities as shown in the image on the right, so in total 125 filter of 1x1 size would be needed.

    So we have few things to worry about:

    All losses are mean squared errors, except classification loss, which uses cross entropy function.

    Now, let's break the code in the image.

    You can find another similar explaination here.

    Not that scary, right!

    Code - Inference

    Clone into DarkNet repository

    git clone
    cd darknet

    Let's run a simpler code first (iOS and Ubuntu only).

    git clone the original DarkNet repository from github and then cd in to it. Issue the make command to build DarkNet for your system.

    Download the weights


    Then download the pre-trained model. This is ~194 MB file, so would take some time to download.

    Let's excute the model now.

    In the data folder we have an image called dog.jpg. We shall use this image to run our network.

    Run DarkNet Yolo

    ./darknet detect cfg/yolo.cfg yolo.weights data/dog.jpg

    image Predictions

    Issue the detect command. It needs a model-architecture-configuration file cfg/yolo.cfg matching our weights, the weights file yolo.weights which you just downloaded, and finally the image file you want to run detection on data/dog.jpg.

    Your console would such a similar output as:

    You can also run Tiny YOLO model. First get the weights: wget and then run this command: ./darknet detect cfg/tiny-yolo-voc.cfg tiny-yolo-voc.weights data/dog.jpg. This time you'll see:

    Code - Training


    from keras.models import Sequential, Model
    from keras.layers import Reshape, Activation, Conv2D, Input, MaxPooling2D, BatchNormalization, Flatten, Dense, Lambda
    from keras.layers.advanced_activations import LeakyReLU
    from keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard
    from keras.optimizers import SGD, Adam, RMSprop
    from keras.layers.merge import concatenate
    import matplotlib.pyplot as plt
    import keras.backend as K
    import tensorflow as tf
    import imgaug as ia
    from tqdm import tqdm
    from imgaug import augmenters as iaa
    import numpy as np
    import pickle
    import os, cv2
    from preprocessing import parse_annotation, BatchGenerator
    from utils import WeightReader, decode_netout, draw_boxes, normalize
    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"] = ""
    %matplotlib inline
    LABELS = ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']
    IMAGE_H, IMAGE_W = 416, 416
    GRID_H,  GRID_W  = 13 , 13
    BOX              = 5
    CLASS            = len(LABELS)
    CLASS_WEIGHTS    = np.ones(CLASS, dtype='float32')
    OBJ_THRESHOLD    = 0.3#0.5
    NMS_THRESHOLD    = 0.3#0.45
    ANCHORS          = [0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828]
    NO_OBJECT_SCALE  = 1.0
    OBJECT_SCALE     = 5.0
    COORD_SCALE      = 1.0
    CLASS_SCALE      = 1.0
    BATCH_SIZE       = 16
    wt_path = 'yolo.weights'                      
    train_image_folder = '/home/andy/data/coco/train2014/'
    train_annot_folder = '/home/andy/data/coco/train2014ann/'
    valid_image_folder = '/home/andy/data/coco/val2014/'
    valid_annot_folder = '/home/andy/data/coco/val2014ann/'

    If you have covered the stuff above, you can just brush through the code now! This is the code we'll refer to.

    Outline of Steps

    As you can see, before we run this code, we need to first pre-process our dataset. This model will be trained on COCO dataset.

    COCO dataset is huge, so dwonloading and pre-processing it is going to take a lot of time.

    Also, to save time, we will be using pre-trained model from YOLOv2's original implementation, shuffle the last few layers and then train just those layers.

    If you try to train the whole network from scratch, it would take close to two weeks!

    Constructing the network

    # the function to implement the orgnization layer (thanks to
    def space_to_depth_x2(x):
        return tf.space_to_depth(x, block_size=2)
    input_image = Input(shape=(IMAGE_H, IMAGE_W, 3))
    true_boxes  = Input(shape=(1, 1, 1, TRUE_BOX_BUFFER , 4))
    # Layer 1
    x = Conv2D(32, (3,3), strides=(1,1), padding='same', name='conv_1', use_bias=False)(input_image)
    x = BatchNormalization(name='norm_1')(x)
    x = LeakyReLU(alpha=0.1)(x)
    x = MaxPooling2D(pool_size=(2, 2))(x)
    # Layer 2
    x = Conv2D(64, (3,3), strides=(1,1), padding='same', name='conv_2', use_bias=False)(x)
    x = BatchNormalization(name='norm_2')(x)
    x = LeakyReLU(alpha=0.1)(x)
    x = MaxPooling2D(pool_size=(2, 2))(x)
    # Layer 3
    x = Conv2D(128, (3,3), strides=(1,1), padding='same', name='conv_3', use_bias=False)(x)
    x = BatchNormalization(name='norm_3')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 4
    x = Conv2D(64, (1,1), strides=(1,1), padding='same', name='conv_4', use_bias=False)(x)
    x = BatchNormalization(name='norm_4')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 5
    x = Conv2D(128, (3,3), strides=(1,1), padding='same', name='conv_5', use_bias=False)(x)
    x = BatchNormalization(name='norm_5')(x)
    x = LeakyReLU(alpha=0.1)(x)
    x = MaxPooling2D(pool_size=(2, 2))(x)
    # Layer 6
    x = Conv2D(256, (3,3), strides=(1,1), padding='same', name='conv_6', use_bias=False)(x)
    x = BatchNormalization(name='norm_6')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 7
    x = Conv2D(128, (1,1), strides=(1,1), padding='same', name='conv_7', use_bias=False)(x)
    x = BatchNormalization(name='norm_7')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 8
    x = Conv2D(256, (3,3), strides=(1,1), padding='same', name='conv_8', use_bias=False)(x)
    x = BatchNormalization(name='norm_8')(x)
    x = LeakyReLU(alpha=0.1)(x)
    x = MaxPooling2D(pool_size=(2, 2))(x)
    # Layer 9
    x = Conv2D(512, (3,3), strides=(1,1), padding='same', name='conv_9', use_bias=False)(x)
    x = BatchNormalization(name='norm_9')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 10
    x = Conv2D(256, (1,1), strides=(1,1), padding='same', name='conv_10', use_bias=False)(x)
    x = BatchNormalization(name='norm_10')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 11
    x = Conv2D(512, (3,3), strides=(1,1), padding='same', name='conv_11', use_bias=False)(x)
    x = BatchNormalization(name='norm_11')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 12
    x = Conv2D(256, (1,1), strides=(1,1), padding='same', name='conv_12', use_bias=False)(x)
    x = BatchNormalization(name='norm_12')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 13
    x = Conv2D(512, (3,3), strides=(1,1), padding='same', name='conv_13', use_bias=False)(x)
    x = BatchNormalization(name='norm_13')(x)
    x = LeakyReLU(alpha=0.1)(x)
    skip_connection = x
    x = MaxPooling2D(pool_size=(2, 2))(x)
    # Layer 14
    x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name='conv_14', use_bias=False)(x)
    x = BatchNormalization(name='norm_14')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 15
    x = Conv2D(512, (1,1), strides=(1,1), padding='same', name='conv_15', use_bias=False)(x)
    x = BatchNormalization(name='norm_15')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 16
    x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name='conv_16', use_bias=False)(x)
    x = BatchNormalization(name='norm_16')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 17
    x = Conv2D(512, (1,1), strides=(1,1), padding='same', name='conv_17', use_bias=False)(x)
    x = BatchNormalization(name='norm_17')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 18
    x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name='conv_18', use_bias=False)(x)
    x = BatchNormalization(name='norm_18')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 19
    x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name='conv_19', use_bias=False)(x)
    x = BatchNormalization(name='norm_19')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 20
    x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name='conv_20', use_bias=False)(x)
    x = BatchNormalization(name='norm_20')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 21
    skip_connection = Conv2D(64, (1,1), strides=(1,1), padding='same', name='conv_21', use_bias=False)(skip_connection)
    skip_connection = BatchNormalization(name='norm_21')(skip_connection)
    skip_connection = LeakyReLU(alpha=0.1)(skip_connection)
    skip_connection = Lambda(space_to_depth_x2)(skip_connection)
    x = concatenate([skip_connection, x])
    # Layer 22
    x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name='conv_22', use_bias=False)(x)
    x = BatchNormalization(name='norm_22')(x)
    x = LeakyReLU(alpha=0.1)(x)
    # Layer 23
    x = Conv2D(BOX * (4 + 1 + CLASS), (1,1), strides=(1,1), padding='same', name='conv_23')(x)
    output = Reshape((GRID_H, GRID_W, BOX, 4 + 1 + CLASS))(x)
    # small hack to allow true_boxes to be registered when Keras build the model 
    # for more information:
    output = Lambda(lambda args: args[0])([output, true_boxes])
    model = Model([input_image, true_boxes], output)

    The actual classification/detection network is very simple.

    The model builds off of prior work on network design as well as common knowledge in the field.

    Similar to the VGG models it uses mostly 3 × 3 filters and double the number of channels after every pooling step.

    Following the work on Network in Network (NIN) it uses global average pooling to make predictions as well as 1 × 1 filters to compress the feature representation between 3 × 3 convolutions.

    It uses batch normalization to stabilize training, speed up convergence, and regularize the model.

    Final model has only 19 convolution layers and 5 maxpooling layers.

    The network architecture looks like this:

    Type Filters Size/Stride Output
    Convolutional 32 3 × 3 224 × 224
    Maxpool 2 × 2/2 112 × 112
    Convolutional 64 3 × 3 112 × 112
    Maxpool 2 × 2/2 56 × 56
    Convolutional 128 3 × 3 56 × 56
    Convolutional 64 1 × 1 56 × 56
    Convolutional 128 3 × 3 56 × 56
    Maxpool 2 × 2/2 28 × 28
    Convolutional 256 3 × 3 28 × 28
    Convolutional 128 1 × 1 28 × 28
    Convolutional 256 3 × 3 28 × 28
    Maxpool 2 × 2/2 14 × 14
    Convolutional 512 3 × 3 14 × 14
    Convolutional 256 1 × 1 14 × 14
    Convolutional 512 3 × 3 14 × 14
    Convolutional 256 1 × 1 14 × 14
    Convolutional 512 3 × 3 14 × 14
    Maxpool 2 × 2/2 7 × 7
    Convolutional 1024 3 × 3 7 × 7
    Convolutional 512 1 × 1 7 × 7
    Convolutional 1024 3 × 3 7 × 7
    Convolutional 512 1 × 1 7 × 7
    Convolutional 1024 3 × 3 7 × 7
    Convolutional 1000 1 × 1 7 × 7
    Avgpool Global 1000

    We modify this network for detection by removing the last convolutional layer and instead adding on three 3 × 3 convolutional layers with 1024 filters each followed by a final 1 × 1 convolutional layer with the number of outputs we need for detection. For VOC we predict 5 boxes with 5 coordinates each and 20 classes per box so 125 filters. We also add a passthrough layer from the final 3 × 3 × 512 layer to the second to last convolutional layer so that our model can use fine grain features.

    Several different initial configurations are possible (most differing on the size of the images it was trained on), which results in different performances.


    Load pretrained weights

    Load Pretrained weights

    weight_reader = WeightReader(wt_path)
    nb_conv = 23
    for i in range(1, nb_conv+1):
        conv_layer = model.get_layer('conv_' + str(i))
        if i < nb_conv:
            norm_layer = model.get_layer('norm_' + str(i))
            size =[0].shape)
            beta  = weight_reader.read_bytes(size)
            gamma = weight_reader.read_bytes(size)
            mean  = weight_reader.read_bytes(size)
            var   = weight_reader.read_bytes(size)
            weights = norm_layer.set_weights([gamma, beta, mean, var])       
        if len(conv_layer.get_weights()) > 1:
            bias   = weight_reader.read_bytes([1].shape))
            kernel = weight_reader.read_bytes([0].shape))
            kernel = kernel.reshape(list(reversed(conv_layer.get_weights()[0].shape)))
            kernel = kernel.transpose([2,3,1,0])
            conv_layer.set_weights([kernel, bias])
            kernel = weight_reader.read_bytes([0].shape))
            kernel = kernel.reshape(list(reversed(conv_layer.get_weights()[0].shape)))
            kernel = kernel.transpose([2,3,1,0])

    Training Neural Networks is hard, and given how complex YOLOv2's loss is (remember it has to predict 13x13x5x25 numbers!) it would be a trememdous challenge to train such a network, both in terms of compute time required as well as computational resources required.

    What is done to avoid such a situation is to use a base pretrained network which shares the same architecture where we transfer the weights directly to the network we are training. We can decide to retrain these weights (but now instead of being random, we have an extremely good starting point) or decide to freeze the layers (and only change the layers we have added).

    To repeat, we make sure the network we want to train has two parts. First part must be same (or similar in case of retraining) to the pretrained network we would use transfer weights, and the second part consists of the additional layers or strategies pertinent to the design problem (like object detection in this case).

    You can use same trategy to design other solutions like Image Segmentation, Human Pose Estimation, Super Resolution, etc. A fully trained network is used as a feature extractor, and we pass on these features to the new layers.

    Randomize weights of the last layer

    Randomize weights

    layer   = model.layers[-4] # the last 4 convolutional layers
    weights = layer.get_weights()
    new_kernel = np.random.normal(size=weights[0].shape)/(GRID_H*GRID_W)
    new_bias   = np.random.normal(size=weights[1].shape)/(GRID_H*GRID_W)
    layer.set_weights([new_kernel, new_bias])

    As mentioned earlier, we can decide to train the whole network or use an existing network as a base network to initialize our new network.

    In this case, YOLOv2 authors decided to freeze all but the last 4 layers of the base network. They randomized the last 4 layers, to allow it to learn new weights to allow integration with new layers. Network structure is kept intact.

    YOLOv2 Loss Function

    The Loss Function

    def custom_loss(y_true, y_pred):
        mask_shape = tf.shape(y_true)[:4]
        cell_x = tf.to_float(tf.reshape(tf.tile(tf.range(GRID_W), [GRID_H]), (1, GRID_H, GRID_W, 1, 1)))
        cell_y = tf.transpose(cell_x, (0,2,1,3,4))
        cell_grid = tf.tile(tf.concat([cell_x,cell_y], -1), [BATCH_SIZE, 1, 1, 5, 1])
        coord_mask = tf.zeros(mask_shape)
        conf_mask  = tf.zeros(mask_shape)
        class_mask = tf.zeros(mask_shape)
        seen = tf.Variable(0.)
        total_recall = tf.Variable(0.)
        Adjust prediction
        ### adjust x and y      
        pred_box_xy = tf.sigmoid(y_pred[..., :2]) + cell_grid
        ### adjust w and h
        pred_box_wh = tf.exp(y_pred[..., 2:4]) * np.reshape(ANCHORS, [1,1,1,BOX,2])
        ### adjust confidence
        pred_box_conf = tf.sigmoid(y_pred[..., 4])
        ### adjust class probabilities
        pred_box_class = y_pred[..., 5:]
        Adjust ground truth
        ### adjust x and y
        true_box_xy = y_true[..., 0:2] # relative position to the containing cell
        ### adjust w and h
        true_box_wh = y_true[..., 2:4] # number of cells accross, horizontally and vertically
        ### adjust confidence
        true_wh_half = true_box_wh / 2.
        true_mins    = true_box_xy - true_wh_half
        true_maxes   = true_box_xy + true_wh_half
        pred_wh_half = pred_box_wh / 2.
        pred_mins    = pred_box_xy - pred_wh_half
        pred_maxes   = pred_box_xy + pred_wh_half       
        intersect_mins  = tf.maximum(pred_mins,  true_mins)
        intersect_maxes = tf.minimum(pred_maxes, true_maxes)
        intersect_wh    = tf.maximum(intersect_maxes - intersect_mins, 0.)
        intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]
        true_areas = true_box_wh[..., 0] * true_box_wh[..., 1]
        pred_areas = pred_box_wh[..., 0] * pred_box_wh[..., 1]
        union_areas = pred_areas + true_areas - intersect_areas
        iou_scores  = tf.truediv(intersect_areas, union_areas)
        true_box_conf = iou_scores * y_true[..., 4]
        ### adjust class probabilities
        true_box_class = tf.argmax(y_true[..., 5:], -1)
        Determine the masks
        ### coordinate mask: simply the position of the ground truth boxes (the predictors)
        coord_mask = tf.expand_dims(y_true[..., 4], axis=-1) * COORD_SCALE
        ### confidence mask: penelize predictors + penalize boxes with low IOU
        # penalize the confidence of the boxes, which have IOU with some ground truth box < 0.6
        true_xy = true_boxes[..., 0:2]
        true_wh = true_boxes[..., 2:4]
        true_wh_half = true_wh / 2.
        true_mins    = true_xy - true_wh_half
        true_maxes   = true_xy + true_wh_half
        pred_xy = tf.expand_dims(pred_box_xy, 4)
        pred_wh = tf.expand_dims(pred_box_wh, 4)
        pred_wh_half = pred_wh / 2.
        pred_mins    = pred_xy - pred_wh_half
        pred_maxes   = pred_xy + pred_wh_half    
        intersect_mins  = tf.maximum(pred_mins,  true_mins)
        intersect_maxes = tf.minimum(pred_maxes, true_maxes)
        intersect_wh    = tf.maximum(intersect_maxes - intersect_mins, 0.)
        intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]
        true_areas = true_wh[..., 0] * true_wh[..., 1]
        pred_areas = pred_wh[..., 0] * pred_wh[..., 1]
        union_areas = pred_areas + true_areas - intersect_areas
        iou_scores  = tf.truediv(intersect_areas, union_areas)
        best_ious = tf.reduce_max(iou_scores, axis=4)
        conf_mask = conf_mask + tf.to_float(best_ious < 0.6) * (1 - y_true[..., 4]) * NO_OBJECT_SCALE
        # penalize the confidence of the boxes, which are reponsible for corresponding ground truth box
        conf_mask = conf_mask + y_true[..., 4] * OBJECT_SCALE
        ### class mask: simply the position of the ground truth boxes (the predictors)
        class_mask = y_true[..., 4] * tf.gather(CLASS_WEIGHTS, true_box_class) * CLASS_SCALE       
        Warm-up training
        no_boxes_mask = tf.to_float(coord_mask < COORD_SCALE/2.)
        seen = tf.assign_add(seen, 1.)
        true_box_xy, true_box_wh, coord_mask = tf.cond(tf.less(seen, WARM_UP_BATCHES), 
                              lambda: [true_box_xy + (0.5 + cell_grid) * no_boxes_mask, 
                                       true_box_wh + tf.ones_like(true_box_wh) * np.reshape(ANCHORS, [1,1,1,BOX,2]) * no_boxes_mask, 
                              lambda: [true_box_xy, 
        Finalize the loss
        nb_coord_box = tf.reduce_sum(tf.to_float(coord_mask > 0.0))
        nb_conf_box  = tf.reduce_sum(tf.to_float(conf_mask  > 0.0))
        nb_class_box = tf.reduce_sum(tf.to_float(class_mask > 0.0))
        loss_xy    = tf.reduce_sum(tf.square(true_box_xy-pred_box_xy)     * coord_mask) / (nb_coord_box + 1e-6) / 2.
        loss_wh    = tf.reduce_sum(tf.square(true_box_wh-pred_box_wh)     * coord_mask) / (nb_coord_box + 1e-6) / 2.
        loss_conf  = tf.reduce_sum(tf.square(true_box_conf-pred_box_conf) * conf_mask)  / (nb_conf_box  + 1e-6) / 2.
        loss_class = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=true_box_class, logits=pred_box_class)
        loss_class = tf.reduce_sum(loss_class * class_mask) / (nb_class_box + 1e-6)
        loss = loss_xy + loss_wh + loss_conf + loss_class
        nb_true_box = tf.reduce_sum(y_true[..., 4])
        nb_pred_box = tf.reduce_sum(tf.to_float(true_box_conf > 0.5) * tf.to_float(pred_box_conf > 0.3))
        Debugging code
        current_recall = nb_pred_box/(nb_true_box + 1e-6)
        total_recall = tf.assign_add(total_recall, current_recall) 
        loss = tf.Print(loss, [tf.zeros((1))], message='Dummy Line \t', summarize=1000)
        loss = tf.Print(loss, [loss_xy], message='Loss XY \t', summarize=1000)
        loss = tf.Print(loss, [loss_wh], message='Loss WH \t', summarize=1000)
        loss = tf.Print(loss, [loss_conf], message='Loss Conf \t', summarize=1000)
        loss = tf.Print(loss, [loss_class], message='Loss Class \t', summarize=1000)
        loss = tf.Print(loss, [loss], message='Total Loss \t', summarize=1000)
        loss = tf.Print(loss, [current_recall], message='Current Recall \t', summarize=1000)
        loss = tf.Print(loss, [total_recall/seen], message='Average Recall \t', summarize=1000)
        return loss

    If you look at the code for calculating cell_x we are tiling (GRID_W x GRID_H ) to (1 x GRID_W x GRID_H x 1 x 1)

    Code for cell_y swaps second and third dimensions.

    We then combine cell_x and cell_y to create our cell_grid.

    pred_box_xy is a variable storing centroid of the predicted box ( y_pred predicts it w.r.t. to the cell, so we append it with location of the cell to get correct location w.r.t. starting point of the image). Also notice the sigmoid function (making predictions a 0~1 fraction of the cell dimensions).

    pred_box_wh is a variable storing the predicted box's dimensions. Here we use exp as box dimensions can be much bigger than the dimensions of the cell predicting the box. This is because the cell has already seen whole of the image because of it's receptive field.

    Next few variables here are used to calculate the mAP, the value we use to quantize how well our network is performing.

    Another interesting concept in play here is the use of Warm-Up training.

    YOLOv2 is an Object Localization algorithm. It not only tries to predict the object class, it also tries to predict the location of that object in the image.

    If we were to train for both simultaneously, then we have too many parameters to predict and learn. This is simplified by first letting the network stabilize for the object classification problem, and then training for localisation. Warm-up training code does just that.

    In the "Finalize the loss" code section you should also take a note, that sparse_softmax_cross_entropy_with_logits is used only for loss_class whereas for others we use reduce_sum.

    Parse Annotations


    generator_config = {
        'IMAGE_H'         : IMAGE_H, 
        'IMAGE_W'         : IMAGE_W,
        'GRID_H'          : GRID_H,  
        'GRID_W'          : GRID_W,
        'BOX'             : BOX,
        'LABELS'          : LABELS,
        'CLASS'           : len(LABELS),
        'ANCHORS'         : ANCHORS,
        'BATCH_SIZE'      : BATCH_SIZE,
        'TRUE_BOX_BUFFER' : 50,
    train_imgs, seen_train_labels = parse_annotation(train_annot_folder, train_image_folder, labels=LABELS)
    ### write parsed annotations to pickle for fast retrieval next time
    #with open('train_imgs', 'wb') as fp:
    #    pickle.dump(train_imgs, fp)
    ### read saved pickle of parsed annotations
    #with open ('train_imgs', 'rb') as fp:
    #    train_imgs = pickle.load(fp)
    train_batch = BatchGenerator(train_imgs, generator_config, norm=normalize)
    valid_imgs, seen_valid_labels = parse_annotation(valid_annot_folder, valid_image_folder, labels=LABELS)
    ### write parsed annotations to pickle for fast retrieval next time
    #with open('valid_imgs', 'wb') as fp:
    #    pickle.dump(valid_imgs, fp)
    ### read saved pickle of parsed annotations
    #with open ('valid_imgs', 'rb') as fp:
    #    valid_imgs = pickle.load(fp)
    valid_batch = BatchGenerator(valid_imgs, generator_config, norm=normalize, jitter=False)

    This is a straightforward code to parse the annotations which are then use for label generation, calculating loss and validation.

    Callbacks and Training

    Setup a few callbacks and start the training

    early_stop = EarlyStopping(monitor='val_loss', 
    checkpoint = ModelCheckpoint('weights_coco.h5', 
    tb_counter  = len([log for log in os.listdir(os.path.expanduser('~/logs/')) if 'coco_' in log]) + 1
    tensorboard = TensorBoard(log_dir=os.path.expanduser('~/logs/') + 'coco_' + '_' + str(tb_counter), 
    optimizer = Adam(lr=0.5e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
    #optimizer = SGD(lr=1e-4, decay=0.0005, momentum=0.9)
    #optimizer = RMSprop(lr=1e-4, rho=0.9, epsilon=1e-08, decay=0.0)
    model.compile(loss=custom_loss, optimizer=optimizer)
    model.fit_generator(generator        = train_batch, 
                        steps_per_epoch  = len(train_batch), 
                        epochs           = 100, 
                        verbose          = 1,
                        validation_data  = valid_batch,
                        validation_steps = len(valid_batch),
                        callbacks        = [early_stop, checkpoint, tensorboard], 
                        max_queue_size   = 3)

    Callback code consists of some utilities for logging and other purposes.

    We also define the optimizer (Adam) here.

    We run the code then for 100 epochs. This should take 14-25 days if you were using Titan X or similar GPU!


    Perform detection on images

    dummy_array = np.zeros((1,1,1,1,TRUE_BOX_BUFFER,4))
    image = cv2.imread('images/giraffe.jpg')
    input_image = cv2.resize(image, (416, 416))
    input_image = input_image / 255.
    input_image = input_image[:,:,::-1]
    input_image = np.expand_dims(input_image, 0)
    netout = model.predict([input_image, dummy_array])
    boxes = decode_netout(netout[0], 
    image = draw_boxes(image, boxes, labels=LABELS)

    Inference code is to finally be able to use the trained model to detect the accuracy of the model. If you do not want to train and directly run this section, then make sure you do not perform training above (use the fully trained network and neither train it or randomize the layer weights).

    Below we show the detection being performed on few sample images.





    It also works on artworks!


    Dtvpz9dwkaare f

    Maxresdefault fish

    Yolo9000  examples

    YOLOv2 is awesome, but we must look at SSD as well as YOLOv2 took inspiration and made modifications to an another awesome algorithm called SSD.



    SSD vs YOLOv1

    Source SSD Paper - Released in Dec 2015, SSD has turned into de facto detection pipeline of many modern DNN object detectors, including YOLOv2. SSD discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generated scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape.

    Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.

    SSD differs from YOLOv2 due to the usage of multiple layers that provide a finer accuracy on objects with different scales.

    Like YOLOv2, SSD uses VGG on ResNet backbone, and then ad some extra conv layers and loss function.

    One important point to notice is that after the image is passed on the VGG network, some conv layers are added producing feature maps of sizes 19x19, 10x10, 5x5, 3x3, 1x1. These, together with the 38x38 feature map produced by VGG’s conv4_3, are the feature maps which will be used to predict bounding boxes. The conv4_3 is responsible to detect the smallest objects while the conv11_2 is responsible for the biggest objects.

    YOLOv2 uses Anchor boxes, a concept inspired by SSD. As we learnt in YOLOv2, here too, the model is then trained to make two (in YOLOv2, we make 3 different types of predictions, what is the 3rd one?) predictions for each anchor: 1. a discrete class prediction for each anchor 2. 4 Coordinates ($t_x, t_y, t_w,$ and $t_h$), prediction of an offset by which the anchor needs to be shifted to fit the ground-truth bounding box.

    anchor Anchor boxes

    Consider the image title "Anchor boxes" over the right side, observe that the cat is has 2 boxes that match on the 8x8 feature map, but none on the dog. Now on the 4x4 feature map there is one box that matches the dog.

    It is important to note that the boxes in the 8x8 feature map are smaller than those in the 4x4 feature map: SSD grab some feature maps, each responsible for a different scale of objects, allowing it to identify objects across a large range of scales.

    SSD Paper Notes

    Please make sure you have covered Advance Concepts. If you haven't, then we recommend you do that right away!

    cats SSD Object Detection


    image Different aspect ratios are needed for detection

    image SSD predicts multiple boxes with various aspect ratio

    multiple scales Width and height calculations Predictions

    Additional Notes

    class conf xmin ymin xmax ymax
    2. 0.91 11.39 81.83 282.1 283.14
    15. 1. 124.41 2.4 215.08 159.87

    SSD Prior Boxes

    gif SSD Detections

    scales Scales

    Unlike in YOLOv2, where we had 3 combined losses (figured the third one yet?), in SSD we combine only the classification and location-regression loss. But first let's talk about anchor box scaling.

    Box Scaling

    SSD tiles the default box so that specific feature maps learn to be responsive to particular scales of the objects. Suppoer we want to use $m$ feature maps for prediction. The scale of the default boxes in each feature map is computed as:

    $$s_k = s_{min} + \frac{s_{max} - s_{min}}{m - 1}(k - 1), k ∈ [1, m]$$

    where $s_{min}$ is 0.2 and $s_{max}$ is 0.9, meaning the lowest layer has a scale of 0.2 and highest layer has scale of 0.9, and all layers in between are regularly spaced.

    Aspect Ratios

    SSD impose different aspect ratios fot the default boxes, and denote then as:

    $$a_r ∈ \{1, 2, 3, \frac{1}{2}, \frac{1}{3}\}$$

    For the aspect ratio of 1, SSD also adds a default box whose scale is:

    $$s^\prime_k = \sqrt{s_k s_{k+1}}$$

    This results in a total of 6 anchor boxes per feature map location.

    The center of each anchor box is set to:

    $$\Big( \frac{i + 0.5}{|f_k|}, \frac{j + 0.5}{|f_k|} \Big)$$

    where $|f_k|$ is the size of the $k$-th square feature map.

    By combined predictions for all anchor boxes with different scales and aspect ratios from all locations of many feature maps, SSD has a diverse set of predictions, covering various input object sizes and shapes.

    SSD Loss function

    $L(x, c, l, g) =$ $\frac{1}{6}$ ${ + \alpha L_{loc}(x, l, g))}$

    $L_{loc}(x, l, g) = \sum_{i ∈ Pos m ∈ \{cx, cy, w, h\}}^N (L_{conf} (x, c)\sum x_{ij}^k smooth_{L1}(l_i^m - \hat{g}_j^m)$

    $\hat{g}_j^{cx} = \frac{(g_j^{cx} - d_i^{cx})}{d_i^w} \ \ \ \ \ \ \ \ \ \ \hat{g}_j^{cy} = \frac{(g_j^{cy} - d_i^{cy})}{d_i^h}$

    $\hat{g}_j^h = \log \left ( \frac{g_j^h}{d_i^h} \right ) \ \ \ \ \ \ \ \ \ \ \hat{g}_j^w = \log \left ( \frac{g_j^w}{d_i^w} \right )$

    $L_{conf}(x, c) = - \sum_{i \epsilon \ Pos}^N x_{i, j}^p\log (\hat{c}_i^{p}) - \sum_{i \ \epsilon \ Neg} \log (\hat{c}_i^{0}) \\ \ \ \ \ \ where \hat{c}_i^{p} = \frac{exp(c_i^p)}{\sum_p exp(c_i^p)}$

    $smooth_L(x) = \begin{Bmatrix} 0.5x^2 & if |x| < 1 \\ |x| - 0.5 & otherwise, \end{Bmatrix}$

    The expression for the loss, which measures how far off our prediction “landed”, is:

    multibox_loss = confidence_loss + alpha * location_loss

    The confidence_loss computed by $(L_{conf} (x, c)$ is a simple softmax loss function between the actual label and the predicted label. The $\alpha$ term helps us in balancing the contribution of the location loss.

    You see two terms: $i \ \epsilon \ Pos$ and $i \ \epsilon \ Neg$. As we have done in YOLOv2, we not only want to detect positive predictions, we want to reduce negative predictions as well. In SSD we regress to offsets for the center $(cx, cy)$ of the default bounding box $(d)$ and for its width $(w)$ and height $(h)$.

    In the location_loss, however, we only consider the positive predicion boxes.

    Although the L2 norm is more precise, L1 norm good for feature selection in high dimensional spaces, as well for prediction speed.


    Let's look at the code here: LINK

    And you can find one of the best description of SSD on this link.