Deep Learning

7 min readMar 7, 2020

With applications like speech recognition, image processing, language translation Deep learning has seen significant success in recent time. Deep neural networks in general refer to neural networks with many layers and large number of neurons, often layered in a way that is generally not domain specific.

Availability of compute power and large amount of data has made these large structures very effective in learning hidden features along with data patterns. Convolutional neural networks (CNNs) found their success in image recognition problems.

LSTMs (long short-term memory) are used extensively in natural language processing, language translation, video games (as policy network in reinforcement learning), and many other complex machine learning problems that require decisions to be taken based on past history of inputs.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are models of neural networks that are best suited for processing sequential data. A series or sequence data can be seen in multiple problem domains with time series data or simply a sequence of data that has repetitive properties.

Sequence data can be seen in DNA data that can have a sequence of thousands of base proteins. For example, the DNA in the largest human chromosome consists of 220 million base pairs that have repeating sub patterns. Speech recognition requires modelling time series data with individual speech patterns. Natural language processing (NLP) is an area where sequences are critical in identifying the matching word.

Simple neural networks’ construct models assume that the input vectors are independent of each other. Neural networks only consider the current input vector and apply the model on the vector to classify or predict the output. This approach ignores much of the information present in the context for accurate detection or prediction.

Recurrent Neural Network (RNN) models solve the problem of sequence modelling by providing ability to remember past data and process the current vector based on the inputs that are before or after the current vector in the sequence.

Recurrent Neural Network (RNN) architecture makes the construction of such neural networks simple and efficient. As the term recurrent in the name implies, the task structure is repeated multiple times in the computation with definition of task done only once. An intuitive understanding is that the past inputs are memorized with weighted values combined with the current input for prediction.

Backpropagation in RNN

In RNN, the total loss function is simply the sum of loss at each sequence as we treat the response of the network as output at all the sequences.

Backpropagation uses Gradient Descent to adjust the weights during training. To compute the change in the loss with the change in the weights, taking a partial derivative with chain rule.

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent.

The gradient gets smaller as the number of sequences increases if the initial error gradient is small. The gradients shrink exponentially fast if the initial error gradient is small. This makes it difficult to capture long-term dependencies of sequence elements.

It is also possible that under some initialization conditions and activation functions, the initial derivatives are large that can lead to exploding gradient problem. Exploding gradients are a problem when large error gradients accumulate and result in very large updates to neural network model weights during training.

This is usually dealt by having an upper threshold that limits the size of the gradient during propagation.

Vanishing Gradients:

The sequence length in RNN can be large, and deriving context for long-range dependencies can introduce errors. For example, interactions between words that are from different parts of a sentence can change the meaning of the sentence.

In sentence “Chris lost balance when jumped to catch the ball and fell on his back,” the pronoun his determines Chris as a short name for a male. The same sentence could have “her” as pronoun and will be entirely valid making Chris a short name for a female. Here, the context for Chris depends on the pronoun that occurs more than ten words apart in the sentence.

Computing the weights using derivatives of error change will get smaller as the sequence number increases. The effect of the pronoun in the sentence, in computing the propagation weights, gets vanishingly small.

It is also possible that the gradient computation can be very large depending on the activation function and the weights and cause NaN computations for the gradients. A simple RNN will not be able to capture the long-term dependencies and will lead to failures in modelling.

Exploding gradients can be solved simply by capping the value to an upper bound or adjusted maximum. By restricting the maximum value, the propagation of the gradients can be controlled.

Vanishing gradients can be addressed by implementing ReLU activation instead of tanh or Sigmoid function and initializing weights that are better suited for the context.

LSTM and GRU networks better address these problems.

LSTM:

Long short-term memory (LSTM) networks are introduced to address the problem of long-term dependency modelling in RNNs. LSTM networks address the problem of vanishing gradients by designing a memory cell approach and providing gates that control the influence of previous sequence on the current state and output of the current sequence.

Long short-term memory (LSTM) networks have been very successful in modelling NLP problems with very good long-range dependency modelling. Language modelling for translating one language into another or suggesting the next word while typing and speech recognition are some of the applications of LSTM.

Intuitively, memory cells control if the current sequence should have bearing on data occurring on the long term. There are multiple variations of LSTM networks that have shown promise in certain types of modelling.

For example, peephole connection models include the previous cells output in the gate models instead of the activation state of the cells. One successful variation is a gated recurrent unit (GRU) implementation that simplifies the number of variables in LSTM model.

GRU:

Gated recurrent unit (GRU) networks are introduced to simplify the LSTM network models with less number of variables to compute for generating the network.

GRU models have been very successful in modelling NLP and are extensively used in RNN models.

GRU models use relevance term and an update gate in place of input gate, forget gate, and output gate of LSTM network.

GRU network computes a relevance term and uses it to update the state of the cell. The update gate is similar to forget gate that modifies the output that is computed from the sequence to use long-term dependencies.

One of the issues with LSTM and GRU networks is that they are sequential models and the parameters have to propagate sequentially for long-term dependencies.

If the sequences are several terms away, for example, in DNA sequence modelling with repetitive patterns, then the models using LSTM networks are very slow and require trial and error.

To improve on these models, a newer technique of hierarchical neural attention encoder is being considered. This technique creates a context vector based on the past sequence data that allows looking back into the past terms of sequences easier without having to pass through all the terms of the sequence. This is an evolving field and can expect significant advancements in performance in the future.

Self-organizing maps (SOMs):

These are models of neural networks that are based on unsupervised learning. SOMs are used in converting high-dimensional data into a lower-dimensional data with common features.

SOMs are feedforward networks with the network organization of neurons distinct from regular neural networks.

SOMs are used in extracting features in high-dimensional input data similar to principal component analysis (PCA; however, unlike PCA, SOMs can perform non-linear transformation, providing more accurate mapping than PCA in most cases.

SOMs use competitive learning in which the input vector is used to choose a winner neuron among a set of neurons. The winner neuron is then activated and weights adjusted to organize closer to similar neurons in the network.

This provides organizing neurons based on input features in unsupervised fashion.

Applications of SOM include the analysis of financial stability, behavioral pattern detection in time series data in data centers, and mapping of different patterns in atmospheric sciences and in genomic studies in biodiversity and ecology.

Deep Learning

Written by Aniket Patil