In this post, we're describing the equations, variable dimensions and names that we will use in this series of tutorials on Recurrent Neural Networks. We assume you are familiar with the functionality of these networks and will only focus on the implementations side. This post includes:

  1. Introduction to RNN structure
  2. RNN Variables and their Dimension
  3. How to Implement RNN in TensorFlow?

1. Introduction to RNN:

Let's start with a simple RNN. Following figure depicts its structure. This structure includes all the required parameters.


Fig1. Sample RNN structure (Left) and its unfolded representation (Right)

Given the above figure and the defined variables, RNN equations are as follows:

  • Updating the hidden states: $\mathbf{h}_t=tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1}+\mathbf{W}_{hx}\mathbf{x}_t+\mathbf{b}_h)$
  • Getting the outputs: $\mathbf{o}_t=g(\mathbf{W}_{oh}\mathbf{h}_t+\mathbf{b}_o)$ where $g$ is the nonlinear functions and depends on the task; for example, if we're using RNN for predicting the price of bitcoin (which is a single value and can take any value), there is no need to use $g$. In the case of classification task, $g$ will be the $softmax$ function.

2. RNN Variables and their Dimension:

Now let's introduce each of the variables and their dimension:

2.1. T: sequence length (i.e. number of times we unroll the RNN network).

One nice property of recurrent networks is that they can be used for inputs of any length. For example, you might want to use RNN for classifying sentences to a sentence with either positive or negative sentiment. Since your input sentences will be of different length, you must be able to unfold your network as many times as required for each input. Therefore, for each specific input, T will be a different value.

In some of the problems, we might have inputs of the same length (e.g. classifying MNIST digits where all images are of the same size). In this case, we can simplify the codes a little bit. We'll see this in the next tutorials.

  • In our series of tutorials, we'll define a variable called seqLen (short for sequence length)
  • We also need to define another variable which determines the maximum length of sequences in our data. we named this variable seq_max_len (short for sequence maximum length).

2.2. X: input

This is the matrix of input values and the input to RNNs can be almost anythin; from image to sentence, word, character, etc. In Tensorflow, we often build it as a matrix of shape [batch_size, seq_max_len, input dim] where:

  • batch_size: The batch size; number of samples in one batch
  • seq_max_len: sequence maximum length; explained above
  • input_dim: input dimension; the number of features. For example, if we have 10,000 words in our vocabulary and we're trying to build a language model, input_dim=10,000.

*Note: As discussed, inputs to RNN might have different lengths, so one might ask how can we fit all of them in an input array of shape [batch_size, seq_max_len, input dim] which requires all inputs to have a fix length of seq_max_len?

Well, good question and the answer is through zero padding! meaning that you need to take samples with length lower than seq_max_len and add zeros to it to increase its length to seq_max_len. For example, if you receive an input of length 13 where the seq_max_len is set to 20, you need to add 7 zeros to it.

Now you might ask padding with zeros will effect the outputs of the next steps (the ones that we don't care about; the output to the 7 zeros in our example), and will also update the hidden states! This is what we need to avoid and the answer is in using Dynamic_rnn in TensorFlow. We'll cover this point in the next tutorial titld: Static vs. Dynamic RNNs.

*Note2: X includes a batch of sequences where each sequence is of size seq_max_dim (all inputs zero padded). Therefore, each sample in the sequence ($\mathbf{x}_t$) is input_dim dimensional (i.e. $\mathbf{x}_t \in \mathcal{R}^{input\_dim \times 1}$)

2.3. Number of hidden units:

This is the number of hidden units of the RNN; i.e. the number of features extracted from each sample of the sequence. We'll define it a variable named num_hidden_units in our tutorials.

2.4. h: hidden states

This is the vector of hidden states of the RNN. Given num_hidden_units, $\mathbf{h}_t \in \mathcal{R}^{num\_hidden\_units \times 1}$ is the vector of hidden states associated to input $\mathbf{x}_t \in \mathcal{R}^{input\_dim \times 1}$.

2.5. o: output

This is the vector of outputs and is $ \in \mathcal{R}^{out\_dim \times 1}$ where out_dim is the output dimension. For example, if we're to classify digits from 0 to 9 (i.e. 10 classes), out_dim=10. Depending on the problem, you might either compute the outputs at each step or not. Following figure shows the most common strategies used when using RNNs for different tasks:


Fig2. Examples of the most common strategies to feed inputs and outputs when using RNNs

2.6. All Weights and biases:

Let's check the dimension of the required parameters for a simple RNN.

  • $\mathbf{W}_{hx} \in \mathcal{R}^{num\_hidden\_units \times input\_dim}$

  • $\mathbf{W}_{hh} \in \mathcal{R}^{num\_hidden\_units \times num\_hidden\_units}$

  • $\mathbf{b}_{h} \in \mathcal{R}^{num\_hidden\_units \times 1}$

  • $\mathbf{W}_{oh} \in \mathcal{R}^{out\_dim \times num\_hidden\_units}$

  • $\mathbf{b}_{o} \in \mathcal{R}^{out\_dim \times 1}$

3. How to Implement in TensorFlow?

Now it's time to see how we can write the code to create a recurrent network. Thanks to TensorFlow, everything is almost ready and you need to write only a very few lines of code and is pretty much similar for all types of recurrent networks, RNNs, GRUs, and LSTMs. Here, we'll just explain the whole strategy and leave the details to the next tutorials.

The 3 main steps in creating a recurrent network are:

3.1. create a cell:

This part creates the cell; including the required parameters ($\mathbf{W}{hx}, \mathbf{W}{hh}, \mathbf{b}_{h}), their initializers, the desired activation function (tanh by default), etc. This can be done through the following command:

In [ ]:
import tensorflow as tf

rnn_cell = tf.nn.rnn_cell.BasicRNNCell(num_hidden_units)   # for RNN cell
gru_cell = tf.nn.rnn_cell.GRUCell(num_hidden_units)        # for GRU cell
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(num_hidden_units) # for LSTM cell

depending on the type of the cell, you might want to specify some other details; for example, you can change the activation from default (tanh) to ReLu by:

In [ ]:
rnn_cell = tf.nn.rnn_cell.BasicRNNCell(num_hidden, activation=tf.nn.relu)

3.2. create a recurrent network from the specified cell:

Next, we create the whole network, get the input and cell, run all the computations, and get the hidden states ($\mathbf{h}_t$) and cell states in the case of LSTM ($\mathbf{c}_t$). A simple example is:

In [ ]:
ht, last_states = tf.nn.dynamic_rnn(cell, x, dtype=tf.float32)

where ht includes the hidden states for all of the steps (all t) and is of shape [batch_size, seq_max_len, num_hidden_units].

last_states includes the last states $\mathbf{h}_T$ (states at t=T or t=seq_max_len) and is of shape [batch_size, num_hidden_units]. In the case of LSTM, it will also include the last cell states ($\mathbf{c}_T$) which has the same shape as $\mathbf{h}_T$.

3.3. Manage the outputs:

Now that we have all the hidden states, we can easily generate the required weight and bias paraameters $\mathbf{W}_{oh}$ and $\mathbf{b}_{o}$ and get the output $\mathbf{o}_{t}$ at any desired t.

Thanks for reading! As mentioned, this tutorial is just a short introduction to parameters and commands that are useful for creating a RNN. In the next tutorial, we'll talk about a very important concept; "what is static and dynamic RNN in TensorFlow?"

So please continue reading and feel free to leave a comment in our website if you have any question, doubt or suggestion.

© 2018 Easy-TensorFlow team. All Rights Reserved.