A study on Neural Probabilistic Language Modelling

Andrej Karpathy recommended the reading of “A Neural Probabilistic Language Model” by Bengio et al., and Feynman said, “he could not create what he did not understand”. Paying attention to the both of them is wise.

So, this article explores Bengio’s paper and tries to deconstruct what is being said in the paper and how the modelling that takes place is beneficial to learning language modelling.

Abstract #

Taking into consideration the bigram language model that we talked about in makemore pt 1, we assumed at that point that the deterministic method to model the probability of letters in a word is the ideal way to retrieve words, and our single layer neural network aimed to (and did) achieve this level of word construction.

What we’re essentially searching for through the deterministic approach is to learn the joint probability function for a sequence of words. This is the goal of statistical language modeling. However, while bigrams are good and trigrams better, n-grams are computationally expensive due to the curse of dimensionality.

We cannot reasonably compute the joint distribution for more than 3 characters at a time because of how many records we will need to process.

The paper discusses a method to learn a distributed representation for each words and a probability function for the word sequences.

Introduction #

Coming back to the concept of the “curse of dimensionality”, let us take an example. If we wanted to model the joint distribution of 10 consecutive words with a vocabulary V of size 100_000. We’d end up with 100_000^10 - 1 parameters since each word is considered unique.

When building these joint probability models, dynamically reducing the contexts for words that have been determined already helps in not having to compute all possibilities for all words at all given times. But rather to discard the state space for alternative words when one word has been fixed.

However, this too poses an issue because what if we come across a combination of words that we haven’t seen in the training corpus? Then, a simple answer to it would be to look at the probability predicted using smaller context sizes like in certain trigram models.

Despite all of these hackfixes, we still suffer from two flaws in this approach:

We aren’t taking into account contexts farther than 1 or 2 words (because it’s hard to compute and store all of this; remember the 100_000^10 scale.)
No account of similarity is taking into consideration.

The sentence “The cat is walking in the bedroom” is quite similar to “A dog was running in a room” semantically. But, in a trigram model that fails to see the same words, the answer wouldn’t even be close to obvious.

The proposed approach attempts to do the following:

Associates each word in the vocabulary with a distributed “feature vector”, to create a notion of similarity between words.
Express the joint probability function of word sequences in terms of the feature vectors.
Learn the word feature vectors and the parameters of that function.

The feature vector would represent different aspects of a word: with each word associated with a point in vector space. The number of features would be much smaller than the size of the vocabulary so it wouldn’t explode when computing the joint distribution.

For example, having 1 of the 30 features could end up predicting the next word to be either “running”, “jogging”, “walking” or even “crawling” with the differing contexts.

The Proposed Model #

The authors argue that the training set used for the model is a large but finite source of data. The framed objective is to learn a good model f(w_t, …, w_{t-n}) = P(w_t | w_1^{t-1}). (a good function that produces the expected word after the previous word)

This is fair and by utilizing the feature vector relation, the sample set can generate a whole bunch of possible words to produce without needing a direct relation to the word itself. A constraint that they seem to place on the model is that for any choice of the previous word, the sum of all the probabilities of the possible words must equal to 1. Which basically states that the probabilities for the words are normalized.

That is the basis of the model. The latter portions seem to discuss the implementation and decomposition of the functionality.

A mapping C from any element of V to a real vector C(i) \in R^m. This represents a “distributed feature vector”. In practice, C is represented by a |V| x m matrix of free parameters.
The probability function over words, expressed with C. The function g maps an input sequence of feature vectors for words in context. Each of the two parts, C and g, are associated with some parameters.

The parameters of C are the feature vectors themselves. The function g may be implemented by a feed-forward or RNN with parameters \omega. So the total parameter set is \theta = (C, \omega).

Training is done by looking for the \theta that maximizes the training NLL, L = \frac{1}{T}\Sigma_t \log f(w_t, w_{t-1}, …, w_{t-n+1}; \theta) + R(\theta) where R(\theta) is a regularization term.

R is a weight decay penality that is applied only to the weights of the NN and C matrix, but not to the biases.

The authors state that the number of free parameters only scales linearly with V, the number of words in the vocabulary. I’m unsure as to why that is believed to be the case. They go on to further mention that the scaling factor could be reduced to sub-linear time if more sharing structures were introduced: time-delay NN or RNN.

In the experiments, the NN generally has one hidden layer beyond the word features mapping (C[X]) and optionally direct connections from the word features to the output (this is not taken into consideration for Andrej’s model in Lecture 2). Thus, there exist two hidden layers: the shared word features layer C with no non-linearity and the ordinary hyperbolic tangent hidden layer.

The neural network, at the end, computes a function with a softmax output layer: P(w_t | w_{t-1}, ... w_{t-n+1}) = \frac{e^{y_{w_t}}}{\Sigma_i e^{y_i}}

The y_i are log probabilities for each output word i, and is computed with the following formula and parameters (b, W, U, d and H): y = b + Wx + U tanh(d + Hx)

where W is optionally zero, and x is the word features layer activation vector, which is the concat of input word features from the matrix C. x = (C(w_{t-1}), C(w_{t-2}), ..., C(w_{t-n+1}))

x can also be considered as our “embeddings” for the words.

C was randomly initialized as our feature vectors while ‘w’ stands for our words. So, drawing a C[w] should provide us our embedding for the word w.

Post this, the authors define all the parameters that need to be updated and draw the function to reduce the loss through stochastic gradient descent for all parameters.

Parallel implementation #

The authors, in this section, talk about the scaling of the parameters and how we can overcome the bottlenecks for computation for their approach. They propose two approaches to this, the latter of which is noted to be better.

Processing different subset of the data in different compute resources. (harder and requires MPI and large shared-memory parallel computers)
Processing parameters in parallel during the stochastic gradient descent step. (easier and parallelizable across different machines provided a bit of context)

The second approach leveraged horizontal scaling and while it did a bunch of redundant calculatings, the overhead is acceptable and the performance is good.

Computation on a single processor #

Forward Phase a. Perform a forward computation for the word features layer. b. Perform forward computation for the hidden layer. (I don’t understand this. Why are we doing a forward computation without actually knowing all the parameters in d and H?) c. Perform forward computation for output units in the i-th block. I’m unsure here as to what’s happening. Are they trying to compute the predicted output here? and also the probability? d. Compute and share S = \Sigma_i s_i among the processors. e. Normalize the probabilities using S: p_j <- p_j / S. f. Update the log likelihood.
Backward/Update Phase a. Perform backward gradient computation for output units. b. Sum and share the derivatives across processors. c. Backprop through and update the hidden layer weights. d. Update the word feature vectors for the input words.

Fin #

The rest of the paper journals the results, comparisions and statistics tested against other models through the methods of perplexity. It then proceeds to detail future work and improvements that need to be made to the current model to improve its performance on larger data.