The topic is interpretable by collecting all nearby word vectors to the selected topic vector. This work boosts word2vec with topic modeling via training in the similar fashion to word2vec.
The key difference of LDA2Vec is its loss function. There are 2 loss functions: the first one is Skipgram Negative Sampling Loss which is similar to Word2Vec. It wants to maximize the probability of predicting a target word \( \vec w_j \) and non-target word (negative samples) given a context vector \( \vec c_j \). This loss wants the model to distinguish a positive word (which related to the given context) from negative sampled words.
The innovation is the context vector \( \vec c_j \). The intuition is that predict a nearby word given a pivot word also depends on the theme of the context. For example, if the document is about an airline when we want to predict nearby words given a word “Germany”, we will likely want to see word related to airlines but not country names. Thus, a context vector is a sum of a word vector and document vector. \( \vec c_j = \vec w_j + \vec d_j \). Thus, LDA2vec attempts to capture both document-wide relationship and local interaction between words within its context window.
In order to learn a topic vector, the document is further decomposed as a linear combination of topic vectors. \( \vec d_j = \sum_{k} p_{jk} \cdot \vec t_k \) where \( p_{jk} \) is a probability of document j will be a topic k. Finally, the interpretability comes from a sparsity of topic assignment vector, \( p_j \). One way to enforce sparsity is to design the loss function as:
\[L^{d} = \lambda \sum_{jk} (\alpha - 1)\log p_{jk}\]When \( \alpha < 1 \), we encourage a topic assignment probability to put more mass on a small set of topics.
The results look interesting. This paper shows a simple way to combine topic modeling with word embedding. By embedding document vectors and topic vectors into the same semantic space as word vectors, we can learn a global semantic structure as well as word-level local interaction.