Recently Quora put out a Question similarity competition on Kaggle. This is the first time I was attempting an NLP problem so a lot to learn. The one thing that blew my mind away was the word2vec embeddings.
Till now whenever I heard the term word2vec I visualized it as a way to create a bag of words vector for a sentence.
For those who don’t know bag of words: If we have a series of sentences(documents)
Bag of words would encode it using 0:This 1:is 2:good 3:bad 4:awesome
But it is much more powerful than that.
What word2vec does is that it creates vectors for words. What I mean by that is that we have a 300 dimensional vector for every word(common bigrams too) in a dictionary.
We can use this for multiple scenarios but the most common are:
A. Using word2vec embeddings we can find out similarity between words. Assume you have to answer if these two statements signify the same thing:
If we do a sentence similarity metric or a bag of words approach to compare these two sentences we will get a pretty low score.
But with a word encoding we can say that
B. Encode Sentences: I read a post from Abhishek Thakur a prominent kaggler.(Must Read). What he did was he used these word embeddings to create a 300 dimensional vector for every sentence.
His Approach: Lets say the sentence is “What is this” And lets say the embedding for every word is given in 4 dimension(normally 300 dimensional encoding is given)
Then the vector for the sentence is normalized elementwise addition of the vectors. i.e.
Elementwise addition : [.25+1+0.5, 0.25+0+0 , 0.25+0+0, .25+0+.5] = [1.75, .25, .25, .75]
divided by
math.sqrt(1.25^2 + .25^2 + .25^2 + .75^2) = 1.5
gives:[1.16, .17, .17, 0.5]
Thus I can convert any sentence to a vector of a fixed dimension(decided by the embedding). To find similarity between two sentences I can use a variety of distance/similarity metrics.
C. Also It enables us to do algebraic manipulations on words which was not possible before. For example: What is king - man + woman ?
Guess what it comes out to be : Queen
Now lets get down to the coding part as we know a little bit of fundamentals.
First of all we download a custom word embedding from Google. There are many other embeddings too.
wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
The above file is pretty big. Might take some time. Then moving on to coding.
from gensim.models import word2vec
model = gensim.models.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)
model.most_similar('python')
What is king - man + woman?
model.most_similar(positive = ['king','woman'],negative = ['man'])
You can do plenty of freaky/cool things using this:
model.most_similar(positive = ['emma','he','male','mr'],negative = ['she','mrs','female'])
model.doesnt_match("math shopping reading science".split(" "))
I think staple doesn’t belong in this list!
In this paper , the authors have shown that itembased CF can be cast in the same framework of word embedding.
Library - Books = Hall
Obama + Russia - USA = Putin
Iraq - Violence = Jordan
President - Power = Prime Minister (Not in India Though)
Is this model sexist?
model.most_similar(positive = ["donald_trump"],negative = ['brain'])
Whatever it is doing it surely feels like magic. Next time I will try to write more on how it works once I understand it fully.
comments powered by Disqus