
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Train Word2Vec Algorithm Using TensorFlow
Tensorflow is a machine learning framework that is provided by Google. It is an open−source framework used in conjunction with Python to implement algorithms, deep learning applications and much more. It is used in research and for production purposes. It has optimization techniques that help in performing complicated mathematical operations quickly.
This is because it uses NumPy and multi−dimensional arrays. These multi−dimensional arrays are also known as ‘tensors’. The framework supports working with deep neural network. It is highly scalable, and comes with many popular datasets. It uses GPU computation and automates the management of resources.
The ‘tensorflow’ package can be installed on Windows using the below line of code −
pip install tensorflow
Tensor is a data structure used in TensorFlow. It helps connect edges in a flow diagram. This flow diagram is known as the ‘Data flow graph’. Tensors are nothing but multidimensional array or a list.
The below code uses an article from Wikipedia to train the model. It helps understand word embeddings. Word embeddings refer to the representation of being able to capture the context of a specific word in a document, its relation with other words, its syntactic similarity, and so on. They are in the form of vectors. These word vectors can be learnt using the technique Word2Vec.
Following is an example −
Example
from __future__ import division, print_function, absolute_import import collections import os import random import urllib import zipfile import numpy as np import tensorflow as tf learning_rate = 0.11 batch_size = 128 num_steps = 3000000 display_step = 10000 eval_step = 200000 eval_words = ['eleven', 'the', 'going', 'good', 'american', 'new york'] embedding_size = 200 # Dimension of embedding vector. max_vocabulary_size = 50000 # Total words in the vocabulary. min_occurrence = 10 # Remove words that don’t appear at least n times. skip_window = 3 # How many words to consider from left and right. num_skips = 2 # How many times to reuse the input to generate a label. num_sampled = 64 # Number of negative examples that need to be sampled. url = 'http://mattmahoney.net/dc/text8.zip' data_path = 'text8.zip' if not os.path.exists(data_path): print("Downloading the dataset... (It may take some time)") filename, _ = urllib.request.urlretrieve(url, data_path) print("Th data has been downloaded") with zipfile.ZipFile(data_path) as f: text_words = f.read(f.namelist()[0]).lower().split() count = [('RARE', −1)] count.extend(collections.Counter(text_words).most_common(max_vocabulary_size − 1)) for i in range(len(count) − 1, −1, −1): if count[i][1] < min_occurrence: count.pop(i) else: break vocabulary_size = len(count) word2id = dict() for i, (word, _)in enumerate(count): word2id[word] = i data = list() unk_count = 0 for word in text_words: index = word2id.get(word, 0) if index == 0: unk_count += 1 data.append(index) count[0] = ('RARE', unk_count) id2word = dict(zip(word2id.values(), word2id.keys())) print("Word count is :", len(text_words)) print("Unique words:", len(set(text_words))) print("Vocabulary size:", vocabulary_size) print("Most common words:", count[:8])
Code credit − https://github.com/aymericdamien/TensorFlow-Examples/blob/master/tensorflow_v2/notebooks/2_BasicModels/word2vec.ipynb
Output
Word count is : 17005207 Unique words: 253854 Vocabulary size: 47135 Most common words: [('RARE', 444176), (b'the', 1061396), (b'of', 593677), (b'and', 416629), (b'one', 411764), (b'in', 372201), (b'a', 325873), (b'to', 316376)]
Explanation
The required packages are imported and aliased.
The learning parameters, evaluation parameters, and word2vec parameters are defined.
The data is loaded, and uncompressed.
The rare words are assigned a label of ‘−1’.
The words in the data file are iterated over, and the total number of words, size of vocabulary and common words are displayed on the console.