pip install git+https://github.com/JustGlowing/obscure_wordsWe can now import the dictionary and create a vectorial representation of each word:
import matplotlib.pyplot as plt import numpy as np from obscure_words import load_obscure_words from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from sklearn.manifold import TSNE obscure_dict = load_obscure_words() words = np.array(list(obscure_dict.keys())) definitions = np.array(list(obscure_dict.values())) vectorizer = TfidfVectorizer(stop_words=None) X = vectorizer.fit_transform(definitions) projector = TSNE(random_state=0) XX = projector.fit_transform(X)In the snippet above, we compute a Tf-Idf representation using the definition of each word. This gives us a vector for each word in our dictionary, but each of these vectors has many elements as the total number of words used in all the definitions. Since we can't plot all the features extracted, we reduce our data to 2 dimensions we use T-SNE. We have now a mapping that allows us to place each word in a point of a bi-dimensional space. There's one problem remaining, how can we plot the words in a way that we can still read them? Here's a solution:
from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.metrics import pairwise_distances def textscatter(x, y, text, k=10): X = np.array([x, y]).T clustering = KMeans(n_clusters=k) scaler = StandardScaler() clustering.fit(scaler.fit_transform(X)) centers = scaler.inverse_transform(clustering.cluster_centers_) selected = np.argmin(pairwise_distances(X, centers), axis=0) plt.scatter(x, y, s=6, c=clustering.predict(scaler.transform(X)), alpha=.05) for i in selected: plt.text(x[i], y[i], text[i], fontsize=10) plt.figure(figsize=(16, 16)) textscatter(XX[:, 0], XX[:, 1], [w+'\n'+d for w, d in zip(words, definitions)], 20) plt.show()In the function textscatter we segment all the points created at the previous steps in k clusters using K-Means, then we plot the word related to the center of cluster (and also its definion). Given the properties of K-Means we know that the centers are distant from each other and with the right choice of k we can maximize the number of words we can display. This is the result of the snippet above:
(click on the figure to see the entire chart)
Neat!
ReplyDelete