Recent research in natural language processing using the program word2vec gives manipulations of word representations that look a lot like semantics produced by vector math. For vector calculations to produce semantics would be remarkable, indeed. The word vectors are drawn from context, big, huge context. And, at least roughly, the meaning of a word is its use (in context). Is it possible that some question is begged here?
Here is the general scenario. We represent words by vectors (one-dimensional arrays) of a large number of elements, the numeric values of which are determined by reading and processing a vast number of examples of context in which those words appear, and which are functions of the distances between occurrences of words in sequences. A neural network refines the values of each element of each word vector in the training phase. The network, through iterated adjustment of the elements of the vector based on errors detected on comparison with the text corpora, produces the values in continuous space that best reflect the contextual data given. The end result is the word vector, a lengthy list of real numbers that do not seem to have any particular interpretation that would reflect properties of the thing itself, nor properties of words.
In the intriguing example provided by Mikolov et al. [Mikolov 2013], they "... assume relationships are present as vector offsets, so that in the embedding space, all pairs of words sharing a particular relation are related by the same constant offset." Let's use 'd' to stand for "distributed representation by vector" of a word.
Mikolov and his colleagues find that v1 - v2 + v3 ≈ v4 (in vector offset mathematics in the n-dimensional space). This is certainly an intriguing and provocative result, in accord with our common understanding of the meanings of the four words, in which taking "king", removing the "male" aspect, and replacing it with a "female" aspect, gives "queen".
The word vectors can also capture plurality and some other shades of meaning that the researchers regard as syntactic. For example, the offset between singular and plural is a (learned) constant:
d("apple") - d("apples") ≈ d("families") - d"family") ≈ d("cars") - d("car")
More details of word2vec can be found in the readable explanation by Xin Rong [Rong 2016]. It looks like, not by direct coding but by some fortuitous discovery, the system has figured out some mathematical analog for semantics. (There is no claim that individual elements of the vector are capturing particular features such as gender, status, or grammatical number.)
But we already have a compendium of data on relationships of words to other words through contexts of use. It's a dictionary. The use of a word is largely given by its context. Its context can be inferred also from its dictionary definition. Most dictionaries will offer a direct or indirect connection through "king" to "ruler" or "sovereign" and "male" and through "queen" to "ruler" or "sovereign" and "female," as below.
The female ruler of an independent state, especially one who inherits the position by right of birth
The male ruler of an independent state, especially one who inherits the position by right of birth
A person exercising government or dominion
These definitions [Oxford Living Dictionary] show that the gender can be "factored out", and that in common usage the gender aspect of sovereigns is notable. We would expect those phenomena to show up in vast text corpora. In fact, we would expect that to show up in text corpora because of the dictionary entries. Since we base our word use on the definitions captured by the dictionary, then it is natural for any graph-theoretic distance metric based on node placement to (somehow) reflect that cross-semantic structure.
Suppose that, employing the English slang terms ``gal'' and ``guy'' for male and female, the word for queen were ``rulergal,'' and for king ``rulerguy,'' (and perhaps the word for mother were ``parentgal,'' and for father, ``parentguy''). Then the word vector offsets calculated would not appear as remarkable, the relationships being exposed in the words themselves.
The system word2vec constructs and operates through the implicit framework of a dictionary, which gave rise in the first place to the input data to word2vec. But how could it be otherwise? As we understand the high degree of contextual dependency of word meanings in a language, any representation of word meaning to a significant degree with reflect context, where context is its inter-association with other words.
Yet the result is still intriguing. We have to ask how co-occurrence of words can so reliably lay out semantic relationships. We might explore what aspects of semantics are missing from context analysis, if any. We might (and should) ask what sort of processing of a dictionary would deliver the same sort of representations, if any.
The word vectors produced by the method of training on a huge natural text dataset, in which words are given distributed vector representations refined through associations present in the input context, reflect the cross-referential semantic compositionality of a dictionary. That close reflection is derived from the fact that words in natural text will be arranged in accordance with dictionary definitions. The word2vec result is revelation of an embedded regularity. This author will be giving a talk on this very subject at the International Association of Computing and Philosophy, at Stanford, June 26-28 (iacap.org).
[Mikolov 2013] Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. arXiv 1301.3781v3.
[Oxford Dictionary] Oxford Living Dictionary (online). 2017. Oxford University Press.
[Rong 2016] Xin Rong. 2016. word2vec Parameter Learning Explained. arXiv 1411.2783.
No entries found