Predicting geographical locations analyzing co-occurrences in text
Can we predict the geographical location of a city only using the way they happen to be mentioned in language? Based on the simple principle that places that are located together are mentioned together, data scientist Max Louwerse of the Tilburg School of Humanities and colleagues have demonstrated that you can.
In fact, the technique they used can be applied to help predict security threats using social media, or facilitate finding archeological excavation sites using historical documents.
Louwerse and colleagues computed a matrix of frequencies with which the largest cities in the United States co-occurred in the New York Times, the Wall Street Journal and the Los Angeles Post. They then extracted a two-dimensional scaling from this matrix, of which the x- and y-axes correlated with the actual longitude and latitude.
More technically, they applied Latent Semantic Analysis (LSA), a computational linguistic technique that measures the semantic association between words by computing the cosine values between the word vectors. The resulting matrix was then analyzed using Multi-Dimensional Scaling, whereby the loadings of the city names correlate with their actual longitude and latitude.
Interestingly, cognitive biases humans have when estimating geographical locations were also found in the computational language estimates.
This work done in English has been repeated using other languages, such as Chinese predicting geographical locations in China, and Arabic predicting geographical locations in the Middle East. In fact, Middle Earth could be mapped out only using Lord of the Rings.
Even though this may seem ivory tower research, this work has societal implications. For instance, the work has been funded by intelligence agencies to predict potential security threats using social media. Moreover, the researchers have recently conducted archeological work predicting excavation sites using the Indus Script.
Computational linguistic techniques can thus extract meaning from language which can give insights in the physical world around us as well as in human behavior.
- Louwerse, M. M. & Benesh, N. (2012). Representing spatial structure through maps and language: Lord of the Rings encodes the spatial structure of Middle Earth. Cognitive Science, 36, 1556-69.
Louwerse, M.M., Cai, Z., Hu, X., Ventura, M., & Jeuniaux, P. (2006). Cognitively inspired natural-language based knowledge representations: Further explorations of Latent Semantic Analysis. International Journal of Artificial Intelligence Tools, 15,1021-1039
Louwerse, M.M. & Zwaan, R.A. (2009). Language encodes geographical information. Cognitive Science, 33, 51-73.
Recchia, G. & Louwerse, M.M. (in press). Archaeology through computational linguistics: Inscription statistics predict excavation sites of Indus Valley artifacts. Cognitive Science.