Introduction

Most of us are used to Internet search engines and social networks capabilities to show only data in certain language, for example, showing only results written in Spanish or English. To achieve that, indexed text must have been analized previously to “guess” the languange and store it together.

There are several ways to do that; probably the most easy to do is a stopwords based approach. The term “stopword” is used in natural language processing to refer words which should be filtered out from text before doing any kind of processing, commonly because this words are little or nothing usefult at all when analyzing text.

How to do that?

Ok, so we have a text whose language we want to detect depending on stopwords being used in such text. First step is to “tokenize” - convert given text to a list of “words” or “tokens” - using an approach or another depending on our requeriments: should we keep contractions or, otherwise, should we split them? we need puntuactions or want to split them off? and so on.

In this case we are going to split all punctuations into separate tokens:

nltk “wordpunct_tokenize” tokenizer

As shown, the famous quote from Mr. Wolf has been splitted and now we have “clean” words to match against stopwords list.

At this point we need stopwords for several languages and here is when NLTK comes to handy:

included languages in NLTK

Now we need to compute language probability depending on which stopwords are used:

calculate languages ratios

First we tokenize using wordpunct_tokenize function and lowercase all splitted tokens, then we walk across nltk included languages and count how many unique stopwords are seen in analyzed text to put this in “language_ratios” dictionary.

Finally, we only have to get the “key” with biggest “value”:

get most rated language

So yes, it seems this approach works fine with well written texts - those who respect grammatical rules - (and not so small ones) and is really easy to implement.

Putting it all together

If we put all the explained above into a script we have something like this:

langdetector.py

There are others ways to “guess” language from a given text like N-Gram-Based text categorization so will see it in, probably, next post.

See you soon and, as always, hope you find it interesting and useful!