Natural language processing with NLTK (Natural Language Toolkit)
|
|
|
|
What is the Natural Language Processing? The NLP is use of computers for processing human language. It stacks practical applications (blogs, twitter, phones, etc) and in some sense can be considered a milestone (almost) final for Artificial Intelligence (The Turing test , after all, is natural language, but this is discussed, a robot immersed in the world is much more difficult).
All very cute with NLP but people are very good speaking, by nature (his mother tongue, of course), so NLP users have high expectations of system performance (not imply the reality). Besides, NLP people do have a mix of interests in both languages and mathematics, an unusual combination. Today NLP involves lots of engineering and tweaking (and the feeling that ... It has to be a better way!).
Natural Language Toolkit
NLTK is a toolkit, a collection of python packages and objects very suitable for NLP tasks. The NLTK is both a tool that introduces new people to the state of the art NLP while allowing experts feel comfortable in their environment. Compared to other frameworks, the NLTK has default strong assumptions (eg, a text is a sequence of words), but can be changed. NLTK not only focuses on people working in NLP coming from the computer but also on linguists doing work field. NLTK mixes well with Python (it is not just "implemented" in Python).
Some facts about NLTK:
The NLTK started at the University of Pennsylvania, as part of a course in computational linguistics. Available online from http://www.nltk.org/. Its source code is distributed under the Apache License version 2.0 and there is a book of 500 pages by Bird, Klein & Loper available from O'Reilly "Natural Language Processing with Python", very recommended (http://nltk.googlecode.com/svn/trunk/doc/book/).
The toolkit also includes data in the form of text collections (many annotated) and statistical models.
NLTK and Python
The toolkit integrates very well with Python as it tries to do most of the things with facilities of Python as portions and list comprehensions. For example, a text is a list of words and a list of words can be transformed into a text. The design goals of the toolkit (Simplicity, Consistency, Extensibility and Modular design) go hand in hand with Python design itself. Furthermore, in keeping with these objectives, NLTK avoids creating your own classes when Python default dictionaries, lists and tuples are enough.
NLTK Main Packages:
- Access to documents: Interfaces for text collections.
- Strings Processing: Tokenization, sentences detection, stemmers.
- Discovering of collocations: Tokens which appear often together than by chance.
- Part-of-speech tagging: Distinguish nouns from verbs, etc.
- Classification: Classifiers in general, based in Python dictionaries as training.
- Chunking: Split up a sentence in granular units.
- Syntactic Analysis: Complex analysis (syntactic and others).
- Semantic Interpretation: λ (lambda) calculus, 1st order logic, etc.
- Evaluation Metrics: Precision, coverage, etc.
- Statistics: Frequencies distribution, estimators, etc.
- Applications: WordNet browser, chatbots.
Some Examples
To prove it, importing everything in the package nltk.book puts Python interpreter ready to go. (Also need to download first binary data, importing nltk and issuing nltk.download()).
Once done, you may ask, for example, for similar words based in contexts, given a text.
>>> from nltk.book import * # long comment, skipped >>> moby_dick = text1 >>> moby_dick.similar('poor') Building word-context index... old sweet as eager that this all own help peculiar german crazy three at goodness world wonderful floating ring simple >>> inaugural_addresses = text4 >>> inaugural_addresses.similar('poor') Building word-context index... free south duties world people all partial welfare battle settlement integrity children issues idealism tariff concerned young recurrence charge those
Tutorial: Highlights
We are interested in something like the Top Stories at http://news.google.com/.
Top stories a la NLTK
Our strategy will be:
- Named Entities seek
- Using NLTK out-of-the-box
- After will score the entities and show the best.
As data we used 7.705 international news borrowed from the site Reuters U.S. (http://www.reuters.com/)
What are the named entities?
The named entities are atomic elements of the text falling into categories such as people, places, organizations. To extract named entities with nltk use the following code:
>>> import nltk >>> s = """ ... Prince William and his new wife Catherine kissed twice ... to satisfy the besotted Buckingham Palace crowds""" >>> a = nltk.word_tokenize(s) >>> b = nltk.pos_tag(a) >>> c = nltk.ne_chunk(b,binary=True) >>> for x in c.subtrees(): ... if x.node == "NE": ... words = [w[0] for w in x.leaves()] ... name = " ".join(words) ... print name ... Prince William Catherine Buckingham Palace >>>
The state of the art algorithms for named entity recognition automatically adapt entities that have not been seen before. From ** news.google.com ** image presented before we see the list is comprised mostly of named entities. Then find named entities and place in a ranking would allow us to simulate the behavior of news.google.com.
The above code in three simple steps:
- Make up a word list
>>> a = nltk.word_tokenize(s) ['Prince', 'William', 'and', 'his', 'new', 'wife', ...
- Add grammar category
>>> b = nltk.pos_tag(a) [('Prince', 'NN'), ('William', 'NNP'), ('and', 'CC'), ...
- Annotate named entities
>>> c = nltk.ne_chunk(b,binary=True) Tree('S',[Tree('NE', [('Prince', 'NN'), ('William', 'NNP')]), ...
The rest of the code example is just to show the entities in the screen.
What are the most relevant?
The remaining problem is to construct a ranking of the most relevant.
How do you distinguish that "Japan" is more relevant between March 10th and 20th (remember the tsunami) that between April 10th and 20th? To answer this, the method we propose for this example is in considering how often Japan is in a range of days with respect to what it usually is. Then, our ranking function is:
\[ratio(word) = \frac{prob.\ of\ word\ in\ days\ i..j} {prob.\ of\ word\ in\ every\ news}\]The words with highest ratio were significant in the days between i and j.
How to avoid overlap?
"Japan" and "Tokyo" may have high ratio but probably we are only interested in one name only as relevant news.
To solve this we do:
- We choose the entity E with the highest ratio.
- We threw all the notice in which E appears.
- We recalculate ratios and go back to the first step.
A typical ouput of all steps is:
============================================================
Top news between 2011-03-11 00:00:00 and 2011-03-20 00:00:00
============================================================
zuwarah
uss ronald reagan
richard wakeford
patrick fuller
unit
soviet ukraine
g7
tokyo commodity exchange
roppongi
nuclear security
Example code is available at:
http://duboue.net/download/pyday-nltk.tar.gz
Conclusions
Rafael Carrascosa (AKA Player 1) used NLTK in his work in the Faculty Mathematics, Physics and Astronomy at UNC, especially in a pipeline of classified ads to:
- To do POS tagging of Castilian.
- To do Chunking of Castilian.
- To do chunk sense disambiguation.
In addition, assembled language models for text compression using probabilistic grammars, N-grams and statistics parsers, all based on NLTK.
So ... How strong is NLTK? Being implemented in pure Python there are times when it take its time, besides the different packages available vary widely at maturity of the code. One question that can be done is:
Can be used to implement commercial products?
At level of license, it's all good. At the level of annotators errors, you have to try and at speed level, also have to try. So ... Try!
Other frameworks that are lying around include GATE and UIMA which are very much based on Java.
How to continue? NLTK is very good, try it and see the book (which is online).
Help PET: Donate
blog comments powered by Disqus