Search This Blog

Sunday, November 23, 2008

Introduction to Natural Language Processing

Getting Started on Natural Language Processing with Python


Natural Language Processing

The term natural language processing encompasses a broad set of techniques for automated generation, manipulation, and analysis of natural or human languages. Although most NLP techniques inherit largely from linguistics and artificial intelligence, they are also influenced by relatively new areas such as machine learning, computational statistics, and cognitive science.

Before we see some examples of NLP techniques, it will be useful to introduce some very basic terminology. Please note that as a side effect of keeping things simple, these definitions may not stand up to strict linguistic scrutiny.

  • Token: Before any real processing can be done on the input text, it needs to be
    segmented into linguistic units such as words, punctuation, numbers, or alphanumerics. These units are known as tokens.

  • Sentence: An ordered sequence of tokens.

  • Tokenization: The process of splitting a sentence into its constituent tokens. For segmented languages such as English, the existence of whitespace makes tokenization
    relatively easy and uninteresting. However, for languages such as Chinese and Arabic, the task is more difficult since there are no explicit boundaries. Furthermore, almost all characters in such non-segmented languages can exist as one-character words by themselves, and can also join together to form multi-character words.

  • Corpus: A body of text, usually containing a large number of sentences. 

  • Part-of-speech (POS) tag: A word can be classified into one or more lexical or part-of-speech categories such as nouns, verbs, adjectives, and articles, to name a few. A POS tag is a symbol representing such a lexical category, e.g., NN (noun), VB (verb), JJ (adjective), AT (article). One of the oldest and most commonly used tag sets is the Brown corpus tag set. We will discuss the Brown corpus in more detail below.

  • Parse tree: A tree defined over a given sentence that represents the syntactic structure of the sentence as defined by formal grammar.

Now that we have introduced the basic terminology, let us look at some common NLP tasks:

  • POS tagging: Given a sentence and a set of POS tags, a common language processing task
    is to automatically assign POS tags to each word in the sentence. For example, given the sentence, "The ball is red," the output of a POS tagger would be, "The/AT ball/NN is/VB
    State-of-the-art POS taggers [9] can achieve accuracy as high as 96%. Tagging text with parts-of-speech turns out to be extremely useful for more complicated NLP tasks such as parsing and machine translation, which are discussed below.

  • Computational morphology: Natural languages consist of a very large number of words that are built upon basic building blocks known as morphemes (or stems), the smallest linguistic units possessing meaning. Computational morphology is concerned with the discovery and analysis of the internal structure of words using computers.

  • Parsing: In the parsing task, a parser constructs the parse tree given a sentence.
    Some parsers assume the existence of a set of grammar rules in order to parse, but recent parsers are smart enough to deduce the parse trees directly from the given data using complex statistical models [1]. Most parsers also operate in a supervised setting and require the sentence to be POS-tagged before it can be parsed. Statistical parsing is an area of active research in NLP.

  • Machine translation (MT): In machine translation, the goal is to have the computer translate the given text in one natural language to fluent text in another language, without
    human interference. This is one of the most difficult tasks in NLP and has been tackled in a lot of different ways over the years. Almost all MT approaches use POS tagging and parsing as preliminary steps.


The Python programming language is a dynamically-typed, object-oriented, interpreted language. Although its primary strength lies in the ease with which it allows a programmer to rapidly prototype a project, its powerful and mature set of standard libraries make it a great fit
for large-scale production-level software engineering projects as well. Python has a very shallow learning curve and an excellent online tutorial [11].

Natural Language ToolKit (NLTK)

Although Python already has most of the functionality needed to perform simple NLP tasks, it is still not powerful enough for most standard NLP tasks. This is where the natural language toolkit (NLTK) comes in [12]. NLTK is a collection of modules and corpora, released under an open-source license, that allows students to learn and conduct research in NLP.

The most important advantage of using NLTK is that it is entirely self-contained. Not only does it provide convenient functions and wrappers that can be used as building blocks for common NLP tasks, it also provides raw and preprocessed versions of standard corpora used in NLP
literature and courses. 


1 Bikel, Dan. On the Parameter Space of Generative Lexicalized Statistical Parsing Models. PhD Thesis. 2004.

2  Chiang, David. A hierarchical phrase-based model for statistical machine translation. Proceedings of ACL. 2005.

3 Church, Kenneth W. and Hanks, Patrick. Word association norms, mutual information, and
lexicography. Computational Linguistics, 16(1). 1990.
4  Cunningham, H., Maynard D., Bontcheva K. and Tablan V. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). 2002.
5  Hart, Michael and Newby, Gregory. Project Gutenberg.
6 Kucera, H. and Francis, W. N. Computational Analysis of Present-Day American English. Brown University Press, Providence, RI. 1967. 
7  Levy, Roger and Manning, Christopher D. Is it harder to parse Chinese, or the Chinese Treebank ? Proceedings of ACL. 2003.
8 Radev, Dragomir R. and McKeown, Kathy. Generating natural language summaries from multiple on-line sources. Computational Linguistics. 24:469-500. 1999.
9  Ratnaparkhi, Adwait. A Maximum Entropy Part-Of-Speech Tagger. Proceedings of Empirical Methods on Natural Language Processing. 1996.
10 Wu, Dekai and Chiang, David. Syntax and Structure in Statistical Translation. Workshop at HTL-NAACL 2007.
11 The Official Python Tutorial.
12 Natural Language Toolkit.
13 NLTK Tutorial.

No comments: