Natural Language Toolkit - Tokenizing Text



What is Tokenizing?

It may be defined as the process of breaking up a piece of text into smaller parts, such as sentences and words. These smaller parts are called tokens. For example, a word is a token in a sentence, and a sentence is a token in a paragraph.

As we know that NLP is used to build applications such as sentiment analysis, QA systems, language translation, smart chatbots, voice systems, etc., hence, in order to build them, it becomes vital to understand the pattern in the text. The tokens, mentioned above, are very useful in finding and understanding these patterns. We can consider tokenization as the base step for other recipes such as stemming and lemmatization.

NLTK package

nltk.tokenize is the package provided by NLTK module to achieve the process of tokenization.

Tokenizing sentences into words

Splitting the sentence into words or creating a list of words from a string is an essential part of every text processing activity. Let us understand it with the help of various functions/modules provided by nltk.tokenize package.

word_tokenize module

word_tokenize module is used for basic word tokenization. Following example will use this module to split a sentence into words.

Example

import nltk
from nltk.tokenize import word_tokenize
word_tokenize('Tutorialspoint.com provides high quality technical tutorials for free.')

Output

['Tutorialspoint.com', 'provides', 'high', 'quality', 'technical', 'tutorials', 'for', 'free', '.']

TreebankWordTokenizer Class

word_tokenize module, used above is basically a wrapper function that calls tokenize() function as an instance of the TreebankWordTokenizer class. It will give the same output as we get while using word_tokenize() module for splitting the sentences into word. Let us see the same example implemented above −

Example

First, we need to import the natural language toolkit(nltk).

import nltk

Now, import the TreebankWordTokenizer class to implement the word tokenizer algorithm −

from nltk.tokenize import TreebankWordTokenizer

Next, create an instance of TreebankWordTokenizer class as follows −

Tokenizer_wrd = TreebankWordTokenizer()

Now, input the sentence you want to convert to tokens −

Tokenizer_wrd.tokenize(
   'Tutorialspoint.com provides high quality technical tutorials for free.'
)

Output

[
   'Tutorialspoint.com', 'provides', 'high', 'quality', 
   'technical', 'tutorials', 'for', 'free', '.'
]

Complete implementation example

Let us see the complete implementation example below

import nltk
from nltk.tokenize import TreebankWordTokenizer
tokenizer_wrd = TreebankWordTokenizer()
tokenizer_wrd.tokenize('Tutorialspoint.com provides high quality technical
tutorials for free.')

Output

[
   'Tutorialspoint.com', 'provides', 'high', 'quality', 
   'technical', 'tutorials','for', 'free', '.'
]

The most significant convention of a tokenizer is to separate contractions. For example, if we use word_tokenize() module for this purpose, it will give the output as follows −

Example

import nltk
from nltk.tokenize import word_tokenize
word_tokenize('won’t')

Output

['wo', "n't"]]

Such kind of convention by TreebankWordTokenizer is unacceptable. That’s why we have two alternative word tokenizers namely PunktWordTokenizer and WordPunctTokenizer.

WordPunktTokenizer Class

An alternative word tokenizer that splits all punctuation into separate tokens. Let us understand it with the following simple example −

Example

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize(" I can't allow you to go home early")

Output

['I', 'can', "'", 't', 'allow', 'you', 'to', 'go', 'home', 'early']

Tokenizing text into sentences

In this section we are going to split text/paragraph into sentences. NLTK provides sent_tokenize module for this purpose.

Why is it needed?

An obvious question that came in our mind is that when we have word tokenizer then why do we need sentence tokenizer or why do we need to tokenize text into sentences. Suppose we need to count average words in sentences, how we can do this? For accomplishing this task, we need both sentence tokenization and word tokenization.

Let us understand the difference between sentence and word tokenizer with the help of following simple example −

Example

import nltk
from nltk.tokenize import sent_tokenize
text = "Let us understand the difference between sentence & word tokenizer. 
It is going to be a simple example."
sent_tokenize(text)

Output

[
   "Let us understand the difference between sentence & word tokenizer.", 
   'It is going to be a simple example.'
]

Sentence tokenization using regular expressions

If you feel that the output of word tokenizer is unacceptable and want complete control over how to tokenize the text, we have regular expression which can be used while doing sentence tokenization. NLTK provide RegexpTokenizer class to achieve this.

Let us understand the concept with the help of two examples below.

In first example we will be using regular expression for matching alphanumeric tokens plus single quotes so that we don’t split contractions like “won’t”.

Example 1

import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
tokenizer.tokenize("won't is a contraction.")
tokenizer.tokenize("can't is a contraction.")

Output

["won't", 'is', 'a', 'contraction']
["can't", 'is', 'a', 'contraction']

In first example, we will be using regular expression to tokenize on whitespace.

Example 2

import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('/s+' , gaps = True)
tokenizer.tokenize("won't is a contraction.")

Output

["won't", 'is', 'a', 'contraction']

From the above output, we can see that the punctuation remains in the tokens. The parameter gaps = True means the pattern is going to identify the gaps to tokenize on. On the other hand, if we will use gaps = False parameter then the pattern would be used to identify the tokens which can be seen in following example −

import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('/s+' , gaps = False)
tokenizer.tokenize("won't is a contraction.")

Output

[ ]

It will give us the blank output.

Advertisements