Tokenize text using NLTK in python


Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. In the context of nltk and python, it is simply the process of putting each token in a list so that instead of iterating over each letter at a time, we can iterate over a token.

For example, given the input string −

Hi man, how have you been?

We should get the output −

['Hi', 'man', ',', 'how', 'have', 'you', 'been', '?']

We can tokenize this text using the word_tokenize method from NLTK. For example,

Example

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

my_sent = "Hi man, how have you been?"
tokens = word_tokenize(my_sent)

print(tokens)

Output

This will give the output −

['Hi', 'man', ',', 'how', 'have', 'you', 'been', '?']

Updated on: 20-Jun-2020

689 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements