6 Methods To Tokenize String In Python

In this article, we will learn about how we can tokenize string in Python. Tokenization is a process of converting or splitting a sentence, paragraph, etc. into tokens which we can use in various programs like Natural Language Processing (NLP). However, we can tokenize strings in Python of any length we desire. It can be a segment of a large body of text or even small strings of that same text. Although there are many methods in Python through which you can tokenize strings. We will discuss a few of them and learn how we can use them according to our needs. To know more about tokens in Python, check out this post.

Contents

Method 1: Tokenize String In Python Using Split()

You can tokenize any string with the ‘split()’ function in Python. This function takes a string as an argument, and you can further set the parameter of splitting the string. However, if you don’t set the parameter of the function, it takes ‘space’ as a default parameter to split the strings. Let us see an example to understand the working of this function.

example = "Hello, Welcome to python pool, hope you are doing well"
tokens = example.split()
print(tokens)

Output:

['Hello,' , 'Welcome' , 'to' , 'python' , 'pool,' 'hope' , 'you' , 'are' , 'doing' , 'well']

As you can see, if we leave the parameter of the split function to default, it splits the sentence into tokens by every consecutive space between every character. Further, let us know how this function works if we provide a parameter to this function.

example = "Hello, Welcome to python pool, hope you are doing well"
tokens = example.split(",")
print(tokens)

Output:

['Hello' , 'Welcome to python pool' , 'hope you are doing well']

From the above code, you can see the sentence was split after every consecutive comma. Similarly, you can set the parameter of the function to divide the sentence.

Method 2: Using NTLK

You can also tokenize strings using NTLK, which has many modules in it. NTLK is a Natural Language Toolkit which is very useful if you are dealing with NLP (Natural Language Processing). Further, NLTK also provides a module, ‘tokenize.’ Furthermore, this module ‘tokenize’ has a function ‘word_tokenize(),’ which can divide a string into tokens. Let us see an example of how we can use this function.

from nltk.tokenize import word_tokenize
example = "Hello, Welcome to python pool, hope you are doing well"
word_tokenize(example)

Output:

['Hello' , ',' , 'Welcome' , 'to' , 'python' , 'pool' , ',' , 'hope' , 'you' , 'are' , 'doing' , 'well']

From the example, you can see the output is quite different from the ‘split()’ function method. This function ‘word_tokenize()’ takes comma “,” as well as apostrophe as a token besides all the other strings.

Further, you can also use the ‘tokenize’ module, which has a function ‘sent_tokenize’ to tokenize the line of the body of text. Let us see an example.

import nltk
example = "Hello, Welcome to python pool. Hope you are doing well"
tokens = nltk.sent_tokenize(example)
print (tokens)

Output:

['Hello, Welcome to python pool' , 'Hope you are doing well']

Method 3: Splitting Strings In Pandas For Tokens

You might want to split strings in ‘pandas’ to get a new column of tokens. You can do this using ‘str.split()’ function. Let us take an example in which you have a data frame that contains names, and you want only the first names or the last names as tokens. In order to do that, you need to write the code given below.

import pandas
example = pandas.DataFrame({"names": [ 
    "Bill Gates", 
    "Elon Musk", 
    "Jeff Bezos",
    "Mukesh Ambani"]})

Let’s say you want only the first names. You can get that by writing the following code.

example.names.str.split('\s+').str[0]

Output:

0      Bill
1     Elon
2    Jeff
3    Mukesh
dtype: object

Further, if you want only the last names, you can get that by using the following code.

example.names.str.split('\s+').str[-1]

Output:

0      Gates
1     Musk
2    Bezos
3    Ambani 
dtype: object

Similarly, you can split any strings in pandas using this function and the indexing of the data structure.

Method 4: Tokenize String In Python Using Keras

You can also split strings using ‘keras’ and make tokens. You can use ‘text_to_word_sequence()’ function to do that. Let us see how you can tokenize using this function.

First, install ‘Keras‘ on your pc using ‘pip’ in the command prompt.

pip install keras

Now write the code to tokenize the string as follows.

from keras.preprocessing.text import text_to_word_sequence
example = "Hello, Welcome to python pool, hope you are doing well"
tokens = text_to_word_sequence(text)
print(tokens)

Output:

['Hello,' , 'Welcome' , 'to' , 'python' , 'pool,' 'hope' , 'you' , 'are' , 'doing' , 'well']

From the above example, you can see how we can tokenize string using ‘keras’ in Python with the help of a function ‘text_to_word_sequence()’ very easily.

Method 5: Tokenize String In Python Using Gensim

Gensim is a library in Python which is open-source and is widely used for Natural Language Processing and Unsupervised Topic Modeling. You can convert any string to tokens using this library. However, it is very easy to carry out tokenization using this library. You can use the combination ‘tokenize’ function and ‘list’ function to get a list of tokens. To do this, first, install ‘Gensim’ using pip in the command prompt.

pip install gensim

Now write the code to tokenize the string as follows.

from gensim.utils import tokenize
example = "Hello, Welcome to python pool, hope you are doing well"
list(tokenize(example))

Output:

['Hello,' , 'Welcome' , 'to' , 'python' , 'pool,' 'hope' , 'you' , 'are' , 'doing' , 'well']

Further, you can also tokenize sentences using this library. To do that, let us see an example.

from gensim.summarization.textcleaner import split_sentences
example = "Hello, Welcome to python pool. Hope you are doing well"
tokens = split_sentences(example)
print(tokens)

Output:

['Hello, Welcome to python pool' , 'Hope you are doing well']

As you can see, we have tokenized the string in two sentences.

Method 6: Using Regex

Further, you can also tokenize string in Python using regex. Regex specifies a particular set or sequence of strings and helps you find it. Let us see how we can use regex to tokenize string with the help of an example.

import re
text = "Welcome to python pool"
tokens = r"\w+"
re.findall(tokens, text)

Output:

['Welcome' , 'to' , 'python' , 'pool']

From the example, you can see how you can use regex to tokenize string.

FAQs on Tokenize String in Python

What is Tokenize string?

Tokenization is a process of breaking a string into tokens. Further, these tokens can be in the form of words, sentences, characters, etc.

Where are tokens used?

Tokens, in the sense of programming, are primarily used in NLP (Natural Language Processing) to process the long sequence of strings.

How do you tokenize a string in Python using NTLK?

You can tokenize a string in Python using NTLK by a module ‘tokenize.’ Furthermore, this module ‘tokenize’ has a function ‘word_tokenize(),’ which can divide a string into tokens.

How do you Tokenize a sentence?

You can use the methods we have discussed in this article to tokenize a sentence, like using ‘split_sentences()’ function of the ‘Gensim’ library.

Conclusion

Finally, we can conclude that tokenization is an important process in Python, and there are many ways you can tokenize strings in Python. We have discussed a few of them which are important and can be useful when programming in Python. However, you can use any of the methods according to your needs and how your code allows the usage of any such methods.