XLNet for Text Classification: Advancing Natural Language Processing

XLNet is one of the top-performing models for text classification. You might be wondering if any model that is more efficient than BERT exists. But don’t worry, we have got u covered. By the end of this blog, you will get to know about xlnet and how you can use it for text classification.

About xlnet

Carnegie Mellon University and Google researchers came up with a text classification model that you didn’t know existed. If you aim to put the data in labels or categories, you should certainly use the xlnet model.

xlnet’s research paper

Authors Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le came up with the xlnet model in the form of a research paper. The paper was published in NeurIPS 2019.

Need of this model

Need of this xlnet model
xlnet for text classification

The current methods of pretraining a model were segregated into two types:

  • autoregressive (AR)
  • autoencoding (AE)

These methods are used for language understanding models (LMs) and have two different functions to perform. The first type of method focuses on generating tokens for the following sequence, while the second method works on a corrupted sequence to generate the original one by working on bidirectional inputs.

XLNet is a combination of these two models. It gives state-of-the-art performance.

Features of this model

Some features of XLNet that have been taken from Transformer-XL:

  • segment recurrence
  • relative positional encoding

Other features include:

  • Autoregressive pretraining
  • Segment recurrence mechanism
  • Relative positional encoding
  • Two-stream self-attention
  • No sequence length limit
  • Efficiency
  • Flexibility

Masked Language models or xlnet for text classification?

In addition to this, XLNet performs better than masked language models. The masked language models (MLM) face a pretrain-finetune discrepancy. This means that they perform differently during the pre-training phase and fine-tuning phase. In the pre-training phase, the model doesn’t interpret all the tokens. Instead, it determines the masked tokens only. On the contrary, the fine-tuning phase includes determining all the tokens- whether it is masked or unmasked.

Thus, the model shows different outputs for masked and unmasked tokens in the two cases. XLNet model addresses this problem by assessing the order in which the words form a sentence. It takes into account different permutations a word can have and, thus, forms a better relationship among words in a sentence. This property of xlnet models is used for:

  • question answering,
  • natural language inference, and
  • summarization


Let us interpret its working with the help of an example:

The cat sat on the mat.

Masked language models would just display ‘cat’ if we mask this word. However, XLNet will interpret the order of words in this sentence. So, it will determine the word ‘cat’ and also understand that the word ‘sat’ follows ‘cat’.

Token data in xlnet for text classification

The XLNet model works on token data. Basically, words are represented in the form of tokens. Every new token depends on the previous one. The model takes into account units of text called tokens. Tokens basically consist of:

  • Words
  • Punctuation marks
  • Any other symbols

As an example, if the sentence “The puppy sat on the mat” exists, then the tokens will be:

  • the
  • puppy
  • sat
  • on
  • the
  • mat

The tokens formed in xlnet are used for permutation language modeling (PLM). It helps to predict the next token. In some cases, you may have the permuted words of a sentence. Still, xlnet will predict the next word. It aids in assessing both the left and right sides of the sentence to get better insights and output the answer accordingly. This differentiates xlnet from other models used for text classification. The autoregressive language modeling (ARLM) techniques can’t work in the context of both sides of the word. Due to this, it may not show high levels of accuracy. So now we know which model we should choose for the purpose of text classification.

Working on xlnet for text classification

Consider that you have raw data in a document, and you need to set labels for the text. So, the first step is to load this data and then segregate the text into tokens, as we have learned before. The transformers library has two important classes, namely XLNetTokenizer and XLNetModel classes. For changing the text into tokens, the XLNetTokenizer class proves to be useful, while the XLNetModel class is a model that has been pre-trained on data. The tokens that have been formed exist in the following way:

They have input_ids and attention_mask that specify the text after tokenization and the masked tokens, respectively. After that, you need to create tensors from this information. Evaluate the confidence of the model with the help of logits. Use the softmax() function to find the related probabilities. With the help of generated probabilities, labels will be set. Now, you need to know the one with the maximum value. The torch.argmax() works on this part and generates a similar output.

from transformers import XLNetTokenizer, XLNetModel

tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetModel.from_pretrained('xlnet-base-cased')

# Load data from a file
with open('data.txt', 'r') as f:
  data = f.read()

# Tokenize the data
tokenized_data = tokenizer(data)

# Convert the tokenized data to a PyTorch tensor
input_ids = torch.tensor(tokenized_data['input_ids'])
attention_mask = torch.tensor(tokenized_data['attention_mask'])

# Pass the input data to the model
outputs = model(input_ids, attention_mask=attention_mask)

# Get the logits from the model outputs
logits = outputs[0]

# Calculate the predicted probabilities
probabilities = F.softmax(logits, dim=-1)

# Get the predicted labels
predictions = torch.argmax(probabilities, dim=-1)

So, supposedly, the probabilities tensor is :

[[0.1, 0.2, 0.7],
 [0.3, 0.4, 0.3],
 [0.5, 0.2, 0.3],
 [0.4, 0.5, 0.1],
 [0.2, 0.3, 0.5],
 [0.6, 0.2, 0.2]]

The output of the torch.argmax() the function will be:

[2, 1, 0, 1, 2, 0]

It suggests the index of the maximum value in all cases.

Is XLNet better than GPT?

XLNet, of course, has an edge over GPT when it comes to bidirectional context. But is that all? Let us have a look at the differences between the two.

Provides Bidirectional contextDoesn’t Provide Bidirectional context
A training data size of 40GB exists.Training data size of 40GB exists.
More consistentLess consistent than XLNet

So, we can say that XLNet can be used for tasks that work better on bidirectional inputs, whereas the GPT model works on unidirectional context use cases.

XLNet vs. BERT model

Follows Permutation language modeling (PLM)A training data size of 136GB exists.
A training data size of 33GB exists.Training data size of 33GB exists.
More consistentLess consistent than XLNet

So, we can conclude that we can use XLNet for tasks involving bidirectional context like natural language inference, but BERT can be used with language modeling and text generation tasks that work without bidirectional context.

Is the XLNet transformer available on the hugging face?

You can check the following XLNet transformer models on hugging face:

  • xlnet-base-uncased
  • xlnet-large-uncased
  • xlnet-base-cased
  • xlnet-large-cased

The work is quite simple. First, you need to install the transformers library. Now, you need to call the XLNetModel class and curate instances of a new object of this model. As the next step, you need to load the pre-trained weights. The given example elaborates on the basic workings of the XLNet transformer model.

import transformers

# Load the XLNet model
xlnet_model = transformers.XLNetModel.from_pretrained("xlnet-base-cased")

# Encode the input text
input_text = "What is the capital of France?"
encoded_input = xlnet_model(input_text)

# Generate the response
response = xlnet_model.generate(encoded_input)

# Print the response


Is XLNet a language model?

Yes, XLNet is a language model.

Is XLNet better than BERT?

Yes, XLNet is considered to be a better model.


This blog elaborates on the usage of xlnet for text classification. It also discusses why the model is better than GPT and BERT.

Notify of
Inline Feedbacks
View all comments