[Fixed] tfidfvectorizer object has no attribute get_feature_names

You may find issues while working with tfidfvectorizer in Python. The get_feature_names() method may be the reason for your error. Go through this blog to know more.

Contents

tfidfvectorizer object

Tfid stands for “Term Frequency-Inverse Document Frequency”. It is an integral part of the scikit learn library. It basically emphasizes how important a word is in a text corpus. When a user wishes to perform text feature extraction, he can create a TF-IDF matrix using the tfidfvectorizer method.

Reasons for the error

You may be getting the “tfidfvectorizer object has no attribute get_feature_names” error due to a couple of reasons:

The version of the scikit-learn library that you currently have is not updated.
There is an issue with the code for get_feature_names() method
You have not initialized the object of this class
Version 0.24 introduced this function. Now, for older versions of scikit learn, this error will definitely persist.

Resolving the error

You can get rid of the “tfidfvectorizer object has no attribute get_feature_names” error by following the given methods of resolution.

Check the version of sci-kit learn

You need to check the version of scikit learn first. Use any of the two commands. Both work with the pip command.

pip show scikit-learn
#or
pip install --upgrade scikit-learn

Thus, you should use the pip command to update the scikit learn library to the latest version. You may enter this either in the terminal or command prompt. There’s an alternate way, also.

import sklearn
print(sklearn.__version__)

Correct Usage of get_feature_names() function

The get_feature_names() function behaves differently with different versions of scikit learn. So once you have checked the version of scikit learn, use the get_feature_names() function accordingly. In case the version is 0.24, use this function as it is. The sklearn version 1.0 supports the get_feature_names_out() function. Have a look at the given code for differentiating between the two.

The first method is used in this manner:

from sklearn.feature_extraction.text import TfidfVectorizer

# Example text data about a delicious meal
text_data = [
    "The aroma of freshly baked bread filled the air.",
    "A platter of golden brown, crispy roasted chicken arrived at the table.",
    "Juicy vegetables and fluffy mashed potatoes completed the perfect meal.",
]

# Create and train a TF-IDF vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(text_data)

# Extract the vocabulary of words used
vocabulary = vectorizer.get_feature_names()

# Print the unique words in the text data
print("Vocabulary:")
for word in vocabulary:
    print(f"- {word}")

Now, let us see how the second method is applied:

from sklearn.feature_extraction.text import TfidfVectorizer

# Example text data
text_data = [
    "The aroma of freshly baked bread filled the air.",
    "A platter of golden brown, crispy roasted chicken arrived at the table.",
    "Juicy vegetables and fluffy mashed potatoes completed the perfect meal.",
]

# Initialize and fit the TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(text_data)

# Get the feature names
feature_names = tfidf_vectorizer.get_feature_names_out()
print(feature_names)

In both cases, you will get the following result:

['air', 'and', 'aroma', 'arrived', 'at', 'baked', 'bread', 'brown', 'chicken', 'completed', 'crispy', 'filled', 'fluffy', 'freshly', 'golden', 'juicy', 'mashed', 'meal', 'of', 'perfect', 'platter', 'potatoes', 'roasted', 'table', 'the', 'vegetables']

Check the flow of code

You should know that first, you need to fit and transform the data and then apply the get_feature_names() function. Have a look at the given code for better understanding:

from sklearn.feature_extraction.text import TfidfVectorizer

# Example text data
text_data = ["The aroma of freshly baked bread filled the air."]

# Initialize and fit the TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(text_data)

# Get the feature names
feature_names = tfidf_vectorizer.get_feature_names()
print(feature_names)

Use an alternative

vocabulary_ attribute is a really powerful alternative to the get_feature_names() function. It fetches the exact same arguments and the output is also similar. Basically creating a dictionary of feature matrices that maps a feature to its index. It can draw a comparison between as many features as you want.

from sklearn.feature_extraction.text import TfidfVectorizer

# Example text data
text_data = ["The aroma of freshly baked bread filled the air."]

# Initialize and fit the TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(text_data)

# Get the feature names using the vocabulary_ attribute
vocabulary = tfidf_vectorizer.vocabulary_
feature_names = list(vocabulary.keys())
print(feature_names)

Correction of Typos

You may have noticed that there is a linguistic error due to which the code is not working. Thus, you should check for spelling errors too.

Error correction in conda

As far as the conda environment is concerned, you can update the sci-kit learn library with the help of this command:

conda update scikit-learn

Incorrect Object

If you call the get_feature_names_out method on an undesirable object, then it also generates an error. For example, if you pass this to a string object, then you will get the “tfidfvectorizer object has no attribute get_feature_names” error.

# Example of incorrect object reference
not_a_vectorizer = "I am a string, not a vectorizer"
not_a_vectorizer.get_feature_names_out()

Usage of get_feature_names attribute

We have discussed the different methods through which we can get rid of the get_feature_names attribute error. Once this is done, you are free to use this function for the following purposes:

Inspection of features
Drawing inference from a machine learning model
Visualization of the data
Generation of the word cloud

The classifier then uses these features to make predictions.

import sklearn.linear_model as LogisticRegression

# Create a logistic regression classifier
classifier = LogisticRegression()

# Train the classifier on the vectorized text data
classifier.fit(vectorizer.transform(text_data), y_train)

# Get the coefficients of the classifier
coefficients = classifier.coef_[0]

# Sort the coefficients in decreasing order
sorted_coefficients = sorted(coefficients, reverse=True)

# Get the feature names for the top 10 features
top_10_features = feature_names[:10]

# Print the top 10 features
print(top_10_features)

Such codes let us know the features of the most commonly used words in a dataset.

Other tips while working with tfidfvectorizer

There are a few tips that you can go through while working with TF-IDF vectorizer.

Prevent using stop words in the sentence. Without stop words, sentences are much easier to process and form features.
Stemming and lemmatization help the user get the root word from the word corpus. These are well-tested techniques.
You can specify some parameters like maximum document frequency (max_df) threshold, minimum document frequency (min_df) threshold, and sublinear term frequency (sublinear_tf) transformation.

The max_df parameters remove words that are present in many documents. Basically, it removes words that cross the maximum threshold. On the other hand, min_df covers words that appear less than the minimum threshold. The sublinear term frequency (sublinear_tf) transformation removes frequently occurring words.

Let us consider the following code that has inculcated all these parameters:

import sklearn.feature_extraction.text as TfidfVectorizer

# Create a stop words list
stop_words = ["the", "is", "of", "and", "a"]

# Create a vectorizer object
vectorizer = TfidfVectorizer(stop_words=stop_words, max_df=0.75, min_df=2, sublinear_tf=True)

# Fit the vectorizer to the text data
vectorizer.fit(text_data)

# Transform the text data into a sparse matrix
sparse_matrix = vectorizer.transform(text_data)

FAQs

What is the difference between get_feature_names and get_feature_names_out?

The get_feature_names function is deprecated now. The updated version of scikit learns uses the get_feature_names_out function.

Conclusion

This blog covers the reasons for the “tfidfvectorizer object has no attribute get_feature_names” attribute error that users encounter while working with the scikit learn library of Python. The blog includes different ways through which you can resolve this error. It also covers some tips users can go through while working with tfidfvectorizer.