Mastering Scikit-LLM: The Ultimate Guide for Data Enthusiasts

In case you want to perform text analysis on large language models, the scikit-llm library of Python will never disappoint you.

One of the most commonly used Large Language Models is OpenAI’s GPT-3. It is widely used with scikit-llm for textual analysis or Interpretation of data. Even though this library has been developed recently, it has garnered positive feedback from a large number of users.

Uses

Zero-shot text classification
Multi-label zero-shot text classification
Text vectorization
Text translation
Text summarization

All these features are present in the scikit-llm library. This feature helps to make classifications of labeled data. There can be single-label or multi-label groupings based on the needs of the user. Text can also be changed into vector format with the help of this library. Also, it can help to translate text or generate a summary of your text.

Examples

Set classifications for positive, negative, or neutral categories
Spam detection
Recommendation Systems
Chatbots
Text Summarization

Installation

You can install it using a pip

pip install scikit-llm

After this, you are supposed to import it. Use the command given below to do the same.

import skllm

Working

The scikit_llm library provides the SKLLMConfig the class that aids the user in setting an API key and name of an organization for OpenAI. It is a crucial step because in order to use OpenAI’s LLMs, you need the API key.

# importing SKLLMConfig to configure OpenAI API (key and Name)
from skllm.config import SKLLMConfig

# Set your OpenAI API key
SKLLMConfig.set_openai_key("<YOUR_KEY>")

# Set your OpenAI organization (optional)
SKLLMConfig.set_openai_org("<YOUR_ORGANIZATION>")

Now as the next step, you need to create an LLM estimator object. So, if you need to categorize a given text, you can use the given code. X_train and y_train variables consist of training data. It basically includes the text samples and labels. Now, you need to classify new data. So, X_new has the new test data, and lastly, the estimator object will predict the labels for the new data.

# Create a Scikit-LLM estimator
estimator = scikit_llm.GPTClassifier()

# Train the estimator
estimator.fit(X_train, y_train)

# Make predictions
y_pred = estimator.predict(X_new)

ZeroShotGPTClassifier in Scikit-LLM

A photo depicting a tech-oriented desk setup showcasing the Scikit-LLM interface on a screen, complemented by other related items and visualizations.

This classifier basically functions to segregate data without training. It works for the classification of text. There is no need to provide labels to the data because it uses GPT-3. It is a language model that has been trained already. Thus, it draws a relation among words of a text and generates a label accordingly. It is quite handy in two cases:

If labeling is quite expensive
If the dataset has less number of labels

Working of ZeroShotGPTClassifier

In order to start working with the ZeroShotGPTClassifier of the Scikit-LLM package, you need to import it first. After that, create a classifier object and then provide labels to it. Say, in this example, customer satisfaction reviews are given 3 labels: “positive,” “negative,” “and neutral.” Feed new data to the model. Using the predict function, the classifier can predict labels for new data.

from scikit_llm import ZeroShotGPTClassifier
classifier = ZeroShotGPTClassifier()
candidate_labels = ["positive", "negative", "neutral"]
new_data = ["I had a great experience at this restaurant! The food was delicious and the service was excellent.", "I had a terrible experience at this restaurant. The food was cold and the service was slow."]
y_pred = classifier.predict(new_data)

Thus, the output for y_pred will be:

y_pred = ["positive", "negative"]

Few-Shot Text Classification in Scikit-LLM

Contrary to the ZeroShotGPTClassifier, Few-Shot Text Classification requires some labeled examples. It is a bit more complex, and it only works in those cases when labeled data is given to the model. You need to specify a prompt also. It should specify the task well.

Working on Few-Shot Text Classification

The given example demonstrates the FewShotGPTClassifier.

from skllm import FewShotGPTClassifier
from skllm.datasets import get_classification_dataset

X, y = get_classification_dataset()

clf = FewShotGPTClassifier(openai_model="gpt-3.5-turbo")
clf.fit(X, y)
labels = clf.predict(X)

Multi-Label Few Shot Text Classification with Scikit-LLM

With the help of Multi-Label Few-Shot Text Classification, you can assign many labels. For example, news text can have two labels:

news text can have two labels: “politics” and “trade“.
product reviews can be positive, neutral or negative.
One can label medical images as benign, malignant, inconclusive

Approach

This method has two approaches :

Transformer-based language model
Meta-learning approach

Transformer-based language models like BERT and RoBERTa can multi-label the data. They can draw a relation among words as they are trained on large datasets of words. Thus, they are used for classifying text. In the case of the meta approach, as the name suggests, it learns to learn. The few labels that the user has provided in the document help to determine the labels for new data. In other words, based on a few predefined labels, this approach helps determine the test data labels.

Working of Multi-Label Few Shot Text Classification

from skllm.models.gpt.gpt_few_shot_clf import MultiLabelFewShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset

X, y = get_multilabel_classification_dataset()

clf = MultiLabelFewShotGPTClassifier(max_labels=2, openai_model="gpt-3.5-turbo")
clf.fit(X, y)
labels = clf.predict(X)

Text Classification with Google PaLM 2

The scikit-llm library provides users an interface to work with PaLM2 LLM provided by Google. It is one of the most efficient LLMs for the classification of text and other such linguistic operations. The scikit-llm library, it is in the test phase currently.

Working in Scikit-LLM

You need to specify two IDs namely: the Google Cloud project ID and the organization ID. Once you have set these, create a classifier object of the PaLM 2 LLM. Post this, use the fit function to fit the training data and then predict labels for the new samples of data using the predict function of the PaLM 2 classifier.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from skllm.models.palm import PaLMClassifier

# Set the Google Cloud project ID and organization ID
from skllm.config import SKLLMConfig
SKLLMConfig.set_project_id("<YOUR_PROJECT_ID>")
SKLLMConfig.set_organization_id("<YOUR_ORGANIZATION_ID>")

# Create a PaLM 2 classifier
classifier = PaLMClassifier()

# Fit the classifier to a training dataset
X_train = ["This is a positive example.", "This is a negative example."]
y_train = [1, 0]

classifier.fit(X_train, y_train)

# Make predictions on a test dataset
X_test = ["This is a new example.", "This is another new example."]

y_pred = classifier.predict(X_test)

# Print the predicted labels
print(y_pred)

Here, 1 implies a positive dataset, while 0 means a negative result. You will get the following output:

[1 0]

LLM Fine Tuning with Scikit-LLM

It provides four types of estimators for fine-tuning the LLM. These are:

skllm.models.palm.PaLMClassifier
skllm.models.gpt.GPTClassifier
skllm.models.palm.PaLM
skllm.models.gpt.GPT

The first two estimators fine-tune on single-label data, whereas the other two do text-to-text based input-output text handling.

Working

The LLM fine-tuning involves a vectorizer, here, a TfidfVectorizer. It converts the textual data to a numerical one and then loads the gpt2 model for fine-tuning. It will basically train the model on the given dataset in order to make predictions of labels of the new data.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from skllm.models import LLM

# Load the training data
X_train = ["This is a positive example.", "This is a negative example."]
y_train = [1, 0]

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Transform the training data into a numerical representation
X_train_vec = vectorizer.transform(X_train)

# Load the GPT-2 model
model = LLM(model_name="gpt-2")

# Fine-tune the model
model.fine_tune(X_train_vec, y_train, epochs=10)

# Evaluate the model on the test data
X_test = ["This is a new example."]
y_test = [1]

X_test_vec = vectorizer.transform(X_test)

y_pred = model.predict(X_test_vec)

# Print the predicted label
print(y_pred[0])

Is there an LLM package for tensorflow or keras?

Yes, as of now, they are available for TensorFlow and Keras. These are properly functional. Some of the packages are:

Jurassic-1 Jumbo
Bard
LaMDA
Megatron-Turing NLG
Bloom

You may include these in your TF or Keras projects for all sorts of uses.

FAQs

What kind of labels should we provide to the datasets in Scikit-LLM?

The labels should always be descriptive. Replace a single word with a more descriptive label for fine-tuning.

Which model out of GPT-3.5/GPT-4 users should choose?

GPT-4 is a more powerful model.

Conclusion

This blog covers a newly developed library in Python, scikit-llm. It is in the developing stage but is being widely used for text classification and other related tasks. It depicts the functions of some libraries and their divisions in scikit-llm.