In case you want to perform text analysis on large language models, the scikit-llm library of Python will never disappoint you.
One of the most commonly used Large Language Models is OpenAI’s GPT-3. It is widely used with scikit-llm for textual analysis or Interpretation of data. Even though this library has been developed recently, it has garnered positive feedback from a large number of users.
- Zero-shot text classification
- Multi-label zero-shot text classification
- Text vectorization
- Text translation
- Text summarization
All these features are present in the scikit-llm library. This feature helps to make classifications of labeled data. There can be single-label or multi-label groupings based on the needs of the user. Text can also be changed into vector format with the help of this library. Also, it can help to translate text or generate a summary of your text.
- Set classifications for positive, negative, or neutral categories
- Spam detection
- Recommendation Systems
- Text Summarization
You can install it using a pip
pip install scikit-llm
After this, you are supposed to import it. Use the command given below to do the same.
scikit_llm library provides the
SKLLMConfig the class that aids the user in setting an API key and name of an organization for OpenAI. It is a crucial step because in order to use OpenAI’s LLMs, you need the API key.
# importing SKLLMConfig to configure OpenAI API (key and Name) from skllm.config import SKLLMConfig # Set your OpenAI API key SKLLMConfig.set_openai_key("<YOUR_KEY>") # Set your OpenAI organization (optional) SKLLMConfig.set_openai_org("<YOUR_ORGANIZATION>")
Now as the next step, you need to create an LLM estimator object. So, if you need to categorize a given text, you can use the given code.
y_train variables consist of training data. It basically includes the text samples and labels. Now, you need to classify new data. So, X_new has the new test data, and lastly, the estimator object will predict the labels for the new data.
# Create a Scikit-LLM estimator estimator = scikit_llm.GPTClassifier() # Train the estimator estimator.fit(X_train, y_train) # Make predictions y_pred = estimator.predict(X_new)
ZeroShotGPTClassifier in Scikit-LLM
This classifier basically functions to segregate data without training. It works for the classification of text. There is no need to provide labels to the data because it uses GPT-3. It is a language model that has been trained already. Thus, it draws a relation among words of a text and generates a label accordingly. It is quite handy in two cases:
- If labeling is quite expensive
- If the dataset has less number of labels
Working of ZeroShotGPTClassifier
In order to start working with the ZeroShotGPTClassifier of the Scikit-LLM package, you need to import it first. After that, create a classifier object and then provide labels to it. Say, in this example, customer satisfaction reviews are given 3 labels: “positive,” “negative,” “and neutral.” Feed new data to the model. Using the predict function, the classifier can predict labels for new data.
from scikit_llm import ZeroShotGPTClassifier classifier = ZeroShotGPTClassifier() candidate_labels = ["positive", "negative", "neutral"] new_data = ["I had a great experience at this restaurant! The food was delicious and the service was excellent.", "I had a terrible experience at this restaurant. The food was cold and the service was slow."] y_pred = classifier.predict(new_data)
Thus, the output for y_pred will be:
y_pred = ["positive", "negative"]
Few-Shot Text Classification in Scikit-LLM
Contrary to the ZeroShotGPTClassifier, Few-Shot Text Classification requires some labeled examples. It is a bit more complex, and it only works in those cases when labeled data is given to the model. You need to specify a prompt also. It should specify the task well.
Working on Few-Shot Text Classification
The given example demonstrates the FewShotGPTClassifier.
from skllm import FewShotGPTClassifier from skllm.datasets import get_classification_dataset X, y = get_classification_dataset() clf = FewShotGPTClassifier(openai_model="gpt-3.5-turbo") clf.fit(X, y) labels = clf.predict(X)
Multi-Label Few Shot Text Classification with Scikit-LLM
With the help of Multi-Label Few-Shot Text Classification, you can assign many labels. For example, news text can have two labels:
- news text can have two labels: “politics” and “trade“.
- product reviews can be positive, neutral or negative.
- One can label medical images as benign, malignant, inconclusive
This method has two approaches :
- Transformer-based language model
- Meta-learning approach
Transformer-based language models like BERT and RoBERTa can multi-label the data. They can draw a relation among words as they are trained on large datasets of words. Thus, they are used for classifying text. In the case of the meta approach, as the name suggests, it learns to learn. The few labels that the user has provided in the document help to determine the labels for new data. In other words, based on a few predefined labels, this approach helps determine the test data labels.
Working of Multi-Label Few Shot Text Classification
from skllm.models.gpt.gpt_few_shot_clf import MultiLabelFewShotGPTClassifier from skllm.datasets import get_multilabel_classification_dataset X, y = get_multilabel_classification_dataset() clf = MultiLabelFewShotGPTClassifier(max_labels=2, openai_model="gpt-3.5-turbo") clf.fit(X, y) labels = clf.predict(X)
Text Classification with Google PaLM 2
The scikit-llm library provides users an interface to work with PaLM2 LLM provided by Google. It is one of the most efficient LLMs for the classification of text and other such linguistic operations. The scikit-llm library, it is in the test phase currently.
Working in Scikit-LLM
You need to specify two IDs namely: the Google Cloud project ID and the organization ID. Once you have set these, create a classifier object of the PaLM 2 LLM. Post this, use the fit function to fit the training data and then predict labels for the new samples of data using the predict function of the PaLM 2 classifier.
import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from skllm.models.palm import PaLMClassifier # Set the Google Cloud project ID and organization ID from skllm.config import SKLLMConfig SKLLMConfig.set_project_id("<YOUR_PROJECT_ID>") SKLLMConfig.set_organization_id("<YOUR_ORGANIZATION_ID>") # Create a PaLM 2 classifier classifier = PaLMClassifier() # Fit the classifier to a training dataset X_train = ["This is a positive example.", "This is a negative example."] y_train = [1, 0] classifier.fit(X_train, y_train) # Make predictions on a test dataset X_test = ["This is a new example.", "This is another new example."] y_pred = classifier.predict(X_test) # Print the predicted labels print(y_pred)
Here, 1 implies a positive dataset, while 0 means a negative result. You will get the following output:
LLM Fine Tuning with Scikit-LLM
It provides four types of estimators for fine-tuning the LLM. These are:
The first two estimators fine-tune on single-label data, whereas the other two do text-to-text based input-output text handling.
The LLM fine-tuning involves a vectorizer, here, a TfidfVectorizer. It converts the textual data to a numerical one and then loads the gpt2 model for fine-tuning. It will basically train the model on the given dataset in order to make predictions of labels of the new data.
import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from skllm.models import LLM # Load the training data X_train = ["This is a positive example.", "This is a negative example."] y_train = [1, 0] # Create a TF-IDF vectorizer vectorizer = TfidfVectorizer() # Transform the training data into a numerical representation X_train_vec = vectorizer.transform(X_train) # Load the GPT-2 model model = LLM(model_name="gpt-2") # Fine-tune the model model.fine_tune(X_train_vec, y_train, epochs=10) # Evaluate the model on the test data X_test = ["This is a new example."] y_test =  X_test_vec = vectorizer.transform(X_test) y_pred = model.predict(X_test_vec) # Print the predicted label print(y_pred)
Is there an LLM package for tensorflow or keras?
Yes, as of now, they are available for TensorFlow and Keras. These are properly functional. Some of the packages are:
- Jurassic-1 Jumbo
- Megatron-Turing NLG
You may include these in your TF or Keras projects for all sorts of uses.
The labels should always be descriptive. Replace a single word with a more descriptive label for fine-tuning.
GPT-4 is a more powerful model.
This blog covers a newly developed library in Python, scikit-llm. It is in the developing stage but is being widely used for text classification and other related tasks. It depicts the functions of some libraries and their divisions in scikit-llm.