Numpy OneHot: The Secret Sauce for Machine Learning Success

Have you ever wondered how you can change a categorical variable to a numerical one? Don’t worry. We have got you covered. Numpy one hot encoding does it all for you.

Contents

About numpy onehot

One hot encoding basically implies obtaining the binary vector form of a categorical variable. Numpy can be used to do one hot encoding in Python. Machine learning algorithms are dependent on onehot encoding for a couple of reasons.

One-hot encoding with numpy

Numpy offers several ways that aid the programmer in encoding variables in binary format. Such variables are usually categorical in nature.

Using the zeroes function of numpy

Yes, you heard it right. The zeroes() function of numpy can be used for numpy onehot encoding. Firstly, create an array of numbers. As the next step, create an array of zeroes. Keep its size as 1 added to the array’s maximum size. Thirdly, for each ith row, assign a value of 1 for a[i]th column.

a = np.array([1, 0, 3])
b = np.zeros((a.size, a.max() + 1))
b[np.arange(a.size), a] = 1

Have a look at the output array:

array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])

Using the eye function of numpy

The eye function of numpy works similarly. You will obtain similar encoding of values for the corresponding columns.

values = [1, 0, 3]
n_values = np.max(values) + 1
np.eye(n_values)[values]

Now, consider the output array, which has the encoded values:

array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])

Consider another example. Assume that a color array has 3 categorical values.

color = ["red", "green", "blue"]
import numpy as np
one_hot_color = np.eye(len(color))[color]
print(one_hot_color)

The output shown below represents different colors for a different row. You can see that wherever 1 exists, that color is part of the original array.

[[[1. 0. 0.]
  [0. 1. 0.]
  [0. 0. 1.]]]

Using the identity function of numpy

The identity function provides the same result as the above two methods. It follows the given syntax:

np.identity(num_classes)[indices]

Use cases

The encoding of categorical variables as binary ones is utilized in a large number of fields. Some of them are:

Machine learning
Deep learning
Natural language processing
Image classification
Text classification
Recommendation systems
Anomaly detection
Fraud detection
Medical diagnosis
Financial forecasting

Return type

After onehot encoding, numpy returns output in the form of (n, k) where:

n: samples of input variable
k: number of categories

Henceforth, 1 means that the sample belongs to a given category, while 0 means that it does not belong to the category. For your information, the input variable is categorical. Let’s go through an example to know more about the return type.

import numpy as np

color = ["red", "green", "blue"]
one_hot_color = np.eye(len(color))[color]

print(type(one_hot_color))
print(one_hot_color.shape)

The output will be:

<class 'numpy.ndarray'>
(3, 3)

This means that there are 3 samples and 3 categories. These categories are- red, blue, and green.

The preferred dtypes

Two preferred data types for one hot encoding are:

np.float32
np.float64

Machine learning algorithms generally support these data types, so you can specify them while doing one hot encoding. Through this example, understand how you can specify the data type during the process of one hot encoding.

import numpy as np

color = ["red", "green", "blue"]
one_hot_color = np.eye(len(color), dtype=np.float32)[color]

print(one_hot_color.dtype)

Therefore, the output will be float32. The choice of dtype depends on 2 factors:

memory usage or performance
higher precision

For the first case, float32 is fine, and for the second case, float64 is used.

Onehot encoding vs. Dummy variables

These two methods are used to change the categorical variable into a numerical variable. However, a few differences exist between these two methods.

One hot encoding	Dummy variables
It is used when a user does not want binary encoding.	It adds fewer variables.
It adds more variables.	It adds less variables.
Logistic regression, support vector machines (SVMs), and neural networks use one hot encoding.	Decision trees and random forests use Dummy variables.

One hot encoding in Sparse datasets

Sparse datasets are those that have zeroes as their values. Sparse datasets also go through One hot encoding in order to convert the categorical variable to a numeric value. Consider the following steps. These will aid you in doing the encoding.

Firstly, you need to import the numpy library along with Scipy. These are a few important libraries that you need to include.

import numpy as np
from scipy.sparse import csr_matrix

As the second step, you need to create a sparse matrix.

X = csr_matrix([[1, 0, 0], [0, 1, 0], [0, 0, 1]])

Thirdly, use the eye function of numpy to encode the sparse matrix. Basically, it curates a square matrix that has entries corresponding to the values in the original matrix as 1. Here, a row implies a category, while a column means a sample. So, at the time of indexing the identity matrix with the original one, wherever the sample comes under a certain category, it is given 1 as the output. However, if no match occurs, it has 0 as the answer.

one_hot_X = np.eye(X.shape[1])[X.tocoo().col]

Finally, you need to see what your encoded matrix looks like. So, we will just use the print function to have a look at the encoded matrix.

print(one_hot_X)

Henceforth, the output is:

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

Numpy poses several benefits when we perform one hot encoding. It is used for several numerical calculations. This is due to the following reasons:

Flexibility with many machine learning algorithms.
High speed
Work with large datasets also.

One hot encoding with multiple columns

At times, there may arise a need to encode multiple columns altogether. Numpy supports one hot encoding with multiple columns too.

Firstly, you need to curate an array using numpy. Next up, you have to provide which columns you wish to encode with the help of numpy. After that, use the eye function of numpy to encode the data and specify the columns that need to be encoded. Perform indexing in order to assign either 0 or 1, as mentioned earlier. The concatenate() function of numpy concatenates along the column axis at the end.

import numpy as np

# Create a NumPy array from the dataset
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Identify the columns to one-hot encode
categorical_columns = [0, 1]

# Create a one-hot encoder using the `np.eye()` function
encoder = np.eye(X.max() + 1)[X[:, categorical_columns]]

# One-hot encode the categorical columns in the dataset
one_hot_X = encoder

# Combine the one-hot encoded columns with the other columns in the dataset
X_combined = np.concatenate([X[:, ~categorical_columns], one_hot_X], axis=1)

# Print the one-hot encoded dataset
print(X_combined)

The Performance Boost

In order to boost the performance, consider a few tips for using numpy onehot.

Assign the same category to the same binary value that has been encoded. Else, it will cause confusion.
Dimensionality reduction techniques are handy in case of a large number of categories in the dataset. Principal component analysis, commonly known as PCA, is one such technique used for dimensionality reduction.
It creates sparse data matrices. These matrices are difficult to work on. Thus, you need to use libraries that create dense matrices only.

FAQS

What are alternative ways of one hot encoding than numpy?

You can use pandas.get_dummies() function,scikit-learn.preprocessing.OneHotEncoder class, or OneHotEncoder class of PySpark.

Conclusion

This blog covers one hot encoding in numpy and draws a comparison between one hot encoding and dummy variables method of encoding.