Numpy Mean: Implementation and Importance

In statistics, three of the most important operations is to find the mean, median, and mode of the given data. Lots of insights can be taken when these values are calculated. Mean is the average of the data. Median is the middle number after arranging the data in sorted order, and mode is the value that has occurred the most number of times. In this article, we will study about mean, what its importance is, and how it can be calculated using numpy mean() function.

Numpy Mean is a powerful method to compute the average of values within an array. This inbuilt method is built on a better algorithmic approach and works very fast. Most importantly, it supports multiple dimensional computations of mean.

Mean = (Sum of all the terms)/(Total number of terms)
For example, if we have 5 numbers- 2,4,6,1,9
Mean = (2+4+6+1+9)/(5)
Mean= ( 22 / 5 ) = 4.4

Contents

Syntax of Numpy Mean

Numpy module is used to perform fast operations on arrays. To use it, we first need to install it in our system using –pip install numpy.
Inside the numpy module, we have a function called mean(), which can be used to calculate the given data points arithmetic mean.

Numpy.mean(arr, axis=None, dtype=None, out=None)

Parameters-

arr: It is the array of whose mean we want to find. The elements must be either integer or floating-point values. Even if arr is not an array, it automatically converts it into array type.

axis: It is the axes along which the mean is calculated. When no value is given, the mean is calculated along with the flattened array.
If axis = 0, the mean is calculated along with the columns.
If axis=1, the mean is calculated along the rows.
We will understand more about this axis parameter when we will be making programs.

dtype: It is the data type, whose value we desire. By default, the value is float. If we want our output to be an integer, we have to give value as an int.

Out: If we want our output to be stored in an array, we can give that array in this argument. The dimensions of that array should be the same as that of the output that is going to come. This is an optional parameter.

Return Type-

By default, the value of the output is float, but we can change it to an integer as well.

Calculating mean using Numpy Mean

Let us now jump to the coding part. We will see how does each parameter affects our output.

1. Without any additional arguments.

import numpy
a = [10,20,11,320]
# list will automatically convert into array
print(numpy.mean(a))

Output-
90.25

import numpy
arr = [[10,20,30],[50,60,70],[40,80,90]]
# as we are not giving axis so we are getting mean of whole array as a single output. 
print(numpy.mean(arr))

50.0

Here we are getting output as 50 because (10+20+30+50+60+70+40+80+90)/9 = 50.0

2. Using axis parameter

import numpy
arr = [[10,20,30],[50,60,70],[40,80,90]]
# column wise elements are taken
print(numpy.mean(arr,axis=0))

array([33.33333333, 53.33333333, 63.33333333])

Here, as we have given axis=0, it is taking elements column wise.

(10+50+40)/3=33.333333

(20+60+80)/3 = 53.33333333

(30+70+90)/3 = 63.33333333

import numpy
arr = [[10,20,30],[50,60,70],[40,80,90]]
# row wise elements are taken
print(numpy.mean(arr,axis=1))

[20. 60. 70.]

Elements taken in order of rows. (10 + 20 + 30 ) / 3 = 20.0 , (50+60+70 ) / 3 = 60.0 , (40+80+90) / 3 = 70.0

3. Changing the dtype

By default, the output is in the form of float. Let us change it into integer type.

import numpy
arr = [[10,20,30,80],[50,60,70,20],[40,80,90,100]]
# for integer the dtype is 'int'
print(numpy.mean(arr,axis=1,dtype=int))

[35 50 77]

4. Storing the output in another array

Let us see how to store the output in another array.

import numpy
arr = [[10,20,30,80],[50,60,70,20],[40,80,90,100]]
# making an array of shape -4
out_arr=np.arange(4)
numpy.mean(arr,axis=0,out=out_arr)
print(out_arr)

[33 53 63 66]

You must be wondering that why we took the shape 4 and how will we know what shape the output array should be. The answer to that is simple. Here, we have given axis as 1 and axis=1 signifies that we want column-wise operations. And the number of columns = 4. So we have given the size of the output array as 4. We can also check it using the numpy.array().shape attribute.

arr = [[10,20,30,80],[50,60,70,20],[40,80,90,100]]
print(numpy.array([[10,20,30,80],[50,60,70,20],[40,80,90,100]]).shape)

(3, 4)

Row=3, Columns = 4

Let us do the same for axis=1.

import numpy 
arr = [[10,20,30,80],[50,60,70,20],[40,80,90,100]]
# making an array of shape - 3
out_arr=numpy.arange(numpy.array(arr).shape[0])
numpy.mean(arr,axis=1,out=out_arr)
print(out_arr)

[35 50 77]

Applications of numpy mean in statistics

In the data science world, the mean is a very important operation. We can handle Null values in the dataset with the mean (commonly known as imputation). This is a very common and easy practice. Not only this, but we can also calculate accuracy for regression algorithms. Don’t worry if you don’t know much about regression or if you don’t about null values, you will get a basic idea along the way.

a. Mean Squared Error

In statistics, mean squared error or MSE is used to calculate the average of the squares of errors. In other words, it is taking the difference between the predicted and the actual value, then squaring it and taking the average of all the values. Let us now see how we can do this using numpy mean.

For example, suppose the actual price of a house is 100000,120115,400030, 500000. And we have created a model, which has predicted the value to be – 100400,121015,402090, 509070.

Now, we can calculate the mean squared error.

actual_values=[100000,120115,400030, 500000]
predicted_values=[100400,121015,402090, 509070]
squared_difference=[]
for values in range(len(actual_values)):
    diff=actual_values[values] - predicted_values[values]
    squared_difference.append(diff**2)
print(numpy.mean(squared_difference))

21869625.0

b. Filling Nan values using numpy mean

Suppose we have a dataset in which we have the age of a person. And there are some Nan (null values in that dataset). Let us see how we will fill those null values.

import pandas as pd
import numpy
# creating dataset
dataset=pd.DataFrame([12,55,70,numpy.NaN,33,28,numpy.NaN,44,35,29],columns=["Age"])
print(dataset)

# finding the mean of ages
mean=numpy.mean(dataset)
print(mean)

Age 38.25 dtype: float64

# filling the null values with mean
dataset=dataset.fillna(mean)
print(dataset)

We can check that now there are no Nan values in our dataset.

Must Read:

Conclusion-

We have seen how important the numpy mean function is in programming. We can fill the null values in the dataset, calculate the accuracy of our model, and do so much more stuff. There are some other ways of calculating mean in python but numpy mean is quite fast and works for any dimensional arrays.

Try to run the programs on your side and let us know if you have any queries.

Happy Coding!