Mean Squared Error in Python: Formula, NumPy, and scikit-learn

Quick answer: Mean squared error averages the squared difference between each true and predicted value. It is always non-negative, penalizes large errors strongly, and is expressed in squared target units, so comparisons require the same scale and evaluation policy.

Python Pool infographic showing mean squared error comparing actual and predicted values, squaring residuals, and averaging them — MSE squares each prediction error before averaging, so large misses contribute disproportionately and the result remains in squared target units.

Mean squared error, or MSE, measures the average squared difference between actual values and predicted values. It is most often used for regression problems, where lower values mean predictions are closer to the true targets.

MSE = (1 / n) * sum((y_true - y_pred) ** 2)

Because the errors are squared, large mistakes count more than small mistakes. That makes MSE useful when big prediction errors should be penalized heavily.

Quick example

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

squared_errors = [(actual - predicted) ** 2 for actual, predicted in zip(y_true, y_pred)]
mse = sum(squared_errors) / len(squared_errors)

print(mse)

0.375

Calculate MSE with NumPy

NumPy makes the calculation concise because subtraction and exponentiation work element by element on arrays.

import numpy as np

y_true = np.array([3, -0.5, 2, 7])
y_pred = np.array([2.5, 0.0, 2, 8])

mse = np.mean((y_true - y_pred) ** 2)
print(mse)

0.375

Make sure both arrays have the same shape. If you are moving prediction data between NumPy and Pandas, the guide to NumPy array to Pandas DataFrame can help keep shapes explicit.

Calculate MSE with scikit-learn

For model evaluation, use scikit-learn’s mean_squared_error() function.

from sklearn.metrics import mean_squared_error

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

mse = mean_squared_error(y_true, y_pred)
print(mse)

0.375

This function also supports sample weights and multi-output regression through its documented parameters.

RMSE vs MSE

Root mean squared error, or RMSE, is the square root of MSE. It is often easier to interpret because it is in the same unit as the target value.

from sklearn.metrics import root_mean_squared_error

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

rmse = root_mean_squared_error(y_true, y_pred)
print(rmse)

In older examples, you may see mean_squared_error(..., squared=False). Current scikit-learn documentation provides root_mean_squared_error() for RMSE, so prefer that in new code. For the manual version, use square root in Python after calculating MSE.

Weighted mean squared error

Use sample_weight when some observations should count more than others.

from sklearn.metrics import mean_squared_error

y_true = [10, 20, 30]
y_pred = [12, 18, 33]
weights = [1, 1, 3]

mse = mean_squared_error(y_true, y_pred, sample_weight=weights)
print(mse)

The third observation has more influence because its weight is larger.

MSE in model evaluation

MSE is useful for comparing regression models trained on the same target and dataset. It is not as easy to compare across unrelated datasets because the value depends on the target scale. If your target is measured in dollars, squared dollars are harder to interpret directly; RMSE may be clearer.

When using scikit-learn cross-validation, remember that scoring names follow a “higher is better” convention. That is why scikit-learn exposes MSE scoring as neg_mean_squared_error in model-selection tools.

Common mistakes

Mismatched shapes. y_true and y_pred must describe the same observations in the same order.
Using classification labels with MSE. For classification, use classification metrics such as accuracy, precision, recall, log loss, or F1 depending on the problem.
Comparing MSE across different target scales. A target measured from 0 to 1 will naturally produce a different range than a target measured in thousands.
Ignoring outliers. Squaring errors makes outliers influential. That may be desirable, but it should be intentional.

Official references

The scikit-learn documentation covers mean_squared_error(), root_mean_squared_error(), and model-evaluation scoring names such as neg_mean_squared_error. NumPy documents numpy.mean() and element-wise array operations in its basics guide.

Conclusion

To calculate mean squared error in Python, subtract predictions from actual values, square the differences, and average them. Use NumPy for a direct array calculation and scikit-learn’s mean_squared_error() when evaluating regression models. Use RMSE when you want the result back in the target’s original unit.

Apply The Formula

For aligned arrays y_true and y_pred, compute residual = y_true – y_pred, square it, and average the result. A shape mismatch is a data error, not something to hide with accidental broadcasting.

Use NumPy For Small Calculations

np.mean((y_true – y_pred) ** 2) is transparent for arrays after checking dtype and shape. Convert inputs deliberately and avoid integer overflow or unintended object arrays in larger pipelines.

Use scikit-learn For Evaluation

sklearn.metrics.mean_squared_error supports sample weights and multi-output aggregation. Make the chosen policy explicit so a reported score can be reproduced.

Interpret Units And Outliers

MSE uses squared target units, and one large miss can dominate the score. Pair it with a metric in original units, residual plots, or a robust metric when that better represents the decision problem.

Keep Splits And Scaling Consistent

Compare models on the same held-out data, target transformation, filtering, and weighting. A lower score from a different split or leaked preprocessing is not evidence of better generalization.

Test Simple Fixtures

Use perfect predictions, one known residual, equal and unequal errors, multi-output targets, and a weighted example. Assert the exact fixture result before evaluating a real model.

Compare Models Fairly

Keep the target transformation, train-test split, filtering, and aggregation fixed when comparing models. Record the sample count and score definition alongside the metric so a lower number is not mistaken for a universally better model.

Watch Data Leakage

Fit preprocessing only on the training data when the evaluation setup requires it, and compute the final error on observations that were not used to tune the model. Otherwise MSE can look artificially small while future predictions remain poor.

Report More Than One Number

MSE is useful for optimization, but a score alone does not show bias, calibration, subgroup behavior, or the shape of residuals. Pair it with a baseline, sample count, a metric in original units, and a plot when the decision requires deeper diagnosis.

The official scikit-learn mean_squared_error reference documents weights and multi-output behavior. Related Python Pool references include arrays and tests.

For related model evaluation, compare array alignment, metric fixtures, and evaluation diagnostics before interpreting MSE.

Python Pool infographic showing target values, predictions, residuals, and squared errors — [object Object]

Python Pool infographic mapping residuals through squaring, summing, dividing by n, and mean squared error — [object Object]

Python Pool infographic comparing Python, NumPy, and scikit-learn mean_squared_error inputs and output — [object Object]

Python Pool infographic testing scale, outliers, multioutput, empty data, and validation — [object Object]

Frequently Asked Questions

What is the mean squared error formula?

MSE is the average of (actual value minus predicted value) squared across the evaluated samples.

How do I calculate MSE with NumPy?

Subtract predictions from targets, square the residuals, and call np.mean on the squared values after confirming the arrays align.

How do I calculate MSE with scikit-learn?

Use sklearn.metrics.mean_squared_error with y_true and y_pred, and set sample_weight or multioutput only when that aggregation matches the evaluation goal.

What does a lower MSE mean?

Lower MSE means smaller squared prediction errors on the evaluated data, but comparisons require the same target scale, split, preprocessing, and weighting policy.