Handling data that is missing

Handling data that is missing#

NaN is short for Not-a-Number.

This page will compare how NaN values differ between Numpy and Pandas.

You are probably aware that Numpy will produce NaNs from invalid floating point operations, such as dividing 0 by 0. In Pandas, NaNs are more commonly a flag to indicate the absence of data, for floating point and other data types.

We will also look at how to handle NaNs safely in Pandas. First, let’s remind ourselves how NaNs work in Numpy.

NaNs in Numpy#

As mentioned above, NaNs in Numpy result from invalid floating point operations.

# Import libraries
import numpy as np
import pandas as pd

# NaN results from a Numpy operation dividing 0 by 0
a_nan = np.array(0) / np.array(0)
a_nan
/tmp/ipykernel_2767/2469014178.py:6: RuntimeWarning: invalid value encountered in divide
  a_nan = np.array(0) / np.array(0)
np.float64(nan)

There are potential pitfalls when dealing with NaN values, to which we will now turn our attention.

As you see above, the Numpy dtype of the returned NaN value is float64. This tells us that the NaN value is a special and particular type of floating point value, in the same sense that Inf (infinity) or -Inf (negative infinity) are special floating point values:

# Inf (np.inf) is another special floating point value.
np.array(1) / np.array(0)
/tmp/ipykernel_2767/58810310.py:2: RuntimeWarning: divide by zero encountered in divide
  np.array(1) / np.array(0)
np.float64(inf)

Numpy uses this special NaN (np.nan) value to indicate that the value is invalid. We will soon see that Pandas uses np.nan in a different and expanded meaning. But more of that in a little while.

The logic of NaNs as invalid values means that any operation with a NaN should return — a NaN — because any operation with an invalid value must itself be an invalid value. This propagation can have some superficially unexpected consequences that can trap the unwary:

# A (potentially) unexpected False
a_nan == np.nan
np.False_
# Another strange result with the equality operator
np.nan == np.nan
False

The last two cells above both return a False value because NaN value is treated as an invalid value. For a NaN, a value is invalid, so comparing any value to a NaN is itself a NaN, even if the other value is a NaN.

To ask the question is this value a NaN, use np.isnan():

np.isnan(a_nan)
np.True_

The same principles apply when we are dealing with NaN values in an array:

# A new array with NaN and non-NaN values
arr = np.array([np.nan, np.nan, 1, 3])
arr
array([nan, nan,  1.,  3.])

Again, if we want to find which of these values as NaNs, we might (early in our programming careers) try something like this:

# Probably not what you meant.
arr == np.nan
array([False, False, False, False])

This has failed to identify the NaN elements, because NaNs propagate, and therefore any operation with a NaN gives a NaN, and therefore the comparison == with a NaN value returns NaN.

You may well want np.isnan() here; it does ask the question — which of these values are NaNs — returning True where the value is NaN, and False otherwise.

# Probably what you did mean.
np.isnan(arr)
array([ True,  True, False, False])

Perhaps more obviously, any mathematical operation with NaN gives NaN:

# Multiplying NaNs by something.
arr * 2
array([nan, nan,  2.,  6.])
# Adding with NaNs
arr + 2
array([nan, nan,  3.,  5.])

OK, so the TL;DR here is that in Numpy NaNs signal an invalid operation has taken place.

NaNs propagate; any numerical operation a NaN will result in a NaN.

Let’s compare this to the way that Pandas uses NaNs.

NaNs in Pandas#

We have seen that NaN values in Numpy are values that indicate the result of invalid floating point operations.

The function of NaN values in Pandas is somewhat different.

NaN values in Pandas are flags for missing data. The NaN values themselves possess the same properties and pitfalls that we showed in the last section. However the cause of NaNs in Pandas is most often that data was missing rather than invalid numerical operations being performed on the data.

We’ll say more about what Pandas means by missing below.

Let’s explore these concepts further by importing a dataset. We will use the full version of the Human Development Index dataset, which contains values for every year, rather than just the subset of data from the year 2000, which we have looked at on previous pages:

# Import data as Data Frame.
df = pd.read_csv("data/children-per-woman-vs-human-development-index.csv")
# Set the index to the country codes.
df = df.set_index('Code')
df
Entity Year Fertility Rate Human Development Index Population Region
Code
AFG Afghanistan 1950 7.248 NaN 7776133.0 NaN
AFG Afghanistan 1951 7.260 NaN 7879295.0 NaN
AFG Afghanistan 1952 7.260 NaN 7987737.0 NaN
AFG Afghanistan 1953 7.266 NaN 8096656.0 NaN
AFG Afghanistan 1954 7.254 NaN 8207910.0 NaN
... ... ... ... ... ... ...
ZWE Zimbabwe 1945 NaN NaN 2519116.0 NaN
ZWE Zimbabwe 1946 NaN NaN 2567459.0 NaN
ZWE Zimbabwe 1947 NaN NaN 2616729.0 NaN
ZWE Zimbabwe 1948 NaN NaN 2666945.0 NaN
ZWE Zimbabwe 1949 NaN NaN 2725021.0 NaN

59911 rows × 6 columns

It is immediately apparent that this dataset contains more NaNs than a retirement village.[1]

Look at the Human Development Index column (which we extract as a Series):

# A column with lots of NaN values.
df['Human Development Index']
Code
AFG   NaN
AFG   NaN
AFG   NaN
AFG   NaN
AFG   NaN
       ..
ZWE   NaN
ZWE   NaN
ZWE   NaN
ZWE   NaN
ZWE   NaN
Name: Human Development Index, Length: 59911, dtype: float64

Let’s take a closer look at the value in the first row of this column:

# Show a nan value
df['Human Development Index'].iloc[0]
np.float64(nan)

Sure enough, it’s a NaN. It also has the expected float64 data type that we saw above.

We can use np.isnan() to get a Boolean confirmation that we are in the presence of a standard NaN:

# Using the `np.isnan()` function
np.isnan(df['Human Development Index'].iloc[0])
np.True_

Alternatively, you could use Pandas’ own pd.isna() function:

# Using the `np.isnan()` function
pd.isna(df['Human Development Index'].iloc[0])
True

Note

Pandas NA

In fact pd.isna() doesn’t just check for NaN values, because Pandas has some other, less common ways to indicate an element is missing. You will see this hinted in the name pd.isna(), because Pandas thinks of missing values as NA values, where NA seems to stand for Not-Applicable or Not-Available. See Working with Missing Data for the full gory details. The summary at this stage is:

  • For readability, and to allow for the use of other NA values in Pandas, we suggest you use pd.isna() or Pandas .isna() methods to check for missing values.

  • That said, at the moment, Pandas nearly always indicates missing (NA) values with Numpy’s np.nan.

Missing data - fancifully referred to as missingness - is common in the vast majority of datasets encountered in the wild. Missing data means data that, for whatever reason, are not present for a particular row and column.

Above we saw, for example, that the HDI value for Afghanistan, and 1950, is missing (not available).

df.iloc[0]
Entity                     Afghanistan
Year                              1950
Fertility Rate                   7.248
Human Development Index            NaN
Population                   7776133.0
Region                             NaN
Name: AFG, dtype: object

That is probably because there were no relevant statistics, for Afghanistan, in 1950, with which to calculate the value.

We will nearly always want to know how much of a given dataset is missing, as we will need to factor this in as a limitation of our data analysis.

Pandas supplies us some useful methods for checking missingness.

For instance, we can use .count() to show us the number of non-NaN elements in each column:

# Count non-NaN (not missing) elements in the Data Frame.
df.count()
Entity                     59911
Year                       59911
Fertility Rate             18722
Human Development Index     6464
Population                 59177
Region                       271
dtype: int64

A useful trick here is to divide the output of the .count() method by the len() of the Data Frame. Remember len(df) gives you the number of rows (number of observations). This provides a handy summary of the proportion of NaNs in each column of the Data Frame:

# Show the proportion of missing values, in each column.
# The division operates elementwise, dividing all values above by the divisor.
df.count() / len(df)
Entity                     1.000000
Year                       1.000000
Fertility Rate             0.312497
Human Development Index    0.107893
Population                 0.987748
Region                     0.004523
dtype: float64

If we want to use brute force, we can use the .dropna() method to remove any rows which have a single NaN value:

# Remove the NaN values
df_no_NaN = df.dropna()
df_no_NaN
Entity Year Fertility Rate Human Development Index Population Region
Code

It turns out in this dataset, every row has at least one NaN value so dropping every row with a NaN has dropped everything…(we did say this method was brute force)!

By far the most important thing to know about missing data in Pandas is that by default NaN values will be ignored in numerical operations.

Let’s look at the Fertility Rate column, which contains numerical data:

# Show the column
df['Fertility Rate']
Code
AFG    7.248
AFG    7.260
AFG    7.260
AFG    7.266
AFG    7.254
       ...  
ZWE      NaN
ZWE      NaN
ZWE      NaN
ZWE      NaN
ZWE      NaN
Name: Fertility Rate, Length: 59911, dtype: float64

Because we are dealing with just one column, we can safely use .dropna() without losing every row (because not every row of the Series contains a NaN value):

# Show the column
df['Fertility Rate'].dropna()
Code
AFG    7.248
AFG    7.260
AFG    7.260
AFG    7.266
AFG    7.254
       ...  
ZWE    3.748
ZWE    3.754
ZWE    3.765
ZWE    3.767
ZWE    3.724
Name: Fertility Rate, Length: 18722, dtype: float64

Let’s compare computing a statistic (the mean) when we drop the NaN values from this Series versus when we leave them in.

Let’s first select all the rows relating to (indexed as) Zimbabwe ('ZWE'), and the corresponding 'Fertility Rate' values:

# Rows with index value 'ZWE', 'Fertility Rate' column.
zwe_fert = df.loc['ZWE', 'Fertility Rate']
zwe_fert
Code
ZWE    7.041
ZWE    7.060
ZWE    7.069
ZWE    7.081
ZWE    7.091
       ...  
ZWE      NaN
ZWE      NaN
ZWE      NaN
ZWE      NaN
ZWE      NaN
Name: Fertility Rate, Length: 261, dtype: float64

When we use the .mean() method on just this column, we get the following value:

# Calculate a mean with NaN data included (NaNs will be ignored).
zwe_fert.mean()
np.float64(5.48758108108108)

Using .dropna() on this column returns exactly the same value - because by default Pandas will ignore NaNs in numerical operations:

zwe_no_nans = zwe_fert.dropna()
zwe_no_nans
Code
ZWE    7.041
ZWE    7.060
ZWE    7.069
ZWE    7.081
ZWE    7.091
       ...  
ZWE    3.748
ZWE    3.754
ZWE    3.765
ZWE    3.767
ZWE    3.724
Name: Fertility Rate, Length: 74, dtype: float64
# Drop NaNs gives the same mean 
zwe_no_nans.mean()
np.float64(5.48758108108108)

So:

  • For Numpy, NaNs propagate, because they indicate an invalid value.

  • For Pandas, NaNs do not propagate, because they indicate a missing value.

Put another way:

  • Numpy treats NaNs as numerical indicators of an invalid operation.

  • Pandas treats NaNs as statistical indicators of missing data.

This fits with the package names; Numpy for numerical Python, Pandas for Panel data and therefore, statistics.

This difference in NaN handling is a key and important difference between Numpy and Pandas statistical routines. Numpy mean, min, max and std return NaN, by default, if there are any NaN values in the array.

np.std(zwe_fert.values)
np.float64(nan)

In contrast, the matching routines in Pandas silently drop the NaN values before calculating mean, min, max and so on.

zwe_fert.std()
np.float64(1.4567471363840736)

Summary#

On this page we have seen how NaN values indicate different things in Numpy and Pandas.

In Numpy, NaN values have a numerical meaning, and typically result from invalid computations, such as dividing zero by zero.

In Pandas, NaN values have statistical meaning. They are most commonly flags for missing data. By default, these NaN values will be ignored when you call Pandas’ statistical methods for Series or Data Frames.