Handling data that is missing

Handling data that is missing#

NaN is short for Not-a-Number.

This page will compare how NaN values differ between Numpy and Pandas.

You are probably aware that Numpy will produce NaNs from invalid floating point operations, such as dividing 0 by 0. In Pandas, NaNs are more commonly a flag to indicate the absence of data, for floating point and other data types.

We will also look at how to handle NaNs safely in Pandas. First, let’s remind ourselves how NaNs work in Numpy.

NaNs in Numpy#

As mentioned above, NaNs in Numpy result from invalid floating point operations.

# Import libraries
import numpy as np
import pandas as pd

# NaN results from a Numpy operation dividing 0 by 0
a_nan = np.array(0) / np.array(0)
a_nan

/tmp/ipykernel_2767/2469014178.py:6: RuntimeWarning: invalid value encountered in divide
  a_nan = np.array(0) / np.array(0)

np.float64(nan)

There are potential pitfalls when dealing with NaN values, to which we will now turn our attention.

As you see above, the Numpy dtype of the returned NaN value is float64. This tells us that the NaN value is a special and particular type of floating point value, in the same sense that Inf (infinity) or -Inf (negative infinity) are special floating point values:

# Inf (np.inf) is another special floating point value.
np.array(1) / np.array(0)

/tmp/ipykernel_2767/58810310.py:2: RuntimeWarning: divide by zero encountered in divide
  np.array(1) / np.array(0)

np.float64(inf)

Numpy uses this special NaN (np.nan) value to indicate that the value is invalid. We will soon see that Pandas uses np.nan in a different and expanded meaning. But more of that in a little while.

The logic of NaNs as invalid values means that any operation with a NaN should return — a NaN — because any operation with an invalid value must itself be an invalid value. This propagation can have some superficially unexpected consequences that can trap the unwary:

# A (potentially) unexpected False
a_nan == np.nan

np.False_

# Another strange result with the equality operator
np.nan == np.nan

False

The last two cells above both return a False value because NaN value is treated as an invalid value. For a NaN, a value is invalid, so comparing any value to a NaN is itself a NaN, even if the other value is a NaN.

To ask the question is this value a NaN, use np.isnan():

np.isnan(a_nan)

np.True_

The same principles apply when we are dealing with NaN values in an array:

# A new array with NaN and non-NaN values
arr = np.array([np.nan, np.nan, 1, 3])
arr

array([nan, nan,  1.,  3.])

Again, if we want to find which of these values as NaNs, we might (early in our programming careers) try something like this:

# Probably not what you meant.
arr == np.nan

array([False, False, False, False])

This has failed to identify the NaN elements, because NaNs propagate, and therefore any operation with a NaN gives a NaN, and therefore the comparison == with a NaN value returns NaN.

You may well want np.isnan() here; it does ask the question — which of these values are NaNs — returning True where the value is NaN, and False otherwise.

# Probably what you did mean.
np.isnan(arr)

array([ True,  True, False, False])

Perhaps more obviously, any mathematical operation with NaN gives NaN:

# Multiplying NaNs by something.
arr * 2

array([nan, nan,  2.,  6.])

# Adding with NaNs
arr + 2

array([nan, nan,  3.,  5.])

OK, so the TL;DR here is that in Numpy NaNs signal an invalid operation has taken place.

NaNs propagate; any numerical operation a NaN will result in a NaN.

Let’s compare this to the way that Pandas uses NaNs.

NaNs in Pandas#

We have seen that NaN values in Numpy are values that indicate the result of invalid floating point operations.

The function of NaN values in Pandas is somewhat different.

NaN values in Pandas are flags for missing data. The NaN values themselves possess the same properties and pitfalls that we showed in the last section. However the cause of NaNs in Pandas is most often that data was missing rather than invalid numerical operations being performed on the data.

We’ll say more about what Pandas means by missing below.

Let’s explore these concepts further by importing a dataset. We will use the full version of the Human Development Index dataset, which contains values for every year, rather than just the subset of data from the year 2000, which we have looked at on previous pages:

# Import data as Data Frame.
df = pd.read_csv("data/children-per-woman-vs-human-development-index.csv")
# Set the index to the country codes.
df = df.set_index('Code')
df

	Entity	Year	Fertility Rate	Human Development Index	Population	Region
Code
AFG	Afghanistan	1950	7.248	NaN	7776133.0	NaN
AFG	Afghanistan	1951	7.260	NaN	7879295.0	NaN
AFG	Afghanistan	1952	7.260	NaN	7987737.0	NaN
AFG	Afghanistan	1953	7.266	NaN	8096656.0	NaN
AFG	Afghanistan	1954	7.254	NaN	8207910.0	NaN
...	...	...	...	...	...	...
ZWE	Zimbabwe	1945	NaN	NaN	2519116.0	NaN
ZWE	Zimbabwe	1946	NaN	NaN	2567459.0	NaN
ZWE	Zimbabwe	1947	NaN	NaN	2616729.0	NaN
ZWE	Zimbabwe	1948	NaN	NaN	2666945.0	NaN
ZWE	Zimbabwe	1949	NaN	NaN	2725021.0	NaN

59911 rows × 6 columns

It is immediately apparent that this dataset contains more NaNs than a retirement village.[1]

Look at the Human Development Index column (which we extract as a Series):

# A column with lots of NaN values.
df['Human Development Index']

Code
AFG   NaN
AFG   NaN
AFG   NaN
AFG   NaN
AFG   NaN
       ..
ZWE   NaN
ZWE   NaN
ZWE   NaN
ZWE   NaN
ZWE   NaN
Name: Human Development Index, Length: 59911, dtype: float64

Let’s take a closer look at the value in the first row of this column:

# Show a nan value
df['Human Development Index'].iloc[0]

np.float64(nan)

Sure enough, it’s a NaN. It also has the expected float64 data type that we saw above.

We can use np.isnan() to get a Boolean confirmation that we are in the presence of a standard NaN:

# Using the `np.isnan()` function
np.isnan(df['Human Development Index'].iloc[0])

np.True_

Alternatively, you could use Pandas’ own pd.isna() function:

# Using the `np.isnan()` function
pd.isna(df['Human Development Index'].iloc[0])

True

Note

Pandas NA

In fact pd.isna() doesn’t just check for NaN values, because Pandas has some other, less common ways to indicate an element is missing. You will see this hinted in the name pd.isna(), because Pandas thinks of missing values as NA values, where NA seems to stand for Not-Applicable or Not-Available. See Working with Missing Data for the full gory details. The summary at this stage is:

For readability, and to allow for the use of other NA values in Pandas, we suggest you use pd.isna() or Pandas .isna() methods to check for missing values.
That said, at the moment, Pandas nearly always indicates missing (NA) values with Numpy’s np.nan.

Missing data - fancifully referred to as missingness - is common in the vast majority of datasets encountered in the wild. Missing data means data that, for whatever reason, are not present for a particular row and column.

Above we saw, for example, that the HDI value for Afghanistan, and 1950, is missing (not available).

df.iloc[0]

Entity                     Afghanistan
Year                              1950
Fertility Rate                   7.248
Human Development Index            NaN
Population                   7776133.0
Region                             NaN
Name: AFG, dtype: object

That is probably because there were no relevant statistics, for Afghanistan, in 1950, with which to calculate the value.

We will nearly always want to know how much of a given dataset is missing, as we will need to factor this in as a limitation of our data analysis.

Exercise 11

Why might we worry about missing values? Why can’t we just drop them and forget about them? Let’s load some related data from the World Bank with country statistics on various measures:

gender_df = pd.read_csv('data/gender_stats.csv')
gender_df

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
0	Aruba	ABW	1.66325	NaN	NaN	NaN	48.721939	NaN	0.103744
1	Afghanistan	AFG	4.95450	19.961015	161.138034	2.834598	40.109708	444.00	32.715838
2	Angola	AGO	6.12300	111.936542	254.747970	2.447546	NaN	501.25	26.937545
3	Albania	ALB	1.76925	12.327586	574.202694	2.836021	47.201082	29.25	2.888280
4	Andorra	AND	NaN	3.197538	4421.224933	7.260281	47.123345	NaN	0.079547
...	...	...	...	...	...	...	...	...	...
211	Kosovo	XKX	2.14250	6.804620	NaN	NaN	NaN	NaN	1.813820
212	Yemen, Rep.	YEM	4.22575	36.819337	207.949700	1.417836	44.470076	399.75	26.246608
213	South Africa	ZAF	2.37525	345.209888	1123.142656	4.241441	48.516298	143.75	54.177209
214	Zambia	ZMB	5.39425	24.280990	185.556359	2.687290	49.934484	233.75	15.633220
215	Zimbabwe	ZWE	3.94300	15.495514	115.519881	2.695188	49.529875	398.00	15.420964

216 rows × 9 columns

Notice that there are various NaN values here.

A) Do you think these indicate invalid floating point operations at some previous step, or do they indicate missing (Not Available) data? Why?

Write your answer here, replacing this text.

Here is a calculation of the mean Health Exp(enditure) per Cap(ita) (per person):

gender_df['health_exp_per_cap'].mean()

np.float64(1269.4949369129715)

Do you think this value is a reasonable estimate of actual worldwide health expenditure per person? If not, why not? Can you think of any way of improving this estimate?

Write your answer here, replacing this text.

Solution to Exercise 11

The NaN values look very much like standard Pandas signals of missing (Not Available) data. First, these values arose from loading a data file directly. It’s possible that some calculation that led to this data file had invalid floating point values, but it’s difficult to see what these might be, or why these could not be avoided. On the other hand, looking at the values, it seems that they occur for smaller countries with less-developed economies (e.g. Aruba) or new countries (at time of data estimate) such as Kosovo. It is easy to see how there might not be good data to calculate e.g. health expenditure per person for these countries.

To explore more, you might have considered looking specifically for rows and columns with many NaN values with something like:

missing_hepc = pd.isna(gender_df['health_exp_per_cap'])
gender_df[missing_hepc]

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
0	Aruba	ABW	1.663250	NaN	NaN	NaN	48.721939	NaN	0.103744
8	American Samoa	ASM	NaN	0.640500	NaN	NaN	NaN	NaN	0.055422
24	Bermuda	BMU	1.617500	5.555624	NaN	NaN	48.423588	NaN	0.065101
45	Curacao	CUW	2.050000	NaN	NaN	NaN	47.986740	NaN	0.155959
46	Cayman Islands	CYM	NaN	NaN	NaN	NaN	49.866695	NaN	0.059159
64	Faroe Islands	FRO	2.525000	2.484982	NaN	NaN	NaN	NaN	0.048867
70	Gibraltar	GIB	NaN	NaN	NaN	NaN	NaN	NaN	0.034026
77	Greenland	GRL	2.033250	2.415893	NaN	NaN	NaN	NaN	0.056378
79	Guam	GUM	2.396250	5.465750	NaN	NaN	NaN	NaN	0.161202
81	Hong Kong SAR, China	HKG	1.209500	292.020359	NaN	NaN	47.941510	NaN	7.247240
87	Isle of Man	IMN	NaN	6.871989	NaN	NaN	NaN	NaN	0.082583
111	Liechtenstein	LIE	1.535000	6.170416	NaN	NaN	48.899433	NaN	0.037115
117	Macao SAR, China	MAC	1.220750	48.029914	NaN	NaN	47.860808	NaN	0.588052
118	St. Martin (French part)	MAF	1.812500	NaN	NaN	NaN	NaN	NaN	0.031491
132	Northern Mariana Islands	MNP	NaN	0.822250	NaN	NaN	NaN	NaN	0.054412
139	New Caledonia	NCL	2.250000	NaN	NaN	NaN	NaN	NaN	0.268200
156	Puerto Rico	PRI	1.469125	102.107758	NaN	NaN	48.494478	14.75	3.529385
157	Korea, Dem. People's Rep.	PRK	1.982500	NaN	NaN	NaN	49.030281	86.25	25.113782
160	West Bank and Gaza	PSE	4.208000	12.508220	NaN	NaN	48.828520	47.50	4.296960
161	French Polynesia	PYF	2.051000	NaN	NaN	NaN	NaN	NaN	0.275723
174	Somalia	SOM	6.514500	5.785250	NaN	NaN	NaN	762.75	13.527075
183	Sint Maarten (Dutch part)	SXM	NaN	NaN	NaN	NaN	49.508551	NaN	0.037552
186	Turks and Caicos Islands	TCA	NaN	NaN	NaN	NaN	48.846884	NaN	0.033703
206	British Virgin Islands	VGB	NaN	NaN	NaN	NaN	47.581520	NaN	0.029585
207	Virgin Islands (U.S.)	VIR	1.760000	3.812000	NaN	NaN	NaN	NaN	0.104141
211	Kosovo	XKX	2.142500	6.804620	NaN	NaN	NaN	NaN	1.813820

(See the filtering page for more on filtering Data Frames.)

For the mean calculation, we notice again that the missing values seem to be for smaller, less-developed and newer countries. You have already seen that Pandas assumes that NaN means missing data, and it drops NaN values in calculations like .mean(). Therefore, the result is the mean excluding these smaller, poorer countries, and the resulting mean will be the mean of the larger, richer countries. To get a better estimate of worldwide health expenditure per capita, we might try and think of ways of estimating what the health expenditure would have been for these smaller countries, perhaps using other data we do have, from this data frame or elsewhere. At least we should point out the confounded nature of the .mean() as a true estimate.

Pandas supplies us some useful methods for checking missingness.

For instance, we can use .count() to show us the number of non-NaN elements in each column:

# Count non-NaN (not missing) elements in the Data Frame.
df.count()

Entity                     59911
Year                       59911
Fertility Rate             18722
Human Development Index     6464
Population                 59177
Region                       271
dtype: int64

A useful trick here is to divide the output of the .count() method by the len() of the Data Frame. Remember len(df) gives you the number of rows (number of observations). This provides a handy summary of the proportion of NaNs in each column of the Data Frame:

# Show the proportion of missing values, in each column.
# The division operates elementwise, dividing all values above by the divisor.
df.count() / len(df)

Entity                     1.000000
Year                       1.000000
Fertility Rate             0.312497
Human Development Index    0.107893
Population                 0.987748
Region                     0.004523
dtype: float64

If we want to use brute force, we can use the .dropna() method to remove any rows which have a single NaN value:

# Remove the NaN values
df_no_NaN = df.dropna()
df_no_NaN

	Entity	Year	Fertility Rate	Human Development Index	Population	Region
Code

It turns out in this dataset, every row has at least one NaN value so dropping every row with a NaN has dropped everything…(we did say this method was brute force)!

By far the most important thing to know about missing data in Pandas is that by default NaN values will be ignored in numerical operations.

Let’s look at the Fertility Rate column, which contains numerical data:

# Show the column
df['Fertility Rate']

Code
AFG    7.248
AFG    7.260
AFG    7.260
AFG    7.266
AFG    7.254
       ...  
ZWE      NaN
ZWE      NaN
ZWE      NaN
ZWE      NaN
ZWE      NaN
Name: Fertility Rate, Length: 59911, dtype: float64

Because we are dealing with just one column, we can safely use .dropna() without losing every row (because not every row of the Series contains a NaN value):

# Show the column
df['Fertility Rate'].dropna()

Code
AFG    7.248
AFG    7.260
AFG    7.260
AFG    7.266
AFG    7.254
       ...  
ZWE    3.748
ZWE    3.754
ZWE    3.765
ZWE    3.767
ZWE    3.724
Name: Fertility Rate, Length: 18722, dtype: float64

Let’s compare computing a statistic (the mean) when we drop the NaN values from this Series versus when we leave them in.

Let’s first select all the rows relating to (indexed as) Zimbabwe ('ZWE'), and the corresponding 'Fertility Rate' values:

# Rows with index value 'ZWE', 'Fertility Rate' column.
zwe_fert = df.loc['ZWE', 'Fertility Rate']
zwe_fert

Code
ZWE    7.041
ZWE    7.060
ZWE    7.069
ZWE    7.081
ZWE    7.091
       ...  
ZWE      NaN
ZWE      NaN
ZWE      NaN
ZWE      NaN
ZWE      NaN
Name: Fertility Rate, Length: 261, dtype: float64

When we use the .mean() method on just this column, we get the following value:

# Calculate a mean with NaN data included (NaNs will be ignored).
zwe_fert.mean()

np.float64(5.48758108108108)

Using .dropna() on this column returns exactly the same value - because by default Pandas will ignore NaNs in numerical operations:

zwe_no_nans = zwe_fert.dropna()
zwe_no_nans

Code
ZWE    7.041
ZWE    7.060
ZWE    7.069
ZWE    7.081
ZWE    7.091
       ...  
ZWE    3.748
ZWE    3.754
ZWE    3.765
ZWE    3.767
ZWE    3.724
Name: Fertility Rate, Length: 74, dtype: float64

# Drop NaNs gives the same mean 
zwe_no_nans.mean()

np.float64(5.48758108108108)

So:

For Numpy, NaNs propagate, because they indicate an invalid value.
For Pandas, NaNs do not propagate, because they indicate a missing value.

Put another way:

Numpy treats NaNs as numerical indicators of an invalid operation.
Pandas treats NaNs as statistical indicators of missing data.

This fits with the package names; Numpy for numerical Python, Pandas for Panel data and therefore, statistics.

This difference in NaN handling is a key and important difference between Numpy and Pandas statistical routines. Numpy mean, min, max and std return NaN, by default, if there are any NaN values in the array.

np.std(zwe_fert.values)

np.float64(nan)

In contrast, the matching routines in Pandas silently drop the NaN values before calculating mean, min, max and so on.

zwe_fert.std()

np.float64(1.4567471363840736)

Summary#

On this page we have seen how NaN values indicate different things in Numpy and Pandas.

In Numpy, NaN values have a numerical meaning, and typically result from invalid computations, such as dividing zero by zero.

In Pandas, NaN values have statistical meaning. They are most commonly flags for missing data. By default, these NaN values will be ignored when you call Pandas’ statistical methods for Series or Data Frames.