Handling data that is missing#
NaN is short for Not-a-Number.
This page will compare how NaN values differ between Numpy and Pandas.
You are probably aware that Numpy will produce NaNs from invalid floating point operations, such as dividing 0 by 0. In Pandas, NaNs are more commonly a flag to indicate the absence of data, for floating point and other data types.
We will also look at how to handle NaNs safely in Pandas. First, let’s remind ourselves how NaNs work in Numpy.
NaNs in Numpy#
As mentioned above, NaNs in Numpy result from invalid floating point operations.
# Import libraries
import numpy as np
import pandas as pd
# NaN results from a Numpy operation dividing 0 by 0
a_nan = np.array(0) / np.array(0)
a_nan
/tmp/ipykernel_2767/2469014178.py:6: RuntimeWarning: invalid value encountered in divide
a_nan = np.array(0) / np.array(0)
np.float64(nan)
There are potential pitfalls when dealing with NaN values, to which we will now turn our attention.
As you see above, the Numpy dtype of the returned NaN value is float64.
This tells us that the NaN value is a special and particular type of floating
point value, in the same sense that Inf (infinity) or -Inf (negative
infinity) are special floating point values:
# Inf (np.inf) is another special floating point value.
np.array(1) / np.array(0)
/tmp/ipykernel_2767/58810310.py:2: RuntimeWarning: divide by zero encountered in divide
np.array(1) / np.array(0)
np.float64(inf)
Numpy uses this special NaN (np.nan) value to indicate that the value is
invalid. We will soon see that Pandas uses np.nan in a different and expanded meaning. But more of that in a little while.
The logic of NaNs as invalid values means that any operation with a NaN should return — a NaN — because any operation with an invalid value must itself be an invalid value. This propagation can have some superficially unexpected consequences that can trap the unwary:
# A (potentially) unexpected False
a_nan == np.nan
np.False_
# Another strange result with the equality operator
np.nan == np.nan
False
The last two cells above both return a False value because NaN value is
treated as an invalid value. For a NaN, a value is invalid, so
comparing any value to a NaN is itself a NaN, even if the other value is
a NaN.
To ask the question is this value a NaN, use np.isnan():
np.isnan(a_nan)
np.True_
The same principles apply when we are dealing with NaN values in an array:
# A new array with NaN and non-NaN values
arr = np.array([np.nan, np.nan, 1, 3])
arr
array([nan, nan, 1., 3.])
Again, if we want to find which of these values as NaNs, we might (early in our programming careers) try something like this:
# Probably not what you meant.
arr == np.nan
array([False, False, False, False])
This has failed to identify the NaN elements, because NaNs propagate, and therefore any operation with a NaN gives a NaN, and therefore the comparison == with a NaN value returns NaN.
You may well want np.isnan() here; it does ask the question — which of these values are NaNs — returning True where the value is NaN, and False otherwise.
# Probably what you did mean.
np.isnan(arr)
array([ True, True, False, False])
Perhaps more obviously, any mathematical operation with NaN gives NaN:
# Multiplying NaNs by something.
arr * 2
array([nan, nan, 2., 6.])
# Adding with NaNs
arr + 2
array([nan, nan, 3., 5.])
OK, so the TL;DR here is that in Numpy NaNs signal an invalid operation has taken place.
NaNs propagate; any numerical operation a NaN will result in a NaN.
Let’s compare this to the way that Pandas uses NaNs.
NaNs in Pandas#
We have seen that NaN values in Numpy are values that indicate the result of invalid floating point operations.
The function of NaN values in Pandas is somewhat different.
NaN values in Pandas are flags for missing data. The NaN values themselves possess the same properties and pitfalls that we showed in the last section. However the cause of NaNs in Pandas is most often that data was missing rather than invalid numerical operations being performed on the data.
We’ll say more about what Pandas means by missing below.
Let’s explore these concepts further by importing a dataset. We will use the full version of the Human Development Index dataset, which contains values for every year, rather than just the subset of data from the year 2000, which we have looked at on previous pages:
# Import data as Data Frame.
df = pd.read_csv("data/children-per-woman-vs-human-development-index.csv")
# Set the index to the country codes.
df = df.set_index('Code')
df
| Entity | Year | Fertility Rate | Human Development Index | Population | Region | |
|---|---|---|---|---|---|---|
| Code | ||||||
| AFG | Afghanistan | 1950 | 7.248 | NaN | 7776133.0 | NaN |
| AFG | Afghanistan | 1951 | 7.260 | NaN | 7879295.0 | NaN |
| AFG | Afghanistan | 1952 | 7.260 | NaN | 7987737.0 | NaN |
| AFG | Afghanistan | 1953 | 7.266 | NaN | 8096656.0 | NaN |
| AFG | Afghanistan | 1954 | 7.254 | NaN | 8207910.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| ZWE | Zimbabwe | 1945 | NaN | NaN | 2519116.0 | NaN |
| ZWE | Zimbabwe | 1946 | NaN | NaN | 2567459.0 | NaN |
| ZWE | Zimbabwe | 1947 | NaN | NaN | 2616729.0 | NaN |
| ZWE | Zimbabwe | 1948 | NaN | NaN | 2666945.0 | NaN |
| ZWE | Zimbabwe | 1949 | NaN | NaN | 2725021.0 | NaN |
59911 rows × 6 columns
It is immediately apparent that this dataset contains more NaNs than a retirement village.[1]
Look at the Human Development Index column (which we extract as a Series):
# A column with lots of NaN values.
df['Human Development Index']
Code
AFG NaN
AFG NaN
AFG NaN
AFG NaN
AFG NaN
..
ZWE NaN
ZWE NaN
ZWE NaN
ZWE NaN
ZWE NaN
Name: Human Development Index, Length: 59911, dtype: float64
Let’s take a closer look at the value in the first row of this column:
# Show a nan value
df['Human Development Index'].iloc[0]
np.float64(nan)
Sure enough, it’s a NaN. It also has the expected float64 data type that we
saw above.
We can use np.isnan() to get a Boolean confirmation that we are in the
presence of a standard NaN:
# Using the `np.isnan()` function
np.isnan(df['Human Development Index'].iloc[0])
np.True_
Alternatively, you could use Pandas’ own pd.isna() function:
# Using the `np.isnan()` function
pd.isna(df['Human Development Index'].iloc[0])
True
Note
Pandas NA
In fact pd.isna() doesn’t just check for NaN values, because Pandas has some
other, less common ways to indicate an element is missing. You will see this
hinted in the name pd.isna(), because Pandas thinks of missing values as
NA values, where NA seems to stand for Not-Applicable or
Not-Available.
See Working with Missing
Data
for the full gory details. The summary at this stage is:
For readability, and to allow for the use of other NA values in Pandas, we suggest you use
pd.isna()or Pandas.isna()methods to check for missing values.That said, at the moment, Pandas nearly always indicates missing (NA) values with Numpy’s
np.nan.
Missing data - fancifully referred to as missingness - is common in the vast majority of datasets encountered in the wild. Missing data means data that, for whatever reason, are not present for a particular row and column.
Above we saw, for example, that the HDI value for Afghanistan, and 1950, is missing (not available).
df.iloc[0]
Entity Afghanistan
Year 1950
Fertility Rate 7.248
Human Development Index NaN
Population 7776133.0
Region NaN
Name: AFG, dtype: object
That is probably because there were no relevant statistics, for Afghanistan, in 1950, with which to calculate the value.
We will nearly always want to know how much of a given dataset is missing, as we will need to factor this in as a limitation of our data analysis.
Exercise 11
Why might we worry about missing values? Why can’t we just drop them and forget about them? Let’s load some related data from the World Bank with country statistics on various measures:
gender_df = pd.read_csv('data/gender_stats.csv')
gender_df
| country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | 1.66325 | NaN | NaN | NaN | 48.721939 | NaN | 0.103744 |
| 1 | Afghanistan | AFG | 4.95450 | 19.961015 | 161.138034 | 2.834598 | 40.109708 | 444.00 | 32.715838 |
| 2 | Angola | AGO | 6.12300 | 111.936542 | 254.747970 | 2.447546 | NaN | 501.25 | 26.937545 |
| 3 | Albania | ALB | 1.76925 | 12.327586 | 574.202694 | 2.836021 | 47.201082 | 29.25 | 2.888280 |
| 4 | Andorra | AND | NaN | 3.197538 | 4421.224933 | 7.260281 | 47.123345 | NaN | 0.079547 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 211 | Kosovo | XKX | 2.14250 | 6.804620 | NaN | NaN | NaN | NaN | 1.813820 |
| 212 | Yemen, Rep. | YEM | 4.22575 | 36.819337 | 207.949700 | 1.417836 | 44.470076 | 399.75 | 26.246608 |
| 213 | South Africa | ZAF | 2.37525 | 345.209888 | 1123.142656 | 4.241441 | 48.516298 | 143.75 | 54.177209 |
| 214 | Zambia | ZMB | 5.39425 | 24.280990 | 185.556359 | 2.687290 | 49.934484 | 233.75 | 15.633220 |
| 215 | Zimbabwe | ZWE | 3.94300 | 15.495514 | 115.519881 | 2.695188 | 49.529875 | 398.00 | 15.420964 |
216 rows × 9 columns
Notice that there are various NaN values here.
A) Do you think these indicate invalid floating point operations at some previous step, or do they indicate missing (Not Available) data? Why?
Write your answer here, replacing this text.
Here is a calculation of the mean Health Exp(enditure) per Cap(ita) (per person):
gender_df['health_exp_per_cap'].mean()
np.float64(1269.4949369129715)
Do you think this value is a reasonable estimate of actual worldwide health expenditure per person? If not, why not? Can you think of any way of improving this estimate?
Write your answer here, replacing this text.
Solution to Exercise 11
The NaN values look very much like standard Pandas signals of missing (Not Available) data. First, these values arose from loading a data file directly. It’s possible that some calculation that led to this data file had invalid floating point values, but it’s difficult to see what these might be, or why these could not be avoided. On the other hand, looking at the values, it seems that they occur for smaller countries with less-developed economies (e.g. Aruba) or new countries (at time of data estimate) such as Kosovo. It is easy to see how there might not be good data to calculate e.g. health expenditure per person for these countries.
To explore more, you might have considered looking specifically for rows and columns with many NaN values with something like:
missing_hepc = pd.isna(gender_df['health_exp_per_cap'])
gender_df[missing_hepc]
| country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | 1.663250 | NaN | NaN | NaN | 48.721939 | NaN | 0.103744 |
| 8 | American Samoa | ASM | NaN | 0.640500 | NaN | NaN | NaN | NaN | 0.055422 |
| 24 | Bermuda | BMU | 1.617500 | 5.555624 | NaN | NaN | 48.423588 | NaN | 0.065101 |
| 45 | Curacao | CUW | 2.050000 | NaN | NaN | NaN | 47.986740 | NaN | 0.155959 |
| 46 | Cayman Islands | CYM | NaN | NaN | NaN | NaN | 49.866695 | NaN | 0.059159 |
| 64 | Faroe Islands | FRO | 2.525000 | 2.484982 | NaN | NaN | NaN | NaN | 0.048867 |
| 70 | Gibraltar | GIB | NaN | NaN | NaN | NaN | NaN | NaN | 0.034026 |
| 77 | Greenland | GRL | 2.033250 | 2.415893 | NaN | NaN | NaN | NaN | 0.056378 |
| 79 | Guam | GUM | 2.396250 | 5.465750 | NaN | NaN | NaN | NaN | 0.161202 |
| 81 | Hong Kong SAR, China | HKG | 1.209500 | 292.020359 | NaN | NaN | 47.941510 | NaN | 7.247240 |
| 87 | Isle of Man | IMN | NaN | 6.871989 | NaN | NaN | NaN | NaN | 0.082583 |
| 111 | Liechtenstein | LIE | 1.535000 | 6.170416 | NaN | NaN | 48.899433 | NaN | 0.037115 |
| 117 | Macao SAR, China | MAC | 1.220750 | 48.029914 | NaN | NaN | 47.860808 | NaN | 0.588052 |
| 118 | St. Martin (French part) | MAF | 1.812500 | NaN | NaN | NaN | NaN | NaN | 0.031491 |
| 132 | Northern Mariana Islands | MNP | NaN | 0.822250 | NaN | NaN | NaN | NaN | 0.054412 |
| 139 | New Caledonia | NCL | 2.250000 | NaN | NaN | NaN | NaN | NaN | 0.268200 |
| 156 | Puerto Rico | PRI | 1.469125 | 102.107758 | NaN | NaN | 48.494478 | 14.75 | 3.529385 |
| 157 | Korea, Dem. People's Rep. | PRK | 1.982500 | NaN | NaN | NaN | 49.030281 | 86.25 | 25.113782 |
| 160 | West Bank and Gaza | PSE | 4.208000 | 12.508220 | NaN | NaN | 48.828520 | 47.50 | 4.296960 |
| 161 | French Polynesia | PYF | 2.051000 | NaN | NaN | NaN | NaN | NaN | 0.275723 |
| 174 | Somalia | SOM | 6.514500 | 5.785250 | NaN | NaN | NaN | 762.75 | 13.527075 |
| 183 | Sint Maarten (Dutch part) | SXM | NaN | NaN | NaN | NaN | 49.508551 | NaN | 0.037552 |
| 186 | Turks and Caicos Islands | TCA | NaN | NaN | NaN | NaN | 48.846884 | NaN | 0.033703 |
| 206 | British Virgin Islands | VGB | NaN | NaN | NaN | NaN | 47.581520 | NaN | 0.029585 |
| 207 | Virgin Islands (U.S.) | VIR | 1.760000 | 3.812000 | NaN | NaN | NaN | NaN | 0.104141 |
| 211 | Kosovo | XKX | 2.142500 | 6.804620 | NaN | NaN | NaN | NaN | 1.813820 |
(See the filtering page for more on filtering Data Frames.)
For the mean calculation, we notice again that the missing values seem to be
for smaller, less-developed and newer countries. You have already seen that
Pandas assumes that NaN means missing data, and it drops NaN values in
calculations like .mean(). Therefore, the result is the mean excluding
these smaller, poorer countries, and the resulting mean will be the mean of
the larger, richer countries. To get a better estimate of worldwide health
expenditure per capita, we might try and think of ways of estimating what the
health expenditure would have been for these smaller countries, perhaps using
other data we do have, from this data frame or elsewhere. At least we should
point out the confounded nature of the .mean() as a true estimate.
Pandas supplies us some useful methods for checking missingness.
For instance, we can use .count() to show us the number of non-NaN elements in each column:
# Count non-NaN (not missing) elements in the Data Frame.
df.count()
Entity 59911
Year 59911
Fertility Rate 18722
Human Development Index 6464
Population 59177
Region 271
dtype: int64
A useful trick here is to divide the output of the .count() method by the
len() of the Data Frame. Remember len(df) gives you the number of rows
(number of observations). This provides a handy summary of the
proportion of NaNs in each column of the Data Frame:
# Show the proportion of missing values, in each column.
# The division operates elementwise, dividing all values above by the divisor.
df.count() / len(df)
Entity 1.000000
Year 1.000000
Fertility Rate 0.312497
Human Development Index 0.107893
Population 0.987748
Region 0.004523
dtype: float64
If we want to use brute force, we can use the .dropna() method to remove any rows which have a single NaN value:
# Remove the NaN values
df_no_NaN = df.dropna()
df_no_NaN
| Entity | Year | Fertility Rate | Human Development Index | Population | Region | |
|---|---|---|---|---|---|---|
| Code |
It turns out in this dataset, every row has at least one NaN value so dropping every row with a NaN has dropped everything…(we did say this method was brute force)!
By far the most important thing to know about missing data in Pandas is that by default NaN values will be ignored in numerical operations.
Let’s look at the Fertility Rate column, which contains numerical data:
# Show the column
df['Fertility Rate']
Code
AFG 7.248
AFG 7.260
AFG 7.260
AFG 7.266
AFG 7.254
...
ZWE NaN
ZWE NaN
ZWE NaN
ZWE NaN
ZWE NaN
Name: Fertility Rate, Length: 59911, dtype: float64
Because we are dealing with just one column, we can safely use .dropna() without losing every row (because not every row of the Series contains a NaN value):
# Show the column
df['Fertility Rate'].dropna()
Code
AFG 7.248
AFG 7.260
AFG 7.260
AFG 7.266
AFG 7.254
...
ZWE 3.748
ZWE 3.754
ZWE 3.765
ZWE 3.767
ZWE 3.724
Name: Fertility Rate, Length: 18722, dtype: float64
Let’s compare computing a statistic (the mean) when we drop the NaN values from this Series versus when we leave them in.
Let’s first select all the rows relating to (indexed as) Zimbabwe ('ZWE'), and the corresponding 'Fertility Rate' values:
# Rows with index value 'ZWE', 'Fertility Rate' column.
zwe_fert = df.loc['ZWE', 'Fertility Rate']
zwe_fert
Code
ZWE 7.041
ZWE 7.060
ZWE 7.069
ZWE 7.081
ZWE 7.091
...
ZWE NaN
ZWE NaN
ZWE NaN
ZWE NaN
ZWE NaN
Name: Fertility Rate, Length: 261, dtype: float64
When we use the .mean() method on just this column, we get the following value:
# Calculate a mean with NaN data included (NaNs will be ignored).
zwe_fert.mean()
np.float64(5.48758108108108)
Using .dropna() on this column returns exactly the same value - because by default Pandas will ignore NaNs in numerical operations:
zwe_no_nans = zwe_fert.dropna()
zwe_no_nans
Code
ZWE 7.041
ZWE 7.060
ZWE 7.069
ZWE 7.081
ZWE 7.091
...
ZWE 3.748
ZWE 3.754
ZWE 3.765
ZWE 3.767
ZWE 3.724
Name: Fertility Rate, Length: 74, dtype: float64
# Drop NaNs gives the same mean
zwe_no_nans.mean()
np.float64(5.48758108108108)
So:
For Numpy, NaNs propagate, because they indicate an invalid value.
For Pandas, NaNs do not propagate, because they indicate a missing value.
Put another way:
Numpy treats NaNs as numerical indicators of an invalid operation.
Pandas treats NaNs as statistical indicators of missing data.
This fits with the package names; Numpy for numerical Python, Pandas for Panel data and therefore, statistics.
This difference in NaN handling is a key and important difference between Numpy
and Pandas statistical routines. Numpy mean, min, max and std return
NaN, by default, if there are any NaN values in the array.
np.std(zwe_fert.values)
np.float64(nan)
In contrast, the matching routines in Pandas silently drop the NaN values before calculating mean, min, max and so on.
zwe_fert.std()
np.float64(1.4567471363840736)
Summary#
On this page we have seen how NaN values indicate different things in Numpy and Pandas.
In Numpy, NaN values have a numerical meaning, and typically result from invalid computations, such as dividing zero by zero.
In Pandas, NaN values have statistical meaning. They are most commonly flags for missing data. By default, these NaN values will be ignored when you call Pandas’ statistical methods for Series or Data Frames.