Indexing by label and position

Indexing by label and position#

Indexing into Series#

From the What is a Series section, remember our maxim:

A Series is the association of:

An array of values (.values)

A sequence of labels for each value (.index)

A name (which can be None).

On this page, we think particularly about the Index (row labels) for Series and Data Frames. We also discuss the Index that Pandas creates if you do not specify one.

The default Index that Pandas makes reminds us of the differences between label indexing (using .loc) and position (integer) indexing (using .iloc).

Along the way, we’ll often press you never to use direct indexing on Series, as there is still some dangerous ambiguity as to whether you are doing label or position indexing.

Because it is easy to get mixed up about position (.iloc) and label (.loc) indexing, it is often sensible to replace Pandas’ default index with a custom index, to avoid accidental errors when indexing.

Getting started#

# import libraries
import numpy as np
import pandas as pd

We’ll use the fertility and Human Development Index data once more.

# Three letter codes for each country
country_codes_array = np.array(['AUS', 'BRA', 'CAN',
                                'CHN', 'DEU', 'ESP',
                                'FRA', 'GBR', 'IND',
                                'ITA', 'JPN', 'KOR',
                                'MEX', 'RUS', 'USA'])

# Human Development Index Scores for each country
hdis_array = np.array([0.896, 0.668, 0.89,
                       0.586, 0.89,  0.828,
                       0.844, 0.863, 0.49,
                       0.842, 0.883, 0.824,
                       0.709, 0.733, 0.894])

Slicing Series with `.iloc` and `.loc`#

hdi_series = pd.Series(hdis_array, index=country_codes_array)
hdi_series

AUS    0.896
BRA    0.668
CAN    0.890
CHN    0.586
DEU    0.890
ESP    0.828
FRA    0.844
GBR    0.863
IND    0.490
ITA    0.842
JPN    0.883
KOR    0.824
MEX    0.709
RUS    0.733
USA    0.894
dtype: float64

There is a fundamental difference between the behaviors of .iloc and .loc when slicing.

Standard slicing in Python uses integers to specify positions, and gives the elements starting at the start position, up to but not including the stop position.

my_name = 'Peter Rush'
# From character at position 2, up to (not including) position 7.
my_name[2:7]

'ter R'

The same rule applies to indexing Python lists, or Numpy arrays:

# From element at position 2, up to (not including) position 7.
country_codes_array[2:7]

array(['CAN', 'CHN', 'DEU', 'ESP', 'FRA'], dtype='<U3')

.iloc is indexing by position, so it may not be surprising that it slices using the same rules as by-position indexing in Numpy:

# From element at position 2, up to (not including) position 7.
hdi_series.iloc[2:7]

CAN    0.890
CHN    0.586
DEU    0.890
ESP    0.828
FRA    0.844
dtype: float64

Now consider slicing by label. The start and stop values are no longer positions, but labels. The label at position 2 is 'CAN'. The label at position 7 is the until-recently-European country'GBR'.

Here’s what we get from slicing using .loc:

# From element labeled 'CAN', up to (including) element labeled 'GBR'
hdi_series.loc['CAN':'GBR']

CAN    0.890
CHN    0.586
DEU    0.890
ESP    0.828
FRA    0.844
GBR    0.863
dtype: float64

First notice that label indexing uses values from the Index as start and stop. Unlike Numpy or .iloc indexing, which by definition have integers as start and stop (because these are positions), .loc indexing start and stop values must match the values in the Index. In this case, the Index has str values, so the start and stop values are also str.

Second, notice that we got one more value from .loc indexing into the Series, because .loc slicing — unlike .iloc or Numpy indexing — includes the stop value.

In the last cell, using .loc, 'GBR' was the stop value, and we got the element corresponding to 'GBR'.

This is a major difference from Numpy and .iloc behavior.

Note

Stop and .loc

Why does .loc slicing return the label corresponding to the stop value, instead of going up to but not including the stop value, like Numpy or .iloc?

We should say that this is absolutely the right choice. But why?

Please consider reflecting before reading on.

Elevator Muzak while you reflect

Please click the link above to get you into a reflective mood.

Back to slicing; let’s consider the problem of selecting some elements that you want. You can see the Index. In your case you want all the elements from CAN through GBR. When the result includes the stop label, then its obvious what to do; you do what you do above: hdi_series.loc['CAN':'GBR'].

Now consider the alternative — where slicing gives you the elements up to but not including the stop value. Your problem now becomes annoying and error-prone. You have to look at the index, identify the last label for the element you do want ('GBR') and then go one element further, and get the label for the element after the one you want (in this case 'IND'. In an alternative world, where .loc was up to and not including the stop value, indexing to get elements 'CAN' through 'GBR' would be hdi_series.loc['CAN':'IND']. Now imagine that for some reason I had deleted the 'IND' element, so the following element label is 'ITA'. In that case, despite the fact nothing had changed in the elements I’m interested in, I now have to write hdi_series.loc['CAN':'ITA'] to get the exact same elements.

So, yes, it’s important to remember this difference, but a little reflection should reveal that this was still the right choice.

Index labels need not be unique#

We haven’t specified so far, but there is no general requirement for Pandas Index values to be unique. Consider the following Series:

not_unique_labels = pd.Series(['France', 'Italy', 'UK', 'Great Britain'],
                              index=['FRA', 'ITA', 'GBR', 'GBR'])
not_unique_labels

FRA           France
ITA            Italy
GBR               UK
GBR    Great Britain
dtype: str

Doing .loc indexing with a label that only matches one element gives the corresponding value:

not_unique_labels.loc['FRA']

'France'

.loc matching a label with more than one element returns a subset of the Series:

not_unique_labels.loc['GBR']

GBR               UK
GBR    Great Britain
dtype: str

This can lead to confusing outputs if you don’t keep track of whether the Index values uniquely identify the element.

The default index#

Thus far, we have specified the Index in building Series:

hdi_series = pd.Series(hdis_array, index=country_codes_array)
hdi_series

AUS    0.896
BRA    0.668
CAN    0.890
CHN    0.586
DEU    0.890
ESP    0.828
FRA    0.844
GBR    0.863
IND    0.490
ITA    0.842
JPN    0.883
KOR    0.824
MEX    0.709
RUS    0.733
USA    0.894
dtype: float64

Pandas allows us to build a Series without specifying an Index:

# Make a Series from `hdis_array`, without specifying `index`.
hdi_series_def_index = pd.Series(hdis_array)
hdi_series_def_index

   0.896
   0.668
   0.890
   0.586
   0.890
   0.828
   0.844
   0.863
   0.490
   0.842
  0.883
  0.824
  0.709
  0.733
  0.894
dtype: float64

Where we did not specify an Index, Pandas has automatically generated one. As you can see, Pandas displays this default index as a sequence of integers, starting at 0, and going up to the number of elements minus 1.

Let’s take a closer look at the default Index:

# The default Pandas index
hdi_series_def_index.index

RangeIndex(start=0, stop=15, step=1)

RangeIndex is similar to Python’s range; it is a space-saving container that represents a sequence of integers from a start value up to, but not including a stop value, with an optional step size. Here RangeIndex represents the numbers 0 through 14, just as range can represent the numbers 0 through 14:

zero_through_14 = range(0, 15)
zero_through_14

range(0, 15)

As for range we can ask the RangeIndex container to give up these numbers (by iteration) into another container, such as an array or list:

# Iterating through `RangeIndex` to give the represented numbers.
np.array(hdi_series_def_index.index)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

# Iterating through a `range` to give the represented numbers.
np.array(zero_through_14)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

As for range, one can ask for the implied elements by indexing:

# View the fifth element of the RangeIndex.
fifth_element = hdi_series_def_index.index[4]
fifth_element

Notice that the elements from RangeIndex are ints:

type(fifth_element)

int

For all practical purposes, you can treat this RangeIndex as being equivalent to the corresponding sequential Numpy integer array.

Exercise 3

Let’s make another Series where we do not specify the index:

a_series = pd.Series([1000, 999, 101, 199, 99])
a_series

  1000
   999
   101
   199
    99
dtype: int64

As you have seen, you will have got the default .index, a RangeIndex:

a_series.index

RangeIndex(start=0, stop=5, step=1)

What do you expect to see for list(a_series)? Reflect, then uncomment below and try it:

# list(a_series)

What do you expect to see for list(a_series.index)? Reflect, then try it:

# list(a_series.index)

The Series method .sort_values returns a new Series sorted by the values.

sorted_series = a_series.sort_values()

Now what do you expect to see for list(sorted_series)? Reflect, then uncomment below and try it:

# list(sorted_series)

How about list(sorted_series.index)? Reflect, try:

# list(sorted_series.index)

What kind of thing do you think the .index is now? Reflect and then:

# type(sorted_series.index)

Can you explain the result of the last cell?

Solution to Exercise 3

list applied to the Series gives a list of the .values. List applied to the .index gives a list of the values implied by the Index. For a RangeIndex, this iterates over the Index, extracting all the implied values.

You should have discovered that the sorted_series now has an Index of integers, and no longer has a RangeIndex. The question was prompting you to reflect that Pandas can only use RangeIndex as a space-saving device if the integers continue to be representable as an ordered sequence with equal steps. Otherwise it will have to rebuild an array of integers to represent the index.

Why an Index of integers can be confusing#

To recap: for our first few Series, we’ve used three-letter country codes as the elements of an index. We’ve just seen what happens if we construct a Data Frame without telling Pandas what to use as an index - it will create a default RangeIndex. RangeIndex represents a series of integers.

If you did the exercise, you will have found that Pandas can use RangeIndex when the index is a regular sequence of integers, but must otherwise change to having an index with an array containing integers, that are the value labels.

What is the advantage of using an index with values that aren’t integers — such as strings? Below are some potential pitfalls to be aware of when using the default index, and any other index made up of integers.

Let’s say we want to access the fifth element of the Series. This is at integer location 4, because we count from 0. At the moment the numerical labels implied by the RangeIndex “line up” with the integer-based locations:

# Show the whole Series
hdi_series_def_index

   0.896
   0.668
   0.890
   0.586
   0.890
   0.828
   0.844
   0.863
   0.490
   0.842
  0.883
  0.824
  0.709
  0.733
  0.894
dtype: float64

If you somehow ask for element 4, there is no ambiguity about which element you mean, because the value with label 4 is also the element at integer position 4. Therefore, if we use integer indexing (.iloc) we get the same value as if we use label based indexing (.loc):

# Indexing using integer location
hdi_series_def_index.iloc[4]

np.float64(0.89)

# Indexing using labels (from the default index)
hdi_series_def_index.loc[4]

np.float64(0.89)

Because of this potential for confusion, we strongly suggest that you index Series with .loc and .iloc, to be explicit about whether you mean label or position indexing.

Why you should never use direct indexing on Series#

Direct indexing occurs where the indexing bracket [ directly follows the Series value. Conversely, indirect-indexing is indexing where the indexing bracket [ follows .loc or .iloc.

Now consider the situation, that we encourage you never to put yourself in, where you use direct indexing on a Series. You can’t specify what type of indexing you mean with direct indexing. Do you mean label indexing or position indexing? Pandas will have to make assumptions, and these assumptions may well be wrong for what you intend. Did we mention, you should never use direct indexing on Series?

OK, let’s imagine that you decided we were being too strict, and used direct indexing on the Series above, with (implied) integer Index values.

# Direct indexing on a Series.  You should never do this.
hdi_series_def_index[4]

np.float64(0.89)

At the moment, because the positions and integer element labels match up, there is no ambiguity as to what 4 refers to, so it may not be surprising that .iloc, .loc and direct indexing all give the same result.

But this will not always be the case. It is extremely common for you to do operations on the Series — such as sorting and filtering — that will mean that the integer labels no longer correspond to positions.

For instance let’s sort the data in our hdi_series_def_index Series in ascending order. To do this we will use the .sort_values() method. We will cover Pandas methods in detail on later pages. The .sort_values() method sorts the values of the Series in ascending order, taking the matching labels in the index with it.

# Sorting the *values* in ascending order
hdi_series_def_index_sorted = hdi_series_def_index.sort_values()
hdi_series_def_index_sorted

   0.490
   0.586
   0.668
  0.709
  0.733
  0.824
   0.828
   0.842
   0.844
   0.863
  0.883
   0.890
   0.890
  0.894
   0.896
dtype: float64

Look at the left hand side of the display from the cell above — in particular, look at the Index. The numbers within the Index no longer run sequentially from 0 to 14. This means that the integer position of each element in the Series no longer matches up with the index label. This can be a potential source of errors.

Note

The index type can change if you rearrange elements

If you haven’t done the exercise above, please consider doing it.

If you have, you will have found already that the sorted Series has a new Index, that is no longer a RangeIndex (because the integer labels now cannot be represented as a regular sequence of integers). Thus type(hdi_series_def_index_sorted.index) will be of type Index, rather than RangeIndex.

Let’s see what happens if we try to access the fifth element of the series using integer based indexing (.iloc[4]) location based indexing (.loc[4]) and direct indexing ([4]) as we did above.

(Did we already say — you should never use direct indexing on Series?)

As you remember, when we did this on the data before sorting, all these methods returned the same value. Now, however:

# Integer indexing on the sorted data
# This is the fifth element in the Series.
hdi_series_def_index_sorted.iloc[4]

np.float64(0.733)

# Label indexing on the sorted data
# This is the element with the label `4`.
hdi_series_def_index_sorted.loc[4]

np.float64(0.89)

# Direct indexing on the sorted data
# Which is this?  Position or label?
# By the way - you should never use direct indexing on Series.
hdi_series_def_index_sorted[4]

np.float64(0.89)

We have used the number 4 with each indexing method, yet have gotten back different values for .iloc compared to .loc and direct indexing.

Consider specifying a not-default index for Series and Data Frames#

We saw above that the default index can induce confusion between label and position.

If you do avoid using direct indexing, the confusion is less — it will be easier to remember that .loc is for labels and .iloc is for positions. But still, with a little inattention, or some sloppy vibe-coding, it is nevertheless easy to forget which is which. This is a pitfall of using sequential numbers as the index — as generated, for example, by RangeIndex — it can lead to confusing results when the position in the sequence and the int label of an element of the Series do not match up.

Compare this to our hdi_series which uses the three-letter country codes as its index:

# Show the `hdi_series`.
hdi_series

AUS    0.896
BRA    0.668
CAN    0.890
CHN    0.586
DEU    0.890
ESP    0.828
FRA    0.844
GBR    0.863
IND    0.490
ITA    0.842
JPN    0.883
KOR    0.824
MEX    0.709
RUS    0.733
USA    0.894
dtype: float64

Let’s get the fifth element using integer based (.iloc) indexing:

# Integer (position) indexing
hdi_series.iloc[4]

np.float64(0.89)

… and let’s try to use .loc[4] on this Series (this will generate an error):

# Label indexing raises a KeyError ...
hdi_series.loc[4]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/indexes/base.py:3641, in Index.get_loc(self, key)
   3640 try:
-> 3641     return self._engine.get_loc(casted_key)
   3642 except KeyError as err:

File pandas/_libs/index.pyx:168, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:578, in pandas._libs.index.StringObjectEngine._check_type()

KeyError: 4

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[38], line 2
      1 # Label indexing raises a KeyError ...
----> 2 hdi_series.loc[4]

File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/indexing.py:1207, in _LocationIndexer.__getitem__(self, key)
   1205 maybe_callable = com.apply_if_callable(key, self.obj)
   1206 maybe_callable = self._raise_callable_usage(key, maybe_callable)
-> 1207 return self._getitem_axis(maybe_callable, axis=axis)

File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/indexing.py:1449, in _LocIndexer._getitem_axis(self, key, axis)
   1447 # fall thru to straight lookup
   1448 self._validate_key(key, axis)
-> 1449 return self._get_label(key, axis=axis)

File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/indexing.py:1399, in _LocIndexer._get_label(self, label, axis)
   1397 def _get_label(self, label, axis: AxisInt):
   1398     # GH#5567 this will fail if the label is not present in the axis.
-> 1399     return self.obj.xs(label, axis=axis)

File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/generic.py:4253, in NDFrame.xs(self, key, axis, level, drop_level)
   4251             new_index = index[loc]
   4252 else:
-> 4253     loc = index.get_loc(key)
   4255     if isinstance(loc, np.ndarray):
   4256         if loc.dtype == np.bool_:

File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/indexes/base.py:3648, in Index.get_loc(self, key)
   3643     if isinstance(casted_key, slice) or (
   3644         isinstance(casted_key, abc.Iterable)
   3645         and any(isinstance(x, slice) for x in casted_key)
   3646     ):
   3647         raise InvalidIndexError(key) from err
-> 3648     raise KeyError(key) from err
   3649 except TypeError:
   3650     # If we have a listlike key, _check_indexing_error will raise
   3651     #  InvalidIndexError. Otherwise we fall through and re-raise
   3652     #  the TypeError.
   3653     self._check_indexing_error(key)

KeyError: 4

This KeyError tells us that there is no index label 4 (which makes sense as the index labels in this Series are three-letter country codes). To use .loc with this Series, we must use the three-letter country code strings:

# Label based indexing
hdi_series.loc['DEU']

np.float64(0.89)

It is much harder to get confused when using integer indices as long as you stick with indirect indexing (.loc and .iloc). You’ve specified what you mean (by label or by position) using the name of the method. However, things can get dangerously confusing if you use an integer index and direct indexing. Which is why you should not use direct indexing with Series.

Just to remind you, hdi_series has the country codes (strings like 'DEU') as the index.

Now, consider, what would happen if we used an integer for direct indexing? As in something like hdi_series[4]? Because we haven’t specified that we want to index with labels (.loc) or positions (.iloc), Pandas has to make some decision as to how to proceed.

Exercise 4

We assume you’ve just read the text above the exercise, where we consider what you would expect to happen if:

Your Series has a index of strings.
You use direct indexing on this Series with an integer.

As in hdi_series[4]. (Don’t try it yet).

Pause and reflect what decision you would make in this situation, if you were a Pandas developer, deciding what Pandas should do. What are the options? Why would you chose one option over another?

Solution to Exercise 4

Briefly you have two options we could think of as the Pandas developer. You could:

Assume that the user is trying to index by label, and raise an error to say that the label 4 is not in the index (because your index is a set of strings).
Assume that the user is trying to index by position (Pandas’ behavior in versions prior to 3).
Try to persuade the user not to use direct-indexing on Series.

However, there’s a big problem with the second option, assuming that the user is trying to index by position. As you have seen above, in general Pandas treats direct indexing as by label. So, if there are integer labels, it will, without complaint, give you the value corresponding the integer label, not the position. This means that you sometimes treat direct indexing as by label (when there is an integer index and integer value between the []), and other times as by position (when there is a non-integer index and integer value between the []).

In fact, Pandas initially went for this second option, because, if you keep track of whether your index is an integer index or not, it can be convenient to avoid the .iloc and use direct indexing for position-based indexing (on a Series with a non-integer index). But recently (and as of Pandas version 3), the Pandas developers have rethought this decision.

You are about to see the result direct indexing on a Series. In older versions of Pandas (before version 3) this did something frightening, which was to guess whether we meant to do .loc or .iloc indexing depending on whether the index values are integers. Version 3 takes (in our view) a more explicit view — and always assumes direct indexing is on labels (.loc).

As you have already seen above, if the index consists of integers, and you specify integers in your direct indexing, then Pandas will assume you mean the values to be labels (like .loc).

If the index does not consist of integers, and you specify integers in your direct indexing, then the result depends on the version of Pandas you are running. Current versions (version 3 or greater) will raise an error, assuming you meant to index by label (loc behavior). Previous versions assumed you meant the values to be positions (like .iloc), but would give you a warning about the upcoming change in version 3.

# Direct indexing
hdi_series[4]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/indexes/base.py:3641, in Index.get_loc(self, key)
   3640 try:
-> 3641     return self._engine.get_loc(casted_key)
   3642 except KeyError as err:

File pandas/_libs/index.pyx:168, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:578, in pandas._libs.index.StringObjectEngine._check_type()

KeyError: 4

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[40], line 2
      1 # Direct indexing
----> 2 hdi_series[4]

File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/series.py:959, in Series.__getitem__(self, key)
    954     key = unpack_1tuple(key)
    956 elif key_is_scalar:
    957     # Note: GH#50617 in 3.0 we changed int key to always be treated as
    958     #  a label, matching DataFrame behavior.
--> 959     return self._get_value(key)
    961 # Convert generator to list before going through hashable part
    962 # (We will iterate through the generator there to check for slices)
    963 if is_iterator(key):

File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/series.py:1046, in Series._get_value(self, label, takeable)
   1043     return self._values[label]
   1045 # Similar to Index.get_value, but we do not fall back to positional
-> 1046 loc = self.index.get_loc(label)
   1048 if is_integer(loc):
   1049     return self._values[loc]

File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/indexes/base.py:3648, in Index.get_loc(self, key)
   3643     if isinstance(casted_key, slice) or (
   3644         isinstance(casted_key, abc.Iterable)
   3645         and any(isinstance(x, slice) for x in casted_key)
   3646     ):
   3647         raise InvalidIndexError(key) from err
-> 3648     raise KeyError(key) from err
   3649 except TypeError:
   3650     # If we have a listlike key, _check_indexing_error will raise
   3651     #  InvalidIndexError. Otherwise we fall through and re-raise
   3652     #  the TypeError.
   3653     self._check_indexing_error(key)

KeyError: 4

Using a custom non-integer index (e.g. the three-letter country codes) rather than the default RangeIndex, or some other integer index, has the advantage of avoiding potential confusion (by you, or someone reading the code) between the integer location of an element, and the index label of that element.

To demonstrate this, let’s sort our hdi_series in ascending order:

# Sorting the Series in ascending order
hdi_series_sorted = hdi_series.sort_values()
hdi_series_sorted

IND    0.490
CHN    0.586
BRA    0.668
MEX    0.709
RUS    0.733
KOR    0.824
ESP    0.828
ITA    0.842
FRA    0.844
GBR    0.863
JPN    0.883
CAN    0.890
DEU    0.890
USA    0.894
AUS    0.896
dtype: float64

The use of custom string-based labels in the index (e.g. FRA, AUS etc) avoids confusing misalignment between the default numerical labels and integer location.

We’ve said it before, we say it again here — we suggest you always specify .loc or .iloc when indexing a Series, in order not to confuse yourself and your readers as to whether you mean to index by label or position. In this case .loc means we need to use a string, preventing confusion (in Pandas 3) and errors (Pandas 2) where we use a number and return data we do not expect.

# Label-based indexing
hdi_series_sorted.loc['DEU']

np.float64(0.89)

Warning

Direct indexing is consistent in older Pandas

If you’re using the latest Pandas (version >= 3), then you should find indexing is explicit and consistent. You need read no further in this warning.

However, if you’re still using Pandas <3, there are more inconsistencies.

As Pandas was shifting towards more explicit choice of labels over positions in direct indexing, there were remaining inconsistencies. These were resolved in version, so if you want to avoid confusion, skip the rest of this note, and remember never use direct indexing on a Series.

If you got this far, we admire your courage. This warning is only to say that Pandas currently treats slices in direct indexing differently from individual positions or labels. Specifically, at the moment, it will always assume integers in slices are positions and not labels. Try some experiments with hdi_series[:5] (string label Series) and hdi_series_def_index[:5] (integer label Series).

See this Pandas Github issue for discussion if you’re interested.

If you’re using Pandas version 2, you may be confused after trying the experiments above. Summary for new and old versions — always use .iloc and .loc to avoid ambiguity.

`.loc` and `.iloc` with Data Frames#

So far we have spent much time with .loc and .iloc on Series, but less time on .loc and .iloc for Data Frames.

Series are like one-dimensional arrays (with and Index and a Name) - therefore .loc and .iloc indexing into Series looks like indexing into one-dimensional Numpy arrays.

A Data Frame is like a two dimensional array, so .loc and .iloc indexing looks like indexing into two-dimensional Numpy arrays.

Consider the following two-dimensional Numpy array:

hdi_series[:5]

AUS    0.896
BRA    0.668
CAN    0.890
CHN    0.586
DEU    0.890
dtype: float64

hdi_series_def_index[:5]

  0.896
  0.668
  0.890
  0.586
  0.890
dtype: float64

two_d_arr = np.array([[1, 2, 3], [11, 21, 31], [101, 102, 103]])
two_d_arr

array([[  1,   2,   3],
       [ 11,  21,  31],
       [101, 102, 103]])

If we index with one expression between the indexing brackets, we select rows:

# Select the second row.
two_d_arr[1]

array([11, 21, 31])

If we want to select columns, we must specify two indexing expressions between the indexing brackets, separated by a comma:

# Select the second row, third column.
two_d_arr[1, 2]

np.int64(31)

As usual, we can use slices as indexing expressions (e.g. expressions containing colons :):

# Select first and second rows, second and third columns.
two_d_arr[:2, 1:3]

array([[ 2,  3],
       [21, 31]])

# Select all rows, third column.
two_d_arr[:, 2]

array([  3,  31, 103])

Because a Data Frame has rows and columns, it corresponds to a two-dimensional array.

Let us make an example Data Frame for illustration. In fact we’ll return to the Data Frame from the introduction to pd.DataFrame.

# Fertility rate scores for each country
fert_rates_array = np.array([1.764, 2.247, 1.51,
                             1.628, 1.386, 1.21,
                             1.876, 1.641, 3.35,
                             1.249, 1.346, 1.467,
                             2.714, 1.19 , 2.03 ])
# Series from array.
fert_rate_series = pd.Series(fert_rates_array, index=country_codes_array)

# Data Frame from dict of Series.
example_df = pd.DataFrame({'Human Development Index': hdi_series,
                           'Fertility Rate': fert_rate_series})
example_df

	Human Development Index	Fertility Rate
AUS	0.896	1.764
BRA	0.668	2.247
CAN	0.890	1.510
CHN	0.586	1.628
DEU	0.890	1.386
ESP	0.828	1.210
FRA	0.844	1.876
GBR	0.863	1.641
IND	0.490	3.350
ITA	0.842	1.249
JPN	0.883	1.346
KOR	0.824	1.467
MEX	0.709	2.714
RUS	0.733	1.190
USA	0.894	2.030

If we ask for the Data Frame .values, we get a two-dimensional Numpy array:

example_df.values

array([[0.896, 1.764],
       [0.668, 2.247],
       [0.89 , 1.51 ],
       [0.586, 1.628],
       [0.89 , 1.386],
       [0.828, 1.21 ],
       [0.844, 1.876],
       [0.863, 1.641],
       [0.49 , 3.35 ],
       [0.842, 1.249],
       [0.883, 1.346],
       [0.824, 1.467],
       [0.709, 2.714],
       [0.733, 1.19 ],
       [0.894, 2.03 ]])

When direct indexing with .loc or .iloc, we can select rows with a single indexing expression:

# Select row corresponding to label 'RUS'
example_df.loc['RUS']

Human Development Index    0.733
Fertility Rate             1.190
Name: RUS, dtype: float64

# Select rows from that labeled 'ITA' to that labeled 'RUS'.
# Remember, `.loc` is inclusive of the stop value.
example_df.loc['ITA':'RUS']

	Human Development Index	Fertility Rate
ITA	0.842	1.249
JPN	0.883	1.346
KOR	0.824	1.467
MEX	0.709	2.714
RUS	0.733	1.190

# Select second row by position.
example_df.iloc[1]

Human Development Index    0.668
Fertility Rate             2.247
Name: BRA, dtype: float64

# Select second through fifth row by position.
# As standard for Python integers indexing, this is exclusive of stop position.
example_df.iloc[1:5]

	Human Development Index	Fertility Rate
BRA	0.668	2.247
CAN	0.890	1.510
CHN	0.586	1.628
DEU	0.890	1.386

Like the Numpy two-dimension indexing case, if we want to select columns with .loc or .iloc, we must give two indexing expressions, separated by a comma:

# Select rows 'ITA' through 'RUS', 'Fertility Rate' column.
example_df.loc['ITA':'RUS', 'Fertility Rate']

ITA    1.249
JPN    1.346
KOR    1.467
MEX    2.714
RUS    1.190
Name: Fertility Rate, dtype: float64

# Row for 'RUS', all columns.
example_df.loc['RUS', :]

Human Development Index    0.733
Fertility Rate             1.190
Name: RUS, dtype: float64

# Select second through fifth row by position, first column by position.
example_df.iloc[1:5, 0]

BRA    0.668
CAN    0.890
CHN    0.586
DEU    0.890
Name: Human Development Index, dtype: float64

# Second row, all columns.
example_df.iloc[1, :]

Human Development Index    0.668
Fertility Rate             2.247
Name: BRA, dtype: float64

The catechism of Pandas indexing#

We are now ready for the definitive advice for your life using indexing in Pandas.

Never use direct indexing on Series. Always use indirect indexing (.loc and .iloc).
You can and should use direct indexing on Data Frames, but in two and only two specific cases. These are:
1. Direct indexing with a column name, or sequence of column names. Here the column name (label) or sequence of column names follows the Data Frame value and the opening [ — as in:
```
example_df['Human Development Index']
```
  and
```
example_df[['Human Development Index', 'Fertility Rate']
```
2. Direct indexing with a Boolean Series. See the filtering page for much more on Boolean Series and indexing. The Boolean Series follows the data frame value and the opening [, and selects rows for which the Boolean Series has True values — as in:
```
# Make a Boolean Series.
have_high_hdi = example_df['Human Development Index'] > 0.6
# Select rows by indexing with Boolean Series.
high_df = example_df[have_high_hdi]
```

We strongly suggest that you restrict your use of direct indexing to a) Data Frames (not Series) and b) these specific cases. We do the same.

Summary#

On this page we have looked at the Pandas Index, and different ways of indexing into Pandas Series.

We discussed the default index that Pandas provides, of integer labels, and we showed how to get Series values by label (.loc) and by position (.iloc).

.loc differs from .iloc and other Python indexing in that slices include their stop value.

We pressed you to completely avoid using direct indexing on Pandas Series, because of the potent confusion that can arise between label and position indexing.

For best results, you should specify an interpretable index for your Series and Data Frames.

Direct indexing into Data Frames is common and useful, in two and only two situations:

Direct indexing using a column name or sequence of names.
Direct indexing using a Boolean Series (see filtering page).

We can use .loc and .iloc on Data Frames, remembering that this indexing acts like indexing two-dimensional Numpy arrays; when selecting columns, we first need to specify a selection for rows.