Working with strings#

String/text data is often used to represent categorical variables, and commonly appears in a variety of data analysis contexts. When dealing with string/text data we will frequently find that we need to alter the strings to correct errors, improve clarity, make formatting uniform, or for a host of other reasons.

String methods are inherent to Python, and these methods or variants of them can all be used on Numpy arrays and Pandas Series/Data Frames. However, Numpy and Pandas use different interfaces for interacting with strings. To understand the differences between Numpy and Pandas with respect to strings, let’s begin at the foundation, the in-built string methods of Python. The cell below contains a simple string.

# A string
my_string = "a few words"
my_string
'a few words'

Here are the not-private methods and attributes of a standard Python str:

# Attributes and methods not starting with `_` (or `__`):
[k for k in dir(my_string) if not k.startswith('_')]
['capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

Remember that a method is a function attached to an object. In this case our object is a string.

Let’s say we like reading values in our data as if they are being spoken in a loud voice. If this is the case we can alter the format of the string to make all letters uppercase, using the .upper() method:

# The `.upper()` method of `str`.
my_string.upper()
'A FEW WORDS'

We can replace characters in the string using the aptly named .replace() method. Here we supply two strings to the method, first the string we want to replace, and then second, the string we want to replace it with. In this case, let’s .replace() the underscores with a blank space:

# The `.replace()` method.
my_string.replace(' ', '_')
'a_few_words'

Fancier formatting methods will let us adjust strings, for instance, in title case (.title()):

# A more elaborate string method.
my_string.title()
'A Few Words'

In Python, strings are collections of characters, and so we can slice them as we would a list or array, using integer indexes and the : symbol:

# Slicing with strings.
my_string[0:1]
'a'
my_string[0:2]
'a '
my_string[0:7]
'a few w'

You can visit this page to see the variety of string methods available in base Python.

String methods with Numpy arrays#

So, strings in base Python have a large number of in-built methods - what about strings in Numpy?

Numpy arrays themselves do not have specific string methods, but the in-built Python string methods can be called on individual string values in a Numpy array. Alternatively, we can use functions from the np.char. module to operate on all the strings in the array in one go.

To investigate how string data is handled in Numpy, let’s make some arrays containing strings from now (very) familiar HDI dataset.

# Import libraries (no imports were needed prior to this point as string methods are part of base python)
import numpy as np
import pandas as pd
# A custom function to generate a Series to check exercise solutions.
import clean_gender_df_names
# Calculate answer to exercise (see below).
answer_clean_series = clean_gender_df_names.get_cleaned(
    pd.read_csv("data/gender_stats.csv")['country_name'])
# Standard three-letter code for each country.
country_codes_array = np.array(['AUS', 'BRA', 'CAN',
                                'CHN', 'DEU', 'ESP',
                                'FRA', 'GBR', 'IND',
                                'ITA', 'JPN', 'KOR',
                                'MEX', 'RUS', 'USA'])
# Country names.
country_names_array = np.array(['Australia', 'Brazil', 'Canada',
                                'China', 'Germany', 'Spain',
                                'France', 'United Kingdom', 'India',
                                'Italy', 'Japan', 'South Korea',
                                'Mexico', 'Russia', 'United States'])

For comparison, let’s make an array containing the numerical HDI scores:

# Human Development Index Scores for each country
hdis_array = np.array([0.896, 0.668, 0.89 , 0.586, 
                       0.844, 0.89 , 0.49 , 0.842, 
                       0.883, 0.709, 0.733, 0.824,
                       0.828, 0.863, 0.894])

The dtype attribute of the first two arrays begins with <U, indicating we are dealing with string data.

# Show the dtype of the country codes array (e.g. string data)
country_codes_array.dtype
dtype('<U3')

U3 tells us that the array stored Unicode (U) strings up three Unicode characters in length.

# Show the dtype of the country names array (e.g. string data)
country_names_array.dtype
dtype('<U14')

Conversely, the hdis_array contains data of a numerical type:

# Show the dtype of the hdis array (e.g. numeric data)
hdis_array.dtype
dtype('float64')

Using indexing, we can use all of the in-built Python string methods on the individual values within a Numpy array:

# Methods on an individual string
country_codes_array[0]
np.str_('AUS')

For instance, we can change the case of the value:

# Lowercase
country_codes_array[0].lower()
'aus'
# Uppercase
country_codes_array[0].upper()
'AUS'

We can also replace elements of the string:

country_codes_array[0].replace("A", "Comparable to the ")
'Comparable to the US'

Understandably, if we try to use any of these string methods on numerical data, we will get an error:

# Oh no!
hdis_array[0].upper()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[22], line 2
      1 # Oh no!
----> 2 hdis_array[0].upper()

AttributeError: 'numpy.float64' object has no attribute 'upper'

All of the string methods used in this section above have been called on single string values from a Numpy array. If we try to use a string method on all values of the array simultaneously, we will also get an error:

# This does not work
country_codes_array.lower()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[23], line 2
      1 # This does not work
----> 2 country_codes_array.lower()

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

String methods in Numpy must be called from the single string values or using the .char. module.

For example, we can use the np.char.lower() function to operate on all values of the Numpy array at once:

# This DOES work
np.char.lower(country_codes_array)
array(['aus', 'bra', 'can', 'chn', 'deu', 'esp', 'fra', 'gbr', 'ind',
       'ita', 'jpn', 'kor', 'mex', 'rus', 'usa'], dtype='<U3')
# This DOES work too
np.char.replace(country_codes_array, 'A', '!')
array(['!US', 'BR!', 'C!N', 'CHN', 'DEU', 'ESP', 'FR!', 'GBR', 'IND',
       'IT!', 'JPN', 'KOR', 'MEX', 'RUS', 'US!'], dtype='<U3')

Pandas deals with string data slightly differently to Numpy. The elements of the .values component of a Pandas Series can be operated on altogether by using the .str. accessor, to which we will now turn our attention.

String methods with Pandas Series#

As mentioned above, Pandas Series have a specialised accessor (.str.) which bypasses the need to use np.char. when we want to do something to all of the string values in a Series.

To see how this works, let’s construct a Series from our country_names array:

# Show again from Series
names_series =  pd.Series(country_names_array,
                          index=country_codes_array)
names_series
AUS         Australia
BRA            Brazil
CAN            Canada
CHN             China
DEU           Germany
ESP             Spain
FRA            France
GBR    United Kingdom
IND             India
ITA             Italy
JPN             Japan
KOR       South Korea
MEX            Mexico
RUS            Russia
USA     United States
dtype: str

To use the .str. accessor, we just place it after our object (e.g. our Series containing our string data). We then can call a variety of string methods, which we be applied to all elements in the .values array of the Series:

# String methods on Series
names_series.str.upper()
AUS         AUSTRALIA
BRA            BRAZIL
CAN            CANADA
CHN             CHINA
DEU           GERMANY
ESP             SPAIN
FRA            FRANCE
GBR    UNITED KINGDOM
IND             INDIA
ITA             ITALY
JPN             JAPAN
KOR       SOUTH KOREA
MEX            MEXICO
RUS            RUSSIA
USA     UNITED STATES
dtype: str
# The `.str.lower()` method
names_series.str.lower()
AUS         australia
BRA            brazil
CAN            canada
CHN             china
DEU           germany
ESP             spain
FRA            france
GBR    united kingdom
IND             india
ITA             italy
JPN             japan
KOR       south korea
MEX            mexico
RUS            russia
USA     united states
dtype: str

The .replace() string method is also available here, it will operate on all the elements in the Series, though in this case (as there is only one United States) it will only alter one value:

# Replacing values in the Series
names_series.str.replace("United States", "USA")
AUS         Australia
BRA            Brazil
CAN            Canada
CHN             China
DEU           Germany
ESP             Spain
FRA            France
GBR    United Kingdom
IND             India
ITA             Italy
JPN             Japan
KOR       South Korea
MEX            Mexico
RUS            Russia
USA               USA
dtype: str

By default, the string-specific .replace method (accessed through the str accessor - .str.replace()) will search for expressions within strings, as opposed to searching for exact, whole string matches. This matches the behavior of the corresponding .replace method on strings.

This is different to the behaviour of the non-string-specific .replace() method which we encountered on an earlier page. By default the non-string-specific .replace() method will search for exactly matching whole strings. Confusion between these two methods is a common source of error.

As such, by using the string-specific .str.replace() method we can easily replace substrings in multiple elements in the data at once, even where the whole strings are not the same. For instance:

# Replacing values in the Series
names_series.str.replace("United", "Disunited")
AUS            Australia
BRA               Brazil
CAN               Canada
CHN                China
DEU              Germany
ESP                Spain
FRA               France
GBR    Disunited Kingdom
IND                India
ITA                Italy
JPN                Japan
KOR          South Korea
MEX               Mexico
RUS               Russia
USA     Disunited States
dtype: str

Here, the substring 'United' has been replaced with 'Disunited' in the both the strings "United Kingdom" and "United States". This is a good example of a situation where a specialized accessor (like .str.) can change the behavior of other methods, often in a helpful way for dealing with a specific data type.

The syntax for slicing strings is the same as for a single value, but it also operates across all elements in the Series at once:

# Slicing with strings, in a Series
names_series.str[2:4]
AUS    st
BRA    az
CAN    na
CHN    in
DEU    rm
ESP    ai
FRA    an
GBR    it
IND    di
ITA    al
JPN    pa
KOR    ut
MEX    xi
RUS    ss
USA    it
dtype: str

Using the .contains() method, Boolean Series can be generated by searching for specific instances of a substring in each value:

# Generate a Boolean Series, True where the value contains "Ind"
names_series.str.contains("Ind")
AUS    False
BRA    False
CAN    False
CHN    False
DEU    False
ESP    False
FRA    False
GBR    False
IND     True
ITA    False
JPN    False
KOR    False
MEX    False
RUS    False
USA    False
dtype: bool

These Boolean Series can be used to retrieve specific values from the original Series, via Boolean filtering:

# Use Boolean filtering to retrieve a specific datapoint
names_series[names_series.str.contains("Ind")]
IND    India
dtype: str

String methods with Pandas DataFrames#

So, Pandas makes it somewhat easier than Numpy to perform operations on all the string elements at once.

Remember that a DataFrame is a dictionary-like collection of Series, and so everything we have just seen of strings in Pandas Series applies to the columns of a Data Frame.

Let’s import the HDI data in a Pandas Data Frame:

# Import data
df = pd.read_csv("data/year_2000_hdi_fert.csv")
# Show the data
df
Code Human Development Index Fertility Rate Population Country Name
0 AUS 0.896 1.764 19.1324 Australia
1 BRA 0.668 2.247 174.0182 Brazil
2 CAN 0.890 1.510 30.8918 Canada
3 CHN 0.586 1.628 1269.5811 China
4 DEU 0.890 1.386 81.7972 Germany
5 ESP 0.828 1.210 41.0197 Spain
6 FRA 0.844 1.876 59.4837 France
7 GBR 0.863 1.641 59.0573 United Kingdom
8 IND 0.490 3.350 1057.9227 India
9 ITA 0.842 1.249 57.2722 Italy
10 JPN 0.883 1.346 127.0278 Japan
11 KOR 0.824 1.467 46.7666 South Korea
12 MEX 0.709 2.714 98.6255 Mexico
13 RUS 0.733 1.190 146.7177 Russia
14 USA 0.894 2.030 281.4841 United States

Because each Data Frame column can be extracted as a Pandas Series, we can use string methods in the same way as we saw in the last section:

# Use the `.replace()` method
df['Country Name'].str.replace('I', 'I starts I')
0          Australia
1             Brazil
2             Canada
3              China
4            Germany
5              Spain
6             France
7     United Kingdom
8     I starts India
9     I starts Italy
10             Japan
11       South Korea
12            Mexico
13            Russia
14     United States
Name: Country Name, dtype: str

However, Pandas will not let us use string methods on the whole Data Frame at once:

# Cannot use `.str` methods on whole Data Frame.
df.str.upper()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_2810/1916934331.py in ?()
      1 # Cannot use `.str` methods on whole Data Frame.
----> 2 df.str.upper()

/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/generic.py in ?(self, name)
   6202             and name not in self._accessors
   6203             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6204         ):
   6205             return self[name]
-> 6206         return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'str'

We might think that this is because not all the data in the Data Frame is of the string type, but this is not the case. We also cannot use Pandas string methods on Data Frames with columns only containing string data, as we get the same error:

# Oops, using `.str` fails even for all-string-dtype columns.
df[['Country Name', 'Code']].str.len()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_2810/3866909753.py in ?()
      1 # Oops, using `.str` fails even for all-string-dtype columns.
----> 2 df[['Country Name', 'Code']].str.len()

/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/generic.py in ?(self, name)
   6202             and name not in self._accessors
   6203             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6204         ):
   6205             return self[name]
-> 6206         return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'str'

Notice here we did not set Code as the index of the Data Frame, we are just treating it as a column containing string data. The cell below sets it as the index, so we can use label-based indexing later in this tutorial:

# Set the index
df.index = df['Code']

So, we cannot apply string methods to multiple columns at once, but if we focus on one column, we can use all of the available string methods:

# The `.lower()` method
df['Country Name'].str.lower()
Code
AUS         australia
BRA            brazil
CAN            canada
CHN             china
DEU           germany
ESP             spain
FRA            france
GBR    united kingdom
IND             india
ITA             italy
JPN             japan
KOR       south korea
MEX            mexico
RUS            russia
USA     united states
Name: Country Name, dtype: str
# The `.upper()` method
df['Country Name'].str.upper()
Code
AUS         AUSTRALIA
BRA            BRAZIL
CAN            CANADA
CHN             CHINA
DEU           GERMANY
ESP             SPAIN
FRA            FRANCE
GBR    UNITED KINGDOM
IND             INDIA
ITA             ITALY
JPN             JAPAN
KOR       SOUTH KOREA
MEX            MEXICO
RUS            RUSSIA
USA     UNITED STATES
Name: Country Name, dtype: str
# Using the `str.count()` method
df['Country Name'].str.count('a') 
Code
AUS    2
BRA    1
CAN    3
CHN    1
DEU    1
ESP    1
FRA    1
GBR    0
IND    1
ITA    1
JPN    2
KOR    1
MEX    0
RUS    1
USA    1
Name: Country Name, dtype: int64
# The `str.contains()` method
df['Country Name'].str.contains('Russia')
Code
AUS    False
BRA    False
CAN    False
CHN    False
DEU    False
ESP    False
FRA    False
GBR    False
IND    False
ITA    False
JPN    False
KOR    False
MEX    False
RUS     True
USA    False
Name: Country Name, dtype: bool
# See if there are any Trues in the Series
df['Country Name'].str.contains('Russia').sum()
np.int64(1)
# Filtering data using the `str.contains()` method
df[df['Country Name'].str.contains('Russia')]
Code Human Development Index Fertility Rate Population Country Name
Code
RUS RUS 0.733 1.19 146.7177 Russia

Uses of string methods in data wrangling#

As we mentioned earlier, string methods generally useful for cleaning text data. This can be especially useful when combining data from different sources, where different conventions in data entry may lead to similar data being formatted differently.

To explore this, let’s import a new dataset, which, like the HDI data, contains observations at the country level (e.g. each row is an observarion from a specific country).

This dataset is also at the country-level of granularity, and it contains various data about countries, including maternal mortality rates. You can read more about the dataset here.

# Import gender statistics dataset
gender_df = pd.read_csv("data/gender_stats.csv")
gender_df
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
0 Aruba ABW 1.66325 NaN NaN NaN 48.721939 NaN 0.103744
1 Afghanistan AFG 4.95450 19.961015 161.138034 2.834598 40.109708 444.00 32.715838
2 Angola AGO 6.12300 111.936542 254.747970 2.447546 NaN 501.25 26.937545
3 Albania ALB 1.76925 12.327586 574.202694 2.836021 47.201082 29.25 2.888280
4 Andorra AND NaN 3.197538 4421.224933 7.260281 47.123345 NaN 0.079547
... ... ... ... ... ... ... ... ... ...
211 Kosovo XKX 2.14250 6.804620 NaN NaN NaN NaN 1.813820
212 Yemen, Rep. YEM 4.22575 36.819337 207.949700 1.417836 44.470076 399.75 26.246608
213 South Africa ZAF 2.37525 345.209888 1123.142656 4.241441 48.516298 143.75 54.177209
214 Zambia ZMB 5.39425 24.280990 185.556359 2.687290 49.934484 233.75 15.633220
215 Zimbabwe ZWE 3.94300 15.495514 115.519881 2.695188 49.529875 398.00 15.420964

216 rows × 9 columns

gender_df[gender_df['country_name'].str.startswith('S')]
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
33 Switzerland CHE 1.53000 676.642359 6335.436388 7.642569 48.556328 5.25 8.185870
58 Spain ESP 1.30750 1299.724261 2963.832825 6.545739 48.722231 5.00 46.553128
103 St. Kitts and Nevis KNA NaN 0.832756 1212.879641 2.234503 49.805051 NaN 0.053722
110 St. Lucia LCA 1.90200 1.362564 798.301449 3.780893 48.590773 49.00 0.176427
112 Sri Lanka LKA 2.09775 76.808506 338.184994 1.762039 49.201950 31.25 20.790000
118 St. Martin (French part) MAF 1.81250 NaN NaN NaN NaN NaN 0.031491
166 Saudi Arabia SAU 2.79225 707.936120 2181.916849 3.114481 49.034225 12.25 30.728077
167 Sudan SDN 4.38775 83.016732 272.975786 1.844650 46.560974 321.00 37.760931
168 Senegal SEN 5.10400 14.539555 101.042953 2.278713 51.881078 331.00 14.551710
169 Singapore SGP 1.24250 298.724394 3645.683877 1.763894 NaN 10.75 5.464722
170 Solomon Islands SLB 3.99975 1.114535 111.679703 4.908624 47.983030 120.00 0.575490
171 Sierra Leone SLE 4.69050 4.331604 207.625767 1.853112 50.098291 1435.00 7.080112
173 San Marino SMR 1.26000 NaN 3437.298747 5.752693 45.616261 NaN 0.032607
174 Somalia SOM 6.51450 5.785250 NaN NaN NaN 762.75 13.527075
175 Serbia SRB 1.45000 41.075644 1298.626684 6.150194 48.625357 16.75 7.129316
176 South Sudan SSD 5.06600 11.480939 58.565752 0.991932 40.879082 827.50 11.527917
177 Sao Tome and Principe STP 4.60375 0.314540 304.237533 3.227125 48.664169 159.50 0.191333
178 Suriname SUR 2.37325 4.773159 967.888627 3.119683 48.418349 157.50 0.547824
179 Slovak Republic SVK 1.35500 93.894473 2107.917656 5.768967 48.521589 6.00 5.418425
180 Slovenia SVN 1.57250 46.048863 2644.776298 6.698656 48.566330 8.75 2.061494
181 Sweden SWE 1.89000 540.626904 5134.572113 10.010744 49.606578 4.00 9.703634
182 Swaziland SWZ 3.30200 4.346817 570.468058 6.891502 47.524406 400.50 1.295364
183 Sint Maarten (Dutch part) SXM NaN NaN NaN NaN 49.508551 NaN 0.037552
184 Seychelles SYC 2.35000 1.366581 876.631453 3.413612 49.523733 NaN 0.091541
185 Syrian Arab Republic SYR 2.96775 NaN 269.945739 1.507166 48.047394 62.00 19.319674
204 St. Vincent and the Grenadines VCT 1.98600 0.730107 775.803386 4.365757 48.536415 45.75 0.109421
210 Samoa WSM 4.11825 0.799887 366.353096 5.697059 48.350049 54.75 0.192225
213 South Africa ZAF 2.37525 345.209888 1123.142656 4.241441 48.516298 143.75 54.177209

Let’s say we are interested in Russia, but do not know how the name of the country is formatted. We can use the str.contains() method to search for likely matches.

# Hmmm is Russia not in this data?
gender_df['country_name'].str.contains('Russia')
0      False
1      False
2      False
3      False
4      False
       ...  
211    False
212    False
213    False
214    False
215    False
Name: country_name, Length: 216, dtype: bool

That output is pretty opaque, maybe there is a True in there somewhere. Because Python treats True values as being equal to 1, we can chain on the .sum() method to count the number of True values in the above Boolean Series:

# Count the Trues for country names containing "Russia" in the `maternal
gender_df['country_name'].str.contains('Russia').sum()
np.int64(1)

It appears we do have one match. Let’s use the Boolean Series we just made to have a look at the row that contains the string “Russia” in the country_name column:

# Use the `str.contains()` method to filter the data
gender_df[gender_df['country_name'].str.contains('Russia')]
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
164 Russian Federation RUS 1.7245 1822.6917 1755.506635 3.731354 48.96807 25.25 143.793504

So, we have found the row for Russia in this new dataset. Let’s compare the naming convention to the HDI data, in the df Data Frame:

# Get the data for Russia, from the HDI data
df.loc['RUS']
Code                            RUS
Human Development Index       0.733
Fertility Rate                 1.19
Population                 146.7177
Country Name                 Russia
Name: RUS, dtype: object

In due course, we may want to merge these datasets. To do that, we need common identifiers linking rows in each dataset which refer to the same observational units (in this case countries).

String methods are our friend here. We can use the process just outlined for find data for a specific country, and then use other methods to ensure uniform formatting between the datasets, such that we can merge them:

# Format the maternal mortality data for Russia to use the same country name as the HDI data
gender_df['country_name'] = gender_df['country_name'].str.replace('Russian Federation', 'Russia')

# Show the newly formatted row
gender_df[gender_df['country_name'].str.contains('Russia')]
country_name country_code fert_rate gdp_us_billion health_exp_per_cap health_exp_pub prim_ed_girls mat_mort_ratio population
164 Russia RUS 1.7245 1822.6917 1755.506635 3.731354 48.96807 25.25 143.793504

We are now ready for a clean and stress-free data merge! (NB: we are grossly exaggerating here, merging datasets is almost never stress-free…)

Summary#

This page looked at string methods in base python, Numpy and Pandas.

Numpy and Python inherit their string methods from base python, but apply them in different ways.

Numpy does not have a set of methods for applying string methods to every element of an array simultaneously. We need functions from the np.char module if we want this.

By contrast, Pandas Series - whether in isolation or as columns in a Data Frame - have the .str. accessor for easily performing string operations on every element in a Series.