Working with strings#
String/text data is often used to represent categorical variables, and commonly appears in a variety of data analysis contexts. When dealing with string/text data we will frequently find that we need to alter the strings to correct errors, improve clarity, make formatting uniform, or for a host of other reasons.
String methods are inherent to Python, and these methods or variants of them can all be used on Numpy arrays and Pandas Series/Data Frames. However, Numpy and Pandas use different interfaces for interacting with strings. To understand the differences between Numpy and Pandas with respect to strings, let’s begin at the foundation, the in-built string methods of Python. The cell below contains a simple string.
# A string
my_string = "a few words"
my_string
'a few words'
Here are the not-private methods and attributes of a standard Python str:
# Attributes and methods not starting with `_` (or `__`):
[k for k in dir(my_string) if not k.startswith('_')]
['capitalize',
'casefold',
'center',
'count',
'encode',
'endswith',
'expandtabs',
'find',
'format',
'format_map',
'index',
'isalnum',
'isalpha',
'isascii',
'isdecimal',
'isdigit',
'isidentifier',
'islower',
'isnumeric',
'isprintable',
'isspace',
'istitle',
'isupper',
'join',
'ljust',
'lower',
'lstrip',
'maketrans',
'partition',
'removeprefix',
'removesuffix',
'replace',
'rfind',
'rindex',
'rjust',
'rpartition',
'rsplit',
'rstrip',
'split',
'splitlines',
'startswith',
'strip',
'swapcase',
'title',
'translate',
'upper',
'zfill']
Remember that a method is a function attached to an object. In this case our object is a string.
Let’s say we like reading values in our data as if they are being spoken in a loud voice. If this is the case we can alter the format of the string to make all letters uppercase, using the .upper() method:
# The `.upper()` method of `str`.
my_string.upper()
'A FEW WORDS'
We can replace characters in the string using the aptly named .replace() method. Here we supply two strings to the method, first the string we want to replace, and then second, the string we want to replace it with. In this case, let’s .replace() the underscores with a blank space:
# The `.replace()` method.
my_string.replace(' ', '_')
'a_few_words'
Fancier formatting methods will let us adjust strings, for instance, in title case (.title()):
# A more elaborate string method.
my_string.title()
'A Few Words'
In Python, strings are collections of characters, and so we can slice them as we would a list or array, using integer indexes and the : symbol:
# Slicing with strings.
my_string[0:1]
'a'
my_string[0:2]
'a '
my_string[0:7]
'a few w'
You can visit this page to see the variety of string methods available in base Python.
String methods with Numpy arrays#
So, strings in base Python have a large number of in-built methods - what about strings in Numpy?
Numpy arrays themselves do not have specific string methods, but the in-built Python string methods can be called on individual string values in a Numpy array. Alternatively, we can use functions from the np.char. module to operate on all the strings in the array in one go.
To investigate how string data is handled in Numpy, let’s make some arrays containing strings from now (very) familiar HDI dataset.
# Import libraries (no imports were needed prior to this point as string methods are part of base python)
import numpy as np
import pandas as pd
# A custom function to generate a Series to check exercise solutions.
import clean_gender_df_names
# Calculate answer to exercise (see below).
answer_clean_series = clean_gender_df_names.get_cleaned(
pd.read_csv("data/gender_stats.csv")['country_name'])
# Standard three-letter code for each country.
country_codes_array = np.array(['AUS', 'BRA', 'CAN',
'CHN', 'DEU', 'ESP',
'FRA', 'GBR', 'IND',
'ITA', 'JPN', 'KOR',
'MEX', 'RUS', 'USA'])
# Country names.
country_names_array = np.array(['Australia', 'Brazil', 'Canada',
'China', 'Germany', 'Spain',
'France', 'United Kingdom', 'India',
'Italy', 'Japan', 'South Korea',
'Mexico', 'Russia', 'United States'])
For comparison, let’s make an array containing the numerical HDI scores:
# Human Development Index Scores for each country
hdis_array = np.array([0.896, 0.668, 0.89 , 0.586,
0.844, 0.89 , 0.49 , 0.842,
0.883, 0.709, 0.733, 0.824,
0.828, 0.863, 0.894])
The dtype attribute of the first two arrays begins with <U, indicating we are dealing with string data.
# Show the dtype of the country codes array (e.g. string data)
country_codes_array.dtype
dtype('<U3')
U3 tells us that the array stored Unicode (U) strings up three Unicode characters in length.
# Show the dtype of the country names array (e.g. string data)
country_names_array.dtype
dtype('<U14')
Conversely, the hdis_array contains data of a numerical type:
# Show the dtype of the hdis array (e.g. numeric data)
hdis_array.dtype
dtype('float64')
Using indexing, we can use all of the in-built Python string methods on the individual values within a Numpy array:
# Methods on an individual string
country_codes_array[0]
np.str_('AUS')
For instance, we can change the case of the value:
# Lowercase
country_codes_array[0].lower()
'aus'
# Uppercase
country_codes_array[0].upper()
'AUS'
We can also replace elements of the string:
country_codes_array[0].replace("A", "Comparable to the ")
'Comparable to the US'
Understandably, if we try to use any of these string methods on numerical data, we will get an error:
# Oh no!
hdis_array[0].upper()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[22], line 2
1 # Oh no!
----> 2 hdis_array[0].upper()
AttributeError: 'numpy.float64' object has no attribute 'upper'
All of the string methods used in this section above have been called on single string values from a Numpy array. If we try to use a string method on all values of the array simultaneously, we will also get an error:
# This does not work
country_codes_array.lower()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[23], line 2
1 # This does not work
----> 2 country_codes_array.lower()
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
String methods in Numpy must be called from the single string values or using the .char. module.
For example, we can use the np.char.lower() function to operate on all values of the Numpy array at once:
# This DOES work
np.char.lower(country_codes_array)
array(['aus', 'bra', 'can', 'chn', 'deu', 'esp', 'fra', 'gbr', 'ind',
'ita', 'jpn', 'kor', 'mex', 'rus', 'usa'], dtype='<U3')
# This DOES work too
np.char.replace(country_codes_array, 'A', '!')
array(['!US', 'BR!', 'C!N', 'CHN', 'DEU', 'ESP', 'FR!', 'GBR', 'IND',
'IT!', 'JPN', 'KOR', 'MEX', 'RUS', 'US!'], dtype='<U3')
Pandas deals with string data slightly differently to Numpy. The elements of the .values component of a Pandas Series can be operated on altogether by using the .str. accessor, to which we will now turn our attention.
String methods with Pandas Series#
As mentioned above, Pandas Series have a specialised accessor (.str.) which bypasses the need to use np.char. when we want to do something to all of the string values in a Series.
To see how this works, let’s construct a Series from our country_names array:
# Show again from Series
names_series = pd.Series(country_names_array,
index=country_codes_array)
names_series
AUS Australia
BRA Brazil
CAN Canada
CHN China
DEU Germany
ESP Spain
FRA France
GBR United Kingdom
IND India
ITA Italy
JPN Japan
KOR South Korea
MEX Mexico
RUS Russia
USA United States
dtype: str
To use the .str. accessor, we just place it after our object (e.g. our Series containing our string data). We then can call a variety of string methods, which we be applied to all elements in the .values array of the Series:
# String methods on Series
names_series.str.upper()
AUS AUSTRALIA
BRA BRAZIL
CAN CANADA
CHN CHINA
DEU GERMANY
ESP SPAIN
FRA FRANCE
GBR UNITED KINGDOM
IND INDIA
ITA ITALY
JPN JAPAN
KOR SOUTH KOREA
MEX MEXICO
RUS RUSSIA
USA UNITED STATES
dtype: str
# The `.str.lower()` method
names_series.str.lower()
AUS australia
BRA brazil
CAN canada
CHN china
DEU germany
ESP spain
FRA france
GBR united kingdom
IND india
ITA italy
JPN japan
KOR south korea
MEX mexico
RUS russia
USA united states
dtype: str
The .replace() string method is also available here, it will operate on all the elements in the Series, though in this case (as there is only one United States) it will only alter one value:
# Replacing values in the Series
names_series.str.replace("United States", "USA")
AUS Australia
BRA Brazil
CAN Canada
CHN China
DEU Germany
ESP Spain
FRA France
GBR United Kingdom
IND India
ITA Italy
JPN Japan
KOR South Korea
MEX Mexico
RUS Russia
USA USA
dtype: str
By default, the string-specific .replace method (accessed through the str accessor - .str.replace()) will search for expressions within strings, as opposed to searching for exact, whole string matches. This matches the behavior of the corresponding .replace method on strings.
This is different to the behaviour of the non-string-specific .replace() method which we encountered on an earlier page. By default the non-string-specific .replace() method will search for exactly matching whole strings. Confusion between these two methods is a common source of error.
As such, by using the string-specific .str.replace() method we can easily replace substrings in multiple elements in the data at once, even where the whole strings are not the same. For instance:
# Replacing values in the Series
names_series.str.replace("United", "Disunited")
AUS Australia
BRA Brazil
CAN Canada
CHN China
DEU Germany
ESP Spain
FRA France
GBR Disunited Kingdom
IND India
ITA Italy
JPN Japan
KOR South Korea
MEX Mexico
RUS Russia
USA Disunited States
dtype: str
Here, the substring 'United' has been replaced with 'Disunited' in the both
the strings "United Kingdom" and "United States". This is a good example of
a situation where a specialized accessor (like .str.) can change the
behavior of other methods, often in a helpful way for dealing with a specific
data type.
The syntax for slicing strings is the same as for a single value, but it also operates across all elements in the Series at once:
# Slicing with strings, in a Series
names_series.str[2:4]
AUS st
BRA az
CAN na
CHN in
DEU rm
ESP ai
FRA an
GBR it
IND di
ITA al
JPN pa
KOR ut
MEX xi
RUS ss
USA it
dtype: str
Using the .contains() method, Boolean Series can be generated by searching for specific instances of a substring in each value:
# Generate a Boolean Series, True where the value contains "Ind"
names_series.str.contains("Ind")
AUS False
BRA False
CAN False
CHN False
DEU False
ESP False
FRA False
GBR False
IND True
ITA False
JPN False
KOR False
MEX False
RUS False
USA False
dtype: bool
These Boolean Series can be used to retrieve specific values from the original Series, via Boolean filtering:
# Use Boolean filtering to retrieve a specific datapoint
names_series[names_series.str.contains("Ind")]
IND India
dtype: str
String methods with Pandas DataFrames#
So, Pandas makes it somewhat easier than Numpy to perform operations on all the string elements at once.
Remember that a DataFrame is a dictionary-like collection of Series, and so everything we have just seen of strings in Pandas Series applies to the columns of a Data Frame.
Let’s import the HDI data in a Pandas Data Frame:
# Import data
df = pd.read_csv("data/year_2000_hdi_fert.csv")
# Show the data
df
| Code | Human Development Index | Fertility Rate | Population | Country Name | |
|---|---|---|---|---|---|
| 0 | AUS | 0.896 | 1.764 | 19.1324 | Australia |
| 1 | BRA | 0.668 | 2.247 | 174.0182 | Brazil |
| 2 | CAN | 0.890 | 1.510 | 30.8918 | Canada |
| 3 | CHN | 0.586 | 1.628 | 1269.5811 | China |
| 4 | DEU | 0.890 | 1.386 | 81.7972 | Germany |
| 5 | ESP | 0.828 | 1.210 | 41.0197 | Spain |
| 6 | FRA | 0.844 | 1.876 | 59.4837 | France |
| 7 | GBR | 0.863 | 1.641 | 59.0573 | United Kingdom |
| 8 | IND | 0.490 | 3.350 | 1057.9227 | India |
| 9 | ITA | 0.842 | 1.249 | 57.2722 | Italy |
| 10 | JPN | 0.883 | 1.346 | 127.0278 | Japan |
| 11 | KOR | 0.824 | 1.467 | 46.7666 | South Korea |
| 12 | MEX | 0.709 | 2.714 | 98.6255 | Mexico |
| 13 | RUS | 0.733 | 1.190 | 146.7177 | Russia |
| 14 | USA | 0.894 | 2.030 | 281.4841 | United States |
Because each Data Frame column can be extracted as a Pandas Series, we can use string methods in the same way as we saw in the last section:
# Use the `.replace()` method
df['Country Name'].str.replace('I', 'I starts I')
0 Australia
1 Brazil
2 Canada
3 China
4 Germany
5 Spain
6 France
7 United Kingdom
8 I starts India
9 I starts Italy
10 Japan
11 South Korea
12 Mexico
13 Russia
14 United States
Name: Country Name, dtype: str
However, Pandas will not let us use string methods on the whole Data Frame at once:
# Cannot use `.str` methods on whole Data Frame.
df.str.upper()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_2810/1916934331.py in ?()
1 # Cannot use `.str` methods on whole Data Frame.
----> 2 df.str.upper()
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/generic.py in ?(self, name)
6202 and name not in self._accessors
6203 and self._info_axis._can_hold_identifiers_and_holds_name(name)
6204 ):
6205 return self[name]
-> 6206 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'str'
We might think that this is because not all the data in the Data Frame is of the string type, but this is not the case. We also cannot use Pandas string methods on Data Frames with columns only containing string data, as we get the same error:
# Oops, using `.str` fails even for all-string-dtype columns.
df[['Country Name', 'Code']].str.len()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_2810/3866909753.py in ?()
1 # Oops, using `.str` fails even for all-string-dtype columns.
----> 2 df[['Country Name', 'Code']].str.len()
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/generic.py in ?(self, name)
6202 and name not in self._accessors
6203 and self._info_axis._can_hold_identifiers_and_holds_name(name)
6204 ):
6205 return self[name]
-> 6206 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'str'
Notice here we did not set Code as the index of the Data Frame, we are just treating it as a column containing string data. The cell below sets it as the index, so we can use label-based indexing later in this tutorial:
# Set the index
df.index = df['Code']
So, we cannot apply string methods to multiple columns at once, but if we focus on one column, we can use all of the available string methods:
# The `.lower()` method
df['Country Name'].str.lower()
Code
AUS australia
BRA brazil
CAN canada
CHN china
DEU germany
ESP spain
FRA france
GBR united kingdom
IND india
ITA italy
JPN japan
KOR south korea
MEX mexico
RUS russia
USA united states
Name: Country Name, dtype: str
# The `.upper()` method
df['Country Name'].str.upper()
Code
AUS AUSTRALIA
BRA BRAZIL
CAN CANADA
CHN CHINA
DEU GERMANY
ESP SPAIN
FRA FRANCE
GBR UNITED KINGDOM
IND INDIA
ITA ITALY
JPN JAPAN
KOR SOUTH KOREA
MEX MEXICO
RUS RUSSIA
USA UNITED STATES
Name: Country Name, dtype: str
# Using the `str.count()` method
df['Country Name'].str.count('a')
Code
AUS 2
BRA 1
CAN 3
CHN 1
DEU 1
ESP 1
FRA 1
GBR 0
IND 1
ITA 1
JPN 2
KOR 1
MEX 0
RUS 1
USA 1
Name: Country Name, dtype: int64
# The `str.contains()` method
df['Country Name'].str.contains('Russia')
Code
AUS False
BRA False
CAN False
CHN False
DEU False
ESP False
FRA False
GBR False
IND False
ITA False
JPN False
KOR False
MEX False
RUS True
USA False
Name: Country Name, dtype: bool
# See if there are any Trues in the Series
df['Country Name'].str.contains('Russia').sum()
np.int64(1)
# Filtering data using the `str.contains()` method
df[df['Country Name'].str.contains('Russia')]
| Code | Human Development Index | Fertility Rate | Population | Country Name | |
|---|---|---|---|---|---|
| Code | |||||
| RUS | RUS | 0.733 | 1.19 | 146.7177 | Russia |
Uses of string methods in data wrangling#
As we mentioned earlier, string methods generally useful for cleaning text data. This can be especially useful when combining data from different sources, where different conventions in data entry may lead to similar data being formatted differently.
To explore this, let’s import a new dataset, which, like the HDI data, contains observations at the country level (e.g. each row is an observarion from a specific country).
This dataset is also at the country-level of granularity, and it contains various data about countries, including maternal mortality rates. You can read more about the dataset here.
# Import gender statistics dataset
gender_df = pd.read_csv("data/gender_stats.csv")
gender_df
| country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | 1.66325 | NaN | NaN | NaN | 48.721939 | NaN | 0.103744 |
| 1 | Afghanistan | AFG | 4.95450 | 19.961015 | 161.138034 | 2.834598 | 40.109708 | 444.00 | 32.715838 |
| 2 | Angola | AGO | 6.12300 | 111.936542 | 254.747970 | 2.447546 | NaN | 501.25 | 26.937545 |
| 3 | Albania | ALB | 1.76925 | 12.327586 | 574.202694 | 2.836021 | 47.201082 | 29.25 | 2.888280 |
| 4 | Andorra | AND | NaN | 3.197538 | 4421.224933 | 7.260281 | 47.123345 | NaN | 0.079547 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 211 | Kosovo | XKX | 2.14250 | 6.804620 | NaN | NaN | NaN | NaN | 1.813820 |
| 212 | Yemen, Rep. | YEM | 4.22575 | 36.819337 | 207.949700 | 1.417836 | 44.470076 | 399.75 | 26.246608 |
| 213 | South Africa | ZAF | 2.37525 | 345.209888 | 1123.142656 | 4.241441 | 48.516298 | 143.75 | 54.177209 |
| 214 | Zambia | ZMB | 5.39425 | 24.280990 | 185.556359 | 2.687290 | 49.934484 | 233.75 | 15.633220 |
| 215 | Zimbabwe | ZWE | 3.94300 | 15.495514 | 115.519881 | 2.695188 | 49.529875 | 398.00 | 15.420964 |
216 rows × 9 columns
gender_df[gender_df['country_name'].str.startswith('S')]
| country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
|---|---|---|---|---|---|---|---|---|---|
| 33 | Switzerland | CHE | 1.53000 | 676.642359 | 6335.436388 | 7.642569 | 48.556328 | 5.25 | 8.185870 |
| 58 | Spain | ESP | 1.30750 | 1299.724261 | 2963.832825 | 6.545739 | 48.722231 | 5.00 | 46.553128 |
| 103 | St. Kitts and Nevis | KNA | NaN | 0.832756 | 1212.879641 | 2.234503 | 49.805051 | NaN | 0.053722 |
| 110 | St. Lucia | LCA | 1.90200 | 1.362564 | 798.301449 | 3.780893 | 48.590773 | 49.00 | 0.176427 |
| 112 | Sri Lanka | LKA | 2.09775 | 76.808506 | 338.184994 | 1.762039 | 49.201950 | 31.25 | 20.790000 |
| 118 | St. Martin (French part) | MAF | 1.81250 | NaN | NaN | NaN | NaN | NaN | 0.031491 |
| 166 | Saudi Arabia | SAU | 2.79225 | 707.936120 | 2181.916849 | 3.114481 | 49.034225 | 12.25 | 30.728077 |
| 167 | Sudan | SDN | 4.38775 | 83.016732 | 272.975786 | 1.844650 | 46.560974 | 321.00 | 37.760931 |
| 168 | Senegal | SEN | 5.10400 | 14.539555 | 101.042953 | 2.278713 | 51.881078 | 331.00 | 14.551710 |
| 169 | Singapore | SGP | 1.24250 | 298.724394 | 3645.683877 | 1.763894 | NaN | 10.75 | 5.464722 |
| 170 | Solomon Islands | SLB | 3.99975 | 1.114535 | 111.679703 | 4.908624 | 47.983030 | 120.00 | 0.575490 |
| 171 | Sierra Leone | SLE | 4.69050 | 4.331604 | 207.625767 | 1.853112 | 50.098291 | 1435.00 | 7.080112 |
| 173 | San Marino | SMR | 1.26000 | NaN | 3437.298747 | 5.752693 | 45.616261 | NaN | 0.032607 |
| 174 | Somalia | SOM | 6.51450 | 5.785250 | NaN | NaN | NaN | 762.75 | 13.527075 |
| 175 | Serbia | SRB | 1.45000 | 41.075644 | 1298.626684 | 6.150194 | 48.625357 | 16.75 | 7.129316 |
| 176 | South Sudan | SSD | 5.06600 | 11.480939 | 58.565752 | 0.991932 | 40.879082 | 827.50 | 11.527917 |
| 177 | Sao Tome and Principe | STP | 4.60375 | 0.314540 | 304.237533 | 3.227125 | 48.664169 | 159.50 | 0.191333 |
| 178 | Suriname | SUR | 2.37325 | 4.773159 | 967.888627 | 3.119683 | 48.418349 | 157.50 | 0.547824 |
| 179 | Slovak Republic | SVK | 1.35500 | 93.894473 | 2107.917656 | 5.768967 | 48.521589 | 6.00 | 5.418425 |
| 180 | Slovenia | SVN | 1.57250 | 46.048863 | 2644.776298 | 6.698656 | 48.566330 | 8.75 | 2.061494 |
| 181 | Sweden | SWE | 1.89000 | 540.626904 | 5134.572113 | 10.010744 | 49.606578 | 4.00 | 9.703634 |
| 182 | Swaziland | SWZ | 3.30200 | 4.346817 | 570.468058 | 6.891502 | 47.524406 | 400.50 | 1.295364 |
| 183 | Sint Maarten (Dutch part) | SXM | NaN | NaN | NaN | NaN | 49.508551 | NaN | 0.037552 |
| 184 | Seychelles | SYC | 2.35000 | 1.366581 | 876.631453 | 3.413612 | 49.523733 | NaN | 0.091541 |
| 185 | Syrian Arab Republic | SYR | 2.96775 | NaN | 269.945739 | 1.507166 | 48.047394 | 62.00 | 19.319674 |
| 204 | St. Vincent and the Grenadines | VCT | 1.98600 | 0.730107 | 775.803386 | 4.365757 | 48.536415 | 45.75 | 0.109421 |
| 210 | Samoa | WSM | 4.11825 | 0.799887 | 366.353096 | 5.697059 | 48.350049 | 54.75 | 0.192225 |
| 213 | South Africa | ZAF | 2.37525 | 345.209888 | 1123.142656 | 4.241441 | 48.516298 | 143.75 | 54.177209 |
Let’s say we are interested in Russia, but do not know how the name of the country is formatted. We can use the str.contains() method to search for likely matches.
# Hmmm is Russia not in this data?
gender_df['country_name'].str.contains('Russia')
0 False
1 False
2 False
3 False
4 False
...
211 False
212 False
213 False
214 False
215 False
Name: country_name, Length: 216, dtype: bool
That output is pretty opaque, maybe there is a True in there somewhere. Because Python treats True values as being equal to 1, we can chain on the .sum() method to count the number of True values in the above Boolean Series:
# Count the Trues for country names containing "Russia" in the `maternal
gender_df['country_name'].str.contains('Russia').sum()
np.int64(1)
It appears we do have one match. Let’s use the Boolean Series we just made to have a look at the row that contains the string “Russia” in the country_name column:
# Use the `str.contains()` method to filter the data
gender_df[gender_df['country_name'].str.contains('Russia')]
| country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
|---|---|---|---|---|---|---|---|---|---|
| 164 | Russian Federation | RUS | 1.7245 | 1822.6917 | 1755.506635 | 3.731354 | 48.96807 | 25.25 | 143.793504 |
So, we have found the row for Russia in this new dataset. Let’s compare the naming convention to the HDI data, in the df Data Frame:
# Get the data for Russia, from the HDI data
df.loc['RUS']
Code RUS
Human Development Index 0.733
Fertility Rate 1.19
Population 146.7177
Country Name Russia
Name: RUS, dtype: object
In due course, we may want to merge these datasets. To do that, we need common identifiers linking rows in each dataset which refer to the same observational units (in this case countries).
String methods are our friend here. We can use the process just outlined for find data for a specific country, and then use other methods to ensure uniform formatting between the datasets, such that we can merge them:
# Format the maternal mortality data for Russia to use the same country name as the HDI data
gender_df['country_name'] = gender_df['country_name'].str.replace('Russian Federation', 'Russia')
# Show the newly formatted row
gender_df[gender_df['country_name'].str.contains('Russia')]
| country_name | country_code | fert_rate | gdp_us_billion | health_exp_per_cap | health_exp_pub | prim_ed_girls | mat_mort_ratio | population | |
|---|---|---|---|---|---|---|---|---|---|
| 164 | Russia | RUS | 1.7245 | 1822.6917 | 1755.506635 | 3.731354 | 48.96807 | 25.25 | 143.793504 |
We are now ready for a clean and stress-free data merge! (NB: we are grossly exaggerating here, merging datasets is almost never stress-free…)
Exercise 15
The gender_df['country_name'] Series contains a lot of formatting that is nice to read, but annoying
to use in indexing operations (or any time where we need to type them).
Entries like 'Virgin Islands (U.S.)' and 'St. Martin (French part)' will be a pain to type if we need to use them in .loc indexing operations, for instance.
We would therefore like to create a new Series containing versions of these names that are easier to type.
That is what we have done with some hidden code. The hidden code:
Processes the
gender_df['country_name']Series to make a new Series where we have replaced the original names (above) with versions of these names that are easier to type.Taken this new Series, and run
sorted(new_series.unique())to show you the new names.
Have a careful look at the resulting list below - and work out which Pandas string methods have been used to get from the gender_df['country_name'] Series to the new Series, to which we have applied sorted(new_series.unique())
# This is the answer - don't use it in your solution.
sorted(answer_clean_series.unique())
['afghanistan',
'albania',
'algeria',
'american_samoa',
'andorra',
'angola',
'antigua_and_barbuda',
'argentina',
'armenia',
'aruba',
'australia',
'austria',
'azerbaijan',
'bahamas_the',
'bahrain',
'bangladesh',
'barbados',
'belarus',
'belgium',
'belize',
'benin',
'bermuda',
'bhutan',
'bolivia',
'bosnia_and_herzegovina',
'botswana',
'brazil',
'british_virgin_islands',
'brunei_darussalam',
'bulgaria',
'burkina_faso',
'burundi',
'cabo_verde',
'cambodia',
'cameroon',
'canada',
'cayman_islands',
'central_african_republic',
'chad',
'chile',
'china',
'colombia',
'comoros',
'congo_dem_rep',
'congo_rep',
'costa_rica',
"cote_d'ivoire",
'croatia',
'cuba',
'curacao',
'cyprus',
'czech_republic',
'denmark',
'djibouti',
'dominica',
'dominican_republic',
'ecuador',
'egypt_arab_rep',
'el_salvador',
'equatorial_guinea',
'eritrea',
'estonia',
'ethiopia',
'faroe_islands',
'fiji',
'finland',
'france',
'french_polynesia',
'gabon',
'gambia_the',
'georgia',
'germany',
'ghana',
'gibraltar',
'greece',
'greenland',
'grenada',
'guam',
'guatemala',
'guinea',
'guinea_bissau',
'guyana',
'haiti',
'honduras',
'hong_kong_sar_china',
'hungary',
'iceland',
'india',
'indonesia',
'iran_islamic_rep',
'iraq',
'ireland',
'isle_of_man',
'israel',
'italy',
'jamaica',
'japan',
'jordan',
'kazakhstan',
'kenya',
'kiribati',
"korea_dem_people's_rep",
'korea_rep',
'kosovo',
'kuwait',
'kyrgyz_republic',
'lao_pdr',
'latvia',
'lebanon',
'lesotho',
'liberia',
'libya',
'liechtenstein',
'lithuania',
'luxembourg',
'macao_sar_china',
'macedonia_fyr',
'madagascar',
'malawi',
'malaysia',
'maldives',
'mali',
'malta',
'marshall_islands',
'mauritania',
'mauritius',
'mexico',
'micronesia_fed_sts',
'moldova',
'monaco',
'mongolia',
'montenegro',
'morocco',
'mozambique',
'myanmar',
'namibia',
'nauru',
'nepal',
'netherlands',
'new_caledonia',
'new_zealand',
'nicaragua',
'niger',
'nigeria',
'northern_mariana_islands',
'norway',
'oman',
'pakistan',
'palau',
'panama',
'papua_new_guinea',
'paraguay',
'peru',
'philippines',
'poland',
'portugal',
'puerto_rico',
'qatar',
'romania',
'russian_federation',
'rwanda',
'samoa',
'san_marino',
'sao_tome_and_principe',
'saudi_arabia',
'senegal',
'serbia',
'seychelles',
'sierra_leone',
'singapore',
'sint_maarten_dutch_part',
'slovak_republic',
'slovenia',
'solomon_islands',
'somalia',
'south_africa',
'south_sudan',
'spain',
'sri_lanka',
'st_kitts_and_nevis',
'st_lucia',
'st_martin_french_part',
'st_vincent_and_the_grenadines',
'sudan',
'suriname',
'swaziland',
'sweden',
'switzerland',
'syrian_arab_republic',
'tajikistan',
'tanzania',
'thailand',
'timor_leste',
'togo',
'tonga',
'trinidad_and_tobago',
'tunisia',
'turkey',
'turkmenistan',
'turks_and_caicos_islands',
'tuvalu',
'uganda',
'ukraine',
'united_arab_emirates',
'united_kingdom',
'united_states',
'uruguay',
'uzbekistan',
'vanuatu',
'venezuela_rb',
'vietnam',
'virgin_islands_us',
'west_bank_and_gaza',
'yemen_rep',
'zambia',
'zimbabwe']
Your task now is to make a Series called my_clean_series which gives (with sorted(my_clean_series.unique())) a list that is identical to the list shown above.
You can perform the relevant string transformations using Pandas string methods on the gender_df['country_name'] Series, and then run sorted(my_clean_names.unique()) to get the final array.
There is a cell at the end of the exercise to check your answer.
Try to do the string transformation in as few lines of code as possible and using ONLY Pandas string methods.
Hint: There are many ways to do this, but for maximum beauty, you might consider having a look at Python’s str.maketrans function. And yes, you can use str.maketrans as well. Or you can use some other algorithm of your choice.
# Your code here to create a new Pandas Series with modified
# country names, as above.
my_clean_series = pd.Series() # Edit here to solve the problem.
# ...
# But don't modify the code below.
my_clean_names = sorted(my_clean_series.unique())
my_clean_names
[]
# Run this cell to check your answer.
# It will return 'Success' if your cleaning worked correctly.
def check_names(proposed_solution):
""" Check resulting names from processed Series `proposed_solution`
"""
answer_arr = np.array(sorted(answer_clean_series.unique()))
solution_arr = np.array(sorted(proposed_solution.unique()))
if len(answer_arr) != len(solution_arr):
return 'The answer and solution names are of different lengths'
not_matching = answer_arr != solution_arr
if np.any(not_matching):
print('My solution unmatched', solution_arr[not_matching])
print('Desired unmatched', answer_arr[not_matching])
return 'Remaining unmatched values'
return 'Success'
check_names(my_clean_series)
'The answer and solution names are of different lengths'
Solution to Exercise 15
Our solution is below. We have used the hint above to make a translation table from a set of characters to another set of characters, followed by a set of characters to delete, and then applied this translation table with the Pandas .str.translate method.
We then use the .lower() method to remove the capitalization.
To make the array identical to the one shown above (and the one used for marking this exercise), we then re-rename russia to russian_federation before using .unique() to show the final array:
soln_clean_series = (gender_df['country_name']
.str.lower()
.str.translate(str.maketrans(' -', '__', '().,'))
.str.replace('russia', 'russian_federation'))
check_names(soln_clean_series)
'Success'
We can do the same processing with .str.replace at the expense of greater verbosity:
# Using `.str.replace`
soln2_clean_series = (gender_df['country_name']
.str.lower()
.str.replace(' ', '_')
.str.replace('-', '_')
.str.replace('(', '')
.str.replace(')', '')
.str.replace('.', '')
.str.replace(',', '')
.str.replace('russia', 'russian_federation'))
check_names(soln2_clean_series)
'Success'
Summary#
This page looked at string methods in base python, Numpy and Pandas.
Numpy and Python inherit their string methods from base python, but apply them in different ways.
Numpy does not have a set of methods for applying string methods to every element of an array simultaneously. We need functions from the np.char module if we want this.
By contrast, Pandas Series - whether in isolation or as columns in a Data Frame - have the .str. accessor for easily performing string operations on every element in a Series.