Working with strings

Working with strings#

String/text data is often used to represent categorical variables, and commonly appears in a variety of data analysis contexts. When dealing with string/text data we will frequently find that we need to alter the strings to correct errors, improve clarity, make formatting uniform, or for a host of other reasons.

String methods are inherent to Python, and these methods or variants of them can all be used on Numpy arrays and Pandas Series/Data Frames. However, Numpy and Pandas use different interfaces for interacting with strings. To understand the differences between Numpy and Pandas with respect to strings, let’s begin at the foundation, the in-built string methods of Python. The cell below contains a simple string.

# A string
my_string = "a few words"
my_string

'a few words'

Here are the not-private methods and attributes of a standard Python str:

# Attributes and methods not starting with `_` (or `__`):
[k for k in dir(my_string) if not k.startswith('_')]

['capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

Remember that a method is a function attached to an object. In this case our object is a string.

Let’s say we like reading values in our data as if they are being spoken in a loud voice. If this is the case we can alter the format of the string to make all letters uppercase, using the .upper() method:

# The `.upper()` method of `str`.
my_string.upper()

'A FEW WORDS'

We can replace characters in the string using the aptly named .replace() method. Here we supply two strings to the method, first the string we want to replace, and then second, the string we want to replace it with. In this case, let’s .replace() the underscores with a blank space:

# The `.replace()` method.
my_string.replace(' ', '_')

'a_few_words'

Fancier formatting methods will let us adjust strings, for instance, in title case (.title()):

# A more elaborate string method.
my_string.title()

'A Few Words'

In Python, strings are collections of characters, and so we can slice them as we would a list or array, using integer indexes and the : symbol:

# Slicing with strings.
my_string[0:1]

'a'

my_string[0:2]

'a '

my_string[0:7]

'a few w'

You can visit this page to see the variety of string methods available in base Python.

String methods with Numpy arrays#

So, strings in base Python have a large number of in-built methods - what about strings in Numpy?

Numpy arrays themselves do not have specific string methods, but the in-built Python string methods can be called on individual string values in a Numpy array. Alternatively, we can use functions from the np.char. module to operate on all the strings in the array in one go.

To investigate how string data is handled in Numpy, let’s make some arrays containing strings from now (very) familiar HDI dataset.

# Import libraries (no imports were needed prior to this point as string methods are part of base python)
import numpy as np
import pandas as pd

# A custom function to generate a Series to check exercise solutions.
import clean_gender_df_names

# Calculate answer to exercise (see below).
answer_clean_series = clean_gender_df_names.get_cleaned(
    pd.read_csv("data/gender_stats.csv")['country_name'])

# Standard three-letter code for each country.
country_codes_array = np.array(['AUS', 'BRA', 'CAN',
                                'CHN', 'DEU', 'ESP',
                                'FRA', 'GBR', 'IND',
                                'ITA', 'JPN', 'KOR',
                                'MEX', 'RUS', 'USA'])

# Country names.
country_names_array = np.array(['Australia', 'Brazil', 'Canada',
                                'China', 'Germany', 'Spain',
                                'France', 'United Kingdom', 'India',
                                'Italy', 'Japan', 'South Korea',
                                'Mexico', 'Russia', 'United States'])

For comparison, let’s make an array containing the numerical HDI scores:

# Human Development Index Scores for each country
hdis_array = np.array([0.896, 0.668, 0.89 , 0.586, 
                       0.844, 0.89 , 0.49 , 0.842, 
                       0.883, 0.709, 0.733, 0.824,
                       0.828, 0.863, 0.894])

The dtype attribute of the first two arrays begins with <U, indicating we are dealing with string data.

# Show the dtype of the country codes array (e.g. string data)
country_codes_array.dtype

dtype('<U3')

U3 tells us that the array stored Unicode (U) strings up three Unicode characters in length.

# Show the dtype of the country names array (e.g. string data)
country_names_array.dtype

dtype('<U14')

Conversely, the hdis_array contains data of a numerical type:

# Show the dtype of the hdis array (e.g. numeric data)
hdis_array.dtype

dtype('float64')

Using indexing, we can use all of the in-built Python string methods on the individual values within a Numpy array:

# Methods on an individual string
country_codes_array[0]

np.str_('AUS')

For instance, we can change the case of the value:

# Lowercase
country_codes_array[0].lower()

'aus'

# Uppercase
country_codes_array[0].upper()

'AUS'

We can also replace elements of the string:

country_codes_array[0].replace("A", "Comparable to the ")

'Comparable to the US'

Understandably, if we try to use any of these string methods on numerical data, we will get an error:

# Oh no!
hdis_array[0].upper()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[22], line 2
      1 # Oh no!
----> 2 hdis_array[0].upper()

AttributeError: 'numpy.float64' object has no attribute 'upper'

All of the string methods used in this section above have been called on single string values from a Numpy array. If we try to use a string method on all values of the array simultaneously, we will also get an error:

# This does not work
country_codes_array.lower()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[23], line 2
      1 # This does not work
----> 2 country_codes_array.lower()

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

String methods in Numpy must be called from the single string values or using the .char. module.

For example, we can use the np.char.lower() function to operate on all values of the Numpy array at once:

# This DOES work
np.char.lower(country_codes_array)

array(['aus', 'bra', 'can', 'chn', 'deu', 'esp', 'fra', 'gbr', 'ind',
       'ita', 'jpn', 'kor', 'mex', 'rus', 'usa'], dtype='<U3')

# This DOES work too
np.char.replace(country_codes_array, 'A', '!')

array(['!US', 'BR!', 'C!N', 'CHN', 'DEU', 'ESP', 'FR!', 'GBR', 'IND',
       'IT!', 'JPN', 'KOR', 'MEX', 'RUS', 'US!'], dtype='<U3')

Pandas deals with string data slightly differently to Numpy. The elements of the .values component of a Pandas Series can be operated on altogether by using the .str. accessor, to which we will now turn our attention.

String methods with Pandas Series#

As mentioned above, Pandas Series have a specialised accessor (.str.) which bypasses the need to use np.char. when we want to do something to all of the string values in a Series.

To see how this works, let’s construct a Series from our country_names array:

# Show again from Series
names_series =  pd.Series(country_names_array,
                          index=country_codes_array)
names_series

AUS         Australia
BRA            Brazil
CAN            Canada
CHN             China
DEU           Germany
ESP             Spain
FRA            France
GBR    United Kingdom
IND             India
ITA             Italy
JPN             Japan
KOR       South Korea
MEX            Mexico
RUS            Russia
USA     United States
dtype: str

To use the .str. accessor, we just place it after our object (e.g. our Series containing our string data). We then can call a variety of string methods, which we be applied to all elements in the .values array of the Series:

# String methods on Series
names_series.str.upper()

AUS         AUSTRALIA
BRA            BRAZIL
CAN            CANADA
CHN             CHINA
DEU           GERMANY
ESP             SPAIN
FRA            FRANCE
GBR    UNITED KINGDOM
IND             INDIA
ITA             ITALY
JPN             JAPAN
KOR       SOUTH KOREA
MEX            MEXICO
RUS            RUSSIA
USA     UNITED STATES
dtype: str

# The `.str.lower()` method
names_series.str.lower()

AUS         australia
BRA            brazil
CAN            canada
CHN             china
DEU           germany
ESP             spain
FRA            france
GBR    united kingdom
IND             india
ITA             italy
JPN             japan
KOR       south korea
MEX            mexico
RUS            russia
USA     united states
dtype: str

The .replace() string method is also available here, it will operate on all the elements in the Series, though in this case (as there is only one United States) it will only alter one value:

# Replacing values in the Series
names_series.str.replace("United States", "USA")

AUS         Australia
BRA            Brazil
CAN            Canada
CHN             China
DEU           Germany
ESP             Spain
FRA            France
GBR    United Kingdom
IND             India
ITA             Italy
JPN             Japan
KOR       South Korea
MEX            Mexico
RUS            Russia
USA               USA
dtype: str

By default, the string-specific .replace method (accessed through the str accessor - .str.replace()) will search for expressions within strings, as opposed to searching for exact, whole string matches. This matches the behavior of the corresponding .replace method on strings.

This is different to the behaviour of the non-string-specific .replace() method which we encountered on an earlier page. By default the non-string-specific .replace() method will search for exactly matching whole strings. Confusion between these two methods is a common source of error.

As such, by using the string-specific .str.replace() method we can easily replace substrings in multiple elements in the data at once, even where the whole strings are not the same. For instance:

# Replacing values in the Series
names_series.str.replace("United", "Disunited")

AUS            Australia
BRA               Brazil
CAN               Canada
CHN                China
DEU              Germany
ESP                Spain
FRA               France
GBR    Disunited Kingdom
IND                India
ITA                Italy
JPN                Japan
KOR          South Korea
MEX               Mexico
RUS               Russia
USA     Disunited States
dtype: str

Here, the substring 'United' has been replaced with 'Disunited' in the both the strings "United Kingdom" and "United States". This is a good example of a situation where a specialized accessor (like .str.) can change the behavior of other methods, often in a helpful way for dealing with a specific data type.

The syntax for slicing strings is the same as for a single value, but it also operates across all elements in the Series at once:

# Slicing with strings, in a Series
names_series.str[2:4]

AUS    st
BRA    az
CAN    na
CHN    in
DEU    rm
ESP    ai
FRA    an
GBR    it
IND    di
ITA    al
JPN    pa
KOR    ut
MEX    xi
RUS    ss
USA    it
dtype: str

Using the .contains() method, Boolean Series can be generated by searching for specific instances of a substring in each value:

# Generate a Boolean Series, True where the value contains "Ind"
names_series.str.contains("Ind")

AUS    False
BRA    False
CAN    False
CHN    False
DEU    False
ESP    False
FRA    False
GBR    False
IND     True
ITA    False
JPN    False
KOR    False
MEX    False
RUS    False
USA    False
dtype: bool

These Boolean Series can be used to retrieve specific values from the original Series, via Boolean filtering:

# Use Boolean filtering to retrieve a specific datapoint
names_series[names_series.str.contains("Ind")]

IND    India
dtype: str

String methods with Pandas DataFrames#

So, Pandas makes it somewhat easier than Numpy to perform operations on all the string elements at once.

Remember that a DataFrame is a dictionary-like collection of Series, and so everything we have just seen of strings in Pandas Series applies to the columns of a Data Frame.

Let’s import the HDI data in a Pandas Data Frame:

# Import data
df = pd.read_csv("data/year_2000_hdi_fert.csv")
# Show the data
df

	Code	Human Development Index	Fertility Rate	Population	Country Name
0	AUS	0.896	1.764	19.1324	Australia
1	BRA	0.668	2.247	174.0182	Brazil
2	CAN	0.890	1.510	30.8918	Canada
3	CHN	0.586	1.628	1269.5811	China
4	DEU	0.890	1.386	81.7972	Germany
5	ESP	0.828	1.210	41.0197	Spain
6	FRA	0.844	1.876	59.4837	France
7	GBR	0.863	1.641	59.0573	United Kingdom
8	IND	0.490	3.350	1057.9227	India
9	ITA	0.842	1.249	57.2722	Italy
10	JPN	0.883	1.346	127.0278	Japan
11	KOR	0.824	1.467	46.7666	South Korea
12	MEX	0.709	2.714	98.6255	Mexico
13	RUS	0.733	1.190	146.7177	Russia
14	USA	0.894	2.030	281.4841	United States

Because each Data Frame column can be extracted as a Pandas Series, we can use string methods in the same way as we saw in the last section:

# Use the `.replace()` method
df['Country Name'].str.replace('I', 'I starts I')

        Australia
           Brazil
           Canada
            China
          Germany
            Spain
           France
   United Kingdom
   I starts India
   I starts Italy
           Japan
     South Korea
          Mexico
          Russia
   United States
Name: Country Name, dtype: str

However, Pandas will not let us use string methods on the whole Data Frame at once:

# Cannot use `.str` methods on whole Data Frame.
df.str.upper()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_2810/1916934331.py in ?()
      1 # Cannot use `.str` methods on whole Data Frame.
----> 2 df.str.upper()

/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/generic.py in ?(self, name)
   6202             and name not in self._accessors
   6203             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6204         ):
   6205             return self[name]
-> 6206         return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'str'

We might think that this is because not all the data in the Data Frame is of the string type, but this is not the case. We also cannot use Pandas string methods on Data Frames with columns only containing string data, as we get the same error:

# Oops, using `.str` fails even for all-string-dtype columns.
df[['Country Name', 'Code']].str.len()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_2810/3866909753.py in ?()
      1 # Oops, using `.str` fails even for all-string-dtype columns.
----> 2 df[['Country Name', 'Code']].str.len()

/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/generic.py in ?(self, name)
   6202             and name not in self._accessors
   6203             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6204         ):
   6205             return self[name]
-> 6206         return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'str'

Notice here we did not set Code as the index of the Data Frame, we are just treating it as a column containing string data. The cell below sets it as the index, so we can use label-based indexing later in this tutorial:

# Set the index
df.index = df['Code']

So, we cannot apply string methods to multiple columns at once, but if we focus on one column, we can use all of the available string methods:

# The `.lower()` method
df['Country Name'].str.lower()

Code
AUS         australia
BRA            brazil
CAN            canada
CHN             china
DEU           germany
ESP             spain
FRA            france
GBR    united kingdom
IND             india
ITA             italy
JPN             japan
KOR       south korea
MEX            mexico
RUS            russia
USA     united states
Name: Country Name, dtype: str

# The `.upper()` method
df['Country Name'].str.upper()

Code
AUS         AUSTRALIA
BRA            BRAZIL
CAN            CANADA
CHN             CHINA
DEU           GERMANY
ESP             SPAIN
FRA            FRANCE
GBR    UNITED KINGDOM
IND             INDIA
ITA             ITALY
JPN             JAPAN
KOR       SOUTH KOREA
MEX            MEXICO
RUS            RUSSIA
USA     UNITED STATES
Name: Country Name, dtype: str

# Using the `str.count()` method
df['Country Name'].str.count('a') 

Code
AUS    2
BRA    1
CAN    3
CHN    1
DEU    1
ESP    1
FRA    1
GBR    0
IND    1
ITA    1
JPN    2
KOR    1
MEX    0
RUS    1
USA    1
Name: Country Name, dtype: int64

# The `str.contains()` method
df['Country Name'].str.contains('Russia')

Code
AUS    False
BRA    False
CAN    False
CHN    False
DEU    False
ESP    False
FRA    False
GBR    False
IND    False
ITA    False
JPN    False
KOR    False
MEX    False
RUS     True
USA    False
Name: Country Name, dtype: bool

# See if there are any Trues in the Series
df['Country Name'].str.contains('Russia').sum()

np.int64(1)

# Filtering data using the `str.contains()` method
df[df['Country Name'].str.contains('Russia')]

	Code	Human Development Index	Fertility Rate	Population	Country Name
Code
RUS	RUS	0.733	1.19	146.7177	Russia

Uses of string methods in data wrangling#

As we mentioned earlier, string methods generally useful for cleaning text data. This can be especially useful when combining data from different sources, where different conventions in data entry may lead to similar data being formatted differently.

To explore this, let’s import a new dataset, which, like the HDI data, contains observations at the country level (e.g. each row is an observarion from a specific country).

This dataset is also at the country-level of granularity, and it contains various data about countries, including maternal mortality rates. You can read more about the dataset here.

# Import gender statistics dataset
gender_df = pd.read_csv("data/gender_stats.csv")
gender_df

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
0	Aruba	ABW	1.66325	NaN	NaN	NaN	48.721939	NaN	0.103744
1	Afghanistan	AFG	4.95450	19.961015	161.138034	2.834598	40.109708	444.00	32.715838
2	Angola	AGO	6.12300	111.936542	254.747970	2.447546	NaN	501.25	26.937545
3	Albania	ALB	1.76925	12.327586	574.202694	2.836021	47.201082	29.25	2.888280
4	Andorra	AND	NaN	3.197538	4421.224933	7.260281	47.123345	NaN	0.079547
...	...	...	...	...	...	...	...	...	...
211	Kosovo	XKX	2.14250	6.804620	NaN	NaN	NaN	NaN	1.813820
212	Yemen, Rep.	YEM	4.22575	36.819337	207.949700	1.417836	44.470076	399.75	26.246608
213	South Africa	ZAF	2.37525	345.209888	1123.142656	4.241441	48.516298	143.75	54.177209
214	Zambia	ZMB	5.39425	24.280990	185.556359	2.687290	49.934484	233.75	15.633220
215	Zimbabwe	ZWE	3.94300	15.495514	115.519881	2.695188	49.529875	398.00	15.420964

216 rows × 9 columns

gender_df[gender_df['country_name'].str.startswith('S')]

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
33	Switzerland	CHE	1.53000	676.642359	6335.436388	7.642569	48.556328	5.25	8.185870
58	Spain	ESP	1.30750	1299.724261	2963.832825	6.545739	48.722231	5.00	46.553128
103	St. Kitts and Nevis	KNA	NaN	0.832756	1212.879641	2.234503	49.805051	NaN	0.053722
110	St. Lucia	LCA	1.90200	1.362564	798.301449	3.780893	48.590773	49.00	0.176427
112	Sri Lanka	LKA	2.09775	76.808506	338.184994	1.762039	49.201950	31.25	20.790000
118	St. Martin (French part)	MAF	1.81250	NaN	NaN	NaN	NaN	NaN	0.031491
166	Saudi Arabia	SAU	2.79225	707.936120	2181.916849	3.114481	49.034225	12.25	30.728077
167	Sudan	SDN	4.38775	83.016732	272.975786	1.844650	46.560974	321.00	37.760931
168	Senegal	SEN	5.10400	14.539555	101.042953	2.278713	51.881078	331.00	14.551710
169	Singapore	SGP	1.24250	298.724394	3645.683877	1.763894	NaN	10.75	5.464722
170	Solomon Islands	SLB	3.99975	1.114535	111.679703	4.908624	47.983030	120.00	0.575490
171	Sierra Leone	SLE	4.69050	4.331604	207.625767	1.853112	50.098291	1435.00	7.080112
173	San Marino	SMR	1.26000	NaN	3437.298747	5.752693	45.616261	NaN	0.032607
174	Somalia	SOM	6.51450	5.785250	NaN	NaN	NaN	762.75	13.527075
175	Serbia	SRB	1.45000	41.075644	1298.626684	6.150194	48.625357	16.75	7.129316
176	South Sudan	SSD	5.06600	11.480939	58.565752	0.991932	40.879082	827.50	11.527917
177	Sao Tome and Principe	STP	4.60375	0.314540	304.237533	3.227125	48.664169	159.50	0.191333
178	Suriname	SUR	2.37325	4.773159	967.888627	3.119683	48.418349	157.50	0.547824
179	Slovak Republic	SVK	1.35500	93.894473	2107.917656	5.768967	48.521589	6.00	5.418425
180	Slovenia	SVN	1.57250	46.048863	2644.776298	6.698656	48.566330	8.75	2.061494
181	Sweden	SWE	1.89000	540.626904	5134.572113	10.010744	49.606578	4.00	9.703634
182	Swaziland	SWZ	3.30200	4.346817	570.468058	6.891502	47.524406	400.50	1.295364
183	Sint Maarten (Dutch part)	SXM	NaN	NaN	NaN	NaN	49.508551	NaN	0.037552
184	Seychelles	SYC	2.35000	1.366581	876.631453	3.413612	49.523733	NaN	0.091541
185	Syrian Arab Republic	SYR	2.96775	NaN	269.945739	1.507166	48.047394	62.00	19.319674
204	St. Vincent and the Grenadines	VCT	1.98600	0.730107	775.803386	4.365757	48.536415	45.75	0.109421
210	Samoa	WSM	4.11825	0.799887	366.353096	5.697059	48.350049	54.75	0.192225
213	South Africa	ZAF	2.37525	345.209888	1123.142656	4.241441	48.516298	143.75	54.177209

Let’s say we are interested in Russia, but do not know how the name of the country is formatted. We can use the str.contains() method to search for likely matches.

# Hmmm is Russia not in this data?
gender_df['country_name'].str.contains('Russia')

    False
    False
    False
    False
    False
       ...  
  False
  False
  False
  False
  False
Name: country_name, Length: 216, dtype: bool

That output is pretty opaque, maybe there is a True in there somewhere. Because Python treats True values as being equal to 1, we can chain on the .sum() method to count the number of True values in the above Boolean Series:

# Count the Trues for country names containing "Russia" in the `maternal
gender_df['country_name'].str.contains('Russia').sum()

np.int64(1)

It appears we do have one match. Let’s use the Boolean Series we just made to have a look at the row that contains the string “Russia” in the country_name column:

# Use the `str.contains()` method to filter the data
gender_df[gender_df['country_name'].str.contains('Russia')]

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
164	Russian Federation	RUS	1.7245	1822.6917	1755.506635	3.731354	48.96807	25.25	143.793504

So, we have found the row for Russia in this new dataset. Let’s compare the naming convention to the HDI data, in the df Data Frame:

# Get the data for Russia, from the HDI data
df.loc['RUS']

Code                            RUS
Human Development Index       0.733
Fertility Rate                 1.19
Population                 146.7177
Country Name                 Russia
Name: RUS, dtype: object

In due course, we may want to merge these datasets. To do that, we need common identifiers linking rows in each dataset which refer to the same observational units (in this case countries).

String methods are our friend here. We can use the process just outlined for find data for a specific country, and then use other methods to ensure uniform formatting between the datasets, such that we can merge them:

# Format the maternal mortality data for Russia to use the same country name as the HDI data
gender_df['country_name'] = gender_df['country_name'].str.replace('Russian Federation', 'Russia')

# Show the newly formatted row
gender_df[gender_df['country_name'].str.contains('Russia')]

	country_name	country_code	fert_rate	gdp_us_billion	health_exp_per_cap	health_exp_pub	prim_ed_girls	mat_mort_ratio	population
164	Russia	RUS	1.7245	1822.6917	1755.506635	3.731354	48.96807	25.25	143.793504

We are now ready for a clean and stress-free data merge! (NB: we are grossly exaggerating here, merging datasets is almost never stress-free…)

Exercise 15

The gender_df['country_name'] Series contains a lot of formatting that is nice to read, but annoying to use in indexing operations (or any time where we need to type them).

Entries like 'Virgin Islands (U.S.)' and 'St. Martin (French part)' will be a pain to type if we need to use them in .loc indexing operations, for instance.

We would therefore like to create a new Series containing versions of these names that are easier to type.

That is what we have done with some hidden code. The hidden code:

Processes the gender_df['country_name'] Series to make a new Series where we have replaced the original names (above) with versions of these names that are easier to type.
Taken this new Series, and run sorted(new_series.unique()) to show you the new names.

Have a careful look at the resulting list below - and work out which Pandas string methods have been used to get from the gender_df['country_name'] Series to the new Series, to which we have applied sorted(new_series.unique())

# This is the answer - don't use it in your solution.
sorted(answer_clean_series.unique())

['afghanistan',
 'albania',
 'algeria',
 'american_samoa',
 'andorra',
 'angola',
 'antigua_and_barbuda',
 'argentina',
 'armenia',
 'aruba',
 'australia',
 'austria',
 'azerbaijan',
 'bahamas_the',
 'bahrain',
 'bangladesh',
 'barbados',
 'belarus',
 'belgium',
 'belize',
 'benin',
 'bermuda',
 'bhutan',
 'bolivia',
 'bosnia_and_herzegovina',
 'botswana',
 'brazil',
 'british_virgin_islands',
 'brunei_darussalam',
 'bulgaria',
 'burkina_faso',
 'burundi',
 'cabo_verde',
 'cambodia',
 'cameroon',
 'canada',
 'cayman_islands',
 'central_african_republic',
 'chad',
 'chile',
 'china',
 'colombia',
 'comoros',
 'congo_dem_rep',
 'congo_rep',
 'costa_rica',
 "cote_d'ivoire",
 'croatia',
 'cuba',
 'curacao',
 'cyprus',
 'czech_republic',
 'denmark',
 'djibouti',
 'dominica',
 'dominican_republic',
 'ecuador',
 'egypt_arab_rep',
 'el_salvador',
 'equatorial_guinea',
 'eritrea',
 'estonia',
 'ethiopia',
 'faroe_islands',
 'fiji',
 'finland',
 'france',
 'french_polynesia',
 'gabon',
 'gambia_the',
 'georgia',
 'germany',
 'ghana',
 'gibraltar',
 'greece',
 'greenland',
 'grenada',
 'guam',
 'guatemala',
 'guinea',
 'guinea_bissau',
 'guyana',
 'haiti',
 'honduras',
 'hong_kong_sar_china',
 'hungary',
 'iceland',
 'india',
 'indonesia',
 'iran_islamic_rep',
 'iraq',
 'ireland',
 'isle_of_man',
 'israel',
 'italy',
 'jamaica',
 'japan',
 'jordan',
 'kazakhstan',
 'kenya',
 'kiribati',
 "korea_dem_people's_rep",
 'korea_rep',
 'kosovo',
 'kuwait',
 'kyrgyz_republic',
 'lao_pdr',
 'latvia',
 'lebanon',
 'lesotho',
 'liberia',
 'libya',
 'liechtenstein',
 'lithuania',
 'luxembourg',
 'macao_sar_china',
 'macedonia_fyr',
 'madagascar',
 'malawi',
 'malaysia',
 'maldives',
 'mali',
 'malta',
 'marshall_islands',
 'mauritania',
 'mauritius',
 'mexico',
 'micronesia_fed_sts',
 'moldova',
 'monaco',
 'mongolia',
 'montenegro',
 'morocco',
 'mozambique',
 'myanmar',
 'namibia',
 'nauru',
 'nepal',
 'netherlands',
 'new_caledonia',
 'new_zealand',
 'nicaragua',
 'niger',
 'nigeria',
 'northern_mariana_islands',
 'norway',
 'oman',
 'pakistan',
 'palau',
 'panama',
 'papua_new_guinea',
 'paraguay',
 'peru',
 'philippines',
 'poland',
 'portugal',
 'puerto_rico',
 'qatar',
 'romania',
 'russian_federation',
 'rwanda',
 'samoa',
 'san_marino',
 'sao_tome_and_principe',
 'saudi_arabia',
 'senegal',
 'serbia',
 'seychelles',
 'sierra_leone',
 'singapore',
 'sint_maarten_dutch_part',
 'slovak_republic',
 'slovenia',
 'solomon_islands',
 'somalia',
 'south_africa',
 'south_sudan',
 'spain',
 'sri_lanka',
 'st_kitts_and_nevis',
 'st_lucia',
 'st_martin_french_part',
 'st_vincent_and_the_grenadines',
 'sudan',
 'suriname',
 'swaziland',
 'sweden',
 'switzerland',
 'syrian_arab_republic',
 'tajikistan',
 'tanzania',
 'thailand',
 'timor_leste',
 'togo',
 'tonga',
 'trinidad_and_tobago',
 'tunisia',
 'turkey',
 'turkmenistan',
 'turks_and_caicos_islands',
 'tuvalu',
 'uganda',
 'ukraine',
 'united_arab_emirates',
 'united_kingdom',
 'united_states',
 'uruguay',
 'uzbekistan',
 'vanuatu',
 'venezuela_rb',
 'vietnam',
 'virgin_islands_us',
 'west_bank_and_gaza',
 'yemen_rep',
 'zambia',
 'zimbabwe']

Your task now is to make a Series called my_clean_series which gives (with sorted(my_clean_series.unique())) a list that is identical to the list shown above.

You can perform the relevant string transformations using Pandas string methods on the gender_df['country_name'] Series, and then run sorted(my_clean_names.unique()) to get the final array.

There is a cell at the end of the exercise to check your answer.

Try to do the string transformation in as few lines of code as possible and using ONLY Pandas string methods.

Hint: There are many ways to do this, but for maximum beauty, you might consider having a look at Python’s str.maketrans function. And yes, you can use str.maketrans as well. Or you can use some other algorithm of your choice.

# Your code here to create a new Pandas Series with modified
# country names, as above.
my_clean_series = pd.Series()  # Edit here to solve the problem.
# ...
# But don't modify the code below.
my_clean_names = sorted(my_clean_series.unique())
my_clean_names

[]

# Run this cell to check your answer.
# It will return 'Success' if your cleaning worked correctly.

def check_names(proposed_solution):
    """ Check resulting names from processed Series `proposed_solution`
    """
    answer_arr = np.array(sorted(answer_clean_series.unique()))
    solution_arr = np.array(sorted(proposed_solution.unique()))
    if len(answer_arr) != len(solution_arr):
        return 'The answer and solution names are of different lengths'
    not_matching = answer_arr != solution_arr
    if np.any(not_matching):
        print('My solution unmatched', solution_arr[not_matching])
        print('Desired unmatched', answer_arr[not_matching])
        return 'Remaining unmatched values'
    return 'Success'
    
check_names(my_clean_series)

'The answer and solution names are of different lengths'

Solution to Exercise 15

Our solution is below. We have used the hint above to make a translation table from a set of characters to another set of characters, followed by a set of characters to delete, and then applied this translation table with the Pandas .str.translate method.

We then use the .lower() method to remove the capitalization.

To make the array identical to the one shown above (and the one used for marking this exercise), we then re-rename russia to russian_federation before using .unique() to show the final array:

soln_clean_series = (gender_df['country_name']
                     .str.lower()
                     .str.translate(str.maketrans(' -', '__', '().,'))
                     .str.replace('russia', 'russian_federation'))
check_names(soln_clean_series)

'Success'

We can do the same processing with .str.replace at the expense of greater verbosity:

# Using `.str.replace`
soln2_clean_series = (gender_df['country_name']
                      .str.lower()
                      .str.replace(' ', '_')
                      .str.replace('-', '_')
                      .str.replace('(', '')
                      .str.replace(')', '')
                      .str.replace('.', '')
                      .str.replace(',', '')
                      .str.replace('russia', 'russian_federation'))
check_names(soln2_clean_series)

'Success'

Summary#

This page looked at string methods in base python, Numpy and Pandas.

Numpy and Python inherit their string methods from base python, but apply them in different ways.

Numpy does not have a set of methods for applying string methods to every element of an array simultaneously. We need functions from the np.char module if we want this.

By contrast, Pandas Series - whether in isolation or as columns in a Data Frame - have the .str. accessor for easily performing string operations on every element in a Series.