Let There Be Data (Frames)#
In the previous tutorials we showed how the Pandas class objects (Series and Data Frames) are constructed from Numpy objects (arrays) and other attributes.
We focused on the maxims:
“a Pandas Series is a numpy array, plus a
nameattribute and an array-likeindex”
…and…
“a Pandas DataFrame is just a dictionary-like collection of Series”.
This page will look at several different ways of constructing Data Frames. All
of these use the pd.DataFrame() constructor but supply it with different
“ingredients”. This influences the specific collection of attributes that the
resultant Data Frame will have.
# Import libraries
import numpy as np
import pandas as pd
Reading in data from a file#
The simplest, probably most common, and easiest way to create a Data Frame is to use a pd.read_* function to import data from a file.
.csv files are common way of storing data, and (as we have seen) can be imported using the creatively named pd.read_csv() function:
# Read in data the boring way
df_from_file = pd.read_csv('data/airline_passengers.csv')
df_from_file
| Month | Thousands of Passengers | |
|---|---|---|
| 0 | 1949-01 | 112 |
| 1 | 1949-02 | 118 |
| 2 | 1949-03 | 132 |
| 3 | 1949-04 | 129 |
| 4 | 1949-05 | 121 |
| ... | ... | ... |
| 139 | 1960-08 | 606 |
| 140 | 1960-09 | 508 |
| 141 | 1960-10 | 461 |
| 142 | 1960-11 | 390 |
| 143 | 1960-12 | 432 |
144 rows × 2 columns
Pandas, as a major Python data science library, has a large array of read_* functions, for importing data stored in different formats.
# Names in Pandas module starting with "read_"
[k for k in dir(pd) if k.startswith('read_')]
['read_clipboard',
'read_csv',
'read_excel',
'read_feather',
'read_fwf',
'read_hdf',
'read_html',
'read_iceberg',
'read_json',
'read_orc',
'read_parquet',
'read_pickle',
'read_sas',
'read_spss',
'read_sql',
'read_sql_query',
'read_sql_table',
'read_stata',
'read_table',
'read_xml']
In other situations, and to deepen our understanding of Data Frame construction, let’s look at more elaborate, artisanal ways of creating Data Frames…
Creating a blank Data Frame#
Another very simple way to create a Data Frame is by using the
pd.DataFrame() constructor with no arguments:
# Calling the constructor with no arguments
blank_df = pd.DataFrame()
blank_df
Perhaps unsurprisingly, this returns a strange, blank output.
Again, unsurprisingly, many of the attributes of the Data Frame are also blank.
For instance, the index:
# Show the blank index
blank_df.index
RangeIndex(start=0, stop=0, step=1)
Ditto for the columns attribute:
# Show the blank columns.
blank_df.columns
RangeIndex(start=0, stop=0, step=1)
We can add new columns (e.g. new Pandas Series) into this blank Data Frame by using direct indexing on the left hand side (LHS). E.g.
# Create a new column in the Data Frame.
blank_df['new_column'] = np.array([1, 2, 3])
blank_df
| new_column | |
|---|---|
| 0 | 1 |
| 1 | 2 |
| 2 | 3 |
We used a Numpy array to construct this new column, however, as we know, Data Frames are a dictionary-like collection of Series, so Pandas can represent the data as a Pandas Series:
# Show the type of df['new_column'].
new_col = blank_df['new_column']
type(new_col)
pandas.Series
The string which we used as the column name (e.g. new_column) has become the name attribute of this new Series:
# Show the `name` of the column.
new_col.name
'new_column'
…and the numpy array we supplied has become the .values of the Series:
# Show the `values` in the column.
new_col.values
array([1, 2, 3])
Pandas has also automatically created a default RangeIndex for the Data Frame, because we did not specify what it should use as an index:
blank_df.index
RangeIndex(start=0, stop=3, step=1)
As you saw in The Pandas from Numpy page, Series extracted from Data Frames inherit the .index of the Data Frame:
new_col.index
RangeIndex(start=0, stop=3, step=1)
If we construct Data Frames using this method (“create a blank Data Frame, add the data later”), then any new columns we add must have equal numbers of elements. This must be so, in order that the new column can share an index with the old.
# Add another new column with correct number of elements.
blank_df['another_new_column'] = np.array(['A', 'B', 'C'])
blank_df
| new_column | another_new_column | |
|---|---|---|
| 0 | 1 | A |
| 1 | 2 | B |
| 2 | 3 | C |
If the number of elements differs, then Pandas will throw an error:
# ValueError from wrong number of elements on RHS.
blank_df['a_further_new_column'] = np.array([4, 5 , 6, 7])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[14], line 2
1 # ValueError from wrong number of elements on RHS.
----> 2 blank_df['a_further_new_column'] = np.array([4, 5 , 6, 7])
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/frame.py:4672, in DataFrame.__setitem__(self, key, value)
4669 self._setitem_array([key], value)
4670 else:
4671 # set column
-> 4672 self._set_item(key, value)
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/frame.py:4872, in DataFrame._set_item(self, key, value)
4862 def _set_item(self, key, value) -> None:
4863 """
4864 Add series to DataFrame in specified column.
4865
(...) 4870 ensure homogeneity.
4871 """
-> 4872 value, refs = self._sanitize_column(value)
4874 if (
4875 key in self.columns
4876 and value.ndim == 1
4877 and not isinstance(value.dtype, ExtensionDtype)
4878 ):
4879 # broadcast across multiple columns if necessary
4880 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/frame.py:5742, in DataFrame._sanitize_column(self, value)
5739 return _reindex_for_setitem(value, self.index)
5741 if is_list_like(value):
-> 5742 com.require_length_match(value, self.index)
5743 return sanitize_array(value, self.index, copy=True, allow_2d=True), None
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/common.py:601, in require_length_match(data, index)
597 """
598 Check the length of data matches the length of the index.
599 """
600 if len(data) != len(index):
--> 601 raise ValueError(
602 "Length of values "
603 f"({len(data)}) "
604 "does not match length of index "
605 f"({len(index)})"
606 )
ValueError: Length of values (4) does not match length of index (3)
Notice the text of this error: ValueError: Length of values (4) does not match length of index (3). The error is caused because all columns must share an index, to facilitate the label-based indexing (via .loc) that we have seen on previous pages.
We want to avoid the pitfalls of integer indices, such as RangeIndex (e.g. misalignment between the integer location of data, and the numerical index label of that data). To do this, we can specify a non-integer values for the index, after we have created the Data Frame.
# Set the index
blank_df.index = ['Person_1', 'Person_2', 'Person_3']
blank_df
| new_column | another_new_column | |
|---|---|---|
| Person_1 | 1 | A |
| Person_2 | 2 | B |
| Person_3 | 3 | C |
We can also specify the index directly when we make the “blank” Data Frame:
df_again = pd.DataFrame(index=['Person_1', 'Person_2', 'Person_3'])
df_again
| Person_1 |
|---|
| Person_2 |
| Person_3 |
This creates a Data Frame with only an index, which data can then be added to:
df_again['new_column'] = np.array([1, 2, 3])
df_again
| new_column | |
|---|---|
| Person_1 | 1 |
| Person_2 | 2 |
| Person_3 | 3 |
Because all Series/columns in the Data Frame must share an index, Pandas will predictably throw an error if try to use something that is the wrong length/shape to be a valid index:
# ValueError because we have specified the wrong number of index elements.
blank_df.index = ['Person_1', 'Person_2', 'Person_3', 'Person_4']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[18], line 2
1 # ValueError because we have specified the wrong number of index elements.
----> 2 blank_df.index = ['Person_1', 'Person_2', 'Person_3', 'Person_4']
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/generic.py:6220, in NDFrame.__setattr__(self, name, value)
6218 try:
6219 object.__getattribute__(self, name)
-> 6220 return object.__setattr__(self, name, value)
6221 except AttributeError:
6222 pass
File pandas/_libs/properties.pyx:69, in pandas._libs.properties.AxisProperty.__set__()
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/generic.py:766, in NDFrame._set_axis(self, axis, labels)
761 """
762 This is called from the cython code when we set the `index` attribute
763 directly, e.g. `series.index = [1, 2, 3]`.
764 """
765 labels = ensure_index(labels)
--> 766 self._mgr.set_axis(axis, labels)
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/internals/managers.py:273, in BaseBlockManager.set_axis(self, axis, new_labels)
271 def set_axis(self, axis: AxisInt, new_labels: Index) -> None:
272 # Caller is responsible for ensuring we have an Index object.
--> 273 self._validate_set_axis(axis, new_labels)
274 self.axes[axis] = new_labels
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/internals/managers.py:288, in BaseBlockManager._validate_set_axis(self, axis, new_labels)
285 pass
287 elif new_len != old_len:
--> 288 raise ValueError(
289 f"Length mismatch: Expected axis has {old_len} elements, new "
290 f"values have {new_len} elements"
291 )
ValueError: Length mismatch: Expected axis has 3 elements, new values have 4 elements
Again, the error that Pandas gives us here is informative: ValueError: Length mismatch: Expected axis has 3 elements, new values have 4 elements. (Unfortunately, not all Pandas errors are as obvious as this one).
Constructing a Data Frame from an array#
Remember (from .loc and .iloc with Data Frames) that a Pandas Data Frame can be considered a view onto a two-dimensional array.
For example, the .values attribute of a Data Frame returns a two-dimensional Numpy array with a copy of the underlying data
# Select the first 10 rows of the loaded Data Frame for brevity
early_passengers_df = df_from_file.head(10)
# Show this as a 2D array.
early_passengers_df.values
array([['1949-01', 112],
['1949-02', 118],
['1949-03', 132],
['1949-04', 129],
['1949-05', 121],
['1949-06', 135],
['1949-07', 148],
['1949-08', 148],
['1949-09', 136],
['1949-10', 119]], dtype=object)
In a similar way, if you pass a Numpy array as the first argument to the Data Frame constructor, Pandas will assume you are passing this underlying 2D data array.
two_d_arr = np.array([[1, 2, 3], [11, 21, 31], [101, 102, 103]])
two_d_arr
array([[ 1, 2, 3],
[ 11, 21, 31],
[101, 102, 103]])
# Construct Data Frame from data in two dimensional array.
default_df = pd.DataFrame(two_d_arr)
default_df
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 1 | 2 | 3 |
| 1 | 11 | 21 | 31 |
| 2 | 101 | 102 | 103 |
Notice that Pandas constructed a default Index (integer row labels), because we did not pass one, and a default and corresponding set of column labels. In fact these default column labels are also integers, of which more soon. For now, let us make this Data Frame more standard by giving string column labels using the columns= argument to the constructor:
# Naming the columns when constructing from 2D array.
pd.DataFrame(two_d_arr, columns=['First', 'Second', 'Third'])
| First | Second | Third | |
|---|---|---|---|
| 0 | 1 | 2 | 3 |
| 1 | 11 | 21 | 31 |
| 2 | 101 | 102 | 103 |
Better still, we can add meaningful row labels by using the index= argument:
# Naming the columns and rows when constructing from 2D array.
pd.DataFrame(two_d_arr,
columns=['First', 'Second', 'Third'],
index=['Row 1', 'Row 2', 'Row 3'])
| First | Second | Third | |
|---|---|---|---|
| Row 1 | 1 | 2 | 3 |
| Row 2 | 11 | 21 | 31 |
| Row 3 | 101 | 102 | 103 |
If you pass a 1D array to the constructor, it assumes you mean this as one column of a 2D array:
pd.DataFrame([10, 20, 20])
| 0 | |
|---|---|
| 0 | 10 |
| 1 | 20 |
| 2 | 20 |
Constructing a Data Frame from a dictionary of Numpy arrays#
Another common way to construct Data Frames is to use a dictionary.
When we do this, the keys of the dictionary become the column names (and therefore the name attribute of the Series that constitutes a given column); and the values of the dictionary become the values attribute of a given column.
First, let’s make a dictionary:
# Make a dictionary, using the keys "A" and "B" and two Numpy arrays for the values
dictionary = {'A': np.array([1, 2, 3, 4]),
'B': np.array([5, 6, 7, 8])}
dictionary
{'A': array([1, 2, 3, 4]), 'B': array([5, 6, 7, 8])}
Here are the keys and values of the dictionary, containing this toy data:
# Show the keys of the dictionary
dictionary.keys()
dict_keys(['A', 'B'])
# Show the values of the dictionary
dictionary.values()
dict_values([array([1, 2, 3, 4]), array([5, 6, 7, 8])])
We can pass this dictionary to the pd.DataFrame() constructor. As noted above, the keys will become the name attribute of each column (where each column is a Pandas Series). The values will become the .values attribute of each column:
# Construction from a dictionary
df3 = pd.DataFrame(dictionary)
df3
| A | B | |
|---|---|---|
| 0 | 1 | 5 |
| 1 | 2 | 6 |
| 2 | 3 | 7 |
| 3 | 4 | 8 |
As we know, the Data Frame itself is just a dictionary-like collection of Series:
# Show one column/Series
df3['A']
0 1
1 2
2 3
3 4
Name: A, dtype: int64
Each Series inherits its name attribute from its key in the original dictionary:
df3['A'].name
'A'
…and its .values attribute from the values in the original dictionary:
df3['A'].values
array([1, 2, 3, 4])
Constructing a Data Frame from a dictionary of Pandas series#
We can also use Pandas Series as the values in a dictionary (rather than Numpy
arrays), in order to build a Data Frame. Because Pandas Series contain a Numpy
array plus additional attributes, like an index, we need to be aware of this
when using them to create Data Frames, as conflicts between the indexes of
different Series can lead to errors.
Let’s build a Series with the familiar three-letter country codes, the country names, and the HDI data:
# Make an array containing the country codes
country_codes_array = np.array(['AUS', 'BRA', 'CAN',
'CHN', 'DEU', 'ESP',
'FRA', 'GBR', 'IND',
'ITA', 'JPN', 'KOR',
'MEX', 'RUS', 'USA'])
# Make an array containing the country names
country_names_array = np.array(['Australia', 'Brazil', 'Canada',
'China', 'Germany', 'Spain',
'France', 'United Kingdom', 'India',
'Italy', 'Japan', 'South Korea',
'Mexico', 'Russia', 'United States'])
As previously, we will use the country codes as an index:
# Build a Series of the country names
country_names_series = pd.Series(country_names_array,
index=country_codes_array)
country_names_series
AUS Australia
BRA Brazil
CAN Canada
CHN China
DEU Germany
ESP Spain
FRA France
GBR United Kingdom
IND India
ITA Italy
JPN Japan
KOR South Korea
MEX Mexico
RUS Russia
USA United States
dtype: str
Now, let’s do the same for the HDI scores:
# Human Development Index Scores for each country
hdis_array = np.array([0.896, 0.668, 0.89 , 0.586,
0.844, 0.89 , 0.49 , 0.842,
0.883, 0.709, 0.733, 0.824,
0.828, 0.863, 0.894])
Here also we will use the country codes as the index:
hdi_series = pd.Series(hdis_array, index=country_codes_array)
hdi_series
AUS 0.896
BRA 0.668
CAN 0.890
CHN 0.586
DEU 0.844
ESP 0.890
FRA 0.490
GBR 0.842
IND 0.883
ITA 0.709
JPN 0.733
KOR 0.824
MEX 0.828
RUS 0.863
USA 0.894
dtype: float64
We can then create the Data Frame by using the Series as values in a dictionary, and passing that dictionary to the pd.DataFrame() constructor:
df4 = pd.DataFrame({'country_names': country_names_series,
'HDI': hdi_series})
df4
| country_names | HDI | |
|---|---|---|
| AUS | Australia | 0.896 |
| BRA | Brazil | 0.668 |
| CAN | Canada | 0.890 |
| CHN | China | 0.586 |
| DEU | Germany | 0.844 |
| ESP | Spain | 0.890 |
| FRA | France | 0.490 |
| GBR | United Kingdom | 0.842 |
| IND | India | 0.883 |
| ITA | Italy | 0.709 |
| JPN | Japan | 0.733 |
| KOR | South Korea | 0.824 |
| MEX | Mexico | 0.828 |
| RUS | Russia | 0.863 |
| USA | United States | 0.894 |
However, it is very important when using this method to ensure that all the Series share an index.
Strange things can happen if they do not.
Let’s adjust the hdi_series to give it a numerical index:
# Adjust the `hdi_series` to have a numerical index
# Copy the Series with the Series `.copy` method.
hdi_with_int_index = hdi_series.copy()
hdi_with_int_index.index = np.arange(len(hdi_series))
hdi_with_int_index
0 0.896
1 0.668
2 0.890
3 0.586
4 0.844
5 0.890
6 0.490
7 0.842
8 0.883
9 0.709
10 0.733
11 0.824
12 0.828
13 0.863
14 0.894
dtype: float64
For the latest Pandas (2.2.3 at time of writing), Pandas will give an error if we try to construct a Data Frame from a dictionary with these two Series as the values:
# TypeError if we construct a Data Frame using Series without matching indexes
df5 = pd.DataFrame({'country_names': country_names_series,
'HDI': hdi_with_int_index})
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[39], line 2
1 # TypeError if we construct a Data Frame using Series without matching indexes
----> 2 df5 = pd.DataFrame({'country_names': country_names_series,
3 'HDI': hdi_with_int_index})
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/frame.py:769, in DataFrame.__init__(self, data, index, columns, dtype, copy)
763 mgr = self._init_mgr(
764 data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
765 )
767 elif isinstance(data, dict):
768 # GH#38939 de facto copy defaults to False only in non-dict cases
--> 769 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy)
770 elif isinstance(data, ma.MaskedArray):
771 from numpy.ma import mrecords
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/internals/construction.py:447, in dict_to_mgr(data, index, columns, dtype, copy)
428 if copy:
429 # We only need to copy arrays that will not get consolidated, i.e.
430 # only EA arrays
431 arrays = [
432 (
433 x.copy()
(...) 444 for x in arrays
445 ]
--> 447 return arrays_to_mgr(arrays, columns, index, dtype=dtype, consolidate=copy)
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/internals/construction.py:112, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, consolidate)
109 if verify_integrity:
110 # figure out the index, if necessary
111 if index is None:
--> 112 index = _extract_index(arrays)
113 else:
114 index = ensure_index(index)
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/internals/construction.py:614, in _extract_index(data)
611 raise ValueError("If using all scalar values, you must pass an index")
613 if have_series:
--> 614 index = union_indexes(indexes)
615 elif have_dicts:
616 index = union_indexes(indexes, sort=False)
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/indexes/api.py:261, in union_indexes(indexes, sort)
259 index = index.append(diff.unique())
260 if sort:
--> 261 index = index.sort_values()
262 else:
263 index = indexes[0]
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/indexes/base.py:5974, in Index.sort_values(self, return_indexer, ascending, na_position, key)
5971 # GH 35584. Sort missing values according to na_position kwarg
5972 # ignore na_position for MultiIndex
5973 if not isinstance(self, ABCMultiIndex):
-> 5974 _as = nargsort(
5975 items=self, ascending=ascending, na_position=na_position, key=key
5976 )
5977 else:
5978 idx = cast(Index, ensure_key_mapped(self, key))
File /opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/pandas/core/sorting.py:442, in nargsort(items, kind, ascending, na_position, key, mask)
440 non_nans = non_nans[::-1]
441 non_nan_idx = non_nan_idx[::-1]
--> 442 indexer = non_nan_idx[non_nans.argsort(kind=kind)]
443 if not ascending:
444 indexer = indexer[::-1]
TypeError: '<' not supported between instances of 'int' and 'str'
Exercise 10
In the cell above, at least at time of writing, you get the following error:
TypeError: '<' not supported between instances of 'int' and 'str'
This occurred when we passed one Series with int-type Index values, and
another with str-type Index values.
Reflect back on the first exercise in the Pandas from
Numpy page. Why do you think Pandas is comparing ints to
strs as it creates the Data Frame?
Solution to Exercise 10
Working through the indices exercise should have revealed
that Pandas follows something like the following algorithm, when dealing with
the .index of different Series intended for a Data Frame:
First check if the Series Indices are the same. If so, use the Index of any Series.
If they are not the same, first sort all Series by their Index values, and use the resulting sorted Index.
The first of these two steps will involve comparing int to str, hence the error.
Remember each index label is a identifier for each row of the Data Frame. Pandas is trying to compare the indices of the two series in order to match corresponding rows, and failing, because it cannot compare the string index of country_names_series to the (newly set) integer series of hdi_series.
Later on we will see further signs that Pandas is trying to match rows between series by using the index.
Constructing a Data Frame from a single Pandas series#
pd.DataFrame has a special case in which you pass a single Series as the data argument.
df_single = pd.DataFrame(hdi_series)
df_single
| 0 | |
|---|---|
| AUS | 0.896 |
| BRA | 0.668 |
| CAN | 0.890 |
| CHN | 0.586 |
| DEU | 0.844 |
| ESP | 0.890 |
| FRA | 0.490 |
| GBR | 0.842 |
| IND | 0.883 |
| ITA | 0.709 |
| JPN | 0.733 |
| KOR | 0.824 |
| MEX | 0.828 |
| RUS | 0.863 |
| USA | 0.894 |
Be careful - as you will see below, if you pass a sequence of Series, then the Series become the rows. Here, the single Series becomes a single column in the Data Frame.
The column name comes from the Series name:
hdi_series.name
As you remember, Series have an optional .name (for which the default is None). For example:
hdi_series_no_name = pd.Series(hdis_array, index=country_codes_array)
hdi_series_no_name.name is None
True
If you pass a Series with no .name (.name == None) then Panda must make a default column name. It uses the same default for column names as it does for row names, that is, a RangeIndex containing integers, where, in this case, it only contains the integer value 0:
df_single_no_name = pd.DataFrame(hdi_series_no_name)
df_single_no_name
| 0 | |
|---|---|
| AUS | 0.896 |
| BRA | 0.668 |
| CAN | 0.890 |
| CHN | 0.586 |
| DEU | 0.844 |
| ESP | 0.890 |
| FRA | 0.490 |
| GBR | 0.842 |
| IND | 0.883 |
| ITA | 0.709 |
| JPN | 0.733 |
| KOR | 0.824 |
| MEX | 0.828 |
| RUS | 0.863 |
| USA | 0.894 |
df_single_no_name.columns
RangeIndex(start=0, stop=1, step=1)
Indexing for this column, with an integer label, is likely to become confusing:
# Getting the column by label.
df_single_no_name.loc[:, 0]
AUS 0.896
BRA 0.668
CAN 0.890
CHN 0.586
DEU 0.844
ESP 0.890
FRA 0.490
GBR 0.842
IND 0.883
ITA 0.709
JPN 0.733
KOR 0.824
MEX 0.828
RUS 0.863
USA 0.894
Name: 0, dtype: float64
Or even this (which is very confusing - direct indexing with column name):
# Direct indexing using column name, where name is integer 0
df_single_no_name[0]
AUS 0.896
BRA 0.668
CAN 0.890
CHN 0.586
DEU 0.844
ESP 0.890
FRA 0.490
GBR 0.842
IND 0.883
ITA 0.709
JPN 0.733
KOR 0.824
MEX 0.828
RUS 0.863
USA 0.894
Name: 0, dtype: float64
It’s usually advisable to either - set the Series name when constructing the Series, or later, with (e.g.) hdi_series.name = 'Human Development Index' - or set the name explicitly to pd.DataFrame using the columns= argument:
# Setting the column name or names on constructing the Data Frame.
df_single_now_named = pd.DataFrame(hdi_series_no_name,
columns=['My HDI'])
df_single_now_named
| My HDI | |
|---|---|
| AUS | 0.896 |
| BRA | 0.668 |
| CAN | 0.890 |
| CHN | 0.586 |
| DEU | 0.844 |
| ESP | 0.890 |
| FRA | 0.490 |
| GBR | 0.842 |
| IND | 0.883 |
| ITA | 0.709 |
| JPN | 0.733 |
| KOR | 0.824 |
| MEX | 0.828 |
| RUS | 0.863 |
| USA | 0.894 |
Constructing a Data Frame from a sequence of Pandas series#
Series have an optional .name (for which the default is None).
If we specify a .name for each Series, then we can pass a sequence of these named Series to pd.DataFrame; Pandas interprets these Series as rows in the Data Frame. For example:
# Set not-default names for the Series.
country_names_series.name = 'country_names'
hdi_series.name = 'HDI'
df5 = pd.DataFrame([country_names_series, hdi_series])
df5
| AUS | BRA | CAN | CHN | DEU | ESP | FRA | GBR | IND | ITA | JPN | KOR | MEX | RUS | USA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| country_names | Australia | Brazil | Canada | China | Germany | Spain | France | United Kingdom | India | Italy | Japan | South Korea | Mexico | Russia | United States |
| HDI | 0.896 | 0.668 | 0.89 | 0.586 | 0.844 | 0.89 | 0.49 | 0.842 | 0.883 | 0.709 | 0.733 | 0.824 | 0.828 | 0.863 | 0.894 |
Notice the .names of the Series become the .index values of the Data Frame
(the row labels). The .index of the two Series become the column labels. To
get the same effect as we have had, up until now, we can transpose the Data
Frame, so that the rows become columns, and the columns become the rows:
# .T is the transpose attribute of the Data Frame. It returns a new, transposed Data Frame.
df6 = df5.T
df6
| country_names | HDI | |
|---|---|---|
| AUS | Australia | 0.896 |
| BRA | Brazil | 0.668 |
| CAN | Canada | 0.89 |
| CHN | China | 0.586 |
| DEU | Germany | 0.844 |
| ESP | Spain | 0.89 |
| FRA | France | 0.49 |
| GBR | United Kingdom | 0.842 |
| IND | India | 0.883 |
| ITA | Italy | 0.709 |
| JPN | Japan | 0.733 |
| KOR | South Korea | 0.824 |
| MEX | Mexico | 0.828 |
| RUS | Russia | 0.863 |
| USA | United States | 0.894 |
Summary#
This page has looked at different methods of constructing Data Frames, and how these affect different attributes of the Pandas Series that constitute each Data Frame.