Skip to content

Pandas

Reading data from into Pandas

The Python package pandas is widely-used to read and manipulate data in tabular or similar form. One of the most common tabular data formats is CSV.

An interesting CSV-format dataset is that containing the successive pulses of the oscillation signal coming from the Pulsar PSR B1919+21 discovered by Jocelyn Bell in 1967. Some of you might also recognise it from a famous album cover

Joy Division

By inspection we can see the data are 80 lines of 300 columns of data. The data format is simple, with no missing values or metadata. We can straightforwardly use the pandas function pd.read_csv, specifying the URL, to read this dataset (specifying only header=None so that the first line is not interpreted as data column names).

import pandas as pd
from urlpath import URL

site = 'https://raw.githubusercontent.com'
site_dir = 'igorol/unknown_pleasures_plot/master'
site_file = 'pulsar.csv'

url = URL(site,site_dir,site_file)

df=pd.read_csv(url.as_posix(),header=None)
df
0 1 2 3 4 5 6 7 8 9 ... 290 291 292 293 294 295 296 297 298 299
0 -0.81 -0.91 -1.09 -1.00 -0.59 -0.82 -0.43 -0.68 -0.71 -0.27 ... -0.08 0.19 -0.19 -0.18 -0.20 -0.26 -0.52 -0.44 -0.58 -0.54
1 -0.61 -0.40 -0.42 -0.38 -0.55 -0.51 -0.71 -0.79 -0.52 -0.40 ... -0.34 -0.58 -0.26 -0.64 -1.05 -0.83 -0.80 -0.47 -0.13 -0.12
2 -1.43 -1.15 -1.25 -1.13 -0.76 -0.25 0.40 0.26 0.30 0.36 ... -0.29 0.16 0.83 0.99 1.28 0.11 -0.77 -0.88 -0.45 -1.01
3 -1.09 -0.85 -0.72 -0.74 -0.26 -0.04 -0.19 0.18 0.03 0.19 ... 0.48 0.52 -0.14 -1.13 -1.07 -1.03 -0.78 -0.40 0.18 0.27
4 -1.13 -0.98 -0.93 -0.90 -1.14 -1.00 -0.90 -1.18 -1.30 -1.07 ... -0.27 -0.47 -0.49 -0.23 -0.75 -0.29 -0.54 -0.65 -0.64 -0.94
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
75 0.62 0.64 0.59 0.30 0.01 0.05 -0.63 0.07 0.36 0.78 ... 0.20 0.22 0.23 0.27 -0.10 -0.21 -0.09 -0.24 -0.17 -0.62
76 0.32 0.31 0.28 0.42 -0.24 -0.48 -0.73 -0.64 0.04 0.02 ... -0.44 -0.53 -0.50 -0.49 -0.63 -0.56 -0.50 -0.38 -0.58 -0.43
77 -0.09 -0.14 -0.24 -0.24 -0.66 0.00 0.29 0.29 0.60 0.86 ... 0.08 -0.88 -1.17 -0.36 -0.31 -0.12 0.29 -0.02 0.21 0.44
78 0.11 0.05 0.05 -0.05 -0.03 -0.29 -0.08 -0.54 -0.01 0.01 ... -0.73 -0.54 -0.53 -0.92 -0.68 -0.87 -1.31 -1.02 -1.10 -1.62
79 0.12 -0.12 -0.12 -0.45 -0.24 -0.48 -0.57 -0.19 -0.07 -0.59 ... 0.12 0.03 -0.28 0.02 -0.01 0.13 0.09 -0.01 -0.03 -0.23

80 rows × 300 columns

pandas transpose

In this format, we have time as columns and sample number in the rows. In many cases, we may wish to view the dataset 'the other way around', i.e. with rows as time and columns as sample number. This is achieved with the transpose operation:

import pandas as pd
from urlpath import URL

site = 'https://raw.githubusercontent.com'
site_dir = 'igorol/unknown_pleasures_plot/master'
site_file = 'pulsar.csv'

url = URL(site,site_dir,site_file)

# transpose the dataset
df=pd.read_csv(url.as_posix(),header=None).transpose()
df
0 1 2 3 4 5 6 7 8 9 ... 70 71 72 73 74 75 76 77 78 79
0 -0.81 -0.61 -1.43 -1.09 -1.13 -0.66 -0.36 -0.73 -0.89 -0.69 ... 0.00 -0.16 0.19 -0.32 -0.16 0.62 0.32 -0.09 0.11 0.12
1 -0.91 -0.40 -1.15 -0.85 -0.98 -0.89 -0.21 -0.83 -0.61 -0.54 ... -0.12 -0.15 0.06 -0.83 -0.26 0.64 0.31 -0.14 0.05 -0.12
2 -1.09 -0.42 -1.25 -0.72 -0.93 -0.87 -0.44 -0.91 -0.74 -0.84 ... 0.10 0.25 -0.27 -0.69 -0.36 0.59 0.28 -0.24 0.05 -0.12
3 -1.00 -0.38 -1.13 -0.74 -0.90 -0.87 -0.20 -1.10 -0.85 -0.89 ... -0.01 0.37 -0.11 -0.80 -0.49 0.30 0.42 -0.24 -0.05 -0.45
4 -0.59 -0.55 -0.76 -0.26 -1.14 -1.07 -0.31 -0.87 -0.77 -0.45 ... -0.15 -0.13 0.09 -0.76 0.00 0.01 -0.24 -0.66 -0.03 -0.24
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
295 -0.26 -0.83 0.11 -1.03 -0.29 -0.55 -1.45 -1.20 -0.94 -0.16 ... 0.47 0.10 -0.06 0.08 0.28 -0.21 -0.56 -0.12 -0.87 0.13
296 -0.52 -0.80 -0.77 -0.78 -0.54 -0.62 -0.77 -1.40 -1.05 0.24 ... 0.41 0.02 -0.08 -0.15 -0.01 -0.09 -0.50 0.29 -1.31 0.09
297 -0.44 -0.47 -0.88 -0.40 -0.65 -0.71 0.03 -0.51 -0.51 -0.17 ... 0.32 -0.10 -0.04 0.03 -0.67 -0.24 -0.38 -0.02 -1.02 -0.01
298 -0.58 -0.13 -0.45 0.18 -0.64 -0.88 0.47 0.25 -0.47 -0.09 ... 0.57 -0.16 0.23 0.03 -0.86 -0.17 -0.58 0.21 -1.10 -0.03
299 -0.54 -0.12 -1.01 0.27 -0.94 -0.70 1.33 0.74 -0.79 0.01 ... 0.48 -0.06 -0.10 -0.54 -1.66 -0.62 -0.43 0.44 -1.62 -0.23

300 rows × 80 columns

We will use this transposed dataset in future exercises. so make sure you remember how to do this operation.

pandas format and read_table

Not all data files we find on the web may be so straightforward to read though (Hint: one of the files you will use in Part A of your assessed practical is like this!). In 020_Python_files we saw data of Monthly Southeast England precipitation (mm) in a tabular form on the Met Office website. We would commonly use the Python package pandas to read and analyse such data.

But the data format is actually quite complex and we can't specify complex formats for reading directly from a URL in pandas.

Instead, in such cases, we need to download the file first, in much the same way we did for MODIS data earlier (but a text file this time, and no password needed).

from urlpath import URL
from pathlib import Path

# NB -- avoid trailing / on these
# i.e. dont put 
# site_dir = 'hadobs/hadukp/data/monthly/' 
# or it wont work!
site = 'https://www.metoffice.gov.uk/'
site_dir = 'hadobs/hadukp/data/monthly'
site_file = 'HadSEEP_monthly_totals.txt'

url = URL(site,site_dir,site_file)

r = url.get()
if r.status_code == 200:
    # setup Path object for output file
    filename = Path('work',url.name)
    # write text data
    filename.write_text(r.text)
    # check size and report
    print(f'file {filename} written: {filename.stat().st_size} bytes')
else:
    print(f'failed to get {url}')
file work/HadSEEP_monthly_totals.txt written: 15209 bytes

Now we want to read this file work/HadSEEP_monthly_totals.txt into pandas.

By inspection, the dataset is seen to have a variable number of spaces between the data columns. This is known as 'whitespace' (i.e. or \t characters). This makes it more complex to read into pandas than a CSV format, and we need to specify a regular expression meaning 'one or more space'. This is r"[ ]{1,}" and we give the keyword sep for pandas as sep=r"[ ]{1,}". Further for pandas in this case we must specify that we should use the Python engine to interpret engine='python'. Other features of the dataset are that the first 3 rows of data are metadata and should be skipped in reading the dataset: skiprows=3, with the 4th line the data column headers. Finally, we see that 'no data' values are given here as the value -99.9: na_values=[-99.9].

Since there are quite a few keyword options to use, we might find it convenient to gather these into a dictionary:

import pandas as pd

panda_format = {
    'skiprows'   :  3,
    'na_values'  :  [-99.9],
    'sep'        :  r"[ ]{1,}",
    'engine'     :  'python'
}

filename = Path('work','HadSEEP_monthly_totals.txt')
df=pd.read_table(filename,**panda_format)

df.head()
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Annual
0 1873 87.1 50.4 52.9 19.9 41.1 63.6 53.2 56.4 62.0 86.0 59.4 15.7 647.7
1 1874 46.8 44.9 15.8 48.4 24.1 49.9 28.3 43.6 79.4 96.1 63.9 52.3 593.5
2 1875 96.9 39.7 22.9 37.0 39.1 76.1 125.1 40.8 54.7 137.7 106.4 27.1 803.5
3 1876 31.8 71.9 79.5 63.6 16.5 37.2 22.3 66.3 118.2 34.1 89.0 162.9 793.3
4 1877 146.0 47.7 56.2 66.4 62.3 24.9 78.5 82.4 38.4 58.1 144.5 54.2 859.6

Exercise 1

The file 2276931.csv contains precipitation data for an NOAA weather station HAVANA 4.2 SW, FL US for the year 2020 to date.

The dataset URL is:

https://raw.githubusercontent.com/UCL-EO/geog0111/master/notebooks/data/2276931.csv

  • Inspect the file to discover any issues you must account for.
  • Download the file and read into pandas
  • print the first 5 lines of data

Selecting data in pandas, and datetime

Whilst it is a good start to be able to load a dataset into a dataFrame using pandas, we need to be able to select data from this.

import pandas as pd
from urlpath import URL
from pathlib import Path

site = 'https://www.metoffice.gov.uk/'
site_dir = 'hadobs/hadukp/data/monthly'
site_file = 'HadSEEP_monthly_totals.txt'

url = URL(site,site_dir,site_file)

r = url.get()
if r.status_code == 200:
    # setup Path object for output file
    filename = Path('work',url.name)
    # write text data
    filename.write_text(r.text)
    # check size and report
    print(f'file {filename} written: {filename.stat().st_size} bytes')
else:
    print(f'failed to get {url}')

panda_format = {
    'skiprows'   :  3,
    'na_values'  :  [-99.9],
    'sep'        :  r"[ ]{1,}",
    'engine'     :  'python'
}

df_had=pd.read_table(filename,**panda_format)

# df.head: first n lines
df_had.head()
file work/HadSEEP_monthly_totals.txt written: 15209 bytes
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Annual
0 1873 87.1 50.4 52.9 19.9 41.1 63.6 53.2 56.4 62.0 86.0 59.4 15.7 647.7
1 1874 46.8 44.9 15.8 48.4 24.1 49.9 28.3 43.6 79.4 96.1 63.9 52.3 593.5
2 1875 96.9 39.7 22.9 37.0 39.1 76.1 125.1 40.8 54.7 137.7 106.4 27.1 803.5
3 1876 31.8 71.9 79.5 63.6 16.5 37.2 22.3 66.3 118.2 34.1 89.0 162.9 793.3
4 1877 146.0 47.7 56.2 66.4 62.3 24.9 78.5 82.4 38.4 58.1 144.5 54.2 859.6

This dataset has column titles Year Jan Feb Mar Apr ... Annual. We can get the list of column titles as df_had.columns:

print(df_had.columns)
Index(['Year', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep',
       'Oct', 'Nov', 'Dec', 'Annual'],
      dtype='object')

Sometimes it is useful to convert this to a list, for list selection in this example:

cols = list(df_had.columns)
for c in cols[1:-1]:
    print(c)
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec

To select a column, we can use any of these column names as a key, in the same way as in using a dictionary:

df_had['Jan']
0       87.1
1       46.8
2       96.9
3       31.8
4      146.0
       ...  
145     80.9
146     34.0
147     66.9
148    115.6
149     24.3
Name: Jan, Length: 150, dtype: float64

Or multiple columns, for example only the month datasets here:

months = list(df_had.columns)[:-1]
df_had_m = df_had[months]
df_had_m.head()
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 1873 87.1 50.4 52.9 19.9 41.1 63.6 53.2 56.4 62.0 86.0 59.4 15.7
1 1874 46.8 44.9 15.8 48.4 24.1 49.9 28.3 43.6 79.4 96.1 63.9 52.3
2 1875 96.9 39.7 22.9 37.0 39.1 76.1 125.1 40.8 54.7 137.7 106.4 27.1
3 1876 31.8 71.9 79.5 63.6 16.5 37.2 22.3 66.3 118.2 34.1 89.0 162.9
4 1877 146.0 47.7 56.2 66.4 62.3 24.9 78.5 82.4 38.4 58.1 144.5 54.2

To select data rows, we can set some condition as a mask.

df_had_m[df_had_m['Year'] > 2000].head()
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
128 2001 93.1 92.9 122.4 75.7 32.2 25.8 51.5 83.4 70.1 133.0 43.0 25.2
129 2002 69.4 79.1 44.6 42.9 78.4 58.1 69.5 43.2 37.5 88.4 175.4 137.3
130 2003 80.1 27.6 25.6 35.6 45.3 49.5 46.4 16.2 11.0 52.8 142.6 74.3
131 2004 86.3 29.7 40.5 81.8 48.7 31.1 52.0 101.8 26.7 121.0 33.4 55.9
132 2005 34.4 21.0 48.8 46.5 28.2 35.7 57.3 49.9 48.5 91.5 50.0 64.9

The selection of years was straightforward in that example, but sometimes the date can be encoded differently.

Let's generate a test example to see this where we encode the date as

# generate date strings
dates = [f'2000-{m:>02d}-01' for m in range(1,13)]
# put in DataFrame
df = pd.DataFrame(dates,columns=["YY-MM-DD"])

# add a column of some values
values = [m*m for m in range(1,13)]
df["VALUES"] = values

df
YY-MM-DD VALUES
0 2000-01-01 1
1 2000-02-01 4
2 2000-03-01 9
3 2000-04-01 16
4 2000-05-01 25
5 2000-06-01 36
6 2000-07-01 49
7 2000-08-01 64
8 2000-09-01 81
9 2000-10-01 100
10 2000-11-01 121
11 2000-12-01 144

To filter this form of date description, we need to use pd.to_datetime. The easiest way to do this is to create an additional column with the datetime object:

df['DATE'] =  pd.to_datetime(df["YY-MM-DD"])

Now we can access datetime fields such asdf['DATE'].dt.year,df['DATE'].dt.month from this, and use these to select rows of data. We combine multiple selection criteria with logical operators and : &, or : | and not : ~:

# print months with index > 4 and <= 7
df[(df['DATE'].dt.month > 4) & (df['DATE'].dt.month <= 7)]
YY-MM-DD VALUES DATE
4 2000-05-01 25 2000-05-01
5 2000-06-01 36 2000-06-01
6 2000-07-01 49 2000-07-01

Hint: Take note of how to filter datetime fields here. You may find you need it for your assessment.

There are many more functions and approaches for data manipulation in pandas that you can read up on later. We have covered the main ones here that you will need in this course and to submit the coursework.

Writing a CSV file in pandas

As well as reading a CSV file, it would be useful to know how to write such a file using pandas. All you need to, once the data are in a pandas dataframe, is to call to_csv:

import pandas as pd 

x = list(range(100))
# loop to generate y = x*x
y = [xi * xi for xi in x]

# load into pandas
df = pd.DataFrame({'x data':x,'y data':y})
df.head()
x data y data
0 0 0
1 1 1
2 2 4
3 3 9
4 4 16
from pathlib import Path
# save as csv without the index
df.to_csv(Path('work/dataset.csv'),index=False)

Exercise 2

Read and print the data in the file 'work/dataset.csv

Summary

In this section, we have used Path and URL classes to open streams from files and URLs. We have seen how to use this to read the stream into packages that can interpret various formats of data, such as yaml, json, and CSV and tabular data.

We have seen that using these object-oriented classes to deal with files and URLs means that we can use essentially the same functions throughout.

We have come across the pandas package for reading tabular and similar datasets.

pandas:

Command Comment
pd.read_csv(f) Read CSV data from file or URL f
pd.read_table(f) Read table data from file or URL f
skiprows=N Keyword to skip N rows in reading for pd.read_table
na_values=[-99.9] Keyword to set list of values to ignore (-99.9 here)
sep Keyword to define field separator
engine='python' or engine='C' Keyword to set reading engine. python is more flexible, but C is faster.
df.transpose() Transpose (rows->columns, columns->rows) pandas dataframe df
df.head(N) first N lines of data (default 5)
df.columns list-like object of column headings
df[cname] Select column with name cname
df[[c1,c2,c3]] Select columns with names c1, c2 and c3
pd.DataFrame(list,columns=cnames) Create pandas dataframe from information in list-like structures list with names from list cnames
pd.to_datetime(str_list) convert list of date strings (e.g. of form YYYY-MM-DD) to datetime representation
df[datetimes].dt.month month from datetime field fromn datetime-format column with name datetimes
df[datetimes].dt.year year from datetime field fromn datetime-format column with name datetimes
df[datetimes].dt.day day from datetime field fromn datetime-format column with name datetimes
df.to_csv(filename,index=False) Write dataframe df to CSV format file, with no index column

Last update: September 29, 2021