This material assumes that you have programmed before. This first lecture provides a quick introduction to programming in Python for those who either haven't used Python before or need a quick refresher.
Let's start with a hypothetical problem we want to solve. We are interested in understanding the relationship between the weather and the number of mosquitos occuring in a particular year so that we can plan mosquito control measures accordingly. Since we want to apply these mosquito control measures at a number of different sites we need to understand both the relationship at a particular site and whether or not it is consistent across sites. The data we have to address this problem comes from the local government and are stored in tables in comma-separated values (CSV) files. Each file holds the data for a single location, each row holds the information for a single year at that location, and the columns hold the data on both mosquito numbers and the average temperature and rainfall from the beginning of mosquito breeding season. The first few rows of our first file look like:
year,temperature,rainfall,mosquitos 2001,87,222,198 2002,72,103,105 2003,77,176,166
In order to load the data, we need to import a library called Pandas that knows how to operate on tables of data.
We can now use Pandas to read our data file.
read_csv() function belongs to the
pandas library. In order
to run it we need to tell Python that it is part of
pandas and we
do this using the dot notation, which is used everywhere in Python
to refer to parts of larger things.
When we are finished typing and press Shift+Enter, the notebook runs our command and shows us its output. In this case, the output is the data we just loaded.
Our call to
pandas.read_csv() read data into memory, but didn't
save it anywhere. To do that, we need to assign the array to a
variable. In Python we use
= to assign a new value to a variable
data = pandas.read_csv('data/A1_mosquito_data.csv')
This statement doesn't produce any output because assignment doesn't display anything. If we want to check that our data has been loaded, we can print the variable's value:
year temperature rainfall mosquitos 0 2001 80 157 150 1 2002 85 252 217 2 2003 86 154 153 3 2004 87 159 158 4 2005 74 292 243 5 2006 75 283 237 6 2007 80 214 190 7 2008 85 197 181 8 2009 74 231 200 9 2010 74 207 184
print(data) tells Python to display the text. Alternatively we
could just include
data as the last value in a code cell:
This tells the IPython Notebook to display the
data object, which
is why we see a pretty formatted table.
Once we have imported the data we can start doing things with it.
First, let's ask what type of thing
data refers to:
The data is stored in a data structure called a DataFrame. There are other kinds of data structures that are also commonly used in scientific computing including Numpy arrays, and Numpy matrices, which can be used for doing linear algebra.
We can select an individual column of data using its name:
0 2001 1 2002 2 2003 3 2004 4 2005 5 2006 6 2007 7 2008 8 2009 9 2010 Name: year, dtype: int64
Or we can select several columns of data at once:
We can also select subsets of rows using slicing. Say we just want the first two rows of data:
There are a couple of important things to note here. First, Python indexing starts at zero. In contrast, programming languages like R and MATLAB start counting at 1, because that's what human beings have done for thousands of years. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that's simpler for computers to do. This means that if we have 5 things in Python they are numbered 0, 1, 2, 3, 4, and the first row in a data frame is always row 0.
The other thing to note is that the subset of rows starts at the first value and goes up to, but does not include, the second value. Again, the up-to-but-not-including takes a bit of getting used to, but the rule is that the difference between the upper and lower bounds is the number of values in the slice.
One thing that we can't do with this syntax is directly ask for the data from a single row:
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-15-c805864c0d75> in <module>() ----> 1 data /usr/lib/python3/dist-packages/pandas/core/frame.py in __getitem__(self, key) 1741 return self._getitem_multilevel(key) 1742 else: -> 1743 return self._getitem_column(key) 1744 1745 def _getitem_column(self, key): /usr/lib/python3/dist-packages/pandas/core/frame.py in _getitem_column(self, key) 1748 # get column 1749 if self.columns.is_unique: -> 1750 return self._get_item_cache(key) 1751 1752 # duplicate columns & possible reduce dimensionaility /usr/lib/python3/dist-packages/pandas/core/generic.py in _get_item_cache(self, item) 1056 res = cache.get(item) 1057 if res is None: -> 1058 values = self._data.get(item) 1059 res = self._box_item_values(item, values) 1060 cache[item] = res /usr/lib/python3/dist-packages/pandas/core/internals.py in get(self, item, fastpath) 2804 2805 if not isnull(item): -> 2806 loc = self.items.get_loc(item) 2807 else: 2808 indexer = np.arange(len(self.items))[isnull(self.items)] /usr/lib/python3/dist-packages/pandas/core/index.py in get_loc(self, key) 1383 loc : int if unique index, possibly slice or mask if not 1384 """ -> 1385 return self._engine.get_loc(_values_from_object(key)) 1386 1387 def get_value(self, series, key): index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3767)() index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3645)() hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11911)() hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11864)() KeyError: 1
This is because there are several things that we could mean by
data so if we want a single row we can either take a slice that
returns a single row:
or use the
.iloc method, which stands for "integer location" since
we are looking up the row based on its integer index.
year 2002 temperature 85 rainfall 252 mosquitos 217 Name: 1, dtype: int64
We can also use this same syntax for getting larger subsets of rows:
We can also subset the data based on the value of other rows:
data['temperature'][data['year'] > 2005]
5 75 6 80 7 85 8 74 9 74 Name: temperature, dtype: int64
Data frames also know how to perform common mathematical operations on their values. If we want to find the average value for each variable, we can just ask the data frame for its mean values
year 2005.5 temperature 80.0 rainfall 214.6 mosquitos 191.3 dtype: float64
Data frames have lots of useful methods:
year 2010 temperature 87 rainfall 292 mosquitos 243 dtype: int64
Import the data from
A2_mosquito_data.csv, create a new variable that holds a data frame with only the weather data, and print the means and standard deviations for the weather variables.
Once we have some data we often want to be able to loop over it to
perform the same operation repeatedly. A
for loop in Python takes
the general form:
for item in list: do_something
So if we want to loop over the temperatures and print out their values in degrees Celsius (instead of Farenheit) we can use:
temps = data['temperature'] for temp_in_f in temps: temp_in_c = (temp_in_f - 32) * 5 / 9.0 print(temp_in_c)
26.6666666667 29.4444444444 30.0 30.5555555556 23.3333333333 23.8888888889 26.6666666667 29.4444444444 23.3333333333 23.3333333333
That looks good, but why did we use 9.0 instead of 9? If you try changing it, you'll still get the same results.
Computers store two different kinds of numbers: integers and floating point numbers (or floats).
9 creates an integer,
9.0 creates a float. In Python 2, dividing one integer by another would throw away the remainder, so
5/9 would give 0. In Python 3, division does what you'd expect - the result is a floating point number. But it's a good idea to be careful, so we made sure that at least one of the numbers for division is a float.
The other standard thing we need to know how to do in Python is conditionals, or if/then/else statements. In Python the basic syntax is:
if condition: do_something
So if we want to loop over the temperatures and print out only those temperatures that are greater than 80 degrees we would use:
temp = data['temperature'] if temp > 75: print("The temperature is greater than 75")
The temperature is greater than 75
We can also use
== for equality,
<= for less than or equal to,
>= for greater than or equal to, and
!= for not equal to.
Additional conditions can be handled using
temp = data['temperature'] if temp < 80: print("The temperature is < 80") elif temp > 80: print("The temperature is > 80") else: print("The temperature is equal to 80")
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-1-d4f03b5de90c> in <module>() ----> 1 temp = data['temperature'] 2 if temp < 80: 3 print("The temperature is < 80") 4 elif temp > 80: 5 print("The temperature is > 80") NameError: name 'data' is not defined
Import the data from
A2_mosquito_data.csv, determine the mean temperate, and loop over the temperature values. For each value print out whether it is greater than the mean, less than the mean, or equal to the mean.
The mathematician Richard Hamming once said, "The purpose of
computing is insight, not numbers," and the best way to develop
insight is often to visualize data. The main plotting library in
matplotlib. To get started, let's tell the IPython
Notebook that we want our plots displayed inline, rather than in a
separate viewing window:
% at the start of the line signals that this is a command for
the notebook, rather than a statement in Python. Next, we will
pyplot module from
matplotlib, but since
a fairly long name to type repeatedly let's give it an alias.
from matplotlib import pyplot as plt
This import statement shows two new things. First, we can import
part of a library by using the
from library import submodule
syntax. Second, we can use a different name to refer to the imported
library by using
Now, let's make a simple plot showing how the number of mosquitos varies over time. We'll use the site you've been doing exercises with since it has a longer time-series.
data = pandas.read_csv('data/A2_mosquito_data.csv') plt.plot(data['year'], data['mosquitos'])
[<matplotlib.lines.Line2D at 0x7fa8ec4e4a20>]
More complicated plots can be created by adding a little additional information. Let's say we want to look at how the different weather variables vary over time.
plt.figure(figsize=(10.0, 3.0)) plt.subplot(1, 2, 1) plt.plot(data['year'], data['temperature'], 'ro-') plt.xlabel('Year') plt.ylabel('Temperature') plt.subplot(1, 2, 2) plt.plot(data['year'], data['rainfall'], 'bs-') plt.xlabel('Year') plt.ylabel('Rain Fall') plt.show()
Using the data in
A2_mosquito_data.csv, plot the relationship between the number of mosquitos and temperature and the number of mosquitos and rainfall.
pandaslibrary to work with data tables in Python.
variable = valueto assign a value to a variable.
print somethingto display the value of
dataframe['columnname']to select a column of data.
dataframe[start_row:stop_row]to select rows from a data frame.
dataframe.min()to calculate simple statistics.
for x in list:to loop over values
if condition:to make conditional decisions
matplotlibfor creating simple visualizations.
With the requisite Python background out of the way, now we're ready to dig in to analyzing our data, and along the way learn how to write better code, more efficiently, that is more likely to be correct.