Intro to Numpy and Simple Plotting
==================================

In this lesson you will be introduced to Numpy, and some simple
plotting using pylab. A simple script will be introduced that opens a
spreadsheet of data, computes some simple statistics, creates some
simple plots.

Numpy = Numerical + Python
--------------------------

Numpy is **the** main building block for doing scientific computing
with python. A wealth of information can be found here:

http://numpy.scipy.org

Numpy for Matlab users
----------------------

As an object-oriented language, python may seem very different than
matlab, and in many ways it is. However, for people with a Matlab
background it can be fairly easy to switch to python using Numpy. 
These pages can be very useful to understand:

   * which numpy function maps to a given matlab function
   * difference between numpy array and numpy matrix
   * how to index arrays, and fancy indexing

http://mathesaurus.sourceforge.net/matlab-numpy.html

http://www.scipy.org/NumPy_for_Matlab_Users


Script Deconstruction : numpy_example.py
----------------------------------------

:download:`numpy_example.py`


In this example, I will open a file (this one uses colon : as a
delimiter), 
and I will read in different fields so I can analyze and viualize the
data.


**import** ::

  import numpy as np

This is the main tool used in the rest of this script.


Parse the Header
----------------

I know my data file has a header row as its first row of data. So I
plan to read this in and use this to help load in the rest of my data.


The beginning of the first row looks like this::

    "BAC#":"Gender":"Handedness":"Years Ed":  ...

As you can see

 * Fields are separated by colon :
 * There are extra quotes " in the fields

So I will

 *  read in the first line
 *  strip unnecessary quotes
 *  split on the colon to create a list
 *  cast result to a numpy array

code::

  infile = 'Cindee_11_9_10_v2.csv'
  hdrstr = open(infile).readline()
  hdrs = hdrstr.replace('"','').split(':')
  hdrs = np.array(hdrs)

So Now I want to find the indicies of a field, lets say all the fields
that have 'BAC' in them::
     
  bac_ind = [np.where(hdrs==x) for x in hdrs if 'BAC' in x]
  bac_ind = [x[0][0] for x in bac_ind] #simplify
  bac_hdr_str = hdrs[bac_ind]

np.where returns indicies where a logical test on an array returns
true

so this lets me find the indicies of the fields that have the term
"BAC"  in them....

Using loadtxt
-------------

Now that I have figured out what fields I want, Ill use **np.loadtxt** to
load them into an array so I can work with the data::

     np.loadtxt(fname, dtype=<type 'float'>, delimiter=None, skiprows=0, usecols=None)


np.loadtxt

 * fname = filename of text file to open
 * dtype = base datatype to cast data to
 * delimiter = character used to specify boundry between fields
 * skiprows = number of leading rows to skip (in case you need to skip
   headers)
 * usecols = tuple indicating the specific columns of data you wish to
   extract

example::

  mydata = (0,1,3,6,10,36)
  dat = np.loadtxt(infile, delimiter=':',
	           dtype=np.str, usecols=mydata,
                   skiprows = 1)

  array([['"BAC001"', '"F"', '18', '63', '30', '13'],
        ['"BAC002"', '"M"', '18', '70', '30', '7'],
        ['"BAC003"', '"F"', '14', '61', '30', '12'], ...


Im skipping the first row since I already have dealt with getting my
headers. And Im casting the data to a dtype of string *np.str* because
the columns I am grabbing have mixed types, I will re-cast them to the
correct types later.  I could also load different types separately to
avoid this issue.

Grab **MMSE** out of the dat array, replace missing fields, and cast
to float::

   empty = np.where(dat[:,keys['mmse']]=='')
   dat[empty,keys['mmse']] = np.nan
   mmse = dat[:,keys['mmse']].astype(np.float)

   array([ 30.,  30.,  30.,  28.,  30.,  29.,  28.,  30.,  28.,  29.,  29.,
           28.,  30.,  30.,  NaN,  29.,  29.,  30.,  30.,  30.,  30.,  29.,


Simple Statistics
-----------------

I may want to calculate a simple mean.  But this will be an issue as I
have NAN in the array. however, it is simple to mask the array::

   mmse_mean = mmse[mmse>0].mean()
   28.991735537190081
   mmse[mmse>0].std()

To make this more interesting......
I now get arrays of data for CVLT and age.  For each of
these, there may be different regions where subjects have missing
fields. So if I want to compare two variables, I need to find the
places where they overlap::

   overlap2 = np.logical_and(mmse>0, cvlt>0)
   
I can now use some other numpy toos to see how two measures relate to
each other.

polyfit::
   
   p = np.polyfit(mmse[overlap2], cvlt[overlap2],1)
   array([  0.87076373, -13.59299814])
   x = [mmse[overlap2].min(), mmse[overlap2].max()]
   y = np.polyval(p, x)

least squares::

   m = np.ones((mmse[overlap2].shape[0],2))
   m[:,0] = mmse[overlap2]
   results = np.linalg.lstsq(m, cvlt[overlap2])
   print results[0]
   array([  0.87076373, -13.59299814])


Simple Plotting
---------------

In simple situations like this I use the -pylab directive of Ipython
if I want to generate some simple plots::

   ipython -pylab

It requires matplotlib to be installed, but is a powerful tool.
There is a wonderful set of example plots to be found here:

http://matplotlib.sourceforge.net/gallery.html

So if we want to see our mmse, and cvlt data::

   plot(mmse, cvlt, 'ro')
   plot(x,y, 'k-')
   axis([0,35,0,20])
   title('Relation of MMSE to CVLT')
   xlabel('MMSE')
   ylabel('CVLT LDFR')


.. image:: simple_plot.png


Homework
--------

1. Open a text file of (your own) data
2. Parse the header
3. Open the data and grab a subset of fields
4. generate a simple statistic and create a simple plot
5. save a png image

Email me if you get stuck, you have a few weeks