Intro to Numpy and Simple Plotting ================================== In this lesson you will be introduced to Numpy, and some simple plotting using pylab. A simple script will be introduced that opens a spreadsheet of data, computes some simple statistics, creates some simple plots. Numpy = Numerical + Python -------------------------- Numpy is **the** main building block for doing scientific computing with python. A wealth of information can be found here: http://numpy.scipy.org Numpy for Matlab users ---------------------- As an object-oriented language, python may seem very different than matlab, and in many ways it is. However, for people with a Matlab background it can be fairly easy to switch to python using Numpy. These pages can be very useful to understand: * which numpy function maps to a given matlab function * difference between numpy array and numpy matrix * how to index arrays, and fancy indexing http://mathesaurus.sourceforge.net/matlab-numpy.html http://www.scipy.org/NumPy_for_Matlab_Users Script Deconstruction : numpy_example.py ---------------------------------------- :download:`numpy_example.py` In this example, I will open a file (this one uses colon : as a delimiter), and I will read in different fields so I can analyze and viualize the data. **import** :: import numpy as np This is the main tool used in the rest of this script. Parse the Header ---------------- I know my data file has a header row as its first row of data. So I plan to read this in and use this to help load in the rest of my data. The beginning of the first row looks like this:: "BAC#":"Gender":"Handedness":"Years Ed": ... As you can see * Fields are separated by colon : * There are extra quotes " in the fields So I will * read in the first line * strip unnecessary quotes * split on the colon to create a list * cast result to a numpy array code:: infile = 'Cindee_11_9_10_v2.csv' hdrstr = open(infile).readline() hdrs = hdrstr.replace('"','').split(':') hdrs = np.array(hdrs) So Now I want to find the indicies of a field, lets say all the fields that have 'BAC' in them:: bac_ind = [np.where(hdrs==x) for x in hdrs if 'BAC' in x] bac_ind = [x[0][0] for x in bac_ind] #simplify bac_hdr_str = hdrs[bac_ind] np.where returns indicies where a logical test on an array returns true so this lets me find the indicies of the fields that have the term "BAC" in them.... Using loadtxt ------------- Now that I have figured out what fields I want, Ill use **np.loadtxt** to load them into an array so I can work with the data:: np.loadtxt(fname, dtype=, delimiter=None, skiprows=0, usecols=None) np.loadtxt * fname = filename of text file to open * dtype = base datatype to cast data to * delimiter = character used to specify boundry between fields * skiprows = number of leading rows to skip (in case you need to skip headers) * usecols = tuple indicating the specific columns of data you wish to extract example:: mydata = (0,1,3,6,10,36) dat = np.loadtxt(infile, delimiter=':', dtype=np.str, usecols=mydata, skiprows = 1) array([['"BAC001"', '"F"', '18', '63', '30', '13'], ['"BAC002"', '"M"', '18', '70', '30', '7'], ['"BAC003"', '"F"', '14', '61', '30', '12'], ... Im skipping the first row since I already have dealt with getting my headers. And Im casting the data to a dtype of string *np.str* because the columns I am grabbing have mixed types, I will re-cast them to the correct types later. I could also load different types separately to avoid this issue. Grab **MMSE** out of the dat array, replace missing fields, and cast to float:: empty = np.where(dat[:,keys['mmse']]=='') dat[empty,keys['mmse']] = np.nan mmse = dat[:,keys['mmse']].astype(np.float) array([ 30., 30., 30., 28., 30., 29., 28., 30., 28., 29., 29., 28., 30., 30., NaN, 29., 29., 30., 30., 30., 30., 29., Simple Statistics ----------------- I may want to calculate a simple mean. But this will be an issue as I have NAN in the array. however, it is simple to mask the array:: mmse_mean = mmse[mmse>0].mean() 28.991735537190081 mmse[mmse>0].std() To make this more interesting...... I now get arrays of data for CVLT and age. For each of these, there may be different regions where subjects have missing fields. So if I want to compare two variables, I need to find the places where they overlap:: overlap2 = np.logical_and(mmse>0, cvlt>0) I can now use some other numpy toos to see how two measures relate to each other. polyfit:: p = np.polyfit(mmse[overlap2], cvlt[overlap2],1) array([ 0.87076373, -13.59299814]) x = [mmse[overlap2].min(), mmse[overlap2].max()] y = np.polyval(p, x) least squares:: m = np.ones((mmse[overlap2].shape[0],2)) m[:,0] = mmse[overlap2] results = np.linalg.lstsq(m, cvlt[overlap2]) print results[0] array([ 0.87076373, -13.59299814]) Simple Plotting --------------- In simple situations like this I use the -pylab directive of Ipython if I want to generate some simple plots:: ipython -pylab It requires matplotlib to be installed, but is a powerful tool. There is a wonderful set of example plots to be found here: http://matplotlib.sourceforge.net/gallery.html So if we want to see our mmse, and cvlt data:: plot(mmse, cvlt, 'ro') plot(x,y, 'k-') axis([0,35,0,20]) title('Relation of MMSE to CVLT') xlabel('MMSE') ylabel('CVLT LDFR') .. image:: simple_plot.png Homework -------- 1. Open a text file of (your own) data 2. Parse the header 3. Open the data and grab a subset of fields 4. generate a simple statistic and create a simple plot 5. save a png image Email me if you get stuck, you have a few weeks