Brustmann: Pandas: interpolate missing rows and plot multiple series in dataframe

Pandas: interpolate missing rows and plot multiple series in dataframe

I'm looking for pointers to the appropriate docs for accomplishing the
analysis task described below with pandas in pylab. I've previously
written python + matplotlib functions that accomplish much of this, but
the resulting code is slow and cumbersome to maintain. It seems like
pandas has the capabilities needed but I'm getting bogged down trying to
find the right approach and functions.
In [1]: import pandas as pd
In [6]: df = pd.read_csv("tinyexample.csv", parse_dates=2)
In [7]: df
Out[7]:
I t A B C D E
0 1 08/06/13 02:34 PM 109.40 105.50 124.30 1.00 1930.95
1 1 08/06/13 02:35 PM 110.61 106.21 124.30 0.90 1964.89
2 1 08/06/13 02:37 PM 114.35 108.84 124.30 0.98 2654.33
3 1 08/06/13 02:38 PM 115.38 109.81 124.30 1.01 2780.63
4 1 08/06/13 02:40 PM 116.08 110.94 124.30 0.99 2521.28
5 4 08/06/13 02:34 PM 105.03 100.96 127.43 1.12 2254.51
6 4 08/06/13 02:35 PM 106.73 101.72 127.43 1.08 2661.76
7 4 08/06/13 02:38 PM 111.21 105.17 127.38 1.06 3163.07
8 4 08/06/13 02:40 PM 111.69 106.28 127.38 1.09 2898.73
The above is a tiny slice of minute-by-minute readings from a network of
radio-connected data loggers. The sample shows ouput from 2 loggers over a
10 minute period. The actual data files have output from dozens of loggers
over multiple days.
Column 'I' is the logger id, 't' is a timestamp, 'A-C' are temperatures,
'D' is a flow rate, and 'E' is an energy rate computed from A, B, and D.
Because of poor radio connectivity there are missing readings in all
loggers at random times.
Specifically, I want to do something like the following
for i in I:
## Insert rows for all missing timestamps with interpolated values for
A through E
## Update a new column 'F' with a cumulative sum of 'E' (actually E/60)
Then I want to be able to define a plotting function that allows me to
output vertically-aligned strip-chart plots similar to those shown in the
docs at http://pandas.pydata.org/pandas-docs/dev/visualization.html. I've
tried
df.plot(subplots=True, sharex=True)
which almost does what I need, except that
It plots by index number rather than by date.
It doesn't create individual plot lines for each logger id.

Finally, I'd want to be able to choose a subset of the logger id's and
data columns to plot, e.g.
def myplot(df, ilist, clist):
"""
ilist is of the form [ n, m, p, ...] where n, m, and p are logger id's
in column 'I'
clist is a list of column labels.
Produces stack of strip chart plots, one for each column contain plot
lines for each id.
"""

Brustmann

Wednesday, 21 August 2013

Pandas: interpolate missing rows and plot multiple series in dataframe

No comments:

Post a Comment