Sunday, 29 September 2013

Getting rows from a data frame which satisfy a condition in pandas

Getting rows from a data frame which satisfy a condition in pandas

I have a data frame and I have a range of numbers. I want to find the rows
where values in a particular column lie in that range.
This seems like a trivial job. I tried with the techniques given here -
http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-boolean
I took a simple example:
In [6]: df_s
Out[6]:
time value
0 1 3
1 2 4
2 3 3
3 4 4
4 5 3
5 6 2
6 7 2
7 8 3
8 9 3
In [7]: df_s[df_s.time.isin(range(1,8))]
Out[7]:
time value
0 1 3
1 2 4
2 3 3
3 4 4
4 5 3
5 6 2
6 7 2
Then, I tried with a sample from the data set I am working with which has
timestamp and value as columns:
In [8]: df_s = pd.DataFrame({'time': range(1379945743841,1379945743850),
'value': [3,4,3,4,3,2,2,3,3]})
In [9]: df_s
Out[9]:
time value
0 1379945743841 3
1 1379945743842 4
2 1379945743843 3
3 1379945743844 4
4 1379945743845 3
5 1379945743846 2
6 1379945743847 2
7 1379945743848 3
8 1379945743849 3
In [10]: df_s[df_s.time.isin(range(1379945743843,1379945743845))]
Out[10]:
Empty DataFrame
Columns: [time, value]
Index: []
Why doesn't the same technique work in this case? What am I doing wrong?
I tried another approach:
In [11]: df_s[df_s.time >= 1379945743843 and df_s.time <=1379945743845]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-45c44def41b4> in <module>()
----> 1 df_s[df_s.time >= 1379945743843 and df_s.time <=1379945743845]
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()
Then, I tried with a bit more complex approach:
In [13]: df_s.ix[[idx for idx in df_s.index if df_s.ix[idx]['time'] in
range(1379945743843, 1379945743845)]]
Out[13]:
time value
2 1379945743843 3
3 1379945743844 4
This gives the desired result but it takes way too much time to give any
result on my original data set. It has 209920 rows and it is expected that
the number of rows will increase when I actually put my code to test.
Can anyone direct to me towards the right approach?
I am using python 2.7.3 and pandas 0.12.0
Update:
Jeff's answer worked.
But I find the isin approach more simple, intuitive and less cluttered.
Please comment if anyone has any idea why it failed.
Thanks!

No comments:

Post a Comment