One of the double edged swords in pandas is that it is possible to update a DataFrame in lots of different ways. At work I often find myself generating part of the data in the begininng of a process, and as the algorithm progresses, I fill it in with more data. As I did this today, I was hit with a mysterious time consuming bottleneck. After investigating this further, did I find that the reason for the long run time was my non efficient updating of the dataframe.
Consider the following code, which creates a dataframe and then creates three placeholder columns A, B, C:
import pandas as pd
import numpy as numpy
pd.options.display.float_format = '{:,.0f}'.format
import time
df = pd.DataFrame(numpy.random.rand(20,20)*100)
df.loc[:,'A'] = None
df.loc[:,'B'] = None
df.loc[:,'C'] = None
I will now loop over the dataframe rows and update these columns in three different ways.
First attempt, update the row, and assign it back to the dataframe:
t0 = time.time()
for idx,row in df.iterrows():
row.loc[('A','B','C')] = (100+idx,200+idx,300+idx)
df.loc[idx] = row
print 'First: ', time.time()-t0
On my machine at work this takes 0.156s.
Second attempt, make use of the fact that row
is actually a back reference in to df
. The problem is that during my experiments the back reference broke. I’m not even sure what I did.
t0 = time.time()
for idx,row in df.iterrows():
row.loc[('A','B','C')] = (100+idx,200+idx,300+idx)
print 'Second: ', time.time()-t0
On my machine this took 0.010s . That is more than a fifteen time speedup.
The third attempt is in its simplicity is the absolute winner.
t0 = time.time()
for idx,row in df.iterrows():
df.set_value(idx,'A', 100+idx)
df.set_value(idx,'B', 200+idx)
df.set_value(idx,'C', 300+idx)
print 'Third: ', time.time()-t0
This took 0.00344s .
I.e. just by rewriting the assignments, I got a 45 times speedup!
After writing the above lines did I learn that set_value()
will be deprecated. But there is another accessor at
that is almost as fast:
t0 = time.time()
for idx,row in df.iterrows():
df.at[idx,'A'] = 100+idx
df.at[idx,'B'] = 200+idx
df.at[idx,'C'] = 300+idx
print 'Forth: ', time.time()-t0
This took 0.0044s , about 25% slower, but apparently this constract is “more” correct.