One of the double edged swords in pandas is that it is possible to update a DataFrame in lots of different ways. At work I often find myself generating part of the data in the begininng of a process, and as the algorithm progresses, I fill it in with more data. As I did this today, I was hit with a mysterious time consuming bottleneck. After investigating this further, did I find that the reason for the long run time was my non efficient updating of the dataframe.
Consider the following code, which creates a dataframe and then creates three placeholder columns A, B, C:
I will now loop over the dataframe rows and update these columns in three different ways.
First attempt, update the row, and assign it back to the dataframe:
On my machine at work this takes 0.156s.
Second attempt, make use of the fact that row
is actually a back reference in to df
. The problem is that during my experiments the back reference broke. I’m not even sure what I did.
On my machine this took 0.010s . That is more than a fifteen time speedup.
The third attempt is in its simplicity is the absolute winner.
This took 0.00344s .
I.e. just by rewriting the assignments, I got a 45 times speedup!
After writing the above lines did I learn that set_value()
will be deprecated. But there is another accessor at
that is almost as fast:
This took 0.0044s , about 25% slower, but apparently this constract is “more” correct.