深入探讨遍历Pandas DataFrame的多种方法及其性能对比,推荐使用字典与数组方法。
原文标题:Pandas中高效的“For循环”
原文作者:数据派THU
冷月清谈:
怜星夜思:
2、在处理大型数据集时,您更倾向于使用矢量化还是快速for循环?为什么?
3、除了文章提到的方法,您还知道哪种高效遍历Pandas DataFrame的技能?
原文内容
来源:DeepHub IMBA本文约1500字,建议阅读5分钟
本文将探索遍历pandas dataframe的各种方法,检查每个循环方法的相关运行时。
实验数据集
import numpy as np import pandas as pd df = pd.DataFrame(np.random.randint(0, 50, size=(6000000, 4)), columns=('a','b','c','d')) df.shape # (6000000, 5) df.head()
Iterrows
import time start = time.time() # Iterating through DataFrame using iterrows for idx, row in df.iterrows(): if row.a == 0: df.at[idx,'e'] = row.d
elif (row.a <= 25) & (row.a > 0):
df.at[idx,‘e’] = (row.b)-(row.c)
else:
df.at[idx,‘e’] = row.b + row.c
end = time.time()
print(end - start)time taken: 335.212792634964
Itertuples
for row in df[:1].itertuples(): print(row) ## accessing the complete row - index following by columns print(row.Index) ## accessing the index of the row print(row.a) ## accessing the value of column 'a'

start = time.time() # Iterating through namedtuples for row in df.itertuples(): if row.a == 0: df.at[row.Index,'e'] = row.d
elif (row.a <= 25) & (row.a > 0):
df.at[row.Index,‘e’] = (row.b)-(row.c)
else:
df.at[row.Index,‘e’] = row.b + row.cend = time.time()
print(end - start)Time taken: 41 seconds
字典
start = time.time() # converting the DataFrame to a dictionary df_dict = df.to_dict('records') # Iterating through the dictionary for row in df_dict[:]: if row['a'] == 0: row['e'] = row['d']
elif row[‘a’] <= 25 & row[‘a’] > 0:
row[‘e’] = row[‘b’]-row[‘c’]
else:
row[‘e’] = row[‘b’] + row[‘c’]converting back to DataFrame
df4 = pd.DataFrame(df_dict)
end = time.time()
print(end - start)Time taken: 31 seconds
数组列表
start = time.time() # create an empty dictionary list2 = [] # intialize column having 0s. df['e'] = 0 # iterate through a NumPy array for row in df.values: if row[0] == 0: row[4] = row[3]
elif row[0] <= 25 & row[0] > 0:
row[4] = row[1]-row[2]else:
row[4] = row[1] + row[2]append values to a list
list2.append(row)
convert the list to a dataframe
df2 = pd.DataFrame(list2, columns=[‘a’, ‘b’, ‘c’, ‘d’,‘e’])
end = time.time()
print(end - start)
#Time Taken: 21 seconds