本文主要内容取自Sofia Heisler在PyCon 2017上的演讲No More Sad Pandas Optimizing Pandas Code for Speed and Efficiency,讲稿代码和幻灯片见GitHub。
Set Up
示例数据
ean_hotel_id | name | address1 | city | state_province | postal_code | latitude | longitude | star_rating | high_rate | low_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 269955 | Hilton Garden Inn Albany/SUNY Area | 1389 Washington Ave | Albany | NY | 12206 | 42.68751 | -73.81643 | 3.0 | 154.0272 | 124.0216 |
1 | 113431 | Courtyard by Marriott Albany Thruway | 1455 Washington Avenue | Albany | NY | 12206 | 42.68971 | -73.82021 | 3.0 | 179.0100 | 134.0000 |
2 | 108151 | Radisson Hotel Albany | 205 Wolf Rd | Albany | NY | 12205 | 42.72410 | -73.79822 | 3.0 | 134.1700 | 84.1600 |
示例函数:Haversine Distance
def haversine(lat1, lon1, lat2, lon2):
miles_constant = 3959
lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
mi = miles_constant * c
return mi
优化它之前,先测量它
IPython Notebook的Magic Command: %timeit
既可以测量某一行代码的执行时间,又可以测量整个单元格里代码快的执行时间。
Package: line_profiler
记录每行代码的执行次数和执行时间。
在IPython Notebook中使用时,先运行%load_ext line_profiler
, 之后可以用%lprun -f [function name]
命令记录指定函数的执行情况。
实验
对行做循环(Baseline)
%%timeit
haversine_series = []
for index, row in df.iterrows():
haversine_series.append(haversine(40.671, -73.985,\
row['latitude'], row['longitude']))
df['distance'] = haversine_series
Output:
197 ms ± 6.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pd.DataFrame.apply()方法
%lprun -f haversine \
df.apply(lambda row: haversine(40.671, -73.985,\
row['latitude'], row['longitude']), axis=1)
Output:
90.6 ms ± 7.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Timer unit: 1e-06 s
Total time: 0.049982 s
File: <ipython-input-3-19c704a927b7>
Function: haversine at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def haversine(lat1, lon1, lat2, lon2):
2 1631 1535 0.9 3.1 miles_constant = 3959
3 1631 16602 10.2 33.2 lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
4 1631 2019 1.2 4.0 dlat = lat2 - lat1
5 1631 1143 0.7 2.3 dlon = lon2 - lon1
6 1631 18128 11.1 36.3 a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
7 1631 7857 4.8 15.7 c = 2 * np.arcsin(np.sqrt(a))
8 1631 1708 1.0 3.4 mi = miles_constant * c
9 1631 990 0.6 2.0 return mi
观察Hits这一列可以看到,apply()
方法还是将函数一行行地应用于每行。
向量化:将pd.Series传入函数
%lprun -f haversine haversine(40.671, -73.985,\
df['latitude'], df['longitude'])
Output:
2.21 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Timer unit: 1e-06 s
Total time: 0.008601 s
File: <ipython-input-3-19c704a927b7>
Function: haversine at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def haversine(lat1, lon1, lat2, lon2):
2 1 3 3.0 0.0 miles_constant = 3959
3 1 838 838.0 9.7 lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
4 1 597 597.0 6.9 dlat = lat2 - lat1
5 1 572 572.0 6.7 dlon = lon2 - lon1
6 1 5033 5033.0 58.5 a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
7 1 1060 1060.0 12.3 c = 2 * np.arcsin(np.sqrt(a))
8 1 496 496.0 5.8 mi = miles_constant * c
9 1 2 2.0 0.0 return mi
向量化之后,函数内的每行操作只被访问一次,达到了行结构上的并行。
向量化:将np.array传入函数
%lprun -f haversine df['distance'] = haversine(40.671, -73.985,\
df['latitude'].values, df['longitude'].values)
Output:
370 µs ± 18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Timer unit: 1e-06 s
Total time: 0.001382 s
File: <ipython-input-3-19c704a927b7>
Function: haversine at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def haversine(lat1, lon1, lat2, lon2):
2 1 3 3.0 0.2 miles_constant = 3959
3 1 292 292.0 21.1 lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
4 1 40 40.0 2.9 dlat = lat2 - lat1
5 1 29 29.0 2.1 dlon = lon2 - lon1
6 1 815 815.0 59.0 a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
7 1 183 183.0 13.2 c = 2 * np.arcsin(np.sqrt(a))
8 1 18 18.0 1.3 mi = miles_constant * c
9 1 2 2.0 0.1 return mi
相比pd.Series
,np.array
不含索引等额外信息,因而更加高效。
小结
Methodology | Avg. single run time | Marginal performance improvement |
---|---|---|
Looping with iterrows | 184.00 | - |
Looping with apply | 78.10 | 2.4x |
Vectorization with Pandas series | 1.79 | 43.6x |
Vectorization with NumPy arrays | 0.37 | 4.8x |
通过上面的对比,我们比最初的baseline快了近500倍。最大的提升来自于向量化。因而,实现的函数能够很方便地向量化是高效处理的关键。
用Cython
优化
Cython
可以将python
代码转化为C
代码来执行,可以进行如下优化(静态化变量类型,调用C函数库)
%load_ext cython
%%cython -a
# Haversine cythonized
from libc.math cimport sin, cos, acos, asin, sqrt
cdef deg2rad_cy(float deg):
cdef float rad
rad = 0.01745329252*deg
return rad
cpdef haversine_cy_dtyped(float lat1, float lon1, float lat2, float lon2):
cdef:
float dlon
float dlat
float a
float c
float mi
lat1, lon1, lat2, lon2 = map(deg2rad_cy, [lat1, lon1, lat2, lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
mi = 3959 * c
return mi
嵌套于循坏中:
%timeit df['distance'] =\
df.apply(lambda row: haversine_cy_dtyped(40.671, -73.985,\
row['latitude'], row['longitude']), axis=1)
Output:
10 loops, best of 3: 68.4 ms per loop
可以看到,Cython
确实带来速度上的提升,但效果不及向量化(并行化)。