Pandas速度优化

ddlee · May 28, 2017

本文主要内容取自Sofia Heisler在PyCon 2017上的演讲No More Sad Pandas Optimizing Pandas Code for Speed and Efficiency,讲稿代码和幻灯片见GitHub

Set Up

示例数据

ean_hotel_id name address1 city state_province postal_code latitude longitude star_rating high_rate low_rate  
0 269955 Hilton Garden Inn Albany/SUNY Area 1389 Washington Ave Albany NY 12206 42.68751 -73.81643 3.0 154.0272 124.0216
1 113431 Courtyard by Marriott Albany Thruway 1455 Washington Avenue Albany NY 12206 42.68971 -73.82021 3.0 179.0100 134.0000
2 108151 Radisson Hotel Albany 205 Wolf Rd Albany NY 12205 42.72410 -73.79822 3.0 134.1700 84.1600

示例函数:Haversine Distance

def haversine(lat1, lon1, lat2, lon2):
    miles_constant = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    mi = miles_constant * c
    return mi

优化它之前,先测量它

IPython Notebook的Magic Command: %timeit

既可以测量某一行代码的执行时间,又可以测量整个单元格里代码快的执行时间。

Package: line_profiler

记录每行代码的执行次数和执行时间。

在IPython Notebook中使用时,先运行%load_ext line_profiler, 之后可以用%lprun -f [function name]命令记录指定函数的执行情况。

实验

对行做循环(Baseline)

%%timeit
haversine_series = []
for index, row in df.iterrows():
    haversine_series.append(haversine(40.671, -73.985,\
                                      row['latitude'], row['longitude']))
df['distance'] = haversine_series

Output:

197 ms ± 6.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pd.DataFrame.apply()方法

%lprun -f haversine \
df.apply(lambda row: haversine(40.671, -73.985,\
                               row['latitude'], row['longitude']), axis=1)

Output:

90.6 ms ± 7.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Timer unit: 1e-06 s

Total time: 0.049982 s
File: <ipython-input-3-19c704a927b7>
Function: haversine at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def haversine(lat1, lon1, lat2, lon2):
     2      1631         1535      0.9      3.1      miles_constant = 3959
     3      1631        16602     10.2     33.2      lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
     4      1631         2019      1.2      4.0      dlat = lat2 - lat1
     5      1631         1143      0.7      2.3      dlon = lon2 - lon1
     6      1631        18128     11.1     36.3      a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
     7      1631         7857      4.8     15.7      c = 2 * np.arcsin(np.sqrt(a))
     8      1631         1708      1.0      3.4      mi = miles_constant * c
     9      1631          990      0.6      2.0      return mi

观察Hits这一列可以看到,apply()方法还是将函数一行行地应用于每行。

向量化:将pd.Series传入函数

%lprun -f haversine haversine(40.671, -73.985,\
                              df['latitude'], df['longitude'])

Output:

2.21 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Timer unit: 1e-06 s

Total time: 0.008601 s
File: <ipython-input-3-19c704a927b7>
Function: haversine at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def haversine(lat1, lon1, lat2, lon2):
     2         1            3      3.0      0.0      miles_constant = 3959
     3         1          838    838.0      9.7      lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
     4         1          597    597.0      6.9      dlat = lat2 - lat1
     5         1          572    572.0      6.7      dlon = lon2 - lon1
     6         1         5033   5033.0     58.5      a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
     7         1         1060   1060.0     12.3      c = 2 * np.arcsin(np.sqrt(a))
     8         1          496    496.0      5.8      mi = miles_constant * c
     9         1            2      2.0      0.0      return mi

向量化之后,函数内的每行操作只被访问一次,达到了行结构上的并行。

向量化:将np.array传入函数

%lprun -f haversine df['distance'] = haversine(40.671, -73.985,\
                        df['latitude'].values, df['longitude'].values)

Output:

370 µs ± 18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Timer unit: 1e-06 s

Total time: 0.001382 s
File: <ipython-input-3-19c704a927b7>
Function: haversine at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def haversine(lat1, lon1, lat2, lon2):
     2         1            3      3.0      0.2      miles_constant = 3959
     3         1          292    292.0     21.1      lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
     4         1           40     40.0      2.9      dlat = lat2 - lat1
     5         1           29     29.0      2.1      dlon = lon2 - lon1
     6         1          815    815.0     59.0      a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
     7         1          183    183.0     13.2      c = 2 * np.arcsin(np.sqrt(a))
     8         1           18     18.0      1.3      mi = miles_constant * c
     9         1            2      2.0      0.1      return mi

相比pd.Seriesnp.array不含索引等额外信息,因而更加高效。

小结

Methodology Avg. single run time Marginal performance improvement
Looping with iterrows 184.00 -
Looping with apply 78.10 2.4x
Vectorization with Pandas series 1.79 43.6x
Vectorization with NumPy arrays 0.37 4.8x

通过上面的对比,我们比最初的baseline快了近500倍。最大的提升来自于向量化。因而,实现的函数能够很方便地向量化是高效处理的关键。

Cython优化

Cython可以将python代码转化为C代码来执行,可以进行如下优化(静态化变量类型,调用C函数库)

%load_ext cython

%%cython -a
# Haversine cythonized
from libc.math cimport sin, cos, acos, asin, sqrt

cdef deg2rad_cy(float deg):
    cdef float rad
    rad = 0.01745329252*deg
    return rad

cpdef haversine_cy_dtyped(float lat1, float lon1, float lat2, float lon2):
    cdef:
        float dlon
        float dlat
        float a
        float c
        float mi

    lat1, lon1, lat2, lon2 = map(deg2rad_cy, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    mi = 3959 * c
    return mi

嵌套于循坏中:

%timeit df['distance'] =\
df.apply(lambda row: haversine_cy_dtyped(40.671, -73.985,\
                              row['latitude'], row['longitude']), axis=1)

Output:

10 loops, best of 3: 68.4 ms per loop

可以看到,Cython确实带来速度上的提升,但效果不及向量化(并行化)。