performance - Why is computing point distances so slow in Python? -
my python program slow. so, profiled , found of time being spent in function computes distance between 2 points (a point list of 3 python floats):
def get_dist(pt0, pt1): val = 0 in range(3): val += (pt0[i] - pt1[i]) ** 2 val = math.sqrt(val) return val
to analyze why function slow, wrote 2 test programs: 1 in python , 1 in c++ similar computation. compute distance between 1 million pairs of points. (the test code in python , c++ below.)
the python computation takes 2 seconds, while c++ takes 0.02 seconds. 100x difference!
why python code so slower c++ code such simple math computations? how speed up match c++ performance?
the python code used testing:
import math, random, time num = 1000000 # generate random points , numbers pt_list = [] rand_list = [] in range(num): pt = [] j in range(3): pt.append(random.random()) pt_list.append(pt) rand_list.append(random.randint(0, num - 1)) # compute beg_time = time.clock() dist = 0 in range(num): pt0 = pt_list[i] ri = rand_list[i] pt1 = pt_list[ri] val = 0 j in range(3): val += (pt0[j] - pt1[j]) ** 2 val = math.sqrt(val) dist += val end_time = time.clock() elap_time = (end_time - beg_time) print elap_time print dist
the c++ code used testing:
#include <cstdlib> #include <iostream> #include <ctime> #include <cmath> struct point { double v[3]; }; int num = 1000000; int main() { // allocate memory point** pt_list = new point*[num]; int* rand_list = new int[num]; // generate random points , numbers ( int = 0; < num; ++i ) { point* pt = new point; ( int j = 0; j < 3; ++j ) { const double r = (double) rand() / (double) rand_max; pt->v[j] = r; } pt_list[i] = pt; rand_list[i] = rand() % num; } // compute clock_t beg_time = clock(); double dist = 0; ( int = 0; < num; ++i ) { const point* pt0 = pt_list[i]; int r = rand_list[i]; const point* pt1 = pt_list[r]; double val = 0; ( int j = 0; j < 3; ++j ) { const double d = pt0->v[j] - pt1->v[j]; val += ( d * d ); } val = sqrt(val); dist += val; } clock_t end_time = clock(); double sec_time = (end_time - beg_time) / (double) clocks_per_sec; std::cout << sec_time << std::endl; std::cout << dist << std::endl; return 0; }
a sequence of optimizations:
the original code, small changes
import math, random, time num = 1000000 # generate random points , numbers # change #1: it's not have randomness. # 1 of cases. # changing code shouldn't change results. # using fixed seed ensures changes valid. # final 'print dist' should yield same result regardless of optimizations. # note: there's nothing magical seed. # randomly picked hash tag git log. random.seed (0x7126434a2ea2a259e9f4196cbb343b1e6d4c2fc8) pt_list = [] rand_list = [] in range(num): pt = [] j in range(3): pt.append(random.random()) pt_list.append(pt) # change #2: rand_list computed in separate loop. # ensures upcoming optimizations same results # unoptimized version. in range(num): rand_list.append(random.randint(0, num - 1)) # compute beg_time = time.clock() dist = 0 in range(num): pt0 = pt_list[i] ri = rand_list[i] pt1 = pt_list[ri] val = 0 j in range(3): val += (pt0[j] - pt1[j]) ** 2 val = math.sqrt(val) dist += val end_time = time.clock() elap_time = (end_time - beg_time) print elap_time print dist
optimization #1: put code in function.
the first optimization (not shown) embed of code except import
in function. simple change offers 36% performance boost on computer.
optimization #2: eschew **
operator.
you don't use pow(d,2)
in c code because knows suboptimal in c. it's suboptimal in python. python's **
smart; evaluates x**2
x*x
. however, takes time smart. know want d*d
, use it. here's computation loop optimization:
for in range(num): pt0 = pt_list[i] ri = rand_list[i] pt1 = pt_list[ri] val = 0 j in range(3): d = pt0[j] - pt1[j] val += d*d val = math.sqrt(val) dist += val
optimization #3: pythonic.
your python code looks whole lot c code. aren't taking advantage of language.
import math, random, time, itertools def main (num=1000000) : # small optimization speeds things couple percent. sqrt = math.sqrt # generate random points , numbers random.seed (0x7126434a2ea2a259e9f4196cbb343b1e6d4c2fc8) def random_point () : return [random.random(), random.random(), random.random()] def random_index () : return random.randint(0, num-1) # big optimization: # don't generate lists of points. # instead use list comprehensions create iterators. # it's best avoid creating lists of millions of entities when don't # need lists. don't need lists; need iterators. pt_list = [random_point() in xrange(num)] rand_pts = [pt_list[random_index()] in xrange(num)] # compute beg_time = time.clock() dist = 0 # don't loop on range. that's c-like. # instead loop on iterable, preferably 1 doesn't create # collection on iteration occur. # particularly important when collection large. (pt0, pt1) in itertools.izip (pt_list, rand_pts) : # small optimization: inner loop inlined, # intermediate variable 'val' eliminated. d0 = pt0[0]-pt1[0] d1 = pt0[1]-pt1[1] d2 = pt0[2]-pt1[2] dist += sqrt(d0*d0 + d1*d1 + d2*d2) end_time = time.clock() elap_time = (end_time - beg_time) print elap_time print dist
update
optimization #4, use numpy
the following takes 1/40th time of original version in timed section of code. not quite fast c, close.
note commented out, "mondo slow" computation. takes ten times long original version. there overhead cost using numpy. setup takes quite bit longer in code follows compared in non-numpy optimization #3.
bottom line: need take care when using numpy, , setup costs might significant.
import numpy, random, time def main (num=1000000) : # generate random points , numbers random.seed (0x7126434a2ea2a259e9f4196cbb343b1e6d4c2fc8) def random_point () : return [random.random(), random.random(), random.random()] def random_index () : return random.randint(0, num-1) pt_list = numpy.array([random_point() in xrange(num)]) rand_pts = pt_list[[random_index() in xrange(num)],:] # compute beg_time = time.clock() # mondo slow. # dist = numpy.sum ( # numpy.apply_along_axis ( # numpy.linalg.norm, 1, pt_list - rand_pts)) # mondo fast. dist = numpy.sum ((numpy.sum ((pt_list-rand_pts)**2, axis=1))**0.5) end_time = time.clock() elap_time = (end_time - beg_time) print elap_time print dist