Which PostGIS function to use?
2015 May 16 at 04:11 » Tagged as :python, road.lk, postgis, short,
Which PostGIS function or operator do you use if you wanted to find the overlap between different lines? There are several candidates which immidiately come to mind; ~, && , ST_Overlaps and and ST_Intersects if you are in a hurry or TLDR; use ST_Overlap. The two operators They all do job but in different ways, takes different amounts of time and return slightly different results.
Using only operators time 4.3237 rows 1208 Using ~ operator and ST_Overlap time 5.5201 rows 233 Using ST_Dwithin time 5.5551 rows 2500 Using ST_Intersects time 5.3128 rows 746
ST_DWithin
seems to match everything to everything else. Using ~
and &&
. Usin operatings seem to be a shade quicker than the other options. But the operators only deal with bounding boxes not the lines themselves so your milage may vary depending on your context. This test was carried out with a nested for loop with the other loop having 250 objects and the inner loop having 25. The data came from the road.lk carpool tables. Picking the right sample is very important in this test since different samples could produce different results so the sample was actually changed multiple times. The results followed the same pattern except for this one instance where all the routes were disjoint!
Using only operators time 0.0174 rows 0 Using operators + overlap time 0.0176 rows 0 Using ST_Dwithin time 0.0177 rows 0 Using ST_Intersects time 0.0177 rows 0
I should have used number formatting but I didn't so sue me. Getting back to the first data set, the number of matches vary by a factor of eleven so we ought to check the quality of the matches. That can be done by using the overlapping distances.
Using only operators time 7.1954 rows 1208 total overlap 0.199858778468 Using ~ operators and ST_Overlap time 6.9210 rows 233 total overlap 0.199858778468 Using ST_Dwithin time 10.7650 rows 2500 total overlap 0.199858778468 Using ST_Intersects time 7.1929 rows 746 total overlap 0.199858778468
Surprise, surprise all the overlapping distances are identical. So clearly the a.line && b.line
greatly over estimates the number of overlaps and so does ST_Dwithin
. Counter intuitively ST_Overlaps
and ST_Intersects
both seem to have the edge over the operators.
Before winding down, I wanted to do one final test. As things stand the queries look like' INNER JOIN pool_route b ON a.line ~ b.line OR b.line ~ a.line or a.line && b.line '
and ' INNER JOIN pool_route b ON a.line ~ b.line OR b.line ~ a.line or ST_Overlaps(a.line , b.line) '
will there be a difference if the contains operator (~) is removed?
Using only operators time 7.4457 rows 1208 total overlap 0.199858778468 Using ~ operators and ST_Overlap time 6.0198 rows 64 total overlap 0.199858778468 Using ST_Dwithin time 10.7326 rows 2500 total overlap 0.199858778468 Using ST_Intersects time 7.3100 rows 746 total overlap 0.199858778468
Just ignore the timing the difference is too small to be taken seriously but look at how the number of matches has dropped dramatically from 266 to 64 for the ST_Overlap query. Yet the final overlapping distance hasn't changed. At this point I double checked and triple checked to make sure that my code doesn't have any bugs and it looks like there isn't. The only thing to do is to try with a much bigger dataset
Using only operators time 317.762 rows 83075 total overlap 116.168481483 Using ST_Overlap time 223.555 rows 13884 total overlap 116.167709497 Using ST_Dwithin time 416.819 rows 160000 total overlap 116.168481483 Using ST_Intersects time 276.907 rows 46804 total overlap 116.168481483
This is with a much larger dataset, 400 objects in the outer loop, 400 objects in the inner loop. If you want to have look at the code that was used here you go:
import os, sys if __name__ == '__main__': #pragma nocover # Setup environ sys.path.append(os.getcwd()) os.environ.setdefault("DJANGO_SETTINGS_MODULE", "main.settings_dev") from django.db import connection from pool.models import Route import time cursor = connection.cursor() routes = Route.objects.order_by('id')[0:100] routes2 = Route.objects.order_by('-id')[0:250] t0 = time.time() sum = 0 distance = 0 for r1 in routes: for r2 in routes2 : q = cursor.execute('SELECT count(*) , sum(ST_Length(ST_intersection(a.line,b.line))) FROM pool_route a ' ' INNER JOIN pool_route b ON a.line && b.line ' ' WHERE a.id = %s AND b.id = %s', [r1.id, r2.id]) row = cursor.fetchone() sum += row[0] if row[1]: distance += row[1] t1 = time.time() - t0 t0 = time.time() print 'Using only operators time ' , t1 , ' rows ' , sum ,' total overlap ', distance sum = 0 distance = 0 for r1 in routes: for r2 in routes2 : q = cursor.execute('SELECT count(*) , sum(ST_Length(ST_intersection(a.line,b.line))) FROM pool_route a ' ' INNER JOIN pool_route b ON ST_Overlaps(a.line , b.line) ' ' WHERE a.id = %s AND b.id = %s', [r1.id, r2.id]) row = cursor.fetchone() sum += row[0] if row[1]: distance += row[1] t1 = time.time() - t0 t0 = time.time() print 'Using ~ operators and ST_Overlap time ' , t1 , ' rows ' , sum ,' total overlap ', distance sum = 0 distance = 0 for r1 in routes: for r2 in routes2 : q = cursor.execute('SELECT count(*) , sum(ST_Length(ST_intersection(a.line,b.line))) FROM pool_route a ' ' INNER JOIN pool_route b ON ST_Dwithin(a.line, b.line, 50) ' ' WHERE a.id = %s AND b.id = %s', [r1.id, r2.id]) row = cursor.fetchone() sum += row[0] if row[1]: distance += row[1] t1 = time.time() - t0 t0 = time.time() print 'Using ST_Dwithin time ' , t1 , ' rows ' , sum, ' total overlap ', distance sum = 0 distance = 0 for r1 in routes: for r2 in routes2 : q = cursor.execute('SELECT count(*) , sum(ST_Length(ST_intersection(a.line,b.line))) FROM pool_route a ' ' INNER JOIN pool_route b ON ST_Intersects(a.line, b.line) ' ' WHERE a.id = %s AND b.id = %s', [r1.id, r2.id]) row = cursor.fetchone() sum += row[0] if row[1]: distance += row[1] t1 = time.time() - t0 print 'Using ST_Intersects time ' , t1 , ' rows ' , sum, ' total overlap ', distance
Update: a bug in the code was fixed and the blog post formatted (by hand)