Which PostGIS function to use?

2015 May 16 at 04:11 » Tagged as :python, road.lk, postgis, short,

Which PostGIS function or operator do you use if you wanted to find the overlap between different lines? There are several candidates which immidiately come to mind;  ~, && , ST_Overlaps and and ST_Intersects if you are in a hurry or TLDR; use ST_Overlap. The two operators They all do job but in different ways, takes different amounts of time and return slightly different results. 

Using only operators             time  4.3237  rows  1208
Using ~ operator and ST_Overlap  time  5.5201  rows  233
Using ST_Dwithin                 time  5.5551  rows  2500
Using ST_Intersects              time  5.3128 rows  746  

ST_DWithin seems to match everything to everything else. Using ~ and &&. Usin operatings seem to be a shade quicker than the other options. But the operators only deal with bounding boxes not the lines themselves so your milage may vary depending on your context. This test was carried out with a nested for loop with the other loop having 250 objects and the inner loop having 25. The data came from the road.lk carpool tables. Picking the right sample is very important in this test since different samples could produce different results so the sample was actually changed multiple times.  The results followed the same pattern except for this one instance where all the routes were disjoint! 

Using only operators      time  0.0174  rows  0
Using operators + overlap time  0.0176  rows  0
Using ST_Dwithin          time  0.0177  rows  0
Using ST_Intersects       time  0.0177  rows  0   

I should have used number formatting but I didn't so sue me. Getting back to the first data set, the number of matches vary by a factor of eleven so we ought to check the quality of the matches. That can be done by using the overlapping distances.

Using only operators             time 7.1954  rows  1208  total overlap  0.199858778468
Using ~ operators and ST_Overlap time 6.9210  rows  233   total overlap  0.199858778468
Using ST_Dwithin                 time 10.7650 rows  2500  total overlap  0.199858778468
Using ST_Intersects              time 7.1929  rows  746   total overlap  0.199858778468 

 

Surprise, surprise all the overlapping distances are identical. So clearly the a.line && b.line greatly over estimates the number of overlaps and so does ST_Dwithin. Counter intuitively ST_Overlaps and ST_Intersects both seem to have the edge over the operators.

Before winding down, I wanted to do one final test. As things stand the queries look like' INNER JOIN pool_route b ON a.line ~ b.line OR b.line ~ a.line or a.line && b.line ' and ' INNER JOIN pool_route b ON a.line ~ b.line OR b.line ~ a.line or ST_Overlaps(a.line , b.line) ' will there be a difference if the contains operator (~) is removed?

 

Using only operators             time  7.4457  rows  1208  total overlap  0.199858778468
Using ~ operators and ST_Overlap time  6.0198  rows  64    total overlap  0.199858778468
Using ST_Dwithin                 time  10.7326 rows  2500  total overlap  0.199858778468
Using ST_Intersects              time  7.3100  rows  746   total overlap  0.199858778468     

 

Just ignore the timing the difference is too small to be taken seriously but look at how the number of matches has dropped dramatically from 266 to 64 for the ST_Overlap query. Yet the final overlapping distance hasn't changed. At this point I double checked and triple checked to make sure that my code doesn't have any bugs and it looks like there isn't. The only thing to do is to try with a much bigger dataset

 

Using only operators time  317.762  rows  83075  total overlap  116.168481483
Using ST_Overlap     time  223.555  rows  13884  total overlap  116.167709497
Using ST_Dwithin     time  416.819  rows  160000  total overlap  116.168481483
Using ST_Intersects  time  276.907  rows  46804  total overlap  116.168481483

This is with a much larger dataset, 400 objects in the outer loop, 400 objects in the inner loop. If you want to have look at the code that was used here you go:

import os, sys

if __name__ == '__main__':  #pragma nocover
    # Setup environ
    sys.path.append(os.getcwd())

    os.environ.setdefault("DJANGO_SETTINGS_MODULE", "main.settings_dev")
    
from django.db import connection
from pool.models import Route

import time

cursor = connection.cursor()

routes = Route.objects.order_by('id')[0:100]
routes2 = Route.objects.order_by('-id')[0:250]

t0 = time.time()
sum = 0
distance = 0

for r1 in routes:
    for r2 in routes2 :
        q = cursor.execute('SELECT count(*) , sum(ST_Length(ST_intersection(a.line,b.line))) FROM pool_route a '

                ' INNER JOIN pool_route b ON a.line && b.line '
                ' WHERE a.id = %s AND b.id = %s', [r1.id, r2.id])
        row = cursor.fetchone()
        sum += row[0]
        if row[1]:
            distance += row[1]
t1 = time.time() - t0
t0 = time.time()


print 'Using only operators time ' , t1 , ' rows ' , sum ,' total overlap ', distance
sum = 0
distance = 0

for r1 in routes:
    for r2 in routes2 :
        q = cursor.execute('SELECT count(*) , sum(ST_Length(ST_intersection(a.line,b.line))) FROM pool_route a '
                ' INNER JOIN pool_route b ON ST_Overlaps(a.line , b.line) '
                ' WHERE a.id = %s AND b.id = %s', [r1.id, r2.id])
        row = cursor.fetchone()
        sum += row[0]
        if row[1]:       
           distance += row[1]

t1 = time.time() - t0
t0 = time.time()
print 'Using ~ operators and ST_Overlap time ' , t1 , ' rows ' , sum ,' total overlap ', distance

sum = 0
distance = 0

for r1 in routes:
    for r2 in routes2 :
        q = cursor.execute('SELECT count(*) , sum(ST_Length(ST_intersection(a.line,b.line))) FROM pool_route a '
                ' INNER JOIN pool_route b ON ST_Dwithin(a.line, b.line, 50) '
                ' WHERE a.id = %s AND b.id = %s', [r1.id, r2.id])
        row = cursor.fetchone()
        sum += row[0] 
        if row[1]:
            distance += row[1]

t1 = time.time() - t0
t0 = time.time()
print 'Using ST_Dwithin time ' , t1 , ' rows ' , sum, ' total overlap ', distance
sum = 0
distance = 0


for r1 in routes:
    for r2 in routes2 :
        q = cursor.execute('SELECT count(*) , sum(ST_Length(ST_intersection(a.line,b.line)))  FROM pool_route a '
                ' INNER JOIN pool_route b ON ST_Intersects(a.line, b.line) '
                ' WHERE a.id = %s AND b.id = %s', [r1.id, r2.id])
        row = cursor.fetchone()
        sum += row[0] 
        if row[1]:
            distance += row[1]
t1 = time.time() - t0

print 'Using ST_Intersects time ' , t1 , ' rows ' , sum, ' total overlap ', distance

Update: a bug in the code was fixed and the blog post formatted (by hand)