Jump to content
Larry Ullman's Book Forums

Distance Formula Efficiency With Large Count Of Locations


Recommended Posts

It is really awesome to finally be able to calculate distance, thanks Larry!

 

I had a question about ways of expanding on this new knowledge. The query presented in the book takes information from the stores table and does a left join with the zip_codes table. The query looks like this:

 

"SELECT name, address, city, state, stores.zip_code, phone, ROUND(return_distance($lat, $long, latitude, longitude)) AS distance FROM stores LEFT JOIN zip_codes USING (zip_code) ORDER BY distance ASC LIMIT 3"

 

I want to confirm the order of things that are happening. Is the following correct?:

1) First, ALL of the stores are returned from the stores table

2) Next, the remaining information needed (city, state, lat, long) are joined to those results

3) Next, the return_distance formula calculates the distance between the origin location and the location of every store

4) Finally those results are ordered to show the 3 closest results

 

If this is how that query gets processed, then I'm wondering how efficient it will be if my table had 500,000 stores? Obviously a table with 500,000 stores will take up a lot more resources calculating the distance between an origin and all 500,000 stores, than it would if I only had 500 stores.

 

If that is in fact how that query with the left join works AND my stores table has a ton of stores, what I'm thinking about doing to improve efficiency is:

 

First do a query that locates every zip code within a certain distance of the origin, lets say 100 miles. That way I'm not running the distance formula 500,000 times, but rather 45,000 times (number of zip codes). Next I would run the query just like it was presented in the book, but include a 'WHERE zip_code=(the values returned by the previous query).

 

Again, this would only improve efficiency if I did in fact have many times more stores in my table than there are zip codes. Does this strategy make sense?

Link to comment
Share on other sites

This is called algorithm design, and can be done theoretical. In programming, this is called big O-analysis. I don't know how joins are doing in that regard, though.

 

My instinct, if you have 500 000 rows, would be to create another denormalized table holding all the data from your join. (That's a pretty simple query) and then perform operations on that data instead.

 

You are right about speed in that regard. Lets say joining data is in a best case scenario a linear algorithm. That means, if 50 000 rows would take 1 second to calculate, 500 000 rows would take 10 seconds. Not really workable in a live site.

 

Relational databases are proving to great in that regard though. Try performing calculations on large data sets to see how it's going.

  • Upvote 1
Link to comment
Share on other sites

 Share

×
×
  • Create New...