After a very long time, this will be my first post. I hope that it will not be the last, and the rest will come.
For a project about taxi trajectory prediction, I have applied k-NN algorithm on coordinate sequence of taxi trips. Did I have promising results? Well, it is arguable, but I have discovered some good techniques. One of them is Z-order curve (Figure 1).
I was looking for a technique to decrease dimensionality of 2-D coordinate representation (lat-lon) into 1-D. But on this 1-D value, I would like to apply k-NN algorithm. For this purpose, 1-D values must preserve the closeness of two different coordinates. If I do linear ordering of latitude and longitude values, I couldn't achieve this goal, because closeness will depend on only one value which is placed on the right-most. Let's assume I do linear ordering as lat-lon:
On the other hand, Z-ordering brings a novel approach to address such problem. In this method, an index value (k) is assigned to each one(e.g. index values are 0 and 1 for 2-D). Then, each value is represented in binary way. Each bit in index i of the binary representation is placed into (2*i+k)th index of encoded value. As an example:
This method is also applicable on N-dimension values. For value N, assign an index (k) to each number from beginning 0 to N-1. Then, bits of each number will be placed at (2*i+k)th index. One major drawback is that the final value will contain many number of bits. Therefore, it would be necessary to diminish the numbers into smaller values. For example, if the value range is 40000-42000, then it can be represented as 0-2000. By this way, you will be able to decode N-dimensions into lower number of bits.
Don't hesitate to mention any improvements if I forget to mention. What could be the other ways to achieve this goal? Any better/alternative approach is welcome. Just drop a comment. I will try to write a post about sensible ones :)
For a project about taxi trajectory prediction, I have applied k-NN algorithm on coordinate sequence of taxi trips. Did I have promising results? Well, it is arguable, but I have discovered some good techniques. One of them is Z-order curve (Figure 1).
Figure 1: Four levels of Z-ordering (taken from Wikipedia page)
Latitude : +47.312
Longitude: +35.02
To represent these float values in int, we multiply them with 1000.
Latitude (int) : 47312
Longitude(int) : 35020
Latitude (Binary) : 1011100011010000
Longitude (Binary) : 1000100011001100
Linear Representation (bin) : 10111000110100001000100011001100
Linear Representation (int) : 3100674252
It is sensitive to the closeness of longitude values, but same situation does not apply for latitude values. Let's assume we have another coordinate (47.313, 35.02) in (lat,lon) representation. Linear representation of this coordinate will be 3,100,739,788. The distance with the one in the example is
|3100674252-3100739788| = 65,536
For such close coordinates, it is not an ignorable distance. Obviously, it will not claim closeness. Therefore, encoding coordinates with linear ordering is not sensible to use in k-NN.On the other hand, Z-ordering brings a novel approach to address such problem. In this method, an index value (k) is assigned to each one(e.g. index values are 0 and 1 for 2-D). Then, each value is represented in binary way. Each bit in index i of the binary representation is placed into (2*i+k)th index of encoded value. As an example:
Latitude : +47.312
Longitude: +35.02
To represent these float values in int, we multiply them with 1000.
Latitude (int) : 47312
Longitude(int) : 35020
Latitude (Binary) : 1011100011010000
Longitude (Binary) : 1000100011001100
Z-Order Representation (bin) : 11001
010110000001111001001010000
Z-Order Representation (int) : 3401642576
Let's use same coordinate (47.313, 35.02) to check distance in linear representation method. Its Z-order representation will be 3,401,642,578. The distance with the one in the example is
|3401642576-3401642578| = 2
By following this way, I have encoded all coordinates (latitude and longitude pairs) into 1-D values. Since closeness property among coordinates is preserved, it is suitable to apply k-NN algorithm on such data. It gives considerably good performance improvement with accuracy.This method is also applicable on N-dimension values. For value N, assign an index (k) to each number from beginning 0 to N-1. Then, bits of each number will be placed at (2*i+k)th index. One major drawback is that the final value will contain many number of bits. Therefore, it would be necessary to diminish the numbers into smaller values. For example, if the value range is 40000-42000, then it can be represented as 0-2000. By this way, you will be able to decode N-dimensions into lower number of bits.
Don't hesitate to mention any improvements if I forget to mention. What could be the other ways to achieve this goal? Any better/alternative approach is welcome. Just drop a comment. I will try to write a post about sensible ones :)
The Hilbert curve might be interesting too, it achieves similar outcomes in traversing a 2D space in 1D, but doesn't have the large jumps of the Z-order curve.
ReplyDeleteAll space filling curves will have discontinuities.
DeleteI remember I had to extract pixel brightness information from the iris of the eye, using Hilbert curves, for a faculty project. I tried writing my own algorithm but ended up using somebody's code for generating hilbert curves in matlab and adapted that.
DeleteThis comment has been removed by the author.
DeleteIt appears that neither the Z curve nor the Hilbert curve preserves closeness of the 2D coordinates. There appear to be points arbitrarily close together in the 2D space that are not close together in Z coordinates and the same is true for Hilbert coordinates.
DeleteIn the case of the Z curve, the problem is that points very close together don't only differ in the lower-order bits of their coordinate values.
Going from (000111111111, 000000000000) to (001000000000, 000000000000) is a tiny jump (in the x-axis), but the Z curve coordinates are 000000101010101010101010 and 000010000000000000000000 which are a long way apart.
The Hilbert curve does preserve locality _in the opposite direction_. So, points with close Hilbert coordinates come from close points in 2D. The Z-curve also does this, but slightly more roughly.
What was the project you were working on?
ReplyDeleteThere was a contest on Kaggle for taxi trajectory prediction. https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i
DeleteInterleaving latitude & longitude digits is also known as Peano coding. Named after Guiseppe Peano the mathematician.
ReplyDeleteOne interesting solution to compare is Google's S2 library http://blog.christianperone.com/2015/08/googles-s2-geometry-on-the-sphere-cells-and-hilbert-curve/
ReplyDeleteWhy not try multidimensional scaling? You can compute the distance matrix between points, since you have the lat/lon. If you have noisy lat/lon and/or incomplete information, you could use something like stochastic triplet embeddings.
ReplyDelete