[Overview] [Papers and Talks] [Source Code] [Data set] [Contacts]
Finding the k nearest neighbors (kNN) of a query point, or a set of query points (kNN-Join) are fundamental problems in many application domains. Many previous efforts to solve these problems focused on spatial databases or stand-alone systems, where changes to the database engine may be required, which may limit their application on large data sets that are stored in a relational database management system. Furthermore, these methods may not automatically optimize kNN queries or kNN-Joins when additional query conditions are specified. In this work, we study both the kNN query and the kNN-Join in a relational database, possibly augmented with additional query conditions. We search for relational algorithms that require no changes to the database engine. The straightforward solution uses the user-defined-function (UDF) that a query optimizer cannot optimize. We design algorithms that could be implemented by SQL operators without changes to the database engine, hence enabling the query optimizer to understand and generate the "best" query plan. Using only a small constant number of random shifts for databases in any fixed dimension, our approach guarantees to find the approximate kNN with only logarithmic number of page accesses in expectation with a constant approximation ratio and it could be extended to find the exact kNN efficiently in any fixed dimension. Our design paradigm easily supports the kNN-Join and updates. Extensive experiments on large, real and synthetic, data sets confirm the efficiency and practicality of our approach.
1. K Nearest Neighbor Queries and KNN-Joins in Large Relational Databases (Almost) for Free,
Full version: Talk:
Please cite our paper if you use this library for your work. Thanks!
If you find any bugs or any suggestions/comments, we are very happy to hear from you!
K Nearest Neighbor Queries and KNN-Joins in Large Relational Databases (Almost) for Free Library [tgz file]
We have generated and experimented with the data sets described in the paper. For real data sets, please follow the description in our paper.