Cell-based outlier detection algorithm: a fast outlier detection algorithm for large datasets

2008 
Finding outliers is an important task for many KDD applications. We developed a cell-based outlier detection algorithm (short for CEBOD) to detect outliers in large dataset. The algorithm is based on LOF; major difference is CEBOD can avoid large computations on the majority part of dataset by filter the initial dataset. Our experiment shows that CEBOD is more efferent than LOF, and can find outliers in large datasets fast and accurately. A large dataset is loaded into memory by blocks, and the data are placed into appropriate cells based on their values. Each cell holds a certain number of data, which represents the cell's density. Data locate in high density cells and have no nearness relationship with local outlier factor calculation are filtered. And we record these cells' density for the next block of data fill in. The final calculation will be done on those data in low density cells. In this way, we can handle a large dataset which can't be loaded into memory once, improving the algorithm's efficiency by reducing many useless computations. The time complexity of CEBOD is O(N).
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    6
    Citations
    NaN
    KQI
    []