Secret sequence comparison in distributed computing environments by interval sampling

2004 
Once a new gene has been sequenced, it must be verified whether or not it is similar to previously sequenced genes. In many cases, the organization that sequenced a potentially novel gene needs to keep the sequence itself in confidence. However, to compare the potentially novel sequence with known sequences, it must either be sent as a query to public databases, or these databases must be downloaded onto a local computer. In both cases, the potentially new sequence is exposed to the public. In this work, we propose a new method, called interval sampling, to compare sequences without leaking exact information about the new sequence. In order to keep the exact sequence information secret, this method samples intervals (subsequences) from a sequence, and these intervals are hashed. The hashed data are open to the public to verify the novelty of the sequence. We find that this method works well in parallel in a distributed computing environment, such as the Grid. The experimental results for 19797 h.sapiens genes and 25000 m.musculus genes show that the parallel implementation of this method performs reasonably well in terms of speed and memory usage.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    1
    Citations
    NaN
    KQI
    []