Background refreshing method based on Spark-SQL big data processing platform

2015 
The invention discloses a background refreshing method based on a Spark-SQL big data processing platform. A new process is created and a timed refreshing mechanism is set in an entry function of Spark-SQL, and a specified table space file directory structure of a HDFS (Hadoop Distributed File System) is scanned periodically. Configuration items are added in a hive-site.xml under a conf folder of a Spark installation directory, and thus, whether to open a refreshing process, a refreshing interval and a big data table space set required to be refreshed can be configured through customization. In the method, under the background of big data, a first query time of the Spark-SQL big data processing platform is reduced greatly; taking 20T data as an example, a big data table is partitioned into 25 regions in a manner of taking hour as a first subregion, is partitioned into 1001 regions in a manner of taking the first three of a mobile phone number as a second subregion, and is subjected to compressed storage according to a PARQUET format; for the query querying total amount of all data of a certain number section of a certain period of time, the original first query time is about 20 minutes, and through the background refreshing method optimized by the invention, the time of the first query is reduced to about 45 seconds.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []