Data from Multiple Web Sources: Crawling, Integrating, Preprocessing, and Designing Applications

2020 
Data from the Web are increasingly heterogeneous and unstructured, representing challenges for data crawling, integration, and preprocessing. There are studies that are “data oriented,” i.e., their work is developed to deal with some problem generated by available data, hence their results are restricted to the respective data. In contrast, there are various problems prior to identifying what data is needed to a specific study, and often multiple data sources are needed. This chapter covers such problems with definitions, current solutions, possible issues, and future work. Especially, the first issue in dealing with data coming from the Web is to define the crawling strategy, which can be classified according to the period and how to start it. The second issue is to define a strategy for integrating data from different sources to have a uniform view for users or applications, and to store them in a way that allows efficient consultation. Note that a possibility is to collect data from each source and store them separately for later integration, or to store all data in a single location in an integrated fashion as each collection is performed. The third issue is data preprocessing, which takes place before or after the data integration, and involves solving missing and duplicate data, normalization, data veracity, etc. Overall, this chapter addresses these three issues in an integrated way with a focus on practical and research questions.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    25
    References
    0
    Citations
    NaN
    KQI
    []