Internet as Data Source in the Istat Survey on ICT in Enterprises

2015 
The Istat sampling survey on ICT in enterprises aims at producing information on the use of ICT and in particular on the use of Internet by Italian enterprises for various purposes (e-commerce, e-recruitment, advertisement, e-tendering, e-procurement, egovernment). To such a scope, data are collected by means of the traditional instrument of the questionnaire. Istat began to explore the possibility to use web scraping techniques, associated, in the estimation phase, to text and data mining algorithms, with the aim to replace traditional instruments of data collection and estimation, or to combine them in an integrated approach. The 8,600 websites, indicated by the 19,000 enterprises responding to ICT survey of year 2013, have been scraped and the acquired texts have been processed in order to try to reproduce the same information collected via questionnaire. Preliminary results are encouraging, showing in some cases a satisfactory predictive capability of fitted models (mainly those obtained by using the Naive Bayes algorithm). Also the method known as Content Analysis has been applied, and its results compared to those obtained with classical learners. In order to improve the overall performance, an advanced system for scraping and mining is being adopted, based on the open source Apache suite Nutch-Solr-Lucene. On the basis of the nal results of this test, an integrated system harnessing both survey data and data collected from Internet to produce the required estimates will be implemented, based on systematic scraping of the near 100,000 websites related to the whole population of Italian enterprises with 10 persons employed and more, operating in industry and services. This new approach, based on Internet as Data source (IaD), is characterized by advantages and drawbacks that need to be carefully analysed.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    7
    Citations
    NaN
    KQI
    []