A Framework to represent variables and values in Social Science research data sets to support data curation and reuse

Guangyuan Sun,Christopher S. G. Khoo

A Framework to represent variables and values in Social Science research data sets to support data curation and reuse

2018

OBJECTIVES An ontology is being developed to model quantitative research datasets stored in social science data repositories, to support researchers to discover and reuse datasets. An increasing number of universities are embarking on research data management, including encouraging faculty members to develop data management plans, and building institutional data repositories to support the preservation and reuse of research data from their faculty members and research staff. Different disciplines generate different types of research data (e.g., quantitative, qualitative textual data, audio, video, and images). It is challenging to reuse quantitative social science datasets. Users need to make substantial intellectual effort to understand the meaning of the variables and values in the datasets: Different researchers may name variables referring to the same or similar concepts differently. Alternatively, variables with the same name may refer to different concepts. Relationships between variables may not be readily apparent. Users may need to read accompanying documentation closely to understand them. The unit of measure or scale for numeric values may not be stored with the dataset. The semantics of categorical values may not be obvious to the user. Thus, there is a need for studying the issues of knowledge organization and representation of quantitative social science research data to support data reuse. The knowledge organization scheme needs to represent the structure and semantics of datasets and data values accurately. METHODS We have collected a sample of datasets and associated survey questionnaires from two well-known social science data repositories (Inter-university Consortium for Political and Social Research & UK Data Archive), and analysed their content and structure to harvest requirements for accurately and comprehensively representing datasets. MAIN RESULTS Based on the requirements, we developed a framework for modelling datasets, represented as an upper-level ontology. A diagram summarizing the framework is in the attachment. This section describes the framework. Firstly, we propose to divide a dataset into two perspectives: physical description and conceptual description . Physically, adopting the terminology of relational database theory (Date, 2000), a dataset is made up of one or more relation values (i.e. data tables), each with two parts – a heading and a body: The heading is a set of {attribute-name (i.e. column-name):type-name (i.e. datatype)} pairs, such as {“Gender”:“Integer”}. The type-name component is to represent the fact that each attribute is associated with a simple datatype (e.g., integer, character string, and date), meaning that the values in the attribute are members of a datatype. The body is a set of tuples (i.e. rows or records) that conform to the heading. Each tuple contains exactly one value for each attribute. Tuples are distinguished by having unique values in the primary key header (i.e. unique identifier), such as {“ID”, 001}. A relation value is stored in a tabular file format physically (e.g., spss, csv, and sas). But this physical file is not what we refer to as physical description. A physical file is what can be downloaded and that can be in different file formats. The physical description of a dataset refers to a higher-level representation of the physical file – the heading and the body of a relation value, regardless of the order of attributes and tuples, or the medium of storage. In this study, when we look at a dataset from its physical perspective, we refer to its physical description rather than its physical file format. Conceptually, a dataset is a collection of variable-value pairs related to a set of objects of study (often referred to in social science research as subjects, participants and respondents). The objects of study can be at different levels of granularity (e.g., individual persons, a group of people, an individual organization, or a group of organizations). The variable-value pairs represent concepts in a particular domain. We call it the conceptual description – the representation of the variable-value pairs as domain concepts for recording the characteristics of the study objects. The separation of dataset representation into the conceptual and physical perspective is to support dataset reuse. The conceptual description represents information in a way that relates to users’ information need. We assume that users searching and browsing in a data repository would be overwhelmed if presented immediately with the tabular datasets. They are more interested in finding information about the objects of study (represented in the dataset), and the variables in the dataset that are related to social science research concepts. The physical description is needed for statistical analysis and data mining. Thus, we propose that in the data repository, users should firstly be shown the dataset’s conceptual description ; the physical description is provided only if users request. To link the conceptual description to the physical description, we defined a rdf:Property hasPhysicalRepresentation , as shown in the attached framework. To represent the semantics of datasets, the conceptual description need to be linked to an ontology. Both variables and variable values need to be mapped to concepts in the ontology (e.g. Gender:male|female). Concepts of social science variables and their value are collected from the sampled questionnaire, and are organized into an ontology with two classes – Class:Concept_Variable and Class:Concept_VariableValue . Both of them have subclasses and: Class:Concept_Variable have subclasses such as Class:Sex . Class:Concept_VariableValue has subclasses such as Class:Male and Class:Female . There are relationships between these subclasses, as shown in the attached framework. We defined a rdf:Property hasConceptValue to link subclasses of Class:Concept_Variable and those of Class:Concept_VariableValue. For a particular dataset, its variables are instances of the ontology Class:Concept_Variable, and its variable values are instances of the Class:Concept_VariableValue. In this way, the semantics of a dataset is assigned by linking its conceptual description to the ontology. CONCLUSIONS This paper will present the knowledge organization framework and issues encountered in organizing and representing datasets using the framework. The ontology is described in the OWL2 language, and stored in the graph database system Neo4j. An visualization interface is being developed to support users in browsing the ontology and dataset metadata – to understand the structure and semantic content of the datasets. Practically, the framework developed in this study is expected to be easily adapted and adopted by data repositories storing quantitative datasets. Theoretically, we will contribute to the knowledge organization and knowledge representation of social science quantitative data, by identifying high-level concepts and relations that are frequently found in social science quantitative research.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations