Wikipedia is the largest online service storing user-generated content. Its pages are open to anyone for addition, deletion and modifications, and the effort of contributors is recorded and can be tracked in time.
In the last twenty years, the evolution of web systems has been driven along three dimensions: the processes used to develop, evolve, maintain and re-engineer the systems themselves; the end products (the pages, content and links) of such processes; and finally the people dimension, with the extraordinary shift in how developers and users shape, interact and maintain the code and content that they put online. This paper reviews the questions that each of these dimensions has addressed in the past, and indicates which ones will need to be addressed in the future, in order for web system evolution to be sustainable. We show that the study on websites evolution has shifted from server- to client-side, focusing on better technologies and processes, and that the users becoming creators of content open several open questions, in particular the issue of credibility of the content created and the sustainability of such resources in the long term.
It is established that the internal quality of software is a key determinant of the total cost of ownership of that software. The objective of this research is to determine the impact that the development team's size has on the internal structural attributes of a codebase and, in doing so, we consider the impact that the team's size may have on the internal quality of the software that they produce. In this paper we leverage the wealth of data available in the open-source domain by mining detailed data from 1000 projects in GoogleCode and, coupled with one of the most established of object-oriented metric suites, we isolate and identify the effect that the development team size has on internal structural attributes of the software produced. We will find that some measures of functional decomposition are enhanced when we compare projects authored by fewer developers against those authored by a larger number of developers while measures of cohesion and complexity are degraded.
Accumulated changes on a software system are not uniformly distributed: some elements are changed more often than others. For optimal impact, the limited time and effort for complexity control, called anti-regressive work, should be applied to the elements of the system which are frequently changed and are complex. Based on this, we propose a maintenance guidance model (MGM) which is tested against real-world data. MGM takes into account several dimensions of complexity: size, structural complexity and coupling. Results show that maintainers of the eight open source systems studied tend, in general, to prioritize their anti-regressive work in line with the predictions given by our MGM, even though, divergences also exist. MGM offers a history-based alternative to existing approaches to the identification of elements for anti-regressive work, most of which use static code characteristics only.
Research on empirical software engineering has increasingly used the data that is made available in online repositories , specifically Free/Libre/Open Source Software projects (FLOSS). The latest trends for researchers is to gather much data as possible to (i) prevent bias in the representation of a small sample, (ii) work with a sample as close as the population itself, and (iii) showcase the performance of existing or new tools in treating vast amount of data. The effects of harvesting enormous amounts of data have been only marginally considered so far: data could be corrupted; repositories could be forked; and developer identities could be duplicated. In this paper we posit that there is a fundamental flaw in harvesting large amounts of data, and when generalising the conclusions: the application domain, or context, of the analysed systems must be the primary factor for the cluster sampling of FLOSS projects. This paper presents two contributions: first, we analyse a collection of 100 BENEVOL papers that appeared showing whether (and how much) FLOSS data has been harvested, and how many times the authors flagged an issue in their different application domains. Second, we discuss the implications of using 'application domain' as the clustering factor in FLOSS sampling, and the generalisations within and outside the clusters.
Effort estimation models are a fundamental tool in software management, and used as a forecast for resources, constraints and costs associated to software development. For Free/Open Source Software (FOSS) projects, effort estimation is especially complex: professional developers work alongside occasional, volunteer developers, so the overall effort (in person-months) becomes non-trivial to determine. The objective of this work it to develop a simple effort estimation model for FOSS projects, based on the historic data of developers' effort. The model is fed with direct developer feedback to ensure its accuracy. After extracting the personal development profiles of several thousands of developers from 6 large FOSS projects, we asked them to fill in a questionnaire to determine if they should be considered as full-time developers in the project that they work in. Their feedback was used to fine-tune the value of an effort threshold, above which developers can be considered as full-time. With the help of the over 1,000 questionnaires received, we were able to determine, for every project in our sample, the threshold of commits that separates full-time from non-full-time developers.%, and that minimizes the type I and type II errors. We finally offer guidelines and a tool to apply our model to FOSS projects that use a version control system.
Students in higher education are traditionally requested to produce various pieces of written work during the courses they undertake. When students' work is submitted online as a whole, both the ethically questionable act of procrastinating and late submissions affect performance. The objective of this paper is to assess the performance of students from a control group, with that of students from an experimental group. The control group produced work as a unique deliverable to be submitted at the end of the course. On the other hand, the experimental group worked on each part for a week, and their work was managed by a wiki environment and monitored by a specifically developed software. Positive effects were noticed in the experimental group, as both students' time management skills and performance increased. Replications of this experiment can and should be performed, in order to compare results in coursework submission.
Previous research in software application domain classification has faced challenges due to the lack of a proper taxonomy that explicitly models relations between classes. As a result, current solutions are less effective for real-world usage. This study aims to develop a comprehensive software application domain taxonomy by integrating multiple datasources and leveraging ensemble methods. The goal is to overcome the limitations of individual sources and configurations by creating a more robust, accurate, and reproducible taxonomy. This study employs a quantitative research design involving three different datasources: an existing Computer Science Ontology (CSO), Wikidata, and LLMs. The study utilises a combination of automated and human evaluations to assess the quality of a taxonomy. The outcome measures include the number of unlinked terms, self-loops, and overall connectivity of the taxonomy. The results indicate that individual datasources have advantages and drawbacks: the CSO datasource showed minimal variance across different configurations, but a notable issue of missing technical terms and a high number of self-loops. The Wikipedia datasource required significant filtering during construction to improve metric performance. LLM-generated taxonomies demonstrated better performance when using context-rich prompts. An ensemble approach showed the most promise, successfully reducing the number of unlinked terms and self-loops, thus creating a more connected and comprehensive taxonomy. The study addresses the construction of a software application domain taxonomy relying on pre-existing resources. Our results indicate that an ensemble approach to taxonomy construction can effectively address the limitations of individual datasources. Future work should focus on refining the ensemble techniques and exploring additional datasources to enhance the taxonomy's accuracy and completeness.