Bilkent University
Department of Computer Engineering
M.S.THESIS PRESENTATION
LEVERAGING FILE SIGNIFICANCE IN BUS FACTOR ESTIMATION
Vahid Haratian
Master Student
(Supervisor: Asst.Prof.Eray Tüzün)
Computer Engineering Department
Bilkent University
Abstract: Software projects experience the departure of developers due to various reasons. Hence, since developers are one of the main sources of knowledge in software projects, their absence will inevitably result in a certain degree of knowledge depletion. Bus Factor (BF) is a metric to evaluate how this knowledge loss can affect the project’s continuity. Conventionally, BF is calculated as the smallest set of developers, removing over half the project knowledge upon departure. Current state-of-the-art approaches measure developers’ knowledge by the number of authored files, utilizing version control system (VCS) information. However, numerous studies have shown that files in software projects have different significance. In this study, we explore how weighting files according to their significance affects the performance of two prevailing BF estimators. We derive significance scores by computing five well-known graph metrics from the project’s Dependency Graph: PageRank, In-/Out-/All-Degree, and Betweenness Centralities. Furthermore, we introduce BFSig, a prototype of our approach. Finally, we present a new dataset comprising reported BF scores collected by surveying software practitioners from five prominent GitHub repositories. Moreover, we identified a lack of a reusable tool for dependency graph extraction. Existing tools are either outdated and difficult to configure, or fail to provide accurate analysis. Since Integrated Development Environments (IDEs) are well-suited for addressing these issues, we leveraged their capabilities to develop RefExpo. RefExpo is a reusable dependency graph extraction tool that supports multiple languages, such as Java, Python, and JavaScript. RefExpo is a plugin based on IntelliJ, which is a well-maintained and reputed IDE. In addition, we compile an initial version of our dataset consisting of 20 Java and Python projects. Our results indicate that BFSig outperforms the baselines by up to an 18% reduction in terms of Normalized Mean Absolute Error (NMAE). Moreover, BFSig yields 18% fewer False Negatives in identifying potential risks associated with low BF. Besides, our respondent confirmed BFSig versatility by showing its ability to assess the BF of the project’s subfolders. In conclusion, we believe to estimate BF from authorship, software components of higher importance should be assigned heavier weight. Currently, BFSig exclusively explores the topological characteristics of these components. Nevertheless, considering attributes such as code complexity and bug proneness could potentially enhance the performance of BFSig.
DATE: January 08, Wednesday @ 15:30 Place: EA 409