ESEM 2017: EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT
PROGRAM FOR FRIDAY, NOVEMBER 10TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-10:00 Session 6: Keynote II: Industry-Academia Communication in Empirical Software Engineering, Per Runeson (Lund University, Sweden)

Abstract: Researchers in software engineering must communicate with industry practitioners, both engineers and managers. Communication may be about collaboration buy-in, problem identification, empirical data collection, solution design, evaluation, and reporting. In order to gain mutual benefit of the collaboration, ensuring relevant research and improved industry practice, researchers and practitioners must be good at communicating. The basis for a researcher to be good at industry-academia communication is firstly to be “bi-lingual”. The terminology in each domain is often different and the number of TLA:s (Three Letter Abbreviations) in industry is overwhelming. Understanding and being able to translate between these “languages” is essential. Secondly, it is also about being “bi-cultural”.Understanding the incentives in industry and academia respectively, is a basis for being able to find balances between e.g. rigor and relevance in the research. Time frames is another aspect that is different in the two cultures. Thirdly, the choice of communication channels is key to reach the intended audience.A wide range of channels exist, from face to face meetings, via tweets and blogs, to academic journal papers and theses; each having its own audience and purposes. The keynote speech will explore the challenges of industry-academia communication, based on two decades of collaboration experiences, both successes and failures. It aims to support primarily the academic side of the communication to help achieving industry impact through rigorous and relevant empirical software engineering research.

Bio: Dr.Per Runeson is a professor of software engineering at Lund University, Sweden, Head of the Department of Computer Science, and the Leader of its Software Engineering Research Group (SERG) and the Industrial Excellence Center on Embedded Applications Software Engineering (EASE). His research interests include empirical research on software development and management methods, in particular for software testing and open innovation, and cross disciplinary topics on the digital society. He has contributed significantly to software engineering research methodology by the books on case studies and experimentation in software engineering.He serves on the editorial boards of Empirical Software Engineering and Software Testing, Verification and Reliability, and is a member of several program committees.

Location: Markham Ballroom A/B
10:00-10:30Coffee Break
10:00-10:01 Session 7
10:00
Multi-objective regression test selection in practice: an empirical study in the defense software industry

ABSTRACT. Executing an entire regression test suite after every code change is often costly in large software projects. To address this challenge, various regression test selection techniques have been proposed in the literature. One of those approaches is the Multi-Objective Regression Test Optimization (MORTO) approach, which is applied when there are multiple objectives during regression testing (e.g., minimizing the number of test cases and maximizing test coverage). This paper reports an “action research”-based empirical study which was conducted to improve regression test-selection practices of a safety-critical industrial software in the defence domain based on the MORTO approach. The problem is formulated and solved by converting the multi-objective genetic-algorithm (GA) problem into a custom-built scalarized single-objective GA. The empirical results demonstrate that this approach yields a more efficient test suite (in terms of testing cost and benefits) compared to the old (manual) test-selection approach, and another approach from the literature, i.e., selective requirement coverage-based approach. We developed the GA based on a set of five cost objectives and four benefit objectives for regression testing while providing full coverage of affected (changed) requirements. Since our proposed approach has been beneficial and has made regression testing more efficient, it is currently in active use in the company.

10:30-12:00 Session 8A: Experiments
Location: Markham Ballroom A/B
10:30
Estimating Energy Impact of Software Releases and Deployment Strategies: the KPMG Case Study

ABSTRACT. Background. Often motivated by optimisation objectives, software products are characterised by different subsequent releases and deployed through different strategies. The impact of these two aspects of software on energy consumption has still to be completely understood and can be improved by carrying out ad-hoc analyses for specific software products. Aims. In this research we report on an industrial collaboration aiming at assessing the different impact that releases and deployment strategies of a software product can have on the energy consumption of its underlying hardware infrastructure. Method. We designed and performed an empirical experiment in a controlled environment. Deployment strategies, releases and use case scenarios of an industrial third-party software product were adopted as experimental factors. The use case scenarios were used as a blocking factor and adopted to dynamically load test the software product. The power consumption and execution time were selected as response variables to measure the energy consumption and achieve our proposed goal. Results. We observed that both deployment strategies and software releases significantly influence the energy consumption of the hardware infrastructure. A strong interaction between the two factors was identified. The impact of such interaction highly varied depending on which use case scenario was considered, making the identification of the most frequently adopted use case scenario critical for energy optimisation. The collaboration between industry and academia has been productive for both parties, even if some practitioners manifested low interest/awareness on software energy efficiency. Conclusions. For the software product considered there is no absolute preferable release and deployment strategy with respect to energy efficiency. The number of machines involved in a software deployment strategy does not simply constitute an additive effect of the energy consumption of the underlying hardware infrastructure.

11:00
Graphical vs. Tabular Notations for Risk Models: On the Role of Textual Labels and Complexity

ABSTRACT. [Background] Security risk assessment methods in industry mostly use a tabular notation to represent the assessment results whilst academic works advocate graphical methods. Several experiments with MSC students showed that the tabular notation is better than a rich graphical notation in supporting the comprehension of security risks. [Aim] We aim to investigate whether the availability of textual labels and terse UML-style notation could change the result. [Method] In this paper we report the results of an online comprehensibility experiment involving 61 professionals with an average of 9 years of working experience, in which we compared the ability to extract security risk information represented in tabular, UML-style with textual labels, and iconic graphical modeling notations. [Results] The results shows that the availability of textual labels does improve the precision and recall of participants responses to comprehensibility questions. [Conclusion] We can conclude that the tabular representation better supports extraction of correct information about security risks than the graphical notation but textual labels help.

11:30
The Influence of Requirements in Software Model Development in an Industrial Environment

ABSTRACT. Textual description of requirements is a specification technique that is widely used in industry, where time is key for success. How requirements are specified textually greatly depends on human factors. In order to study how requirements processing is affected by the level of detail in textual descriptions, this paper compares enriched textual requirements specifications with non-enriched ones. To do this, we have conducted an experiment in industry with 19 engineers of CAF (Construcciones y Auxiliares de Ferrocarril), which is a supplier of railway solutions. The experiment is a crossover design that analyzes efficiency, effectiveness, and perceived difficulty starting from a written specification of requirements that subjects must process in order to build software models. The results show that effectiveness and efficiency for enriched requirements are better, while non-enriched requirements are slightly more difficult to deal with. Therefore, even though enriched requirements require more time to be specified, the results are more successfully when using them.

10:30-12:00 Session 8B: Software Quality
Location: Markham Ballroom C
10:30
An Industry Perspective to Comparing the SQALE and Quamoco Software Quality Models

ABSTRACT. Context: We investigate the different perceptions of quality provided by leading operational quality models when used to evaluate software systems from an industry perspective. Goal: To compare and evaluate the quality assessments of two competing quality models and to develop an extensible solution to meet the quality assurance measurement needs of an industry stakeholder --The Construction Engineering Research Laboratory (CERL). Method: In cooperation with our industry partner TechLink, we operationalize the Quamoco quality model and employ a multiple case study design comparing the results of Quamoco and SQALE, two implementations of well known quality models. The study is conducted across current versions of several open source software projects sampled from GitHub and commercial software for sustainment management systems implemented in the C\# language from our industry partner. Each project represents a separate embedded unit of study in a given context --open source or commercial. We employ inter-rater agreement and correlation analysis to compare the results of both models, focusing on Maintainability, Reliability, and Security assessments. Results: Our observations suggest that there is a significant disconnect between the assessments of quality under both quality models. Conclusion: In order to support industry adoption, additional work is required to bring competing implementations of quality models into alignment. This exploratory case study helps us shed light into this problem.

11:00
Formative Evaluation of a Tool for Managing Software Quality

ABSTRACT. Background: To achieve high software quality, particularly in the context of agile software development, organizations need tools to continuously analyze software quality. Several quality management (QM) tools have been developed in recent years. However, there is a lack of evidence on the quality of QM tools and of a standardized definition of such quality and reliable instruments for measuring it. This, in turn, impedes proper selection and improvement of QM tools. Aims: We aimed at operationalizing the quality of a research QM tool – namely the ProDebt prototype – and evaluating its quality. The goal of the ProDebt prototype is to provide practitioners with support for managing software quality and technical debt. Method: We performed interviews, workshops, and a mapping study to operationalize the quality of the ProDebt prototype and to identify reliable instruments to measure it. We also designed a mixed-method study aimed at formative evaluation, i.e., at assessing the quality of the ProDebt prototype and providing guidance for its further development. Eleven practitioners from two German companies evaluated the ProDebt prototype. Results: The participants assessed the information provided by the ProDebt prototype as understandable and relevant. They considered the ProDebt prototype’s functionalities as easy to use but of limited usability. They identified improvement needs, e.g., that the analysis results should be linked to other information sources and the support for interpreting quality metrics should be extended. Conclusions: The evaluation design was of practical value for evaluating the ProDebt prototype considering the limited resources such as the practitioners’ time. The evaluation results provided the developers of the ProDebt prototype with guidance for its further development. We conclude that it can be used and tailored for replication or evaluation of other QM tools.

11:30
The Impact of Coverage on Bug Density in a Large Industrial Software Project

ABSTRACT. Measuring quality of test suites is one of the major challenges of software testing. Code coverage identifies tested and untested parts of code and is frequently used to approximate test suite quality. Multiple previous studies have investigated the relationship between coverage ratio and test suite quality, without a clear consent in the results. In this work we study whether covered code contains a smaller number of bugs than uncovered code (assuming appropriate scaling). If this correlation holds and bug density is lower in covered code, coverage can be regarded as a meaningful metric to estimate the adequacy of testing. To this end we analyse bugs and bug-fixes of SAP HANA, a large industrial software project. We found that the above-mentioned relationship indeed holds, and is statistically significant. Contrary to most previous works our study uses real bugs and real bug-fixes. Furthermore, our data is derived from a complex industrial project.

10:30-12:00 Session 8C: Repository Analysis I
Location: Butternut/Holly
10:30
Quantifying the Transition from Python 2 to 3: An Empirical Study of Python Applications

ABSTRACT. Background: Python is one of the most popular modern programming languages. In 2008 its authors introduced a new version of the language, Python 3.0, that was not backward compatible with Python 2, initiating a transitional phase for Python software developers. Aims: The study described in this paper investigates the degree to which Python software developers are making the transition from Python 2 to Python 3. Method: We have developed a Python compliance analyser, PyComply, and have assembled a large corpus of Python applications. We use PyComply to measure and quantify the degree to which Python 3 features are being used, as well as the rate and context of their adoption. Results: In fact, Python software developers are not exploiting the new features and advantages of Python 3, but rather are choosing to retain backward compatibility with Python 2. Conclusions: Python developers are confining themselves to a language subset, governed by the diminishing intersection of Python 2, which is not under development, and Python 3, which is under development with new features being introduced as the language continues to evolve.

11:00
Which Version Should be Released to the App Store?

ABSTRACT. Background: Several mobile app releases do not find their way to the end users. Our analysis of 11,514 releases across 917 open source mobile apps revealed that 44.3% of releases created in GitHub never shipped to the app store (market). Aims: We introduce "marketability" of open source mobile apps as a new release decision problem. Considering app stores as a complex system with unknown treatments, we evaluate performance of predictive models and analogical reasoning for marketability decisions. Method: We performed a survey with 22 release engineers to identify the importance of marketability release decision. We compared different classifiers to predict release marketability. For guiding the transition of not successfully marketable releases into successful ones, we used analogical reasoning. We evaluated our results both internally (over time) and externally (by developers). Results: Random forest classification performed best with F1 score of 78%. Analyzing 58 releases over time showed that, for 81% of them, analogical reasoning could correctly identify changes in the majority of release attributes. A survey with seven developers showed the usefulness of our method for supporting real world decisions. Conclusions: Marketability decisions of mobile apps can be supported by using predictive analytics and by considering and adopting similar experience from the past.

11:30
Mining Logs to Model the Use of a System

ABSTRACT. Process mining is a technique to build process models from ``execution logs" (i.e., events triggered by the execution of a process). State-of-the-art tools can provide process managers with different graphical representations of such models. Managers use these models to compare them with an ideal process model or to support process improvement. They typically select the representation based on their experience and knowledge of the system.

This study applies the theory of Hidden Markov Models to mine logs and model the use of a system. Unlike the models generated with the process mining tools, the Hidden Markov Models automatically generated in this study can be used to recommend managers with a faithful representation of the use of their systems.

The study also shows that the automatic generation of the Hidden Markov Models can achieve a good level of accuracy provided the log dataset is carefully chosen. In our study, the information contained in one-month set of logs helped build automatically Hidden Markov Models with superior accuracy and similar expressiveness of the models built together with the company's stakeholder.

12:00-13:00Lunch Break
13:00-14:30 Session 9A: Defect Prediction
Location: Markham Ballroom C
13:00
File-Level Defect Prediction: Unsupervised vs. Supervised Models

ABSTRACT. Background: Software defect models can help software quality assurance teams to allocate testing or code review resources. A variety of techniques have been used to build defect prediction models, including supervised and unsupervised methods. Recently, Yang et al. [1] surprisingly find that unsupervised models can perform statistically significantly better than supervised models in effort-aware change-level defect prediction. However, little is known about relative performance of unsupervised and supervised models for effort-aware file-level defect prediction. Aims: Inspired by their work, we aim to investigate whether a similar finding holds in effort-aware file-level defect prediction. Method: We replicate Yang et al.’s study on PROMISE dataset with totally ten projects. We compare the effectiveness of unsupervised and supervised prediction models for effort-aware file-level defect prediction. Results: We find that the conclusion of Yang et al. [1] does not hold under within-project but holds under cross-project setting for file-level defect prediction. In addition, following the recommendations given by the best unsupervised model, developers needs to inspect statistically significantly more files than that of supervised models considering the same inspection effort (i.e., LOC). Conclusions: (a) Unsupervised model cannot perform statistically significantly better than state-of-art supervised model under within-project validation, (b) Unsupervised model can perform statistically significantly better than state-of-art supervised model under cross-project validation, (c) We suggest that not only LOC but also number of files needed to be inspected should be considered when evaluating effort-aware file-level defect prediction models.

13:30
Training data selection for cross-project defection prediction: which approach is better?

ABSTRACT. Many relevancy filters have been proposed to select training data for building cross-project defect prediction (CPDP) models. However, up to now, there is no consensus about which relevancy filter is better for CPDP. In this paper, we conduct a thorough experiment to compare nine relevancy filters proposed in the recent literature. We compare not only the retaining ratio of the original training data and the overlapping degree among the retained data but also the prediction performance of the resulting CPDP models under the ranking and classification scenarios. Our experimental results from 33 open-source software projects provide an in-depth understanding the differences between these filters as well as practical guidelines for choosing appropriate filters to build effective CPDP models in practice.

14:00
The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification

ABSTRACT. Recent studies have shown that performance of defect prediction models can be affected when data sampling approaches are applied to imbalanced training data for building defect prediction models. However, the magnitude (degree and power) of the effect of these sampling methods on the classification and prioritization performances of defect prediction models is still unknown. We examine the practical effects of six data sampling methods on performances of five defect prediction models. The prediction performances of the models trained on default datasets (no sampling method) are compared with that of the models trained on resampled datasets (application of sampling methods). To decide whether the performance changes are significant or not, robust statistical tests are performed and effect sizes computed. Twenty releases of ten open source projects extracted from the PROMISE repository are considered and evaluated using the AUC, pd, pf and G-mean performance measures. Results show there are statistical significance differences and practical effects on the classification performance (pd, pf and G-mean) between models trained on resampled datasets and those trained on the default datasets. However, sampling methods have no statistical and practical effects on defect prioritization performance (AUC) with small or no effect values obtained from the models trained on the resampled datasets With large and significant effects sizes on the pd and G-mean, the existing sampling methods can properly set the threshold between buggy and clean samples, while they cannot improve the prediction of defect-proneness itself since the AUC improvements are insignificant and pf consequently increases. Sampling methods are highly recommended for defect classification purposes when all faulty modules are to be considered for testing.

13:00-14:30 Session 9B: Qualitative Research II
Location: Markham Ballroom A/B
13:00
Eliciting Strategies for the GQM+Strategies Approach in IT Service Measurement Initiatives

ABSTRACT. GQM+Strategies is a goal-oriented measurement approach that supports organizations in identifying goals, strategies to achieve goals, and measures to monitor strategies and goals. However, identifying proper strategies is not an easy task. This paper presents two studies performed aiming at investigating how strategies can be established to achieve IT service goals. First, we carried out a qualitative study involving three IT service-related departments of a large company to find out how they have defined strategies and the problems they have faced. We noted that strategies have been defined by leaders, in a top-down approach, or by teams, in a bottom-up approach, and that causal analysis techniques have been used to investigate aspects which can impact goals achievement. We also found out that the relation between the IT service strategies and goals was not clear for the teams. Then, considering these findings, we performed an empirical study in another IT-service related department in which we applied an approach combining GQM+Strategies plus some instruments (checklists, templates and examples) and causal analysis to support IT strategies identification. As a result, we noticed that by using that approach the team was able to derive IT strategies based on goals, define measures to monitor goals and strategies, and better understand the alignment between the goals to be achieved and the strategies to be performed. Moreover, the results showed that causal analysis is useful to define strategies and that the supporting instruments facilitate using the approach and building the GQM+Strategies grid.

13:30
Looking for Peace of Mind? Manage your (Technical) Debt - An Exploratory Field Study

ABSTRACT. Background: In the last two decades Technical Debt (TD) has received a considerable amount of attention from software engineering research and practice. Recently, a small group of studies suggests that, in addition to its technical and economic consequences, TD can affect developers’ psychological states and morale. However, until now there has been a lack of empirical research clarifying such influences. Aims: In this study, we aim at taking the first step in filling this gap by investigating the potential impacts of TD and its management on developers’ morale. Method: Drawing from previous literature on morale, we decided to explore the influence of TD and its management on three dimensions of morale called affective, future/goal, and interpersonal antecedents. In so doing, we conducted an exploratory field study and collected data from software professionals active in different industrial domains through eight qualitative interviews and an online survey (n=33). Results: Our results indicate that TD has a negative influence on future/goal and influence on affective antecedents of morale. This is mainly because the occurrence of TD hinders developers from performing their tasks and achieving their goals. TD management, on the other hand, has a positive influence on all the three dimensions of morale since it is associated with positive feelings and interpersonal feedback as well as a sense of progress. Conclusions: According to the results of this empirical study, the occurrence of TD reduces developers’ morale, while its management increases developers’ morale.

14:00
Characterizing Software Engineering Work with Personas Based on Knowledge Worker Actions

ABSTRACT. Mistaking versatility for universal skills, some companies tend to categorize all software engineers the same not knowing a difference exists. For example, a company may select one of many software engineers to complete a task, later finding that the engineer's skills and style do not match those needed to successfully complete that task. This can result in delayed task completion and demonstrates that a one-size fits all concept should not apply to how software engineers work. In order to gain a comprehensive understanding of different software engineers and their working styles we interviewed 21 participants and surveyed 868 software engineers at a large software company and asked them about their work in terms of knowledge worker actions. We identify how tasks, collaboration styles, and perspectives of autonomy can significantly effect different approaches to software engineering work. To characterize differences, we describe empirically informed personas on how they work. Our defined software engineering personas include those with focused debugging abilities, engineers with an active interest in learning, experienced advisors who serve as experts in their role, and more. Our study and results serve as a resource for building products, services, and tools around these software engineering personas.

13:00-14:30 Session 9C: Change/ Issue Management II
Location: Butternut/Holly
13:00
Common Bug-fix Patterns: A Large-Scale Observational Study

ABSTRACT. \textit{[Background]:} There are more bugs in real-world programs than human programmers can realistically address. Several approaches have been proposed to aid debugging. A recent research direction that has been increasingly gaining interest to address the reduction of costs associated with defect repair is automatic program repair. Recent work has shown that some kind of bugs are more suitable for automatic repair techniques. \textit{[Aim]:} The detection and characterization of common bug-fix patterns in software repositories play an important role in advancing the field of automatic program repair. In this paper, we aim to characterize the occurrence of known bug-fix patterns in Java repositories at an unprecedented large scale. \textit{[Method]:} The study was conducted for Java GitHub projects organized in two distinct data sets: the first one (i.e., Boa data set) contains more than 4 million bug-fix commits from 101,471 projects and the second one (i.e., Defects4J data set) contains 369 real bug fixes from five open-source projects. We used a domain-specific programming language called Boa in the first data set and conducted a manual analysis on the second data set in order to confront the results. \textit{[Results]:} We characterized the prevalence of the five most common bug-fix patterns (identified in the work of Pan \textit{et al.}) in those bug fixes. The combined results showed direct evidence that developers often forget to add {\tt IF} preconditions in the code. Moreover, 76\% of bug-fix commits associated with the IF-APC bug-fix pattern are isolated from the other four bug-fix patterns analyzed. \textit{[Conclusion]:} Targeting on bugs that miss preconditions is a feasible alternative in automatic repair techniques that would produce a relevant payback.

13:30
(Journal First) Towards an understanding of change types in bug fixing code

ABSTRACT. (JOURNAL FIRST: https://doi.org/10.1016/j.infsof.2017.02.003)

Context: As developing high quality software becomes increasingly challenging because of the explosive growth of scale and complexity, bugs become inevitable in software systems. The knowledge of bugs will naturally guide software development and hence improve software quality. As changes in bug fixing code provide essential insights into the original bugs, analyzing change types is an intuitive and effective way to understand the characteristics of bugs. Objective: In this work, we conduct a thorough empirical study to investigate the characteristics of change types in bug fixing code. Method: We first propose a new change classification scheme with 5 change types and 9 change subtypes. We then develop an automatic classification tool CTforC to categorize changes. To gain deeper insights into change types, we perform our empirical study based on three questions from three perspectives, i.e. across project, across domain and across version. Results: Based on 17 versions of 11 systems with thousands of faulty functions, we find that: (1) across project: the frequencies of change subtypes are significantly similar across most studied projects; interface related code changes are the most frequent bug-fixing changes (74.6% on average); most of faulty functions (65.2% on average) in studied projects are finally fixed by only one or two change subtypes; function call statements are likely to be changed together with assignment statements or branch statements; (2) across domain: the frequencies of change subtypes share similar trends across studied domains; changes on function call, assignment, and branch statements are often the three most frequent changes in studied domains; and (3) across version: change subtypes occur with similar frequencies across studied versions, and the most common subtype pairs tend to be same. Conclusion: Our experimental results improve the understanding of changes in bug fixing code and hence the understanding of the characteristics of bugs.

14:00
Mining Version Control System for Automatically Generating Commit Comment

ABSTRACT. Commit comments increasingly receive attention as an important complementary component in code change comprehension. To address the comment scarcity issue, a variety of automatic approaches for commit comment generation have been intensively proposed. However, most of these approaches mechanically outline a superficial level summary of the changed software entities, the change intent behind the code changes is lost (e.g., the existing approaches cannot generate such comment: “fixing null pointer exception”). Considering the comments written by developers often describe the intent behind the code change, we propose a method to automatically generate commit comment by reusing the existing comments in version control system. Specifically, for an input commit, we apply syntax, semantic, pre-syntax, and pre-semantic similarities to discover the similar commits from half a million commits, and recommend the reusable comments to the input commit from the ones of the similar commits. We evaluate our approach on 7 projects. The results show that 9.1% of the generated comments are good, 27.7% of the generated comments are fix, and 63.2% are bad, and we also analyze the reasons that make a comment available or unavailable.

14:30-15:00Coffee Break
15:00-16:00 Session 10A: Repository Analysis II
Location: Markham Ballroom C
15:00
House of Cards: Code Smells in Open-source C# Repositories

ABSTRACT. Code smells are indicators of quality problems that make a software hard to maintain and evolve. Given the importance of smells in writing maintainable source code, many studies have explored characteristics of smells and analyzed effects of smells on the quality of the software. In this paper, we present our observations and insights on four research questions that we address with the empirical exploration of implementation and design smells. The study mines 19 design smells and 11 implementation smells in 1988 C# repositories containing more than 49 million lines of code. We find that unutilized abstraction and magic number smells are the most frequently occurring smells in C# code. Further, our results show that implementation and design smells exhibit strong inter-category correlation. Also, contrary to common belief, intra-category correlation analysis reveals that implementation smells show higher correlation among themselves than the design smells. Our study can benefit the C# developer community by helping them understand the smells and their implications. At the same time, the study extends the state of the art by scaling the mining experiment in terms of number of repositories and number of smells analyzed.

15:15
How Does Machine Translated User Interface Affect User Experience? A Study on Android Apps

ABSTRACT. For global-market-oriented software applications, it is required that their user interface be translated to local languages so that users from different areas in the world an use the software. A long-term practice in software industry is to hire professional translators or translation companies to perform the translation. However, due to the large number of user- interface labels and target languages, this is often too expensive for software providers, especially cost-sensitive providers such as personal developers of Android apps. On the other hand, more mature machine translation techniques are providing a cheap though imperfect alternative, and the Google Translation service has already been widely used for translating websites and apps. On the other hand, the effect of lower translation quality on user experience has not been well studied yet. In this paper, we present a user study on 6 popular Android apps, which involves 24 participants performing tasks on app variants with 4 different translation quality levels and 2 target languages: Spanish and Chinese. From our study, we acquire the following 3 major findings, including (1) compared with original versions, machine translated versions of apps have similar task completion rate and efficiency on most studied apps; (2) machine translated versions have more failed and flawed tasks but more in-depth inspection reveals that they are not directly caused by wrongly translated labels; and (3) users are not satisfied with the GUI of machine translated versions and the two major complaints are misleading labels of input boxes, and unclear translation of items in option lists.

15:30
An exploratory analysis of a hybrid OSS company's forum in search of sales leads

ABSTRACT. Background: Online forums are instruments through which information or problems are shared and discussed, including expressions of interests and intentions.

Objective: In this paper, we present ongoing work aimed at analyzing the content of forum posts of a hybrid open source company that offers both free and commercial licenses, in order to help its community manager gain improved understanding of the forum discussions and sentiments and automatically discover new opportunities such as sales leads, i.e., people who are interested in buying a license. These leads can then be forwarded to the sales team for follow up and can result in them potentially making a sale, thus increasing company revenue.

Method: For the analysis of the forums, an untapped channel for sales leads by the company, text analysis techniques are utilized to identify potential sales leads and the discussion topics and sentiments in those leads.

Results: Results of our preliminary work make a positive contribution in lessening the community manager’s work in understanding the sentiment and discussion topics in the hybrid open source forum community, as well as make it easier and faster to identify potential future customers.

Conclusion: We believe that the results will positively contribute to improving the sales of licenses for the hybrid open source company

15:45
On Software Productivity Analysis with Propensity Score Matching

ABSTRACT. [Context:] Software productivity analysis is an essential activity for software process improvement. It specifies critical factors to be resolved or accepted from project data. As the nature of project data is observational, not experimental, the project data involves bias that can cause spurious relationships among analyzed factors. Analysis methods based on linear regression suffer from the spurious relationships and sometimes lead an inappropriate causal relation. The propensity score is a solution for this problem but has rarely been used. [Objective:] To investigate what differences the use of propensity score brings to software productivity analysis in comparison to a conventional method. [Method:] We revisited classical software productivity analyses on ISBSG and Finnish datasets. The differences of critical factors between the propensity score and the linear regression were investigated. [Results:] Both analysis methods specified different critical factors on the two datasets. The specified factors were both reasonable to some extent, and further considerations are needed for the propensity score results. [Conclusions:] The use of propensity score can lead new possible factors to be tackled. Although the contradiction does not necessarily indicate a flaw of the linear regression, the results by the propensity score should also be noticed for better actions.

15:00-16:00 Session 10B: Requirements Engineering
Location: Butternut/Holly
15:00
What the Job Market Wants from Requirements Engineers? An Empirical Analysis of Online Job Ads from the Netherlands

ABSTRACT. Recently, the requirements engineering (RE) community recognized the increasing need for understanding how industry perceives the jobs of requirements engineers and their most important qualifications. This study contributes to the community’s research effort on this topic. Based on an analysis of RE job ads in 2015 from the Netherlands’ three most popular online IT-job portals, we identified those task—elated and skill-related qualifications important for employers when seeking for applicants for RE jobs. We found that the job titles used in industry for the specialists that the RE community calls ‘requirements engineers’, are ‘Product Owner’ and ‘Analyst’, be it Information Analysts, Application Analyst, or Data Analysts. Those professionals expected to perform RE tasks also take responsibility for additional tasks including quality assurance, realization and deployment, and project management. The most sought after skills are soft skills: Dutch and English language skills as well as communication and analytical skills are the most demanded. RE is perceived as an employment occupation for experts, with 23% of all job ads asking explicitly for RE experience and 63% asking for experience similar to RE.

15:15
Agile Quality Requirements Engineering Challenges: First Results from a Case Study

ABSTRACT. Agile software development methods have become increasingly popular in the last years. Despite its popularity, Agile has been criticized for focusing on delivering functional requirements and neglecting the quality requirements. Several studies have reported this shortcoming. However, there is little know about the challenges organizations currently face when dealing with quality requirements. Based on qualitative exploratory case studies this research investigated real life large-scale distributed Agile projects to understand the challenges Agile teams face regarding quality requirements. Eighteen semi-structured open-ended in-depth interviews were conducted with Agile practitioners representing six different organizations in the Netherlands. Based on the analysis of the collected data, we have identified nine challenges Agile practitioners face when engineering quality requirements in large-scale distributed Agile projects that could harm the implementation of the quality requirements and result in neglecting them.

15:30
Issues and Opportunities for Human Error-based Requirements Inspections: An Exploratory Study

ABSTRACT. [Background] Software inspections are extensively used for requirements verification. Our research uses the perspective of human cognitive failures (i.e., human errors) to improve the fault detection effectiveness of traditional fault-checklist based inspections. Our previous evaluations of a formal human error based inspection technique called Error Abstraction and Inspection (EAI) have shown encouraging results, but have also highlighted a real need for improvement. [Aims and Method] The goal of conducting the controlled study presented in this paper was to identify the specific tasks of EAI that inspectors find most difficult to perform and the strategies that successful inspectors use when performing the tasks. [Results] The results highlighted specific pain points of EAI that can be addressed by improving the training and instrumentation.

15:45
Assessing the Intuitiveness of Qualitative Contribution Relationships in Goal Models: an Exploratory Experiment

ABSTRACT. Developing conceptual models is an integral part of the requirements engineering (RE) process. Goal models are requirements engineering conceptual models that allow diagrammatic representation of stakeholder intentions and how they affect each other. A specific goal modeling language construct, the contribution of goal satisfaction of one goal to another, plays a central role in supporting decision problem exploration within goal models. In this paper, we report on an experimental study whose aim was to measure the user perception of the meaning of the aforementioned modeling construct. A set of contributions under different scenarios were given to experimental participants who were asked what they thought the effect of the contribution was. We found that participants are not always in agreement either within themselves or with the designer's intentions on the meaning of the language, calling for possible design adaptations.

15:00-16:00 Session 10C: Poster Session
Location: Markham Ballroom A/B
15:00
Identifying Software Decays: A System Usage Perspective

ABSTRACT. The value of a software product diminishes not because of its weary or rusty codes, but due to gradual changes in system usage patterns, or emerging new requirements over its lifespan, including other direct or indirect impacts from the surrounding environment. These new requirements or changes cannot be accommodated immediately; therefore, the system becomes outdated with incompatible or unused features; the overall value of its services gradually degrades. We term this as software decay. It is expensive to overcome it, since there is no identifiable single component which would reinstate its value to its full extent. In this work, we attempt to discover decays in a software at an early stage of its lifespan; and measure it by quantifying the system value based on the usage of its subset of core features that are necessary to perform its value-added services for the intended users.

15:00
An Empirical Study of Open Source Virtual Reality Software Projects

ABSTRACT. In this paper, we present an empirical study of 1,156 open source virtual reality (VR) projects from Unity List. Our study shows that the number of open source VR software projects are steadily growing, and some large projects attracting many developers are emerging. The most popular topic of VR software is still games.We also found C# is one major language used by VR developers, and developers often face miss-commit automatically generated files.

15:00
Beyond Boxes and Lines: Designing and Evaluating Alternative Visualizations for Requirements Conceptual Models

ABSTRACT. Conceptual modeling languages have been widely studied in requirements engineering as tools for capturing, representing and reasoning about domain problems. One of those languages, goal models, has been proposed for representing the structure of stakeholder intentions. Like most other conceptual modeling languages, goal models are visualized using box-and-line diagrammatic notations. But is this box-and-line approach the best way for visualizing goals and relationships thereof? Through a series of experimental studies we have recently endeavored to find out. In this presentation we describe features of our alternative visualization proposals and present lessons we learned from our attempts to empirically evaluate them, which could be useful for those interested in empirically-driven conceptual modeling language design.

15:00
A Comparison of Dictionary Building Methods for Sentiment Analysis in Software Engineering Text

ABSTRACT. Sentiment Analysis (SA) in Software Engineering (SE) texts suffers from low accuracies primarily due to the lack of an effective dictionary. The use of a domain-specific dictionary can improve the accuracy of SA in a particular domain. Building a domain dictionary is not a trivial task. The performance of lexical SA also varies based on the method applied to develop the dictionary. This paper includes a quantitative comparison of four dictionaries representing distinct dictionary building methods to identify which methods have higher/lower potential to perform well in constructing a domain dictionary for SA in SE texts.

15:00
Structured Synthesis Method: the Evidence Factory Tool

ABSTRACT. We present the Evidence Factory (EF), a tool designed to support the Structured Synthesis Method (SSM). SSM is a research synthesis method that can be used to aggregate both quantitative and qualitative studies. It is a kind of integrative synthesis method, such as meta-analysis, but has several features from interpretative methods, such as meta-ethnography, particularly those concerned with conceptual development. The tool is a web-based infrastructure, which supports the organization of synthesis studies. Researchers can compare findings from different studies by modeling their results according to the evidence meta-model. After deciding whether the evidence can be combined, the tool automatically computes the uncertainty associated with the aggregated results using the formalisms from the Mathematical Theory of Evidence.