Augmenting and structuring user queries to support efficient free-form code search

2018 
Motivation : Code search is an important activity in software development since developers are regularly searching [6] for code examples dealing with diverse programming concepts, APIs, and specific platform peculiarities. To help developers search for source code, several Internet-scale code search engines, such as OpenHub [5] and Codota [1] have been proposed. Unfortunately, these Internet-scale code search engines have limited performance since they treat source code as natural language documents. To improve the performance of search engines, the construction of the search space index as well as the mapping process of querying must address the challenge that "no single word can be chosen to describe a programming concept in the best way" [2]. This is known in the literature as the vocabulary mismatch problem [3]. Approach : We propose a novel approach to augmenting user queries in a free-form code search scenario. This approach aims at improving the quality of code examples returned by Internet-scale code search engines by building a C o de vo C a B u lary (C o C a B u ) [7]. The originality of C o C a B u is that it addresses the vocabulary mismatch problem, by expanding/enriching/re-targeting a user's free-form query, building on similar questions in Q&A sites so that a code search engine can find highly relevant code in source code repositories. Figure 1 provides an overview of our approach. The search process begins with a free-form query from a user, i.e., a sentence written in a natural language: (a) For a given query, C o C a B u first searches for relevant posts in Q&A forums. The role of the Search Proxy is then to forward developer free-form queries to web search engines that can collect and rank entries in Q&A with the most relevant documents for the query. (b) C o C a B u then generates an augmented query based on the information in the relevant posts. It mainly leverages code snippets in the previously identified posts. The Code Query Generator then creates another query which includes not only the initial user query terms but also program elements. To accelerate this step in the search process, C o C a B u builds upfront a snippet index for Q&A posts. (c) Once the augmented query is constructed, C o C a B u searches source files for code locations that match the query terms. For this step, we crawl a large number of repositories and build upfront a code index of program elements in the source code. Contributions: • C o C a B u approach to the vocabulary mismatch problem: We propose a technique for finding relevant code with freeform query terms that describe programming tasks, with no a-priori knowledge on the API keywords to search for. • G it S earch free-form search engine for GitHub: We instantiate the C o C a B u approach based on indices of Java files built from GitHub and Q&A posts from Stack Overflow to find the most relevant code examples for developer queries. • Empirical user evaluation: Comparison with popular code search engines further shows that G it S earch is more effective in returning acceptable code search results. In addition, Comparison against web search engines indicates that G it S earch is a competitive alternative. Finally, via a live study, we show that users on Q&A sites may find G it S earch 's real code examples acceptable as answers to developer questions. Concluding remarks: As a follow-up work, we have also leveraged Stack Overflow data to build a practical, novel, and efficient code-to-code search engine [4].
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    5
    References
    4
    Citations
    NaN
    KQI
    []