Abstract: In this paper, we describe an effective compile-time analysis for software prefetching in Java. Previous work in software data prefetching for pointer-based codes uses simple compiler algorithms and does not investigate prefetching for object-oriented language features that make compile-time analysis difficult. We develop a new data flow analysis to detect regular accesses to linked data structures in Java programs. We use intra and interprocedural analysis to identify profitable prefetching opportunities for greedy and jump-pointer prefetching, and we implement these techniques in a compiler for Java. Our results show that both prefetching techniques improve four of our ten programs. The largest performance improvement is 48% with jump-pointers, but consistent improvements are difficult to obtain.
Describes an effective compile-time analysis for software prefetching in Java. Previous work in software data prefetching for pointer-based codes asses simple compiler algorithms and does not investigate prefetching for object-oriented language features that male compile-time analysis difficult. We develop a new data flow analysis to detect regular accesses to linked data structures in Java programs. We use intra- and inter-procedural analysis to identify profitable prefetching opportunities for greedy and jump-pointer prefetching, and we implement these techniques in a compiler for Java. Our results show that both prefetching techniques improve four of our ten programs. The largest performance improvement is 48% with jump-pointers, but consistent improvements are difficult to obtain.
Article Free Access Share on Performance evaluation of a distributed architecture for information retrieval Authors: Brendon Cahoon Department of Computer Science, University of Massachusetts, Amherst, MA Department of Computer Science, University of Massachusetts, Amherst, MAView Profile , Kathryn S. McKinley Department of Computer Science, University of Massachusetts, Amherst, MA Department of Computer Science, University of Massachusetts, Amherst, MAView Profile Authors Info & Claims SIGIR '96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrievalAugust 1996 Pages 110–118https://doi.org/10.1145/243199.243238Online:18 August 1996Publication History 26citation555DownloadsMetricsTotal Citations26Total Downloads555Last 12 Months16Last 6 weeks1 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my Alerts New Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteeReaderPDF
Modern Java programs, such as middleware and application servers, include many complex software components. Improving the performance of these Java applications requires a better understanding of the interactions between the application, virtual machine, operating system, and architecture. Hardware performance monitors, which are available on most modern processors, provide facilities to obtain detailed performance measurements of long-running applications in real time. However, interpreting the data collected using hardware performance monitors is difficult because of the low-level nature of the data.
We have developed a system, consisting of two components, to alleviate the difficulty of interpreting results obtained using hardware performance monitors. The first component is an enhanced VM that generates traces of hardware performance monitor values while executing Java programs. This enhanced VM generates a separate trace for each Java thread and CPU combination and thus provides accurate results in a multithreaded and multiprocessor environment. The second component is a tool that allows users to interactively explore the traces using a graphical interface. We implemented our tools in the context of Jikes RVM, an open source Java VM, and evaluated it on a POWER4 multiprocessor. We demonstrate that our system is effective in uncovering as yet unknown performance characteristics and is a first step in exploring the reasons behind observed behavior of a Java program.
Java is becoming a viable choice for numerical algorithms due to the software engineering benefits of object-oriented programming. Because these programs still use large arrays that do not fit in the cache, they continue to suffer from poor memory performance. To hide memory latency, we describe a new unified compile-time analysis for software prefetching arrays and linked structures in Java. Our previous work uses data-flow analysis to discover linked data structure accesses, and here we present a more general version that also identifies loop induction variables used in array accesses. Our algorithm schedules prefetches for all array references that contain induction variables. We evaluate our technique using a simulator of an out-of-order superscalar processor running a set of array-based Java programs. Across all our programs, prefetching reduces execution time by a geometric mean of 23%, and the largest improvement is 58%. We also evaluate prefetching on a PowerPC processor, and we show that prefetching reduces execution time by a geometric mean of 17%. Traditional software prefetching algorithms for C and Fortran use locality analysis and sophisticated loop transformations. Because our analysis is much simpler and quicker, it is suitable for including in a just-in-time compiler. We further show that the additional loop transformations and careful scheduling of prefetches used in previous work are not always necessary for modern architectures and Java programs.
The memory hierarchy in modern architectures continues to be a major performance bottleneck. Many existing techniques for improving memory performance focus on Fortran and C programs, but memory latency is also a barrier to achieving high performance in object-oriented languages. Existing software techniques are inadequate for exposing optimization opportunities in object-oriented programs. One key problem is the use of high-level programming abstractions which make analysis difficult. Another challenge is that programmers use a variety of data structures, including arrays and linked structures, so optimizations must work on a broad range of programs. We develop a new unified data-flow analysis for identifying accesses to arrays and linked structures, called recurrence analysis. Prior approaches that identify these access patterns are ad hoc, or treat arrays and linked structures independently. The data-flow analysis is intra- and inter-procedural, which is important in Java programs that use encapsulation to hide implementation details.
We show Java programs that traverse arrays and linked structure have poor memory performance. We use compiler-inserted data prefetching to improve memory performance in these types of programs. The goal of prefetching is to hide latency by bringing data into the cache prior to a program's use of the data. We use our recurrence analysis to identify prefetching opportunities in Java programs. We develop a new algorithm for prefetching arrays, and we evaluate several methods for prefetching objects in linked structures. Since garbage collection is an integral part of Java, we evaluate the impact of a copying garbage collector on prefetching. We demonstrate how to improve the memory performance of the collector itself by using prefetching. This dissertation shows that a unified whole-program compiler analysis is effective in discovering prefetching opportunities in Java programs that traverse arrays and linked structures. Compiler-inserted data prefetching improves the memory performance even in the presence of object-oriented features that complicate analysis.
The information explosion across the Internet and elswhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this article, we explore how to achieve scalable performance in a distributed system for collection sizes ranging from 1GB to 128GB. We implement a fully functional distributed IR system based on a multithreaded version of the Inquery simulation model. We measure performance as a function of system parameters such as client command rate, number of document collections, ter ms per query, query term frequency, number of answers returned, and command mixture. Our results show that it is important to model both query and document commands because the heterogeneity of commands significantly impacts performance. Based on our results, we recommend simple changes to the prototype and evaluate the changes using the simulator. Because of the significant resource demands of information retrieval, it is not difficult to generate workloads that overwhelm system resources regardless of the architecture. However under some realistic workloads, we demonstrate system organizations for which response time gracefully degrades as the workload increases and performance scales with the number of processors. This scalable architecture includes a surprisingly small number of brokers through which a large number of clients and servers communicate.