Spreadsheets for Stream Partitions and Windows.
2015
We discuss the suitability of spreadsheet processors as tools for programming streaming systems. We argue that, while spreadsheets can function as powerful models for stream operators, their fundamental boundedness limits their scope of application. We propose two extensions to the spreadsheet model and argue their utility in the context of programming streaming systems. I. STREAM PROCESSING WITH SPREADSHEETS Stream processing is an appealing approach to problems involving high volumes of data. Such problems arise in many different domains: finance, health care, consumer analytics, telecommunications, etc. Streaming applications are typically structured as a network of independent processing operators that consume and produce tuples of data. This architecture enables large-scale deployments, as operators can be distributed and replicated across machines for scalability, maximizing data-parallelism and thus resulting in systems that can handle millions of tuples per second. It also naturally enforces a clear separation of concerns: because operators only need to know the format of incoming and outgoing tuples, their implementations are largely independent of each other and they can even be developed by separate teams. We argue that spreadsheets are particularly well suited for programming operators in streaming applications. Indeed, spreadsheet and stream programming systems share many important commonalities: • Two-dimensional view of data: spreadsheets organize data in rows and columns, and streams are most naturally modelled as a sequence (rows) of tuples of attributes (columns). • Reactive model of computation: a change of value in a spreadsheet cell results in the immediate propagation of computations to all dependent cells. Similarly, stream operators produce new tuples as a reaction to inputs. • Stateless computations: streaming systems embrace statelessness as it facilitates encapsulation and thus parallelization. Spreadsheets favor statelessness to maintain a model of idempotent (re-)computations. A final point is simply pragmatism: domain experts are not always programmers, but they will typically be familiar with spreadsheet processors, as more than 500 million people worldwide are reported to use spreadsheets [1]. Supporting stream programming within a familiar tool thus removes the need for an additional expert programmer and the associated translation issues. Fig. 1. A simple trading bargain calculator. With these considerations in mind, we developed ActiveSheets [2], a platform for developing streaming operators in Microsoft Excel. ActiveSheets integrates with IBM Infosphere Streams and the SPL programming language [3], allowing end-users to visualize live streaming data in spreadsheets, and to export their computation models as SPL operators. As an illustration of programming with ActiveSheets, we consider a simple financial application to detect bargains in quoted stock prices, based on trading data. We observe a given stock symbol, with a history of trades at prices Pi and volumes Vi. The volume-weighted average price, a standard metric in finance, is defined as: VWAP = ∑ i Pi · Vi ∑ i Vi A quoted price is deemed a bargain if it is lower than the VWAP. Figure 1 shows how we can detect bargains in ActiveSheets. Cells A3:C22 are mapped to the input stream of executed trades and, as tuples come in, are updated to always reflect the last 20 tuples (a scrolling effect). This conforms with the natural view of streams as rows of tuples. Cells A29:B29 contain the latest available quote, and are similarly updated, potentially asynchronously from trades. All other cells and graphics are standard Excel content, and update reactively to the incoming streams. The ActiveSheets user marks the result of the computations for export, in this case the cells G3 and G7, thus determining the outputs of the operator. This programming experience is natural to users familiar with spreadsheets, and leads to rapid development of malleable models. A fundamental limitation of spreadsheets, however, is their bounded nature. In the following sections, we highlight how this affects the programming of streaming operators and propose potential extensions to alleviate these issues. II. VARIABLE-SIZE WINDOWS We first consider a limitation arising from boundedness in the vertical dimension. In the example described in the previous section, the computation of the VWAP happens over a window of fixed size (the last 20 values). A more realistic model would instead mandate aggregation over, e.g., the last minute of trading data. Such a window is problematic to represent in a spreadsheet, as its size in terms of tuples cannot in general be computed statically. One approach (the current state of ActiveSheets) is to require application developers to compute aggregations over variable-size windows outside of the spreadsheet, in a separate set of operators, and to only stream the aggregated data into the spreadsheet. While it can be straightforward for problems as simple as VWAP, in general it takes away flexibility from the spreadsheet developer and limits the expressiveness of the models that can be developed. Another solution would be to impose a maximum size on variable-size windows. This would make it possible to allocate a sufficiently large number of cells in the spreadsheet in advance, but would not properly support scenarios where an unpredictably high number of tuples must be consumed in a short timespan. We argue that a better approach is to introduce a new function to spreadsheets, WINDOW, to allow users to store a list of values in a single cell. The spreadsheet processor can enforce that such cells only flow into aggregating functions such as COUNT, SUM, AVERAGE, etc., producing #VALUE! errors otherwise. The aggregating functions are by design already capable of handling variable argument counts, and also have sensible default values when applied to empty ranges. Their semantics when applied to a window are thus predictable and natural. Reactivity can be preserved by triggering recomputation only when a new value enters the window. This may go counter to the notion that the window should contain exactly the values for, say, the last minute, but we believe it matches the natural intuition that the value of a cell changes whenever its inputs do. Finally, many such aggregation functions can be implemented efficiently using incremental or partial recomputations, preserving the performance and responsiveness associated with spreadsheets. III. KEY-BASED PARTITIONING The second limitation we want to address is in handling unboundedness in data variation. In our example in Figure 1, we detect bargains for the ACME ticker symbol only, but what if we wanted to detect bargains for other symbols, or for all incoming symbols? Quotes and trades for a specific symbol can be easily extracted from the streams of all quotes and trades using IF or MATCH expressions as filters. One approach is therefore to replicate the computation n-ways, selecting a particular symbol each time. While this approach is acceptable for a small and known number of symbols, it does not scale and requires defining a list of symbols to monitor a priori. We argue that a better approach is to let users define and visualize the full computation for one defined symbol. To do so, they can filter the incoming stream with a special SELECT function, whose behavior is to update the value of the cell only if its argument matches an expected value. The rest of the computation is left as usual. Once the computation is fully defined, the programmer replaces SELECT by another new function, PARTITION. Visually, PARTITION behaves identically. The semantics, however, are that for each distinct value that flows into PARTITION, a conceptually independant instance of the spreadsheet performs the computation. We can think of programming with PARTITION as another instance of the paradigm, familiar to spreadsheets users, of programming by example. In the VWAP example, the PARTITION function would be used on two streams: quotes and trades. To preserve ease of use and enable efficient implementation, we can require that all occurrences of the PARTITION function in a spreadsheet agree on the key, i.e., we can forbid partitioning the trades by ticker symbol and the quotes by, say, geography.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
2
References
0
Citations
NaN
KQI