The approximate membership query (AMQ) data structure is a kind of space-efficient probabilistic data structure. It can approximately indicate whether an element exists in a set. The AMQ data structure has been widely used in database indexing, network security, IoT applications, etc. Resizing is an extensively utilized operation of the AMQ data structure, but it can lead to system performance degradation. We summarize two main problems that lead to such degradation. Specifically, one of them is that the resizing operation can block other operations, while the other is that the performance of AMQ structures will deteriorate after multiple resizing operations. However, existing related work cannot alleviate both of them. Therefore, we propose a novel AMQ data structure called bamboo filter, which can alleviate the two problems simultaneously. Bamboo filters can insert, search and delete an element in constant time. Moreover, bamboo filters can dynamically resize in a fine-grained way according to the number of contained elements. Experimental results show that bamboo filters significantly outperform state-of-the-art resizable AMQ data structures in insertion, lookup, and deletion operations. For example, bamboo filters achieve $\mathbf{2.46}\times$ lookup throughput of the dynamic cuckoo filter, on average.
By introducing new Information Elements (IEs), this draft extends the
existing BGP-related IEs to enable IP Flow Information Export (IPFIX)
to export BGP community information, including BGP standard
communities defined in RFC1997, BGP extended communities defined in
RFC4360, and BGP large communities defined in RFC8092. Network traffic
information can then be accumulated and analyzed at the BGP community
granularity, which represents the traffic of different kinds of
customers, services, or geographical regions according to the network
operator's BGP community planning. Network traffic information at
the BGP community granularity is useful for network traffic analysis
and engineering.
Finding or monitoring subgraph instances that are isomorphic to a given pattern graph in a data graph is a fundamental query operation in many graph analytic applications, such as network motif mining and fraud detection. Existing distributed methods are inefficient in communication. They have to shuffle partial matching results during the distributed multiway join. The partial matching results may be much larger than the data graph itself. To overcome the drawback, we develop the Batch-BENU framework for distributed subgraph enumeration on static data graphs. Batch-BENU executes a group of local search tasks in parallel. Each task enumerates subgraphs around a vertex in the data graph, guided by a backtracking-based execution plan. To handle large-scale data graphs that may exceed the memory capacity of a single machine, Batch-BENU stores the data graph in a distributed database. Each task queries adjacency sets of the data graph on demand, shuffling the data graph instead of partial matching results. To support incremental subgraph enumeration on dynamic data graphs, we propose the Streaming-BENU framework. Streaming-BENU turns the problem of enumerating incremental matching results into enumerating all matching results of incremental pattern graphs at each time step. We implement Batch-BENU and Streaming-BENU with the local database cache and the load balance optimization to improve their efficiency. Extensive experiments show that Batch-BENU and Streaming-BENU can scale to big graphs and complex pattern graphs. They outperform the state-of-the-art distributed methods by up to one and two orders of magnitude, respectively.
In the big data era, there are many demands for efficient and easy to use data replay services over large scale historical data. For example, stock security trading and on-line e-business services need historical data replay services to conduct system testing or ex-post review and analysis, just like replaying videos for security monitoring. Stream processing systems are designed for processing stream data, thus can not perform complex replay jobs over the static historical data. Database management systems support easy to use complex queries, but lack stream processing abilities. In this paper, we present a data replay model that combines the stream replay and complex query ability together, to allow the applications to replay large scale historical data from various data sources. First, to meet the demands of flexible replay semantics, we designed a set of easy to use replay operators to describe various replay behaviors and semantics. Users can use these operators to build up their complex replay jobs with diversified requirements. Then, we proposed a query mechanism to provide a flexible data loading service. Next, we presented the Penguin framework to support the proposed query-based replay model along with the replay operators. Penguin enables users to develop high-throughput and easy to use replay services over various large scale data sources with tunable replay speeds. To further improve the data replay performance, we proposed four system-level optimizations, including caching loading task results, cascading merging intermediate record queues, producing the replay queue in parallel, and caching remote file streams. Experimental results over replaying millions of records demonstrate that Penguin can achieve up to 4x and 144x speedup in data preparation and up to 16x and 9x speedup in replay speed compared to Apache Phoenix and Apache Hive respectively. As a case study, Penguin has been deployed in production environments of some Securities companies to provide online historical stock data replay services to large number of stock market users.
Personalized educational system provides an open learning environment which enriches the advanced technologies to establish a paradigm shift, active and dynamic teaching and learning patterns. With different approaches, researchers tried to design a more developed personalized educational system. Some of their designs are instructive. The others are problematic. The interaction with students are neglected in the process of education. A Fabric of Knowledge (FoK) is an original module, which can be used as an excellent interactive media between students and teachers. Our project is to design and develop a FoK-based Personalized Educational System (FoK-PES). This system has been proven to have made personalized learning more effective and efficient. Teaching practice also shows that some aspects of this FoK-PES need to be improved such as the classification method of cognitive style and the evaluation method of cognitive level.
Complex event recognition (CER) refers to identifying specific patterns composed of several primitive events in event stores. Since full-scanning event stores to identify primitive events holding query constraint conditions will incur costly I/O overhead, a mainstream and practical approach is using index techniques to obtain these events. However, prior index-based approaches suffer from significant I/O and sorting overhead when dealing with high predicate selectivity or long query window (common in real-world applications), which leads to high query latency. To address this issue, we propose ACER, a Range Bitmap-based index, to accelerate CER. Firstly, ACER achieves a low index space overhead by grouping the events with the same type into a cluster and compressing the cluster data, alleviating the I/O overhead of reading indexes. Secondly, ACER builds Range Bitmaps in batch (block) for queried attributes and ensures that the events of each cluster in the index block are chronologically ordered. Then, ACER can always obtain ordered query results for a specific event type through merge operations, avoiding sorting overhead. Most importantly, ACER avoids unnecessary disk access in indexes and events via two-phase filtering based on the window condition, thus alleviating the I/O overhead further. Our experiments on six real-world and synthetic datasets demonstrate that ACER reduces the query latency by up to one order of magnitude compared with SOTA techniques.
In the big data era, data-intensive cluster computing systems like Hadoop, have gained much popularity, and YARN, the second generation of Hadoop becomes the general resource manager in the Hadoop ecosystem. In the distributed computing scenarios, data locality (scheduling tasks on where the data resides) is essential to the performance since higher data locality brings lower network transmission cost and higher throughput. However, we find that the native YARN scheduling mechanism has little data locality and the delay scheduling strategy leads to the long-tail effect while achieving data locality for in-memory computing scenarios. Therefore, in this paper we propose the push-based YARN scheduling mechanism for the in-memory computing environment. First, we classify the Resource Requests into various categories. Then, we prune the non-local Resource Requests to achieve fast datalocality in-memory computation. Finally, we push the left longtail Resource Requests to the data-locality nodes to avoid the long-tail effect. The experimental results demonstrate that the proposed scheduling mechanism achieves nearly 100% datalocality percentage comparing to the native YARN scheduling mechanism that only achieves 10% 20% data-locality percentage. Under the identical data-locality percentage, the proposed push based scheduling mechanism promotes nearly 20% throughput and reduces nearly 10% application running time comparing to the existing delay scheduling mechanism used in YARN.
The immunogenetics theory is applied into traffic signal control in this paper. Then a new algorithm is proposed to solve the problem of signal timing optimization. For obtaining the optimal schedule quickly, the mechanism of the antibody response antigen is simulated in the improved immunogenetic algorithm, in which cell memory base is used to preserved best antibodies. The method of computing the affinity between antigens by using information entropy is introduced. And based on it, antigen clustering is used to guarantee the algorithm stability and fast convergence. A simulated experiment for the traffic model at a four-phase single intersection is designed according to the new algorithm. The result shows that the method is feasible and effective.
A multilongitudinal mode fiber ring laser sensor is proposed and experimentally demonstrated by measuring the strain applied on the laser sensor head. The ring cavity of the laser is formed by a 3-dB coupler, a section of erbium-doped fiber, and one fiber Bragg grating. Photonic generation of beat signals and strain measurement theory are discussed in detail. The strain applied on the fiber ring cavity is obtained by measuring the beat frequency shift. The selection way of the optimal beat signal for strain measurement is obtained by experimental research and discussion. The root-mean-square deviation of the strain and the response of beat frequency to the strain are 2.7 μɛ and 1.5 kHz/μɛ at 1993 MHz, respectively. The proposed sensor scheme offers a cost-effective and high-stability device for strain measurement.
Bloom filter is an efficient data structure for filtering negative keys (keys not in a given set) with substantially small space. However, in real-world applications, there widely exist vulnerable negative keys, which will bring high costs if not being properly filtered, especially when positive keys are added/deleted dynamically. Such problem gets more severe when keys within one set are dynamically added or deleted. Recently, there are works focusing on handling such (vulnerable) negative keys by incorporating learning techniques. These learning-based filters fail to work as the learning techniques can hardly handle incremental insertions or deletions. To address the problem, we propose S ee S aw C ounting F ilter (SSCF), which is innovated with encapsulating the vulnerable negative keys into a unified counter array named seesaw counter array, and dynamically modulating (or varying) the applied hash functions to guard the encapsulated keys from being misidentified. Moreover, we design ada-SSCF to handle the scenarios where the vulnerable negative keys cannot be obtained in advance. We extensively evaluate our SSCF, which shows that SSCF outperforms the cutting-edge filters by $3\times$ on averages regarding accuracy while ensuring a low operation latency. All source codes are in (SSCF-authors).