There are thousands of data repositories on the Web, providing access to millions of datasets. National and regional governments, scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others' work, and providing data journalists easier access to information and its provenance. In this paper, we discuss Google Dataset Search, a dataset-discovery tool that provides search capabilities over potentially all datasets published on the Web. The approach relies on an open ecosystem, where dataset owners and providers publish semantically enhanced metadata on their own sites. We then aggregate, normalize, and reconcile this metadata, providing a search engine that lets users find datasets in the "long tail" of the Web. In this paper, we discuss both social and technical challenges in building this type of tool, and the lessons that we learned from this experience.
A new generation of data processing systems, including web search, Google’s Knowledge Graph, IBM’s Watson, and several different recommendation systems, combine rich databases with software driven by machine learning. The spectacular successes of these trained systems have been among the most notable in all of computing and have generated excitement in health care, finance, energy, and general business. But building them can be challenging, even for computer scientists with PhD-level training. If these systems are to have a truly broad impact, building them must become easier. We explore one crucial pain point in the construction of trained systems: feature engineering. Given the sheer size of modern datasets, feature developers must (1) write code with few effective clues about how their code will interact with the data and (2) repeatedly endure long system waits even though their code typically changes little from run to run. We propose brainwash, a vision for a feature engineering data system that could dramatically ease the ExploreExtract-Evaluate interaction loop that characterizes many trained system projects.
Abstract A three-phase strategy was adopted for a deep-water reservoir development due to large uncertainties associated with fault compartmentalization and aquifer support. First phase started with primary production from several wells while water injection was implemented a few years later as second phase. Finally, the third phase involved infills. Data collected during drilling was utilized to improve reservoir characterization. This paper describes how a Surveillance, Analysis and Optimization (SA&O) plan was utilized to resolve key subsurface uncertainties and optimize the development plan, along with lessons learned and best practices. The development plan included adding five infill wells during the infill campaign. Two of those infills, Well1 and Well2, delineated the reservoir south of a major fault. Early reservoir characterization placed two sealing faults between the Well1 and Well2 wells, creating a separate fault compartment with a future infill well in the development plan. Steeply dipping beds below a thick salt canopy make seismic imaging a challenge and fault delineation uncertain. Although a future infill well was planned to recover the reserves from this compartment, the length and transmissibility of the faults, as well as reservoir heterogeneity between the Well1 and Well2 wells were highly uncertain. Prior to start-up of Well2, a pulse test and multiple pressure build-up tests were conducted as part of the SA&O plan to resolve fault uncertainties. The pulse test involved creating pressure pulses by shutting-in and starting-up the Well1 and monitoring at Well2 for pulse arrival. Test design was developed using Intersect and Saphir numerical simulations sensitizing on the fault extent and transmissibility uncertainties. The test was planned and executed successfully as a result of multi-functional collaboration between the asset development, reservoir management support, and operations teams. Analysis of all transient pressure data indicated very strong pressure communication without any detectable barrier between the Well1 and Well2 wells. Increased confidence in recent seismic imaging enhancements have since corroborated these results. Reservoir simulations without these compartmental faults, indicated that Well1 & Well2 can effectively drain most of the recoverable oil, with limited improvement from an additional well. Consequently, the development plan was optimized by removing the future infill well, translating to a significant capital cost saving.
The following article is the first of two parts and details a collection of trainee Educational Psychologist’s experiences carrying out race and equity projects within their local authority placements. The Division of Educational and Child Psychology (DECP; 2006) guidance and tool for “Promoting Racial Equality within Educational Psychology Services” was provided by our course provider as a collective stimulus and a range of projects were carried out in response to this. The article details process and reflections on these projects with the aim to raise consciousness about the benefits, challenges and complexities of promoting anti-racist practice within EP and school practice. The authors hope that it might subsequently ignite conversations and creativity within EPS teams and trainee courses working on their anti-racist practice.
Active users of social networks are subjected to extreme information overload, as they tend to follow hundreds (or even thousands of other users). Aggregated social feeds on sites like Twitter are insufficient, showing superfluous content and not allowing users to separate their topics of interest or place a priority on the content being pushed to them by their “friends.” The major social network platforms have begun to implement various features to help users organize their feeds, but these solutions require significant human effort to function properly. In practice, the burden is so high that most users do not adopt these features. We propose a system that seeks to help users find more relevant content on their feeds, but does not require explicit user input. Our system, BUTTERWORTH, automatically generates a set of “rankers” by identifying sub-communities of the user’s social network and the common content they produce. These rankers are presented using human-readable keywords and allow users to rank their feed by specific topics. We achieve an average top-10 precision of 78%, as compared to a baseline of 45%, for automatically generated topics.
Code and data for Burgess et al. (2018), Protecting marine mammals, turtles, and birds by rebuilding global fisheries. Published in Science, 359 (6381) 1255–31258, 2018. DOI: 10.1126/science.aao4248
The original repo may be found at: https://github.com/grantmcdermott/bycatch
Lexical simplification of scientific terms represents a unique challenge due to the lack of a standard parallel corpora and fast rate at which vocabulary shift along with research.We introduce SimpleScience, a lexical simplification approach for scientific terminology.We use word embeddings to extract simplification rules from a parallel corpora containing scientific publications and Wikipedia.To evaluate our system we construct SimpleSciGold, a novel gold standard set for science-related simplifications.We find that our approach outperforms prior context-aware approaches at generating simplifications for scientific terms.
Many real networks that are collected or inferred from data are incomplete due to missing edges. Missing edges can be inherent to the dataset (Facebook friend links will never be complete) or the result of sampling (one may only have access to a portion of the data). The consequence is that downstream analyses that "consume" the network will often yield less accurate results than if the edges were complete. Community detection algorithms, in particular, often suffer when critical intra-community edges are missing. We propose a novel consensus clustering algorithm to enhance community detection on incomplete networks. Our framework utilizes existing community detection algorithms that process networks imputed by our link prediction based sampling algorithm and merges their multiple partitions into a final consensus output. On average our method boosts performance of existing algorithms by 7% on artificial data and 17% on ego networks collected from Facebook.
Abstract Field X is a mature offshore field in the Niger Delta. In 2013, several wells in field X experienced significant production decline and a study conducted by cross-functional team identified fines migration as the primary cause of sandface impairment and acid stimulation was recommended as remediation for the damage. This technical work highlights the strategies and methodology adopted in implementing acid stimulation in field X. The approach to candidate selection involved screening of historical well test data and pressure transient analyses (PTA) database for impaired/at-risk completions. Typical selection criteria are productivity index (PI) degradation (> 50%), pressure drop due to near wellbore skin (DP skin >500psi), violation of flux constraint, Interval Control Valve (ICV) integrity (dual completion) and water-cut evolution (<5.0%). The team also piloted diversion agents in the field to improve stimulation in wells with high water-cut. A novel method of using dynamic positioning marine vessel with Remotely Operated Vehicle (ROV) and customized dual conduit Coil Tubing (CT) have been deployed to bullhead acid into the formation. In the last two campaigns executed in 2020 & 2021, this ingenious way of harnessing value from field X has resulted to combined incremental Initial Production (IP) gain of approximately 10,000 bopd and safeguarded completion of a critical oil producer (~11,000 bopd). The total project cost for the two campaigns was ~$25.4 million resulting in significant cost savings (>$51.6 million) compared to rig operations. The use of cross-functional team, robust candidate selection, continuous lookback and adoption of best practices were critical to successful execution. Some challenges being experienced in the field are lack of adequate injectivity, formation lock-up, increasing gas oil ratio (GOR) and increasing water-cut. The team plans to deploy diverters across additional high water-cut wells in future campaigns.