Refactoring Google’s N-gram frequency norms for psycholinguistic studies

HAL (Le Centre pour la Communication Scientifique Directe) (2019)

Stéphane Dufau Jonathan Grainger

Citation

Reference

Related Paper

Keywords:

Code refactoring

Gram

n-gram

Topics:

Natural Language Processing Techniques

Source

Cite

Μοντελοποίηση της ελληνικής νοηματικής γλώσσας για τα συστήματα στατιστικής μηχανικής μετάφρασης

Δημήτριος Κουρεμένος

Η παρούσα διατριβή τοποθετείται στο πλαίσιο της αυτόματης Μηχανικής Μετάφρασης, στην διαπροσωπίας ανθρώπου και μηχανής για τα άτομα με προβλήματα ακοής κάνοντας χρήση την γλώσσα των Κωφών, τηn Ελληνική Νοηματική Γλώσσα. Σε αυτή τη εργασία παρουσιάζουμε ένα πρωτότυπο σύστημα βασισμένο σε κανόνες μηχανικής μετάφρασης με σκοπό τη δημιουργία μεγάλων παράλληλων εύρωστων γραπτών σωμάτων ελληνικού κειμένου και της Ελληνικής Νοηματικής Γλώσσας κάνοντας χρήση της Σύντομης Μεταγραφής της Ελληνικής Νοηματικής Γλώσσας (ΣΜΕΝΓ) (text glosses). Στη συνέχεια, τα σώματα κειμένου χρησιμοποιούνται ως δεδομένα κατάρτισης για την παραγωγή / δημιουργία γλωσσικών μοντέλων ν-γραμμάτων (n-gram Language Model). Επίσης χρησιμοποιούνται και ως δεδομένα εκπαίδευσης για το σύστημα MOSES Στατιστικής Μηχανικής Μετάφρασης. Πρέπει να σημειωθεί ότι όλη η διαδικασία είναι ισχυρή και ευέλικτη, καθώς δεν απαιτεί βαθιά γνώση γραμματικής της ΕΝΓ. Στην εργασία μας παρουσιάζουμε μετρήσεις χρονικές εκτιμήσεις για την δημιουργία των γλωσσικών πόρων, αξιολογούμε τα γλωσσικά μοντέλα της ΕΝΓ μέσω της περιπλοκής και τέλος χρησιμοποιώντας τη μετρική βαθμολογία BiLingual Understudy Assessment (BLEU) για την αξιολόγηση της μηχανικής μετάφρασης, το πρωτότυπο σύστημα MT μας επιτυγχάνει ελπιδοφόρες επιδόσεις και συγκεκριμένα μια μέση βαθμολογία 60,53% και 85,1% / 65,5% / 53,8% / 44,8% για 1-gram / 2 -gram / 3-gram / 4-gram.

Gram

n-gram

10.12681/eadd/48061

Cite

Citations (0)

Diagnosis of Stochastic Discrete Event Systems Based on <i>N</i>-gram Models

IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences (2015)

Miwa Yoshimoto Koichi Kobayashi Kunihiko Hiraishi

In this paper, we present a new method for diagnosis of stochastic discrete event system. The method is based on anomaly detection for sequences. We call the method sequence profiling (SP). SP does not require any system models and any system-specific knowledge. The only information necessary for SP is event logs from the target system. Using event logs from the system in the normal situation, N-gram models are learned, where the N-gram model is used as approximation of the system behavior. Based on the N-gram model, the diagnoser estimates what kind of faults has occurred in the system, or may conclude that no faults occurs. Effectiveness of the proposed method is demonstrated by application to diagnosis of a multi-processor system.

Gram

n-gram

Profiling (computer programming)

10.1587/transfun.e98.a.618

Cite

Citations (1)

NGRAM: Stata module to provide n-gram feature extractor

RePEc: Research Papers in Economics (2018)

Matthias Schonlau

ngram extracts n-gram variables containing counts of how often n-grams occur in a given text. An n-gram is an n-long sequence of words. For example, is a unigram (1-gram), is a bigram (2-gram), and the black sheep is happy is a 5-gram. This is useful for text mining applications.

Bigram

n-gram

Gram

Extractor

Feature (linguistics)

Sequence (biology)

Source

Cite

Citations (0)

n-gram/2L : 공간 및 시간 효율적인 2단계 n-gram 역색인 구조 = n-Gram/2L : a space and time efficient two-level n-Gram inverted index structure

김민수 Minsoo Kim

Gram

n-gram

Gramian matrix

Source

Cite

Citations (0)

Experimental Study of Higher-gram Index Length for N-gram Full Text Search System

IEEJ Transactions on Electronics Information and Systems (2006)

Hiroshi Yamamoto Hiroshi Tsuji

N-gram indexing method is the most popular algorithm for the Japanese full text search system where each index consists of serial N characters. Especially the full text search for Japanese text usually has the 2-gram characters index as base in order to save the volumes of the index file. Although the additional higher-gram index is expected to improve the performance for searching indices, we have no experimental evaluation with additional higher-gram indices. This paper presents the evaluation about improving the text search performance with additional higher-gram indices by Search Term Intensive Approach which decides the term for higher-gram indices depend upon the appearance ratio in application programs as the searching term. On the concrete evaluation, the number of paper articles for searching is one or two hundred thousands, and the simulation for 5 or more gram additional indices can be applied add to evaluation for 3,4-gram additional indices.

Gram

n-gram

Inverted index

10.1541/ieejeiss.126.1173

Cite

Citations (0)

ULC Series gram cells from Interface offer high accuracy at low capacities

Sensor Review (2000)

Gram

n-gram

Interface (matter)

10.1108/sr.2000.08720dad.009

Cite

Citations (0)

Refactoring Generics in JAVA: A Case Study on Extract Method

Raúl Marticorena Sánchez Carlos López Nozal Yania Crespo Francisco Javier Pérez Jiménez

The addition of support for genericity to mainstream programming languages has a notable influence in refactoring tools. This also applies to the JAVA programming language. Those versions of the language specification prior to JAVA 5 did not include support for generics. Therefore, refactoring tools had to evolve to modify their refactoring implementations according to the new language characteristics in order to assure the correct effects when transforming code containing generic definitions or using generic instantiations. This paper presents an evaluation of the behaviour of refactoring tools on source code that defines or uses generics. We compare the behaviour of five refactoring tools on a well known refactoring, Extract Method, and its implementation for the JAVA language. We distill the lessons learned from our evaluation into requirements that have to be taken into account by refactoring tools in order to fully conform to this new language feature.

Code refactoring

Implementation

Code (set theory)

10.1109/csmr.2010.38

Cite

Citations (3)

Study on the N-gram measure based flame detection in Korean online messages = N-gram을 이용한 인터넷 게시판에서의 상호 비방 척도 알고리즘에 대한 연구

Se-Wook Cheon 천세욱

Gram

n-gram

Source

Cite

Citations (0)

n-gram / 2L : 공간 및 시간 효율적인 2단계 n-gram 역색인 구조

정보과학회논문지 : 데이타베이스 (2006)

김민수 황규영 Jae‐Gil Lee 이민재

n-gram 기반 역색인 구조는 언어 중립적이고 에러 허용적인 장점들로 인해 일부 아시아권 언어에 대한 정보 검색이나 단백질과 DNA의 sequence의 근사 문자열 매칭에 유용하게 사용되고 있다. 그러나, n-gram 기반의 역색인 구조는 색인의 크기가 크고 질의 처리 시간이 오래 걸린다는 단점들을 가지고 있다. 이에 본 논문에서는 n-gram 기반 역색인의 장점을 그대로 유지하면서 색인의 크기를 줄이고 질의 처리 성능을 향상시킨 2단계 n-gram 역색인(간단히 n-gram/2L 역색인이라 부른다)을 제안한다. n-gram/2L 역색인은 n-gram 기반 역색인에 존재하던 위치 정보의 중복을 제거한다. 이를 위해 문서로부터 길이 m의 m-subsequence들을 추출하고, 그 m-subsequence들로부터 n-gram을 추출하여 2단계로 역색인을 구성한다. 이러한 2단계 구성 방법은 이론적으로 의미 있는 다치 종속성이 존재하는 릴레이션을 정규화하여 중복을 제거하는 것과 동일하며, 이를 본문에서 정형적으로 증명한다. n-gram/2L 역색인은 데이타의 크기가 커질 수록 n-gram 역색인에 비해 색인 크기가 줄어들며 질의 처리 성능이 향상되고, 질의 문자열의 길이가 길어져도 질의 처리 시간이 거의 증가하지 않는 좋은 특성을 가진다. 1GByte 크기의 데이타에 대한 실험을 통하여, n-gram/2L 역색인은 n-gram 기반 역색인에 비해 최대 1.9 ~ 2.7배 더 작은 크기를 가지면서, 동시에 질의 처리 성능은 3~18 범위의 길이를 가지는 질의들에 대해 최대 13.1배 향상됨을 보였다.

Gram

n-gram

Gram-Negative Bacteria

Source

Cite

Citations (0)

NGRAM: Stata module to provide n-gram feature extractor

Statistical Software Components (2018)

Matthias Schonlau

Bigram

n-gram

Gram