TreeCloud & Unitex: an increased synergy

Claude Martineau

TreeCloud & Unitex: an increased synergy

2017

Claude Martineau

Given two words A and B and: • O 11 , observed number of sliding windows containing both A and B • O 12 , observed number of sliding windows containing A but not B • O 21 , observed number of sliding windows not containing A but B • O 22 , observed number of sliding windows containing neither A nor B the following variables are defined: • R 1 = O 11 + O 12 , number of sliding windows containing A • R 2 = O 21 + O 22 , number of sliding windows not containing A • C 1 = O 11 + O 21 , number of sliding windows containing B • C 2 = O 12 + O 22 , number of sliding windows not containing B • N = R 1 + R 2 = C 1 + C 2 , number of sliding windows • E 11 = (R 1 C 1 /N), expected number of sliding windows containing both A and B • E 12 = (R 1 C 2 /N), expected number of sliding windows containing A but not B • E 21 = (R 2 C 1 /N), expected number of sliding windows not containing A but B • E 22 = (R 2 C 2 /N), expected number of sliding windows containing neither A nor B The definitions of co-occurrence formulas are the following: • jaccard: 1-O 11 / (O 11 + O 12 + O 21) • liddell: 1-(O 11 O 22-O 12 O 21) / (C 1 C 2) • dice: 1-2O 11 / (R 1 + C 1) • hyperlex: 1-max(O 11 / R 1 ,O 11 / C 1) • poissonstirling: O 11 (log O 11-log E 11-1) • chisquared: 1000-N(O 11-E 11) 2 / (E 11 E 22) • zscore: 1-(O 11-E 11) / sqr(E 11) • ms: 1-min(O 11 / R 1, O 11 / C 1) • oddsratio: 1-log((O 11 O 22) / (O 12 O 21)) • loglikelihood: 1-2(O 11 log(O 11 / E 11) + O 12 log(O 12 / E 12) + O 21 log(O 21 / E 21) + O 22 log(O 22 / E 22)) • gmean: 1-O 11 /sqr(R 1 C 1) = 1-O 11 /sqr(NE 11) • mi (mutual information): 1-log(O11/E 11) • ngd (normalized Google distance): (max(log R 1 ,log C 1)-log O 11) / (N-min(log R 1 ,log C 1)) TreeCloud builds a tree cloud visualization of a text, which looks like a tag cloud where the tags are displayed around a tree to reflect the co-occurrence distance between the words in the text. avocat,avocat.N+Hum+Prof:ms avocate,avocat.N+Hum+Prof:fs avocats,avocat. N+Hum+Prof:mp avocates,avocat.N+Hum+Prof:fp avocat d'affaires,avocat d'affaires.N+Hum+Prof:ms avocate d'affaires,avocat d'affaires.N+Hum+Prof:fs avocats d'affaires,avocat d'affaires. N+Hum+Prof:mp avocates d'affaires,avocat d'affaires.N+Hum+Prof:fp Several ways to use Unitex/GramLab Unitex-GramLab is a corpus processing suite [MATCH] Unitex-GramLab is an open source corpus processing suite [MATCH] Unitex-GramLab is a hard to learn corpus processing suite [FAIL] Unitex-GramLab is [FAIL] 1 inflected form 2 ,canonical form 3 .grammatical category 4 +semantic attributes 5 :inflectional information (m: masculine, f: feminine, s: singular, p: plural) business lawyer Unitex/GramLab is a corpus analyser and annotation tool • Based on Automata and RTNs with outputs • Multilingual: Up to 22 languages (French, English,..., Greek, ... , Korean, Thai) • Unicode 3.0 (UTF8, UTF16LE, UTF16BE) • Cross-platform: Linux, macOS, Windows • Open source: https://github.com/UnitexGramLab • Website and binary installers: http://unitexgramlab.org • Under development since 2001 by a group of passionate volunteers Unitex/GramLab uses linguistic resources: • DELA (LADL electronic dictionaries) A typical DELA entry is composed by a simple or compound inflected form, followed by a lemma and grammatical information. Each entry can be associated with syntactic and semantic attributes and inflection rules: inflected_form,lemma.grammatical_information+attributes:inflection_rule Example: Given the French simple word "avocat" (lawyer) and the compound word "avocat d'affaires" (business lawyer), a DELA representation would be: • Syntactic or semantic rules called «local grammars» represented by graphs • Graphical representations of local grammars are composed by a set of linked boxes. • A successful path is a path between initial and final states. TreeCloud is a tree cloud visualization of a text The grammar below contains two search paths: • an adverb ( ) ending in-ly followed by a past participle ( ) • a noun ( ) followed by a verb in progressive form ( ) A lexical mask like refers to the text dictionary. The recognized sequences are surrounded by the tag . The results are represented in the form of concordances. Some examples of matched and unmatched sequences by the above grammar: Two interfaces written in JAVA: • Unitex IDE (classic) • GramLab IDE (project-oriented) Unitex Core written in C/C++ Text dictionary Application of a dictionary; the result is the text dictionary, then application of a local grammar They refer to Command lines or system calls with Perl, Python, etc. Use the API C and JAVA (JNI) that provides access to • a virtual file system • a persistence layer for resources (alphabets, dictionaries and corpora) How and Why to plug Unitex into TreeCloud? Take advantage of the work already done by Unitex Unitex/GramLab analysis steps Normalize Tokenize Dico Locate Concord created files Concordances Annotated text program called dlf, dlc, err tokens.txt, text.cod concord.ind At the end of the Unitex analysis process, text.snt contains a cleaned text (normalization of separator characters), text.cod contains the list of indexes of the tokens into the tokens.txt file list. dlf, dlc, err, respectively contain simple words, compound words, unknown words concord.ind contains the matched sequences with their position into the text (XXX, and multiword units) To get the «new text», we retokenize the text with matched sequences of the concord.ind file as the new tokens of the text. New token.txt and text.cod files are created. This process prevents double reading of the text and double division into words. Thanks to the Unitex API and virtual file system, all this work is done in memory.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations