| Event File | The file which probablistic events are outputed(in text format or bompressed in gz/bz format)
The TERM TEMPLATE feature of the terminal node of a derivation tree is assigned the name of a lexical entry template (a combination of the name of a lexeme template and the history of lexical rule application)
unimaker would generate probablistic events from the word information(the 'word'-typed value assigned to the TERM WORD feature) of each terminal node, the name of a lexical entry template (the value of the TERM_TEMPLATE feature) and the word information of terminal nodes nearby.
- Create derivation word lattice from derivation by calling the
um_derivation_to_deriv_word_lattice/2 predicate.
The WORD feature of an element of derivation word lattice is assigned the terminal node of a derivation tree as its value.
- Create a word lattice from a derivation tree by calling the um_derivation_to_word_lattice/2 predicate and insert it in a chart.
- Output probablistic events of positive examples and negative examples from features of the derivation word lattice. The values of the WORD feature of eleements in the word lattice, that is, the terminal nodes of the relevant derivation tree, are extracted and processed in the following procedures.
- Supply the um_correct_lexical_entry/2 predicate with the terminal nodes of a derivation tree for obtaiing the correct lexical entries.
- Supply the um_complement_lexical_entry/2 predicate with a terminal node of a derviation tree to obtain the correct name of a lexical entry.
- The lexical entry obtained is passed to the extract_lexical_event/4 predicate for extracting (the category and feature of) proabablistic events and outputing them. At this point, the probabilistic events obtained from correct lexical entries are outputed as positive examples and those obtained from other lexical entries are outputed as negative examples.
When outputing probablistic events with the extract_lexical_event/4 predicate, information from the word lattice of words nearby can be included in the Feature field. This is for inserting the word lattice in the relevant chart. The derviation word lattice of words nearby are not included the FIELD field.
The predicate that creates two types of word lattice from terminal nodes of a derivation tree is found in "devel/unimake.lil". "devel/unimake.lil" also includes a predicate for finding out the correct lexical entries to be used at the terminal nodes of a derviation tree. These prediates will be explained in the following paragraphy.
- A word lattice created by the um_derivation_to_word_lattice/2 predicate contains a left_position feature and a right_position feature that tells the position of the word corresponding to the word lattice. It also has the WORD feature which is assigned the value of the TERM_WORD feature of a terminal node of a derviation tree.
- A derviation word lattice created created by um_derivation_to_deriv_word_lattice/2 has features that tells the position of the corresponding word. It also has a WORD feature which is assigned the terminal nodes of a derivation tree.
- The um_correct_lexical_entry/2 predicate would return the correct lexical entry from lexical entries assigned to the terminal nodes of a derivation tree. The returned lexical entry shares the same feature structure with the lexical entries supplied to the predicate.
- The um_complement_lexical_entry/2 predicate would return one by one the lexical entries excluded by um_correct_lexical_entry/2 from lexcial entries assigned to the terminal nodes of a derivation tree.
These predicates are defined in the following way:
um_derivation_to_word_lattice(derivation_internal & DERIV_DTRS\$Dtrs,
$WordLattice) :-
um_derivation_to_word_lattice_dtrs($Dtrs, $WordLattice). %% recursive predicate
um_derivation_to_word_lattice(derivation_terminal & TERM_WORD\$Word,
[left_position\$LPos & right_position\$RPos &
word\$LexEntry]) :-
$LexEntry = [$Word],
$Word = POSITION\$LPos,
$RPos is $LPos + 1.
um_derivation_to_deriv_word_lattice(derivation_internal & DERIV_DTRS\$Dtrs,
$WordLattice) :-
um_derivation_to_deriv_word_lattice_dtrs($Dtrs, $WordLattice). %% recursive predicate
um_derivation_to_deriv_word_lattice(derivation_terminal & $Term & TERM_WORD\$Word,
[left_position\$LPos & right_position\$RPos &
word\$LexEntry]) :-
$LexEntry = $Term,
$Word = POSITION\$LPos,
$RPos is $LPos + 1.
um_correct_lexical_entry(TERM_WORD\$Word & LEXENTRY_SIGN\$Sign, $LexName) :-
lookup_lexicon($Word, $TempNameList),
member($TempName, $TempNameList),
lookup_template($TempName, $LexEntry),
equivalent($LexEntry, $Sign),
!,
$LexName = LEX_WORD\$Word & LEX_TEMPLATE\$TempName.
um_complement_lexical_entry(TERM_WORD\$Word & LEXENTRY_SIGN\$Sign, $LexName) :-
lookup_lexicon($Word, $TempNameList1),
check_coverage($TempNameList1, $Sign, $TempName1), %% check whether $TempNameList1 contains
%% elements that carry $Sign
findall($Lex,
(member($TN, $TempNameList1),
$TN \= $TempName1,
$Lex = LEX_WORD\$Word & LEX_TEMPLATE\$TN),
$LexList),
member(LEX_TEMPLATE\$TempName, $LexList),
$LexName = LEX_WORD\$Word & LEX_TEMPLATE\$TempName.
The extract_lexical_event/4 predicate, which is used for extracting probablistic events is found in
"grammar/unievent.lil". Probablistic events of the category "uni" is extracted by this predicate. Its feature contains the following fields:
- the word that precedes the word that immediately precedes the current word (string and POS, strings generated by stemming and POS)
- the word that immediatellly precedes the current word(strings and POS, strings generated by stemming and POS)
- the current word(stringss and POS,strings generated by stemming and POS), lexical entries and names of lexemes
- the word that immediately comes after the current word(string and POS,strings geeeenerated by stemming and POS)
- the word that comes after the word that immediately comes after the current word(string and POS, stemming and POS)
They are specified in the following ways:
extract_lexical_event("hpsg-uni", "uni", $LexEntry, $Event) :-
$LexEntry = (LEX_WORD\ (SURFACE\ $Surface &
POS\ $Pos &
BASE\ $Base &
BASE_POS\ $BasePOS &
POSITION\ $Position) &
LEX_TEMPLATE\($LexTemplate & LEXEME_NAME\$LexemeName)),
lex_template_label($LexTemplate, $LexName),
$PositionN2 is $Position - 2,
$PositionN1 is $Position - 1,
$PositionP1 is $Position + 1,
$PositionP2 is $Position + 2,
$PositionP3 is $Position + 3,
$PositionP4 is $Position + 4,
lexical_event($PositionN2, $PositionN1, $Event, $Event2), %% -2
lexical_event($PositionN1, $Position, $Event2, $Event3), %% -1
$Event3 = [$Surface, $Pos, $LexName, $Base, $BasePOS, $LexemeName|$Event4],
lexical_event($PositionP1, $PositionP2, $Event4, $Event5), %% 1
lexical_event($PositionP2, $PositionP3, $Event5, $Event6), %% 2
lexical_event($PositionP3, $PositionP4, $Event6, []). %% 3
The event file outputed by unimaker looks like the following.
(One event is outputed as one line. In the case of event_2_0, there are 3 probablistic events.)
event_2_0
1 BOS//BOS//BOS//BOS//BOS//BOS//BOS//BOS//ms-period-//NNP//
[D< N.3sg>]_lxm-noun_adjective_rule//ms-period-//NNP//
[D< N.3sg>]_lxm//haag//NNP//haag//NNP//plays//VBZ//play//VB//
elianti//NNP//elianti//NNP//uni
0 BOS//BOS//BOS//BOS//BOS//BOS//BOS//BOS//ms-period-//NNP//
[D< N.3sg>]_lxm//ms-period-//NNP//[D< N.3sg>]_lxm//haag//
NNP//haag//NNP//plays//VBZ//play//VB//elianti//NNP//elianti//NNP//
uni
0 BOS//BOS//BOS//BOS//BOS//BOS//BOS//BOS//ms-period-//NNP//
[< NP.3sg.adj>]NP.adj_mod//ms-period-//NNP//
[< NP.3sg.adj>]NP.adj_mod//haag//NNP//haag//NNP//plays//VBZ//
play//VB//elianti//NNP//elianti//NNP//uni
event_2_1
...
In this case,
event_2_0 is the word "Ms." に "[D]_lexm-noun_adjective_rule" で
表される語彙項目が対応する確率イベントを表しています.
The "1" in the beginning of the 2nd line indicates that it is a positive example. The "0" in the beginning of the 3rd line and the 4th line indicates that they are negative examples. "BOS" marks the beginning of a sentence.
Applying masks to extract features
Let us illustrate how to use amisfilter to apply masks to the probablistic events outputed above and extract features for generating a data file in Amis format.
| amisfilter Name of Model Mask Module Probablistic Event File
Count File Model File Event File
| | Name of Model | Name of Proablistic Model(Also used in parsing)
| | Mask Module | The lilfes module that applies masks to probablistic events
| | Probablistic Event File | The Inpu Probablistic Event File(text file or gz/bz file)
| | Count File | The file that outputs the frequencies of features(text file)
| | Model File | Model File(AmisModel foramt)
| | Event File | Event File(AmisEvent format)
|
The actual processing being done is as follows
-
Create features by applying masks defined in the category corresponding to probablistic events that represents observed events in the probablistic event file.
Category-specific Masks are defined by the feature_mask/3 predicate.
- The frequency of a feature appearing with observed events are counted and outputed to the count file.
- Those features that have frequencies above a predefined threshold are adopted and model files and event files in Amis format are created.
The feature_mask/3 predicate is found in "grammar/lexmask.lil". The mask
Enju Developers' Manual
Enju Home Page
Tsujii Laboratory
MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)
|