This class wraps a libfoma FSM and allows fast retrieval of similar words via string edit distance based search.
The API of the class is the following:
class foma_FSM {
public:
/// build automaton from a file
foma_FSM(const std::wstring &, const std::wstring &mcost=L"");
/// delete FSM
~foma_FSM();
/// Use automata to obtain closest matches to given form, and
//add them to given list.
void get_similar_words(const std::wstring &,
std::list<std::pair<std::wstring,int> > &) const;
/// set maximum edit distance of desired results
void set_cutoff_threshold(int);
/// set maximum number of desired results
void set_num_matches(int);
/// Set default cost for basic SED operations
void set_basic_operation_cost(int);
};
The constructor of the module requests one parameter stating the
file to load, and a second optional parameter stating a file with
the cost matrix for SED operations. If the cost matrix is not
given, all operations default to a cost of 1 (or to the value set
with the method set_basic_operation_cost).
The automata file may have extension .src or .bin.
If the extension is .src, the file is intepreted as a text
file with one word per line. The FSM is built to recognize the
vocabulary contained in the file.
If the extension is .bin, the file is intepreted as a binary
libfoma FSM. To compile such a binary file, FOMA command
line front-end must be used. The front-end is not included in
FreeLing. You will need to install FOMA if you want to create binary
FSM files. See http://code.google.com/p/foma for details.
A cost matrix for SED operations may be specified only for
text FSMs (i.e., for .src files).
To use a cost matrix with a .bin file, you can compile
it into the automata using FOMA front-end.
The format of the cost matrix must comply with FOMA formats. See
FOMA documentation, or examples provided in
data/common/alternatives in FreeLing tarball.
The method get_similar_words will receive a string and
return a list of entries in the FSM vocabulary sorted by string
edit distance to the input string.
Lluís Padró 2013-09-09