Finite-State-Transducers for Natural Language Processing

Finite State Transducers provide a method for performing mathematical operations on ordered collections of context-sensitive rewrite rules such as those commonly used to implement fundamental natural language processing tasks. Multiple rules may be composed into a single pass, mega rule, significantly increasing the efficiency of rule-based systems.

Quick Primer on Finite State Machines

A finite-state machine (FSM) or automata is an abstract mathematical model of computation that is capable of storing a status or state and changing this state based on input. While not as powerful as other computational models such as Turing machines due to their limited memory, FSMs are applicable to a number of electronic modeling, engineering, and language processing problems. An FSM can be represented as a set of nodes representing the various states of the system and labeled edges between these nodes where the edges represent transitions from one state to another and the labels represent conditions on these transitions. A stream of input (an input tape containing a string) is then ‘processed’ by the FSM, potentially causing a number of state transitions. The simple FSM example below is a remarkably accurate computational model for a ferret:

The daily behavior of my ferret, Pangur Bán, would best be represented by an input tape containing the string, ‘tired,tired,tired,tired,hungry,tired,tired,tired,bored,tired,tired,tired’:

FSMs for Language Processing

In language processing, an FSM containing a start state (node) and an end state can be used to generate or recognize a language defined by all possible combinations of symbols (conditional labels) on each of the edges generated by traversing the FSM from the start state to the end state. The class of languages generated by finite automata is known as the class of regular languages.

Finite State Transducers

A finite state transducer (FST) is a special type of finite state machine. Specifically, an FST has two tapes, an input string and an output string. Rather than just traversing (and accepting or rejecting) an input string, an FST translates the contents of its input string to its output string. Or put another way, it accepts a string on its input tape and generates another string on its output tape.

Context-Sensitive Rules

FSTs are particularly useful in the implementation of certain natural language processing tasks. Context-sensitive rewriting rules (ex: ax → bx) are adequate for implementing certain computational linguistic tasks such as morphological stemming and part-of-speech tagging. Such rewrite rules are also computationally equivalent to finite-state transducers, providing a unique path for optimizing rule based systems.

Take, for example, the following set of ordered context-sensitive rules:

1) change ‘c’ to ‘b’ if followed by ‘x’    cx → bx
2) change ‘a’ to ‘b’ if preceded by ‘rs’    rsa → rsb
3) change ‘b’ to ‘a’ if preceded by ‘rs’ and followed by ‘xy’    rsbxy → rsaxy

Given the following string on the input tape:


the application of the given rule set would proceed as follows:

1) rsaxyrscxy → rsaxyrsbxy
2) rsaxyrsbxy → rsbxyrsbxy
3) rsbxyrsbxy → rsaxyrsaxy

The time required to apply a sequence of context-sensitive rules is dependent upon the number of rules, the size of the context window, and the number of characters in the input string. The inefficiencies  in such an implementation are highlighted in this example by the multi-step transformation required to translate ‘c‘ to ‘b‘ then ‘b‘ to ‘a‘ by rules 1 and 3, and the redundant transformation of ‘a‘ to ‘b‘ and back to ‘a‘ by rules 2 and 3. Finite State Transducers provide a path to eliminate these inefficiencies. But first we need to convert the rules to State Machines.

Converting Rules to FSTs

To do this, we simple represent each rule as an FST where each link between states represent the acceptance of an input character and the expression of the corresponding output character. This input / output combination is denoted within an FST by labeling edges with both the input and output character separated by the ‘/’ character. Following is an FST for each of the above rules:


Extending the FSTs

While the FSTs above represent our set of context-sensitive rules, they would be of little use in matching against an input string as each is designed to process exactly the context widow described in its corresponding rule. To make each FST applicable to a string of arbitrary length and perform the necessary translation each time the rule is fired, we will need to extend each of the Transducers. We do this by allowing for every possible input in our language at each state. For example, rule 1 must be able to handle the string rsaxyrscxy. Since rule 1 matches the string ‘cx‘ and outputs the string ‘bx‘, it must handle the characters ‘r‘, ‘s‘, ‘a‘, ‘x‘, ‘y‘, ‘r‘, and ‘s‘ before finally encountering ‘c‘ and ‘x‘. Then it must handle the final ‘y‘ character. Each of these characters (and all other possible characters) could be explicitly listed on its own individual edge, but to simplify, we can create a single edge labeled with ‘?/?‘, to match and output any character not already represented on another edge leading from that state. FST extension of rules with a trailing context will also require the use of edges with an input but no output (labeled with ‘ε‘ for output) and edges with multiple outputs. Following is the extension for each of the above FSTs:


Composing a single FST

One of the operations that can be performed on a pair of FSTs is composition. The composition operation will take two deterministic FSTs, A and B, and combine their nodes and edges into a single deterministic FST. The resulting Transducer will accept an input string of arbitrary length and output the equivalent of applying Transducer A followed by Transducer B. A full explanation of the FST composition algorithm is beyond the scope of this write-up. Following is the FST resulting from the composition of FSTs 1 and 2 followed by the composition of the resulting FST and FST 3:

Note that while the output of applying the final FST to the input string ‘rsaxyrscxy’ is exactly equivalent to the output of applying each of the individual context-sensitive rules (‘rsaxyrsaxy‘), the transformation required only a single pass through the FST and did not result in any inefficient transformations. While the time required to apply the original rules was dependent upon the number of rules, the size of the context window, and the number of characters on the input tape, application time of the final FST is dependent only upon the number of characters on the input tape.

Comments are closed.