ABL : Alignment-Based Learning

Demo

Provide at least two sentences that have words in common, one sentence each line, and see how the alignment of substitutable parts aids the learning of structure.

Input   
   
     
Output


  

Introduction

Alignment-Based Learning (ABL), introduced in [1], is a symbolic grammar inference framework that has successfully been applied for several unsupervised machine learning tasks in Natural Language Processing (NLP). Given sequences of symbols only, a system that implements ABL induces structure by aligning and comparing the input sequences. As a result, the input sequences are augmented with the induced structure.

The demo on this page allows you to apply ABL on unstructured text.

A C++ implementation was developed by Menno van Zaanen and me, partially supported by a grant from Macquarie University. The package is maintained by Menno (see his page for stable release downloads), and I keep a development version on Github.

Alignment learning

The first phase in ABL, alignment learning, builds a search space of possible structures, called hypotheses, by comparing the sequences and clustering sub-sequences that share similar context. E.g., when we have the following input sentences:

    she is smart
    she is tired
alignment learning would result in the following hypothesized structures:
    (0 she is (1 smart )1 )0
    (0 she is (1 tired )1 )0
However, the set of hypotheses that are generated during alignment learning contain hypotheses that are unlikely to be correct constituents. We also assume the underlying grammar that is induced to be context-free, which implies that overlapping constituents are unacceptable. For example, consider the following plain text sentences:
    she runs from CBD to Bondi
    he drives from CBD to Bondi
    he walks terribly slowly
After alignment learning, the hypothesized structure would contain overlapping constituents in sentence 2 and look as follows:
    (0 (1 she runs )1 from CBD to Bondi )0
    (0 (1 he (2 drives )1 from CBD to Bondi )2 )0
    (0 he (2 walks terribly slowly )2 )0

Selection learning

In the second phase, selection learning, the most probable combination of hypothesized structures are selected in an Expectation-Maximalization search and overlapping constituents are eliminated according to a heuristic. The final induced structure thus becomes:

    (0 (1 she runs )1 from CBD to Bondi )0
    (0 (1 he drives )1 from CBD to Bondi )0
    (0 he (2 walks terribly slowly )2 )0

If desired, the induced structure could be transformed to a context-free grammar (CFG):
    0 -> 1 from CBD to Bondi
    0 -> he 2
    1 -> she runs
    1 -> he drives
    2 -> walks terribly slowly

References

  1. Van Zaanen, M. (2001). Bootstrapping Structure into Language: Alignment-Based Learning. , PhD Thesis, School of Computing, University of Leeds, UK.
  2. Geertzen, J. & van Zaanen, M. (2004). Grammatical Inference using Suffix Trees. In: Proc. of the 7th International Colloquium on Grammatical Inference (ICGI), 3264:163-174, Springer-Verlag