Introduction
The Toolkit for Advanced Discriminative Modeling (TADM) is a C++ implementation for estimating the parameters of discriminative models, such as maximum entropy models. It uses the PETSc and TAO toolkits to provide high performance and scalability. It was written by Rob Malouf and is now being developed as an open source project on Sourceforge in collaboration with Jason Baldridge and Miles Osborne. It is licensed under the Lesser GNU Public License.
For downloads, forums, and news, check out the the Sourceforge project page for TADM.Background
A feature of maximum entropy (ME) modeling that makes it very attractive is that it is a general purpose technique which can be applied to a wide variety of problems in natural language processing. Indeed, recent years have seen ME techniques used for sentence boundary detection, part of speech tagging, parse selection and ambiguity resolution, and stochastic attribute-value grammars, to name just a few applications (see, e.g., Berger, et al. 1996; Ratnaparkhi 1998; Johnson, et al. 1999; Osborne 2000). However, while parameter estimation for ME models is conceptually straightforward, in practice ME models for typical natural language tasks are usually large, and frequently contain thousands of free parameters. Estimation of such large models is not only expensive, but also, due to sparsely distributed features, sensitive to round-off errors.
See Zhang Le's maxent page for more background and information about maximum entropy.
Input format
The format for event files:
2 5 2 0 1 1 2 3 2 0 3 2 1 3 10 1 3 1 6 2 0 2 2 2 3 1 2 1
The first part of the file is a header, bracketed by lines containing &header and /. The header is optional and, if present, is ignored. The first line of each block is the number of events for that context (2 and 3 for the two contexts here). Then come the events. Each event line has a frequency, the number of feature value pairs, then pairs of feature number and value. Features are numbered starting with zero. Each feature can appear only once in an event, and must have a value greater than zero. You can have events with a zero frequency -- these are used in computing Z(x) for each context, but ignored for computing the entropy and KL divergence. Any feature with an expected value of zero is ignored (i.e., the corresponding parameter is set to 0.0).
Event files can be compressed using gzip. As event files tend to get very large, this can save a lot of disk space and improve performance dramatically.
Usage
The tadm executable takes all its commands as options on the command line. Some of the most interesting options are:
- -events_in <filename>
- file to read the events from (required)
- -params_out <filename>
- file to write parameter values to
- -method <method>
- optimization method to use
(reasonable choices are tao_lmvm, tao_cg_prp,
iis, gis, steep; there are other choices
but using them isn't a good idea) (default = tao_lmvm)
- -lbound, -ubound
- set an lower or upper bound constraint on the parameter values (only works with constrained optimization methods like tao_bmlvm)
- -monitor
- display progress towards convergence
- -max_it <n>
- stop if still haven't converged after n iterations (default = 9999)
- -frtol <d>
- relative stopping tolerance (if frtol=.001 then the final log-likelihood will be accurate to about 3 places, whatever that means) (default = 1e-7)
- -fatol <d>
- absolute stopping tolerance (fatol=.001 means when the log likelihood improves between iterations by less than .001) (default = 1e-10)}
- -checkpoint <n>
- write out intermediate parameters every n iterations (default = 0)}
- -converge
- use a simplified convergence test (for benchmarking)
- -summary
- print performance summary
- -trmalloc
- use error-checking memory allocator (without this, the memory statistics reported by -summary are meaningless)
- -trmalloc
- use error-checking memory allocator (without this, the memory statistics reported by -summary are meaningless)
There are some recent options which we have not provided documentation for as yet. There are also scores of other options which get passed on to PETSc and TAO (the option -help will list some of them, and more are listed in the documentation for the libraries), but most of them are mainly for profiling and tuning the underlying solvers. Feel free to tinker with the options (the SNES options look particularly interesting and particularly daunting), and let me know if any of them improve anything.
Most of the options have reasonable defaults (except -events_in, which you need to give a value for, and -params_out, which you probably want to give an option for) and can be left out. One feature that's kind of cute is that on startup the program reads default settings from ~/.petscrc (or a different file specified by the option -options_file). This file can also have alias statements, to allow abbreviations for some of the option names. For example, my .petscrc contains:
-monitor alias -in -events_in alias -out -params_out
Parallel processing
Since tadm uses MPI for interprocess communication, it can easily be ported to a wide range of parallel architectures, including SMP and Beowulf-type clusters. Documentation for how to do this will come in future releases.
Changes
- version 0.9.5 - First TADM release, basically Rob Malouf's original code relicensed under the Lesser GNU Public License.
Availability
Go to the TADM download page for source code.Installing
Some very rough instructions are provided in our current installation guide. We hope to provide better documentation in the near future.Getting help
Post any questions you have about installing or using TADM to the Sourceforge help forum.10 August 2005 |