I'm occasionally asked about how one goes about actually building an NLG system. Robert Dale and I actually wrote a book about this, but reading a book is a big time commitment, and also at this point the book is starting to get out of date. I hope this FAQ will provide short up-to-date answers for interested people.
NLG systems are computer programs which generate texts in English and other human languages, usually based on some non-linguistic data or input. For example,
For many people, the most obvious way to create a program that produces English texts is to write code which uses "fill in the blank" templates, or perhaps just concatenates strings together into sentences. The NLG approach differs from this in two ways (both matters of degree rather than binary distinctions):
Note that it is perfectly possible to do a thorough analysis of effective language use and then build a template-based system based on this analysis. Whether this counts as "NLG" is a question about definitions which I will avoid. The advantage of modular construction is that it makes systems easier to parametrise, control, and maintain (for example, if we want to change the usage of pronouns in generated texts, this involves changing one localised module in a modularised NLG system but a global rewrite in a non-modularised template system). This advantage is only important for systems that are seriously used, however; a research prototype which is only used in a few demos, for example, may not benefit from the modularised NLG approach.
Unfortunately, there isn't much freely (or commercially) available NLG software, especially if we exclude poorly documented research systems which no one other than the developer can easily use. In an ideal world, the NLG community would develop robust and well-documented standard parametrised modules, development support tools for creating new modules, and APIs and architectures for integrating modules into systems; in other words, something like GATE. This hasn't been done yet, though, which is a shame in many ways.
Currently, the only modules which are shared across institutions and projects are realisers (grammar modules). Probably the most popular free realiser is FUF/SURGE, and the most popular commercial realiser is RealPro. Both of these systems are complex, and require developers to spend a considerable amount of time learning how to use them. Recently I have helped to develop the simplenlg realiser, which has less functionality but is (I believe) considerably easier to use.
This obviously depends on the type of text generated, the complexity of the language it uses, how good the language needs to be, how robust the software needs to be, and the expertise of the people building the system.
To take an example, assume that the text is informative and is being generated to help somone perform a task or make a decision (as opposed to persuasive or entertaining text, for example); the text is fairly simple linguistically (as most informative task-related texts are); the desired language quality is similar to a time-pressured human writer; bugs should affect less than 1% of texts generated; and the developers have experience in NLG. Then an NLG system could probably be created using a few person-years of effort. Most of this time would be spent on understanding language usage, on testing and evaluation, and on integration with input data sources and output text presentation systems; actually programming the NLG modules is only a small part of the job of building an NLG system.
The University of Aberdeen NLG group has worked with companies to develop NLG systems. Excluding small-scale consultancy activities, we usually only get involved in projects which are interesting from a research perspective (and which we can publish research papers about). But we are always happy to chat to people about NLG and building NLG systems. We are considering establishing a company to create NLG systems on a consultancy basis, but we have not done this yet.
I'd like to think that my book is a good first starting point (although it is a bit out-of-date). A few people with backgrounds in AI but not NLG have commented that my paper on knowledge acquisition for NLG, which focuses on understanding how to use language effectively, helped them understand NLG better.
If possible, your best bet is probably to discuss your interests with someone who knows about NLG. As above, the University of Aberdeen NLG group is always interested in chatting to people about NLG. We're especially interested in meeting people who would like to be involved in research projects.
SIGGEN (ACL Special Interest Group in Generation)
maintains a useful web site with information about
NLG events, resources, and researchers.