Jump to content


From Wikimania

This page is part of the Proceedings of Wikimania 2006 (Index of presentations)

Wikipedia meets Natural Language Processing

Authors Malvina Nissim, Valentina Presutti (Laboratory for Applied Ontology, ISTC-CNR)
Track Technical Infrastructure
License GNU Free Documentation License (details)
About the authors
Malvina Nissim graduated in Linguistics from the University of Pisa, Italy, in 1998. She received a Ph.D in Linguistics from the University of Pavia, Italy, in 2002. From November 2001 until January 2006 she was a postdoctoral research fellow at the Institute for Communicating and Collaborative Systems, School of Informatics, University of Edinburgh, in the U.K. Her interests are in information extraction and knowledge acquisition, computational semantics, and discourse processing, using both statistical and symbolic techniques.
Valentina Presutti graduated in Computer Science from the University of Bologna in 2002. She then received a Ph.D in Computer Science from the same University. Her research interests involve ontology engineering, ontology-based software engineering, semantic web portals and searching.
This presentation forms part of the following panel:
Adding semantics to MediaWiki allows for semantic relations to be annotated in Wikipedia. Furthermore, a new feature for importing ontologies will be available soon. While this is extremely useful, users might find it difficult to choose which relation to use in a given context, especially if they want to be compliant with a suitable ontology.

Complete freedom in creating new semantic relations can eventually lead to low interoperability and sharing, and expose the semantic capability of wikipedia to the risk of failure. Without explicit information about mapping aspects, equivalence between relations having the same sense would be completely ignored and adding semantic relations by simply manual editing is a very error prone task. It is also true that the success of Wikipedia is due to its philosophy of 'relaxed constraints', so we need a soft strategy that helps users in creating and/or choosing the semantic relations appropriately. We believe Natural Language Processing (NLP) and Wikipedia have a lot to tell each other, also in this respect.

NLP, broadly put, is concerned with enabling machines to understand and generate natural language, as spoken/written by humans.

Higher level NLP systems are aimed at extracting all sorts of information from text, answering questions formulated in natural language, automatically translating documents from one language to another, generating summaries of one or several documents, and soon. Such composite systems are built of modules that tackle specific tasks. For example, in order to extract information from a given document (or set of documents), one needs to start by recognising entities of interest, such as persons, for example, or locations. This task is known as Named Entity Recognition. Examples of other useful tasks for wider systems are anaphora resolution (finding all linguistic expressions in a text that refer to the same entity), and word sense disambiguation (distinguishing uses of 'bank' as the financial institution and uses of ``bank" as the river side), for instance. Although crucial for most NLP tasks, we will not discuss parsing.

With increased power of machines, and the use of statistical methods, recent years have seen huge progress in several areas of NLP. However, the bottleneck of most NLP system is the acquisition of lexical and world knowledge. For example, take a system that needs to answer a question like 'What is the role of Bill Clinton's wife in current American politics?'. A system that searches documents to extract such information needs to know that Bill Clinton's wife is Hillary (Rodham) Clinton, that 'junior senator of New York' is a political role, and that New York is in the United States, which can be referred to by the adjective 'American'.

Wikipedia is the largest source of encyclopaedic knowledge currently available in electronic form, and therefore the biggest source of organised information which can be exploited for NLP purposes. Conversely, NLP techniques can be used to semi-automatically enrich Wikipedia from a semantic persepctive, as well as provide means for assisting the creation of Wikipedia articles and links. Specifically, we propose an extension to Semantic MediaWiki which provides users with a helping mechanism for annotating Wikipedia content in an easier and more consistent way, by exploiting both NLP techniques and reasoning on existing local ontologies.

More generally, by means of several application examples, we will discuss the mutual benefit that NLP and Wikipedia can gain from exploiting one another, and the advantages and limits of such an exchange.

4Final edit

Full text