Project
This project gives a unique opportunity to further develop collaboration between Pakistan and
Germany to mature linguistic research capacity in Pakistan and to concurrently develop the
much needed linguistic resources.
It aims to collaborate in three aspects:
- The teams will organize joint
workshops with researchers from both teams coming together to develop a common
understanding of the issues and solutions in core areas relevant for grammar and semantic
resources, including POS tagset, WordNet and VerbNet structures, and semantic and
mutliword issues related to nouns in particular.
- The teams will work to develop multiple
layers of annotation on a common corpus of Urdu, including POS tags and semantic senses
(i.e., the range of meanings available for a given word).
- The common understanding of the range of issues and problems involved together with the
annotated corpus to be developed will be used to derive additional information and perform additional analyses.
The additional information and analyses will be used to develop algorithms for automatically annotating a
corpus with POS tags as well as word senses, thus resulting in a multi-layered automatically
annotated corpus, which can then be used to identify and extract further information about the
different word classes in order to feed thesauri and databases like WordNet and VerbNet. These
then in turn can be used to develop reliable strategies for word sense disambiguation (WSD),
that is for strategies which can reliably distinguish between different senses of a word (a classic
example illustrating WSD is the English word bank, which as a noun can either mean the bank
of a river or a financial institution).
In detail the work will be structured in five steps:
- POS tagset and tagging
- Analysis of issues
- POS tagset revision
- Manually tag 100,000 words
- Automatically tag 5 million words (to extract words, senses,
frames, etc.)
- Urdu Wordnet
- Analysis of issues
- Add 2000 senses for a total of 5000 senses (3000 senses have
already been derived from the transliterated Hindi WordNet)
- Revise and add to existing hierarchical relationships between the
senses
- Urdu VerbNet
- Analysis of issues
- Automatic Acquisition of Sub-categorization Frames
- Identifying Sets of Verb Classes
- Sense tagged Corpus
- Analysis of issues
- Manual tagging of 100,000 word corpus (for words in WordNet
and VerbNet)
- Algorithm for identifying nouns vs. names
- Understanding and identifying the semantics underlying N-V
combinations
- Identification of multiwords
- Automated tagging of words covered by WordNet and VerbNet
and names, multiwords and N-V combinations identified in parts
- Design and Specification of 2 new linguistic courses
Prof. Hussain currently teaches computational linguistics at the University of Engineering and Technology in
Lahore, Pakistan (UET), and courses in linguistics for students of the
doctoral program at the University of Management of Technology (UMT) via a formal agreement
of cooperation between UET and UMT. He also teaches the MPhil students at Kinnaird College
in Lahore, where he is on the Board of Studies for Linguistics. The courses developed as part
of this project will thus flow directly into existing curricula in Pakistan as well as Germany, where
Prof. Butt will integrate them into the existing MAs on General Linguistics and on Speech and
Language Processing.