Software & Resources
On this page we are beginning to make software projects and resources available that have been developed as part of course work or project work at the University of Konstanz.
- eXLEpse editor: The primary goal of eXLEpse was to develop an easy-to-use editor for computational grammars and interface to XLE. The editor basically replaces emacs as an editor and provides an alternative to the shell-based interaction with the XLE platform. You can download the software and more information on this project here.
- ParGramBank treebank: This is a parallel treebank currently involving ten languages from six language families. The treebank is based on deep LFG (Lexical-Functional Grammar) grammars that were developed within the framework of the ParGram (Parallel Grammar) effort. The grammars produce output that is maximally parallelized across languages and language families. This output forms the basis of a parallel treebank covering a diverse set of phenomena. ParGramBank can be accessed and downloaded for free via the INESS treebanking environment.
- CP Reference Dependency Bank: When dealing with languages of South Asia from an NLP perspective, a problem that repeatedly crops up is the treatment of complex predicates. In Ahmed et al. (2012), we present a first approach to the analysis of complex predicates (CPs) in the context of dependency bank development. The efforts originate in theoretical work on CPs done within Lexical Functional Grammar (LFG), but are intended to provide a guideline for analyzing different types of CPs in an independent framework. The design of the dependencies is kept parallel to the triples in PARC700 (King et al. 2003) and general enough to account for CP constructions across languages.
Software & Resources (other institutions)
Below, we also list software projects and resources developed at other institutions which we have licensed for the lab. If you are a student affiliated with us and you need access to these resources or tools for research purposes or for writing a thesis, let us know.
- FST is a tool to build finite-state networks that can be used for morphological and phonological analysis. You can download the software and more information on this tool here.
- HUTB, the goal of the Hindi-Urdu Treebank (HUTB) project is to build a multi-representational and multi-layered treebank for Hindi and Urdu. More here.
- GermaNet is a lexical-semantic net that relates German nouns, verbs, and adjectives semantically by grouping lexical units that express the same concept into synsets and by defining semantic relations between these synsets. GermaNet has much in common with the English WordNet and can be viewed as an on-line thesaurus or a light-weight ontology.
- WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. More here.
- XLE consists of cutting-edge algorithms for parsing and generating Lexical Functional Grammars (LFGs) along with a rich graphical user interface for writing and debugging such grammars. It is the basis for the Parallel Grammar Project, which is developing industrial-strength grammars for English, French, German, Norwegian, Japanese, and Urdu. XLE is written in C and uses Tcl/Tk for the user interface. More here.
- The PARC 700 Dependency Bank consists of 700 sentences which were randomly extracted from section 23 of the UPenn Wall Street Journal treebank, parsed with the broad-coverage English ParGram grammar developed at PARC, and given gold-standard annotations of grammatical dependency relations by manual correction and extension. Average sentence length: 19.8 words; average number of relation triples: 65.4. More here.
- ARCHER is a multi-genre corpus of British and American English covering the period 1600-1999, first constructed by Douglas Biber and Edward Finegan in the 1990s. It is managed as an ongoing project by a consortium of participants at fourteen universities in seven countries. More here.
- DTA, the German Text Archive (Deutsches Textarchiv, DTA) presents online a selection of key German-language works in various disciplines from the 17th to 19th centuries. The electronic full-texts are indexed linguistically and the search facilities tolerate a range of spelling variants. You can find the English description of the DTA here.
- GerManC, the ultimate aim of the project is to compile a representative historical corpus of written German for the years 1650-1800. This is a crucial period in the development of the language, as the modern standard was formed during it, and competing regional norms were finally eliminated. A central aim of the project is to provide a basis for comparative studies of the development of the grammar and vocabulary of English and German and the way in which they were standardized. More here
- TextGrid Bibliothek, the Digital Library at zeno.org represents an extensive collection of german texts in digital form, ranging from the beginning of the printing press up to the first decades of the 20th century. The collection is of particular interest to German Literature Studies as it contains virtually all the important texts in the canon and numerous other texts relevant to literary history whose copyright has expired. The same applies to Philosophy and Cultural Studies as a whole. For the most part, the texts are taken from textbooks and can therefore be cited, as well as the remaining texts which predominantly stem from the digitalisation of first editions.
- Tüba-D/Z, the Tübingen Treebank of Written German, is a syntactically annotated newspaper corpus based on data of they daily newspapyer "die tageszeitung". The syntactic annotation was performed manually. The treebank is expected to grow in size with every new release.
- ICE-GB is the British component of the International Corpus of English (ICE). ICE-GB is fully grammatically analysed. Like all the ICE corpora, ICE-GB consists of a million words of spoken and written English and adheres to the common corpus design. 200 written and 300 spoken texts make up the million words. Every text is grammatically annotated, permitting complex and detailed searches across the whole corpus.