Show simple item record

 
dc.creator Pedersen, Bolette Sandford
dc.creator Søgaard, Anders
dc.creator Johannsen, Anders
dc.creator Braasch, Anna
dc.creator Olsen, Sussi
dc.creator Nimb, Sanni
dc.creator Sørensen, Nicolai Hartvig
dc.creator Alonso, Héctor Martínez
dc.date.accessioned 2019-06-26T15:01:28Z
dc.date.available 2019-06-26T15:01:28Z
dc.date.issued 2015
dc.identifier.uri http://hdl.handle.net/20.500.12115/38
dc.description The SemDax Corpus is a Danish human-annotated corpus relying on the combined wordnet and dictionary resources: DanNet and Den Danske Ordbog, and available through a CLARIN academic license. The corpus includes approx. 90,000 words, comprises six textual domains, and is annotated with sense inventories of different granularity. All nouns, verbs and adjectives in the corpus were annotated with supersenses (all-words task). Furthermore, 20 very polysemous nouns were annotated with all the senses from the Den Danske Ordbog and a reduced set of clustered senses respectively. The aim of the developed corpus is twofold: i) to assess the reliability of the different sense annotation schemes for Danish measured by qualitative analyses and annotation agreement scores, and ii) to serve as training and test data for machine learning algorithms with the practical purpose of developing sense taggers for Danish. To these aims, we take a new approach to human-annotated corpus resources by double annotating a much larger part of the corpus than what is normally seen: for the all-words task we double annotated 60% of the material and for the lexical sample task 100%. We include in the corpus not only the curated files, but also the diverging annotations. In other words, we consider not all disagreement to be noise, but rather to contain valuable linguistic information that can help us improve our annotation schemes and our learning algorithms.
dc.language.iso dan
dc.publisher Centre for Language Technology, NorS, University of Copenhagen
dc.relation.isreferencedby http://www.lrec-conf.org/proceedings/lrec2016/pdf/306_Paper.pdf
dc.rights CLARIN-ACA-NC
dc.rights.uri https://kitwiki.csc.fi/twiki/bin/view/FinCLARIN/ClarinEulaAca?ID=1&AFFIL=EDU&BY=1&NC=1&NORED=1
dc.rights.label ACA
dc.source.uri https://cst.ku.dk/english/projekter/projekter-afsluttet/semantikprojekt/corpus/
dc.subject semantics
dc.subject supersenses
dc.subject word sense annotations
dc.title The SemDaX Corpus
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN-DK
contact.person Administrator; CLARIN-DK; info@clarin.dk; Centre for Language Technology, NorS, University of Copenhagen
contact.person Bolette Sandford; Pedersen; bspedersen@hum.ku.dk; Centre for Language Technology, NorS, University of Copenhagen
sponsor Forskningsrådet for Kultur og Kommunikation; DFF-1319-00123; Semantic Processing across Domains; nationalFunds;
size.info 90000; words
size.info 673; files
files.size 7281529
files.count 4
annotationInfo.annotationType sense annotation


 Files in this item

 Download all files in item (6.94 MB)
This item is
Academic Use
and licensed under:
CLARIN-ACA-NC
Attribution Required Noncommercial
Icon
Name
lexicalsample.zip
Size
6.38 MB
Format
application/zip
Description
Lexical sample annotations
MD5
c4dd63180cd72d5225b3190ecb65db58
 Download file
Icon
Name
SemDax-supersenses.zip
Size
572.27 KB
Format
application/zip
Description
All words supersense annotations
MD5
92345cdd96051473e146d3018323bd91
 Download file
Icon
Name
LICENSE
Size
1.33 KB
Format
Unknown
Description
License
MD5
85b100e5d075024f48089b7a4eb34a51
 Download file
Icon
Name
README.md
Size
2.38 KB
Format
Unknown
Description
Readme
MD5
07aa20d002dbea3e7adb89e51aa21430
 Download file

Show simple item record