The SemDaX Corpus

Pedersen, Bolette Sandford; Søgaard, Anders; Johannsen, Anders; Braasch, Anna; Olsen, Sussi; Nimb, Sanni; Sørensen, Nicolai Hartvig; Alonso, Héctor Martínez

dc.creator	Pedersen, Bolette Sandford
dc.creator	Søgaard, Anders
dc.creator	Johannsen, Anders
dc.creator	Braasch, Anna
dc.creator	Olsen, Sussi
dc.creator	Nimb, Sanni
dc.creator	Sørensen, Nicolai Hartvig
dc.creator	Alonso, Héctor Martínez
dc.date.accessioned	2019-06-26T15:01:28Z
dc.date.available	2019-06-26T15:01:28Z
dc.date.issued	2015
dc.identifier.uri	http://hdl.handle.net/20.500.12115/38
dc.description	The SemDax Corpus is a Danish human-annotated corpus relying on the combined wordnet and dictionary resources: DanNet and Den Danske Ordbog, and available through a CLARIN academic license. The corpus includes approx. 90,000 words, comprises six textual domains, and is annotated with sense inventories of different granularity. All nouns, verbs and adjectives in the corpus were annotated with supersenses (all-words task). Furthermore, 20 very polysemous nouns were annotated with all the senses from the Den Danske Ordbog and a reduced set of clustered senses respectively. The aim of the developed corpus is twofold: i) to assess the reliability of the different sense annotation schemes for Danish measured by qualitative analyses and annotation agreement scores, and ii) to serve as training and test data for machine learning algorithms with the practical purpose of developing sense taggers for Danish. To these aims, we take a new approach to human-annotated corpus resources by double annotating a much larger part of the corpus than what is normally seen: for the all-words task we double annotated 60% of the material and for the lexical sample task 100%. We include in the corpus not only the curated files, but also the diverging annotations. In other words, we consider not all disagreement to be noise, but rather to contain valuable linguistic information that can help us improve our annotation schemes and our learning algorithms.
dc.language.iso	dan
dc.publisher	Centre for Language Technology, NorS, University of Copenhagen
dc.relation.isreferencedby	http://www.lrec-conf.org/proceedings/lrec2016/pdf/306_Paper.pdf
dc.rights	CLARIN-ACA-NC
dc.rights.uri	https://kitwiki.csc.fi/twiki/bin/view/FinCLARIN/ClarinEulaAca?ID=1&AFFIL=EDU&BY=1&NC=1&NORED=1
dc.rights.label	ACA
dc.source.uri	https://cst.ku.dk/english/projekter/projekter-afsluttet/semantikprojekt/corpus/
dc.subject	semantics
dc.subject	supersenses
dc.subject	word sense annotations
dc.title	The SemDaX Corpus
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN-DK
contact.person	Administrator; CLARIN-DK; info@clarin.dk; Centre for Language Technology, NorS, University of Copenhagen
contact.person	Bolette Sandford; Pedersen; bspedersen@hum.ku.dk; Centre for Language Technology, NorS, University of Copenhagen
sponsor	Forskningsrådet for Kultur og Kommunikation; DFF-1319-00123; Semantic Processing across Domains; nationalFunds;
size.info	90000; words
size.info	673; files
files.size	7281529
files.count	4
annotationInfo.annotationType	sense annotation