About metadata

This page provides information about what metadata we require and how we disseminate it. Metadata are freely accessible and are distributed in the public domain (under CC0). However, we reserve the right to be informed about commercial usage of metadata from CLARIN-DK repository including a description of your use case at Help Desk.



Metadata formats

During the submission process, users fill out metadata fields which are stored as a part of the record. We are able to disseminate the submission metadata in various formats including but not limited to CMDI and oai_dc. See the full list of supported formats but note that some of the formats might not be applicable to all items. The various formats help us promote the submitted content in number of aggregators (and/or search engines).

CMDI

See the CLARIN introduction to component metadata in order to get more information about this topic.

Our current submissions are adhering to the clarin.eu:cr1:p_1403526079380 profile/schema.

However, we are supporting submissions with arbitrary CMDI metadata files that are used in OAI-PMH when the CMDI metadata profile is requested.

oai_dc

oai_dc is the format required by OAI-PMH. See the mapping section in order to understand how we map our submission to this format.


Submitted metadata

Following list enumerates the fields we ask in the submission workflow (the list is subject to sporadic changes). The metadata are submitted in English. There are subtle differences depending on the type of the resource being submitted. Not all the fields are present in all the formats. There are fields that are automatically generated (eg. human readable language names acompanying the iso codes, identifiers, other dates).

Field name Description Status
Type Type of the resource: "Corpus" refers to text, speech and multimodal corpora. "Lexical Conceptual Resource" includes lexica, ontologies, dictionaries, word lists etc. "language Description" covers language models and grammars. "Technology / Tool / Service" is used for tools, systems, system components etc. required
Title The main title of the item. required
Project URL URL of resource/project related to the submitted item (eg. project webpage). Regexp controlled (starts with http/https) regexp controlled
Demo URL Demonstration, samples or in case of tools sample output URL. Regexp controlled (starts with http/https) regexp controlled
Referenced by Link to original paper that references this dataset. regexp controlled
Date issued The date when the submission data were issued if any e.g., 2014-01-21 or at least the year. required
Creators Fullnames of the creators of the item. In case of collections (eg. corpora or other large database of text) you usually want to provide the names of people involved in compiling the collection, not the individual authors. A person name is stored as surname comma any other name (eg. "Smith, John Jr."). required multiple
Publisher Name of the organization/entity which published any previous instance of the item, or your home institution. required multiple
Contact person Person to contact in case of issues with the submission. Someone able to provide information about the resource, eg. one of the creators, or the submitter. Stored as structured string containing given name, surname, email and home organization. required multiple
Funding Sponsors and funding that supported work described by the submission. Stored as structured string containing project name, project code, the funding organization, the type of funds (national/eu/other) and OpenAIRE identifier (which is also stored in dc.relation). Note that you either have to fill out all or none of the fields. multiple
Description Textual description of the submission. required
Language The language(s) of the main contenten of the item. Stored as ISO 639-3 code. Required for corpora, lexical conceptual resources and language descriptions. multiple type-bind required
Subject Keywords Keywords or phrases related to the subject of the item. multiple required
Size Extent of the submitted data, eg. the number of token, or number of files. multiple
Media type Media type of the main content of the item, eg. text or audio. Dropdown selection, required for corpora, language descriptions and lexical conceptual resources. dropdown selection type-bind required
Detailed type Further classification of the resource type. Dropdown selection, required for tools, language descriptions and lexical conceptual resources. dropdown selection type-bind required
Language Dependent Boolean value indicating whether the described tool/service is language dependent or not. Required for tools. type-bind required
Annotation type The type of annotation of the corpus, e.g. tokenization, POS-tagging, lemmatization, chunking, segmentation, parsing, sematic role labelling, term hood scoring. multiple

Metadata mapping

The following tables contains the Submission to OAI:DC mapping, it also lists some of the important automatically generated fields.

Submission field Mapped DC metadata field
Type dc.type
Title dc.title
Project URL dc.source
Demo URL not mapped
Date Issued dc.date
Creator dc.creator
Publisher dc.publisher
Contact person not mapped
Funding not mapped
Description dc.description
Language dc.language
Subject Keywords dc.subject
Size not mapped
Media Type not mapped
Detailed Type not mapped

Generated DC metadata field Description
dc.identifier PID (currently handle) of the resource.
dc.rights Multiple field can contain the name of the license under which the resource is distributed, the URL to the full text of the license and so called label (PUB, ACA, RES)

The following table contains the CLARIN VLO facets (concepts) mapping to DSpace metadata:

VLO facet Mapped DSpace metadata field
Organisation dc.publisher
Description dc.description
Availability dc.rights.uri
Language dc.language.iso
License dc.rights.uri
Name dc.title
National project internal mapping
Country not mapped
Modality not mapped
Keywords not mapped
Subject not mapped
Format not mapped
Genre not mapped
Resource type dc.type
temporalCoverage not mapped
projectName not mapped

Mappings are enabled by linked concepts (CLARIN CCR) in the CMDI metadata profile (and additionally by configured XPath mappings). Click here to see a full description of all VLO facet mappings