Show simple item record

 
dc.creator Jongejan, Bart
dc.date.accessioned 2021-05-21T14:36:50Z
dc.date.available 2021-05-21T14:36:50Z
dc.date.issued 2021-05-21
dc.identifier.uri http://hdl.handle.net/20.500.12115/45
dc.description CSTlemma is a lemmatizer that treats pre- in- and suffixes alike. The CST's lemmatizer can be (and already is) trained for tens of languages, also ones that require lemmatization rules that change words by adding or removing prefixes and/or infixes to obtain the lemma for the word. In Dutch, for example, the word "afgemaakt" has the lemma "afmaken", so the "ge" has to be removed, an "a" has to be inserted and the "t"-ending must be replaced by "en". New in version 8 of CSTlemma is the possibility to output the rule by which a given word is transformed to its lemma. It is also possible to just output a unique identifier for that rule - in practice, this identifier is just some kind of pointer in the datastructure that comprises the rule set. Rules for CSTlemma must be created with the affixtrain program (https://github.com/kuhumcst/affixtrain), but ready-made rules can be obtained from the net. For example, the https://github.com/kuhumcst/texton-linguistic-resources repo contains rules for about 30 languages. If you want to build CSTlemma, you not only need the source code contained in https://github.com/kuhumcst/cstlemma, but also some source code files from https://github.com/kuhumcst/letterfunc and from https://github.com/kuhumcst/parsesgml, The easiest and best way to go forward is to copy https://github.com/kuhumcst/cstlemma/blob/master/doc/makecstlemma.bash to a (linux, Mac?) folder and run that script. That will fetch all needed repositories and build cstlemma.
dc.publisher Centre for Language Technology, NorS, University of Copenhagen
dc.relation.isreferencedby https://www.aclweb.org/anthology/P09-1017/
dc.rights GNU General Public Licence, version 3
dc.rights.uri http://opensource.org/licenses/GPL-3.0
dc.rights.label PUB
dc.source.uri https://github.com/kuhumcst/cstlemma
dc.subject lemmatiser
dc.subject lemmatizer
dc.title CSTlemma version 8.1.2
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent false
hidden false
hasMetadata false
has.files yes
branding CLARIN-DK
demo.uri https://cst.dk/tools/index.php?lang=en
contact.person Bart; Jongejan; bartj@hum.ku.dk; Centre for Language Technology, NorS, University of Copenhagen
files.size 167405
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
GNU General Public Licence, version 3
Icon
Name
cstlemma-8.1.2.tar.gz
Size
163.48 KB
Format
application/gzip
Description
Source code & Makefile
MD5
627b300945873cdf284b8adece6e3555
 Download file  Preview
 File Preview  
  • cstlemma-8.1.2
    • README.md4 kB
    • Changelog15 kB
    • COPYING17 kB
    • doc
      • makecstlemma.bash840 B
      • cstlemma.rtf499 kB
    • src
      • functiontree.cpp4 kB
      • functiontree.h2 kB
      • flattext.h2 kB
      • lemmtags.cpp4 kB
      • applyrules.h1 kB
      • readlemm.cpp8 kB
      • argopt.h906 B
      • lemmatiser.h2 kB
      • dictionary.h1 kB
      • word.h14 kB
      • tags.h1 kB
      • lem.h1 kB
      • wordReader.cpp9 kB
      • text.cpp23 kB
      • flex.cpp14 kB
      • caseconv.h2 kB
      • cstlemma.cpp2 kB
      • applyrules.cpp43 kB
      • readfreq.h1 kB
      • Makefile2 kB
      • function.cpp1 kB
      • makesuffixflex.cpp33 kB
      • freqfile.h1 kB
      • functio.h4 kB
      • fieldfnc.h995 B
      • wordReader.h2 kB
      • lemmatise.cpp9 kB
      • readfreq.cpp11 kB
      • lemmtags.h1019 B
      • field.h4 kB
      • makedict.cpp36 kB
      • comparison.h1 kB
      • outputclass.cpp6 kB
      • flex.h11 kB
      • makedict.h1 kB
      • lemmatiser.cpp37 kB
      • lext.cpp1 kB
      • XMLtext.h4 kB
      • readlemm.h1 kB
      • makesuffixflex.h1 kB
      • tags.cpp4 kB
      • word.cpp19 kB
      • basefrmpntr.cpp9 kB
      • caseconv.cpp39 kB
      • option.h5 kB
      • basefrm.cpp10 kB
      • argopt.cpp3 kB
      • lext.h1 kB
      • basefrmpntr.h3 kB
      • flattext.cpp16 kB
      • lemmatise.h898 B
      • dictionary.cpp12 kB
      • defines.h3 kB
      • outputclass.h1 kB
      • freqfile.cpp1 kB
      • basefrm.h5 kB
      • text.h4 kB
      • option.cpp48 kB
      • field.cpp5 kB
      • XMLtext.cpp21 kB
    • pax_global_header52 B

Show simple item record