Show simple item record

 
dc.creator Van der Sluis, Frans
dc.date.accessioned 2023-09-26T10:02:54Z
dc.date.available 2023-09-26T10:02:54Z
dc.date.issued 2023-09-26
dc.identifier.uri http://hdl.handle.net/20.500.12115/49
dc.description The ensiwiki dataset contains Wikipedia pages sampled from Simple-English and regular English Wikipedia. For each Simple-English page, a paired page was sampled from the regular English Wikipedia if available. The result is a list of pairs between Simple-English and regular English pages. Only pages that form a pair were included. In total 138,790 pages were sampled from Simple-English Wikipedia and English Wikipedia from August, 2011. The purpose of this dataset is to train and test readability detection systems. The dataset is intended to be sufficiently large to detect intricate relations between different features of readability. The dataset is used for this purpose in Van der Sluis (2013, 2014) and is described in further detail in Van der Sluis (2013). The dataset furthermore contains plain text versions of the wiki-text pages. These were parsed using JWPL MediaWiki parser (see https://dkpro.github.io/dkpro-jwpl/JWPLParser/) and split to the level of articles, sections, and paragraphs. Only the oldest 38,955 wikitext pages were parsed in order to arrive at a more mature set of pages that more clearly distinguishes between different levels of readability, which proved superior for training readability models. Note: This data is a result of work done at the Human-Media Interaction group of the University of Twente, The Netherlands. It's release is in accordance with original licensing requirements and aligned with relevant parties.
dc.language.iso eng
dc.publisher University of Copenhagen
dc.relation.isreferencedby https://doi.org/10.1002/asi.23095
dc.relation.isreferencedby https://doi.org/10.3990/1.9789036505673
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.subject readability
dc.subject textual complexity
dc.subject wikipedia
dc.subject simple english
dc.title ensiwiki-2011 dataset for readability modelling
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN-DK
contact.person Frans; Van der Sluis; frans@hum.ku.dk; University of Copenhagen
sponsor 7th Framework ICT Programme of the European Union.; FP7-ICT-2007-3; PuppyIR; euFunds;
size.info 138790; articles
files.size 961669085
files.count 2
annotationInfo.annotationType readability labels: simple (english) or (regular) english


 Files in this item

 Download all files in item (917.12 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Attribution Required Share Alike
Icon
Name
readme.md
Size
3.59 KB
Format
Unknown
Description
Readme
MD5
1f4a663fcf415e9cf2089d255f9124b1
 Download file
Icon
Name
ensiwiki2011.db.tar.gz
Size
917.12 MB
Format
application/gzip
Description
Tar+gzipped SQLite3 database file containing all data and metadata
MD5
664ccbe0aed88212a6863d29338a0632
 Download file  Preview
 File Preview  
    • ensiwiki2011.db2 GB

Show simple item record