ensiwiki-2011 dataset for readability modelling

Van der Sluis, Frans

dc.creator	Van der Sluis, Frans
dc.date.accessioned	2023-09-26T10:02:54Z
dc.date.available	2023-09-26T10:02:54Z
dc.date.issued	2023-09-26
dc.identifier.uri	http://hdl.handle.net/20.500.12115/49
dc.description	The ensiwiki dataset contains Wikipedia pages sampled from Simple-English and regular English Wikipedia. For each Simple-English page, a paired page was sampled from the regular English Wikipedia if available. The result is a list of pairs between Simple-English and regular English pages. Only pages that form a pair were included. In total 138,790 pages were sampled from Simple-English Wikipedia and English Wikipedia from August, 2011. The purpose of this dataset is to train and test readability detection systems. The dataset is intended to be sufficiently large to detect intricate relations between different features of readability. The dataset is used for this purpose in Van der Sluis (2013, 2014) and is described in further detail in Van der Sluis (2013). The dataset furthermore contains plain text versions of the wiki-text pages. These were parsed using JWPL MediaWiki parser (see https://dkpro.github.io/dkpro-jwpl/JWPLParser/) and split to the level of articles, sections, and paragraphs. Only the oldest 38,955 wikitext pages were parsed in order to arrive at a more mature set of pages that more clearly distinguishes between different levels of readability, which proved superior for training readability models. Note: This data is a result of work done at the Human-Media Interaction group of the University of Twente, The Netherlands. It's release is in accordance with original licensing requirements and aligned with relevant parties.
dc.language.iso	eng
dc.publisher	University of Copenhagen
dc.relation.isreferencedby	https://doi.org/10.1002/asi.23095
dc.relation.isreferencedby	https://doi.org/10.3990/1.9789036505673
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	http://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.subject	readability
dc.subject	textual complexity
dc.subject	wikipedia
dc.subject	simple english
dc.title	ensiwiki-2011 dataset for readability modelling
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN-DK
contact.person	Frans; Van der Sluis; frans@hum.ku.dk; University of Copenhagen
sponsor	7th Framework ICT Programme of the European Union.; FP7-ICT-2007-3; PuppyIR; euFunds;
size.info	138790; articles
files.size	961669085
files.count	2
annotationInfo.annotationType	readability labels: simple (english) or (regular) english

Files in this item

Download all files in item (917.12 MB)

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: readme.md
Size: 3.59 KB
Format: Unknown
Description: Readme
MD5: 1f4a663fcf415e9cf2089d255f9124b1

Download file

Name: ensiwiki2011.db.tar.gz
Size: 917.12 MB
Format: application/gzip
Description: Tar+gzipped SQLite3 database file containing all data and metadata
MD5: 664ccbe0aed88212a6863d29338a0632

Download file Preview

File Preview

- ensiwiki2011.db2 GB

Show simple item record

Files in this item

Coordination, Funding

Repository

More