dc.creator | Van der Sluis, Frans |
dc.date.accessioned | 2023-09-26T10:02:54Z |
dc.date.available | 2023-09-26T10:02:54Z |
dc.date.issued | 2023-09-26 |
dc.identifier.uri | http://hdl.handle.net/20.500.12115/49 |
dc.description | The ensiwiki dataset contains Wikipedia pages sampled from Simple-English and regular English Wikipedia. For each Simple-English page, a paired page was sampled from the regular English Wikipedia if available. The result is a list of pairs between Simple-English and regular English pages. Only pages that form a pair were included. In total 138,790 pages were sampled from Simple-English Wikipedia and English Wikipedia from August, 2011. The purpose of this dataset is to train and test readability detection systems. The dataset is intended to be sufficiently large to detect intricate relations between different features of readability. The dataset is used for this purpose in Van der Sluis (2013, 2014) and is described in further detail in Van der Sluis (2013). The dataset furthermore contains plain text versions of the wiki-text pages. These were parsed using JWPL MediaWiki parser (see https://dkpro.github.io/dkpro-jwpl/JWPLParser/) and split to the level of articles, sections, and paragraphs. Only the oldest 38,955 wikitext pages were parsed in order to arrive at a more mature set of pages that more clearly distinguishes between different levels of readability, which proved superior for training readability models. Note: This data is a result of work done at the Human-Media Interaction group of the University of Twente, The Netherlands. It's release is in accordance with original licensing requirements and aligned with relevant parties. |
dc.language.iso | eng |
dc.publisher | University of Copenhagen |
dc.relation.isreferencedby | https://doi.org/10.1002/asi.23095 |
dc.relation.isreferencedby | https://doi.org/10.3990/1.9789036505673 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.subject | readability |
dc.subject | textual complexity |
dc.subject | wikipedia |
dc.subject | simple english |
dc.title | ensiwiki-2011 dataset for readability modelling |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN-DK |
contact.person | Frans; Van der Sluis; frans@hum.ku.dk; University of Copenhagen |
sponsor | 7th Framework ICT Programme of the European Union.; FP7-ICT-2007-3; PuppyIR; euFunds; |
size.info | 138790; articles |
files.size | 961669085 |
files.count | 2 |
annotationInfo.annotationType | readability labels: simple (english) or (regular) english |
Files in this item
Download all files in item (917.12 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
- Name
- readme.md
- Size
- 3.59 KB
- Format
- Unknown
- Description
- Readme
- MD5
- 1f4a663fcf415e9cf2089d255f9124b1
- Name
- ensiwiki2011.db.tar.gz
- Size
- 917.12 MB
- Format
- application/gzip
- Description
- Tar+gzipped SQLite3 database file containing all data and metadata
- MD5
- 664ccbe0aed88212a6863d29338a0632