The Geirfan wordlist: A Vocabulary list for adult learners of Welsh
The Geirfan wordlist is a curated list of 500 of the most frequent words in the Welsh language, designed for use by learners at A1/A2[1] levels of proficiency (Council of Europe, 2021). This vocabulary list was developed using an innovative symbiosis of corpus-based methods (using data from the CorCenCC corpus) and expert-led introspection and reflection; an approach which can be replicated and adapted for use in any other language context.
The lists are included in the appendices of the Geirfan document, comprising the following:
Appendix A contains the most frequent 750 words from CorCenCC, the result of tagging the corpus with CyTag2. This was the list that was curated by the project team, resulting in the basic 500-word wordlist (Appendix B).
Appendix B contains the basic 500-word list, without additions. These 500 words are those which are directly drawn from CorCenCC’s frequency data.
Appendix C contains the working list of additions, as an alphabetical list. The 500-word basic list plus these additions is the initial batch of headwords for the dictionary on the Geirfan website.
Full details on how to interpret the lists are included in the main body of the documentation.
[1] A2 refers to the Common European Framework of Reference for Languages (CEFR) basic user, waystage level. All references to levels in this paper are as defined by CEFR and in the context of Welsh. See https://www.wjec.co.uk/qualifications/welsh-for-adults-qualification-suite/#tab_overview [Accessed 26.08.22]
This is version 2 of the dataset. It includes new Appendices, which incorporate minor corrections to errors in the original data. No changes have been made to the main document. Version 2 was implemented on 23rd March 2023.
Funding
Supporting the generation of impact of CorCenCC - The National Corpus of Contemporary Welsh (2021-04-12 - 2023-03-31); Knight, Dawn. Funder: Economic & Social Research Council
History
Specialist software required to view data files
NoneLanguage(s) in dataset
- English-Great Britain (EN-GB)
- Welsh (CY)