Icelandic NLP resources
This is an list of known tools and resources developed specifically to do linguistic processing in Icelandic. It is intended to give readers a clear overview of the ever-growing arsenal of tools for working with Icelandic natural language data at a glance.
This list is categorized by task to increase clarity. Due to that, some multi-functional tools and toolkits might appear more than once in the list. If you notice a category or resource is missing or have suggestions on how to improve this list, please open a GitHub pull request. For those who do not know how to make pull requests, you can also create an issue with your GitHub account.
Contents
- Notable papers
- Other resource collections
- Corpora
- European Language Grid Services
- Toolkits
- Tokenization and text normalization
- POS tagging
- Syntactic parsing
- Grapheme-to-phoneme
- Stress analysis
- Speech synthesis (TTS)
- Speech recognition (ASR)
Notable papers and reports ↑
- Máltækniáætlun fyrir íslensku 2018-2022 (English version)
- The project plan for an ongoing language technology programme funded by the Icelandic Ministry of Education.
- Short paper describing the programme, note that the programme has been postponed by a year compared to the original plan.
- Risamálheild: A Very Large Icelandic Text Corpus
- Paper describing the Icelandic Gigaword Corpus, a tagged and lemmatized corpus containing over 10^9 tokens.
- A Wide-Coverage Context-Free Grammar for Icelandic and an Accompanying Parsing System
- Please send a pull request with additions to this list. If you create a Github issue with the following details of the paper: title, link/URL to PDF/book, and a short description then we can add it to the website/markdown file.
Other resource collections ↑
- CLARIN-IS
- The Icelandic branch of the CLARIN-ERIC language resource initiative. Contains information on and downloads for many tools and datasets.
- SÍM homepage
- Overview page for SÍM (the Icelandic Language Technology Consortium), which contains mirrors and descriptions for all Language Technology Programme projects.
- malfong.is
- List of language technology resources, maintained by Árnastofnun.
- Comprehensive list of language resources
- This list of over 100 Icelandic language technology resources was compiled by @bjarnigithub in the summer of 2021.
Corpora ↑
- Talrómur
- A large public domain TTS corpus designed for research and development. Contains over 160 hours of studio-recorded prompted speech, divided between 8 speakers.
- Samrómur
- An open and accessible speech recognition dataset with FLAC audio files, corresponding text and metadata.
- Icelandic broadcast speech
- 193 hours of radio and TV data from the Icelandic National Broadcasting Service (RÚV).
- Spjallromur
- Icelandic Conversational Speech
- Kennslurómur
- Icelandic lectures with audio and corresponding text.
- GreynirCorpus
- A large, parsed treebank of modern Icelandic text
European Language Grid Services ↑
- tokenizer_api
- icenlp_api
- icenlp_api (IceParser - Shallow Parser)
- pos_api
- ner_api
- far_abltagger_api
- icesum_api
- nefnir_api
- greynirseq_api
- greynirseq_api
- binpackage_api
- greynircorrect_api
Toolkits ↑
Greynir
- Python 3 package which is capable of syntactic parsing, lemmatization, POS tagging, noun phrase inflection and more
- The GitHub repo for this project
- Developed by Miðeind ehf.
IceNLP
- Java toolkit which does tokenization, POS tagging, lemmatization, parsing and NER
- Developed by Hrafn Loftsson
LVL-tts-frontend
- TTS frontend designed to work with the Merlin speech synthesis system developed by CSTR
- It contains a pronunciation dictionary, sequitur g2p model, stress analysis component and more. Unfortunately it does not include any documentation.
- Developed by Anna Björk Nikulásdóttir at LVL
Tokenization and text normalization ↑
- Icelandic tokenizer
- Textahaukur - text normalization toolkit
- This seems to be in suspended development and claims to not be functional as of yet.
- Regína normalizer
- Regex-based text normalization in python. Currently in early stages of development.
POS tagging ↑
Syntactic parsing ↑
- Neural parsing pipeline for Icelandic
- Greynir, see above
- IceNLP, see above
Grapheme-to-phoneme ↑
- LSTM encoder-decoder sequence-to-sequence models for Icelandic, reference
- g2p-service is a g2p web service. reference
- Icelandic pronunciation dictionary
- Pronunciation dictionary editor
- Thrax G2P grammar for Icelandic, reference
- LVL-tts-frontend
- G2P - Atli Thor’s g2p python module/pip package, reference
- Module for preparing text data for TTS data collections …, reference
- Althingi ASR g2p, reference
Stress analysis ↑
- LVL-tts-frontend performs stress analysis
Speech synthesis ↑
Speech recognition ↑
- Ice-ASR
- Alþingi
- Samromur ASR
- Contains a vanilla recipe (base), subword modelling, and specialized children and adolescent recipes
- alignment and segmentation
- Scripts to prepare RÚV TV data for alignment and segmentation to make an ASR dataset
- Tiro Speech Core
- Tal
Our CADIA-LVL works in progress
You can also see our many works in progress at LVL itself if you follow us on our github: https://github.com/cadia-lvl
Facebook page: https://www.facebook.com/languageandvoice/