No. | Data Source Name, EN | Data Source Name, National Language | National Language ID | Data type | Source Type | Language(s) | Domain | Source | IPR / Licensing / Security considerations | Data Holder |
---|---|---|---|---|---|---|---|---|---|---|
1 | ParIce - English-Icelandic parallel corpus | - | - | parallel | corpus | IS EN | the bible, books, EEA documents, Patient information leaflets (EMA), European Southern Observatory (ESO), Texts from the localization files of KDE (KDE4) (from OPUS), OpenSubtitles (from OPUS), Sagas, Statistics Iceland, Tatoeba, Ubuntu |
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/16 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | info about data source in the following article: https://aclanthology.org/W19-6115.pdf |
2 | ParIce Dev/Test/Train Split 20.05 | - | - | parallel | corpus | IS EN | the bible, books, EEA documents, Patient information leaflets (EMA), European Southern Observatory (ESO), Texts from the localization files of KDE (KDE4) (from OPUS), OpenSubtitles (from OPUS), Sagas, Statistics Iceland, Tatoeba, Ubuntu |
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/24 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | info about data source in the following article: https://aclanthology.org/W19-6115.pdf |
3 | En-Is Synthetic Parallel Corpus | - | - | parallel | corpus | IS EN | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/70 | Icelandic Gigaword Corpus Part1 | data sources: Wikipedia, Newscrawl and Europarl corpora; Icelandic Gigaword Corpus |
4 | En-Is Semi-Synthetic Parallel Name Robustness Corpus | - | - | parallel | corpus | IS EN | person names | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/74 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | data source: based on the ParIce corpus |
5 | cities_is2en | - | - | parallel | corpus | IS EN | city names | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/66 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | data source: information provided by the Icelandic Ministry for Foreign Affairs and the Árni Magnússon Institute for Icelandic Studies |
6 | Gold Alignments for English-Icelandic Word Alignments | - | - | parallel | lexical conceptual resource | IS EN | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/103 | - | - |
7 | UD Icelandic PUD | - | - | parallel | corpus | IS EN | news, wikipedia | https://universaldependencies.org/treebanks/is_pud/index.html | CC BY-SA 4.0 | - |
8 | OPUS (The Open Parallel Corpus) | - | - | parallel | website/corpus | multilingual | - | https://opus.nlpl.eu/ | - | - |
9 | WikiMatrix v1 | - | - | parallel | corpus | IS EN | - | https://opus.nlpl.eu/WikiMatrix-v1.php | CC-BY-SA 4.0 | data source: wikimedia |
10 | wikimedia v20210402 | - | - | parallel | corpus | IS EN | - | https://opus.nlpl.eu/wikimedia-v20210402.php | CC–BY-SA 4.0 | data source: wikimedia |
11 | XLEnt v1.1 | - | - | parallel | corpus | IS EN | - | https://opus.nlpl.eu/XLEnt-v1.1.php | - | - |
12 | TildeMODEL | - | - | parallel | corpus | IS EN | document texts of European Economic and Social Committee document portal; press releases; banking; medicin; travel; tourism; texts of Lithuanian National Philharmonic Society web site; Müpa Budapest - web site of Hungarian national culture house and concert venue; texts of fold.lv portal http://www.fold.lv/en/ of the best of Latvian and foreign creative industries; from texts of http://czechtourism.com/ portal |
https://opus.nlpl.eu/TildeMODEL-v2018.php | CC-BY - Creative Commons with Attribution | data sources: http://dm.eesc.europa.eu/; http://europa.eu/rapid/; http://ebc.europa.eu/; http://www.ema.europa.eu/; http://www.worldbank.org/; https://www.airbaltic.com/en/destinations/; http://liveriga.com/; http://www.filharmonija.lt/; https://www.mupa.hu/en/; http://www.fold.lv/en/; http://czechtourism.com/ |
13 | CCAligned v1 | - | - | parallel | corpus | IS EN | - | https://opus.nlpl.eu/CCAligned-v1.php | - | - |
14 | JW300 v1b | - | - | parallel | corpus | IS EN | - | https://opus.nlpl.eu/JW300-v1b.php | For all practical purpose, the license is CC-BY-NC-SA. Still, jw.org maintains custom terms of use [https://www.jw.org/en/terms-of-use/]; in doubt, make sure to observe their license! |
data source: jw.org |
15 | QED | - | - | parallel | corpus | IS EN | education | https://opus.nlpl.eu/QED-v2.0a.php | “The QED Corpus is made public for RESEARCH purpose only. The corpus is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Copyright Qatar Computing Research Institute. All rights reserved.” |
- |
16 | Mozilla-I10n v1 | - | - | parallel | corpus | IS EN | Mozilla localisation/internationalisation data | https://opus.nlpl.eu/Mozilla-I10n-v1.php | Mozilla Public License 2.0 | - |
17 | Eubookshop | - | - | parallel | corpus | IS EN | documents from the EU bookshop | https://opus.nlpl.eu/EUbookshop-v2.php | - | data source: http://bookshop.europa.eu |
18 | TED2020 v1 | - | - | parallel | corpus | IS EN | - | https://opus.nlpl.eu/TED2020-v1.php | License: Please respect the TED Talks Usage Policy | data source: crawl of nearly 4000 TED and TED-X transcripts |
19 | Paracrawl | - | - | parallel | corpus | multilingual | - | https://paracrawl.eu/ | Creative Commons CC0 license (“no rights reserved”) | - |
20 | Paracrawl Synthesized Data | - | - | parallel | corpus | multilingual | covid-19 | https://paracrawl.eu/manufactured-data | - | |
21 | Parallel English-Icelandic corpus from the Icelandic Directorate for International Development Cooperation website |
- | - | parallel | corpus | IS EN | - | https://data.europa.eu/data/datasets/elrc_504?locale=en | Creative Commons Attribution 4.0 International | data source: Icelandic Directorate for International Development Cooperation website |
22 | Ríkiskaup (Central Public Procurement) - Translation Memory 2020 |
- | - | parallel | corpus | IS EN | - | https://elrc-share.eu/repository/browse/rikiskaup-central-public- procurement-translation-memory-2020/cd8551a8c78511eb9c1a001 55d0267069d21c63733144b2fa4b9c9cfcbababc4/ |
IPR Holders: Ríkiskaup (Central Public Procurement) https://elrc-share.eu/static/metashare/licences/CC-BY-4.0.pdf |
data source: internal bi-lingual documents from Rikiskaup (https://www.rikiskaup.is/) |
23 | University of Iceland’s TM | - | - | parallel | corpus | IS EN | includes translations of rules, procedures, contracts, policies, announcements, letters, speeches and news |
https://elrc-share.eu/repository/browse/university-of-icelands- tm/7cd401ccc79a11eb9c1a00155d026706fb642f0237ec4bd9a590b5 bc81441512/ |
IPR Holders: Abigail Charlotte Cooper; University of Iceland https://elrc-share.eu/static/metashare/licences/CC-BY-4.0.pdf |
- |
24 | The Icelandic Met Office - Weather forecasts and warnings |
- | - | parallel | corpus | IS EN | Meteorological reports | https://elrc-share.eu/repository/browse/the-icelandic-met-office -weather-forecasts-and-warnings/6963b446c56411eb9c1a00155d02 67068a479fe4536c4c9e83eb051f37198b7e/ |
IPR Holders: Icelandic Meteorological Office https://elrc-share.eu/static/metashare/licences/CC-BY-4.0.pdf |
https://vedur.is/ |
25 | Government Offices in Iceland - Reports | - | - | parallel | corpus | IS EN | eGovernment | https://elrc-share.eu/repository/browse/government-offices-in -iceland-reports/6963b445c56411eb9c1a00155d026706c22c67dbe 78743059ee4a388fab2cc7c/ |
IPR Holders: Government Offices of Iceland https://elrc-share.eu/static/metashare/licences/CC-BY-4.0.pdf |
data source: www.government.is; www.stjornarradid.is |
26 | Government Offices in Iceland – Legislation and regulations |
- | - | parallel | corpus | IS EN | eJustice/LAW | https://elrc-share.eu/repository/browse/government-offices-in- iceland-legislation-and-regulations/6ad42ad5c56411eb9c1a00155 d026706aac46ebe9197417e999d6f2b768c0bf7/ |
IPR Holders: Government Offices of Iceland https://elrc-share.eu/static/metashare/licences/CC-BY-4.0.pdf |
data source: documents on the Icelandic and English websites of the Government Offices in Iceland |
27 | Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA), https://www.ema.europa.eu, (February 2020) (EN-IS) |
- | - | parallel | corpus | IS EN | SOCIAL QUESTIONS Health (Eurovoc 2841) | https://elrc-share.eu/repository/browse/bilingual-corpus-made -out-of-pdf-documents-from-the-european-medicines-agency- emea-httpswwwemaeuropaeu-february-2020-en-is/2911078886 2811ea913100155d0267069f685ed8fd1e4ae088600d9c99af303c/ multilingual version of this parallel corpus: https://elrc-share.eu/repository/browse/multilingual- corpus-made-out-of-pdf-documents-from-the- european-medicines-agency-emea-httpswwwemaeuropaeu- february-2020/3cf9da8e858511ea913100155d0267062d01c2d84 7c349628584d10293948de3/ |
https://elrc-share.eu/static/metashare/licences/CC-BY-4.0.pdf | data source: PDF documents from the European Medicines Agency (https://www.ema.europa.eu) |
28 | Bilingual English-Icelandic parallel corpus from the official Nordic cooperation website |
- | - | parallel | corpus | IS EN | INTERNATIONAL ORGANISATIONS (Eurovoc 76), POLITICS (Eurovoc 04), INTERNATIONAL RELATIONS (Eurovoc 08) |
https://elrc-share.eu/repository/browse/bilingual-english- icelandic-parallel-corpus-from-the-official-nordic-cooperation -website/0e9d06707ad311e8b7d400155d026706ce2fbb1eb16c412 a8ba9080d7490657d/ |
IPR Holders: Nordic Council; Nordic Council of Ministers Open Under-PSI (more info about the license) |
data source: Nordic Co-operation website http://www.norden.org |
29 | Bilingual English-Icelandic parallel corpus from Harpa Reykjavik Concert Hall and Conference Centre website |
- | - | parallel | corpus | IS EN | - | https://elrc-share.eu/repository/browse/bilingual-english- icelandic-parallel-corpus-from-harpa-reykjavik-concert-hall -and-conference-centre-website/b56c64c2e4d411e7b7d400 155d0267060908d10ea65d42b58b429dd9301a7582/ |
IPR Holders: Harpa Reykjavik Concert Hall and Conference Centre Open Under-PSI (more info about the license) |
data source: contents of https://en.harpa.is and https://www.harpa.is |
30 | Bilingual English-Icelandic parallel corpus from Icelandic Financial Supervisory Authority |
- | - | parallel | corpus | IS EN | BUSINESS & COMPETITION (Eurovoc 40) | https://elrc-share.eu/repository/browse/bilingual-english -icelandic-parallel-corpus-from-icelandic-financial-supervisory -authority/f2a5b200e4c311e7b7d400155d02670665375c5479674 4de9689b6b49deb74ed/ |
IPR Holders: Financial Supervisory Authority Iceland Open Under-PSI (more info about the license) |
data source: contents of https://en.fme.is/ and https://www.fme.is/ |
31 | Bilingual English-Icelandic parallel corpus from Icelandic Post and Telecom Administration website |
- | - | parallel | corpus | IS EN | LAW (Eurovoc 12) | https://elrc-share.eu/repository/browse/bilingual-english- icelandic-parallel-corpus-from-icelandic-post-and-telecom- administration-website/d6cc14a8e4c711e7b7d400155d02670 668f4c1b127ee42ab9108ee2d0f2eb4b7/ |
IPR Holders: Post and Telecom Administration in Iceland Open Under-PSI (more info about the license) |
data source: contents of https://www.pfs.is/ |
32 | Bilingual English-Icelandic parallel corpus from Nordisk eTax website |
- | - | parallel | corpus | IS EN | FINANCE (Eurovoc 24) | https://elrc-share.eu/repository/browse/bilingual-english- icelandic-parallel-corpus-from-nordisk-etax-website/c0970ab 4eadd11e7b7d400155d026706fd923049f0ab48678edf9c4ae3fcdf71/ |
IPR Holders: Nordisk eTax Open Under-PSI (more info about the license) |
data source: contents of https://www.nordisketax.net/ |
33 | Bilingual is-en parallel corpus from Icelandic Medicines Agency website |
- | - | parallel | corpus | IS EN | - | https://elrc-share.eu/repository/browse/bilingual-is-en-parallel- corpus-from-icelandic-medicines-agency-website/4a6ddf7ae56011 e7b7d400155d026706bfb8c1760a814dea8b0ce4300da6b504/ |
IPR Holders: Icelandic Medicines Agency Open Under-PSI (more info about the license) |
data source: contents of https://www.lyfjastofnun.is/ and https://www.ima.is |
34 | Bilingual is-en parallel corpus from National Gallery of Iceland website |
- | - | parallel | corpus | IS EN | - | https://elrc-share.eu/repository/browse/bilingual-is-en-parallel- corpus-from-national-gallery-of-iceland-website/0958d46ee4d311e 7b7d400155d0267061663be21d26647d1accefea64fc4db3b/ |
IPR Holders: NATIONAL GALLERY OF ICELAND Open Under-PSI (more info about the license) |
data source: contents of http://www.listasafn.is |
35 | Bilingual is-en parallel corpus from The Icelandic Directorate of Immigration website |
- | - | parallel | corpus | IS EN | - | https://elrc-share.eu/repository/browse/bilingual-is-en-parallel-corpus- from-the-icelandic-directorate-of-immigration-website/2467fa26e56111e7 b7d400155d026706bfd15d2901e94fdf979f4f1fff86c318/ |
IPR Holders: Útlendingastofnun, The Directorate of Immigration, Iceland Open Under-PSI (more info about the license) |
data source: contents of http://www.utl.is |
36 | Bilingual is-en parallel corpus from THE LITERATURE WEB website |
- | - | parallel | corpus | IS EN | - | https://elrc-share.eu/repository/browse/bilingual-is-en-parallel- corpus-from-the-literature-web-website/b5a7f5fee4d511e7b7d400 155d026706cfd7be18e5bd497fb00355ba4d23741d/ |
IPR Holders: City of Reykjavík Open Under-PSI (more info about the license) |
data source: contents of https://bokmenntaborgin.is/ |
37 | Parallel English-Icelandic corpus from the contents of Icelandic National Debt Management Agency website |
- | - | parallel | corpus | IS EN | ECONOMICS (Eurovoc 16), FINANCE (Eurovoc 24) | https://elrc-share.eu/repository/browse/parallel-english- icelandic-corpus-from-the-contents-of-icelandic-national- debt-management-agency-website/827c09c0e4cd11e7b7 d400155d026706b3e67c6af2754f67a335ce5d6068d223/ |
IPR Holders: Central Bank of Iceland Open Under-PSI (more info about the license) |
data source: contents of http://www.lanamal.is |
38 | Parallel English-Icelandic corpus from the Icelandic Directorate for International Development Cooperation website |
- | - | parallel | corpus | IS EN | INTERNATIONAL RELATIONS (Eurovoc 08) | https://elrc-share.eu/repository/browse/parallel-english- icelandic-corpus-from-the-icelandic-directorate-for- international-development-cooperation-website/eaca6b 40e4c611e7b7d400155d0267065d1af2425274432fafd35c1f93ff097e/ |
IPR Holders: Government Offices of Iceland Open Under-PSI (more info about the license) |
data source: contents of http://www.iceida.is/ |
39 | EAC Translation memory - Forms Data | - | - | parallel | corpus | IS EN | electronic forms, EDUCATION & COMMUNICATIONS (Eurovoc 32) | https://elrc-share.eu/repository/browse/eac-translation- memory-forms-data/0ed2d886c1f711eb9c1a00155d0267 06c66213a889bf4297a834b3b0f21c84e1/ |
IPR Holders: Directorate General for Education and Culture CC-BY-4.0 |
- |
40 | EAC Translation memory - Reference Data | - | - | parallel | corpus | IS EN | Electronic Reference Data | https://elrc-share.eu/repository/browse/eac-translation- memory-reference-data/67911206c56411eb9c1a001 55d02670635d5c6e318714fa4803d8099f75f7bcb/ |
IPR Holders: Directorate General for Education and Culture CC-BY-4.0 |
- |
41 | META-NORD Sofie Parallel Treebank | - | - | parallel | corpus | DA EN ET DE IS NO SV | - | https://clarino.uib.no/iness/page?page-id=Sofie&session-id=251398323083844 the following error message appears: “Fake or stale session id” - to find the corpus: select “Treebank selection” under “Treebanks” on the left side of the webpage, select”Icelandic” under “Languages” and “Sofie” under “Treebank Collections” and select “Show only parallel Treebanks”, then click “Sofie” under “Collection” at the bottom. |
License: http://license.no/ | data source: first chapters of the novel Sofies verden by Jostein Gaarder, published by Aschehoug forlag |
42 | EAC Translation Memory | - | - | parallel | corpus | BG CS DA ES ET FI FR HU IS IT LT LV NL PL RO SK SL SV TR EN EL PT DE HR MT NB NO |
law, culture, education | https://inventory.clarin.gr/corpus/733 | Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode https://creativecommons.org/licenses/by/4.0/ |
- |
43 | ECDC Translation Memory | - | - | parallel | corpus | BG CS DA EX ET FI FR HU IS IT LT LV MT NB NL PL RO SK SL SV TR EN PT DE EL |
Medicine & Health | https://inventory.clarin.gr/corpus/729 | Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode https://creativecommons.org/licenses/by/4.0/ |
- |
44 | PELCRA mutlilingual parallel corpora (CC-BY) | - | - | parallel | corpus | DE EN ES FR IT PL CS DA FI IS NL NO PT RU SV TR UK BG EL ET HU LT LV MT RO SK SL AR BE GA HR |
Law, Science, Political Science | https://inventory.clarin.gr/corpus/665 | Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode https://creativecommons.org/licenses/by/4.0/ |
data source: CORDIS news; ESO website; European Parliament website; EUROPA website |
45 | META-NORD Acquis Parallel Treebank | - | - | parallel | corpus | ET IS SV NO EN DA FI | law | https://clarino.uib.no/iness/clarino- metadata?session-id=251398323083844&identifier=Acquis the following error message appears: “Fake or stale session id” - to find the corpus: click “Home page” and select “Treebank selection” under “Treebanks” on the left side of the webpage, select “Icelandic” under “Languages” and “Acquis” under “Treebank Collections” and select “Show only parallel Treebanks”, then click “Acquis” under “Collection” at the bottom. |
Creative_Commons-BY (CC-BY) | data source: Directive 2002/74/EC from the Acquis Communautaire (AC) |
46 | GreynirCorpus (2021-06-23) | - | - | monolingual, parsed corpus | corpus | IS | mostly news sources | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/119 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
47 | The Icelandic Contemporary Treebank (IceConTree) Version 1.1 |
Samtímalegi íslenski trjábankinn |
IS | monolingual, parsed corpus | corpus | IS | parliamentary text, speech, law text, text from media, text from radio, text from the internet, text from television, encyclopedia |
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/112 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
48 | The Icelandic Parsed Historical Corpus (IcePaHC) | Sögulegi íslenski trjábankinn |
IS | monolingual, parsed corpus | corpus | IS | narratives and religious material but some samples from other genres | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/62 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | data source: consists of texts from the Icelandic Gigaword Corpus |
49 | NeuralMIcePaHC (2020-05-07) | - | - | monolingual, parsed corpus | corpus | IS | Icelandic texts from the 13th to 20th century, mostly Icelandic sagas | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/20 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
50 | Icelandic Gigaword Corpus 1 (IGC1) - version 20.05 | Risamálheildin 1 - Útgáfa 20.05 |
IS | monolingual, tagged and lemmatized corpus | corpus | IS | official texts (e.g. parliamentary speeches as far back as 1911, law text, adjudications); big text collections from news media and various texts from the text collection of the Árni Magnússon Institute for Icelandic Studies.” (http://igc.arnastofnun.is/) |
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/41 | Icelandic Gigaword Corpus Part1 | - |
51 | Icelandic Gigaword Corpus 2 (IGC2) - version 20.05 | Risamálheildin 2 - Útgáfa 20.05 |
IS | monolingual, tagged and lemmatized corpus | corpus | IS | parliamentary speeches, law text, adjudications); news media and various texts from the text collection of the Árni Magnússon Institute for Icelandic Studies |
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/33 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
52 | IGC-Adjud-21.05 (The Icelandic Gigaword Corpus: Adjudications) |
- | - | monolingual, tagged and lemmatized corpus | corpus | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/101 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | data source: judgements that have been published on the websites of the three levels of jurisdiction in Iceland |
53 | IGC-Laws-21.05 (The Icelandic Gigaword Corpus: Laws, bills and proposals) |
- | - | monolingual, tagged and lemmatized corpus | corpus | IS | 1) the Icelandic laws, 2) explanatory reports and observations extracted from bills submitted to Althingi, and 3) parliamentary proposals and resolutions |
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/116 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
54 | IGC-Parla-21.05 (The Icelandic Gigaword Corpus: Parliamentary speeches) |
- | - | monolingual, tagged and lemmatized corpus | corpus | IS | parliamentary speeches that have been encoded according to the Parla-CLARIN recommendations |
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/111 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
55 | IGC - evaluation set 20.09 | - | - | monolingual, tagged and lemmatized corpus | corpus | IS | adjudications, books, educational websites, legal tests, news, opinions, parliamentary speeches, sport news and radio and tv news scripts |
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/51 | Icelandic Mim Gold Standard for PoS Tagging | data source: Icelandic Gigaword Corpus (version 2018) |
56 | Icelandic Frequency Dictionary 2020.05 - training/testing sets |
Orðtíðnibókin 2020.05 þjálfunar-/prófunarsafn |
IS | monolingual, tagged and lemmatized corpus | corpus | IS | Icelandic fiction, translated fiction, biographies and memoirs, non-fiction (field of humanities, field of science) and books for children and teenagers (original texts, translations) |
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/38 | Icelandic Frequency Dictonary | - |
57 | MIM-GOLD 21.05 | MÍM-Gull 21.05 | IS | monolingual, tagged and lemmatized corpus | corpus | IS | texts are from The Tagged Icelandic Corpus (MÍM) | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/113 | Icelandic Mim Gold Standard for PoS Tagging | data source: texts are from The Tagged Icelandic Corpus (MÍM) |
58 | MIM-GOLD 21.05 - train/test | MÍM-Gull 21.05 - þjálfunar-/prófunargögn |
IS | monolingual, tagged and lemmatized corpus | corpus | IS | texts are from The Tagged Icelandic Corpus (MÍM) | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/114 | Icelandic Mim Gold Standard for PoS Tagging | - |
59 | Tagged Icelandic Corpus | Mörkuð íslensk málheild | IS | monolingual, tagged and lemmatized corpus | corpus | IS | among other things: newspapers; text from various printed periodicals; official texts (speeches from the Icelandic Parliament (Alþingi), legal texts and adjudications, and texts from the websites of government ministries) |
http://www.malfong.is/index.php?pg=mim&lang=en https://clarin.is/en/resources/mim/ |
Special User License | - |
60 | Talromur | Talrómur | IS | monolingual, speech corpus | corpus | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/104 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | audio was recorded by Reykjavík University and The Icelandic National Broadcasting Service |
61 | RÚV TV data | Rúv TV gagnasafnið | IS | monolingual, speech corpus | corpus | IS | news commentary, literature discussions, and the prime time news | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/93 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | data source: TV data from RÚV; published by the Icelandic National Broadcasting Service - Ríkisútvarpið (RÚV) and made by both RÚV and Reykjavik University |
62 | The RÚV Corpus | RÚV-málheildin | IS | monolingual, speech corpus | corpus | IS | read news items that includes a large vocabulary | http://www.malfong.is/index.php?pg=ruv&lang=en | - | - |
63 | Islex Recordings | Hljóðskrár ISLEX | IS | monolingual, speech corpus | corpus | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/6 http://www.malfong.is/index.php?lang=en&pg=islexrecordings |
CC-BY-NC-ND license | - |
64 | The Hjal Corpus | Hjal | IS | monolingual, speech corpus | corpus | IS | - | https://clarin.is/en/resources/hjal/ http://www.malfong.is/index.php?pg=hjal&lang=en |
CC BY 3.0 license | - |
65 | Parliament Speech Corpus | Alþingisumræður | IS | monolingual, speech corpus | corpus | IS | government budget, taxation, water laws, energy, schools and transportation | http://www.malfong.is/index.php?pg=althingi&lang=en | CC BY 3.0 license | data source: recordings were obtained directly from the Icelandic Parliament (Althingi) |
66 | Althingi’s Parliamentary Speeches | Alþingisgögnin | IS | monolingual, speech corpus | corpus | IS | Althingi recordings | http://www.malfong.is/index.php?pg=althingisraedur&lang=en | Creative Commons - Attribution 4.0 International (CC BY 4.0) | data source: Althingi recordings |
67 | The Jensson Corpus | Jenson-málheildin | IS | monolingual, speech corpus | corpus | IS | - | http://www.malfong.is/index.php?pg=jensson&lang=en | - | |
68 | The Thor Corpus | Þór-málheildin | IS | monolingual, speech corpus | corpus | IS | weather | http://www.malfong.is/index.php?pg=thor&lang=en | - | data source: the text was translated from MIT´s JUPITER corpus |
69 | The Malromur Corpus | Málrómur | IS | monolingual, speech corpus | corpus | IS | - | https://clarin.is/en/resources/malromur/ http://www.malfong.is/index.php?pg=malromur&lang=en |
Creative Commons - Attribution 4.0 International (CC BY 4.0) | data source: part of text is from mbl.is |
70 | General Pronunciation Dictionary for ASR | Almenn framburðarorðabók fyrir talgreiningu |
IS | monolingual, speech corpus | corpus | IS | - | http://www.malfong.is/index.php?lang=en&pg=framb_talgr | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
71 | Samromur 21.05 | Samrómur 21.05 | IS | monolingual, speech corpus | corpus | IS | - | https://www.openslr.org/112/ | CC BY 4.0 | - |
72 | Pronunciation Dictionary for Icelandic | Framburðarorðabókin | IS | monolingual, language description | corpus | IS | news, novels, Ístal Corpus | https://clarin.is/en/resources/prondict/ http://www.malfong.is/index.php?pg=framburdur&lang=en |
CC BY 3.0 license | data source: newspaper Morgunblaðið, recent novels, and the Ístal Corpus |
73 | Patterns and Sentences | Mynstur og setningar | IS | monolingual, language description | corpus | IS | extracted from novels | http://www.malfong.is/index.php?pg=mynsturogsetningar&lang=en | CC BY 3.0 license | - |
74 | The Icelandic Dyslexia Error Corpus (IceDEC) Version 1.0 |
Íslenska lesblinduvillumálheildin |
IS | monolingual, error corpus | corpus | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/107 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
75 | Icelandic Error Corpus (IceEC) Version 1.1 | Íslenska villumálheildin - Útgáfa 1.1 |
IS | monolingual, error corpus | corpus | IS | student essays, online news texts and Icelandic Wikipedia articles | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/105 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
76 | The Icelandic Child Language Error Corpus (IceCLEC) Version 1.0 |
Villumálheild íslensks barnamáls |
IS | monolingual, error corpus | corpus | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/108 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
77 | The Icelandic L2 Error Corpus (IceL2EC) Version 1.1 |
Villumálheild íslensku sem annars máls |
IS | monolingual, error corpus | corpus | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/106 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
78 | Icelandic Error Corpus Nonwords | Óorð íslensku villumálheildarinnar |
IS | monolingual, error corpus | corpus | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/63 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
79 | Icelandic Search Query Errors (IceSQuEr) 0.1 |
Íslenskar leitarvillur | IS | monolingual, error corpus | corpus | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/78 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | data source: users’ search queries that do not give results in the Database of Icelandic Morphology (https://bin.arnastofnun.is/) |
80 | nonwords | - | - | monolingual, error corpus | corpus | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/50 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | data source: The list was prepared using a word list from the DMII (The Database from Modern Icelandic Inflection) |
81 | The Icelandic Confusion Set Corpus (ICoSC) 2.0 (2020-05-06) |
- | - | monolingual, error corpus | corpus | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/19 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
82 | MIM-GOLD-NER – named entity recognition corpus |
Nafnkennslamálheildin | IS | monolingual | corpus | IS | NE version of MIM-GOLD | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/42 | Icelandic Gigaword Corpus Part1 | - |
83 | IceSum - Icelandic Text Summarization Corpus |
- | - | monolingual | corpus | IS | local, world, business and sports news | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/96 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | data source: news articles from mbl.is |
84 | The Saga Corpus | Fornritin | IS | multilingual | corpus | IS old-norse | Old Icelandic narrative texts: Family Sagas, Sturlunga Saga, Sagas of the Kings of Norway and the Book of Settlement |
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/32 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | data sources: Family Sagas –> Bragi Halldórsson, Jón Torfason and Örnólfur Thorsson (eds.). 1985-1986. Íslendinga sögur. Svart á hvítu. Reykjavík. Heimskringla –> Bergljót Kristjánsdóttir, Bragi Halldórsson, Jón Torfason and Örnólfur Thorsson (eds.). 1991. Heimskringla. Mál og menning. Reykjavík. Book of Settlement –> Jakob Benediktsson (ed.). 1968. Íslenzk fornrit I. Íslendingabók - Landnámabók. Hið íslenzka fornritafélag. Sturlunga Saga –> Örnólfur Thorsson, Bergljót Kristjánsdóttir, Bragi Halldórsson, Gísli Sigurðsson, Guðrún Ása Grímsdóttir, Guðrún Ingólfsdóttir, Jón Torfason and Sverrir Tómasson (eds.). 1988. Sturlunga saga. Svart á hvítu. Reykjavík. |
85 | Icelandic Taboo Database (iceTaboo) Version 1.0 |
- | - | monolingual | corpus | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/64 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
86 | Icelandic Web Text Corpus | Íslenskur orðasjóður | IS | monolingual | website/corpus | IS | - | https://corpora.uni-leipzig.de/en?corpusId=isl-is_web_2019 | - | |
87 | Icelandic Multi-SimLex | - | - | monolingual | lexical conceptual resource | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/121 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
88 | IceBATS - The Icelandic Bigger Analogy Test Set |
- | - | monolingual | lexical conceptual resource | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/120 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
89 | Icegrams (2020-09-30) | - | - | monolingual | language description | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/80 | The MIT License (MIT) | “The Icegrams trigram corpus is built from the 2017 edition of the Icelandic Gigaword Corpus” |
90 | Icelandic Hyphenation Dictionary | Íslenskur orðskiptingalisti og orðskiptingamynstur |
IS | monolingual | lexical conceptual resource | IS | - | https://clarin.is/en/resources/hyphenation/ https://repository.clarin.is/repository/xmlui/handle/20.500.12537/86 |
Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
91 | MerkOr | MerkOr - íslenskur merkingarbrunnur |
IS | monolingual | corpus, tool | IS | - | https://clarin.is/en/resources/merkor/ | LGPL-3.0 License | - |
92 | Terminology Database of the Ministry of Foreign Affairs |
Hugtakasafn þýðingarmiðstöðvar utanríkisráðuneytisins |
IS | parallel, multilingual | website/terminology | IS EN DA NO SV FR DE LA | law, administration, names of international agreements, institutions, committees, councils etc, glossary of the Icelandic International Development Agency (ICEIDA) |
https://clarin.is/en/resources/translation/ https://hugtakasafn.utn.stjr.is/umhts.adp, https://hugtakasafn.utn.stjr.is/ |
- | - |
93 | Icelandic Wordnet | Íslenskt orðanet | IS | monolingual | website | IS | - | https://clarin.is/en/resources/icewordnet/, https://ordanet.is/ | - | - |
94 | The Icelandic Wordweb 21.06 | - | - | monolingual | lexical conceptual resource | IS | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/117 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
95 | IceWordNet | - | - | monolingual | (similar to a thesaurus) | IS | - | https://clarin.is/en/resources/iwn/ | CC BY 3.0 license | data source: English words in the Princeton Core WordNet were translated into Icelandic; the synonyms of the Icelandic words were listed with the help from the Icelandic Thesaurus and the web site snara.is. |
96 | Dictionary of Modern Icelandic | Íslensk nútímamálsorðabók |
IS | monolingual | website/lexical conceptual resource | IS | - | https://clarin.is/en/resources/dmi/ https://islenskordabok.arnastofnun.is/ https://repository.clarin.is/repository/xmlui/handle/20.500.12537/94 |
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) | “based on the multilingual dictionary ISLEX “ |
97 | The Institute of Lexicography Written Language Archive |
Ritmálssafn Orðabókar Háskólans |
IS | monolingual | website | IS | “citations from printed books and journals, and a number of manuscripts, from 1540 onwards” |
https://clarin.is/en/resources/archive/ | - | - |
98 | Islex - Icelandic-Scandinavian multilingual dictionary |
ISLEX | IS | multilingual | lexical conceptual resource | DA , FO , FI , IS , NB , NN , SV | - | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/10 | - | - |
99 | Database of Icelandic Morphology | Beygingarlýsing íslensks nútímamáls |
IS | monolingual | language description | IS | - | https://bin.arnastofnun.is/DMII/LTdata/ https://repository.clarin.is/repository/xmlui/handle/20.500.12537/5 |
Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
100 | The Icelandic Term Bank | Íðorðabankinn | IS | multilingual | terminological resource | multilingual | - | https://clarin.is/en/resources/termbank/ | CC-BY-SA licence | - |
101 | Plaintext Wikipedia dump 2018 | - | - | multilingual (see list of languages on corpus webpage) |
corpus | multilingual | texts from Wikipedia | https://lindat.cz/repository/xmlui/handle/11234/1-2735 | Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) | data source: Wikipedia plain text data obtained from Wikipedia dumps |
102 | Deltacorpus 1.1 | - | - | multilingual (see list of languages on corpus webpage) |
corpus | multilingual | - | https://lindat.cz/repository/xmlui/handle/11234/1-1743 | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) | data source: W2C corpus |
103 | W2C – Web to Corpus – Corpora | - | - | multilingual (see list of languages on corpus webpage) |
corpus | multilingual | collected from wikipedia and the web | https://lindat.cz/repository/xmlui/handle/11858/00-097C-0000-0022-6133-9# https://vlo.clarin.eu/record/https_58__47__47_hdl.handle.net_47_11858_47_ 00-097C-0000-0022-6133-9_64_format_61_cmdi?1&q=multilingual&fqType= languageCode:or&fq=languageCode:code:isl&index=11&count=17 |
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) | - |
104 | Concreteness and imageability lexicon MEGA.HR-Crossling |
- | - | multilingual (see list of languages on corpus webpage) |
lexical conceptual resource | multilingual | - | https://www.clarin.si/repository/xmlui/handle/11356/1187# | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) | - |
105 | Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1 |
- | - | multilingual (see list of languages on corpus webpage) |
corpus | BG HR CS DA NL EN FR HU IS IT LV LT PL SL ES TR |
parliamentary debates mostly starting in 2015 and extending to mid-2020 |
https://www.clarin.si/repository/xmlui/handle/11356/1431 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
106 | Multilingual comparable corpora of parliamentary debates ParlaMint 2.1 |
- | - | multilingual (see list of languages on corpus webpage) |
corpus | BG HR CS DA NL EN FR HU IS IT LV LT PL SL ES TR |
parliamentary debates mostly starting in 2015 and extending to mid-2020 |
https://www.clarin.si/repository/xmlui/handle/11356/1432 | Creative Commons - Attribution 4.0 International (CC BY 4.0) | - |
107 | Universal Dependencies 2.8.1 | - | - | multilingual (see list of languages on corpus webpage) |
corpus | multilingual | - | https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3687# | Licence Universal Dependencies v2.8 | - |
108 | COVID-19 ANTIBIOTIC dataset. Multilingual (CEF languages) |
- | - | multilingual (see list of languages on corpus webpage) |
corpus | LV PL NL FI LT HR MT NO NB SL SK EN SV IS RO PT HU IT BG ES FR DA DE ET CS GA EL |
Health (Eurovoc 2841), Social Questions | https://portulanclarin.net/repository/browse/1c5ff916146911eb b6ec02420a0004094d31b044c80f4109bb228ff6f55a68e8/ |
CC - BY | data source: acquired from the website https://antibiotic.ecdc.europa.eu/ |