Sirvi Autor "Ginter, Filip" järgi
Nüüd näidatakse 1 - 20 25
- Tulemused lehekülje kohta
- Sorteerimisvalikud
listelement.badge.dso-type Kirje , Applying BLAST to Text Reuse Detection in Finnish Newspapers and Journals, 1771-1910(Gothenburg, Linköping University Electronic Press, pp. 54--58, 2017) Vesanto, Aleksi; Nivala, Asko; Rantala, Heli; Salakoski, Tapio; Salmi, Hannu; Ginter, Filip; Bouma, Gerlof; Adesam, Yvonnelistelement.badge.dso-type Kirje , Building a Large Automatically Parsed Corpus of Finnish(Oslo, Norway, Linköping University Electronic Press, Sweden, pp. 291--300, 2013) Ginter, Filip; Nyblom, Jenna; Laippala, Veronika; Kohonen, Samuel; Haverinen, Katri; Vihjanen, Simo; Salakoski, Tapio; Oepen, Stephan; Hagen, Kristin; Johannessen, Janne Bondilistelement.badge.dso-type Kirje , Creating register sub-corpora for the Finnish Internet Parsebank(Gothenburg, Sweden, Association for Computational Linguistics, pp. 152--161, 2017) Laippala, Veronika; Luotolahti, Juhani; Kyröläinen, Aki-Juhani; Salakoski, Tapio; Ginter, Filip; Tiedemann, Jörg; Tahmasebi, Ninalistelement.badge.dso-type Kirje , Dep_search: Efficient Search Tool for Large Dependency Parsebanks(Gothenburg, Sweden, Association for Computational Linguistics, pp. 255--258, 2017) Luotolahti, Juhani; Kanerva, Jenna; Ginter, Filip; Tiedemann, Jörg; Tahmasebi, Ninalistelement.badge.dso-type Kirje , Fine-grained Named Entity Annotation for Finnish(Reykjavik, Iceland (Online), Linköping University Electronic Press, Sweden, pp. 135--144, 2021) Luoma, Jouni; Chang, Li-Hsin; Ginter, Filip; Pyysalo, Sampo; Dobnik, Simon; Øvrelid, Liljalistelement.badge.dso-type Kirje , FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering(University of Tartu Library, 2025-03) Henriksson, Erik; Tarkka, Otto; Ginter, Filip; Johansson, Richard; Stymne, SaraData quality is crucial for training Large Language Models (LLMs). Traditional heuristic filters often miss low-quality text or mistakenly remove valuable content. In this paper, we introduce an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. These labels are grouped into nine main categories, and we train a DeBERTa-v3 classifier to scale the filtering to a 10B-token subset of FineWeb. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets. The results show that models trained on the filtered data achieve higher accuracy on the HellaSwag benchmark and reach their performance targets faster, even with up to 25\% less data. This demonstrates that LLM-based line-level filtering can significantly improve data quality and training efficiency for LLMs. We release our quality-annotated dataset, FinerWeb-10BT, and the codebase to support further work in this area.listelement.badge.dso-type Kirje , Finnish Paraphrase Corpus(Reykjavik, Iceland (Online), Linköping University Electronic Press, Sweden, pp. 288--298, 2021) Kanerva, Jenna; Ginter, Filip; Chang, Li-Hsin; Rastas, Iiro; Skantsi, Valtteri; Kilpeläinen, Jemina; Kupari, Hanna-Mari; Saarni, Jenna; Sevón, Maija; Tarkka, Otto; Dobnik, Simon; Øvrelid, Liljalistelement.badge.dso-type Kirje , Finnish SQuAD: A Simple Approach to Machine Translation of Span Annotations(University of Tartu Library, 2025-03) Nuutinen, Emil; Rastas, Iiro; Ginter, Filip; Johansson, Richard; Stymne, SaraWe apply a simple method to machine translate datasets with span-level annotation using the DeepL MT service and its ability to translate formatted documents. Using this method, we produce a Finnish version of the SQuAD2.0 question answering dataset and train QA retriever models on this new dataset. We evaluate the quality of the dataset and more generally the MT method through direct evaluation, indirect comparison to other similar datasets, a backtranslation experiment, as well as through the performance of downstream trained QA models. In all these evaluations, we find that the method of transfer is not only simple to use but produces consistently better translated data. Given its good performance on the SQuAD dataset, it is likely the method can be used to translate other similar span-annotated datasets for other tasks and languages as well. All code and data is available under an open license: data at HuggingFace TurkuNLP/squad_v2_fi, code on GitHub TurkuNLP/squad2-fi, and model at HuggingFace TurkuNLP/bert-base-finnish-cased-squad2.listelement.badge.dso-type Kirje , Is Multilingual BERT Fluent in Language Generation?(Turku, Finland, Linköping University Electronic Press, pp. 29--36, 2019) Rönnqvist, Samuel; Kanerva, Jenna; Salakoski, Tapio; Ginter, Filip; Nivre, Joakim and Derczynski, Leon and Ginter, Filip; Lindi, Bjørn; Oepen, Stephan; Søgaard, Anders; Tidemann, Jörglistelement.badge.dso-type Kirje , Learning to Extract Biological Event and Relation Graphs(2009-05-11T08:58:27Z) Björne, Jari; Ginter, Filip; Heimonen, Juho; Pyysalo, Sampo; Salakoski, Tapiolistelement.badge.dso-type Kirje , Learning to Extract Biological Event and Relation Graphs(Odense, Denmark, Northern European Association for Language Technology (NEALT), pp. 18--25, 2009) Björne, Jari; Ginter, Filip; Heimonen, Juho; Pyysalo, Sampo; Salakoski, Tapio; Jokinen, Kristiina; Bick, Eckhardlistelement.badge.dso-type Kirje , MULTI-CROSSRE A Multi-Lingual Multi-Domain Dataset for Relation Extraction(University of Tartu Library, 2023-05) Bassignana, Elisa; Ginter, Filip; Pyysalo, Sampo; Goot, Rob van der; Plank, Barbaralistelement.badge.dso-type Kirje , OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches(University of Tartu Library, 2025-03) Kanerva, Jenna; Ledins, Cassandra; Käpyaho, Siiri; Ginter, Filip; Tudor, Crina Madalina; Debess, Iben Nyholm; Bruton, Micaella; Scalvini, Barbara; Ilinykh, Nikolai; Holdt, Špela ArharOptical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.listelement.badge.dso-type Kirje , Parsing Clinical Finnish: Experiments with Rule-Based and Statistical Dependency Parsers(Odense, Denmark, Northern European Association for Language Technology (NEALT), pp. 65--72, 2009) Haverinen, Katri; Ginter, Filip; Laippala, Veronika; Salakoski, Tapio; Jokinen, Kristiina; Bick, Eckhardlistelement.badge.dso-type Kirje , Parsing Clinical Finnish: Experiments with Rule-Based and Statistical Dependency Parsers(2009-05-13T11:07:10Z) Haverinen, Katri; Ginter, Filip; Laippala, Veronika; Salakoski, Tapiolistelement.badge.dso-type Kirje , Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing(Turku, Finland, 2019) Nivre, Joakim; Derczynski, Leon; Ginter, Filip; Lindi, Bjørn; Oepen, Stephan; Søgaard, Anders; Tidemann, Jörglistelement.badge.dso-type Kirje , Sentence Compression For Automatic Subtitling(Vilnius, Lithuania, Linköping University Electronic Press, Sweden, pp. 135--143, 2015) Luotolahti, Juhani; Ginter, Filip; Megyesi, Beátalistelement.badge.dso-type Kirje , A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora(Gothenburg, Sweden, Association for Computational Linguistics, pp. 330--333, 2017) Vesanto, Aleksi; Ginter, Filip; Salmi, Hannu; Nivala, Asko; Salakoski, Tapio; Tiedemann, Jörg; Tahmasebi, Ninalistelement.badge.dso-type Kirje , Template-free Data-to-Text Generation of Finnish Sports News(Turku, Finland, Linköping University Electronic Press, pp. 242--252, 2019) Kanerva, Jenna; Rönnqvist, Samuel; Kekki, Riina; Salakoski, Tapio; Ginter, Filip; Hartmann, Mareike; Plank, Barbaralistelement.badge.dso-type Kirje , Towards a Dependency-Based PropBank of General Finnish(Oslo, Norway, Linköping University Electronic Press, Sweden, pp. 41--57, 2013) Haverinen, Katri; Laippala, Veronika; Kohonen, Samuel; Missilä, Anna; Nyblom, Jenna; Ojala, Stina; Viljanen, Timo; Salakoski, Tapio; Ginter, Filip; Oepen, Stephan; Hagen, Kristin; Johannessen, Janne Bondi