Sirvi Autor "Szawerna, Maria Irena" järgi

Nüüd näidatakse 1 - 3 3

listelement.badge.access-status Avatud juurdepääs ,
"I Need More Context and an English Translation": Analysing How LLMs Identify Personal Information in Komi, Polish, and English
(University of Tartu Library, 2025-03) Ilinykh, Nikolai; Szawerna, Maria Irena; Tudor, Crina Madalina; Debess, Iben Nyholm; Bruton, Micaella; Scalvini, Barbara; Ilinykh, Nikolai; Holdt, Špela Arhar
Automatic identification of personal information (PI) is particularly difficult for languages with limited linguistic resources. Recently, large language models (LLMs) have been applied to various tasks involving low-resourced languages, but their capability to process PI in such contexts remains under-explored. In this paper we provide a qualitative analysis of the outputs from three LLMs prompted to identify PI in texts written in Komi (Permyak and Zyrian), Polish, and English. Our analysis highlights challenges in using pre-trained LLMs for PI identification in both low- and medium-resourced languages. It also motivates the need to develop LLMs that understand the differences in how PI is expressed across languages with varying levels of availability of linguistic resources.
listelement.badge.access-status Avatud juurdepääs ,
The Devil’s in the Details: the Detailedness of Classes Influences Personal Information Detection and Labeling
(University of Tartu Library, 2025-03) Szawerna, Maria Irena; Dobnik, Simon; Muñoz Sánchez, Ricardo; Volodina, Elena; Johansson, Richard; Stymne, Sara
In this paper, we experiment with the effect of different levels of detailedness or granularity—understood as i) the number of classes, and ii) the classes’ semantic depth in the sense of hypernym and hyponym relations — of the annotation of Personally Identifiable Information (PII) on automatic detection and labeling of such information. We fine-tune a Swedish BERT model on a corpus of Swedish learner essays annotated with a total of six PII tagsets at varying levels of granularity. We also investigate whether the presence of grammatical and lexical correction annotation in the tokens and class prevalence have an effect on predictions. We observe that the fewer total categories there are, the better the overall results are, but having a more diverse annotation facilitates fewer misclassifications for tokens containing correction annotation. We also note that the classes’ internal diversity has an effect on labeling. We conclude from the results that while labeling based on the detailed annotation is difficult because of the number of classes, it is likely that models trained on such annotation rely more on the semantic content captured by contextual word embeddings rather than just the form of the tokens, making them more robust against nonstandard language.
listelement.badge.access-status Avatud juurdepääs ,
Towards Shared Standards for Pseudonymization of Research Data
(Tartu University Library, 2025) Volodina, Elena; Dobnik, Simon; Lindström Tiedemann, Therese; Muñoz Sánchez, Ricardo; Szawerna, Maria Irena; Södergård, Lisa; Nermo, Magnus; Papadopoulou Skarp, Frantzeska; Tienken, Susanne; Widholm, Andreas; Blåder, Anna; Verhagen, Harko; Fridlund, Mats
The article introduces the key concepts in pseudonymization, summarizes the half-way findings in the project Mormor Karl, and proposes several ways to unify and standardize the field of pseudonymization.