
Details
Researcher
Work Package 5
University of Bergen
Contact
/ Biography
Samia Touileb is currently a researcher in MediaFutures WP5 on Norwegian Language Technologies. Prior to this she was a Postdoc at the Language Technology Group (LTG), Department of Informatics, at the University of Oslo. She holds a PhD in Information Science with a focus on Natural Language Processing (NLP) from the University of Bergen, and has been working within research in and applications of NLP for almost a decade.
Her main research interests are information extraction, sentiment analysis, bias and fairness in NLP, and applications of NLP and machine learning methods to tasks within social science research. She also mainly works on under-resourced languages such as Norwegian.
/ Publications
Publications from 2020 and before are not direct results of the SFI MediaFutures, but are key results from our team members working on related topics in MediaFutures.
2022 |
Measuring Harmful Representations in Scandinavian Language Models Conference Touileb, Samia; Nozza, Debora 2022. @conference{Touileb2022b, Scandinavian countries are perceived as rolemodels when it comes to gender equality. With the advent of pre-trained language models and their widespread usage, we investigate to what extent gender-based harmful and toxic content exist in selected Scandinavian language models. We examine nine models, covering Danish, Swedish, and Norwegian, by manually creating template-based sentences and probing the models for completion. We evaluate the completions using two methods for measuring harmful and toxic completions and provide a thorough analysis of the results. We show that Scandinavian pre-trained language models contain harmful and gender-based stereotypes with similar values across all languages. This finding goes against the general expectations related to gender equality in Scandinavian countries and shows the possible problematic outcomes of using such models in real world settings. |
Annotating Norwegian language varieties on Twitter for Part-of-speech Workshop Mæhlum, Petter; Kåsen, Andre; Touileb, Samia; Barnes, Jeremy 2022. @workshop{Mæhlum2022, Norwegian Twitter data poses an interesting challenge for Natural Language Processing (NLP) tasks. These texts are difficult for models trained on standardized text in one of the two Norwegian written forms (Bokmål and Nynorsk), as they contain both the typical variation of social media text, as well as a large amount of dialectal variety. In this paper we present a novel Norwegian Twitter dataset annotated with POS-tags. We show that models trained on Universal Dependency (UD) data perform worse when evaluated against this dataset, and that models trained on Bokmål generally perform better than those trained on Nynorsk. We also see that performance on dialectal tweets is comparable to the written standards for some models. Finally we perform a detailed analysis of the errors that models commonly make on this data. |
Occupational Biases in Norwegian and Multilingual Language Models Workshop Touileb, Samia; Øvrelid, Lilja; Velldal, Erik 2022. @workshop{Touileb2022, In this paper we explore how a demographic distribution of occupations, along gender dimensions, is reflected in pre-trained language models. We give a descriptive assessment of the distribution of occupations, and investigate to what extent these are reflected in four Norwegian and two multilingual models. To this end, we introduce a set of simple bias probes, and perform five different tasks combining gendered pronouns, first names, and a set of occupations from the Norwegian statistics bureau. We show that language specific models obtain more accurate results, and are much closer to the real-world distribution of clearly gendered occupations. However, we see that none of the models have correct representations of the occupations that are demographically balanced between genders. We also discuss the importance of the training data on which the models were trained on, and argue that template-based bias probes can sometimes be fragile, and a simple alteration in a template can change a model’s behavior. |
2021 |
Using Gender- and Polarity-Informed Models to Investigate Bias Inproceedings Touileb, Samia; Øvrelid, Lilja; Velldal, Erik In: Association for Computational Linguistics, 2021. @inproceedings{cristin1924816, |
Using Gender- and Polarity-informed Models to Investigate Bias Working paper Touileb, Samia; Øvrelid, Lilja; Velldal, Erik 2021. @workingpaper{cristin1958571, |
2020 |
Gender and sentiment, critics and authors: a dataset of Norwegian book reviews Journal Article Touileb, Samia; Øvrelid, Lilja; Velldal, Erik In: Gender Bias in Natural Language Processing. Association for Computational Linguistics, 2020, (Pre SFI). @article{Touileb2020, Gender bias in models and datasets is widely studied in NLP. The focus has usually been on analysing how females and males express themselves, or how females and males are described. However, a less studied aspect is the combination of these two perspectives, how female and male describe the same or opposite gender. In this paper, we present a new gender annotated sentiment dataset of critics reviewing the works of female and male authors. We investigate if this newly annotated dataset contains differences in how the works of male and female authors are critiqued, in particular in terms of positive and negative sentiment. We also explore the differences in how this is done by male and female critics. We show that there are differences in how critics assess the works of authors of the same or opposite gender. For example, male critics rate crime novels written by females, and romantic and sentimental works written by males, more negatively. |
Identifying Sentiments in Algerian Code-switched User-generated Comments Conference Adouane, Wafia; Touileb, Samia; Bernardy, Jean-Philippe 2020, (Pre SFI). @conference{Adouane2020, We present in this paper our work on Algerian language, an under-resourced North African colloquial Arabic variety, for which we built a comparably large corpus of more than 36,000 code-switched user-generated comments annotated for sentiments. We opted for this data domain because Algerian is a colloquial language with no existing freely available corpora. Moreover, we compiled sentiment lexicons of positive and negative unigrams and bigrams reflecting the code-switches present in the language. We compare the performance of four models on the task of identifying sentiments, and the results indicate that a CNN model trained end-to-end fits better our unedited code-switched and unbalanced data across the predefined sentiment classes. Additionally, injecting the lexicons as background knowledge to the model boosts its performance on the minority class with a gain of 10.54 points on the F-score. The results of our experiments can be used as a baseline for future research for Algerian sentiment analysis. |
Named Entity Recognition without Labelled Data: A Weak Supervision Approach Journal Article Lison, Pierre; Hubin, Aliaksandr; Barnes, Jeremy; Touileb, Samia In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1518–1533, 2020, (Pre SFI). @article{Lison2020, Named Entity Recognition (NER) performance often degrades rapidly when applied to target domains that differ from the texts observed during training. When in-domain labelled data is available, transfer learning techniques can be used to adapt existing NER models to the target domain. But what should one do when there is no hand-labelled data for the target domain? This paper presents a simple but powerful approach to learn NER models in the absence of labelled data through weak supervision. The approach relies on a broad spectrum of labelling functions to automatically annotate texts from the target domain. These annotations are then merged together using a hidden Markov model which captures the varying accuracies and confusions of the labelling functions. A sequence labelling model can finally be trained on the basis of this unified annotation. We evaluate the approach on two English datasets (CoNLL 2003 and news articles from Reuters and Bloomberg) and demonstrate an improvement of about 7 percentage points in entity-level F1 scores compared to an out-of-domain neural NER model. |
2019 |
Lexicon information in neural sentiment analysis: a multi-task learning approach Conference Barnes, Jeremy; Touileb, Samia; Øvrelid, Lilja; Velldal, Erik Linköping University Electronic Press, 2019, (Pre SFI). @conference{Barnes2019, This paper explores the use of multi-task learning (MTL) for incorporating external knowledge in neural models. Specifically, we show how MTL can enable a BiLSTM sentiment classifier to incorporate information from sentiment lexicons. Our MTL set-up is shown to improve model performance (compared to a single-task set-up) on both English and Norwegian sentence-level sentiment datasets. The paper also introduces a new sentiment lexicon for Norwegian. |
2018 |
NoReC: The Norwegian Review Corpus Proceeding Velldal, Erik; Øvrelid, Lilja; Bergem, Eivind Alexander; Stadsnes, Cathrine; Touileb, Samia; Jørgensen, Fredrik 2018, (Pre SFI). @proceedings{Velldal2018, https://repo.clarino.uib.no/xmlui/handle/11509/124 |
2017 |
Automatic identification of unknown names with specific roles Journal Article Touileb, Samia; Pedersen, Truls; Sjøvaag, Helle In: Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 150-158, 2017, (Pre SFI). @article{Touileb2017, Automatically identifying persons in a particular role within a large corpus can be a difficult task, especially if you don’t know who you are actually looking for. Resources compiling names of persons can be available, but no exhaustive lists exist. However, such lists usually contain known names that are “visible” in the national public sphere, and tend to ignore the marginal and international ones. In this article we propose a method for automatically generating suggestions of names found in a corpus of Norwegian news articles, and which “naturally” belong to a given initial list of members, and that were not known (compiled in a list) beforehand. The approach is based, in part, on the assumption that surface level syntactic features reveal parts of the underlying semantic content and can help uncover the structure of the language. |