Prof. Dr. Alexander M. Fraser

TU Munich - Chair for Data Analytics & Statistics

Open Topics

We offer multiple Bachelor/Master theses, Guided Research projects and IDPs in the area of natural language processing.

A non-exhaustive list of open topics is listed below, together with a potential supervisor. Please contact potential supervisors directly.

___

Translation of Low-Resource Languages and Dialects

Type

Bachelor Thesis / Master Thesis / Guided Research

Requirements

Description

Machine translation is near human-level performance for high-resource languages, but when it comes to low-resource languages and dialects, there is still a lot of room for improvement. The goal of this project is to work on a low-resource language or dialect of your choosing and to establish or improve on the state-of-the-art for translating to and from another language. There are several ways you can go about this, such as gathering data and constructing a new high-quality dataset using alignment methods, or transferring knowledge from a related high-resource language, among many others.

Contact

Lukas Edman

References

___

Title: Linguistic gloss generation for low-resource languages

Type

Master's thesis / Bachelor's thesis

Prerequisites

Description

Linguistic glosses (or interlinear glosses) are linguistic annotations that are created to express the meaning and grammatical phenomena in a source language (e.g., 'Hund-e' in German would be annotated as 'dog-PLURAL' in English). These annotations are costly to obtain and scarce; the aim of gloss generation is hence to automatically predict glosses, especially for low-resource languages. The main challenges come from the amount of training data and the lexical diversity in sentences. Glosses may also help machine translation (cf. [4]) since they can act as a bridge between the source and target languages.

Contact

Shu Okabe (first.last@tum.de)

References

  1. Statistical gloss generation: Automating Gloss Generation in Interlinear Glossed Text (https://aclanthology.org/2020.scil-1.42/)
  2. Neural gloss generation: Automatic Interlinear Glossing for Under-Resourced Languages Leveraging Translations (https://aclanthology.org/2020.coling-main.471/)
  3. Shared task on automatic glossing: Findings of the SIGMORPHON 2023 Shared Task on Interlinear Glossing (https://aclanthology.org/2023.sigmorphon-1.20/)
  4. Glosses as bridge for machine translation: Using Interlinear Glosses as Pivot in Low-Resource Multilingual Machine Translation (https://arxiv.org/abs/1911.02709)

Cross-Cultural NLP

Type

Master’s Thesis / Bachelor’s Thesis / Guided Research

Prerequisites

Description

One of our core interests in the group are multilingual language models. However, there is a serious concern that these models are dominated by English-language data and particularly American cultural norms.

In this project, you would evaluate open models on CulturalBench¹ or similar datasets, and explore how they can become better at such tasks. Has the model never seen data relevant to a specific question, or is information from another culture simply overriding this information? This project can include finding additional useful data and fine-tuning a model, but also quicker modifications such as few-shot learning and prompt engineering.

Contact

Kathy Hämmerl (haemmerl [at] cis.lmu.de)

References

[1] CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack of) Multicultural Knowledge (https://arxiv.org/pdf/2404.06664)

[2] Culturally Aware and Adapted NLP: A Taxonomy and a Survey of the State of the Art (https://arxiv.org/abs/2406.03930)

[3] Speaking Multiple Languages Affects the Moral Bias of Language Models (https://aclanthology.org/2023.findings-acl.134/)

[4] NLPositionality: Characterizing Design Biases of Datasets and Models (https://aclanthology.org/2023.acl-long.505/)

Evaluation in Text Style Transfer

Type

Master’s Thesis / Bachelor’s Thesis / Guided Research

Prerequisites

  Enthusiasm (for publishing results at a conference/workshop)

 Proficiency in speaking and writing English

● Good Python programming background (e.g., knowledge of numpy, pandas, sklearn libraries)

  Basic knowledge of ML/NLP (e.g., understanding how a classifier works, knowledge of transformer architecture)

   Basic command of PyTorch and Transformers libraries is recommended

Description

Text style transfer aims to effectively rewrite the source style text into the target style text without changing the meaning and fluency. For example, rewriting the sentence "Sorry about that" from an informal style to "I apologize for the inconvenience caused" in a formal style can be challenging. This difficulty arises because, although the two sentences are semantically consistent, their consistency is not conveyed through the exact words used, but rather through an underlying semantic space. For bachelor students, it is essential to understand the evaluation metrics in the TST task and conduct a detailed analysis of the current metrics in a low-resource language scenario. For further details, please refer to Babakov et al., (2022) and Ostheimer et al., (2024). For master students, a deeper understanding of the principles of TST evaluation is required, along with an attempt to propose an innovative evaluation method similar to BERTScore (Zhang et al., 2019).

Contact

Wen Lai (lavine [at] cis.lmu.de)

References

[1] A large-scale computational study of content preservation measures for text style transfer and paraphrase generation (https://aclanthology.org/2022.acl-srw.23)

[2] Text Style Transfer Evaluation Using Large Language Models (https://aclanthology.org/2024.lrec-main.1373)

[3] Bertscore: Evaluating text generation with bert (https://arxiv.org/abs/1904.09675)

[4] Evaluating the Smooth Control of Attribute Intensity in Text Generation with LLMs (https://arxiv.org/abs/2406.04460)

Segmentation and Morphological Features in Machine Translation

Bachelor Thesis / Master Thesis / Guided Research

Requirements

Description

Morphologically rich languages like Finnish or Turkish condense much information within one word. This leads to data sparsity problems, as a high number of inflected forms is only insufficiently covered in the training data. Word segmentation approaches such as BPE do not optimally capture morphological patterns. An alternative approach is linguistically guided word segmentation.

The thesis topic consists in exploring segmentation approaches for (low-resourced) morphologically rich languages, in combination with the integration of relevant morpho-syntactic information in an NMT scenario.

Contact

Marion Di Marco   (marion.dimarco [at] tum.de)

References

___

Multilingual and Cross-lingual Text Detoxification

Type

Bachelor Thesis / Master Thesis / Guided Research

Requirements

Description

One of the domains of text style transfer approaches applications is transfer of style of texts from toxic to non-toxic. Currently, the training parallel data is available for 9 languages: English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, and Amharic. However, there is still a little investigation of cross-lingual text detoxification knowledge possibilities between languages. This project aims to explore how much data and data in which language is needed to obtain a text detoxification model for the target language, which models/modules are the best to achieve this, if LLMs can solve everything already.

Contact

daryna.dementieva@tum.de

References

____

Data-Efficient Hate Speech Detection Using Multi-task Learning

Type:

Master’s Thesis / Bachelor’s Thesis / Guided Research

Requirements:

Description:

Hate speech propagation is a major issue, and detecting it remains challenging, especially across different languages and contexts. One key difficulty is gathering enough labeled data, making it useful to leverage existing ones. However, determining which datasets or specific instances are most useful is crucial, as tuning a model on all available data can be too costly. This project aims to identify the best instances from multiple datasets for a targeted hate speech task. Utilizing label/task information or focusing on multilingualism, particularly low-resource languages, could be the novel aspect of this project or any other idea you have for detecting hate speech.

Contact:

Faeze Ghorbanpour (firstname.lastname [at] lmu.de)

Reference: