Fb 2 LE I: Masterthesis to offer

In our research group we have a couple of new masterthesis to offer:

1. M.Sc. Transcription with Textual Topical Priors

This master thesis aims to improve speech2text transcriptions. The research question is: Given a speech2text model/pipeline and a sample speech of a given topic and another text of the same topic can we improve the transcription?

Current transcription programs are generic and don’t have prior knowledge about any topic. Professional simultaneous interpreters are sometimes given keywords and an outline of the content before translating. Is a similar approach feasible for automatic speech2text pipelines?

Expected Tasks and Skills:

Programming or modifying ML pipelines for audio and text processing
Theoretical knowledge about ML models and performance metrics

Requirements/Questions:

How can speech2text models be improved without relearning/fine-tuning (zero-shot)?
How do transcriptions models work and how to modify them?
Are different topic specific word distributions one-shot learnable and can they be transferred to audio models?
Can we fix the output in a post processing step given topical knowledge?

This thesis may be written in German or English (depending on study course). The thesis will be supported by Cyriax Brast and Eicke Godehardt or myself. For further inquiries, please consult Mr Cyriax Brast.

2. Speech Semantic Search and Summarization

Audio data of videos or video conferences contains information that a user wants to search for (e.g. when was something taught in a lecture, was a given topic addressed in a meeting). While improved transcription promotes semantic search of audio data, this differs from just trying to improve transcription quality. Compressing the content might be more or less susceptible to individual transcription errors and search provides priors.

Expected Tasks and Skills:

Programming or modifying ML pipelines for audio and text processing
Theoretical knowledge about ML models and performance metrics

Requirements/Questions:

Given a speech2text model/pipeline can we summarize it’s content? How good is it?
Can we make its content available to be semantically searchable? How good is it?

3. Turn Taking / End of Speech Prediction

During online conversations in video conferences we want to predict the changes in who speaks currently. Points in the conversation where the speaker changes are called transition relevance places (TRP). In a natural conversion humans predict possible ends of utterances by tone, pauses, and by predicting the content semantically.

Illustration of turn taking events: IPU (Interpausal Unit), Turn (for speaker A and Speaker B, resp), P. (within-speaker Pause), Gap, Overlap and Backchannel

CC BY 4.0 by Tú Anh Nguyễn, Eugene Kharitonov, Jade Copet,[…] ,Emmanuel Dupoux

Expected Tasks and Skills:

Programming or modifying ML pipelines for audio
Theoretical knowledge about ML models, neural networks and performance metrics
This thesis requires training ones own models

Requirements / Questions:

Find a suitable dataset.
Program a pipeline (in Python): Audio -> Diaritization -> Turn Taking markers to generate training data.
Program a pipeline predicting the end of turn/TRP.
How well can we predict the end of an Inter Pausal Unit (IPU) / Utterance analyzing the audio of a conversation?

4. Turn Taking / Next Speaker Prediction

During online conversations in video conferences we want to predict the changes in who speaks currently. Points in the conversation where the speaker changes are called transition relevance places (TRP). We believe that the next speaker can be identified visually by certain modes of attention/engagement before they take the turn.

Illustration of turn taking events: IPU (Interpausal Unit), Turn (for speaker A and Speaker B, resp), P. (within-speaker Pause), Gap, Overlap and Backchannel

CC BY 4.0 by Tú Anh Nguyễn, Eugene Kharitonov, Jade Copet,[…] ,Emmanuel Dupoux

Expected Tasks and Skills:

Programming or modifying ML pipelines for video
Theoretical knowledge about ML models, neural networks and performance metrics
This thesis requires training ones own models

Requirements / Questions:

Find a suitable dataset.
Find Active Speaker Detection video pipelines to generate training data.
Program a pipeline predicting the next speaker.
How well can we predict who wants to speak next given a video?