In our research group we have a couple of new masterthesis to offer:
1. M.Sc. Transcription with Textual Topical Priors
This master thesis aims to improve speech2text transcriptions. The research question is: Given a speech2text model/pipeline and a sample speech of a given topic and another text of the same topic can we improve the transcription?
Current transcription programs are generic and don’t have prior knowledge about any topic. Professional simultaneous interpreters are sometimes given keywords and an outline of the content before translating. Is a similar approach feasible for automatic speech2text pipelines?
Expected Tasks and Skills:
- Programming or modifying ML pipelines for audio and text processing
- Theoretical knowledge about ML models and performance metrics
Requirements/Questions:
- How can speech2text models be improved without relearning/fine-tuning (zero-shot)?
- How do transcriptions models work and how to modify them?
- Are different topic specific word distributions one-shot learnable and can they be transferred to audio models?
- Can we fix the output in a post processing step given topical knowledge?
This thesis may be written in German or English (depending on study course). The thesis will be supported by Cyriax Brast and Eicke Godehardt or myself. For further inquiries, please consult Mr Cyriax Brast.
2. Speech Semantic Search and Summarization
Audio data of videos or video conferences contains information that a user wants to search for (e.g. when was something taught in a lecture, was a given topic addressed in a meeting). While improved transcription promotes semantic search of audio data, this differs from just trying to improve transcription quality. Compressing the content might be more or less susceptible to individual transcription errors and search provides priors.
Expected Tasks and Skills:
- Programming or modifying ML pipelines for audio and text processing
- Theoretical knowledge about ML models and performance metrics
Requirements/Questions:
- Given a speech2text model/pipeline can we summarize it’s content? How good is it?
- Can we make its content available to be semantically searchable? How good is it?
This thesis may be written in German or English (depending on study course). The thesis will be supported by Cyriax Brast and Eicke Godehardt or myself. For further inquiries, please consult Mr Cyriax Brast.
3. Turn Taking / End of Speech Prediction
During online conversations in video conferences we want to predict the changes in who speaks currently. Points in the conversation where the speaker changes are called transition relevance places (TRP). In a natural conversion humans predict possible ends of utterances by tone, pauses, and by predicting the content semantically.
CC BY 4.0 by Tú Anh Nguyễn, Eugene Kharitonov, Jade Copet,[…] ,Emmanuel Dupoux
Expected Tasks and Skills:
- Programming or modifying ML pipelines for audio
- Theoretical knowledge about ML models, neural networks and performance metrics
- This thesis requires training ones own models
Requirements / Questions:
- Find a suitable dataset.
- Program a pipeline (in Python): Audio -> Diaritization -> Turn Taking markers to generate training data.
- Program a pipeline predicting the end of turn/TRP.
- How well can we predict the end of an Inter Pausal Unit (IPU) / Utterance analyzing the audio of a conversation?
This thesis may be written in German or English (depending on study course). The thesis will be supported by Cyriax Brast and Eicke Godehardt or myself. For further inquiries, please consult Mr Cyriax Brast.
4. Turn Taking / Next Speaker Prediction
During online conversations in video conferences we want to predict the changes in who speaks currently. Points in the conversation where the speaker changes are called transition relevance places (TRP). We believe that the next speaker can be identified visually by certain modes of attention/engagement before they take the turn.
CC BY 4.0 by Tú Anh Nguyễn, Eugene Kharitonov, Jade Copet,[…] ,Emmanuel Dupoux
Expected Tasks and Skills:
- Programming or modifying ML pipelines for video
- Theoretical knowledge about ML models, neural networks and performance metrics
- This thesis requires training ones own models
Requirements / Questions:
- Find a suitable dataset.
- Find Active Speaker Detection video pipelines to generate training data.
- Program a pipeline predicting the next speaker.
- How well can we predict who wants to speak next given a video?
This thesis may be written in German or English (depending on study course). The thesis will be supported by Cyriax Brast and Eicke Godehardt or myself. For further inquiries, please consult Mr Cyriax Brast.