Brief Introduction
Visually impaired people require a greater amount of audio-mediated information when playing an interactive game. Thanks to such an output, they can, for example, orient themselves and solve a spatial puzzle. The production of sound-rich environments is time-consuming, requiring the collection of a large number of sound samples and their subsequent processing. This processing is often relatively easy to automate, e.g., adding an equalizer, or using parameter randomization to create multiple sound variations. An example can be creating sound for footsteps on different surfaces (clay, sand, pavement, etc.). The subject of this work is the review and proposal of different sound synthesis approaches based on the designer's textual description. The methods include analytical as well as modern approaches based on generative neural networks.
Thesis Goals
- To conduct a comprehensive review of existing sound synthesis methods, from analytical to modern neural network approaches.
- Consultation with the target group of users (visually impaired gamers) and specification of their requirements for in-game sound effects.
- Creation and evaluation of a model for the automatic generation of sound effects based on a textual description, which will consider the needs of the target group.
Documents and Project Outputs
- for GitHub acsess contact me on hasan.norbert99@gmail.com or hasan4@uniba.sk
- Project Proposal Presentation (PDF) (May 2025)
- Master's Thesis Report (PDF)
- Source Codes / Frameworks (GitHub) (Continuously updated)
- Current versions of experiments and partial results will be documented in the repository.
Literature
- GANSynth: Adversarial Neural Audio Synthesis
- Generative adversarial network synthesis of hyperspectral vegetation data
- SpecSinGAN: Sound Effect Variation Synthesis Using Single-Image GANs
- Neural Synthesis of Sound Effects Using Flow-Based Deep Generative Models
- Large Scale GAN Training for High Fidelity Natural Image Synthesis
- Wasserstein generative adversarial networks
- Alias-Free Generative Adversarial Networks
- Denoising Diffusion Probabilistic Models
- Generative Adversarial Networks
- Generative Adversarial Networks for Synthetic Data Generation: A Comparative Study
- Improved Training of Wasserstein GANs
- Attention Is All You Need
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
- Signal estimation from modified short-time Fourier transform
Work Progress and Task Calendar
Completed Tasks:
- [Feb 2025 - Apr 2025] In-depth literature review and study of scientific articles (Neural Network Sound Synthesis, GANSynth, SpecSinGAN, WaveFlow for SFX).
- [Apr 2025 - May 2025] Preparation of software frameworks and simple testing of basic sound synthesis concepts.
- [May 2025] Development of the thesis proposal and preparation of the presentation.
- [July-September 2025] Detailed specification of requirements based on consultations with visually impaired gamers.
- [October 2025] Finalization of the strategy for collecting/curating datasets (text descriptions, sound effects).
- [November 2025] Prototyping of the basic model: text encoder and simple sound generator.
- [December 2025] Beggining of training of main model at home(HW bottleneck), development almost done->depends on results, acsess to KAI servers for trining. Writen the chapter theoretical backgroung, +some Implementation.
Planned / In Progress Tasks:
- [December 2025 - February 2026] Development of main model depending on results. Training of the main model for generating sounds from text planned around christmas, KAI servers should be more free, powerful GPU.
- [February 2026 - March 2026] Rigorous evaluation of the model including user studies with the target group.
- [February 2026 - May 2026] Writing the master's thesis, final revisions, and preparation for defense.