Arab World Regional Special Section
HCI

Unlocking the Potential of Arabic Voice-Generation Technologies

Addressing linguistic complexities, the scarcity of high-quality datasets, and other challenges is crucial for advancing Arabic text-to-speech technology.

Posted
four people in a circular form

Voice-generation technology enables machines to synthesize human-like speech—text-to-speech (TTS)—revolutionizing digital communication by fostering more inclusive and accessible experiences. What began as simple robotic speech synthesis has evolved into highly sophisticated voice-cloning systems that can produce natural, coherent, expressive, and personalized voices using minimal data. These technologies empower individuals with cross-lingual communication through virtual agents, assist in overcoming visual or speech impairments or literacy challenges via assistive tools, and support educators and industries such as entertainment with creative content generation.

Potential Application of Arabic TTS.  With more than 422 milliona speakers worldwide, the potential for Arabic voice technologies are vast. In education, AI-powered tools such as ArabicTutor,12 Numue,b and QVoicec,10 provide real-time spoken feedback, making language learning more interactive. For accessibility, services such as NVDAd and CogentIfoteche enable visually impaired persons and healthcare recipients (as well as providers) to access digital content and health records. Furthermore, Arabic voice generation facilitates culturally appropriate media content generation and enhances interaction with Arabic large language models (LLMs), such as Fanar,f for seamless natural communication.

Challenges in Arabic.  Despite all its potential, developing Arabic TTS remains challenging due to the morphological richness, phonetic complexity, and diglossic nature of the language. This complexity is further amplified by Arabic’s unique status as a shared language across 22 countries,a encompassing more than 20 mutually unintelligible dialects. This gap between Modern Standard Arabic (MSA) and Dialectal Arabic (DA) poses additional challenges for model training, as spoken Arabic varieties differ widely across regions in both phonology and lexicon. Unlike MSA, DA lacks a standardized orthographic norm, making resources inconsistent and difficult to use. Moreover, the absence of diacritics in written Arabic creates pronunciation ambiguities, requiring additional preprocessing, primarily diacritic recovery, to ensure accurate and natural synthesized speech. These linguistic complexities and the scarcity of high-quality Arabic datasets hinder progress in voice-generation models, and addressing these challenges is crucial for advancing Arabic TTS technology.

Diacritization, Datasets, and Model Design.  Despite Arabic’s extensive research history in natural language processing (NLP) and speech recognition,8 voice generation remains mostly underexplored. Diacritization is a crucial component in Arabic TTS, with its modeling approaches ranging from rule-based methods to deep learning. State-of-the-art models either use surface, morphological, and syntactic features7 or rely exclusively on character sequences, with the latter approach achieving the best results.11,15 While multilingual LLMs such as ChatGPT and Gemini show some promise for diacritization, Arabic-centric NLP tools continue to yield better results. Tools including Farasa,1 CAMeL,17 and MADAMIRA18 offer critical support for diacritization, with datasets such as Arabic Penn Treebank, CATT, Tashkeela, and WikiNews driving further research.

Arabic voice generation has evolved from rule-based and concatenative methods to advanced deep-learning architectures, significantly improving naturalness and expressiveness. Modern TTS systems typically extract linguistic features from input text (linguistic analyzer), then predict continuous or discrete speech representations (acoustic modeling), and convert them into waveforms (Vocoder). Recent approaches simplify or bypass the linguistic analysis,5 thus removing dependency on linguistic resources. Alternatively, some models adopt end-to-end architecture, streamlining the process of generating waveforms directly from textual input. For more technical details see Alrige et al.5

A major bottleneck is still the scarcity of high-quality Arabic speech datasets. Several corpora have been developed, including ASCg—the first open-source Arabic TTS dataset—read speech data as part of the NEMLAR project,21 and Classical Arabic TTS.13 Additionally, Al-Radhi et al.4 created a small audiovisual phonetically annotated Arabic corpus to develop a vocoder, and TunArTTS14 also introduced a small, single (male) speaker dataset.

Multilingual foundation models have become a powerful catalyst for advancing Arabic TTS. Models including YourTTS and XTTS-v2, trained on diverse multilingual corpora, effectively transfer prosodic and phonetic knowledge to low-resource languages such as Arabic. They significantly reduce the need for large Arabic-specific datasets and enable seamless adaptation across dialects and speakers. To address resource limitations, researchers have fine-tuned pretrained Arabic transformer models such as ArTST19 for developing a TTS. For a comprehensive review, refer to Alrige et al.5 and Chemnad and Othman.6 Such adaptations to new voices or dialects can also be achieved with just a handful of training examples, using zero-shot and few-shot voice-generation techniques. Zero- or few-shot learning refers to the model’s ability to replicate voices with minimal data.20 However, their application to Arabic remains largely underexplored. Notably, Doan et al.9 used the QASR corpus16 and XTTS-v2h to lay the groundwork for future research in low-resource Arabic voice cloning.

The Commercial and Academic Push toward Arabic Speech Technology.  Even though open source Arabic TTS faces resource scarcity, commercial alternatives have flourished. Major technology companies such as Microsoft, Google, Amazon, OpenAI, and IBM provide Arabic TTS solutions in MSA, whereas very few came with dialectal coverage. For example, Microsoft covers 16 Arab-speaking countries, while Amazon Polly covers standard Arabic and GULF Arabic as part of its cloud-based AI services. Other companies, such as iSpeech, ElevenLabs, and Kanari, have also developed MSA Arabic TTS products tailored to various use cases. However, the majority of commercial TTS efforts still face challenges, particularly in handling diacritization. For instance, while Amazon Polly’s Zeina (MSA) voice is regarded as accurate, Abdelali et al.3 found that it performs comparably to QCRI-Kanari TTS in intelligibility and naturalness, yet continues to struggle with accurate diacritization.

Research initiatives from academic institutes—such as NatiQ2 and Fanar-TTS, along with tools like Farasa, CaMel, and ArTST—represent some of the most impactful contributions supporting and advancing Arabic speech technology communities. Collaboration between academia, industry, and governments is crucial for further progress, with open source projects playing a pivotal role in democratizing access to Arabic speech resources.

Looking ahead, Arabic voice-generation technologies have immense potential but also present ethical challenges. Ensuring privacy, preventing impersonation, and mitigating disinformation will require audio watermarking, synthetic speech detection, and robust authentication mechanisms. These challenges open new opportunities for research and businesses for the community. Despite these hurdles, continued advances in Arabic TTS and voice cloning can drive societal impact, fostering inclusive communication, enhanced learning, and cultural exchange across the Arabic-speaking world and beyond.

    • 1. Abdelali, A., Darwish, K., Durrani, N., and Mubarak, H.  Farasa: A fast and furious segmenter for Arabic. In Proceedings of the 2016 Conf. of the North American Chapter of the Association for Computational Linguistics: Demonstrations, DeNero, J., Finlayson, M., and Reddy, S.  (Eds.). Association for Computational Linguistics, 1116; 10.18653/v1/N16-3003
    • 2. Abdelali, A. et al. NatiQ: An end-to-end text-to-speech system for Arabic. In Proceedings of the 7th Arabic Natural Language Processing Workshop. Association for Computational Linguistics (2022), 394398; 10.18653/v1/2022.wanlp-1.38
    • 3. Abdelali, A. et al. LAraBench: Benchmarking Arabic AI with large language models. In Proceedings of the 18th Conf. of the European Chapter of the Association for Computational Linguistics 1: Long Papers (2024), 487520.
    • 4. Al-Radhi, M.S. et al. A continuous vocoder for statistical parametric speech synthesis and its evaluation using an audio-visual phonetically annotated Arabic corpus. Computer Speech Language 60 (2020), 101025; doi:10.1016/j.csl.2019.101025
    • 5. Alrige, M. et al. End-to-end text-to-speech systems in Arabic: A comparative study. In Proceedings of the 2024 IEEE 12th Intern. Symp. on Signal, Image, Video and Communications. IEEE, 16.
    • 6. Chemnad, K. and Othman, A. Advancements in Arabic text-to-speech systems: A 22-year literature review. IEEE Access 11 (2023), 3092930954.
    • 7. Darwish, K., Abdelali, A., Mubarak, H., and Eldesouki, M. Arabic diacritic recovery using a feature-rich BiLSTM model. Transactions on Asian and Low-Resource Language Information Processing 20, 2 (2021), 118.
    • 8. Dhouib, A. et al. Arabic automatic speech recognition: A systematic literature review. Applied Sciences 12, 17 (2022), 8898.
    • 9. Doan, K.D., Waheed, A., and Abdul-Mageed, M. Towards zero-shot text-to-speech for Arabic dialects. arXiv (2024); arxiv.org/abs2406.16751
    • 10. Kheir, Y.E. et al. QVoice: Arabic speech pronunciation learning applicationINTERSPEECH (2023).
    • 11. Elmallah, M.M. et al. Arabic diacritization using morphologically informed character-level model. In Proceedings of the 2024 Joint Intern. Conf. on Computational Linguistics, Language Resources and Evaluation, 14461454.
    • 12. Erradi, A., Nahia, S., Almerekhi, H., and Al-kailani, L. ArabicTutor: A multimedia m-Learning platform for learning Arabic spelling and vocabulary. In Proceedings of the 2012 Intern. Conf. on Multimedia Computing and Systems. IEEE, 833838.
    • 13. Kulkarni, A., Kulkarni, A., Shatnawi, S.A.M., and Aldarmaki, H. ClArTTS: An open-source classical Arabic text-to-speech corpus. arXiv preprint (2023); arxiv.org/abs/2303.00069
    • 14. Laouirine, I., Kammoun, R., and Bougares, F. TunArTTS: Tunisian Arabic text-to-speech corpus. In Proceedings of the 2024 Joint Intern. Conf. on Computational Linguistics, Language Resources and Evaluation. (2024), 1687916889.
    • 15. Mubarak, H. et al. Highly effective Arabic diacritization using sequence to sequence modeling. In Proceedings of the 2019 Conf. of the North American Chapter of the Assoc. for Computational Linguistics: Human Language Technologies 1: Long and Short Papers, 23902395.
    • 16. Mubarak, H., Hussein, A., Chowdhury, S.A., and Ali, A. QASR: QCRI Aljazeera speech resource: A large scale annotated Arabic speech corpus. In Proceedings of the 59th Annual Meeting of the Assoc. for Computational Linguistics and the 11th Intern. Joint Conf. on Natural Language Processing 1: Long Papers (2021), 22742285.
    • 17. Obeid, O. et al. CAMeL Tools: An open source Python Toolkit for Arabic natural language processing. In Proceedings of the 12th Language Resources and Evaluation Conf. (2020), 70227032.
    • 18. Pasha, A. et al. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of  LREC 201410941101.
    • 19. Toyin, H.O., Djanibekov, A., Kulkarni, A., and Aldarmaki, H. ArTST: Arabic Text and Speech Transformer. In Proceedings of the 1st Arabic Natural Language Processing Conf.  Association for Computational Linguistics (ACL) (2023), 4151.
    • 20. Xie, T., Rong, Y., Zhang, P., and Liu, L. Towards controllable speech synthesis in the era of large language models: A survey. (2024); arXiv preprint, arxiv.org/abs/2412.06602
    • 21. Yaseen, M. et al. Building annotated written and spoken Arabic LRs in NEMLAR Project. In LREC. Citeseer (2006), 533538.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More