TLDR: A new study investigates how to manipulate voice features like pitch and formants to amplify the ‘kawaii’ (cute) factor in computer voices. Researchers found that text-to-speech voices could be made significantly more kawaii by increasing fundamental and first formant frequencies, often leading to perceptions of youthfulness. However, applying the same techniques to professionally recorded game character voices yielded varied results, sometimes even decreasing perceived cuteness, suggesting a ‘ceiling effect’ or voice-specific ‘sweet spots’ for kawaii.
The concept of “kawaii,” the Japanese word for cute, extends beyond visual aesthetics into the realm of sound, particularly in computer voices. A recent research paper titled “Super Kawaii Vocalics: Amplifying the “Cute” Factor in Computer Voice” delves into how elements of voice relate to kawaii and how they can be manipulated, both manually and automatically. This study, involving a grand total of 512 participants, explored two types of computer voices: text-to-speech (TTS) and game character voices.
Kawaii is a multifaceted phenomenon in Japanese culture, encompassing terms like “cute,” “pretty,” and “adorable.” Researchers like Nittono and colleagues have proposed a two-layer model for kawaii, linking it to Japanese cultural aspects such as “amae” (desire to be loved) and “chizimi shikou” (love of small things), as well as universal biological responses like Kindchenschema (baby schema), where baby-like features stimulate a care response. While most prior research focused on visual kawaii, this study extends the concept to voice, exploring what makes a voice sound “kawaii.”
Exploring Kawaii in Voices
The researchers aimed to identify which voice features lead to perceptions of voices as kawaii and how these features might also link to social identity perceptions like gender and age. They hypothesized that higher fundamental and formant frequencies (which relate to pitch and vocal tract shape) would increase perceived kawaiiness and result in younger or more ambiguous gender perceptions.
The study was conducted in four phases. The first phase involved manually processing text-to-speech (TTS) voices using a digital audio workstation (DAW) like Cubase. Participants evaluated these manipulated voices. The findings showed a positive correlation between kawaii perceptions and higher fundamental frequencies (pitch) and first formant frequencies. This suggests that making a voice higher-pitched and altering its primary resonant frequencies can make it sound cuter. Additionally, higher fundamental and first formant frequencies were associated with younger age perceptions, supporting the link between youthfulness and kawaii. However, the link to gender ambiguity was less clear, only showing a correlation with the third formant frequency.
Following the manual manipulation, the researchers explored automated methods using speech signal processing tools like Legacy-STRAIGHT and WORLD. While these tools could replicate some of the manual effects, there were subtle differences, indicating the complexity of fully automating the process while maintaining quality and desired perceptions.
Game Voices: A Different Challenge
In the second and third phases, the study applied these manipulation techniques to a diverse set of pre-recorded game character voices. Unlike the generated TTS voices, game character voices are often professionally recorded by voice actors and may already have various filters and audio manipulations. The results here were more complex. A simple three-semitone shift in fundamental and formant frequencies did not consistently increase kawaiiness in game character voices; in some cases, it even led to a decrease. This might be due to a “ceiling effect,” where some voices are already at their peak kawaii, or because the manipulations introduced unnatural sounds to already highly processed audio.
The third phase delved into more granular manipulations (one, two, or three semitone shifts) using the manual Cubase method for game voices. While some voices showed an increase in kawaiiness with these finer adjustments, the effect was not universal and sometimes even led to a decrease. This highlights that the impact of voice manipulation on perceived kawaiiness can be highly dependent on the original voice’s characteristics and how it was produced.
Also Read:
- Fast-VGAN: Precise and Lightweight Control for Voice Transformation
- Advancing Text-to-Speech: A Differentiable Approach to AI Reward Optimization
Implications and Future Directions
The research suggests that text-to-speech voices, which may have more room for improvement in terms of humanlikeness and fluency, can be more readily enhanced for kawaii perceptions through frequency manipulation. For professionally voice-acted characters, the existing level of artistry and processing might limit further simple manipulation. The study also points to the need for qualitative insights from voice actors to understand the nuances of creating “kawaii” voices, including different subtypes like “otona-kawaii” (adult-kawaii).
This pioneering work in “kawaii vocalics” opens new avenues for designing more engaging and culturally resonant voice user experiences. As artificial agents and interfaces increasingly incorporate voices, understanding and intentionally manipulating vocal characteristics like kawaiiness will become crucial. The full research paper can be found here: Super Kawaii Vocalics: Amplifying the “Cute” Factor in Computer Voice.


