Voicing Visual Novels with AI voices

For the last few months, I’ve been experimenting with using AI voices to voice my visual novels. I started with Strip Crazy Eights, because it was a short game, and ended up not voicing the actual card game part.

It would require hundreds of lines of the characters saying the name of the card they are playing, and when I skipped voicing card names, what was left over sounded annoyingly repetitive, so I made the actual card game in “Montage mode”: no voice, just background music. So the only part of the game is the introduction and conclusion.

That proved to me that voicing every line of every game was unworkable and unnecessary. It also proved to me it was worth doing.

Voice in Renpy Games

There are two ways to voice Renpy games, and they have both existed since I started making Renpy games in 2012. The easiest is pressing the “V” key on your keyboard for “Self Voicing” if you are playing on a PC or Mac. If your Linux has text to speech configured correctly it works there, too, but it doesn’t work on Android.

Very few Renpy game makers are aware that this is even a feature. I can tell because when I use it in other peoples games, on screen menus and help text gets voiced too (the secret is to add alt “” to the end of any screen text that shouldn’t be voiced). I use Self Voicing when testing my games, because it easily points out spelling and grammar mistakes if it sounds funny in self-voicing mode. Also some people like it.

Self voicing uses the text to speech voice built into your computer, which sounds like a robot, I use “Zira”. And despite it’s off putting robot sound, it makes me want to hear actual voice. Well audio voicing is supported in Renpy as well, there is a whole audio channel dedicated to it, you just need mp3 or ogg audio files of every line of dialogue you want to voice. The two can be used together and won’t overlap, so with mp3 voice enabled and pressing V for self voicing, your computer voice will fill in the narration and player character voice lines.

I would love to voice my games, but I know human voice talent is expensive, but then I discovered the power of AI voice overs. Not as good as actual actors, but a passable substitute that adds to the game’s over all playability. I started experimenting in March when discussions of AI were rampant, including my own website.

AI Voices online

There are several companies that AI generate voices from text: play.ht, speechify.com, murf.ai, elevenlabs.io, and listnr.tech, are the ones I tried out. I ultimately decided on elevenlabs.io based on two features they started early: the ability to use recorded voices to train AI, and the ability to random generate voices.

Since Something’s In The Air Redux involves a lot of NPC characters to voice, I liked the random generate feature. It turns out most randomly generated voices sound like crap and are unusable, or they don’t emote as well as they should. I should have learned this from playing Watch Dogs Legion this past winter which uses artificial voices for its thousands of NPCs, many of which sound really bad.

This month Eleven Labs added a way to share randomly generated voices, so I can find ones that others generated instead of all the bad ones I was getting. I ended up replacing my worst voices with shared ones.

I am not using the recorded voice trainer at all, going 100% with random voices, because I don’t want to get in any legal trouble using celebrity voices, and especially now with the shared voices feature I can find good voices for any character.

Finding Voices for Characters

I think i got really lucky with Ariane and Rachel voices, in part because I knew what I was looking for, and kept trying until I got “close enough”. For Ariane I imagined Cherami Leigh the female voice of V in Cyberpunk 2077. For Rachel I imagined April Stewart, the female human voice in Guild Wars 2, most famous for Sharon Marsh in South Park. The voices I ended up using sound nothing like these actors of course except maybe in the pitch and enunciation, hence my decision to say “good enough” when I randomly generated them.

The cost of AI voices

For 100,000 characters of text to voice, I pay $25 a month. That should represent 20,000 words, but if you have tried any AI for text or images, you know it sometimes requires multiple takes before the AI gives you a line reading close enough to acceptable, so it is closer to 8,000 words a month when you factor in retakes, and when I extract the dialog from SITAR and put it in a word processor, it comes out to 260 pages, single spaced.

I am also not voicing every line, maybe only half. I leave out lines of narration, and the player character lines, partly out of tradition, and partly because if I voiced them I would have to do male and female versions, which would double the work and cost.

I am only voicing NPCs, and only lines pertinent to the story. Long dialogues about background info, puzzle instructions, and lines down failing paths are excluded. Even then it took 4 months worth of allotment to fully voice Something’s In The Air Redux, which will be released in version 1.1 later this month.

That works to $100 and several hours of my own valuable time to get one game voiced, so not exactly cheap, but definitely less than the cost of an army of voice actors.

I am planning Date Ariane Remastered next, and all should be done in about a year.

The Ethics of AI voices

As I write this, the Writers Guild of America and the Screen Actors Guild are on strike, primarily over two issues: 1. The ridiculously low residual payments from entertainment on streaming services, and 2. The existential threat of AI systems to replace writers and actors. I have nothing to say on the first issue, except artist must be paid what they are worth.

On the AI issue, here is what I know for sure: Even with the best AI available, real actors are far superior to AI. The fear that AI might get good enough to replace human actors is real, especially voice actors. AI makes CGI cheaper too.

Every actor has to do voice acting. Sometimes visual takes of a scene are great, but the voice is wrong, or there was noise during the recording, this is called ADR voice-over. If all that gets replaced by AI, actors are going to get paid a lot less.

But taking a position demanding AI to never be used is also self defeating. Directors, editors, and CGI artists are finding AI useful enough to do things never seen before, and a total ban would be a disadvantage.

My position, if I were a union rep: Make it so AI doesn’t cost less. Any voice-over job that would normally be done by a human, can either require a human to do it, or the studio must pay a fee to the union equal to the cost of a human doing the job (including the cost of auditions and rehearsals). Studios refusing to pay means their ability to use union actors where they need human actors disappears.

Writers should make similar deals. You can’t use AI to save money on writers unless you ONLY use AI, which according to the current laws, make it uncopyrightable. If you use humans to clean up an AI generated script, they must be paid the same as if they wrote it themselves. I’m also a fan of big writers rooms. You just have to look at the drop in humor quality from the first 5 seasons of Futurama where they used a big writers room, to the later 4 seasons where they got by with minimal writing staff. AI is notoriously bad at humor writing, so any AI trend in script writing is going to be awful.

Hopefully, this philosophy carries over to the rest of the industry. AI may be “cheaper” but not necessarily less time consuming. The use of AI in special effects should not be used to justify underpaying CGI artists who use it. As my previous experiments in AI art have proven: artists are better at AI than non artists, the only real advantage of AI art in CGI work is that CGI work can be done faster, but studios can’t be allowed to use that as an excuse to pay artists less.

So How Can I justify AI voicing my games?

First of all, my game studio is a one person operation, a rarity even in simple visual novel production which averages around 3 people: an artist, a writer, and a coder. My Computer Science degree means I can code, and 20 years experience in 3D art means i can do art. My admittedly weakest talent is writing, but I have already proven to myself I am better than ChatGPT at writing visual novel scripts, even if it takes me significantly longer to do it.

I’m committed to using AI as little as possible on everything. But voice work, I cannot do it. I justify using AI voice in my games for two reasons: 1. My operation is small. If I don’t use AI to voice it, there would be no voice at all. Nobody’s job is being replaced here. 2. I make my games as if voicing does not exist. The game should be fully playable with voice turned completely off (go to Preferences, and slide the Voice volume to 0). Voicing should be a helpful extra, nothing more.

While yes, there seems to be a growing trend of replacing voice actors with AI, when it is done by small indie game developers, or low-wage/no-wage TikTok / YouTube video producers, I don’t have a problem with it.

When it is done by commercial audio book producers, or for advertising of big companies who can afford real actors, I DO have a problem.

Current Technology

I did the voicing of Strip Crazy Eights in March. I am finishing up with the Ariane and Rachel paths of Something’s In The Air Redux this July, a mere gap of 4 months, and the quality has noticeably improved. Like most AI productions, I often have to do multiple takes of a line before it sounds close enough to the way I imagine it should sound. March AI typically took two to three takes, now more than half are one takes, and I often get the AI reading lines better than I imagined it by emphasizing words I didn’t think to emphasize. How good will it get in a year? I think there is an upper limit it can reach and that limit might not reach “human”, but it is likely to get close.

That’s great for us small time operations just trying to look half way professional. Not so great for actual professionals, hence the reason I support the SAG/WGA strikes.

3 comments

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.