Baldur's Gate 3 - Gates voice

November 10, 2023

Click here for the mod on Nexus Mods “Nexus” ;
Or here for the YouTube video “YouTube” .

Baldur’s Gate, but it’s just Gates. That’s it.

To make this mod there are several steps we have to take:

Train the model;
Create a Python script;
Package .wav files into .wem files;
Pack the mod and put it into your game.

Training the model

To train a Bill Gates model I used Piper TTS.

Two helpful tutorials on how to use Piper TTS are this video from Thorsten-Voice, and this blogpost for WSL by ssamjh.

To speed up the process and limit the amount of data needed I fine-tuned the “en_US-hfc_male-medium” voice.

The first actual step is finding or creating a dataset. I used audio clips from a Behind The Tech interview with Bill Gates. I separated the first 10 or so minutes of Bill Gates speaking into one to two sentence clips and transcribed them in the format described in this documentation from Coqui-AI. This resulted in a dataset of 50 audio samples. While ideally we would have many more samples, that would also take much longer to do. Since we are fine-tuning an existing model it should be okay-ish.

The second step is fine-tuning the model. To do this I followed the two previously mentioned tutorials. I let the fine-tuning run up until and including epoch 3540. Since the original model was trained up until and including epoch 2785, I let mine run for 755 epochs. The reason I stopped is because it sounded like it wasn’t improving much anymore and I wanted to prevent over-fitting. This is not an exact science so you just have to listen to it every once in a while.

Create a Python script

Now that we have the model, we need to generate the voice lines. Given the size of the game, there are A LOT of voice lines. To automate this process we use a Python script. For this script to work we need the unpacked game files. To unpack the game files I used the Baldur’s Gate 3 Modder’s Multitool.

From these unpacked files we need the /English/Localization/English/english.loca file, and all files in the /VoiceMeta/Localization/English/Soundbanks/ directory. The english.loca file contains around 224000 formatted as follows:

<content contentuid="h000006d4gcefbg4092gbb39gfeb27a3bb0a7" version="1">Sorry, darling, I haven't got time for underlings.</content>

The files in the Soundbanks directory are files containing voice metadata. These files contain voice text metadata formatted as follows:

<node id="VoiceTextMetaData">
	<attribute id="MapKey" type="FixedString" value="h0637667eg5168g41ffgbee6g2da017faa0e1" />
	<children>
		<node id="MapValue">
			<attribute id="Codec" type="FixedString" value="VORBIS" />
			<attribute id="Length" type="float" value="9.6" />
			<attribute id="Priority" type="FixedString" value="P1_StoryDialog" />
			<attribute id="Source" type="FixedString" value="v1a4a557786fa4d8ab99cc1c249b4d49e_h0637667eg5168g41ffgbee6g2da017faa0e1.wem" />
		</node>
	</children>
</node>

From these nodes we are interested in two things. The first is the value in the second line: value=“h0637667eg5168g41ffgbee6g2da017faa0e1”. The second is the value in the last attribute line: value=“v1a4a557786fa4d8ab99cc1c249b4d49e_h0637667eg5168g41ffgbee6g2da017faa0e1.wem”.

The falue from the first line can be mapped to the contentuids from the english.loca file. The value from the second line is the filename that these lines correspond to. We can use this mapping to get the text from the english.loca file, generate the voice line, and then save it under the correct filename.

Therefore we first parse the english.loca file to create a dictionary that maps each contentuid to the text line. Then we create a dictionary that maps the contentuids to the filenames from the files in the Soundbanks directory. Lastly we can iterate over the soundbanks dictionary, get the correct text line from the english dictonary, and generate the voice line and save it under the correct filename. This results in the following Python script called “main.py” which can be ran on Linux or WSL:

from bs4 import BeautifulSoup
import re
from pathlib import Path
import os
from shellescape import quote

dict_english = {}
dict_voice = {}

with open('Unpacked_game_data/english.xml', 'r', encoding="utf8") as f:
	data = f.read()

english_data = BeautifulSoup(data, 'xml')
content_data = english_data.find_all('content')

for tag in content_data:
	# Remove all text between <> and remove all *
	text = tag.text
	text = text.replace('*',  '')
	text = re.sub('<.*?>', '', text)
	text = text.replace('|', '')
	text = re.sub('\(\[.*?\]\)', '', text)

	contentuid = tag['contentuid']

	dict_english[contentuid] = text

# There seem to be some files missing, but for now we'll just leave it like this. The number of files from here is much smaller than the number of files from the english.xml file.
for p in Path('Unpacked_game_data/Soundbanks_meta_mod/').glob("*.lsx"):
	with open(p) as f:
		data = f.read()

	split_on_metadata = data.split("VoiceTextMetaData")

	for i in range(len(split_on_metadata)):
		# ignore index 0, this is just file header.
		if i == 0:
			continue

		# To get the key and value we split on value=
		split_on_value = split_on_metadata[i].split("value=")
		split_on_quote_key = split_on_value[1].split('"')
		split_on_quote_value = split_on_value[5].split('"')

		key = split_on_quote_key[1]
		value = split_on_quote_value[1].replace(".wem", "")

		dict_voice[key] = value

i = 0
for k in dict_voice:
	if k in dict_english.keys():
		if i%100 == 0:
			print(i)
		i += 1
		text = dict_english[k]
		file_name = dict_voice[k]

		cmd = 'echo {} | piper -m ~/<path to model>/model.onnx'.format(quote(text))
		cmd = cmd + f' --output_file ~/<path to file>/{file_name}.wav'
		os.system(cmd)

print(i)

However, there is one issue with this method, which comes in with the “if k in dict_english.keys():” line in the last loop. Not all contentuids in the soundbanks dictionary are in the english dictionary. I have not been able to find out why this is, so our method only leaves us with 122431 contentuids which are in both dictionaries.

To generate all voice lines we can just call “python3 main.py” in a Linux or WSL terminal. This will take a long time. For me it took 24 hours. Interestingly enough I ended up with 114836 files. So, somewhere along the way I lost around 7500 more files. I do not know how or why, and for my sanity’s sake I just moved on.

Note: Not all voicelines are replaced. Point and click lines seem to not be changed, and it also misses some sentences that characters say while walking somewhere or during battle. I tried my best, but I couldn’t find how to replace these files.

Package .wav files into .wem files

Now that we have the .wav files we are almost there. To do this I used the WWise, and this tutorial by Hp Xro.

Pack the mod and put it into your game

First I tried to pack these files using the BG3 Modder’s Multitool, but this did not work. Instead I used the packing part of this BG3 AI voice toolkit.

The Bill Gates image: https://commons.m.wikimedia.org/wiki/File:Bill_Gates_at_the_MSC_summit_-_2023_%2852695822752%29_%28cropped%29.jpg