Skip to main content
Metaist

Generating Audiobooks

How I used Coqui.ai TTS and The Internet Archive to make audiobook MP3 files.

When I was in high school I used to use Microsoft Agent to read books from Project Gutenberg. Nowadays, I use @Voice Aloud Reader on Android to listen to .epub files.

One of my sons got an old-school MP3 player and he wants to fill it with audiobooks. The Internet Archive has a large collection of audiobooks, but some of the books only come in text form.

On my wife's recommendation, I decided to try an make an audiobook out of the E.L. Konigsburg book From The Mixed Up Files Of Mrs Basil E. Frankweiler.

First, I downloaded the text file and manually cut it up into chapters, one file per chapter.

Next, I converted all paragraphs into single lines.

Then I setup a python project using rye and uv:

curl -sSf https://rye.astral.sh/get | bash
rye config --set-bool behavior.use-uv=true
rye init
rye pin 3.11 # TTS doesn't support 3.12
rye add attrbox tts

attrbox is my library for doing things like processing command-line arguments.

tts is the Coqui.ai repository for doing text-to-speech. However, I just discovered that Coqui.ai, the company behind the project is shutting down, so I don't know how well this package will be maintained in the future.

Wrote up a quick script:

#!/usr/bin/env python
"""Convert a text file to an MP3.

Usage: txt_tts
    [--help | --version][--debug]
    <input> [--output PATH]
    [--model PATH]

Options:
    -h, --help                  show this message and exit
    --version                   show program version and exit
    --debug                     show debug messages

    <input>                     input file
    -o PATH, --output PATH      output file

    --model PATH                model path [default: tts_models/en/ljspeech/tacotron2-DDC_ph]
"""
# std
from pathlib import Path

# lib
from attrbox import parse_docopt

__version__ = "0.1.0"
__pubdate__ = "2024-06-27T18:58:45Z"


def main():
    """Main entry point."""
    args = parse_docopt(__doc__, version=f"{__version__} ({__pubdate__})")
    args.input = Path(args.input)
    if not args.input.exists():
        raise FileNotFoundError(args.input)

    if not args.output:
        args.output = args.input.with_suffix(".wav")
    else:
        args.output = Path(args.output)

    if args.debug:
        print(args)

    from TTS.api import TTS

    tts = TTS(args.model)
    tts.tts_to_file(text=args.input.read_text(), file_path=args.output)


if __name__ == "__main__":
    main()

At first, I tried tts_models/en/ljspeech/tacotron2-DDC based on the examples I saw online. However, it had trouble with names and words that weren't part of its lexicon. The _ph version works on a phonetic level and has better performance while not being too slow. It takes about 3 minutes per chapter. (I did try TortoiseTTS, but it was extremely slow and the output wasn't that much better.)

The trickiest bit is making sure there aren't any empty sentences when the chapter gets split up. This can happen when you have a sentence at the end of a paragraph that ends with ." and the quote mark ends up by itself. To fix this, I replaced those instances with ". which worked great.

I ran the script on each chapter which produced a .wav file which I ran through ffmpeg to convert them to MP3.

As a final touch, I used kid3 to fix all the tags so that the files would get sorted by album/track (i.e. book/chapter).