How hard a project would it be to use ML to assist a single author/script writer in writing dialog where each "speaker" sounds like a distinct person? Is that something that a professional programmer with little if any practical experience in ML could reasonably expect to accomplish?
There are a bunch of tangentially related questions (see some of below) that I'd like to get answers to (or advice on) but the above is the question I'm specifically asking. If anything below here is off-topic or unfocused, then please just ignore that part of what I've typed.
Context
I'm a programmer by trade but I've basically no actual experience with ML beyond reading about what others have done. I'd like to use ML (or something like it) to take dialog (provided as text) I've written and make it sound like multiple different people.
I'm concerned that this could be something that is not a reasonable project for someone with my minimal ML experience. Even more, I'm concerned that I could need to expend a lot of effort before I will even know if this is a reasonable project for me to tackle.
What I'm trying to do:
I'm working on a literary project where I want to create dialog with recognizably distinct "voices". I don't want everyone's speech to sound like they are quoting me, but rather seem like actual different people.
Fortunately, due to how I'm working things, extracting the written dialog and tagging it by assigned speaker is not overly difficult. From there feeding that to an ML system shouldn't be too hard. Based on a little research, it seems that if I had a model for speaker classification for the set of characters, I could measure how good a job I've done. Also, as I understand it (without having ever done it), there are ways to work backwards thought such a model and get adjustments to the input that give stronger matches to a pre-selected identification.
However, I've never done any of that. Furthermore, my interest in ML is purely practical: learning ML techniques is fine, but only so far as it has a good likelihood from the start of solving my problem. Canned solutions or one that require learning less about ML would be my preference.
One other thing I'm particularly worried about is that while I'm looking for computational help in choosing how things are said, I can't allow what is said to change very much. Whatever solution I use will have to leave me in the edit loop.
My actual questions (in context):
- Does this seems like a problem that could be addressed via known ML techniques without having to delve into original research?
- Is this a problem that could be reasonably managed by an ML novice?
- Are there any particular techniques and/or tools I should start by looking at?
What I'm thinking I would do, if someone doesn't point me in a better direction, is as follows:
- Find the largest corpus of transcribed speech I can readily get hold of.
- Pick groupings of speech (maybe overlapping) that "sound like" what I want each character to sound like.
- Fabricate a set of training data for each character by grabbing chunks from the assigned groupings and train an ML model to classify speakers from this synthetic training data.
- Compare my tagged dialog with who the model thinks is speaking to find places that "don't work".
- Generate a second set of training data from the first by replacing short spans of text with a "wildcard" token and train a second model on this.
- Take dialog that "doesn't work" and look for phrases where replacing them with a "wildcard" token most improves the scoring and report those phrases as needing revision.
The point of doing the last few steps that way is that it's "the simplest thing that might work" and shouldn't requiter learning any new ML techniques that aren't already required for the prior bits.
Does this seem like a reasonable approach? Are there better entry level methodologies?