Oto Theory Tutorial by Cdra

A foreword: this is an "advanced" tutorial

Alright, glad “advanced” didn't scare you off immediately. This tutorial is going to talk about oto.ini configuration from a general perspective; what the oto should do, and how to make it do that. I'll at least touch on CV, VCV, CV VC (JP), and CV VC English otos during the course of this tutorial, but the point is more to make a tutorial that's useful for EVERY kind of oto. This tutorial is really geared toward people who have done an oto before, but want to know how to do them better.

Now, since this is an “advanced” tutorial, I'm going to assume you know some things about otos. You should know what all of the settings are (and have some vague idea of what they do) and you should be fairly experienced looking at waveforms for CV and VCV. If you haven't looked at VCV waveforms much, you should still be okay, but this might sound a bit arcane to you and you'll have to do a little extra learning. If you don't know what the settings are, I'll cover them in brief here, but you might want to check a better tutorial for beginners so that you'll understand them better.

If you're still not afraid, let's get going with the tutorial!

Intermediate/Advanced Oto Theory

Starting from the big picture, there are two large component parts to a sound file that are important to the oto—the consonant region (what is not stretched) and the vowel region (what is stretched). Of course, the consonant region is pink and the vowel region is white; unused regions are blue. Let's look at the vowel first, since it's simpler.


The consonant and cutoff values are what determine the vowel region. You want to put the consistent vowel sound inside of this region—as much as is recorded and consistent, since the more sound you have the less UTAU will have to stretch the sample. That means you want to cut off the inconsistent parts with either the consonant (for the front) or the cutoff (for the back).


Here's another trick for you! The resamplers EFB-GT and tn_fnds are known for looping rather than stretching samples. Sometimes, they don't sound good due to jumping sounds when the sample loops. If you want to help prevent that jump, then look closely at the waveform.


It's a little easier to see on this "ke".


You see those big periodic darker-blue waves? That's part of your sound's periodic function. If you maintain this periodic function, you'll create smooth looping with no jumps. A simple way to do this is to place the cutoff and the consonant both inside of the trough (lowest point, where there is no darker-blue wave) of the wave. It can be hard to see this, so try adjusting the zoom to get a better look. Other times, it's almost impossible to see, so it's hard to always take this into account. Of course, if you don't plan on using looping resamplers, this is optional and perhaps not at all useful to you.

That's really all there is to say about the vowel region—UTAU does a pretty good job of handling it. No, the consonant region is far more interesting. But before I start dissecting that, let's look at preutterance.


Preutterance is a pretty simple value that maintains timing by telling the program where your note should start, which is at the very end of the consonant. Unless you're dealing with a VC oto, you will put the preutterance in the same place to maintain timing.

In order to understand the consonant region, we need to think about the equivalent region of the human singing voice. This is the region that occurs at the ends and beginnings of notes. Let's look at a VCV sample to learn more.


We can see that the “consonant region” here contains three things: the first vowel fading out, the consonant itself, and the second vowel beginning. These are the critical regions of the sound; we want to maintain them properly. The consonant value (pink region) should cover the whole critical region of the sound, including the beginning part of the vowel that is inconsistent (before the vowel gets going smoothly); that's part of why this is the "consonant region" as well as the "critical regions".

VCV maintains the critical region in a very intuitive way—by having the whole thing present in the sample. It relies on crossfading the vowels together (that is, the end of the previous vowel with the prefix vowel of the VCV sample) to create smooth singing. Overlap is the value used to represent volumewise crossfading—so since we want to crossfade the vowels together, we want the overlap to be inside of the consistent prefix vowel. The tightest otos will put the overlap at the very end of the consistent portion of the prefix vowel in order to cut away excess sound (though UTAU will push it together with STP if there is excess sound in the way). However, the vowel fadeout is part of the critical region and therefore should not be covered with overlap, but instead allowed to behave naturally. So the oto looks like this:


In the interest of completeness, I will note here that many people use completely uniform preutterance and/or overlap values to oto VCV. This can work because the overlap should remain behind the critical region in these otos. Uniform preutterance values, however, will leave a lot of extra sound after the overlap and before the critical region of the sound, which is not entirely ideal because it causes UTAU to have to crush more of the sound out with STP, creating potentially strange crossfades. Uniform overlap values can make a lot of the guesswork involved with placing overlap go away, which is actually a good thing. Thus, I personally recommend using a uniform value for overlap and customizing your preutterance values to match the oto style explained previously; making both values uniform is popular and can work, though.

Now then, let's talk about CV. CV is different in that the samples do not have the critical region fully inside of them—they only have the consonant part of it and are missing the vowel fadeout. Therefore, a CV oto must mimic the vowel fadeout of the VCV sample! In order to do this, it's good to reference a VCV sample of the same consonant type and go from there.

For example, recall our CV 'ka' and our VCV 'a ka' shown in the previous examples. We can see from the VCV waveform that the prefix vowel fades out, then there's some white space (about 90 ms, though because VCV samples are slower than many songs this could be too much), then the CV waveform pretty well matches up to the consonant and the rest of the waveform. In order to mimic this critical transition in the CV oto, we'll put a little bit of overlap into silence before the sample (leaving some silence by moving the offset back a bit), then leave space before the consonant itself—basically faking the fadeout of the vowel to mimic the critical region of the consonant. My oto of this CV sample looks like this:


Hard consonants like 'k' are fairly intuitive, but what about soft consonants like 'b', or even 'y' and 'w'? Well, let's compare a CV and a VCV again, this time for b.


We see that in the VCV waveform, the prefix vowel fades out directly into the voicing of the consonant, but then the consonant voices for a few ms before the next vowel comes in. So that's the fadeout and critical region we want to mimic—let's do this by placing the overlap over some silence and the very beginning of the voicing of the consonant. You can see that that's what I did in the above screenshot.

Finally, liquid (or glide) consonants are terrible! So let's have a look at a 'y'.


There appears to be no true vowel fadeout before the consonant here… but it's not immediately clear what to do with the oto. Let's look at the spectrogram (by double-clicking the little [s]) to learn more.


Okay, that make a little more sense—remember, vowels appear as consistent blocks in the spectrogram, whereas consonants do weird things. We don't see any real fade-out before the prefix vowel becomes the consonant, so let's just overlap into the beginning of the consonant to try to mimic this. You can see how I did that in the images above. Using the spectrogram can be really helpful for other things in otos as well, but I'm not going to talk about it deeply in this tutorial. The general idea is that you can find the consonant and clear vowel regions easily in the spectrogram, once you know what you're looking at.
Once you get accustomed to what different kinds of consonants do, you won't have to reference the VCV waveform to do your otos. Practice makes it very simple, in fact! These are not all of the types of consonants (s comes to mind as behaving especially funny), but it should give you an idea of where to start. If you get stuck, look at a VCV sample from a well-recorded VCV bank to see what your critical region should look like, and use the overlap to mimic it carefully.

On the subject of overlap values, an overlap of around 30 ms is the minimum necessary to create a realistic fade-out or crossfade. In many cases, 30 ms is enough; however, it's fine to use more at your discretion. However, for CV-type otos (ones that don't have a real prefix vowel fade), I don't recommend using values above 60 ms; for VCV, I wouldn't recommend using more than 100 ms. Large values of overlap tend to create slur, and cause UTAU to have to push more of your sample back with STP, which isn't ideal (as it may create weird fades). Therefore, watch out for overdoing your overlap values.

Okay, so that's CVs and VCVs. What about Vs and VVs, you say? Well, with VVs you will treat the transition region between the vowels as the consonant region. Let's look at a VV through the spectrogram (which is really the only way to know what's going on with them):


See that transition? Treat that like it's the consonant. Put the preutterance at the end of it, then put the overlap into the prefix vowel like you would do for a VCV. Simple enough, right? Vs (single vowels) are a little tricky; they don't transition perfectly, and there's no great way to mimic the transition. Personally, I just throw 30 overlap and 60 preutterance on them and call it good. You could also make the overlap and preutterance the same. Again, there's no great way to do this.

Now, the only type of oto I haven't talked about is the VC. The VC sounds like a strange thing to anyone, especially in the context of a Japanese bank, but in fact, it makes perfect sense in context. A Japanese VC is taken out of a VCV-type string to preserve the prefix vowel fade-out, as VCV does, but without requiring as many samples to get them all.


Remembering our VCV oto theory, we still want to put the overlap at the end of the consistent prefix vowel—we want to crossfade the clear vowels together. However, because this is a VC, we'll put the preutterance at the very end of the vowel fadeout, preserving timing through this note.

Now, what to do with the vowel region of these otos can be a bit of a stressful thought. After all, we don't really want any of this sample to stretch. However, your resampler will crash and refuse to render the sample if you don't have any vowel region at all! So, what we'll do depends on the sample type—which determines what part of the critical region CAN stretch if necessary. For hard consonants, let the white space between the vowel fade-out and the consonant stretch (and take the consonant itself out with the cutoff). For soft consonants with tonal voicing regions (like b, g, d), let the voicing stretch. This also applies to nasal consonants like m and n, and voiced glides like y and w (and l). This works with our oto theory because that voicing region carries a pitch—which is exactly what our vowel region is doing, funny enough. For soft consonants that shouldn't stretch, like s, z, h, and f (these ones sound gross when stretched by resamplers), just leave a tiny bit of white at the end of the consonant. You shouldn't ever hear it, so don't worry about it.

For English (or other language) VCs—the kind that go at the end of a word, rather than just faking the prefix vowel fade-out—oto them the exact same way as Japanese VCs, but leave the white space at the end of the sample to stretch. Basically, you're still maintaining the critical pronunciation region at the end of the vowel, just that it also includes a consonant. So fade it out the same way.


If your sample is a dipthong, you'll need to treat it a bit differently. Think of the end of the dipthong as the critical “consonant” region as well—you don't want to have it overlapped away, just like I said about VCV. Therefore, put the preutterance at the very end of the first vowel in the dipthong, letting the end keep its character. Of course the overlap should be at the end of the consistent part of the first vowel.


And that's the jist of oto theory! I hope this made sense to everyone and I didn't just explode your heads. Feel free to contact me (cdra or cdra1617) if you have questions.

Note: Some of these ideas were borrowed from Cz, but built up and written down by me. Others she will disagree with, and that's okay too. In fact, it's okay to disagree with some of this; it's just theory that I'd like to share for the sake of UTAU users who'd like to become better at otos.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License