Hello! I’m Aster (Aster Selene), and you may recognize me as the girl who used to put snarky, troperiffic reviews on all of the UTAU rankings. Sadly, that’s not true anymore, as my parents don’t take well to me taking breaks from my studies for longer than 15 minutes, and UTAU ranking reviews usually take longer than that… Rest assured that someday, I will storm back in full glory into those reviews and start re-applying snark to them.
In any case! Today I’m here to discuss the use of continuous sound (連続音 or renzokuon) voicebanks, also known as VCV. They’re also erroneously called “triphones”. Continue reading after the jump!
First, a little bit of history. I’m not so good at this part so I’m going to paraphrase what a friend of mine said on a different venue:
The VCV (vowel-consonant-vowel) method was developed by Ameya, the creator of UTAU. It was first displayed with the UTAU Momone Momo, with the song “Kenka Wakare” (original by MimiroboP).
There was a catch, though. Momo’s voicebank at the time did not have a full VCV voicebank – it had what is now called a “Lite” reclist (a reclist being the list of syllables you record for an UTAU voicebank). The list didn’t cover all possible combinations, so while it could sound good at some parts it would go back to CV (normal) quality in others (and if you listen, you’ll hear that in the middle of the song it starts getting a little choppier).
This is the part most people skip: After Momo came the UTAU Otodamaya. Not a lot of people really know her even now.
Afterwards came Shirakane Hiyori, who had a different reclist than Momo, but it was still a Lite list.
And of course, a bunch of UTAU started to follow suit. Some started to experiment around with complete “standalone” VCV banks that would cover an entire song, but the idea never fully caught hold…that is, until a new UTAU named Namine Ritsu came with a standalone list. And here, things started to take off.
Now, how does VCV work? Instead of recording flat syllables like “ka” and “to”, you would record ones with vowels before them like “a ka” and “i to”. For example, if you were to plug in the first line of the song “Toeto” in CV, you’d write
あ な た の こ と が
(a) (na) (ta) (no) (ko) (to) (ga)
And if you were to do this in VCV:
- あ a な a た a の o こ o と o が
(- a) (a na) (a ta) (a no) (o ko) (o to) (o ga)
The methodology utilizes the “overlap” function. In UTAU, when you configure the oto.ini (the file that distinguishes your consonants from your vowels to prevent the program from stretching out your vowels or using a long space in recording as a note), there’s something called overlap, and in CV voicebanks it’s used to make consonants a little less awkward (you need a bit of overlap in consonants like k or ch).
In VCV voicebanks, the overlap smooths out the vowels. In the above example, “a no” and “o to” – the beginning o of “o to” would overlap on the o of “a no”, mixing them together.
The recording method is also modified. Because recording every single possible vowel-syllable combination would take up disk space and drive the voicer mad, the syllables are recorded in sets. To take an example from Ritsu’s VCV bank:
This is one sound file in Ritsu’s bank, and this style of recording will yield seven syllables if the oto.ini is done properly: “- ka”, “a ka”, “a ki”, “i ka”, “a ku”, “u ke”, and “e ka”. This significantly cuts down on the number of recordings needed to fill up a whole bank.
(Note: Because Ritsu’s bank has each file contributing seven syllables, the bank is referred to as being “7-mora”. Other lists exist with other numbers of mora; for instance, Sukone Tei’s VCV is 5-mora, as well as Takano Yuki’s.)
It also helps to use a guideBGM, which is essentially a song that plays in the background to help the recorder “sing” his or her samples to the beat. It regulates the rhythm and pitch, which makes it easier for the UTAU resampler to filter.
VCV has two major advantages. The first is that it makes transitions between syllables much smoother, since even the transitions are recorded by the voicer. The second is that it creates a plausible space between the vowel and the consonant when it’s natural – it’s impossible to completely hold out a vowel without creating a tiny little space for things like “a ta”, but UTAU will make CV voicebanks bleed the vowel into the consonant.
Currently, VCV itself is not very well-known to people who don’t use UTAU and only listen to the songs (though they probably notice that some songs sound smoother than others), but it’s very popular amongst UTAU users and voicebank voicers, and most popular voicebanks utilize VCV.
Also, VCV sounds a lot more realistic. Compare the old version of an original Tei song “13km” with the new VCV voicebank:
VCV has also spawned more recording methods too, mostly for other languages. The “CV-VC” method, though debatable as to whether or not it is better, is being popularized for voicebanks handling languages such as English, Korean, and Chinese.
My personal experience with VCV? I’ve been using UTAU since November of 2009 (and I’ve had an UTAU since February), but even then VCV still feels like something new to me. My UTAU doesn’t have VCV much better than her CV bank – but this is because I did a pretty poor job recording (no guideBGM, pitch too low, and I got lazy somewhere near the end). So in the end, I feel that VCV can be a very powerful tool – if it’s done correctly. It requires a lot of work and a lot of patience, but in the end it can bring some very good results when done right.
And does a voicer or user have to use VCV? Absolutely not. There are still ways to do great things with CV; for instance, the popular songs Trip Trip and Hana ni Naru were made using CV voicebanks. VCV just makes things a little easier.
I see VCV as a huge step forward for UTAU. To use a DDR analogy, CV is like playing normally, and VCV is like hugging the bar. Sure, there are amazing players who don’t touch the bar, and there are terrible players who hang onto the bar as if it’s their lifeline. However, as evidenced by In The Groove, the best bar player will always be able to beat the best non-bar player, because s/he can do things that the person who doesn’t hug the bar when playing just can’t do. The best VCV can make an UTAU sound as realistic as a human singer, though in a different way from Vocaloid.
Building off of the last point Aster made, though, this is not to say that CV is “bad” and VCV is “good.” (As an example, I’ve found that CV makes for better talkloid videos than VCV does.) However, it is a huge step forward, and I eagerly await more advancements with the UTAU program.