USS Clueless - A break
     
     
 

Stardate 20040204.1836

(Captain's log): Sorry about the slow pace of postings the last few days. I've been distracted by some other things I was doing, and there really hasn't been much in the news that triggered whatever-it-is in the back of my head that does the real writing around here.

I wrote a few days ago about a problem I had with trying to rip certain anime episodes from a couple of DVDs I own. I actually figured out how to make that work, in part because of advice from readers.

The first problem was that when I tried to include any soundtrack except the first, Vidomi would miscode the sound. The first soundtrack on those DVDs is the English dub, and I don't like it. Some of the voices are miscast and they rewrote the dialog in ways I didn't care for. I also didn't think that the acting was as good on the English dub. So I wanted to use the Japanese soundtrack instead, but whenever I tried to convert it using Vidomi, the sound just warbled on playback.

I never figured out what to do about that in Vidomi. But after reading some of the mail I received, it occurred to me that the rip tool I was using (SmartRipper 2.41) might permit a different solution.

So when I ripped the DVDs, I configured SmartRipper so that it didn't rip the English soundtrack at all. I also configured it to rip the Japanese sound track and remap it so that it looked like the first one (0x80) instead of the second one (0x81). When I extracted a test segment from that using Vidomi, it sounded great. And it sounded like Japanese.

But without subtitles, that would be useless since neither my friend nor I speak Japanese. Both SmartRipper and Vidomi supposedly can process subtitles, but the resulting file wasn't treated as if it had subtitles even by my normal DVD player program (WinDVD Platinum), let alone by Windows Media Player (WMP), which was what my friend was going to use to view them.

I did some googling, and discovered what was going on. Subtitles on DVDs are handled as a special video stream encoded into the VOB file, which combines them with the normal video and as many sound tracks as there are. The subtitle graphics stream is only 2 bits per pixel, encoding four colors, and is presumed to not change very rapidly or require much total bandwidth. One "color" is transparent, one is used for the letters in the foreground, one to outline the letters, and one is used for anti-aliasing if you want to.

WMP expects subtitles for an AVI file (for instance) to be stored in a separate parallel file with the extension SMI. Moreover, the subtitles in SMI files are text, and are accompanied in the text by timestamps indicating when they appear and when they vanish again. A web page I found indicated that what you needed was a program called SubRip, and it's pretty cool. One of the things it does is to convert graphical subtitles into text.

It uses a poor-man's form of optical character recognition, where the user does the recognizing. It takes advantage of the fact that the subtitles are machine generated and thus consistent, and it assumes that a contiguous section of foreground color is one letter, so it can easily parse individual characters out of the subtitle graphic. If it doesn't recognize one, it displays it on the screen and asks you to type the letter, and adds it to a database. Thereafter, it knows that one and automatically converts it every time it encounters it. When you rip a subtitle channel using SubRip, initially it prompts you a lot, but the rate cuts way down and after a certain point it finishes on its own. It took maybe fifteen minutes to capture the subtitles from the first DVD, which is not onerous.

Most characters consist of a contiguous foreground region. A character such as the equal sign = is not contiguous but SubRip isn't confused and recognizes it as a single character. Unfortunately, SubRip treats double quotes " as two characters, but that was easy enough to deal with later. (It meant that each double-quote in the subtitle was converted to two consecutive double-quotes in the resulting text file, which could be fixed by one "replace all" on the file.)

I was working with two different DVDs but both were from the same series and publisher, and they used the same tool and font to create the subtitles each time. So the recognition file created by processing the first DVD worked fine for the second as well.

SubRip can write a lot of different formats, but unfortunately SMI isn't one of them. However, it does have the ability to write SUB format, and there is a tool called Sub2Sami which, as the name suggests, converts a SUB file to an SMI file. (The differences are really pretty small, but are critical. For instance, in a SUB file timestamps are represented as HH:MM:SS.FFF, whereas in SMI files they're integers in millisecond units.)

The site where I found the page that explained most of this had offered SubRip for download. That page also said that Sub2Sami could be downloaded from that site – only it wasn't there.

More googling. The first couple of pages I found which claimed to offer it actually ended up linking back to the site which didn't have it any longer. Finally I found it here.

Which left only one remaining problem: the DVD's subtitle timestamps are all referenced to the beginning of the DVD. One of the DVDs I ripped contained episodes 13-19, and I wanted episodes 15 and 17, the third and fifth on the disc.

Using SubRip I grabbed the subtitles for the entire disc into a single file. I edited that omnibus SUB file and extracted out the subtitles for episodes 15 and 17 into separate SUB files, which I could convert to SMI. But WMP expects the timestamps to be referenced to the beginning of the AVI file it is playing, not to the beginning of the DVD from whence they were "borrowed".

The first subtitle in episode 17 was supposed to appear after eight and a half seconds. WMP expected it to have a timestamp which meant "8.5 seconds". Unfortunately, my SUB file for episode 17 had its first timestamp at 01:37:27.5. There were 265 subtitles in that episode, with two timestamps per subtitle, and every single one of them was off by ninety-seven minutes.

I was afraid I was going to have to do something ugly and unfun to correct that (such as reading the entire thing into Excel and using it to offset all the times), but a short RTFM later I realized there was a clean solution. After I extracted out a single episode worth of subtitles, I was able to reload it into SubRip, which permitted me use SubRip's time adjustment utility. That feature permits you to enter any time interval and it can either add that amount or subtract that amount from every timestamp in the file.

That allowed me to subtract 01:37:19 from all 530 of them in a single step, which got me within a fraction of a second of where I wanted to be. Then it was only a matter of fine tuning. (And of figuring out that the fourth entry field for that function is entered in millisecond units, not in fractions of a second. "5" meant "5 milliseconds", not "0.5 seconds".)

I watched the first little bit of the video with the subtitles to see how closely synchronized they were, and used that same mechanism in SubRip in order to adjust the timestamps up or down by small fractions of a second. After about four iterations it was perfect.

That was fun. I learned some things, and the problem was tricky without being frustrating, and the result was everything I hoped it would be. I wasn't really trying hard for maximum compression, and I didn't resize or change the frame-rate, and even so the resulting file sizes were quite acceptable. Episode 15 ended up being 202MB for 20:36 of 624*480*30fps video (and 48KHz stereo encoded in MP3 at 160 kbit/s, which was probably overkill). Episode 17 was 201MB. And I created a single file out of episodes 24-26 which was 1:01:43 long and took 604MB. Two 16X CD burns later, I was done.


include   +force_include   -force_exclude

 
 
 

Main:
normal
long
no graphics

Contact
Log archives
Best log entries
Other articles

Site Search

The Essential Library
Manifesto
Frequent Questions
Font: PC   Mac
Steven Den Beste's Biography
CDMA FAQ
Wishlist

My custom Proxomitron settings
as of 20040318



 
 
 

Friends:
Disenchanted

Grim amusements
Armed and Dangerous
Joe User
One Hand Clapping


Rising stars:
Ace of Spades HQ
Baldilocks
Bastard Sword
Drumwaster's Rants
Iraq the Model
iRi
Miniluv
Mister Pterodactyl
The Politburo Diktat
The Right Coast
Teleologic Blog
The Review
Truck and Barter
Western Standard
Who Knew?

Alumni

 
 
    
Captured by MemoWeb from http://denbeste.nu/cd_log_entries/2004/02/Abreak.shtml on 9/16/2004