Converting Djvu To Pdf AND Preserving Table Of Contents , How Is It Possible?
Answer :
update: user3124688 has coded up this process in the script dpsprep.
I don't know of any tools that will do the conversion for you. You certainly should be able to do it, but it might take a little work. I'll outline the basic process. You'll need the open source command line utilities pdftk
and djvused
(part of DjVuLibre). These are available from your package manager (GNU/Linux) or their websites (Windows, OS X).
step 1: convert the file text
First, use any tool to convert the DJVU file to a PDF (without bookmarks).
Suppose the files are called
filename.djvu
andfilename.pdf
.step 2: extract DJVU outline
Next, output the DJVU outline data to a file, like this:
djvused "filename.djvu" -e 'print-outline' > bmarks.out
This is a file listing the DJVU documents bookmarks in a serialized tree format. In fact it's just a SEXPR, and can be easily parsed. The format is as follows:
file ::= (bookmarks
<bookmark>*)
bookmark ::= (name
page
<bookmark>*)
name ::= "<character>*"
page ::= "#<digit>+"For example:
(bookmarks
("bmark1"
"#1")
("bmark2"
"#5"
("bmark2subbmark1"
"#6")
("bmark2subbmark2"
"#7"))
("bmark3"
"#9"
...))step 3: convert DJVU outline to PDF metadata format
Now, we need to convert these bookmarks into the format required by PDF metadata. This file has format:
file ::= <entry>*
entry ::= BookmarkBegin
BookmarkTitle: <title>
BookmarkLevel: <number>
BookmarkPageNumber: <number>
title ::= <character>*So our example would become:
BookmarkBegin
BookmarkTitle: bmark1
BookmarkLevel: 1
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: bmark2
BookmarkLevel: 1
BookmarkPageNumber: 5
BookmarkBegin
BookmarkTitle: bmark2subbmark1
BookmarkLevel: 2
BookmarkPageNumber: 6
BookmarkBegin
BookmarkTitle: bmark2subbmark2
BookmarkLevel: 2
BookmarkPageNumber: 7
BookmarkBegin
BookmarkTitle: bmark3
BookmarkLevel: 1
BookmarkPageNumber: 9Basically, you just need to write a script to walk the SEXPR tree, keeping track of the level, and output the name, page number and level of each entry it comes to, in the correct format.
step 4: extract PDF metadata and splice in converted bookmarks
Once you've got the converted list, output the PDF metadata from your converted PDF file:
pdftk "filename.pdf" dump_data > pdfmetadata.out
Now, open the file and find the line that begins:
NumberOfPages:
insert the converted bookmarks after this line. Save the new file as
pdfmetadata.in
step 5: create PDF with bookmarks
Now we can create a new PDF file incorporating this metadata:
pdftk "filename.pdf" update_info "pdfmetadata.in" output out.pdf
The file
out.pdf
should be a copy of your PDF with the bookmarks imported from the DJVU file.
Based on the very clear outline above given by user @pyrocrasty (thank you!), I have implemented a DJVU to PDF converter which preserves both OCR'd text and the bookmark structure. You may find it here:
https://github.com/kcroker/dpsprep
Acknowledgements for the OCR data go to @zetah on the Ubuntu forums!
Comments
Post a Comment