Modified Version of pdftohtml

Below is a source tarball for a modified version of utils/pdftohtml taken from poppler 0.8.3. The new version outputs HTML which flows better for e-book readers such as uBook. The main changes are:
Here is the tarball: wkt_pdftohtml-20081008.tar.gz. And here is a shell script which converts a PDF into a Zip file containing the HTML and JPEG files, suitable for the uBook reader: pdftozip.

Paragraph Flow Change

The change to the flow of the output paragraphs is a significant one, so here is the rationale behind the change. The original pdftohtml tried to keep the layout of the HTML output as close to the original PDF as possible, inserting <br> breaks at the end of every line on the PDF, and inserting <hr>s at the end of every PDF page. As an example, this short PDF document is converted to this HTML code by the original program:
Douglas Adams
From Wikipedia, the free encyclopedia
Douglas NoŽl Adams (11 March 1952 ≠ 11 May 2001) was an En-
glish author, comic radio dramatist, and musician. He is best
known as the author of the Hitchhiker's Guide to the Galaxy series.
Hitchhiker's began on radio, and developed into a "trilogy" of five
books (which sold more than fifteen million copies during his life-
time) as well as a television series, a comic book series, a computer
game, and a feature film that was completed after Adams' death.
The series has also been adapted for live theatre using various
scripts; the earliest such productions used material newly written
by Adams. He was known to some fans as Bop Ad (after his illegi-
ble signature), or by his initials `DNA'; he was born the year before
the elucidation of the structure of "the meaning of life" or D.N.A. by
Francis Crick and James Watson in Cambridge i.e. where he was
In addition to The Hitchhiker's Guide to the Galaxy, Douglas Adams
wrote or co-wrote three stories of the science fiction television se-
ries Doctor Who and served as Script Editor during the seventeenth
season. His other written works include the Dirk Gently novels,
and he co-wrote two Liff books and Last Chance to See, itself based
on a radio series. Adams also originated the idea for the computer
game Starship Titanic, which was produced by a company that
Adams co-founded, and adapted into a novel by Terry Jones. A
posthumous collection of essays and other material, including an
incomplete novel, was published as The Salmon of Doubt in 2002.
His fans and friends also knew Adams as an environmental ac-
tivist, a self-described `radical atheist', and a lover of fast cars,
cameras, the Macintosh computer, and other `techno gizmos'. The
biologist Richard Dawkins dedicated his book The God Delusion
to Douglas Adams and in it described how Adams came to un-
derstand evolution. Douglas was a keen technologist, writing
about such topics as e-mail and Usenet before they became widely
known. Toward the end of his life he was a sought-after lecturer
on topics including technology and the environment.

While the output is faithful to the original PDF, it does not flow well and is unsuitable for e-book readers. The new version joins PDF lines into paragraphs, and paragraphs are broken with the normal HTML <p> tag. This acommodates e-book readers with arbitrary screen widths, allowing them to reformat the paragraph as required. The above PDF document is converted to this HTML code by the new program:
Douglas Adams

From Wikipedia, the free encyclopedia

Douglas Noël Adams (11 March 1952 - 11 May 2001) was an English author, comic radio dramatist, and musician. He is best known as the author of the Hitchhiker's Guide to the Galaxy series. Hitchhiker's began on radio, and developed into a "trilogy" of five books (which sold more than fifteen million copies during his lifetime) as well as a television series, a comic book series, a computer game, and a feature film that was completed after Adams' death. The series has also been adapted for live theatre using various scripts; the earliest such productions used material newly written by Adams. He was known to some fans as Bop Ad (after his illegible signature), or by his initials `DNA'; he was born the year before the elucidation of the structure of "the meaning of life" or D.N.A. by Francis Crick and James Watson in Cambridge i.e. where he was born.

In addition to The Hitchhiker's Guide to the Galaxy, Douglas Adams wrote or co-wrote three stories of the science fiction television series Doctor Who and served as Script Editor during the seventeenth season. His other written works include the Dirk Gently novels, and he cowrote two Liff books and Last Chance to See, itself based on a radio series. Adams also originated the idea for the computer game Starship Titanic, which was produced by a company that Adams cofounded, and adapted into a novel by Terry Jones. A posthumous collection of essays and other material, including an incomplete novel, was published as The Salmon of Doubt in 2002.

His fans and friends also knew Adams as an environmental activist, a self-described `radical atheist', and a lover of fast cars, cameras, the Macintosh computer, and other `techno gizmos'. The biologist Richard Dawkins dedicated his book The God Delusion to Douglas Adams and in it described how Adams came to understand evolution. Douglas was a keen technologist, writing about such topics as email and Usenet before they became widely known. Toward the end of his life he was a sought-after lecturer on topics including technology and the environment.

The changes to the main poppler tree have been returned as diffs to the maintainers of poppler. So, use this version until the changes get absorbed back into the main poppler tree.
Warren Toomey, October 2008.