Fetch, Clean and Zip a Subset of Wikipedia

Here are a few Perl scripts which I have written to fetch a set of topics from Wikipedia, remove a lot of unwanted HTML code, reduce the file size of the associated images, and Zip them all into a set of Zip files for use with the uBook reader.

wikifetch: fetch one or more Wikipedia topics and associated images
wikitopics: search the downloaded topic files for more topics to fetch
cleanimages: reduce the size of the downloaded image files
cleanpages: remove unwanted HTML code from the topic files, and rewrite hyperlinks
buildzips: build a set of Zip files from the cleaned-up topic and image files

As an example, this Zips directory contains the Zip files resulting from a download of the maths, physics and chemistry topics from Wikipedia. There are about 1,100 topics. The original HTML and images took up 81 Megs and 93 Megs respectively. This was reduced down to 22 Megs and 69 Megs respectively. The Zip files in total come in at 79 Megs. And with the "no images" option enabled, the Zip files come down to under 8 Megs.

To run the software, you will need a Linux or Unix box with Perl 5.8 or later, the LWP::UserAgent Perl module, the `convert' tool from ImageMagick and the `zip' command-line tool from info-zip.

wikifetch

Usage: wikifetch topic [-ni] [htmldir] [imgdir]

Given a topic, e.g. 'M._C._Escher', fetch the topic and associated images and store them in the directory Html and Images, respectively. Alternatively, you can name the HTML and image directories on the command-line. The -ni option tells the script not to fetch any images associated with the topic.

The topic can also be the name of a text file, which contains a set of lines. Each line is a topic to be fetched from Wikipedia. So, for example, you can create a file called topic_list with these lines:

Douglas_Adams
The_Hitchhiker's_Guide_to_the_Galaxy
Dirk_Gently's_Holistic_Detective_Agency
The_Long_Dark_Tea-Time_of_the_Soul

then each of the four topics will be fetched if you do

wikifetch topic_list

Note: the topic(s) must be named exactly as they appear in Wikipedia URLs.

wikitopics

Usage: wikitopics [htmldir]

Once you have downloaded some Wikipedia topics, you may want to fetch others which are related to the ones you already have. You can run wikitopics on the HTML directory to create a list of missing topics. So, for example, if you do

wikitopics Html > newtopic_list

then the file called newtopic_list will be a list of new topics, ready to be fed into wikifetch.

Note: I strongly recommend that you inspect and hand-edit the resulting list as you probably will not want all of the new topics. Repeat the use of wikifetch and wikitopics until you have built a large set of HTML topics and image files.

cleanimages

Usage: cleanimages

This shell script create a new directory called NewImages, and converts all the Jpeg and Gif files in Images into Jpegs at 45% quality in the NewImages directory. Run this before you run cleanpages. Yes, the directory names are hard-coded, but it's a short script so feel free to modify it.

cleanpages

Usage: cleanpages [-ni] [-nh] [olddir] [newdir]

This script removes unwanted HTML code from the downloaded topic files, changes the image links to point to the smaller image files, and rewrites the hyperlinks so that they will work once the topics are Zipped up. The -ni option tells the script not to include images in the HTML output. The -nh option tells the script not to include any external hyperlinks in the HTML output. By default, the files in the Html directory are cleaned up, and the new versions stored in the NewHtml directory, although you can change these defaults on the command-line.

buildzips

Usage: buildzips [htmldir] [zipdir]

Once you have fetched all the topics, reduced the images and cleaned up the HTML files, it's time to build Zip files which can be used by uBook. This script collects all the HTML files in the NewHtml directory, and creates Zip files named using the first two characters of the topic named in the Zips directory, although you can change these defaults on the command-line.

That's it. The scripts are still pretty rough, so check back here occasionally to see if I have done any later versions.

Last updated September 22, 2008, Warren Toomey.

File translated from T_EX by T_TH, version 3.77.
On 22 Sep 2008, 13:57.