Fetch, Clean and Zip a Subset of Wikipedia
Here are a few Perl scripts which I have written to fetch a set of
topics from Wikipedia, remove a
lot of unwanted HTML code, reduce the file size of the associated
images, and Zip them all into a set of Zip files for use with the
uBook reader.
- wikifetch: fetch one or more Wikipedia topics and
associated images
- wikitopics: search the downloaded topic files
for more topics to fetch
- cleanimages: reduce the size of the downloaded
image files
- cleanpages: remove unwanted HTML code from the
topic files, and rewrite hyperlinks
- buildzips: build a set of Zip files from the cleaned-up
topic and image files
As an example, this Zips directory contains the Zip
files resulting from a download of the maths, physics and chemistry
topics from Wikipedia. There are about 1,100 topics. The original
HTML and images took up 81 Megs and 93 Megs respectively. This was
reduced down to 22 Megs and 69 Megs respectively. The Zip files in
total come in at 79 Megs. And with the "no images" option enabled,
the Zip files come down to under 8 Megs.
To run the software, you will need a Linux or Unix box with Perl 5.8
or later, the LWP::UserAgent
Perl module, the `convert' tool from ImageMagick
and the `zip' command-line tool from info-zip.
wikifetch
Usage: wikifetch topic [-ni] [htmldir] [imgdir]
Given a topic, e.g. 'M._C._Escher', fetch the topic and associated
images and store them in the directory Html and Images,
respectively. Alternatively, you can name the HTML and image directories
on the command-line. The -ni option tells the script not
to fetch any images associated with the topic.
The topic can also be the name of a text file, which contains a set
of lines. Each line is a topic to be fetched from Wikipedia. So, for
example, you can create a file called topic_list with these
lines:
Douglas_Adams
The_Hitchhiker's_Guide_to_the_Galaxy
Dirk_Gently's_Holistic_Detective_Agency
The_Long_Dark_Tea-Time_of_the_Soul
then each of the four topics will be fetched if you do
wikifetch topic_list
Note: the topic(s) must be named exactly as they appear in
Wikipedia URLs.
wikitopics
Usage: wikitopics [htmldir]
Once you have downloaded some Wikipedia topics, you may want to fetch
others which are related to the ones you already have. You can run
wikitopics on the HTML directory to create a list of missing topics.
So, for example, if you do
wikitopics Html > newtopic_list
then the file called newtopic_list will be a list of new
topics, ready to be fed into wikifetch.
Note: I strongly recommend that you inspect and hand-edit
the resulting list as you probably will not want all of the new topics.
Repeat the use of wikifetch and wikitopics until
you have built a large set of HTML topics and image files.
cleanimages
Usage: cleanimages
This shell script create a new directory called NewImages,
and converts all the Jpeg and Gif files in Images into Jpegs
at 45% quality in the NewImages directory. Run this before
you run cleanpages. Yes, the directory names are hard-coded,
but it's a short script so feel free to modify it.
cleanpages
Usage: cleanpages [-ni] [-nh] [olddir] [newdir]
This script removes unwanted HTML code from the downloaded topic files,
changes the image links to point to the smaller image files, and rewrites
the hyperlinks so that they will work once the topics are Zipped up.
The -ni option tells the script not to include images in
the HTML output. The -nh option tells the script not to include
any external hyperlinks in the HTML output. By default, the files
in the Html directory are cleaned up, and the new versions
stored in the NewHtml directory, although you can change
these defaults on the command-line.
buildzips
Usage: buildzips [htmldir] [zipdir]
Once you have fetched all the topics, reduced the images and cleaned
up the HTML files, it's time to build Zip files which can be used
by uBook. This script collects all the HTML files in the NewHtml
directory, and creates Zip files named using the first two characters
of the topic named in the Zips directory, although you can
change these defaults on the command-line.
That's it. The scripts are still pretty rough, so check back here
occasionally to see if I have done any later versions.
Last updated September 22, 2008, Warren Toomey.
File translated from
TEX
by
TTH,
version 3.77.
On 22 Sep 2008, 13:57.