Friday, August 22nd, 2014
"Hacking" National Library
As a person interested in history and as a monarchist, I enjoy
reading old newspapers from the end of 19th and beginning of
the 20th century. Some of the newspapers of that era are freely
available on the website of the Czech National Library
in the electronic form as DjVu images, which is great.
What's not so great is the content management system through which
are the scanned pages published. If you have DjVu plugin in your
browser you can read single page after six mouse clicks, but as
the plugin is binary software, it may not be available for your
hardware platform, especially if you use non-x86 hardware, like I do.
In that case you have to click at least seven times to download
image of single page and then open it with the DjVu viewer of your
choice. Not great at all, not even user-friendly and certainly
not quick to get what I want.
So I decided to download all pages of single newspaper to read it
whenever I want. My favorite is Narodni politika (="National
politics"), which was published from cca 1884 to 1945. It had four
pages in the early years and at sixteen pages with two or three
issues every day after a few decades. I don't even want to calculate
how many times I would have to click to download this amount
of pages and how much of my life it would cost me. Instead of that,
I spent a few hours studying the structure of the website and wrote
a PHP script, which can download all the pages for me.
After first couple of hours, it seems to be working. Even with few
random waits in the code (simulating user clicking) it takes about
five seconds to download a single page. If my calculations are
correct, everything should be on my drive in two or three weeks.
Yes, it's quite a long time, but that time will be wasted by
a machine, not by me.