DATA IMMORTALITY
Part of my plan for my "Internet Client" computer was that it would
help me organise my bookmarks between different machines. For
various reasons, it hasn't. Well it has definately helped to an
extent, but again I'm thinking that the only way I'll get bookmarks
to work efficiently is using a separate bookmark manager program.
I've been looking at bookmark managers without satisfaction for
years though, so I've finally given in and decided to come up with
a solution of my own.
Feature summary:
* Bookmarks stored as individual files in a directory structure
equivalent to the bookmark menu structure - worryingly the only
other bookmark system developers who seem to have gone with this
approach were those of Microsoft Internet Explorer. That's probably
a bad sign, but it still seems like the most flexible solution to
me. Like MailDir, but for bookmarks.
* Firefox-like add-bookmark dialogue, but run in a terminal window.
Triggered by a keyboard conbination and automatically copies the
current X selection of a URL. Downloads the page itself in order to
grab the title.
* Statically generated HTML interface which can be accessed either
locally (file://) or from a local web server. directory tree +
top-level bookmarks in either a small frame or table cell on the
left, and directory contents in the main view.
* In the frame view, have an option to browse all bookmarks in the
small left frame, and open links in the larger frame, emulating
Firefox's Ctrl-B bookmark selector.
* List of all bookmarks on one page, usable with browser's page
search function for searching.
* Optionally save a local copy of the page being bookmarked using
wget, also grabbing any file linked from that page up to a certain
size limit. This goes into a separate directory tree, where I can
also manually go into and grab the whole site using HTTrack if
desired.
The last feature is the one I really want to discuss, and it has
been whirring around in my head for a long time ever since I read
this post by Solderpunk:
gopher://zaibatsu.circumlunar.space/0/%7esolderpunk/phlog/the-individual-archivist-and-ghosts-of-gophers-past.txt
There he proposes a Gopher client (though I'd probably try to do it
with a Gopher proxy myself) which archives every visited page
locally. Just recently he's come up with a new approach to the
problem, proposing instead that sites be hosted as Git repos:
gopher://zaibatsu.circumlunar.space/0/%7esolderpunk/phlog/low-budget-p2p-content-distribution-with-git.txt
Looking back on my earlier bookmarks, this is definately a problem
that I do need to solve. I seem to have had a remarkable knack
about a decade ago for finding websites that were about to go
offline in the next ten years, and were obviously of so little
interest to the world at large that the Wayback Machine often
didn't bother archiving images (which are kind-of the key point if
they're talking about electrical circuits) or much of the sites at
all. Even when they did get archived, the Internet Archive is just
another single point of failure anyway. Archive.is, for example,
got blocked by the Australian government a few years ago for
archiving terrorist content (the gov. did a rubbish job of it and
you could still access the site via some of their alternative
domains because it was done at the DNS level, but the fact that the
people in power are idiots doesn't negate the potential of their
power). Unfortunately I don't like either of Solderpunk's solutions.
That may be a little harsh on Solderpunk. My objection to the
client local mirroring approach is mainly just philosophical and
the related practical problems are likely solvable. For his second
suggestion, I disagree with using Git, but propose the same thing
using Rsync (which also solves the URL problem at the cost of
losing a pre-baked changelog system) and I'd be happy.
The difference between us is simply whether to attribute importance
to needless data storage.
For me, storing data is a commitment. You don't need one copy of
the data, the way I do things I need at least four. One copy on the
PC you're working from, two on your local backup drive (the latest
backup and your previous backup in case the backup process goes
haywire, granted incremental backups are another approach which I
don't use myself), and at least one copy off-site. I try to keep
all the data I can't easily cope with losing on my laptop, with its
40GB HDD. Relying on a 20yo HDD probably isn't all that wise, but
just to focus on the 40GB, that actually translates into up to
160GB of data stored, and 120GB needing to be processed to complete
a full backup cycle.
Maybe that's nothing these days, but to me it's already
inconvenient:
* It means the backup process takes a non-trivial amount of time
during which the laptop's performance is poor, so I leave it to run
overnight only once a week. That's a waste of power, and limits the
regularity of my backup routine.
* It means my only practical medium is HDDs. DVDs, CDs, ZIP disks,
might be an option otherwise. I'm not managing to pick up SSDs or
sufficiently large flash drives/cards in my free to $5 second-hand
price range.
* It means I can't use the internet in my laptop's backup strategy,
because my connection is too slow and I'd have to pay a lot more
than for my current 3GB/month deal. That combines with the first
problem to make offsite backups more of a pain.
(I've got my Internet Client computer set up on a 2GB SD card. All
important files get synced with the laptop daily including all
system/user configuration files which make up only ~30MB compressed)
Now back to Solderpunk's concept. you can say that Gopher content
(or probably Gemini, though I don't look at that much) is small so
you might as well grab everything. But my Gopher hole currently
totals 80MB. I've got about 70 sites bookmarked in the Gopher
client on this PC (UMN Gopher); if I'm average (alright I'm
probably not, but I'm the only one I can run "du" on) then at 5.6GB
that's enough data to fill up over 1/8th of my 40GB laptop drive
right there. Including my backups, that would be 22.4GB of data
sitting somewhere, regularly read and copied at the expense of time
and energy.
Now of that data, the largest share (34MB) is my archive of Firetext
logs. I should purge that again actually - I do keep it all myself,
and it may have some use for historical purposes, but the average
Gopher user surely doesn't give a stuff. With the caching client
scheme, it's not a fair assumption that the hourly log you look at
one day is going to be what you want to find later either. With the
Git hosting scheme, someone who just wants to read this phlog post
is obliged to pull all that Firetext data in even if they've got no
interest in it anyway. In fact the Photos and History Snippets
sections make up the bulk of the other data, and yet the only part
that I've ever received feedback on is the phlog, so for all I know
this one 700KB corner is the only bit of content that anyone
actually wants to view, yet using Git they'd be storing 80MB of
data in order to do so.
Should I just ditch everything but the Phlog and just host
that with Git (or Gopher, for that matter, it's potentially just
clogging up the Aussies.space server, which is why I cull the
firetext archive already)? For you, maybe. For me the favourite
part, the part I'd be most thrilled to find in my own browsing, is
the History Snippets section (19MB), even though I've been
struggling to get around to adding new entries there (by the way,
if someone does actually like viewing it, letting me know would
certainly help my motivation). So if I drop that then I'm dropping
my favourite content for the sake of popularity, now embodied in
the sheer efficiency of data storage and transfer.
At the same time I don't think the client caching approach is
right, because everyone who drops into the History Snippets
section, clicks a couple of links, decides it's just something some
weirdo's needlessly put together, and leaves never to return, then
ends up carrying around the gophermap and photos they viewed
purposelessly for as long as they can keep all their data intact.
Yet the person who drops in, looks at a few entries, bookmarks it
for later when they have the time (what I'd probably do), then goes
away - they find that when they return after it's gone offline, all
they can view is the same stuff they saw before.
With the Git proposal, Rsync is an alternative which would solve
the problem of fetching unwanted data. You just pick the directory
with the content you're interested in and Rsync only mirrors that
bit. Server load may be a problem, though public Rsync sites do
already exist for software downloads, so maybe it's practical. You
could also just Rsync individual files for browsing around and
maybe before committing them to parmanent storage.
But with my bookmark system, if I ever get around to creating it,
I've got my own equivalent to the client caching system, which
works with existing protocols (well, I guess most easily just with
the web). It specifically grabs what I think I might want to look
at. Rather than enforcing some rigid system that theoretically
grabs all the data I'll ever want to find again, I'd rather just
make that decision myself.
- The Free Thinker.