So, I've been quiet lately. That's because I got distracted from the
distraction from the distraction and churned out another project real
quick that I got exited about. I've been messing about with usenet of
late and found that there really isn't a lot of good search engines
for it out there. The ones that _are_ out there that don't completely
suck usually cost money to access. Which got me into this rabbit hole
of looking at usenet indexers, and trying to figure out how to index
the usenet myself.
--------------------
Usenet uses the NNTP protocol------------ -- ------------
and, like gopher, NNTP is ------- -- - -- - --- --------
a pretty simple, plain -----------------------------------
--
text, human-readable ------ -- -------------------- --- ----
---
protocol. So writing -----------------------------------
-------
a client is not all ---- -- - - -- -- -- - - --- -----
that difficult. ----- - ------ -- -- -- ------ - ------
---- -- - -- -- -- -- -- - -- ----
No, the problem -_______ _______ _______ _______ _______ _______--
with writing an - | | __| ___| | | ___|_ _|-
indexer is rather - | |__ | ___| | ___| | |---
very much a -_______|_______|_______|__|____|_______| |___| --
problem of scale. ---- -- -- - - -- - - -- -- ----
----- - -- - - -- - - -- - -----
You see, usenet ------------------------------------
--------------
has a lot of posts.---- -- - -- -- -- -- -- - -- ----
Nearly 500 billion ----- --------------------------------
----- -----
at this time, with ----- --- - - -- -- -- - - -- ----
another 171 million ---------------- - -- - --------------
being added every day. -------- -------------------------- --
----
-----------------------------------
--
This means you'll have to ------- --- ---------- -- -------
optimize the crap out of how --------- --- -- -- ---------
you're storing what you've --------------------
indexed, as well as try everything
you can to make it as fast as possible to search through that massive
amount of data. I found this to be a very interesting problem, so I
kept at it. The result is 'UsenetSearch' - yes, very inspiring and
creative name, I know.
link to the code: https://tildegit.org/jns/UsenetSearch
I'll repeat some of the back-story in the readme there below:
"
I got interested in indexing usenet because I was
disappointed with the existing options. There's
horrible things like newznab, nntmux, etc,... all
written in php, and seemingly depend on the
entire body of software ever written by mankind
in the history of the world. I actually tried to
get nntmux running on my bsd server, failed of
course. Why became immediately obvious as soon as
I looked at their code. It's php code, often
shelling out linux-only shell commands. Ugh. I
was so appalled I had to go and write my own
thing. After all, how hard can it be :)
"
( If you are the author of nntmux or newznab I apologize! - It's not
personal ;) -)
Anyway, ok - so how does this thing work... well, probably, for the
most part, I guess it's similar to how most search engines work, but
I will spell it out for those not familar.
The indexer makes a connection to the NNTP server, and requests a
list of all newsgroups. We look for any newsgroups that match the
patterns of newsgroups we care to index. UsenetSearch lets you set
regular expressions for newsgroup names to index, rather than having
to explicitly list specific ones. If none are set, we are indexing
all of usenet (Hope you've got a lot of storage!!)
For every newsgroup,we ask for all subject headers - starting at the
last indexed post. At this time we are only indexing subject headers.
I might support body indexing or storing dates and such in the future
but I'm really trying to get the storage costs down to a minimum here
obviously.
These posts then get split up in batches that are fed into a thread
pool where we start doing the indexing magic in parallel. Time to put
that CPU to work!
Every subject line gets filtered, then tokenized. So, how does
tokenization work you ask? What is it? Well, basically we're trying
to generate every possible search string that would match the subject
line. So this means every possible combination of sequential words.
Ok, here's an example, say you wanted to tokenize the phrase:
" gophers are cute "
The generated tokens would be:
- gophers are cute
- gophers are
- gophers
- are cute
- are
- cute
So you can see, a 3 word sentence generated us 6 tokens. As you can
imagine, this can blow up pretty quickly in size for long pieces of
text. So we also do a bunch of extra token filtering after we do the
initial tokenization. We remove duplicate tokens, some tokens may
just be dropped, for others we may only care to store them if they
are an exact match to the subject. I tried to make all that filtering
configurable via the usenetsearch configuration file, such that the
end-user may make their own tradeoffs in search result quality versus
storage consumption. Another important thing is that we can limit the
recursion depth of the tokenization algorithm. The algorithm itself
basically is just a loop within a loop, where the outer loop removes
a word from the start and the inner loop removes a word from the end
- each time storing a token of the entire string as it goes through.
The actual C++ implementation of this splits the string by whitespace
into a vector of strings, and then makes use of iterators to take
subsets of word tokens. - So instead of: 'removing a word from the
left', you can simply increment the outer iterator, and instead of
'removing a word from the right', you simply decrement the inner
vector iterator.
Anyway, then each token gets hashed. We don't ever actually store the
subjects themselves, just the hashes of the subjects. This is because
a hash is a fixed size. It doesn't matter how long a sentence is, the
hash of the sentence will be a predictable number of bytes. This may
improve storage efficiency when the subjects are longer than the hash
but may also harm it if the subjects are shorter. In reality, usually
they are longer. But much more important than that, storing things
with a predictable size in a file, makes it so you can easily seek to
a specific entry really quickly without having to do a linear search,
because you can calculate where something is going to be by simply
multiplying it's index with the size of an entry. And what's more, we
can very quickly and efficiently search for a hash, because we can
pre-sort them in a tree structure on disk. That way, we don't have to
search through an entire file of billions of hashes every time. This
is especially important because we're building a reverse-index of
hashes. We're not matching some ID to a hash, we're matching some
hash to an (article) id.
So let's take our example string from before, and look at the hashes:
- gophers are cute : 06e50092d57e0db6219c2ca6b80edbe5
- gophers are : 6f9d9543a11e72dc4f0134fd4874a666
- gophers : de96350d92c5eb383a0de9331933a212
- are cute : 7c4860f2c7ac43aab4345a34d190e2e0
- are : 4015e9ce43edfb0668ddaa973ebc7e87
- cute : 02829fb05c3076ec5a6caebd12477dec
So each of these tokens could be stored on the filesystem like:
./06/e5/00/06.db (contains the 06e50092d57e0db6219c2ca6b80edbe5 hash)
./6f/9d/95/6f.db (contains the 6f9d9543a11e72dc4f0134fd4874a666 hash)
etc... - in other words, you can take the first n digits of the hash
and build a super simple tree from them. Since these are hex digits,
the number of files on the filesystem is always tunable and
predictable - you don't want to end up building such a deep structure
that you run out of inodes. Moreover, using such a scheme one could
also distribute individual top-level folders over multiple different
filesystems which could be useful for scalability reasons.
Anyway, that's the rough outlines of how it works. There's of course
a lot more to it. You don't want 2 threads to write to the same file,
so you have to use file locking, but then you also don't want to end
up in a situation where everything always wants to write to the same
token file because then your program will just be waiting for locks
all the time. You also want to do whatever you can to avoid database
corruption should the daemon get killed uncleanly and what not. Right
now I just have a simple fixed header and footer to check for db file
integrity, although I may add a CRC later.
At some point I may re-purpose some of this code and write a gopher
search engine. That would be a fun project too. But it would be a lot
more complicated, since there you'd have to index (potentially long)
bodies of text, account for endless loops, read robots.txt files, and
so on,... so I might not ever get to it. I've also got a bunch of
gopherclients to finish, a gopher server, a game engine, etc etc etc.
Enough drivel for now :) Catch ya on the flip side!