HN Gopher Feed (2017-09-06) - page 2 of 10 ___________________________________________________________________
Interactive Demo of Bloom Filters
115 points by tekromancr
https://www.jasondavies.com/bloomfilter/___________________________________________________________________
natch - 6 hours ago
You don't need more than one hash function. Just append an
incrementing token onto the input before each successive hash.
chrisseaton - 5 hours ago
> Just append an incrementing token onto the input before each
successive hash.But that's a new hash function then.
thinkr42 - 5 hours ago
This is great, but think "Probably is in there" should be replaced
with "Might be in there".
tekromancr - 4 hours ago
For this toy example, sure. But extend it from a 2byte table to,
for example, a 50MB table and use many more hashes. The larger
the table, and the more hash functions you use, the smaller the
false positive error becomes. If you have a limited dataset that
you want to test against, you can set these params to have a near
0% false positive rate, at the expense of a much larger table
callahad - 5 hours ago
> Unfortunately I can't use the 64-bit trick in the linked post as
JavaScript only supports bitwise operations on 32 bits.This could
be a good use case for WebAssembly, which supports 64-bit math
natively.
Assossa - 5 hours ago
If you want a simple explanation of bloom filters:To add to
filter:1) Get multiple hashes of the data. You can use the same
hash function and increment the data each hash, or use multiple
hash functions. You can do any amount of hashes from 1 to infinity,
each filter size and dataset size has a sweet spot.2) Mod
(remainder) each hash by the filter size. The filter size can be
any size, 1 to the maximum hash from your function[s]. The larger
the filter, the more accurate the results, but a filter larger than
your dataset is obviously useless.3) Set each bit at the locations
(from step 2) in the filter to 1.To check filter:1) Do steps 1 & 2
from previous procedure.2) Check each location in the filter. If
all locations are 1, the data might be in the filter. If any of the
locations are 0, the data is definitely not in the filter.
taco_emoji - 3 hours ago
> a filter larger than your dataset is obviously uselessThis
isn't obvious to me, can you explain?
biggerfisch - 2 hours ago
At that point, it's harder to search/test than just looking at
your dataset, so there's no point to making the size that large
irl_zebra - 35 minutes ago
Well,the bloom filter only uses a tiny bit of space per data
element and doesn't actually store the data. When searching
the data itself, you're often searching through the actual
data, which is more intensive and slow.
netcraft - 6 hours ago
obligatory mention that if you are interested in this you should
also know about cuckoo filters:https://www.cs.cmu.edu/~dga/papers
/cuckoo-conext2014.pdfhttps://en.wikipedia.org/wiki/Cuckoo_hashing
hendler - 5 hours ago
The OP also posted
http://www.lkozma.net/cuckoo_hashing_visualization/
GordonS - 5 hours ago
Do you have a tldr for how/when it is better than bloom filters?
kappaloris - 5 hours ago
When you want an error rate lower than 3% and you can properly
size the filter as it can fail insertions if you overfill it.
spike021 - 1 hours ago
Cuckoo hashing was one of the more interesting concepts I learned
back in school. Definitely a fun way of implementing a hash
table.
kristoff_it - 6 hours ago
...and a shameless plug for my (WIP) Redis Module that doesn't
implement a "bloomy" interface, for oncehttps://github.com
/kristoff-it/redis-cuckoofilter
in9 - 2 hours ago
What is the efficiency of this algorithm? Is this how one can check
for used usernames when a user is creating an account?
striking - 1 hours ago
Bloom filters are O(1), but are probabilistic. So they're faster
than even an indexed lookup in a database... but they don't
always give you the correct answer.
SatvikBeri - 2 hours ago
Wouldn't a database lookup (with an index) be fast enough in that
case? IIRC that's what Reddit does.
greenpizza13 - 2 hours ago
Presumably you have indexed based on username or the hash of a
username, so the lookup is very inexpensive.
[deleted]
ThePhysicist - 6 hours ago
If you're interested in using Bloom filters in the backend, we (at
DCSO) have written efficient and interoperable open-source
implementations in Python and Go, also using FNV-1
hashing:https://github.com/DCSO/bloom (Go
version)https://github.com/DCSO/flor (Python version)The Go version
comes with a command line tool that allows you to use Bloom filters
on the shell.
imajes - 5 hours ago
hey, curious why your variables in the go version are so terse?
makes following along a little tough!
fooker - 5 hours ago
It is Go 'style'.The abstractions are supposed to be self
explanatory and simple and thus the variable names do not have
to serve as documentation.
jerf - 3 hours ago
I think it is Go style to name things that come in as
interfaces with short names. When you have an io.Writer,
there's very little else you can name it other than "w" that
would have any additional meaning or utility. When you have a
variable or struct field of a concrete type, it can and
should have a meaningful name, even in Go.Ironically, I see
several "input io.Reader" sort of things in that code, where
as "number of hash functions" is k. A few minutes with go-
rename would clean that right up, though.Go to
https://golang.org/pkg/net/http/#Client and use your browser
to search for the string "struct {" and have a look.
https://golang.org/pkg/os/ is another page with a number of
struct declarations.
avaer - 4 hours ago
I think CS education would be improved if these concepts didn't
have mysterious names.Respect to Bloom, but the programming median
might be raised if Bloom filters were called something stupid
obvious like Hash Array Filler-upper Tables. We might not even need
to spend time making visualizations to explain them.
mikehollinger - 4 hours ago
That'll just get renamed to a HAF table, which'll then get
confused with some (to be invented) "half" table. ;-)
middleorigin - 4 hours ago
I have no idea why you think the name helps remove the need for a
visualization. Besides I think "filter" is more apt than
"table"?it doesn't store anything per se.
avaer - 3 hours ago
I never needed a picture to understand a hash table once I knew
what a hash function was. If it were called a McCready table,
that's one more trip to Wikipedia, plus one more every time I
forget.Re: naming: It's a table of hash function results. It
probabilistically stores a set. I don't see any filter here,
though one use of the technique is indeed in filtering a list,
though there are plenty more.
middleorigin - 56 minutes ago
Right, but you're describing the implementation, which
doesn't imply anything about its use. I prefer to name by the
latter: the implemenation is just a detail on how the filter
does its filtering.
tdb7893 - 3 hours ago
How about "probabilistic hash filter" as a name?... this naming
stuff it hard
panic - 3 hours ago
What is this mysteriously-named Hash Array? Perhaps you're
talking about a Contiguous Allocation with Well-Shuffled Value-
Derived Indexes?
avaer - 3 hours ago
You know what both a Hash and Array are, as would ~100% of the
people who care about Bloom filters.
jolmg - 2 hours ago
His point still stands, though. :)Those who already know what
a Bloom Filter is would probably the find the name more
convenient than having to say a short description every time.
Those that are just learning programming might argue that
Hash, Array, Boolean, String, etc. would be understood faster
if they had more descriptive names.Having a short,
distinctive name helps in differentiating and reasoning about
the different objects. Think of an everyday programming-
related discusion that includes hashes, arrays, and strings,
and think about making the following substitutions:- Hash :
Contiguous Allocation with Well-Shuffled Value-Derived
Indexes- Array : Contiguous Allocation of Same Typed Values-
String : Contiguous Allocation of Character Values- Boolean :
Either True or False Value
geocar - 6 hours ago
If anyone just wanted to see a collision: a 2 c
collides with: f
taco_emoji - 3 hours ago
Another anomaly is that "hello" yields the same array position
for all three hashes, which makes it highly prone to collision.
bearded_comrade - 6 hours ago
Another one is 1111 collides with ab
jauntbox - 33 minutes ago
Even better, just e and f collide.
cr0sh - 5 hours ago
Am I understanding this correctly?1) For each key, you generate a
"hash" - which is a bit position2) The hash generation is such to
ensure that the bit location in the filter is probabilistically
distributed for each key (so they are spread "evenly", for lack of
a better term, over the length of the filter)3) You generate so
many locations per key, the number of which is the number of hash
generators you are using (so if you have 3 generators, you get 3
distributed bit positions)4) Some keys may cause
overlaps/collisions - but this is ok5) At some point, you fill up
the filter with keys6) To see if a key is -probably- in the filter,
you run the same steps again and see if the bits are set; if they
are, then it -probably- is7) You then run that key on your regular
index search; some false positives though will cause you to run
that expensive operation and get back nothing, but usually you'll
get back something (true positive). But if that key wasn't found in
the filter, you definitely don't run the expensive lookup at all.Is
that correct?There must also be something that the bit position
"hash" generators have to be able to span an arbitrarily large bit
range, in order to "store" more keys; this trivial example would
look like it would fill up rather quickly (one all bits are set to
"1", any key would generate a "positive" and you would always be
running the expensive lookup - whether it was in the index or not -
basically, falling back to a default state).If I have all of that
correct (or close) - or even if I don't - it seems like a very
powerful technique, limited to only that bit length (and being able
to perform bitwise operations on it quickly), which (unless I am
missing something?) would need to be fairly long to accommodate a
reasonable number of keys (for say a record lookup in an rdb
table).What might also be nice would be a way to detect it is
"filled up" and bypass the test; you'd fall back to the worst case
scenario (ie - no bloom filter), but at least you wouldn't be
running the bloom filter check on top of that as well, incurring an
extra demand. I'm thinking there's probably something easily done
here - some bitwise operation that could be done (maybe take the
inverse and compare it to zero?).
tekromancr - 5 hours ago
Pretty close. Except 1) you generate multiple hashes for a
particular value, thus setting multiple bits for a value 5)
when this happens, you use a larger filter 6) You can also
determine with 100% confidence that a value isn't in the
set.Other than that, I think you have a good idea of what it's
about. Once I learned about this, I got tons of great ideas.
Right now, I think it would be cool to construct a bloom filter
for detected bot users on twitter. You could make a plugin that
marked tweets from probable bot-users as such without storing the
whole db on disk, and without making tons of network requests.
afonsotsukamoto - 13 minutes ago
I do believe on the 100% confidence, but when you have full
filter is that still meaningful? As in - are there still any
gains on using a bloom filter in this case?I know that is sort
of a corner case, but in that situation the 100% confidence
case happens 0% of the times which makes the filter a bit
useless - no?Just trying to understand what are the limitations
as been super curious about bloom filters for a long time