HN Gopher Feed (2017-11-10) - page 1 of 10
How should you build a high-performance column store for the 2020s?
64 points by deafcalculushttps://lemire.me/blog/2017/11/10/how-should-you-build-a-high-pe...
xoogler_thr - 1 hours ago
This already exists, in Google BigQuery. Uses darn near every trick
in the book, and some that aren?t in the book. Source: shipped it.
joelwilsson - 9 minutes ago
BigQuery uses a lot of tricks to get efficiency, but this post
emphasizes Apache Arrow and open data formats like it as the way
forward (in particular the last point, "Be open, or else?") which
are not currently supported by BigQuery.If Apache Arrow takes off
I hope BigQuery will support it as a data interchange format in
the future. Zero-copy is pretty awesome, as are open standards in
general. This feature does not exist in BigQuery today (as far as
I know - definitely not as discussed in the source).
WilsonPhillips - 1 hours ago
I think incorporating a blockchain element could prove an
interesting way to implement this in practice.
dogruck - 1 hours ago
Sure comes across as arrogant for Prof Abadi to remark:> I assume
that the Arrow developers will eventually read my 2006 paper on
compression in column-stores and expand their compression options
to include other schemes which can be operated on directly (such as
run-length-encoding and bit-vector compression).In this blog post,
I don?t agree with:> Almost every single major data-processing
platform that has emerged in the last decade has been either open
source.That?s somewhat true by definition. OTOH, I also know most
financial firms use proprietary solutions (which leverage open
amelius - 38 minutes ago
In 2020? Abstract it all away, and let the compiler deal with such
earthly stuff as columns and rows!
lima - 1 hours ago
Yandex's recently open sourced ClickHouse column store does some
of these.It heavily relies on compression, data locality and SIMD
instructions and supports external dictionaries for lookup.:
misterHN - 1 hours ago
put data in text files, ASCII printable characters, one data point
per lineput data files in directoryname data files after columnsuse
".data" filename extension for data fileswrite a tool to create
index files (append ".index" to the name of the input text file)
that map record number to byte offset in data fileIf data files are
all < 4GB, use a 32 bit unsigned integer to represent the byte
offset in the index fileEach index file is a packed array of 32 bit
integersWrite a tool to create length files ".length" that count
the number of entries in a data fileGenerate .length files for all
data filesUse mmap to access index filesUse C for all of the
aboveThis is for variable-length data values. Not every column will
have these, making the .index files redundant in this case; the
.index files should not be created in this case and program logic
should support both uniform value length access and nonuniform
value length access. The reason to prefer two access modes is to
keep data from the .index files out of the cache when it is
redundant.When all of this is done, the next thing to do is write a
tool to test the cache characteristics on your processor by
implementing sorting algorithms and testing their performance.
Unless you are using a GPU (why?) all data your algorithm touches
will go through every level of the cache hierarchy, forcing other
data out. If possible, use a tool that reports hardware
diagnostics. These tools may be provided by the processor
vendor.Now, there is a trend to give the programmer control over
mark-some...I don't know if this is worth exploring or a wild goose
chase. It may improve performance for some tasks, but it sounds a
little strange for the programmer to tell the computer how to use
the cache...shouldn't the operating system do this?Anyway, that's a
dustingetz - 35 minutes ago
Datastore of 2020s will be designed around an immutable log because
it permits both strong consistency and horizontal scaling (like
git).Once you're both distributed and consistent, the problems
today's stores are architected around, go away. Your distributed
queries can index the immutable log however they like. column-
oriented, row-oriented, documents, time-oriented, graphs,
immutability means you can do all of it, as a library in your
application processhttp://www.datomic.com/ - it's what you get when
Facebook's graph datastore has a baby with immutability.
SamReidHughes - 18 minutes ago
You can't just say "immutable log" and then be done. You
certainly don't want to have just one immutable log, because then
unrelated operations, for example to different parts of a key
space, have to "see" each other. If you go the route of Datomic,
your writes can't outpace the one CPU that processes them.
(Correct me if I'm wrong, I'm just reading its documentation.)
Git, with a DAG history, is just eventual consistency.