HN Gopher Feed (2017-09-06) - page 1 of 10 ___________________________________________________________________
Publishing with Apache Kafka at The New York Times
214 points by rmoff
https://www.confluent.io/blog/publishing-apache-kafka-new-york-t...___________________________________________________________________
[deleted]
cookiecaper - 5 hours ago
>We need the log to retain all events forever, otherwise it is not
possible to recreate a data store from scratch.SIGH. Cue the
facepalm, head in hands, etc.I'm not going to get into a big thing
here. But if you find yourself saying "I need to keep this thing
forever no matter what" and then you try to use something that even
entertains the notion of automatic eviction/deletion semantics as
the system of record, you're doing it wrong.Not to burst the bubble
of the techno-hipsters, but Kafka is "durable" relative to message
brokers like RabbitMQ, not relative to a system actually designed
to store decades of mission-critical data. Those systems are called
"RDBMS".Elsewhere in the article he says that they have less than
100GB of data and that it's mostly text. This is massive
overarchitecture that isn't even covering the basic flanks that it
thinks it is, such as data permanence.I would really like to read
the article that discusses why Postgres or MySQL couldn't have
served this purpose equally well.
bognition - 5 hours ago
OOC what makes a RDBMS more durable that a Kafka? Both of them
are systems for representing data on disk. I'd love to hear why
one representation system is better at disaster recovery than
another.
cachemiss - 5 hours ago
That's a bit of an oversimplification. Production grade RDBMS
systems have far more guard rails, testing and work put in to
them than Kafka. It's relatively straight forward to lose data
in Kafka, I've done it (its usually control plane bugs, not
data plane).
[deleted]
cookiecaper - 5 hours ago
In Postgres, I never have to worry that the server will be
accidentally loaded with `retention.bytes` or `retention.days`
set too low and, as a result, choose to delete everything in
the database, generating a wholly artificial "disaster" that
can result in long periods of disruption or downtime (at a
minimum; worst case is permanent data loss).It is true that
someone could issue `DROP DATABASE`, `rm -rf` the filesystem on
the database server, or so forth, so my point is not that other
systems are invincible. It's just that a properly-configured
RDBMS is designed to take data integrity extremely seriously
and provides numerous failsafes and protective mechanisms to
try to ensure that any data "loss" is absolutely intentional.On
a RDBMS, things like foreign key constraints prevent deletion
of dependent records, mature and well-defined access control
systems prevent accidental or malicious record alteration,
concurrency models and transactions keep data in a good state,
etc. Kafka, on the other hand, is designed to automatically and
silently delete/purge data whenever a couple of flags are
flipped.That is not a flaw in Kafka itself; it's designed to do
that so that you don't have to interrupt your day and purge
expired/old/processed data all the time. It's a flaw in
architectures that misinterpret Kafka's log paradigm as a
replacement for a real data storage/retrieval/archive
system.I've had this argument countless times with people
who've tried to use RabbitMQ as a system of record (if only for
a few minutes while the messages sat in queue). There's just
some fundamental disconnect for a lot of developers where they
don't understand that something accepting the handoff doesn't
mean that the data is inherently safe.
skybrian - 4 hours ago
So, remove the flag that you'll never use and recompile?I'm
not sure this is worse than using a Unix box with a working
"rm" command.
cookiecaper - 3 hours ago
For what it's worth, I've known sysadmins who strip their
boxes to the bones and take pains to ensure that the "rm"
command won't be able to be accidentally invoked, primarily
by ensuring it doesn't exist on the box. They carry their
utilities from box to box, and take them with them when
they leave.That said, any slightly-sane permission or
access control scheme, including the defaults mandated by
almost all RDBMS distributions (which want a system user
dedicated to their use), would make it rather difficult to
rm the database folder. Just opening a shell to a RDBMS's
underlying server should be a rare event in itself, to say
nothing of actually elevating to root, or running a
sloppy/careless rm command that is uncaught by the numerous
potential failsafes that sysadmins have been installing for
decades now (constraining superuser access to a pre-defined
set of commands, for example).Again, the point is not that
RDBMS systems are invincible. It's just that they're much
sturdier, and actually designed to serve this purpose.In
what universe is "Well, hack out the dangerous parts" a
reasonable answer? Talk about reckless disregard for data
integrity! Do you really want to use Kafka that bad that
you'd develop, maintain, and thoroughly test a custom
patchset that circumvents its eviction routines, rather
than just using the systems that already excel at not
deleting stuff?Secondly, eviction is a core part of a
message queue's design, on purpose. It's actually a needed
thing, and while I'm not a Kafka dev, I seriously doubt
that it's so simple that a single flag can be disabled and
we can move on.
skybrian - 2 hours ago
Disabling a flag is likely a one-line change, assuming a
reasonable flag library. But yes, maintaining a custom
fork at all is not something to take on lightly. It would
make more sense to talk to the Kafka developers about how
to make it safer to use in this scenario.
lima - 3 hours ago
Kafka is a fine replacement for a RDBMS if it fits your
particular use case. It has very strong data consistency
guarantees - stronger than most RDBMS - if you configure it
properly (acks=1 et al). It won't even lose data if the
leader of a partition crashes during a commit.It has been
explicitly designed for these use cases and even has features
like compaction:https://kafka.apache.org/documentation/#compa
ctionNow, I agree with you that in most cases, using Kafka as
your primary data store instead of a RDBMS is madness - but
that doesn't mean it's a bad idea in general.
dreamfactored - 2 hours ago
Isn't that what 'in general' means?
scaryclam - 5 hours ago
From what I can tell, Kafka isn't designed for long term data
storage. RDBMS systems are designed for this.Kafka is more for
streaming data and events, so I'd probably be uncomfortable
assuming that it won't do something and "tidy up" my very old
data at some point in the future. Since it's supposed to do
this from time to time out of the box, you'd have to be very
careful not to let anyone tweak the custom config to revert
back to this behaviour. RDBMS won't delete things unless you
tell it to more explicitly.While I think storing things in
Kafka is fine generally, there's no way I'd not have a more
perminant store of the data somewhere so that I can recreate
the Kafka data store if I need to. I'm not sure why they're not
just using a boring old DB for that purpose. Perhaps they have
a reason, but it's not obvious to me.
miguno - 3 hours ago
Kafka can perfectly keep your data around forever. The only
limitation is available disk space (and databases have the
same limitation). I'm not implying that it is always the
best idea to use Kafka as a long-term storage solution, but
likewise a database isn't the silver bullet here either.> so
I'd probably be uncomfortable assuming that it won't do
something and "tidy up" my very old data at some point in the
futureKafka doesn't "tidy up" your data unless it is
configured to do so. What's true is that, by default, Kafka
will keep your data around for a week "only" but that's a
configuration. And most people change it to whatever fits
their use case (some lower it to a few hours, some increase
it to months or years; others configure it to keep data
around forever, typically in combination with Kafka's so-
called "log compaction" functionality).> Since it's supposed
to do this from time to time out of the box, you'd have to be
very careful not to let anyone tweak the custom config to
revert back to this behaviour. RDBMS won't delete things
unless you tell it to more explicitly.The DBAs I worked with
would now say "Hold my beer..." ;-)> While I think storing
things in Kafka is fine generally, there's no way I'd not
have a more perminant store of the data somewhere so that I
can recreate the Kafka data store if I need to.What's
interesting is that more and more users (from my experience)
are actually beginning to treat Kafka as the source of truth,
and rather recreate other data stores -- like RDBMS, Elastic
indexes, etc. -- from it. If you like RDBMS, you can think
of Kafka as the DB's transaction log.IMHO a lot of these
discussion is about personal preferences, the situation that
you are in (cough legacy cough), team skills, etc. There are
often good reasons to use Kafka rather than RDBMS (in this
context) but also vice versa, or completely different
technologies of course (like blob stores, e.g. S3).
[deleted]
mayank - 2 hours ago
You've really just laid out feelings rather than concrete
technical reasons for why Kafka can't function as a permanent
datastore.Can you point to specific design elements in Kafka
that would lead you to conclude that it isn't suitable for
permanent data storage?Also, Kafka doesn't "do" anything to
your old data if you don't want it to. It's also open-source,
so these behaviors can be verified.
pram - 42 minutes ago
I'd say it isn't appropriate for permanent data storage
because the individual brokers don't scale well with the
amount of logs present. If you have hundreds of partitions,
and millions of logs, then any operation dealing with the
indexes (like an unclean startup) will take an extremely
long amount of time. So your individual brokers are now
down for an hour if they don't shut down cleanly. Which
happens often (oom, weird zk errors, etc)It scales
linearly.
robohamburger - 5 hours ago
I skimmed the article but I imagined they were using it as a
secondary data store. I think they want to it to be durable in
the sense that even if the events are already consumed they can
still play them back to reindex elastic search (which is a thing
you need to do periodically).
Diederich - 4 hours ago
"With the log as the source of truth, there is no longer any
need for a single database that all systems have to use.
Instead, every system can create its own data store (database)
? its own materialized view ? representing only the data it
needs, in the form that is the most useful for that system.
This massively simplifies the role of databases in an
architecture, and makes them more suited to the need of each
application."
robohamburger - 3 hours ago
Fair enough. It seems like it still ought to be able to burn
the kafka+elasticsearch world down and submit everything to
kafka with such a setup (and thus elasticsearch). I would
certainly not sleep very well at night if I could not.
miguno - 3 hours ago
> I think they want to it to be durable in the sense that even
if the events are already consumed they can still play them
back to reindex elastic search (which is a thing you need to do
periodically).That (replaying if needed) is exactly what Kafka
allows you to do, unless I misunderstood what you wrote.
_lex - 5 hours ago
I think the POV he's taken is that the kafka stream is the one
true datasource (for all time), with all other dbs being
derivatives. This insane strategy seems to be over engineered to
get around db migrations... though i'm sure the event stream will
also change over time, and he'll have to write migration-like
code anyways.
LgWoodenBadger - 2 hours ago
There's not just one reason for it.Kafka lets dependent
consumers transform that data into whatever model is
appropriate for their use case. There's no one-size-fits-all
E-R model for data for all use cases.It's not just working
around db migrations. It's also providing you with the ability
to model the data as many ways as you require. How else would
you do it, materialized views? How often are they
materialized? Regular views? How performant are they?There
are many benefits to this approach, and a lot of them require a
different way of thinking. It's a different paradigm.
dreamfactored - 2 hours ago
Isn't that just skinny model, fat client?
rhizome - 3 hours ago
If datasource engineers are for some reason implementing
"ontogeny recapitulates phylogeny," they won't be creating
migration-like code, they'll be writing ETL filters. Break out
your Members Only jackets!
aug_aug - 4 hours ago
Figure 3: The Monolog, containing all assets every published by The
New York Times.
toomim - 3 hours ago
> Traditionally, databases have been used as the source of truth
... [but] can be difficult to manage in the long run. First, it?s
often tricky to change the schema of a database. Adding and
removing fields is not too hard, but more fundamental schema
changes can be difficult to organize without downtime.This argument
sounds self-contradicting. Kafka doesn't let you change its schema
at all! At least postgres gives you the option.It seems that the
author is excited about having a single source of truth that
doesn't change, and didn't realize that he could do that with a
database, if he just never used the schema-changing features.Am I
missing something? It seems like the author could be totally happy
with a bunch of derived postgres databases sitting in front of a
"source of truth" database, where he never changes the source of
truth database's schema.Why use kafka?
BenoitEssiambre - 3 hours ago
I tend to be the one arguing this, to stick to postgres for most
things but even I will admit it does depend on scale.I'm not sure
what the NYT requirements are but from my understanding of Kafka,
its persistent redundant distributed queues scale automatically
horizontally across machines to support colossal amounts of data.
It's possible that they had difficulty fitting everything in a
postgres instance.
pcsanwald - 3 hours ago
The author states it's about 100GB of data.
pram - 52 minutes ago
Kafka by default is not persistent, the logs expire after 7
days. You can increase it on a per topic basis. It also doesn't
scale automatically. If you have three replicas on a single
partition topic, they will live on their assigned broker
forever unless you manually reassign them. Adding new nodes
does not kick off rebalancing of partitions. Its automatic
cluster management is very primitive compared to something like
elasticsearch.For example, if you lose a broker, the replica
will just be gone forever. Unless you replace the broker with
the same id.
theossuary - 3 hours ago
See, that's where I'm confused. I'm no Kafka expert, but they
say they use a "single-partition topic" which I believe means
the only way they can replicate the data is by replicating the
entire log, they can't shard because it's a single partition.
The reasoning behind this is because Kafka doesn't support
ordering between partitions.Also I've never thought of Kafka as
a persistent data storage solution, it's interesting Confluent
is supporting Kafka being used in this way.
poooogles - 2 hours ago
>it's interesting Confluent is supporting Kafka being used in
this way.If it earns them money, I think they'll pretty much
support anything. Jay has rubber stamped it on SO [1], but
he's got a bit of a vested interest on selling Kafka.1.
https://stackoverflow.com/questions/17708489/using-kafka-
as-...
[deleted]
monktastic1 - 52 minutes ago
Postgres doesn't help with the streaming part though, does it?
robertlagrant - 1 hours ago
Heh this was exactly my thought. I've no idea why they couldn't
just store the data in a database! Or if they really have several
ways of storing data, then have a database for each and a common
API/ESB layer over the top. Seems normal.
kod - 1 hours ago
Kafka doesn't have a schema per message, messages are just bytes
that you can serialize / deserialize however you want to. The
article refers to using protobuf for messages, which does support
adding fields.If you're equating kafka topics with the idea of a
schema, you can add topics.
l8again - 24 minutes ago
To add to this. Having implemented a similar log-based
architecture, I would say that it is much simpler data
infrastructure than having a central RDBMS considering the use
case. Remember we need to deal with several consumer
applications with their own respective optimization models.
Postgres is a perfect choice for a given application's local
datastore, while for a different application they may want to
use ElasticSearch as their datastore. However, the "source of
truth" remains free of any such optimization requirements. You
simply save your "messages", "event", "facts", whatever you
want to call it in its purest form preferably immutable, and
let the consumer apps create/recreate their local datastores as
they deem fit.
sidlls - 2 hours ago
I agree.It's rare that a company needs something like Kafka.Kafka
introduces a number of issues related to the development of
client code and data stores (if any) and the maintenance of these
things. It's important that the actual scale justifies the
expense incurred.
[deleted]
pizzaman09 - 1 hours ago
What does this have to do with the article?
[deleted]
look_lookatme - 3 hours ago
This is very similar to a normalized model in a relational
database, with many-to-many relationships between the assets.In the
example we have two articles that reference other assets. For
instance, the byline is published separately, and then referenced
by the two articles. All assets are identified using URIs of the
form nyt://article/577d0341-9a0a-46df-b454-ea0718026d30. We have a
native asset browser that (using an OS-level scheme handler) lets
us click on these URIs, see the asset in a JSON form, and follow
references. The assets themselves are published to the Monolog as
protobuf binaries.When consuming this data do you have to
programatically do relationship fetching on the client side or is
eager loading/joins available in some way in Kafka?Additionally
there seems to be a focus on point-in-time specific views of this
data, but are you able to construct views using arbitrary
values/functions? Let's say each article is annotated with some geo
data, can you construct regional versions of these materialized
views of articles at the Kafka level? If not it seems like you are
pushing a fair amount of existing sophisticated behavior at the
RDBMS level up into custom built application servers.
oliveralbertini - 5 hours ago
Interesting, we use rabbitmq instead of kafka and we have a re-
indexation system... not sure if it's more complex for what I see.
mateuszf - 4 hours ago
Isn't this just event sourcing?
arthurk - 4 hours ago
Yes it is, but for some reason they called it "Log-based
architecture" in the article
eropple - 14 minutes ago
They call it that because Kafka is a log-based datastore.
tabeth - 5 hours ago
I wonder how much of this kind of stuff exists out of necessity and
how much of it exists because very smart people are just bored
and/or unsatisfied.Are there any articles that supplement this that
explain how much business value is added/lost by the
existence/removal of these kind of features? In the case of NYT I
suspect its popularity is maintained because of the perception
(real or not) of high quality journalism, in spite of any technical
failings.---How much would be lost if NYT was just implemented as
text articles that are cached and styled with some CSS.
"Personalization" could be added by tags each article has and a
small component that shows the three most recent articles that
share the same tag.
weego - 4 hours ago
I can't speak for the situation at the NYT but the actual public
site for online papers are often pretty simple things with most
of the complexity being ad logic. The systems here almost
entirely deal with writing and content retrieval pipelines for
stuff that was written years ago in other systems that isn't
tagged or stored in sympathetic ways, and there will also be the
very old school print pipeline to have to deal with too.
cookiecaper - 4 hours ago
>I wonder how much of this kind of stuff exists out of necessity
and how much of it exists because very smart people are just
bored and/or unsatisfied.That's a ton of it. Like it or not,
publishing a digital newspaper is not a hard or unsolved problem;
it's one of the web's core competencies. If you hire people who
want to build cool stuff to supervise a CMS, well, you get this
kind of outcome.The raw cost is understated because these
experimental setups misinterpret the functionality of the new
architectures/formats they're using. It doesn't truly rear its
ugly head until there is a major data loss or corruption event.
It's not that these never happen with RDBMS, it's just that RDBMS
contemplates this possibility and tries to make it pretty hard to
do that, whereas message queues just automatically delete stuff
(by design, so they can serve as functional message
queues!).RDBMS have spoiled us and we take its featureset, 40+
years in the making, for granted. We need to be careful and not
assume that `GROUP BY` is the only thing we leave on the table
when we "adopt" (more accurately abuse) one of these new-wave
solutions as a system of record.Since no one is going to admit to
their boss "this wouldn't have happened if we used Postgres", and
since most bosses are not going to know what that means, most of
these spectacular failures will never be accurately attributed to
their true cause: developers putting their interest in trying new
things above their duty to ensure their employer's systems are
reliable, stable, and resilient.
skybrian - 4 hours ago
Well, except that Kafka doesn't automatically delete stuff?
cookiecaper - 4 hours ago
https://kafka.apache.org/documentation/Name:
cleanup.policyDescription: A string that is either "delete"
or "compact". This string designates the retention policy to
use on old log segments. The default policy ("delete") will
discard old segments when their retention time or size limit
has been reached.???How is this not automatic deletion of
stuff? I don't have to worry about someone setting a "delete
all data older and/or bigger than Y or Z" flag on a RDBMS,
because such a flag doesn't exist.---EDIT: Since my other
post got a few downvotes, and my account has been sanctioned
in the past for posting similar thoughts about tech-hipsters
diving head-first into Kubernetes without understanding any
of the effects of doing so, I can no longer post replies for
a period of some hours.As such, I will copy my reply to child
comment from snuxoll below, which I was not allowed to post,
and then log out for a while.Kafka devs, seek investment from
YC, you may get them to sanction accounts that are too
critical of the inappropriate adoption of your project, as
they've done to me for criticizing Docker/k8s.reply:>Log
compaction is generally what you want, which will preserve
the most recent message for every key in a topic. Event
streams expanding boundlessly is something very few will ever
need or want, so you'll toss messages into an "event" topic
of some kind, apply the event to the most recent entry in the
"model" topic and store a new version (which will be kept
after log compaction, the original messages can be pruned
after you need to free up storage).Yeah, I didn't explicitly
detail the compaction option because it's still potentially
deleting huge amounts of data. It promises to keep the latest
state, which, yes, is better than deleting everything. But
why not just not delete anything? The article says they're
working with 100GB of data. There are VIDEO GAMES that get
downloaded to consumer hard disks that occupy more than that
(also not really excusably, but still).I understand why Kafka
provides these semantics and I think it's great that they do.
The issue is that developers choose to ignore them, and
seriously imperil their data's well-being by doing so.
snuxoll - 3 hours ago
Log compaction is generally what you want, which will
preserve the most recent message for every key in a topic.
Event streams expanding boundlessly is something very few
will ever need or want, so you'll toss messages into an
"event" topic of some kind, apply the event to the most
recent entry in the "model" topic and store a new version
(which will be kept after log compaction, the original
messages can be pruned after you need to free up storage).
LgWoodenBadger - 2 hours ago
You may not have to worry about somebody setting a "delete
all data older and/or bigger than Y or Z" but you have to
worry about someone running "DELETE FROM table" without a
WHERE clause. Which is easier to prevent? The one that
can be done through the same mechanism as non-destructive
queries? Or the one that can only be modified through a
file-system configuration, completely separate from its
API?Regardless, it's a different paradigm with different
"don't do that" behaviors that you need to know about.In
Kafka, if you want the persistent, append-only, write-ahead
log to not delete stuff, then configure the retention
period to keep things forever.
feelin_googley - 4 hours ago
"I wonder how much this kind of stuff exists out of necessity and
how much of it exists because very smart people are just bored
and/or unsatisfied."Do you mean only the developers (==very smart
people) hired by the NYT or those hired at other companies, too?
1wheel - 4 hours ago
> How much would be lost if NYT was just implemented as text
articles that are cached and styled with some
CSS.https://www.nytimes.com/interactive/2016/12/28/us/year-in-
in...
tabeth - 4 hours ago
That's neat, but most of those aren't anything that can't be
implemented with just CSS and HTML, unlike some of the stuff
you've done for the NYT. Though even in that case, isn't all of
this stuff just static assets? Hasn't this problem already been
"solved"?
iooi - 1 hours ago
> Because the topic is single-partition, it needs to be stored on a
single disk, due to the way Kafka stores partitions. This is not a
problem for us in practice, since all our content is text produced
by humans ? our total corpus right now is less than 100GB, and
disks are growing bigger faster than our journalists can
write.Before this line, the author mentions they also store images.
There's no way that all their text + images is <100GB right?
Something is inconsistent here.
chuhlomin - 1 hours ago
I think they mean image metadata (caption, paths to image files,
credits, etc.), not the images itself.
kod - 1 hours ago
More than likely they store references to images in kafka, with
the actual image bytes being in a different store.
cturner - 4 hours ago
This is a flawed architecture. It will work at release, but it will
be difficult to manoeuvre with, and they will grow to hate it.As
your business changes, your data changes. Imagine if on day one,
they had one author per article. On day 1000, they change this to
be a list of authors.Kafka messages are immutable. Each of those
green boxes on the right hand side of the first diagram will need
to have special-case logic to unpack the kafka stream, with
knowledge of its changes (up until 17 May 2017, treat the data like
this, but between then and 19 May 2017 do x, and after that do
y).Document pipelines is a rare instance of a context where XML is
the best choice. They should have defined normalised file formats
for each of their data structures. Something like the gateway on
the left of the first diagram would write files in that format.
(At some future time, they will need to modify the normalised
formats. Files are good for that. You can change the gateway and
your stored files in coordination.)Secondly, they should have a
gateway coming out of the file store. For each downstream consumer,
they should have a distinct API.These APIs might look the same on
the first day of release. But they should be separate APIs so that
you are free to refactor them independently.When you have a one-to-
one API relationship, you can negotiate significant refactors in a
single phone call. When you have more than one codebase consuming,
you need to have endless meetings and project managers. I call
this, "The Principle of Two."Some of the other comments here say
that they should have used databases. So far, they have not made
the case for it. And databases are easily abused in settings like
this one. People connect multiple codebases to them, and use SQL as
a chainsaw. Again, you can't negotiate changes.When you create a
system, your data structures are the centre of that system. You
need to do everything you can to keep your options open to refactor
them at a later time, and to do so in a way that respects APIs that
you are offering your partners.Kafka is a good tool. If used well,
your deployment design will stop your system regularly (e.g. every
day), nuke the channels, recreate them from scratch, and restart
your system against these empty channels. You shouldn't use it as a
long-term data store.
amenod - 4 hours ago
> Kafka messages are immutable. Each of those green boxes on the
right hand side of the first diagram will need to have special-
case logic to unpack the kafka stream, with knowledge of its
changes (up until 17 May 2017, treat the data like this, but
between then and 19 May 2017 do x, and after that do y).I
respectfully disagree. The genius of this approach is that you
can make the same transformation on the original Kafka stream to
change its schema and prepare the new feed. Once you are
satisfied with the results and you have switched all subscribers
to the new feed, just turn off the old one. Voila - you only have
y.> This is a rare case where use of XML makes sense.Sorry, but
no. Just no.
robertlagrant - 1 hours ago
Not sure how just having a new API on top of a database doesn't
achieve the same?
cturner - 3 hours ago
"Voila - you only have y."Thanks. My mental model had a long-
lived channel, but I follow your explanation.
skybrian - 4 hours ago
They're using protobufs, which seem just about as flexible as XML
as far as schema updates are concerned and are considerably less
ambiguous. So I don't see how XML would help?
bonesss - 1 hours ago
Through its various protocols Kafka topics can be configured to
be guaranteed forwards, backwards, or bi-directionally
compatible.That is to say: as flexible as XML or an RDBMS
schema with long-term, format encoded, data that can explicitly
support conflicting clients over time (as desired by the dev).
Zero-impact, live, online, updates touching hundreds of active
systems without issue...TBH most posters here have completely
missed the forest for the trees. The point is not to avoid DB
migrations. The point is to support hundreds of DB migrations
in connected systems simultaneously with no schema-related
down-time or centralized point of failure or intractable CAP
challenges.Trying to solve these kind of operation issues with
an RDBMS in an Enterprise context is _literally_ the "big ball
of mud" design pattern.Kafka, warts and all, is an operational
answer to how 1 client can feed 1 MM real-time connections, how
massive resource unlimited batch systems can integrate with
real-time feeds, how your data warehouse can keep growing
without painful forced restructuring, and how data architects
can mandate standards across multiple systems built by external
teams with human sized budgets.Data is part of it. Protocol,
format guarantees, and loosely coupled systems are where the
wins lie.
pm90 - 5 hours ago
Excellent, well written article. The key take away seems to be that
instead of an temporary event stream log, since the number of news
articles (and associated assets) is finite and cannot explode, they
store all the "logs" forever (I'm using the term log as is defined
in the article, as a unit of a time-ordered data structure).I
wonder if NYT can help other news websites by making their code
open source? I'm a huge fan of NYT and their jump to digital has
just been amazing. However, I would also like my local newspaper
(which covers more regional news) to be able to serve quality
digital content.
sync - 4 hours ago
NYT has quite a lot of open source repos:
https://github.com/nytimes
knowtheory - 3 hours ago
> I wonder if NYT can help other news websites by making their
code open source?Hey! I, and a number of other news nerds have
been encouraging FOSS for the past decade or so. And in fact a
number of major open source projects have come out of news
related projects, including Django, Backbone.js/Underscore.js,
Rich Harris's work on Svelt.js, and a whole lot more.Most often
the problem with local news organizations are operational
constraints. The news biz has seen a huge downturn over this
same period of time. Most orgs, both on the reporting side and
on the tech side are super strapped for people-time.It's not
enough to have FOSS software, you also have to have folks doing
devops and maintaining systems often at below-market salaries.
jdcarter - 5 hours ago
FWIW, the article mentions the book "Designing Data-Intensive
Applications" by Martin Kleppmann. I wanted to throw out my own
endorsement for the book, it's been instrumental in helping me
design my own fairly intensive data pipeline.
emmelaich - 12 minutes ago
Had to reparse "throw out ..." a few times before I understood
what you saying.Initially I thought you were withdrawing your
former endorsement.
teej - 3 hours ago
Dear HN reader - if you're not quite ready to buy the book, take
a listen to this episode of Software Engineering Daily
(https://softwareengineeringdaily.com/2017/05/02/data-
intensi...). It will give you a sense of what Martin Kleppmann is
all about and how he thinks about problems. I ordered my copy of
"Designing Data-Intensive Applications" after listening to this
episode.
[deleted]
dswalter - 3 hours ago
It's such a wonderful book.Reading it pushed me from thinking in
terms of what I had worked with to building systems based on what
was needed.I cannot recommend it highly enough, for pretty much
anyone in {frontend, backend, data science, etc}.
pcsanwald - 3 hours ago
I third this recommendation. I've worked on a ton of data
intensive applications on all kinds of stacks over the years,
and this book gives you lessons learned as well as a very
valuable historical perspective on relational databases that is
missing from a lot of the popular literature today.
hugofirth - 1 hours ago
An upvote was an insufficient form of agreement. Go buy this book
people!