HN Gopher Feed (2017-12-18) - page 1 of 10 ___________________________________________________________________
Colly - Fast and Elegant Scraping Framework for Golang
80 points by tampo9
https://github.com/gocolly/colly___________________________________________________________________
mlevental - 3 hours ago
>Lightning Fast and Elegant Scraping Framework for Gophersthe
bottle neck in scraping is never the parsing/DOM
representation/traversal.
tampo9 - 2 hours ago
Good performance matters if you have decent networking
infrastructure or your server has limited resources.Bandwidth and
IP limits are the most common bottle necks, but these can be
solved using multiple proxies and ssh tunnels. Colly has built in
support for switching proxies [1].[1] http://go-
colly.org/docs/best_practices/distributed/
[deleted]
deoxxa - 54 minutes ago
Tell that to the project I migrated from scrapy to go six months
back. Granted, scrapy might be doing other "fun" things to eat
into performance, but it was really night and day. Immediately
went from CPU bottleneck to network.
[deleted]
Xeoncross - 1 hours ago
Please breakup your main `colly.go` file into separate parts. If
possible you shouldn't have a 30 line imports definition covering
everything from cookies and regex to html and sync access.Make sure
to use DNS caching on the box else add it in Go.Colly only supports
a single machine via map of visited URL's. Would be great if you
replace with a queue like redis or beanastalkd. visitedURLs
map[uint64]bool
guessmyname - 1 hours ago
SQLite would like to disagree with you:> Over 100 separate source
files are concatenated into a single large files of C-code named
"sqlite3.c" and called "the amalgamation". The amalgamation
contains everything an application needs to embed SQLite. The
amalgamation file is more than 180,000 lines long and over 6
megabytes in size.> Combining all the code for SQLite into one
big file makes SQLite easier to deploy ? there is just one file
to keep track of. And because all code is in a single translation
unit, compilers can do better inter-procedure optimization
resulting in machine code that is between 5% and 10% faster.> ?
https://www.sqlite.org/amalgamation.html
detaro - 58 minutes ago
I don't think that's a good comparison to justify large code
files:a) it's automatically generated, so you can do dev work
on the "easier" split versionb) Go comes with package
management that helps solve the deployment issuec) Question:
Does the concept of a "compilation unit" affect the Go compiler
the same way?
Xeoncross - 55 minutes ago
> SQLite would like to disagree with you... ...Over 100
separate source filesUm, how does this disagree?
fiatjaf - 1 hours ago
Please don't follow this suggestion. It's very helpful and
healthy to have everything in a single file if you consider that
manageable, so no problem at all.
Xeoncross - 1 hours ago
It would be one thing if he just wanted a single go file for
one reason, but the repo already contains multiple files. I'm
recommending he break up the main file according to concerns.
(common in go)
JepZ - 21 minutes ago
Well, in Go all files in one directory are part of one package
(best practice) and files within the same package do not have
to import each other to have access to other functions.
Therefore, breaking the package into several files is common
practice.A sane approach to this is for example to create a
separate file for each type (Collector, HTMlElement, Request,
Response, ...) and its attached functions/methods.
gschier - 1 hours ago
Yes, this looks like very common Go code. No need to increase
complexity by breaking it up.
fiatjaf - 1 hours ago
For DOM parsing I cannot imagine that there could anything better
than https://github.com/PuerkitoBio/goquery.
jjuel - 1 hours ago
Which is funny because if you look at the code this is using
goquery. Which then makes you wonder why would I use this when I
can just use goquery?
blowski - 3 hours ago
The obvious question - why would I use this over Scrapy?
tptacek - 3 hours ago
Because you're using Golang and not Python.
sheraz - 3 hours ago
Here here. The right tool for the right job. And I can't think of
a "righter" tool for this kind of job.Edit - not picking on you,
but given the quality and ecosystem of libraries and ancillary
tools for scrapy, I don't even consider alternatives at this
point. Good on anyone who does it to learn but for actual
workloads I won't consider anything else.
[deleted]
[deleted]
dguaraglia - 10 minutes ago
I'm always surprised by how many web scraping frameworks/libraries
I see sprout here on HN on a regular basis. Is web scraping
something people are doing, or is web scraping the new high-
concurrency version of the "to do list" utility everyone used to
write as an exercise?This is an honest question, I'm not trying to
take a dig at anyone in particular.