HN Gopher Feed (2017-06-30) - page 1 of 10 ___________________________________________________________________
Do 20 pages of a book gives you 90% of its words?
43 points by kiechu
https://blog.vocapouch.com/do-20-pages-of-a-book-gives-you-90-of...___________________________________________________________________
rfrank - 1 hours ago
I wonder how Pale Fire by Nabokov would look after this sort of
analysis. For the unfamiliar, per wikipedia, "Starting with the
table of contents, Pale Fire looks like the publication of a
999-line poem in four cantos ("Pale Fire") by the fictional John
Shade with a Foreword, extensive Commentary, and Index by his self-
appointed editor, Charles Kinbote. Kinbote's Commentary takes the
form of notes to various numbered lines of the poem. Here and in
the rest of his critical apparatus, Kinbote explicates the poem
surprisingly little. Focusing instead on his own concerns, he
divulges what proves to be the plot piece by piece, some of which
can be connected by following the many cross-references. Espen
Aarseth noted that Pale Fire "can be read either unicursally,
straight through, or multicursally, jumping between the comments
and the poem."[4] Thus although the narration is non-linear and
multidimensional, the reader can still choose to read the novel in
a linear manner without risking misinterpretation."
s_kilk - 1 hours ago
Huh, sounds a little like House Of Leaves, which has a similarly
weird structure.I'll have to check out Pale Fire.
rfrank - 13 minutes ago
Ah nice, I need to do the same with House of Leaves, I'm a big
fan of stories with unconventional structuring. Sometimes a
Great Notion by Kesey is my favorite; it's told from multiple
first-person perspectives that shift pretty rapidly, where the
shifts are indicated by having a particular speakers' text
italicized, in parenthesis, with no formatting applied, etc.
It's pretty neat.
Finch2192 - 1 hours ago
This doesn't seem all that groundbreaking, it's just an instance of
Zipf's law in action, is it not?
kiechu - 1 hours ago
Yes, that's Zipf's law applied. I doubt that many language
learners knew about this law. I think it is still worth pointing
out, that when you go through the beginning of the book, reading
will become rapidly easier.
twoodfin - 10 minutes ago
It'd be an interesting exercise in Modernist writing to try
producing a book that violates Zipf's law, say by hashing all
but the most common few hundred words into chapter buckets.
ihaveajob - 1 hours ago
I bet this is not true for the Encyclopedia Britannica, by design.
pealco - 1 hours ago
This doesn't really address your teacher's claim about having to
look words up, though. What you want to look at is the distribution
of low frequency words across the book. What do the plots look like
when you remove proper nouns, functional words (e.g., "the", "and",
prepositions) and, say, the top 1000 most frequent words in
English?
anon1094 - 44 minutes ago
Would be very interesting to see this applied to blogs in
different categories to rapidly learn languages through reading
based on the words that you currently know and the most frequent
words in that language. So it would always present you with the
article that suits your level and you would have the benefit of
learning the most new words.
kiechu - 42 minutes ago
That's something worth trying.
kiechu - 37 minutes ago
It probably would look more or less similar. They are excluded
very quickly. There is something I cannot asses: How is the word
important to understanding the sequence?
oconnor0 - 1 hours ago
Not if it's a dictionary!
kiechu - 43 minutes ago
Or a phone book.
bryanrasmussen - 2 hours ago
The use of Eve's diary doesn't make any sense here, of course the
distribution of words in a short story are going to be longer than
in a book.Ulysses is fair, but I would expect it and works of a
similar caliber to be outliers.
kiechu - 1 hours ago
It is Myth Buster's kind of science. The goal was to see how it
works with short and long books and with one with reputation
being easy and a hard read. It would be interesting to see it on
larger population, with more of statistic involved.
dri_ft - 1 hours ago
For the record, Ulysses is at least a full order of magnitude more
comprehensible than Joyce's next book, Finnegans' Wake.I'd also
expect it to give a skewed response on a test of this kind because
it is composed of a number of different sections, which vary
considerably in their style. But maybe that's the point of
including it.
kiechu - 43 minutes ago
Here are Finnegans Wake graphs. It is indeed even more
complicated. https://github.com/vocapouch/vocapouch-
research/blob/master/.... Number of Pages: 729 Number of Total
Words: 218793 Number of Unique Words: 50872 You will know 90% of
words after 387 pages which are 53.09% of the book. At that page,
you will know 60.64% of unique words.
al452 - 1 hours ago
"incomprehensibility"
kiechu - 1 hours ago
Fixed. Thank you!
loeg - 38 minutes ago
> we turned words to their basic forms (went to go, cars to car,
jumps to jump etc.)FYI, this is called stemming.
https://en.wikipedia.org/wiki/Stemming
elchief - 9 minutes ago
or contextually, lemmatization
twoodfin - 1 hours ago
FWIW, Ulysses isn't particularly incomprehensible. To the extent
that it's difficult to read, it's much more the shifting narrative
perspective, widely ranging references, and stream-of-consciousness
rather than the vocabulary.Take this typical section from the
"Lotus Eaters" chapter, wherein Mr. Bloom is contemplating the
origins of the wares in a tea shop:So warm. His right hand once
more more slowly went over again: choice blend, made of the finest
Ceylon brands. The far east. Lovely spot it must be: the garden of
the world, big lazy leaves to float about on, cactuses, flowery
meads, snaky lianas they call them. Wonder is it like that. Those
Cinghalese lobbing around in the sun, in dolce far niente. Not
doing a hand's turn all day. Sleep six months out of twelve. Too
hot to quarrel. Influence of the climate. Lethargy. Flowers of
idleness. The air feeds most. Azotes. Hothouse in Botanic gardens.
Sensitive plants. Waterlilies. Petals too tired to. Sleeping
sickness in the air.Hard to be too confused by the imagery and mood
in this passage.Now, Finnegans Wake...
kiechu - 1 hours ago
I will run Finnegans Wake in a moment and I will get back with a
response. I must find it in a text format.
kawera - 1 hours ago
Have you tried non-fiction books?
kiechu - 44 minutes ago
What do you have in mind?
samstave - 35 minutes ago
The Bible.(Just kidding)What about having this read a tweet
history, say that of a POTUS?
kiechu - 28 minutes ago
From what I see, POTUS is circling in basic 1000 words.
samstave - 22 minutes ago
Can you plot his most used to least used, please.
samstave - 18 minutes ago
Actually,I do have a requested challenge for you: US
laws.Can you evaluate this law that just was signed 3
days ago:http://leginfo.legislature.ca.gov/faces/billText
Client.xhtml...(There is a link to a PDF of the law if
you prefer to DL the PDF first...)
kiechu - 18 minutes ago
That is bit different analysis, but I will. Ping me on
@r_kierzkowski on twitter, so we will stay in touch.
kiechu - 11 minutes ago
According to bill you posted:Number of Pages: 217 Number
of Total Words: 65396 Number of Unique Words: 3106 You
will know 90% of words after 28 pages which are 12.90% of
the book. At that page, you will know 36.77% of unique
words.The graph is less regular but it has more or less
same shape. I will not publish this part because it is
not a book.
tomjakubowski - 1 hours ago
Excerpts from Finnegans Wake are great fodder to break CSS
layouts during development.
Somanyobscenelylongwordswithoutbreaks.
debt - 1 hours ago
ulysses and finnegans wake is inspiration porn. i find the more
jarring the sequence of words/sentences/phrases from a few
passages the more inspired i become after.best to have this type
of stuff on hand when you get stuck. it's neurological
kiechu - 44 minutes ago
Here are Finnegans Wake graphs. It is indeed even more
complicated. https://github.com/vocapouch/vocapouch-
research/blob/master/...Number of Pages: 729 Number of Total
Words: 218793 Number of Unique Words: 50872 You will know 90% of
words after 387 pages which are 53.09% of the book. At that page,
you will know 60.64% of unique words.