Hacker News started as a pet project for a venture capital firm named after a concept in lambda calculus, the Y Combinator. Since then, it has grown to become the go-to source of all technology news amongst technology people [citation needed]. Besides serving as the holy grail of daily updates of what's going on in the tech world, it has, over time, managed to accumulate a history of what tech talks about, what tech cares about, and the progress tech has made in the recent past. In this post, I look at interesting things the data from HN can tell us. In another post, The Top 100 Hacker News Posts of All Time, I go over HN's top submissions.

As of 13th October, 2015, out of nearly 2 million Hacker News (1,959,809) submissions, merely 217 have managed to rake up over 1000 upvotes. That's about one out of every 9000 posts. Recently, I stumbled upon one of Google engineer Felipe Hoffa's many awesome curated datasets on Google's BigQuery containing all the data on all HN stories (I encourage you look at some of his other datasets, which include all of Reddit, Wikipedia, Freebase, NYC Taxis, Uber, and more).

Post Volume

The growth of Hacker News post volume over time, and the subsequent stabilization starting late 2011.

Hacker News had its humble beginnings on October 9, 2006, although logged daily traction only began on Feb 19, 2007. Since then, the daily volume of content has risen steadily, peaking on November 29, 2011 with 1474 stories. After that, the average daily volume has remained steady at around 900 a day. Interestingly, probably due to a long outage or a bug, Jan 5, 2014 has much lesser content than expected, and the next day has none. HN volume is much lower on Saturdays and Sundays, about half as much as the weekdays, which all share similar volume.

Over the course of a week we see clear daily post rhythyms on weekdays and a much lower post volume on weekends. The daily volume on weekdays peaks at 7AM PST (California time), while Saturday and Sunday are 8AM and 9AM respectively.

Average Upvote Volume

The slow steady growth of average daily upvotes on Hacker News over time.

Interestingly, the average upvotes per article has also grown since 2007, and has a steady growth trajectory even today, at about 10 upvotes per post. The distribution of upvotes on content, is unsurprisingly skewed. After just one other vote on a submission (original posters default upvote their own posts), your post is at the 50th percentile of all HN. That is, only half of all HN submissions have been upvoted. At 14 votes, you hit the 90th percentile, 43 votes hits the 95th, and you need 139 for the 99th. Although we previously saw that weekends got about half the post volume, the average upvote volume on weekends is approximately 16% more per post.

Average upvotes, too, show weekly variations. They bounce from 8 to 10 during the work week, and between 10 and 13 on the weekends. They peak at 4AM-7AM PST and steadily decay for the rest of the day on weekdays, but lesser so on weekends.

Average upvote volume is particularly interesting on two days - Jan 6, 2014, again, for most likely having an outage, had 0 votes and Jan 12, 2013. The latter has an average of 32 upvotes per post, almost twice as much as any other day in HN history. I wonder how many of you remember what happened on that day?

It was one of the most tragic days in HN history - the suicide of Aaron Swartz. Out of the 39 stories that had over 100 upvotes, 36 of them were related to Aaron Swartz. You can relive that day by querying the following in BigQuery, or find the results here.

  SELECT
    *
  FROM
    [fh-bigquery:hackernews.stories]
  WHERE
    DATE(time_ts)='2013-01-12'
  ORDER BY
    score DESC
  LIMIT
    100

The SQL for querying weekly trends is as follows. Note, Big Query deals in UTC, which is 8 hours ahead of PST.

  SELECT
    (DAYOFWEEK(time_ts)-1)*24 + HOUR(time_ts) AS hour_of_week,
    SUM(score) AS total_score,
    COUNT(1) AS total_posts,
    SUM(score)/COUNT(1) AS average_upvotes,
  FROM
    [fh-bigquery:hackernews.stories]
  WHERE
    time_ts IS NOT NULL
  GROUP BY
    1
  ORDER BY
    1;

Best Sources

What are the best domains shared on Hacker News? Using several interpretations of what best means, we figured out what they were.

Most Commonly Shared Domains

If by "best", we mean most commonly shared domains, the following are the top 20. Most of these are the usual suspects - large aggregate sites and publications. The only site I hadn't heard of in the top 20 was ReadWrite Web.

Rank Title Submissions Domain
1 Github 31600 github.com
2 YouTube 30872 youtube.com
3 TechCrunch 30219 techcrunch.com
4 NY Times 20694 nytimes.com
5 Medium 18535 medium.com
6 Ars Technica 13697 arstechnica.com
7 Wired 11855 wired.com
8 BBC 8855 bbc.co.uk
9 Wikipedia 8336 en.wikipedia.org
10 Business Insider 7493 businessinsider.com
11 Mashable 7107 mashable.com
12 Forbes 6887 forbes.com
13 VentureBeat 6739 venturebeat.com
14 The Next Web 6698 thenextweb.com
15 The Verge 6528 theverge.com
16 Wall Street Journal 6329 online.wsj.com
17 Washington Post 5797 washingtonpost.com
18 Giga Om 5656 gigaom.com
19 ReadWrite Web 5611 readwriteweb.com
20 The Atlantic 5457 theatlantic.com

The SQL query to get these results is:

SELECT
  a.domain,
  COUNT(1) AS c
FROM (
  SELECT
    REGEXP_EXTRACT(url,r'^https?://(?:www.)?([^/]*)/?(?:.*)') AS domain,
    score
  FROM
    [fh-bigquery:hackernews.stories]) a
GROUP BY
  a.domain
ORDER BY
  c DESC
LIMIT
  100

Most Upvoted Domains

If by "best", we mean most upvoted domains, the list largely remains unchanged from before. Notable (and surprising) additions include EFF and Google+.

Rank Title Upvotes Domain
1 Github 490623 github.com
2 TechCrunch 407218 techcrunch.com
3 NY Times 341044 nytimes.com
4 Ars Technica 196861 arstechnica.com
5 Wired 180171 wired.com
6 Medium 172429 medium.com
7 YouTube 121898 youtube.com
8 WashingtonPost 116831 washingtonpost.com
9 BBC 113378 bbc.co.uk
10 The Guardian 90038 theguardian.com
11 The Atlantic 88693 theatlantic.com
12 Wikipedia 81609 en.wikipedia.org
13 EFF 74903 eff.org
14 Google+ 73453 plus.google.com
15 Google 73319 google.com
16 The Next Web 71910 thenextweb.com
17 Wall Street Journal 68477 online.wsj.com
18 Bloomberg 65599 bloomberg.com
19 Forbes 63611 forbes.com
20 Economist 62160 economist.com

The SQL query is:

SELECT
  a.domain,
  SUM(score) AS c
FROM (
  SELECT
    REGEXP_EXTRACT(url,r'^https?://(?:www.)?([^/]*)/?(?:.*)') AS domain,
    score
  FROM
    [fh-bigquery:hackernews.stories]) a
GROUP BY
  a.domain
ORDER BY
  c DESC
LIMIT
  100

Domains with the Highest Average Upvotes

What drastically alters the list is by finding the average upvotes of every shared domain, and filtering by a minimum number of submissions. I arbitrarily chose 100 as my filter bar to preserve only the most popular content. This reveals some extremely interesting content. We see some companies - Tesla, Spacex, Mozilla, Stripe, and unsurprisingly, Y Combinator. 10 of the 20 are personal blogs of well-known developers and influential people in technology (and not very well know outside it). Wikileaks also made it. To me, the most surprising entry was Kalzumeus, which I've never heard of.

Rank Title Upvotes Submissions Quality Domain
1 Y Combinator 21706 175 124.03 blog.ycombinator.com
2 Stripe 18289 150 121.93 stripe.com
3 Kalzumeus, Patrick McKenzie 18910 159 118.93 kalzumeus.com
4 Tesla 17557 160 109.73 teslamotors.com
5 Antirez, Salvatore Sanfilippo's Blog 11644 131 88.89 antirez.com
6 RaganWald, Reginald Braithwaite 10214 117 87.30 raganwald.posterous.com
7 Spacex 10556 121 87.24 spacex.com
8 Daemonology, Colin Percival 9815 116 84.61 daemonology.net
9 Jacques Matthiej 25575 311 82.23 jacquesmattheij.com
10 Zach Holman 9802 129 75.98 zachholman.com
11 Dustin Curtis 8802 118 74.59 dcurt.is
12 Go Lang Blog 8193 119 68.85 blog.golang.org
13 Paul Graham 45864 672 68.25 paulgraham.com
14 Derek Sivers 17304 261 66.30 sivers.org
15 Armin Ronacher 7685 124 61.98 lucumr.pocoo.org
16 Wikileaks 9417 160 58.86 wikileaks.org
17 Bret Victor's Blog 8109 143 56.71 worrydream.com
18 Mozilla 8282 150 55.21333333 mozilla.org
19 Mailing List Archives 7653 144 53.15 marc.info
20 Google Online Security 6481 123 52.70 googleonlinesecurity.blogspot.com

The SQL query for the highest average upvotes per domain with a minimum of 100 shares is:

SELECT
  b.domain,
  s,
  c,
  s/c AS quality
FROM (
  SELECT
    a.domain,
    SUM(score) AS s,
    COUNT(1) AS c
  FROM (
    SELECT
      REGEXP_EXTRACT(url,r'^https?://(?:www.)?([^/]*)/?(?:.*)') AS domain,
      score
    FROM
      [fh-bigquery:hackernews.stories]) a
  GROUP BY
    a.domain) b
WHERE
  c >= 100
ORDER BY
  quality DESC
LIMIT
  100

What People Talk About

We know the source of the content people share, but what do they actually talk about?

Most Commonly Upvoted Words

Let's take a look at the words in the titles of the stories that get the most upvotes. The SQL query for this is:

SELECT
  a.word,
  SUM(a.score) AS score
FROM (
  SELECT
    LOWER(SPLIT(title, ' ')) AS word,
    score
  FROM
    [fh-bigquery:hackernews.stories]) a
GROUP BY
  a.word
ORDER BY
  score DESC
LIMIT
  1000

The top 40 words that don't include stopwords (admittedly handpicked) were below. It covers the standard array of programming languages, companies and other tech and programming related things.

Rank Word Upvotes Rank Word Upvotes
1 google 633322 21 twitter 122475
2 web 360208 22 iphone 121689
3 startup 277140 23 windows 121070
4 data 248914 24 design 119559
5 app 248277 25 nsa 118330
6 facebook 232569 26 language 114610
7 apple 224476 27 project 109374
8 code 214499 28 apps 109072
9 programming 201684 29 computer 108865
10 javascript 182948 30 github 108706
11 python 178466 31 [pdf] 107560
12 source 169257 32 ios 106800
13 internet 167170 33 search 106669
14 software 161382 34 system 106441
15 android 161100 35 build 105453
16 microsoft 160804 36 tech 103110
17 game 152917 37 security 102922
18 linux 141067 38 bitcoin 102473
19 hacker 129485 39 os 96854
20 amazon 124177 40 startups 96643

Another way to track what people on Hacker News talk about is by tracing the rise and fall of specific words. To test this, I used the words "bitcoin", which gained traction in relatively recent times and "php", which I hypothesized would be popular in the past which has waned in recent times. It turns out that this is indeed the case.

Who Posts The Best Content

Similar to domains, the three rankings of users on HN we look at are - most prolific posters, most upvoted posters, and most upvotes/submissions for users with at least 100 submissions. Thankfully, due to the HN karma system, there are no prolific posters who get by with substandard post quality.

Most Prolific Contributors

With a runaway total of over 7000 submissions on Hacker News, Clement Wan averages 2.24 posts a day since Hacker News took off (It's been 3,158 days since Feb 19, 2007). Two very mysterious users appear on this list. iProject, who has no user descriptions and posts a lot of content from popular publications and nickb. nickb is a great conspiracy theory story if there ever was one. There is a thread which points out that it is in fact a pseudonym from Paul Graham, the YC founder. It came out when nickb responded seemingly unhesitatingly to a comment on Paul Graham's (pg) comment as if it were him.

Rank User Submissions Upvotes Description
1 cwan 7077 52833 Clement Wan, ounder of a few niche contract manufacturing services in Hong Kong/China
2 shawndumas 6602 64308 Shawn Dumas, front-end engineer at Nest (Google)
3 evo_9 5659 41765 Rick Giampietro, founder of web dev company DotGlow
4 nickb 4322 29611 Quite a conspiracy theory, but revealed to be another account of Paul Graham here.
5 iProject 4266 26436 No clues given
6 bootload 4212 28759 Peter Renshaw, Programmer, Melbourne, Australia
7 edw519 3844 30073 Ed Weissman, profession programmer for 32 years
8 ColinWright 3766 77799 Colin Wright, PhD in Math and founder of Solipsys
9 nreece 3724 29841 Ashutosh Nilkanth, entrepreneur and programmer from Melbourne
10 tokenadult 3659 36769 Karl Bunday, founding director of the Edina Center for Academic Excellence

Most Upvoted Contributors

When it comes to most upvoted contributors, Colin Wright leads the list. The list contains 4 usual suspects from the most prolific contributors and notably, nickb's "real half", pg.

Rank User Submissions Upvotes Description
1 ColinWright 3766 77799 Colin Wright, PhD in Math and founder of Solipsys
2 shawndumas 6602 64308 Shawn Dumas, front-end engineer at Nest (Google)
3 llambda 2601 60432 Max Countryman, engineer and open-source contributor
4 fogus 2420 57038 Michael Fogus, Clojure and ClojureScript contributor
5 danso 2625 53587 Dan Nguyen, Stanford lecturer in Computational Journalism
6 cwan 7077 52833 Clement Wan, ounder of a few niche contract manufacturing services in Hong Kong/China
7 luu 2266 51838 Dan Luu, ex-Google engineer, currently at Microsoft
8 ssclafani 1326 49155 Stephen Scaplani, security researcher and founder of Play To Win
9 pg 708 46333 Paul Graham, YC founder
10 evo_9 5659 41765 Rick Giampietro, founder of web dev company DotGlow

Highest Quality Contributors

Note, again, that these are users with at least 100 submissions ranked by average upvoted per post. Funnily, the highest quality poster is whoishiring, a bot which submits "Who is Hiring?" posts at 11AM EST on the first weekday of every month. Here are the top 10 and their descriptions:

Rank User Upvotes Submissions Quality Description
1 whoishiring 23156 126 183.78 A bot that submits "Who is Hiring?" posts every month.
2 jaf12duke 8947 123 72.74 Jason Freedman, two time YC alum, runs 42Floors
3 cperciva 10541 145 72.70 Colin Percival, founder of Tarsnap, FreeBSD security officer, runs his blog Daemonology.
4 pg 46333 708 65.44 Paul Graham, YC founder
5 jsnell 7761 124 62.59 Juho Snellman, systems programmer from Zurich
6 jordanmessina 7452 121 61.59 Jordan Messina, YC alum and founder of density.io
7 paul 6458 107 60.36 Paul Buchheit, lead dev on Gmail
8 tptacek 17969 310 57.96 Thomas Ptacek, founder of Matasano Security
9 wlll 5869 103 56.98 Unsure - probably Jason Fried, founder of Basecamp
10 dko 7461 144 51.81 Derrick Ko, PM at Lyft

I've only scratched the surface of what Hacker News data can tell us, and I'm sure there's plenty more. Let me know in the comments if you think there would be other cool things worth exploring, or if there are any other cool analyses of HN data out there. And do leave any feedback! If you enjoyed this, make sure you check out The Top 100 Hacker News Posts of All Time.

I love hearing feedback! If you didn't like something, let me know in the comments and feel free reach out to me. If you did, you can share it with your followers in one click or follow me on Twitter!