Analyzing node.js on GitHub with BigQuery

13 August 2016 Posted Under: node.js [0] comments

BigQuery + GitHub awesomeness

As someone who works on developer tooling - GitHub is the holy grail of data sets. There’s just so much code out there, written by so many people, for so many reasons. I’ve often wished I could just clone all of the data on GitHub, and then write scripts to process the data for various reasons:

What are the top 1k npm modules used with Node.js apps? We want to know this so we can test them with App Engine.
What percentage of people are defining their supported Ruby versions in .ruby-version files? What about Gemfile? Can we reliabily use that to choose a Ruby version for the user?
What’s the most common way to inject configuration? Environment variables? Nconf? Etcd? Dotenv?

For each of these, we’re largely left to poke around using anecodtal observations or surveys. Having a simple way of answering these questions would be huge. Well… using the new public GitHub dataset with BigQuery we can.

GitHub on BigQuery: Analyze all the open source code

BigQuery is essentially a giant data warehouse that lets you store petabytes of data, originally built for internal use at Google. Usually querying over this much data requires a ton of infrastructure and an understanding of MapReduce… but BigQuery lets me just use SQL.

One of the fun things BigQuery offers is a bunch of public data sets. Some of the fun sets include:

I was hanging out with with Sandeep Dinesh at NodeSummit a few weeks ago, and we were chatting about some of the new data available in BigQuery from GitHub. We figured with a little bit of SQL … we could learn all kinds of cool stuff.

To get started - first, you’re going to need to visit the BigQuery console.

BigQuery + GitHub awesomeness

From here we can choose the dataset, and start taking a look at the schema. Now lets start asking some interesting questions!

How many files are out there on GitHub

We just need to query over the github_repos.files table, and get a count.

SELECT COUNT(*) FROM [bigquery-public-data:github_repos.files]

2,122,805,654

Wow -over 2 billion files. Next question!

How many package.json(s) are there on GitHub?

This time we’re just going to limit our files to paths ending in package.json. We can just use the RIGHT function to grab the end of the full path:

SELECT COUNT(*) FROM [bigquery-public-data:github_repos.files] WHERE RIGHT(path, 12) = "package.json"

8,128,298

Over 8 million! Now of course - this could include any project that has a package.json (not node.js), so it’s probably going to be a little front-end heavy.

What’s the most popular top level npm import on GitHub?

So here’s the big one. Lets say you want to know which npm module is most likely to be imported as a top level dependency? You could get some of this data by looking at npmjs.com, but that’s going to include subdependencies, and also count every install. I don’t want every install - I want to know how many apps are using which modules.

Up until this point, we’ve only been looking at the data available to us directly in the table. But in this case - we want to parse the contents of a file. This is where things start to get fun. This query will…

Grab all of the package.json files out there
Get the contents of those files
Run a JavaScript user-defined-function
Place the results in a temp table
Do a GROUP BY / ORDER BY to get our final count

Let’s take a look!

SELECT
  COUNT(*) as cnt, package
FROM
  JS(
    (SELECT content FROM [bigquery-public-data:github_repos.contents] WHERE id IN (
      SELECT id FROM [bigquery-public-data:github_repos.files] WHERE RIGHT(path, 12) = "package.json"
    )),
    content,
    "[{ name: 'package', type: 'string'}]",
    "function(row, emit) {
      try {
        x = JSON.parse(row.content);
        if (x.dependencies) {
          Object.keys(x.dependencies).forEach(function(dep) {
            emit({ package: dep });
          });
        }
      } catch (e) {}
    }"
  )
GROUP BY package
ORDER BY cnt DESC
LIMIT 1000

So this is really freaking cool. We were able to just slam a JavaScript function in the middle of the SQL query to help us process the results. You may also notice the try/catch floating around in there - turns out that not every package.json on GitHub is valid JSON!

So let’s take a look at the results:

Package	Count
express	66207
lodash	55698
debug	47499
async	40054
inherits	35782
body-parser	35644
request	31242
mkdirp	25941
chalk	25904
readable-stream	25015
glob	24497
underscore	24151
morgan	22398
minimatch	20561
cookie-parser	19957
react	19764
through2	18488
mongoose	17992
commander	17805
jade	17666
isarray	16677
minimist	16518
socket.io	15675
moment	15434
graceful-fs	15198
qs	14663
object-assign	14218
jquery	13709
serve-favicon	13641
string_decoder	13597
source-map	13548
babel-runtime	13524
rimraf	13233
gulp-util	13055
express-session	13045
core-util-is	13041
bluebird	12751
semver	12722
passport	12530
q	11990
colors	11710
mime	11627
react-dom	11560
ejs	11392
xtend	11312
node-uuid	11265
optimist	11070
gulp	10934
compression	10759
once	10544
mime-types	10352

( … it keeps going for a while ) At the end of this - we processed quite a bit of data.

Query complete (209.3s elapsed, 1.76 TB processed)

What other types of questions should we ask? I can think of a few that may be interesting:

Which npm dependencies are the most likely to be out of date?
How many people are using the fs npm module (the one on npmjs.com, not the core module)
How many people are hard coding keys in their JavaScript files?

If you want to play around with the GitHub dataset, check out the getting started tutorial.

If you find the answers to these (or anything else interesting), let me know at @JustinBeckwith!