As someone who works on developer tooling - GitHub is the holy grail of data sets. There’s just so much code out there, written by so many people, for so many reasons. I’ve often wished I could just clone all of the data on GitHub, and then write scripts to process the data for various reasons:
What are the top 1k npm modules used with Node.js apps? We want to know this so we can test them with App Engine.
What percentage of people are defining their supported Ruby versions in .ruby-version files? What about Gemfile? Can we reliabily use that to choose a Ruby version for the user?
What’s the most common way to inject configuration? Environment variables? Nconf? Etcd? Dotenv?
For each of these, we’re largely left to poke around using anecodtal observations or surveys. Having a simple way of answering these questions would be huge. Well… using the new public GitHub dataset with BigQuery we can.
BigQuery is essentially a giant data warehouse that lets you store petabytes of data, originally built for internal use at Google. Usually querying over this much data requires a ton of infrastructure and an understanding of MapReduce… but BigQuery lets me just use SQL.
One of the fun things BigQuery offers is a bunch of public data sets. Some of the fun sets include:
I was hanging out with with Sandeep Dinesh at NodeSummit a few weeks ago, and we were chatting about some of the new data available in BigQuery from GitHub. We figured with a little bit of SQL … we could learn all kinds of cool stuff.
To get started - first, you’re going to need to visit the BigQuery console.
From here we can choose the dataset, and start taking a look at the schema. Now lets start asking some interesting questions!
How many files are out there on GitHub
We just need to query over the github_repos.files table, and get a count.
Wow -over 2 billion files. Next question!
How many package.json(s) are there on GitHub?
This time we’re just going to limit our files to paths ending in
package.json. We can just use the
RIGHT function to grab the end of the full path:
Over 8 million! Now of course - this could include any project that has a
package.json (not node.js), so it’s probably going to be a little front-end heavy.
What’s the most popular top level npm import on GitHub?
So here’s the big one. Lets say you want to know which npm module is most likely to be imported as a top level dependency? You could get some of this data by looking at npmjs.com, but that’s going to include subdependencies, and also count every install. I don’t want every install - I want to know how many apps are using which modules.
Up until this point, we’ve only been looking at the data available to us directly in the table. But in this case - we want to parse the contents of a file. This is where things start to get fun. This query will…
- Grab all of the
package.jsonfiles out there
- Get the contents of those files
- Place the results in a temp table
- Do a
ORDER BYto get our final count
Let’s take a look!
So let’s take a look at the results:
( … it keeps going for a while ) At the end of this - we processed quite a bit of data.
Query complete (209.3s elapsed, 1.76 TB processed)
What other types of questions should we ask? I can think of a few that may be interesting:
- Which npm dependencies are the most likely to be out of date?
- How many people are using the
fsnpm module (the one on npmjs.com, not the core module)
If you want to play around with the GitHub dataset, check out the getting started tutorial.
If you find the answers to these (or anything else interesting), let me know at @JustinBeckwith!