Analyzing node.js on GitHub with BigQuery

13 August 2016 Posted Under: node.js [0] comments

BigQuery + GitHub awesomeness

As someone who works on developer tooling - GitHub is the holy grail of data sets. There’s just so much code out there, written by so many people, for so many reasons. I’ve often wished I could just clone all of the data on GitHub, and then write scripts to process the data for various reasons:

  • What are the top 1k npm modules used with Node.js apps? We want to know this so we can test them with App Engine.

  • What percentage of people are defining their supported Ruby versions in .ruby-version files? What about Gemfile? Can we reliabily use that to choose a Ruby version for the user?

  • What’s the most common way to inject configuration? Environment variables? Nconf? Etcd? Dotenv?

For each of these, we’re largely left to poke around using anecodtal observations or surveys. Having a simple way of answering these questions would be huge. Well… using the new public GitHub dataset with BigQuery we can.

BigQuery is essentially a giant data warehouse that lets you store petabytes of data, originally built for internal use at Google. Usually querying over this much data requires a ton of infrastructure and an understanding of MapReduce… but BigQuery lets me just use SQL.

One of the fun things BigQuery offers is a bunch of public data sets. Some of the fun sets include:

Fun with Sandeep at NodeSummit

I was hanging out with with Sandeep Dinesh at NodeSummit a few weeks ago, and we were chatting about some of the new data available in BigQuery from GitHub. We figured with a little bit of SQL … we could learn all kinds of cool stuff.

To get started - first, you’re going to need to visit the BigQuery console.

BigQuery + GitHub awesomeness

From here we can choose the dataset, and start taking a look at the schema. Now lets start asking some interesting questions!

How many files are out there on GitHub

We just need to query over the github_repos.files table, and get a count.

SELECT COUNT(*) FROM [bigquery-public-data:github_repos.files]

Wow -over 2 billion files. Next question!

How many package.json(s) are there on GitHub?

This time we’re just going to limit our files to paths ending in package.json. We can just use the RIGHT function to grab the end of the full path:

SELECT COUNT(*) FROM [bigquery-public-data:github_repos.files] WHERE RIGHT(path, 12) = "package.json"

Over 8 million! Now of course - this could include any project that has a package.json (not node.js), so it’s probably going to be a little front-end heavy.

So here’s the big one. Lets say you want to know which npm module is most likely to be imported as a top level dependency? You could get some of this data by looking at, but that’s going to include subdependencies, and also count every install. I don’t want every install - I want to know how many apps are using which modules.

Up until this point, we’ve only been looking at the data available to us directly in the table. But in this case - we want to parse the contents of a file. This is where things start to get fun. This query will…

  • Grab all of the package.json files out there
  • Get the contents of those files
  • Run a JavaScript user-defined-function
  • Place the results in a temp table
  • Do a GROUP BY / ORDER BY to get our final count

Let’s take a look!

  COUNT(*) as cnt, package
    (SELECT content FROM [bigquery-public-data:github_repos.contents] WHERE id IN (
      SELECT id FROM [bigquery-public-data:github_repos.files] WHERE RIGHT(path, 12) = "package.json" 
    "[{ name: 'package', type: 'string'}]",
    "function(row, emit) {
      try {
        x = JSON.parse(row.content);
        if (x.dependencies) {
          Object.keys(x.dependencies).forEach(function(dep) {
            emit({ package: dep });
      } catch (e) {}
GROUP BY package
LIMIT 1000

So this is really freaking cool. We were able to just slam a JavaScript function in the middle of the SQL query to help us process the results. You may also notice the try/catch floating around in there - turns out that not every package.json on GitHub is valid JSON!

So let’s take a look at the results:

Package Count
express 66207
lodash 55698
debug 47499
async 40054
inherits 35782
body-parser 35644
request 31242
mkdirp 25941
chalk 25904
readable-stream 25015
glob 24497
underscore 24151
morgan 22398
minimatch 20561
cookie-parser 19957
react 19764
through2 18488
mongoose 17992
commander 17805
jade 17666
isarray 16677
minimist 16518 15675
moment 15434
graceful-fs 15198
qs 14663
object-assign 14218
jquery 13709
serve-favicon 13641
string_decoder 13597
source-map 13548
babel-runtime 13524
rimraf 13233
gulp-util 13055
express-session 13045
core-util-is 13041
bluebird 12751
semver 12722
passport 12530
q 11990
colors 11710
mime 11627
react-dom 11560
ejs 11392
xtend 11312
node-uuid 11265
optimist 11070
gulp 10934
compression 10759
once 10544
mime-types 10352

( … it keeps going for a while ) At the end of this - we processed quite a bit of data.

Query complete (209.3s elapsed, 1.76 TB processed)

What other types of questions should we ask? I can think of a few that may be interesting:

  • Which npm dependencies are the most likely to be out of date?
  • How many people are using the fs npm module (the one on, not the core module)
  • How many people are hard coding keys in their JavaScript files?

If you want to play around with the GitHub dataset, check out the getting started tutorial.

If you find the answers to these (or anything else interesting), let me know at @JustinBeckwith!


Building node.js applications on Google Cloud Platform

24 May 2016 Posted Under: node.js [0] comments

Node.js is for hats and cats

In March I had the chance to talk at GCP Next on Node.js @ Google. This is a fun little tour of what Google Cloud has to offer Node.js developers.


  • 00:00 - Intro
  • 01:57 - Why Node.js
  • 02:57 - Node.js + Google Cloud Platform
  • 03:48 - The many engines of Google Cloud
  • 04:48 - Getting started with App Engine
  • 07:26 - Traffic splitting
  • 11:45 - Cloud Shell
  • 16:03 - Google Cloud APIs & Services
  • 17:03 - gcloud npm module
  • 20:17 - Cloud cats demo
  • 22:09 - Code review
  • 25:24 - Cloud Debugger
  • 27:30 - Cloud Trace
  • 28:47 - Enterprise Node.js at NodeSource
  • 37:05 - Node.js and IoT
  • 41:09 - Hatspin
  • 42:31 - Closing

Watch the video



Dependency management and Go

29 May 2015 Posted Under: Go [0] comments

I find dependency management and package managers interesting. Each language has its own package manager, and each one has characteristics that are specific to that community. NuGet for .NET has great tooling and Visual Studio support, since that’s important to the .NET developer audience. NPM has a super flexible model, and great command line tools.

In a lot of ways, golang is a little quirky. And that’s awesome. However - I’ve really struggled to wrap my head around dependency management in Go.

"Dependency management and golang"

When dealing with dependency management, I expect a few things:

1. Repeatable builds

Given the same source code, I expect to be able to reproduce the same set of binaries. Every. Time. Every bit of information needed to complete a build, whether it be on my local dev box or on a build server, should be explicitly called out in my source code. No surprises.

2. Isolated environments

I am likely to be working on multiple projects at a time. Each project may have a requirement on different compilers, and different versions of the same dependency. At no point should changing a dependency in one project have an effect on the dependencies on a completely separate project.

3. Consensus

Having a package management story is awesome. What’s even better is making sure everyone uses the same one :) As long as developers are inventive and curious, there will always be alternatives. But there needs to be consensus on the community accepted standard on how a package manager will work. If 5 projects use 5 different models of dependency management, we’re all out of luck.

How node.js does it

As I’ve talked about before, I like to use my experience with other languages as a way to learn about a new language (just like most people I’d assume). Let’s take a look at how NPM for node.js solves these problems.

Similar to the go get command, there is an npm install command. It looks like this:

npm install --save yelp

The big difference you’ll see is --save. This tells NPM to save the dependency, and the version I’m using into the package.json for my project:

  "name": "pollster",
  "version": "2.0.0",
  "private": true,
  "scripts": {
    "start": "node server"
  "dependencies": {
    "express": "~3.1.0",
    "nconf": "~0.6.7",
    "": "~0.9.13"

package.json is stored in the top level directory of my app. It provides my isolation. If I start another project - that means another project.json, another set of dependencies. The environments are entirely isolated. The list of dependencies and their versions provides my repeatability. Every time someone clones my repository and runs npm install, they will get the same list of dependencies from a centralized source. The fact that most people use NPM provides my consensus.

Version pinning is accomplished using semver. The ~ relaxes the rules on version matching, meaning I’m ok with bringing down a different version of my dependency, as long as it is only a PATCH - which means no API breaking changes, only bug fixes. If you’re being super picky (on production stuff I am), you can specify a specific version minus the ~. For downstream dependencies (dependencies of your dependencies) you can lock those in as well using npm-shrinkwrap. On one of my projects, I got bit by the lack of shrink-wrapping when a misbehaved package author used a wildcard import for a downstream dependency that actually broke us in production.

The typical workflow is to check in your package.json, and then .gitignore your node_modules directory that contains the actual source code of 3rd party packages.

It’s all pretty awesome.

Go out of the box

With the out of the box behavior, Go is less than ideal in repeatability, isolation, and consensus. If you follow the setup guide for golang, you’ll find yourself with a single directory where you’re supposed to keep all of your code. Inside of there, you create a /src directory, and a new directory for each project you’re going to work on. When you install a dependency using go get, it will essentially drop the source code from that repository into `$GOPATH/src’. In your source code, you just tell the compiler where it needs to go to grab the latest sources:

import ""
client := yelp.New(options)
result, err := client.DoSimpleSearch("coffee", "seattle")

So this is really bad. The go-yelp library I’m importing from github is pulled down at compile time (if not already available from a go get command), and built into my project. That is pointing to the master branch of my github repository. Who’s to say I won’t change my API tomorrow, breaking everyone who has imported the library in this way? As a library author, I’m left with 3 options:

  1. Never make breaking changes.
  2. Make a completely new repository on GitHub for a new version of my API that has breaking changes.
  3. Make breaking changes, and assume / hope developers are using a dependency management tool.

Without using an external tool (or one of the methods I’ll talk about below), there is no concept of version pinning in go. You point towards a namespace, and that path is used to find your code during the build. For most open source projects - the out of the box behavior is broken.

My problem is that the default workflow on a go project leads you down a path of sadness. You start with a magical go get command that installs the latest and greatest version of a dependency - but doesn’t ask you which specific version or hash of that dependency you should be using. Most web developers have been conditioned to not check our dependencies into source control, if they’re managed by a package manager (see: gem, NuGet, NPM, bower, etc). The end result is that I could easily break someone else, and I can easily be broken.

Vendoring, import rewrites, and the GOPATH

There is currently no agreed upon package manager for Go. Recently the Go team kicked up a great thread asking the community for their thoughts on a package management system. There are a few high level concepts that are helpful to understand.


At Google, the source code for a dependency is copied into the source tree, and checked into source control. This provides repeatability. There is never a question on where the source is downloaded from, because it is always available in the source tree. Copying the source from a dependency into your own source is referred to as “vendoring”.

Import rewriting

After you copy the code into your source tree, you need to change your import path to not point at the original source, but rather to point at a path in your tree. This is called “Import rewriting”.

After copying a library into your tree, instead of this:

import ""
client := yelp.New(options)

you would do this:

import "yourtree/third_party/"
client := yelp.New(options)


GOPATH rewriting

Vendoring and import rewriting provide our repeatable builds. But what about isolation? If project (x) relies on go-yelp#v1.0, project (y) should be able to rely on go-yelp#v2.0. They should be isolated. If you follow How to write go code, you’re led down a path of a single workspace, which is driven by $GOPATH. $GOPATH is where libraries installed via go get will be installed. It controls where your own binaries are generated. It’s generally the defining variable for the root of your workspace. If you try to run multiple projects out of the same directory - it completely blows up isolation. If you want to be able to reference different versions of the same dependency, you need to change the $GOPATH variable for each current project. The act of changing the $GOPATH environment variable when switching projects is “GOPATH rewriting”.

Package managers & tools

Given the lack of prescriptive guidance and tools on how to deal with dependency management, just a few tools have popped up. In no particular order, here are a few I found:

Given my big 3 requirements above, I checked out the most popular of the repos above, and settled on godep. The alternatives all fell into at least one of these traps:

  • Forced rewriting the url, making it harder to manage dependency paths
  • Relied on a centralized service
  • Only works on a single platform
  • Doesn’t provide isolation in the $GOPATH


Godep matched most of my requirements for a package manager, and is the most popular solution in the community. It solves the repeatability and isolation issues above. The workflow:

Run go get to install a dependency (nothing new here):

go get

When you’re done installing dependencies, use the godep save command. This will copy all of the referenced code imported into the project from the current $GOPATH into the ./Godeps directory in your project. Make sure to check this into source control.

godep save

It also will walk the graph of dependencies and create a ./Godeps/Godeps.json file:

	"ImportPath": "",
	"GoVersion": "go1.4.2",
	"Deps": [
			"ImportPath": "",
			"Rev": "e0e1b550d545d9be0446ce324babcb16f09270f5"
			"ImportPath": "",
			"Rev": "a1577bd3870218dc30725a7cf4655e9917e3751b"

When it’s time to build, use the godep tool instead of the standard go toolchain:

godep go build

The $GOPATH is automatically rewritten to use the local copy of dependencies, ensuring you have isolation for your project. This approach is great for a few reasons:

  1. Repeatable builds - When someone clones the repository and runs it, everything you need to build is present. There are no floating versions.
  2. No external repository needed for dependencies - with all dependencies checked into the local repository, there’s no need to worry about a centralized service. NPM will occasionally go down, as does NuGet.
  3. Isolated environment - With $GOPATH being rewritten at build time, you have complete isolation from one project to the next.
  4. No import rewriting - A few other tools operate by changing the import url from the origin repository to a rewritten local repository. This makes installing dependencies a little painful, and makes the import statement somewhat unsightly.

There are a few negatives though as well:

  1. Not checking in your dependencies is convenient. It’s a pain to check in thousands of source files I won’t really edit. Without a centralized repository, this is not likely to be solved.
  2. You need to use a wrapped toolchain with the godep commands. There is still no real consensus.

For an example of a project that uses godep, check out coffee.

Wrapping up

While using godep is great - I’d really love to see consensus. It’s way too easy for newcomers to fall into the trap of floating dependencies, and it’s hard without much official guidance to come to any sort of consensus on the right approach. At this stage - it’s really up to each team to pick what they value in their dependency management story and choose one of the (many) options out there. Until proven otherwise, I’m sticking with godep.

Great posts on this subject

There have been a lot of great posts by others on this subject, check these out as well:


Docker, Revel, and App Engine

08 May 2015 Posted Under: Google Cloud [0] comments

"Revel running on Google App Engine with Docker"

** note: I recently updated this post to make sure all of the commands still work. **

I’ve spent some time recently using go for my side web projects. The Go standard libraries are minimal by design - meaning it doesn’t come with a prescriptive web framework out of the box. The good news is that there are a ton of options:

Of course, you could decide to just not use a web framework at all. Comparing these is a topic of great debate - but that topic is for another post :) I decided to try out Revel first, as it was the closest to a full featured rails-esque framework at a glance. I’ll likely give all of these a shot at some point.

After building an app on Revel, I wanted to get a feel for deploying my app to see if it posed any unique challenges. I recently started a new gig working on Google Cloud, and decided to try out App Engine. The default runtime environment for Go in App Engine is sandboxed. This comes with some benefits, and a few challenges. You get a lot of stuff for free, but you also are restricted in terms of file system access, network access, and library usage. Given the restrictions, I decided to use the new App Engine Flexible service. App Engine Flex lets you deploy your application in a docker container, while still having access to the other App Engine features like datastore, logging, caching, etc. The advantage of using docker here is that I don’t need to write any App Engine specific code. I can write a standard Go/Revel app, and just deploy to docker.

Starting with Revel

There’s a pretty great getting started tutorial for Revel. After getting the libraries installed, scaffold a new app with the revel new command:

go get
go get
revel new myapp

Using Docker

Before touching App Engine Flexible, the first step is to get it working with docker. It took a little time and effort, but once docker is completely set up on your machine, writing the docker file is straight forward.

Here’s the docker file I’m using right now:

# Use the official go docker image built on debian.
FROM golang:1.4.2

# Grab the source code and add it to the workspace.
ADD . /go/src/

# Install revel and the revel CLI.
RUN go get
RUN go get

# Use the revel CLI to start up our application.
ENTRYPOINT revel run dev 8080

# Open up the port where the app is running.

There are a few things to call out with this Dockerfile:

  1. I chose to use the golang docker image as my base. You could replicate the steps needed to install and configure go with a base debian/ubuntu image, but I found this easier. I could have also used the pre-configured App ngine golang image, but I did not need the additional service account support.

  2. The ENTRYPOINT command tells Docker (and App Engine) which process to run when the container is started. I’m using the CLI included with revel.

  3. For the ENTRYPOINT and EXPOSE directives, make sure to use port 8080 - this is a hard coded port for App Engine.

To start using docker with your existing revel app, you need to install docker and copy the dockerfile into the root of your app. Update the dockerfile to change the path in the ADD and ENTRYPOINT instructions to use the local path to your revel app instead of mine.

After you have docker setup, build your image and try running the app:

# build and run the image
docker build -t revel-appengine .
docker run -it -p 8080:8080 revel-appengine

This will run docker, build the image locally, and then run it. Try hitting http://localhost:8080 in your browser. You should see the revel startup page:

"Running revel in docker"

Now we’re running revel inside of docker.

App Engine Flexible

The original version of App Engine had a bit of a funny way of managing application runtimes. There are a limited set of stacks available, and you’re left using a locked down version an approved runtime. Flex gets rid of this restriction by letting you run pretty much anything inside of a container. You just need to define a little bit of extra config in a app.yaml file that tells App Engine how to treat your container:

runtime: custom
vm: true
api_version: go1

This config lets me use App Engine, with a custom docker image as my runtime, running on a managed virtual machine. You can copy my app.yaml into your app directory, alongside the Dockerfile. Next, make sure you’ve signed up for a Google Cloud account, and download the Google Cloud SDK. After getting all of that setup, you’ll need to create a new project in the developer console.

# Install the Google Cloud SDK
curl | bash

# Log into your account
gcloud init

That covers the initial setup. After you have a project created, you can try deploying the app. This is essentially going to startup your app using the Dockerfile we defined earlier on Google Cloud:

# Deploy the application
gcloud app deploy

After deploying, you can visit your site here:

Revel running on App Engine

Wrapping up

So that’s it. I decided to use revel for this one, but the whole idea behind using docker for App Engine is that you can bring pretty much any stack. If you have any questions, feel free to check out the source, or find me @JustinBeckwith.


Realtime services with io.js, redis and Azure

15 February 2015 Posted Under: azure [0] comments

"View the demo"

A few years ago, I put together a fun little app that used node.js, service bus, cloud services, and the Instagram realtime API to build a realtime visualization of images posted to Instagram. In 2 years time, a lot has changed on the Azure platform. I decided to go back into that code, and retool it to take advantage of some new technology and platform features. And for fun.

Let’s take a look through the updates!

Resource groups

I’m using resource groups to organize the various services. Resource groups provide a nice way to visualize and manage the services that make up an app. RBAC and aggregated monitoring are two of the biggest features that make this useful.

"Using a resource group makes it easier to organize services"

Websites & Websockets

In the original version of this app, I chose to use cloud services instead of Azure web sites. One of the biggest reasons for this choice was websocket support with At the time, Azure websites did not support websockets. Well… now it does. There are a lot of reasons to choose websites over cloud services:

  • Fast continuous deployment via Github
  • Low concept count, no special tooling needed
  • Now supports deployment slots, ssl, enterprise features

When you create your site, make sure to turn on websockets:

"setting up websockets"


io.js is a fork of node.js that provides a faster release cycle and es6 support. It’s pretty easy to get it running on Azure, thanks to iojs-azure. Just to prove I’m running io.js instead of node.js, I added this little bit in my server.js:`Started wazstagram running on ${process.title} ${process.version}`);

The results:

"Console says it's io.js"


In the previous version of this app, I used service bus for publishing messages from the back end process to the scaled out front end nodes. This worked great, but I’m more comfortable with redis. There are a lot of options for redis on Azure, but we recently rolled out a first class redis cache service, so I decided to give that a try. I’m really looking to use two features from redis:

  • Pub / Sub - Messages received by Instagram are published to the scaled out front end
  • Caching - I keep a cache of 100 messages around to auto-fill the page on the initial visit

You can create a new redis cache from the Gallery:

"Create a new redis cache"

After creating the cache, you have a good ol standard redis database. Nothing special/fancy/funky. You can connect to it using the standard redis-cli from the command line:

"I can connect using standard redis tools"

Note the password I’m using is actually one of the management keys provided in the portal. I also chose to disable SSL, as nothing I’m storing is sensitive data:

"Set up non-SSL connections"

I used node-redis to talk to the database, both for pub/sub and cache. First, create a new redis client:

function createRedisClient() {
    return redis.createClient(
            auth_pass: nconf.get('redisKey'), 
            return_buffers: true
    ).on("error", function (err) {
        logger.error("ERR:REDIS: " + err);

// create redis clients for the publisher and the subscriber
var redisSubClient = createRedisClient();
var redisPubClient = createRedisClient();

PROTIP: Use nconf to store secrets in json locally, and read from app settings in Azure.

When the Instagram API sends a new image, it’s published to a channel, and centrally cached:

logger.verbose('new pic published from: ' +;
redisPubClient.publish('pics', JSON.stringify(message));

// cache results to ensure users get an initial blast of (n) images per city
redisPubClient.lpush(, message.pic);
redisPubClient.ltrim(, 0, 100);
redisPubClient.lpush(universe, message.pic);
redisPubClient.ltrim(universe, 0, 100);

The centralized cache is great, since I don’t need to use up memory in each io.js process used in my site (keep scale out in mind). Each client also connects to the pub/sub channel, ensuring every instance gets new messages:

// listen to new images from redis pub/sub
redisSubClient.on('message', function(channel, message) {
    logger.verbose('channel: ' + channel + " ; message: " + message);
    var m = JSON.parse(message.toString()); ('newPic', m.pic); (universe).emit('newPic', m.pic);

After setting up the service, I was using the redis-cli to do a lot of debugging. There’s also some great monitoring/metrics/alerts available in the portal:

"monitoring and metrics"

Wrapping up

If you have any questions, feel free to check out the source, or find me @JustinBeckwith.