Chris Whong (@chris_whong) had an interesting data journey which resulted in getting legal access (via FOIL) for the data file of the New York City’s Taxi & Limo Commission (TLC)’s full 2013 taxi tripsheet data. The ensuing collaborations continued the data as bit-torrent files, then as flat files, and more recently as a Google BigQuery project.
From here, I’ll show how to connect Tableau 8.2 to this large data set stored in Google BigQuery. After setting up the required Google account (aka gmail), this data should be freely accessible. Test access to it first through the Google BigQuery Browser tool. [ https://bigquery.cloud.google.com/table/833682135931:nyctaxi.trip_data ]. If it’s accessible there then Tableau access will work since it uses the same authentication. Once confirmed, here are the Tableau connection steps:
—-
Step 1. In Tableau (I’m using 8.2 Mac), begin by selecting the Google BigQuery connection type. The Google Account prompt will appear. Enter credentials and accept the dialog for app access.
—
Step 2. As a Google BigQuery backgrounder, know that Table(s) belong to a Dataset which belong to a Project. The fully qualified name of a BigQuery table has a construct of [project:dataset.table]. The full path to the NYC FOIL Taxi Data is [833682135931:nyctaxi.trip_data], broken down as:
Project ID : 833682135931
Dataset: nyctaxi
Table: (two tables) trip_data, trip_fare
Click on ‘Select Project’, enter the Project ID of 833682135931 then click on the + symbol.
—
Step 3. Next, if the above project handshake step succeeds, the Dataset member(s) will populate. Click on ‘Select Dataset’ and choose nyctaxi.
—
Step 4. Finally, the Table member(s) will be populated. From here, use the appropriate Tableau query design (single, multi, or custom sql) and continue to the worksheet.
That’s it.
—
Here are the row counts for each table. Google BigQuery is designed for low latency, so ad-hoc queries against these tables are very fast. This count(*) returned in seconds. Just remember that it’s a columnar store and to design the queries according to BigQuery best practices.
—
Metadata for each table are shown in Tableau below. Note that the values for the taxi medallion & license have been converted to an MD5 hash and @vijayp discovered that unknown taxi drivers were given a value of 0, so prior to this discovery there appeared to be a super driver that had higher aggregates than others. It turns out that this driver is the “zero” unknown bucket which skewed the statistics (high aggregations for Driver 0). Check out @vijayp ‘s decoding and the ensuing firestorm that the MD5 hashes can be converted to the actual driver. (Truly data science labor with his work.) MD5 can be easily googled now for its reverse value–it requires salting and other techniques to harden. There may be some fallout from this hashing oversight. There’s also the question of potential lat/long triangulation to identifiable persons. I’m sure there will be some heated discussions around this discovery.
Lastly, here’s a chart of the number of trips per day–approximately half a million. I referenced it against the official Taxi & Limo Commission publication to validate these incredible results. The reddit community also has posted some interesting queries that could be turned into Tableau visualizations easily.
I’m anticipating what the rest of the Tableau community can do with this data, and I’m sure other data viz programmers such as D3 but this context is just Tableau. This data is real, it’s rich, and there’s plenty of insight in it. Uber and Lyft could very well be data mining it now for busines insight.
Thanks to NYC, Chris Whong, and others’ initiative in democratizing this big data set, and to Google for turning it into a public BigQuery dataset. (Check out their GDELT dataset.) What’s really exciting is that Google BigQuery can respond very quickly to questions that come to mind during exploration of this large-scale data. Having ad-hoc exploratory ability on this size of data is becoming the new normal.