Dynamic Queries to your App Engine Data Store
I'm using AppEngine for a project at work. Exactly what it does it not important, but at some point it stores an Event entry into the AppEngine datastore for each thing that we do. I put this is so we could easily show the lsat 5 events on a webpage, and also because it felt like a good idea at the time.
This tool has been running for about 6 months now, we've got a bunch of events stored. I need to look at this data to see how long the events are taking, so I can have an idea of what a good event timeout value would be. A simple approach would be to write some python code in my application to render a custom admin page that could display this information for me. This approach is fraught with issues:
- More code to write, which means more passes through the write, deploy, test, debug cycle. (The debug datastore doesn't have enough interesting information to do this locally.)
- I've got over 7000 entries in the datastore. To do this serially is going to be kinda slow. I'm not positive I can do it before the AppEngine request deadline kicks in. Or if I can beat the deadline now, I won't be able to in the future.
So I started looking around and came up with another approach. AppEngine supports a datastore backup feature, and BigQuery can use that data as an input. Bingo. I can now treat my datastore as if it were SQL and write dynamic queries to explore the data that way.
The Nitty Gritty
So there are a few things to setup. First, you need a Google Cloud Storage bucket to store your stuff in, and you need to give your application access to write to it. If you are lucky enough to be using the default bucket associated with your app, then you have nothing to do but look up your bucket name (it's in the cloud console for your app engine app).
Setup Cloud Storage
Start by going into your AppEngine settings page and go to "Application Settings". Under there we want to find and note your "Service Account Name". This is the Google user your app runs as. Remember it, we'll need it for the next step.
Go into the cloud console for your specific app. If you didn't create one explicity, Google created one for you. The application id for your app should appear somewhere on this page. Go into it and create a new bucket, and then give the Service Account from above write acccess to this bucket. (Again, if you're using the default bucket for your app, this already setup for you).
Do the backup
Find the "Datastore Admin" tab in your AppEngine settings page, select the entity type to backup, and start your backup. Make sure to point it to your Cloud Storage bucket when asked.
Go back to the cloud console for your specific app and scroll down to find BigQuery. You'll now need to create a datastore (call it whatever you want, I called mine test) and then create a table inside that datastore (I also called my table test), and point it to the file inside your bigstore bucket. Mine looked something like:
Now you're ready to start examining your data.
BigQuery's query language is very similar (but not exactly) SQL.
A simple example, how many events did I have?
SELECT COUNT(*) FROM test.test
Told me I had over 7000 entries. But I needed something more interesting. My events can be grouped around a key, and I'm looking for information on how long they took. The query I used was this one:
SELECT key, AVG((END-start)/1000/1000/60) AS avg_duration, MAX((END-start)/1000/1000/60) AS max_duration, MIN((END-start)/1000/1000/60) AS min_duration, COUNT(*) AS cnt FROM test.test WHERE status == "complete" GROUP BY key ORDER BY avg_duration DESC
Which basically says: For each key, show me the min, avg, and max duration of "complete" events, ordered by average duration. For ease of the human eye, I did all my durations in minutes, because that worked for my data.
7000 entries? I've forgotten how to count that low. -- BigQuery
So yes, BigQuery is overkill for this dataset. I could have just whipped up a python script, or dumped it into MySQL and done it that way. But the advantage of this way is that all the bits are already setup. I just needed to hook up the outputs to the inputs and I was ready. I didn't even need to download anything to my local machine, it all took place "in the cloud".
But I view this setup as just the start of something. Yes, I solved my immediate question, but what else could I do? How about setting up daily backups, and then scripting up a daily import of those into BigQuery. Then whenever I need to do a quick query on something, the data is already there.
Or going even further, how about some data visualizations based on BigQuery? If I have a snapshot for every day, I could graph some things like events per day, average durations per day, errors per day, etc.
So yes, it's still a bit of work to get all this glued together such that we can work on it. In my mind, there is still lots of room for improvement, but at least in this case Google delivers tools that work with other Google products in a meaningful fashion. That's not always the case.