— aleatory

Financial Charts with App Engine Tutorial Pt 2: Bulkloading Data

Rutherford Financial chart

Here I will walkthrough pre-populating the Google App Engine datastore with historical data in CSV format using the python bulkloader tool to do the data transformation and import.

In the last tutorial we seen how to create a financial chart with Google App Engine & update it periodically via some css-based web scraping. The bulkloader tool that comes in the appcfg.py script will enable us to visualise long term trends *right now* without needing to wait on our datastore growing one day at a time. Again the 3 Month EONIA Index Swap from the European Banking Federation’s Euribor site will be used as sample historical data.

So we need to grab the EONIA historic data. It’s available in CSV already but additionally I needed to:
a. remove unnecessary maturities (i.e. everything but the 3mth fixing).
b. transposing from column to row. This presented an issue as most spreadsheets only support up to 256 cells, and the historic data download had a column for each day’s fixing which ran over that in certain cases. I was surprised there wasn’t a readily available web-based solution so I rolled my own rough & ready version – presenting the totally unoriginal sounding CSV Transpose Tool. With this I arranged all the data in simple timestamp,instrument,bips format via a Google Docs spreadsheet.

Now, the bulkloader process:
Note: there are warnings in the docs about using this tool with the High Replication datastore, ie Python 2.7 apps, but I did so without any issue and mainly by adding the –url argument to command line calls to appcfg.

Steps:

a. enable remote_api

1
2
builtins:
- remote_api:on

in app.yaml

b. configure bulk loader

If you already have even a single sample entity in your datastore matching the kind you’re aiming to upload you can create the config skeletal structure automatically by running

appcfg.py create_bulkloader_config --filename=path/to/target/bulkloader.yaml --url=http://your-app.appspot.com/_ah/remote_api

BEWARE: Autogenerating schemas aren’t an exact science at the best of times and App Engine’s Bulkloader is no different. The create_bulkloader_config command returns the following output for StringProperty:

1
2
3
  - property: instrument
    external_name: instrument
    import_transform: db.Text

This is *not* what you want if you need to run queries on the instrument property (which we do). App Engine’s datastore does not support indexing of TextProperty. After looking through the sourcecode of transform.py what we need to do is replace the db.Text transform with:

      import_transform: transform.none_if_empty(str)

This ensures the uploaded data is treated as StringProperty and thus is indexed according to your index.yaml.

If you haven’t got any data yet my bulkloader.yaml was pretty simple and looked like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.ext.db
- import: google.appengine.api.datastore
- import: google.appengine.api.users
 
transformers:
 
- kind: Fixing
  connector: csv
  connector_options:
   encoding: utf-8
   columns: from_header
  property_map:
  - property: timestamp
    external_name: timestamp
    import_transform: transform.import_date_time('%d/%m/%Y')
 
  - property: instrument
    external_name: instrument
    import_transform: transform.none_if_empty(str)
 
  - property: bips
    external_name: bips
    import_transform: transform.none_if_empty(float)

c. the bulkloader transforms the CSV into a python list according to the above format & then a 2nd transformation into datastore entities which it uploads to the app engine cloud. To kick off the process use the upload data appcfg command:

app.cfg upload_data --config_file=path/to/target/bulkloader.yaml --filename=path/to/data.csv --kind=Fixing --url=http://your-app.appspot.com/_ah/remote_api

The loading process works in batches of 10 so if there are any errors reported during upload that entire batch of 10 will not be uploaded (but earlier batches will, so keep an eye on the number of entities transferred should you hit an error so that you don’t upload duplicate data later on).

locally you should see the following output:

[INFO    ] Logging to bulkloader-log-20120721.002147
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Batch Size: 10
[INFO    ] Opening database: bulkloader-progress-20120721.002147.sql3
[INFO    ] Connecting to euriborois.appspot.com/_ah/remote_api
[INFO    ] Starting import; maximum 10 entities per post
...............
[INFO    ] 142 entities total, 0 previously transferred
[INFO    ] 142 entities (79377 bytes) transferred in 5.5 seconds
[INFO    ] All entities successfully transferred

And that’s it! Your historical chart data is loaded into your apps datastore ready for tinkering. To demonstrate this in our chart we can include an option to display data out to the previous 12 months, taking a fixing from each week and displaying it. Our main.py now becomes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#!/usr/bin/env python
 
import webapp2
import os
from google.appengine.ext.webapp import template
import logging
 
from datetime import datetime, timedelta
from google.appengine.ext import db
from data import Fixing
 
class MainHandler(webapp2.RequestHandler):
	def get(self):
		timespan = self.request.path_info[1:]
		logging.info('path_info:%s' % timespan)
 
		if timespan == '1yr':
			multiplier = 7 # weekly data
			limit = 53 # because economists always want data from 1 year ago today...
			fixings=[]
			for i in range(limit,0,-1):
				since = datetime.utcnow().date() - timedelta(days=multiplier*(i-1))
				q = db.GqlQuery("SELECT * FROM Fixing WHERE timestamp <= :1 ORDER BY timestamp DESC",since)
				fixing = q.get()
				if fixing is not None:
					fixing.timestamp_prettyprint = spread.timestamp.strftime('%d/%m/%Y')
				fixings.append(spread)
 
		else:
			number = 5
 
			q = db.GqlQuery("SELECT * FROM Fixing ORDER BY timestamp DESC")
			fixings = q.fetch(number)
			for fixing in fixings:
				fixing.timestamp_prettyprint = fixing.timestamp.strftime('%d/%m/%Y')
 
		results = {'fixings':fixings}
 
		path = os.path.join(os.path.dirname(__file__), 'html/index.html')
		self.response.out.write(template.render(path,results))
 
logging.getLogger().setLevel(logging.DEBUG)
app = webapp2.WSGIApplication([('/.*', MainHandler)],
                              debug=True)

Here we check if the url path is /1yr and if so work back from today’s date collecting fixing data for the past year, one value every 7 days. Then in the corresponding index.html we just insert a simple link to reach the 1 year chart:

<a href="/1yr">1 year</a>

And that’s our historical data displayed. However, 53 GQL queries hitting the datastore each time we display year-long data is an expensive proposition; what we need is some form of caching. In part 3 of this financial charting series we’ll have a look at using memcache to load this data from memory instead, speeding up our application and reducing the chances of it hitting App Engine quota limits.

0 comments
Submit comment