2009

jsonpickle 0.3.1 Released

jsonpickle, the powerful library for serializing complex object graphs in Python to JSON, had a major milestone this week with the official release of 0.3.1, available on PyPi with documentation and full release notes at http://jsonpickle.github.com/. We have migrated from the Google Code site to the new Github site at http://github.com/jsonpickle/jsonpickle.

This release represents nearly a year of development from multiple contributors. Some of the major highlights of the release include supporting a wider variety of objects, supporting the pickle protocol's set and get state methods, and allowing the use and addition of any Python JSON backend (e.g. demjson, simplejson, django.util.simplejson, etc.). Please be aware that backwards compatibility for the 0.2.0 format JSON is not guaranteed.

In the past year jsonpickle has done well, with nearly 2500 downloads of the 0.2.0 release from Google Code, not including the bundled distribution of jsonpickle in tools such as FireLogger and git-cola. jsonpickle is currently available in the Gentoo repository and working its way through the Fedora and Debian repository processes.

I thank our contributors, including David Aguilar, Dan Buch, and Ian Schenck for their massive improvements to jsonpickle. I also thank everyone who has submitted bug reports and shared thoughts on our mailing list. Finally, I thank the distribution managers who have worked to package jsonpickle for their various distributions.

Please try the new version, submit bug reports, and even fork the project on Github.

HTML to reStructuredText in Python using Pandoc

During the conversion of my blog from Wordpress to a custom Django-based system, I wanted to move from HTML markup to reStructuredText (partly to make it easier to publish Sphinx documentation to my blog).

While it is dead simple to convert reStructuredText to HTML, going the other way is more difficult. Luckily, Pandoc, the swiss army knife for converting between markup formats, can do a nice job converting HTML to reStructuredText.

I wrote a custom Django Command to parse a Wordpress XML export file and store the blog entries. The relevant code to convert HTML to reStructuredText is very simple. It simply makes a subprocess call to the Pandoc command and retrieves the command's output. Make sure you have Pandoc installed (in Ubuntu, sudo apt-get install pandoc will work).

import subprocess
def html2rst(html):
    p = subprocess.Popen(['pandoc', '--from=html', '--to=rst'],
                         stdin=subprocess.PIPE, stdout=subprocess.PIPE)
    return p.communicate(html)[0]

reStructuredText Widget in Django Admin

While Django has great support for rendering standard markup languages, sometimes it can be difficult editing documents using a markup in the Django Admin. Several others (1, 2, 3) show how easy it is to edit Markdown, Textile, and even HTML, in the Django Admin using a WYSIWYG editor, such as markItUp! or TinyMCE.

Unfortunately, reStructuredText is not well support by most WYSIWYG editors. Nevertheless, we can improve the experience of editing reStructuredText in the Django Admin. One of the biggest improvements is switching the Textarea to use a monospace font to avoid issues caused by the heading underlines being too short. We can also customize the size of the Textarea.

In our app/admin.py file, we can add a ModelForm which overrides the field that has the reStructuredText content (in this case, description). We then use this ModelForm as the form in our subclass of ModelAdmin. Finally, we indicate that the ModelAdmin subclass is associated with the specific model that contains the reStructuredText content.

For the reStructuredText content field, description, we change the size to a width of 80 characters and the font to a monospace family. Additionally, we add a quick link to the reStructuredText Quick Reference.

from django import forms
from django.contrib import admin
from app.models import Entry

class EntryAdminForm(forms.ModelForm):
    description = forms.CharField(widget=forms.Textarea(attrs={'rows':30,
                                                                'cols':80,
                                                                'style':'font-family:monospace'}),
                                  help_text='<a href="http://docutils.sourceforge.net/docs/user/rst/quickref.html">reStructuredText Quick Reference</a>')
    class Meta:
        model = Entry

class EntryAdmin(admin.ModelAdmin):
    form = EntryAdminForm

admin.site.register(Entry, EntryAdmin)

Mocking Groovy's HTTPBuilder

I ran into a head-scratcher today when trying to unit test some Groovy code. The code under test interacts with an HTTP web service using Groovy's great HTTPBuilder, which wraps Apache's HttpClient. Obviously, I wanted to mock the interaction with the HTTP server to limit the scope of my tests.

Groovy makes it easy to create simple mocks using maps. To mock a class with a map, one must create a map which is keyed by the methods names to be tested and storing closures for the mock method implementation. For example, if we wish to mock out the HTTPBuilder, which has a "post" method, we can accomplish it using the map defined by mapMock.

class HTTPBuilder {
    def post(...) { /* real implementation */ }
}


def mapMock = ["post": { /* mock implementation */ }]

This map-mock approach was working great for mocking out the post, put, and delete methods in HTTPBuilder, but the get method was giving me quite a bit of trouble. The closure in my get method mock was never executed.

After taking a step back, I realized that the map's get method (the one used to return the value at a specific key) was getting called instead of the key within the map called get.

The simple solution was to switch to use an Expando mock instead of a map mock.

def expandoMock = new Expando()
expandoMock.get = { /* mock implementation */ }

I know I'm late to the train, buy Groovy is a breathe of fresh air compared to Java.

Master's Thesis & Open-Source Tool

On July 15th, I successfully defended my Master's Thesis in Biomedical Informatics at Vanderbilt University. This defense was the culmination of 2 years of work. The thesis focuses on extracting organizational structure and relationships from the audit logs of clinician information systems. This work has potential applications in the improvement of delivery of care and improving the security of patients private medical data.

As part of this work, I developed an open-source tool for analyzing audit logs. Licensed under an Apache 2.0 License, the Healthcare Organizational Relational Network Extraction Toolkit (HORNET) is a Python framework for plugins that analyze healthcare audit logs. The tool is fully functional, but is not yet polished enough for use by healthcare administrators.

The project is hosted on Google Code (http://code.google.com/p/hornet/). You can visit the project site as well as view the latest documentation

I am writing a journal publication that describes this tool, its methods, and results from Vanderbilt University Medical Center. I will link to that publication when it is available, but until that time, I can release my thesis abstract.

A Framework for the Automatic Discovery of Policy from Healthcare Access Logs

by John M. Paulett

Healthcare organizations are often stymied in their efforts to prevent insider attacks that violate patient privacy. Numerous high-profile privacy breaches involving celebrities have brought this deficiency to the public's attention. In response, recent legislation aims to improve this situation by means of regulations and sanctions. While the public and government may demand more privacy safeguards, the current state-of-the-art tools in healthcare security, such as access control and auditing, will still be limited in their ability to solve the issue technically. These technologies are theoretically sound and tested in other industries, yet are suboptimal because no feasible methods exist for generating the policies these systems must act upon, due to the inherent complexities of modern healthcare organizations.

To address this shortcoming, we present a novel open-source framework, which mines low-level statistics of how users interact within the organization from the access logs of the organization's information systems. Our framework is scalable and capable of handling real world data integrity issues. We demonstrate the use of our tool by modeling the Vanderbilt University Medical Center. Additionally, we compare our framework's model to traditional experts who would attempt to manually generate a similar model.

Programming Clojure Review

http://cdn.johnpaulett.com/upload/programming-clojure.jpg

When Stuart Halloway's Programming Clojure came out in May, I picked up a copy and have been reading through it and practicing with the Project Euler problems.

First off, it is a great book! Second off, it introduces a seriously interesting programming language.

Clojure is a Lisp dialect designed to run on Java Virtual Machine (JVM). This combination is what makes Clojure very powerful: you get the power of a mature virtual machine with access to any existing Java libraries, combined with the dynamic, functional style of Lisp. Imagine being able to continue to use the code and libraries you any others have spent years developing from a new programming environment.

Layering a language on top of the JVM is not a new concept. Jython, JRuby, Groovy, and others did it years ago. But to some extent, these languages serve as a mere face-lift to the verbose syntax of Java. These languages were ported or created for the JVM to harness the power of existing Java libraries and platforms, while providing a prettier language.

While Clojure does offer a new syntax, it has a much more fundamental contribution to the Java world: strong concurrency primitives. (It should be noted that Scala offers this benefit as well.)

Clojure takes a hard-line approach to the arch-enemy of concurrency: shared state. Clojure allows programmers to easily write concurrent programs that can execute on multiple processors or cores. This ability comes from several facets of Clojure:

  • Immutable data
  • Preferring "pure" functions by making the programmer explicitly state where shared state is accessed
  • Multiple models for transactions and locks

Almost anyone who has experience writing threaded Java code, knows how difficult it is to ensure that multiple threads can execute in parallel without causing awful race conditions and subtle bugs. Luckily, Clojure addresses these shortcomings by using its own concurrency models.

Stuart's book begins by discussing the syntax of Clojure and demonstrates Clojure's ability to interact with regular Java classes. The book moves into the list-based world of Lisp with functional programming techniques, including lazy evaluation. The book then moves into advanced topics, including concurrency, macros, and Clojure's form of polymorphism, multimethods. The book concludes with a short chapter on testing Clojure code, working with SQL databases, and doing web development.

Through the book, we work on building an Ant replacement in Clojure. The most interesting take-away from this ongoing example is the use of actual Clojure code for the build DSL, removing the need for Ant's build.xml. The code-as-data concept is very elegant, resulting in a DSL that is very clear yet lacks XML's verbosity.

I also found the Snake game to be an excellent example of an application sharing state in a safe way using the Clojure transaction primitives.

The book gave me a great appreciation of the Lisp family of languages. The only wart that bothered me about Clojure was that it seems that at times the programmer must be too aware of the specific implementation of Clojure on the JVM. For instance, Clojure's recursion is at times hampered by the lack of Tail Call Optimization on the JVM. Because of this lack, the programmer must determine which work-around is most appropriate for his problem. Regardless, Clojure feels very clean and precise.

The book also clearly provides best practices and examples of idiomatic Clojure.

I look forward trying Clojure out in my projects. As I mentioned, I have been working through the Project Euler problems (my answers are definitely not ideal).

I would highly recommend the book to anyone who works in Java. I also believe the book is an excellent introduction to functional programming--I have read the Real World Haskell and Programming Erlang books with some difficultly, but Programming Clojure just clicked in my mind.

Install Eclipse Galileo (3.5) on Ubuntu Jaunty (9.04)

Eclipse 3.5, codenamed "Galileo," was released this week! While there is a team actively working on building an Ubuntu deb package, they do not yet have a package yet for Eclipse 3.5. I put together some super simple instructions for installing Eclipse 3.5.

I am going to perform a per-user installation into my home-directory. If multiple people use eclipse on the same computer, you may want to modify these instructions to install into /opt/. I am going to put the installable in ~/bin/packages/eclipse3.5. First, create the installation directory (change according to your own tastes)

mkdir -p ~/bin/packages
cd ~/bin/packages

Now download the appropriate tar.gz file from eclipse. I am going to grab them from Amazon's Cloudfront.

For 64-bit Ubuntu:

wget http://d2u376ub0heus3.cloudfront.net/galileo/eclipse-java-galileo-linux-gtk-x86_64.tar.gz

For standard 32-bit Ubuntu:

wget http://d2u376ub0heus3.cloudfront.net/galileo/eclipse-java-galileo-linux-gtk.tar.gz

Now unzip, and rename the directory (I want multiple versions of Eclipse):

tar xzvf eclipse-java-galileo-linux-gtk*.tar.gz
mv eclipse eclipse3.5

Great, almost there. I am going to create a file so that I can launch eclipse from the command line. Create a new file ~/bin/eclipse, and in that file, put:

#/bin/bash
`~/bin/packages/eclipse3.5/eclipse -vmargs -Xms128M -Xmx512M -XX:PermSize=128M -XX:MaxPermSize=512M &> /dev/null` &

(You can later change these values if you get out of memory issues from Eclipse.) Lastly, make the file executable:

chmod u+x ~/bin/eclipse

Install plugins

Yet again, Eclipse has changed its update manager (each time it gets better). I am going to add a few plugins for Python, Clojure, and Mercurial. If you go to Help > Install new software, click the "Available Software Sites" link, and add your update sites. For me they include:

Add Icon to the Panel I like having an icon on my panel to quickly launch Eclipse, like so:

http://cdn.johnpaulett.com/upload/eclipse-toolbar.png

To do so, right click on your panel in a place with no other panel tool. Select "Add to Panel" then create a "Custom Application Launcher". You can enter /home/<USERNAME>/bin/eclipse (put in your username) as the command to run, and if you click the icon on the left, you can use the Eclipse icon in ~/bin/packages/eclipse3.5/.

http://cdn.johnpaulett.com/upload/eclipse-add-icon.png

Leave a comment if you run into issues or have a better method! You can also see my previous instructions for Eclipse 3.4, if you run into any issues--there were lots of great comments!

Vanderbilt Projects in the News

Some of the projects that I have been involved in at Vanderbilt University's Informatics Center have been featured in the news recently:

  • The New York Times had a feature about our use of the ILOG (now IBM) business rules engine to send out pager or SMS messages to physicians when patients have critical lab results.
  • Infection Control Today had an article about our Sepsis detection application which monitors patients' vital signs and lab results in real time to alert physician to patients who may have sepsis.

tdaemon and virtualenv

I ran across tdaemon, which automatically runs your test suite when you make a change to your source code. It is very helpful when developing. One issue I had was running the tests when tdaemon needs to monitor a huge number of files (as occurs when I have a virtualenv environment in the same directory, which has more than 100MB of code and binaries). I committed a few changes to tdaemon to allow the user to ignore any directories. For instance, if you want to ignore a virtualenv directory called "env" and the "build" and "dist" directories from distutils:

tdaemon --ignore-dirs=docs,build,env

You can even use it with any of the other tdaemon test programs, such as Django:

tdaemon --test-program=django --custom-args="myapp" --ignore-dirs=env,docs

I also uploaded tdaemon to PyPi and have tdaemon install as a script, so you don't need to keep the tdaemon.py file in your directory. Right now all the changes are on my personal fork of tdaemon on github, but hopefully will end upstream.

Move to johnpaulett.com

I have moved this blog to johnpaulett.com. Links to all old posts should still work.

Automated Testing Presentation

Getting Started with (Distributed) Version Control

I gave a short talk on using a distributed version control system. The slides are available on SlideShare under a Creative Commons license.

Parsing HL7 with Python

I've had a need to parse Health Level 7 (HL7) version 2.x messages from Python, thus I created python-hl7. The library allows for easy, key-based access of all the elements in an HL7 message. See the release announcement for download information.

HL7 is a communication protocol and message format for health care data. It is the de facto standard for transmitting data between clinical information systems and between clinical devices. The version 2.x series, which is often is a pipe delimited format is currently the most widely accepted version of HL7 (version 3.0 is an XML-based format).

As an example, let's create a HL7 message:

>>> message = 'MSH|^~\&|GHH LAB|ELAB-3|GHH OE|BLDG4|200202150930||ORU^R01|CNTRL-3456|P|2.4\r'
>>> message += 'PID|||555-44-4444||EVERYWOMAN^EVE^E^^^^L|JONES|196203520|F|||153 FERNWOOD DR.^^STATESVILLE^OH^35292||(206)3345232|(206)752-121||||AC555444444||67-A4335^OH^20030520\r'
>>> message += 'OBR|1|845439^GHH OE|1045813^GHH LAB|1554-5^GLUCOSE|||200202150730||||||||555-55-5555^PRIMARY^PATRICIA P^^^^MD^^LEVEL SEVEN HEALTHCARE, INC.|||||||||F||||||444-44-4444^HIPPOCRATES^HOWARD H^^^^MD\r'
>>> message += 'OBX|1|SN|1554-5^GLUCOSE^POST 12H CFST:MCNC:PT:SER/PLAS:QN||^182|mg/dl|70_105|H|||F\r'

We call the hl7.parse() command with string message:

>>> import hl7
>>> h = hl7.parse(message)

We get a n-dimensional list back:

>>> type(h)
<type 'list'>

There were 4 segments (MSH, PID, OBR, OBX):

>>> len(h)
4

We can extract individual elements of the message:

>>> h[3][3][1]
'GLUCOSE'
>>> h[3][5][1]
'182'

We can look up segments by the segment identifer:

>>> pid = hl7.segment('PID', h)
>>> pid[3][0]
'555-44-4444'