ModernBERT in Radiology Part 3: Fine-tuning a Classifier

In Part 3 of the ModernBERT in Radiology series, we will fine-tune a ModernBERT Classifier to predict the UMLS CUIs given a radiology report. It will combine our fine-tuning from Part 2 to produce a better classifier than the simple scikit-learn Logistic Regression from Part 1.

You can follow along with the associated Colab Notebook for Part 3šŸ”„!

The ModernBERT in Radiology Series

ModernBERT in Radiology Part 2: Fine Tuning a Masked Language Model (MLM)

In Part 2 of the ModernBERT in Radiology series, we will fine-tune ModernBERT against the ROCOv2 dataset for Radiology for the Masked Language Model task so that given an input of ninth [MASK] fracture the model would predict ninth rib fracture.

You can follow along with the associated Colab Notebook for Part 2šŸ”„!

The ModernBERT in Radiology Series

ModernBERT in Radiology Part 1: Simple Classifier using Hidden States

In Part 1 of the ModernBERT in Radiology series, we will explore ModernBERT and the ROCOv2 dataset for Radiology. We build a multi-label classifier using a simple Logistic Regression model on top of the pre-trained ModernBERT body.

You can follow along with the associated Colab Notebook for Part 1šŸ”„!

The ModernBERT in Radiology Series

  • Part 1: Simple Classifier using Hidden States. šŸ‘ˆ This Post

    Build a multi-label classification using a simple scikit-learn Logistic Regression model on top of the pre-trained ModernBERT body.

And we're back

After setting up by bluesky to use a custom @johnpaulett handle (learn how set up your own domain handle), a friend pointed out my website was dead! Apparently sometime in the past 15 years (šŸ¤Æ) without touching it, my custom Django blog app written in Django v1.3 got disconnected from its Heroku Postgres.

Figured I should simplify the setup, since there is no need to have a database and app server (that costs $10/mo) to serve a blog that probably gets a few pageviews a month, a static generated site will work.

Django 1.3 & South

Running the latest Django Release Candidate, I noticed that all of my South data migrations were failing.

Debugging via pdb led me to a a recent change in which the database was being flushed after south ran, but before the tests ran. I found a changeset that was committed after Django 1.3 beta-1, #14661 that introduced the flush to correct an issue with MySQL. It is a documented as a backwards incompatible change, which will prevent SQL fixtures (and unfortunately South data migrations).

django-durationfield v0.3.3

I just released an implementation of DurationField for Django to PyPi.

The latest release uses datetime.timedelta for its internal representation of durations (thanks to Paul Oswald)) and has support for South thanks to Wes Winham.

The documentation is now hosted on ReadTheDocs and the package data is available on Django Packages.

After a discussion with a few Django core developers, it seems like keeping the DurationField implementation as a separate, reusable application is the preferred option. By staying independent, we keep the ability to make changes independent of the Django release cycle. Additionally, we avoid adding bloat to Django-core. Since not all databases implement an interval or duration data type (PostgreSQL does), django-durationfield is in some ways a hack by using a bigint datatype to store integer instances of timedelta.

jsonpickle 0.3.1 Released

jsonpickle, the powerful library for serializing complex object graphs in Python to JSON, had a major milestone this week with the official release of 0.3.1, available on PyPi with documentation and full release notes at http://jsonpickle.github.com/. We have migrated from the Google Code site to the new Github site at http://github.com/jsonpickle/jsonpickle.

This release represents nearly a year of development from multiple contributors. Some of the major highlights of the release include supporting a wider variety of objects, supporting the pickle protocol's set and get state methods, and allowing the use and addition of any Python JSON backend (e.g. demjson, simplejson, django.util.simplejson, etc.). Please be aware that backwards compatibility for the 0.2.0 format JSON is not guaranteed.

HTML to reStructuredText in Python using Pandoc

During the conversion of my blog from Wordpress to a custom Django-based system, I wanted to move from HTML markup to reStructuredText (partly to make it easier to publish Sphinx documentation to my blog).

While it is dead simple to convert reStructuredText to HTML, going the other way is more difficult. Luckily, Pandoc, the swiss army knife for converting between markup formats, can do a nice job converting HTML to reStructuredText.

I wrote a custom Django Command to parse a Wordpress XML export file and store the blog entries. The relevant code to convert HTML to reStructuredText is very simple. It simply makes a subprocess call to the Pandoc command and retrieves the command's output. Make sure you have Pandoc installed (in Ubuntu, sudo apt-get install pandoc will work).

reStructuredText Widget in Django Admin

While Django has great support for rendering standard markup languages, sometimes it can be difficult editing documents using a markup in the Django Admin. Several others (1, 2, 3) show how easy it is to edit Markdown, Textile, and even HTML, in the Django Admin using a WYSIWYG editor, such as markItUp! or TinyMCE.

Unfortunately, reStructuredText is not well support by most WYSIWYG editors. Nevertheless, we can improve the experience of editing reStructuredText in the Django Admin. One of the biggest improvements is switching the Textarea to use a monospace font to avoid issues caused by the heading underlines being too short. We can also customize the size of the Textarea.

Mocking Groovy's HTTPBuilder

I ran into a head-scratcher today when trying to unit test some Groovy code. The code under test interacts with an HTTP web service using Groovy's great HTTPBuilder, which wraps Apache's HttpClient. Obviously, I wanted to mock the interaction with the HTTP server to limit the scope of my tests.

Groovy makes it easy to create simple mocks using maps. To mock a class with a map, one must create a map which is keyed by the methods names to be tested and storing closures for the mock method implementation. For example, if we wish to mock out the HTTPBuilder, which has a "post" method, we can accomplish it using the map defined by mapMock.

Master's Thesis & Open-Source Tool

On July 15th, I successfully defended my Master's Thesis in Biomedical Informatics at Vanderbilt University. This defense was the culmination of 2 years of work. The thesis focuses on extracting organizational structure and relationships from the audit logs of clinician information systems. This work has potential applications in the improvement of delivery of care and improving the security of patients private medical data.

As part of this work, I developed an open-source tool for analyzing audit logs. Licensed under an Apache 2.0 License, the Healthcare Organizational Relational Network Extraction Toolkit (HORNET) is a Python framework for plugins that analyze healthcare audit logs. The tool is fully functional, but is not yet polished enough for use by healthcare administrators.

Programming Clojure Review

“Programming Clojure” book cover

When Stuart Halloway's Programming Clojure came out in May, I picked up a copy and have been reading through it and practicing with the Project Euler problems.

First off, it is a great book! Second off, it introduces a seriously interesting programming language.

Clojure is a Lisp dialect designed to run on Java Virtual Machine (JVM). This combination is what makes Clojure very powerful: you get the power of a mature virtual machine with access to any existing Java libraries, combined with the dynamic, functional style of Lisp. Imagine being able to continue to use the code and libraries you any others have spent years developing from a new programming environment.

Install Eclipse Galileo (3.5) on Ubuntu Jaunty (9.04)

Eclipse 3.5, codenamed "Galileo," was released this week! While there is a team actively working on building an Ubuntu deb package, they do not yet have a package yet for Eclipse 3.5. I put together some super simple instructions for installing Eclipse 3.5.

I am going to perform a per-user installation into my home-directory. If multiple people use eclipse on the same computer, you may want to modify these instructions to install into /opt/. I am going to put the installable in ~/bin/packages/eclipse3.5. First, create the installation directory (change according to your own tastes)

Vanderbilt Projects in the News

Some of the projects that I have been involved in at Vanderbilt University's Informatics Center have been featured in the news recently:

  • The New York Times had a feature about our use of the ILOG (now IBM) business rules engine to send out pager or SMS messages to physicians when patients have critical lab results.
  • Infection Control Today had an article about our Sepsis detection application which monitors patients' vital signs and lab results in real time to alert physician to patients who may have sepsis.

tdaemon and virtualenv

I ran across tdaemon, which automatically runs your test suite when you make a change to your source code. It is very helpful when developing. One issue I had was running the tests when tdaemon needs to monitor a huge number of files (as occurs when I have a virtualenv environment in the same directory, which has more than 100MB of code and binaries). I committed a few changes to tdaemon to allow the user to ignore any directories. For instance, if you want to ignore a virtualenv directory called "env" and the "build" and "dist" directories from distutils:

Parsing HL7 with Python

I've had a need to parse Health Level 7 (HL7) version 2.x messages from Python, thus I created python-hl7. The library allows for easy, key-based access of all the elements in an HL7 message. See the release announcement for download information.

HL7 is a communication protocol and message format for health care data. It is the de facto standard for transmitting data between clinical information systems and between clinical devices. The version 2.x series, which is often is a pipe delimited format is currently the most widely accepted version of HL7 (version 3.0 is an XML-based format).

Getting RESTful with web.py

Django may be the Python web framework getting all the press recently, but web.py is definitely a nice, simple framework. One of the nice aspects of web.py is that it exposes methods for the basic HTTP methods (GET, POST, PUT, DELETE, etc.) and uses these methods to process each request from the client. This approach makes it amazingly easy to write a RESTful API.

web.py

import web
class Resource(object):
    def GET(self, name):
        # return the resource
    def POST(self, name):
        # update/create the resource

This approach is very similar to what Google App Engine does with its webapp.

Eclipse 3.4 (Ganymede) on Ubuntu

::: notice These instructions refer to outdated version of Eclipse and Ubuntu. Please refer to the new instructions on installing Eclipse Galileo (3.5) on Ubuntu Jaunty (9.04) :::

Eclipse Ganymede (the successor to Europa) was released today. Ubuntu seems to be stuck on Eclipse 3.2 since at least Feisty Fawn. There are nice features that we are missing out on (Mylyn, inline renames, etc.). JDK First things first, you need a JDK (Java SDK) in order to use Eclipse. I am a fan of the OpenJDK, Sun's open source version of its JDK. OpenJDK recently reached full Sun JDK compliance. But any JDK should work, assuming it is at least Java 5.

pymedia on Ubuntu Hardy Heron

Recently, I needed to use pymedia, for some audio and video encoding. The problem though, is that pymedia was nowhere to be found in the Ubuntu Hardy Heron package repository, and the only .deb installation candidate from the pymedia website was for an older version of pymedia and Python 2.4. Not wanting to run an old version and having Python 2.5 as a requirement, I needed to compile the package myself--no easy task, it turns out.

bminews.com launched

I am launching a social news site, bminews.com. It is a specialty site for practitioners in the fields of Bioinformatics and Medical Informatics. Join in! I am looking for a site with great news on new developments, technologies, and opportunities in the field. Related topics (programming/consulting/career advice/math/scientific writing/conferences) are all encouraged.

jsonpickle

I have been working on an open source project, jsonpickle. The goal of the project is to be able to serialize a Python object into standard JSON notation. Python can "pickle" objects into a special binary format, but sometimes it is nice to get a human-readable format. Especially with projects like CouchDB that have use a JSON-based API. jsonpickle is on its seconds release and can now officially handle Mark Pilgrim's Universal Feed Parser. Feel free to join in by finding bugs and working on the code! It is pretty easy to use:

Building Python Packages from Source on Windows

I always forget how to build Python packages, such as psyco and simplejson that require C/C++ code to be compiled. The usual error I get from running "python setup.py install" is

error: Python was built with Visual Studio 2003; extensions must be
built with a compiler than can generate compatible binaries. Visual
Studio 2003 was not found on this system. If you have Cygwin
installed, you can try compiling with MingW32, by passing "-c
mingw32" to setup.py.

Now, I do not have Visual Studio 2003, but I do have mingw32. (Grab cygwin and when selecting packages, make sure than mingw-runtime and gcc are selected.) Now, back with our setup.py file, execute:

CPAN on Windows

To use Perl's CPANon Windows with cygwin, you need to install some additional programs in cygwin. Run cygwin's setup.exe (I like clicking the "View" button to change the listing to Full, so I get an alphabetical list of the packages). Make sure that you install the following packages:

  • perl (just in case you do not have it)
  • gzip
  • tar
  • unzip
  • make
  • lynx
  • wget
  • ncftp
  • gnupg

Open the Cygwin bash shell and enter:

CouchDB on Ubuntu

::: notice These results are now largely outdated by the couchdb package in the Ubuntu universe repository. :::

Installing CouchDB

I have eagerly been waiting to try out CouchDB. I find the concept of document storage, instead of strict relational storage, to be very interesting. Plus, Erlang seems to be gaining mindshare. I documented the process that I took to install CouchDB 0.7.2 on Ubuntu 7.10 (it is basically straight from with the CouchDB wiki, but with some small modifications to get it to work).

Just upgraded to Gutsy Gibbon Tribe 2

After reading a post on Planet Ubuntu Users, I grew envious that I didn't have the latest and greatest of Ubuntu--so I decide to take the plunge. First I tried the sudo update-manager -c -d, which error'ed out. So I hacked away at the /usr/lib/python2.5/site-packages/UpdateManager/DistUpgradeFetcher.py file by adding import os. Everything started going fine, except my connection was crawling (almost dial-up speeds)--there was no way that I was going to wait. So I grabbed the latest ISO, picking the alternate install just to avoid the hassle of booting up into the live cd. Everything worked nicely--except I got a little worried when the install hung for ten minutes when installing Tomboy. But with a little patience, everything installed properly. I'm currently updating the system and looking into the new Dual Monitor tool. I am looking forward to exploring this new toy!

Code Statistics

I like looking at numbers. So when I saw an interesting gif on a friend's webpage saying that a certain piece of software had so many thousands of lines of code and cost so much money, I was naturally intrigued. After clicking on the link, I was directed to ohloh.net. This site analyzes public version control repositories and provides some interesting statistics about the project--including the lines of code and estimated cost to develop the software from scratch. I decided to try for myself and registered RadLex. I would have registered my VCSFrenzy branch, but currently only Subversion, CVS, and GIT are supported. After a little digging, it seems like ohloh uses David Wheeler's SLOCCount program. With an easy

Sync Outlook, Thunderbird, a PDA, and your Smartphone

Since I first owned a PDA, I have been looking for an elegant solution to syncing my data with the PDA and 'nonstandard' applications (Thunderbird / Sunbird / anything in Linux) ... as apparently many people are.

Outlook

I tried switching over to Outlook as my primary email/calendar program, my unholy marriage to Outlook lasted longer than I care to admit, mostly because syncing with my PDA <em>just worked</em>.

BirdieSync

With the release of Thunderbird 1.5, I decided I had enough of MS Outlook. Around the same time, Lightning and Sunbird caught my attention. After extensive google'ing, I found an excellent product, BirdieSync. I installed the trial version, and was loving it, until my 20 day trial expired. At ā‚¬19.95 (roughly 27 USD) the product was a little extravagant as a college student. If you have the money and only use Windows, this is a decent option.

Upgrade to VirtualBox 1.4.0

VirtualBox 1.4.0 was released yesterday, so I thought I would write a very quick guide to installing it in Ubuntu Feisty. My previous post detailed downloading the file directly and several troubleshooting steps. If you run into a problem, check that post out. Add VirtualBox Repository Open up /etc/apt/sources.list (with sudo gedit) and add this line: :

deb http://www.virtualbox.org/debian feisty non-free

Add the VirtualBox Public Key :

wget http://www.virtualbox.org/debian/innotek.asc
sudo apt-key add innotek.asc

Update Apt and Install :

Getting Involved with Ubuntu

My history with Linux started off with RedHat 9 then moved through Mandrake, Mandriva, openSUSE, and most recently, Ubuntu. I must admit that Ubuntu is by far the most polished/easy to use/accessible distribution yet. I am excited by the progress of the project and the community. Therefore, since I get such a nice operating system to use everyday for free, I want to give something back to the community and get more people involved with Ubuntu. I've been writing quite a few guides to using Ubuntu (or a least tools on Ubuntu); but I have also started getting involved in some Ubuntu projects, most notably VCS Frenzy. I've also updated my Launchpad Profile and my wiki page. Maybe one day I'll get into the respected ubuntu-dev group.

Install Pidgin in Ubuntu

Sure, there are some posts on the Ubuntu Forums with links on how to install a deb package of Pidgin from some untrusted repository. Personally, I'm not too thrilled about using a package that hasn't gone through the community process of being added to Ubuntu. So I have two goals:

  1. Install Pidgin
  2. Show you how to install something from source

Don't be afraid by that last point--a few months ago I was too, but there is no reason to be afraid, because in 7 commands/15 minutes you are going to have Pidgin on your system.

Open Sourcing RadLex

I finally got a chance today to release part of a project that I have been working on for the past year: RadLex. RadLex is a medical terminology, specifically designed for radiology. I created the servlet and a set of plugins for Protege that aid in the development and distribution of RadLex. We recently decided to open source the plugins, so I have spent the past several days preparing the plugins for a release on Sourceforge. Stay tuned as more documentation is added to the plugins (and stay on the lookout for a new release of the servlet and terminology). Remember, you can already develop an application around the RadLex API. - Wiki user page.

VCSFrenzy

I noticed an interesting project on Planet Ubuntu yesterday--VCSFrenzy. This tool allows you to get desktop notifications of changes to version control systems (VCS). Sure, some VCS let you have hooks that email you whenever a commit occurs--but this tool provides a lightweight and simple method of keeping track of multiple vcs's. So I have started helping Pete work on this project. So far we have support for subversion and bazaar (additionally with mercurial in my branch right now). Stay tuned as this program gets some support for more VCS's and gets some cool new features.

Vista on Ubuntu Using VirtualBox

::: notice Check out the new guide to installing VirtualBox 1.4.0 from a repository. This is a guide to installing Microsoft Windows Vista on a Ubuntu Edgy machine using VirtualBox. Note that Feisty is not yet fully supported by VirtualBox, but the edgy package is reported to work in feisty. Certain versions of Vista may be illegal to run in a virtual machine, according to the EULA, namely the Home flavors of Vista. :::

Get Ready

I like to keep stuff clean, so I am going to download everything to a folder. Adjust according if you decide to not make this folder. At the terminal:

Upgrade to Feisty Fawn from Edgy

Ubuntu 7.04, Feisty Fawn, has just been released, so why not upgrade your machine. To replace all your repositories, at the terminal:

sudo sed -e 's/edgy/feisty/g' -i /etc/apt/sources.list

You can likely also upgrade Dapper to Feisty beta. I have

not tried this, directly. I have upgrade from Dapper to Edgy using this method before. So you could upgrade to Edgy first, or go straight to Feisty (if you are brave):

Now update:

RSS Alarm Clock

I finally managed to get a sound card for an old computer. Now, I can make my Ubuntu Alarm Clock! What better way to wake up each morning, than to have your friendly linux box read the news to you? The techniques that I am using are nothing new (Hak5 did it and some people at the Ubuntu Forums have talked about it). It is pretty simple, just download and install the rss2html.pl file and the necessary perl libaries, then create a bash script and add it to the crontab.

ClamAV Email Checking on a Shared Host

This guide is designed for users who have a shared hosting account (no root access), namely on DreamHost. We will make use of ClamAV, procmail, and ClamAssassin.

ā„¹ļø
If you add a ~/.procmailrc file to DreamHost, you will likely be unable to use the DreamHost Control Panel's Junk Filtering. Therefore, it is recommended that you check out this excellent SpamAssassin guide and the Dreamhost wiki.

Note that I have installed everything in a ~/packages folder, which the previously mentioned guide does not do, so you should adjust accordingly. I also installed a more recent version of SpamAssassin than the previous guide (3.1.8 vs 3.1.0).

MediaWiki Markup for WordPress

Just found a nice tool for using MediaWiki's markup inside of Wordpress at Zech's Blog. The WYSIWYG editor is Wordpress is nice, but staying at the keyboard with the wiki markup is nice (without the clutter of XHTML). I have started to become a fan of the MediaWiki markup, because it leaves the text of document less cluttered than XHTML. I have recently been working quite a bit on wiki's (RadLex and several Trac installations). I originally coded the RadLex wiki text in XHTML, before I learned about MediaWiki's markup. But picking up the wiki markup was quick and I found reformatting the RadLex page in wiki markup to be much quicker than my first attempt with XHTML (which I have years of experience with).

Songbird 0.2 (Almost)

There’s a new kid in town, Songbird. Imagine if Firefox mated with iTunes, and somehow the gray and brush metal interfaces formed a midnight black child--that would be Songbird. Songbird is an attempt to use the XUL design framework from Mozilla Firefox to create a user experience similar to Apple's iTunes program. While some have said that Songbird is nothing but a direct rip off of iTunes, Songbird brings new ideas to the media player realm. Like iTunes (and most media players), there is a library with the typical sorting features. However, instead of only providing a single source for new content like iTunes has with iTMS, Songbird uses the whole internet as its "music store." Any site with open media content can be part of Songbirds "music store," as Songbird auto discovers tracks on the site. Songbird uses services from Amazon, Creative Commons, eMusic, and dozens of others. Podcasts and streaming radio are easily accessible sources of content in Songbird. A nice feature is the Wikipedia plugin (make sure to install it during the Songbird installation), which shows the Wikipedia page for the currently playing artist.

Firefox 2.0 beta 2

The new version of Mozilla Firefox is out--grab a beta copy now! There have been doubts if Firefox can keep making inroads against Microsoft's Internet Explorer, given the fact that IE 7 is now out in beta and has many of the features that the internet community has come to expect from a "modern browser" (tabbed browser, feed integration, etc.). However, it appears as if the Mozilla team is going strong with the latest preview of Firefox and will beat Microsoft to the market in the latest round of the browser wars.

Smithsonian Quickly Drops Pluto

The International Astronomical Union has decided that Pluto no longer is considered a planet. This distant body has been reclassified as a "dwarf planet," along with two other bodies, Xena and Ceres. While some may feel cheated, losing one of their beloved planets, the Smithsonian Air & Space Museum was quick to adopt the new standard. Within a week and a half, the museum had begun making changes to the solar system exhibit. The symbol of Pluto, which had been in a list of all the planets, was hidden behind a black plastic square.

Downfall of DRM?

There has been a lot of buzz this week in the anti-DRM (Digital Rights Management) camp. Engadget posted that Microsoft's PlayForSure had been removed by a program called FairUse4WM. While Microsoft almost immediately pushed a patch out that broke version 1.1 of this program, within the week version 1.2 was released, which according to its author should be much harder for Microsoft to break. Meanwhile, in the iTunes arena, a new version of QTFairUse6 has been released which removes the Apple's previously uncrackable DRM, FairPlay for iTunes v6.

New Site, New Host

Welcome to the new jhcore.com. After having hosted the website for the past 3 years on an old computer in my apartment, it was time for a change to a hosted solution, where I wouldn't have to worry about my internal network being compromised. After quite a bit of research, I have settled upon DreamHost--besides multiple glowing reviews, their prices and features are highly competitive.

In addition to switching hosts, I have finally decided to try out several packaged web-software packages. I have thought about switching over to WordPress for about a year and a half now, but I have always enjoyed creating my very own layout and code for my site. But hopefully with WordPress I will be able to focus more upon the content of the site and less with the actual code.