In Part 3 of the ModernBERT in Radiology series, we will fine-tune a ModernBERT Classifier to predict the UMLS CUIs given a radiology report. It will combine our fine-tuning from Part 2 to produce a better classifier than the simple scikit-learn Logistic Regression from Part 1.
In Part 2 of the ModernBERT in Radiology series, we will fine-tune ModernBERT against the ROCOv2 dataset for Radiology for the Masked Language Model task so that given an input of ninth [MASK] fracture the model would predict ninth rib fracture.
In Part 1 of the ModernBERT in Radiology series, we will explore ModernBERT and the ROCOv2 dataset for Radiology. We build a multi-label classifier using a simple Logistic Regression model on top of the pre-trained ModernBERT body.
After setting up by bluesky to use a custom @johnpaulett handle (learn how set up your own domain handle), a friend pointed out my website was dead! Apparently sometime in the past 15 years (š¤Æ) without touching it, my custom Django blog app written in Django v1.3 got disconnected from its Heroku Postgres.
Figured I should simplify the setup, since there is no need to have a database and app server (that costs $10/mo) to serve a blog that probably gets a few pageviews a month, a static generated site will work.
Debugging via pdb led me to a a recent change in which the database
was being flushed after south ran, but before the tests ran. I found a
changeset that was committed after Django 1.3 beta-1,
#14661 that introduced the
flush to correct an issue with MySQL. It is a documented as a backwards
incompatible
change,
which will prevent SQL fixtures (and unfortunately South data
migrations).
The latest release uses datetime.timedelta for its internal
representation of durations (thanks to Paul
Oswald)) and has support for
South thanks to Wes
Winham.
After a discussion with a few Django core developers, it seems like
keeping the DurationField implementation as a separate, reusable
application is the preferred option. By staying independent, we keep the
ability to make changes independent of the Django release cycle.
Additionally, we avoid adding bloat to Django-core. Since not all
databases implement an interval or duration data type (PostgreSQL does),
django-durationfield is in some ways a hack by using a bigint datatype
to store integer instances of timedelta.
jsonpickle, the powerful library for
serializing complex object graphs in Python to JSON, had a major
milestone this week with the official release of 0.3.1, available on
PyPi with documentation and
full release notes at http://jsonpickle.github.com/. We have migrated
from the Google Code site to the new Github site at
http://github.com/jsonpickle/jsonpickle.
This release represents nearly a year of development from multiple
contributors. Some of the major highlights of the release include
supporting a wider variety of objects, supporting the pickle protocol's
set and get state methods, and allowing the use and addition of any
Python JSON backend (e.g. demjson, simplejson, django.util.simplejson,
etc.). Please be aware that backwards compatibility for the 0.2.0 format
JSON is not guaranteed.
During the conversion of my blog from Wordpress to a custom Django-based
system, I wanted to move from HTML markup to
reStructuredText (partly to
make it easier to publish Sphinx
documentation to my blog).
While it is dead simple to convert reStructuredText to HTML, going the
other way is more difficult. Luckily,
Pandoc, the swiss army knife for
converting between markup formats, can do a nice job converting HTML to
reStructuredText.
I wrote a custom Django Command to parse a Wordpress XML export file and
store the blog entries. The relevant code to convert HTML to
reStructuredText is very simple. It simply makes a subprocess call to
the Pandoc command and retrieves the command's output. Make sure you
have Pandoc installed (in Ubuntu, sudo apt-get install pandoc will
work).
While Django has great
support
for rendering standard markup languages, sometimes it can be difficult
editing documents using a markup in the Django Admin. Several others
(1,
2,
3) show how easy it is
to edit Markdown, Textile, and even HTML, in the Django Admin using a
WYSIWYG editor, such as markItUp!
or TinyMCE.
Unfortunately, reStructuredText is not well support by most WYSIWYG
editors. Nevertheless, we can improve the experience of editing
reStructuredText in the Django Admin. One of the biggest improvements is
switching the Textarea to use a monospace font to avoid issues caused by
the heading underlines being too short. We can also customize the size
of the Textarea.
I ran into a head-scratcher today when trying to unit test some
Groovy code. The code under test
interacts with an HTTP web service using Groovy's great
HTTPBuilder, which
wraps Apache's
HttpClient. Obviously, I
wanted to mock the interaction with the HTTP server to limit the scope
of my tests.
Groovy makes it easy to create simple mocks using
maps. To mock a class with a
map, one must create a map which is keyed by the methods names to be
tested and storing closures for the mock method implementation. For
example, if we wish to mock out the HTTPBuilder, which has a "post"
method, we can accomplish it using the map defined by mapMock.
On July 15th, I successfully defended my Master's Thesis in Biomedical
Informatics at Vanderbilt University.
This defense was the culmination of 2 years of work. The thesis focuses
on extracting organizational structure and relationships from the audit
logs of clinician
information
systems. This
work has potential applications in the improvement of delivery of care
and improving the security of patients private medical data.
As part of this work, I developed an open-source tool for analyzing
audit logs. Licensed under an Apache 2.0
License, the
Healthcare Organizational Relational Network Extraction Toolkit (HORNET)
is a Python framework for plugins that analyze healthcare audit logs.
The tool is fully functional, but is not yet polished enough for use by
healthcare administrators.
When Stuart Halloway's Programming
Clojure
came out in May, I picked up a copy and have been reading through it and
practicing with the Project Euler problems.
First off, it is a great book! Second off, it introduces a seriously
interesting programming language.
Clojure is a
Lisp dialect
designed to run on Java Virtual
Machine (JVM). This
combination is what makes Clojure very powerful: you get the power of a
mature virtual machine with access to any existing Java libraries,
combined with the dynamic, functional style of Lisp. Imagine being able
to continue to use the code and libraries you any others have spent
years developing from a new programming environment.
Eclipse 3.5, codenamed "Galileo," was released this week! While there
is a team actively working on
building an Ubuntu deb
package, they do not
yet have a package yet for Eclipse 3.5. I put together some super simple
instructions for installing Eclipse 3.5.
I am going to perform a per-user installation into my home-directory. If
multiple people use eclipse on the same computer, you may want to modify
these instructions to install into /opt/. I am going to put the
installable in ~/bin/packages/eclipse3.5. First, create the
installation directory (change according to your own tastes)
Some of the projects that I have been involved in at Vanderbilt
University's Informatics Center have been featured in the news
recently:
The New York Times had a
feature
about our use of the ILOG (now IBM) business rules engine to send
out pager or SMS messages to physicians when patients have critical
lab results.
Infection Control Today had an
article
about our Sepsis detection application which monitors patients'
vital signs and lab results in real time to alert physician to
patients who may have sepsis.
I ran across tdaemon, which
automatically runs your test suite when you make a change to your source
code. It is very helpful when developing. One issue I had was running
the tests when tdaemon needs to monitor a huge number of files (as
occurs when I have a
virtualenv environment in the
same directory, which has more than 100MB of code and binaries). I
committed a few changes to
tdaemon to allow the user to ignore any directories. For instance, if
you want to ignore a virtualenv directory called "env" and the
"build" and "dist" directories from distutils:
I've had a need to parse Health Level 7 (HL7) version 2.x messages from
Python, thus I created
python-hl7.
The library allows for easy, key-based access of all the elements in an
HL7 message. See the release
announcement
for download information.
HL7 is a communication protocol and
message format for health care data. It is the de facto standard for
transmitting data between clinical information systems and between
clinical devices. The version 2.x series, which is often is a pipe
delimited format is currently the most widely accepted version of HL7
(version 3.0 is an XML-based format).
Django may be the Python web framework
getting all the press
recently, but
web.py is definitely a nice, simple framework. One
of the nice aspects of web.py is that it exposes methods for the basic
HTTP methods
(GET, POST, PUT, DELETE, etc.) and uses these methods to process each
request from the client. This approach makes it amazingly easy to write
a RESTful API.
web.py
import web
classResource(object):
defGET(self, name):
# return the resourcedefPOST(self, name):
# update/create the resource
Eclipse Ganymede (the successor to Europa) was
released today. Ubuntu seems to be stuck on Eclipse 3.2 since at least
Feisty Fawn. There are nice features that we are missing out on (Mylyn,
inline renames, etc.). JDK First things first, you need a JDK (Java
SDK) in order to use Eclipse. I am a fan of the OpenJDK, Sun's open
source version of its JDK. OpenJDK recently reached full Sun JDK
compliance. But any JDK should work, assuming it is at least Java 5.
Recently, I needed to use pymedia, for some audio
and video encoding. The problem though, is that pymedia was nowhere to
be found in the Ubuntu Hardy Heron package repository, and the only .deb
installation candidate from the pymedia website was for an older version
of pymedia and Python 2.4. Not wanting to run an old version and having
Python 2.5 as a requirement, I needed to compile the package myself--no
easy task, it turns out.
I am launching a social news site, bminews.com. It
is a specialty site for practitioners in the fields of Bioinformatics
and Medical Informatics. Join in! I am looking for a site with great
news on new developments, technologies, and opportunities in the field.
Related topics (programming/consulting/career advice/math/scientific
writing/conferences) are all encouraged.
I have been working on an open source project,
jsonpickle. The goal of the
project is to be able to serialize a Python object into standard
JSON notation. Python can "pickle" objects into
a special binary format, but sometimes it is nice to get a
human-readable format. Especially with projects like
CouchDB that have use a
JSON-based API. jsonpickle is on its seconds release and can now
officially handle Mark Pilgrim's Universal Feed
Parser. Feel free to join in by finding bugs
and working on the code! It is pretty easy to use:
I always forget how to build Python packages, such as
psyco and
simplejson that require C/C++
code to be compiled. The usual error I get from running "python
setup.py install" is
error: Python was built with Visual Studio 2003; extensions must be
built with a compiler than can generate compatible binaries. Visual
Studio 2003 was not found on this system. If you have Cygwin
installed, you can try compiling with MingW32, by passing "-c
mingw32" to setup.py.
Now, I do not have Visual Studio 2003, but I do have mingw32. (Grab
cygwin and when selecting packages, make sure
than mingw-runtime and gcc are selected.) Now, back with our setup.py
file, execute:
To use Perl's CPANon Windows with
cygwin, you need to install some additional
programs in cygwin. Run cygwin's setup.exe (I like clicking the
"View" button to change the listing to Full, so I get an alphabetical
list of the packages). Make sure that you install the following
packages:
::: notice
These results are now largely outdated by the couchdb package in the
Ubuntu universe repository.
:::
Installing CouchDB
I have eagerly been waiting to try out CouchDB. I
find the concept of document storage, instead of strict relational
storage, to be very interesting. Plus, Erlang
seems to be gaining mindshare. I documented the process that I took to
install CouchDB 0.7.2 on Ubuntu 7.10 (it is basically straight from with
the CouchDB
wiki, but with
some small modifications to get it to work).
After reading a post on Planet Ubuntu Users, I grew envious that I
didn't have the latest and greatest of Ubuntu--so I decide to take the
plunge. First I tried the sudo update-manager -c -d, which error'ed
out.
So I hacked away at the
/usr/lib/python2.5/site-packages/UpdateManager/DistUpgradeFetcher.py
file by adding import os. Everything started going fine, except my
connection was crawling (almost dial-up speeds)--there was no way that
I was going to wait. So I grabbed the latest
ISO, picking the
alternate install just to avoid the hassle of booting up into the live
cd. Everything worked nicely--except I got a little worried when the
install hung for ten minutes when installing Tomboy. But with a little
patience, everything installed properly. I'm currently updating the
system and looking into the new Dual
Monitor tool. I am looking
forward to exploring this new toy!
I like looking at numbers. So when I saw an interesting gif on a
friend's webpage saying that a certain piece of software had so many
thousands of lines of code and cost so much money, I was naturally
intrigued. After clicking on the link, I was directed to
ohloh.net. This site analyzes public version
control repositories and provides some interesting statistics about the
project--including the lines of code and estimated cost to develop the
software from scratch. I decided to try for myself and registered
RadLex. I would have registered my
VCSFrenzy branch, but currently only
Subversion, CVS, and GIT are supported. After a little digging, it seems
like ohloh uses David Wheeler's
SLOCCount program. With an easy
Since I first owned a PDA, I have been looking for an elegant solution
to syncing my data with the PDA and 'nonstandard' applications
(Thunderbird / Sunbird / anything in Linux) ... as apparently many
people are.
Outlook
I tried switching over to Outlook as my primary email/calendar program,
my unholy marriage to Outlook lasted longer than I care to admit, mostly
because syncing with my PDA <em>just worked</em>.
BirdieSync
With the release of
Thunderbird 1.5, I decided
I had enough of MS Outlook. Around the same time,
Lightning and
Sunbird caught my
attention. After extensive google'ing, I found an excellent product,
BirdieSync. I installed the trial version,
and was loving it, until my 20 day trial expired. At ā¬19.95 (roughly 27
USD) the product was a little extravagant as a college student. If you
have the money and only use Windows, this is a decent option.
VirtualBox 1.4.0 was released yesterday, so I thought I would write a
very quick guide to installing it in Ubuntu Feisty. My previous
post
detailed downloading the file directly and several troubleshooting
steps. If you run into a problem, check that
post
out. Add VirtualBox Repository Open up /etc/apt/sources.list (with
sudo gedit) and add this line: :
deb http://www.virtualbox.org/debian feisty non-free
My history with Linux started off with RedHat 9 then moved through
Mandrake, Mandriva, openSUSE, and most recently,
Ubuntu. I must admit that Ubuntu is by far the
most polished/easy to use/accessible distribution yet. I am excited by
the progress of the project and the community. Therefore, since I get
such a nice operating system to use everyday for free, I want to give
something back to the community and get more people involved with
Ubuntu. I've been writing quite a few guides to using Ubuntu (or a
least tools on Ubuntu); but I have also started getting involved in some
Ubuntu projects, most notably VCS
Frenzy. I've also updated my
Launchpad Profile and my wiki
page. Maybe one day I'll get into
the respected ubuntu-dev group.
Sure, there are some
posts on the Ubuntu
Forums with links on how to install a deb package of
Pidgin from some untrusted repository.
Personally, I'm not too thrilled about using a package that hasn't
gone through the community process of being added to Ubuntu. So I have
two goals:
Don't be afraid by that last point--a few months ago I was too, but
there is no reason to be afraid, because in 7 commands/15 minutes you
are going to have Pidgin on your system.
I finally got a chance today to
release part of a project
that I have been working on for the past year:
RadLex. RadLex is a medical terminology,
specifically designed for radiology. I created the servlet and a set of
plugins for Protege that aid in the
development and distribution of RadLex. We recently decided to open
source the plugins, so I have spent the
past several days preparing the plugins for a release on
Sourceforge. Stay tuned as more documentation
is added to the plugins (and stay on the lookout for a new release of
the servlet and terminology). Remember, you can already develop an
application around the RadLex
API. - Wiki
user
page.
I noticed an interesting project on Planet Ubuntu
yesterday--VCSFrenzy. This tool
allows you to get desktop notifications of changes to version control
systems (VCS). Sure, some VCS let you have hooks that email you whenever
a commit occurs--but this tool provides a lightweight and simple method
of keeping track of multiple vcs's. So I have started helping
Pete work on this project.
So far we have support for subversion and bazaar (additionally with
mercurial in my branch right now). Stay tuned as this program gets some
support for more VCS's and gets some cool new features.
::: notice
Check out the new
guide to
installing VirtualBox 1.4.0 from a repository. This is a guide to
installing Microsoft Windows Vista on a Ubuntu
Edgy machine using
VirtualBox. Note that Feisty is not yet
fully supported by VirtualBox, but the edgy package is
reported to work in
feisty. Certain versions of Vista may be
illegal to run in a virtual
machine, according to the EULA, namely the Home flavors of Vista.
:::
Get Ready
I like to keep stuff clean, so I am going to download everything to a
folder. Adjust according if you decide to not make this folder. At the
terminal:
Ubuntu 7.04, Feisty
Fawn, has just been released, so
why not upgrade your machine. To replace all your repositories, at the
terminal:
sudo sed -e 's/edgy/feisty/g' -i /etc/apt/sources.list
You can likely also upgrade Dapper to Feisty beta. I have
not tried this, directly. I have upgrade from Dapper to Edgy using this
method before. So you could upgrade to Edgy first, or go straight to
Feisty (if you are brave):
I finally managed to get a sound card for an old computer. Now, I can
make my Ubuntu Alarm Clock! What better way to wake up each morning,
than to have your friendly linux box read the news to you? The
techniques that I am using are nothing new
(Hak5 did it and
some people at the Ubuntu
Forums have talked
about it). It is pretty simple, just download and install the
rss2html.pl file and the necessary perl libaries, then create a bash
script and add it to the crontab.
This guide is designed for users who have a shared hosting account (no
root access), namely on
DreamHost. We will make use of
ClamAV,
procmail, and
ClamAssassin.
ā¹ļø
If you add a ~/.procmailrc file to DreamHost, you will likely be unable
to use the DreamHost Control Panel's Junk Filtering. Therefore, it is
recommended that you check out this excellent SpamAssassin
guide and
the Dreamhost wiki.
Note that I have installed everything in a ~/packages folder, which the
previously mentioned guide does not do, so you should adjust
accordingly. I also installed a more recent version of SpamAssassin than
the previous guide (3.1.8 vs 3.1.0).
Just found a nice tool for using MediaWiki's markup inside of Wordpress
at Zech's Blog. The
WYSIWYG editor is Wordpress is nice, but staying at the keyboard with
the wiki markup is nice (without the clutter of XHTML). I have started
to become a fan of the MediaWiki markup, because it leaves the text of
document less cluttered than XHTML. I have recently been working quite a
bit on wiki's
(RadLex and several
Trac installations). I originally coded the
RadLex wiki text in XHTML, before I learned about MediaWiki's
markup. But
picking up the wiki markup was quick and I found reformatting the RadLex
page in wiki markup to be much quicker than my first attempt with XHTML
(which I have years of experience with).
There’s a new kid in town,
Songbird. Imagine if Firefox
mated with iTunes, and somehow the gray and brush metal interfaces
formed a midnight black child--that would be Songbird. Songbird is an
attempt to use the XUL design framework from Mozilla Firefox to create a
user experience similar to Apple's iTunes program. While some have said
that Songbird is nothing but a direct rip off of iTunes, Songbird brings
new ideas to the media player realm. Like iTunes (and most media
players), there is a library with the typical sorting features. However,
instead of only providing a single source for new content like iTunes
has with iTMS, Songbird uses the whole internet as its "music store."
Any site with open media content can be part of Songbirds "music
store," as Songbird auto discovers tracks on the site. Songbird uses
services from Amazon, Creative Commons, eMusic, and dozens of others.
Podcasts and streaming radio are easily accessible sources of content in
Songbird. A nice feature is the Wikipedia plugin (make sure to install
it during the Songbird installation), which shows the Wikipedia page for
the currently playing artist.
The new version of Mozilla Firefox is out--grab a beta copy
now! There have
been doubts if Firefox can keep making inroads against Microsoft's
Internet Explorer, given the fact that IE 7 is now out in
beta and has many of
the features that the internet community has come to expect from a
"modern browser" (tabbed browser, feed integration, etc.). However, it
appears as if the Mozilla team is going strong with the latest preview
of Firefox and will beat Microsoft to the market in the latest round of
the browser
wars.
The International Astronomical Union has decided that Pluto no longer
is considered a
planet.
This distant body has been reclassified as a "dwarf planet," along
with two other bodies, Xena and Ceres. While some may feel cheated,
losing one of their beloved planets, the Smithsonian Air & Space
Museum was quick to adopt the new standard.
Within a week and a half, the museum had begun making changes to the
solar system exhibit. The
symbol of Pluto,
which had been in a list of all the planets, was hidden behind a black
plastic square.
There has been a lot of buzz this week in the anti-DRM (Digital Rights
Management)
camp. Engadget
posted
that Microsoft's PlayForSure had been removed by a program called
FairUse4WM. While
Microsoft almost immediately pushed a patch out that broke version 1.1
of this program, within the week version 1.2 was released, which
according to its author should be much harder for Microsoft to break.
Meanwhile, in the iTunes arena, a new version of
QTFairUse6 has
been released which removes the Apple's previously uncrackable DRM,
FairPlay for iTunes v6.
Welcome to the new jhcore.com. After having hosted the website for the
past 3 years on an old computer in my apartment, it was time for a
change to a hosted solution, where I wouldn't have to worry about my
internal network being compromised. After quite a bit of research, I
have settled upon
DreamHost--besides multiple
glowing
reviews,
their prices and features are highly competitive.
In addition to switching hosts, I have finally decided to try out
several packaged web-software packages. I have thought about switching
over to WordPress for about a year and a
half now, but I have always enjoyed creating my very own layout and code
for my site. But hopefully with WordPress I will be able to focus more
upon the content of the site and less with the actual code.