Bounty Hunting

Via Jeremy Hylton, Mark Shuttleworth (super-rich geek and space tourist) is offering $100,000 worth of bounties in 2004 to developers willing to help out with a number of Open Source projects, most of which are to be developed in Python. A comparison can be made here with Mitch Kapor, another geek-done-well who funds open source development through his sponsorship of the Open Source Applications Foundation.

Personally I think this is a great way of funding open source development, although a counter-argument is that this kind of reward could encourage the wrong kind of attitude within the open source community with programmers dedicating their time to paid projects at the expense of others. I don't see this happening - people work on open source projects for the love of doing so and to scratch an itch. If you have the money to spend and want to offer it up as a bounty for someone to further develop a product you like then more power to you! I would certainly consider doing so if some freak incident left me rolling in cash (very unlikely to happen considering my complete disinterest in nursing my finances).

New PHP community site

Via The Farm, Chris Shiflett is calling for assistance in setting up a new PHP community site to run along similar lines to use Perl. Chris has already secured an offer of hosting and support from O'Reilly and is now seeking offers of help from potential contributers. PHP has long needed a site of this kind (PHP Builder has lost a lot of momentum since being sold by Tim Perdue) so this could be a worth while project to get involved with if you have the time.

I wonder if Python would benefit from something like this? Python already has an excellent decentralised community centered mainly around the Python newsgroups, blogs and mailing lists, but it would be nice if Python.org provided more community oriented features.

Simpler content managment

Perls of wisdom in a sea of site mismanagement, via the ever-excellent Column Two:

The great surprise of the past five years of content management is that, despite all the hundreds of systems, no clear winners have emerged. Instead, there's a growing dissatisfaction with the ongoing technical burden that such systems impose.

Some influential voices are starting to argue that many sites should, in effect, wait out this immature phase of website management. For the moment, they should content themselves with limited automation.

The article concludes with the idea that many sites can do perfectly well with a few simple Perl scripts and maybe a relational database on the back end, rather than investing in an expensive super-package that claims to be able to do anything you could possibly want. This is very sound advice. The simple fact of the matter is that many sites really don't need a complex content management platform with support for templating, user logins, workflow, versioning and a dozen other high end features. Most sites just need someone to be able to easily update them, when necessary. This is why Macromedia Contribute has been such a success - people want the ability to hit "Edit This Page", make a few changes and publish straight to their site.

I've worked on my fair share of content management systems (in fact I'm helping develop one at the moment) and out of all of the ones I've been involved in, the one I got the biggest kick out of took the shortest time to develop. It was based on Tavi Wiki, and consisted of a password protected Tavi install for the back end and a slightly modified separate install for the front end. Both installs pointed to the same database, but the front end was altered to disable all editing features and make the site look less like a Wiki. You can see the end result here.

All in all, the CMS took less than an hour to put together from start to finish. It made it easy enough for contributors with no previous knowledge of HTML to update the site (using Wiki markup) and provided us with full versioning on all content contained within the project. The final site gives very few clues that the underlying engine is a Wiki, and thanks to Tavi's ease of customisation the site design can be easily changed to look even less wiki-like. It's close to the simplest thing that could possibly work and it works just fine.

Of course, if you don't have a competent server-side programmer to hand your only option is to buy a pre-made solution, but with a half-decent programmer and a good set of tools a simple home built CMS customised to fit your needs could be a much better investment than some $100,000 one-size-fits all monstrosity.

GAWDS now inviting new members

The Guild of Accessible Web Designers is a world wide organisation of web designers and developers committed to helping each other, and promoting the message that accessible web design is 'good for business'. I'd describe the organisation in detail here, but the official site does a far better job than I could. If you've been following the web accessibility community in any detail You're likely to recognise a number of the names on the registered members list; I've been following GAWDS developments for a while and its shaping up to be a great resource for accessibility minded designers. I've also contributed an article on Writing good ALT text which hopefully provides some useful advice on a frequently misunderstood topic.

Dates on the web

D. Keith Robinson writes about Using Dates For Featured Web Content. Keith's right, including a date with your content really is a no-brainer. I'll add an anecdote of my own. Several years ago I ran a popular news site for Team Fortress Classic, a team based online first person shooter game with a thriving clan scene. I was careful to include dates on every piece of content, but in my youthful naivety I neglected to include the year. The years rolled by and the content built up until I suddenly realised that I was no longer sure what year some of it was written in! The site has sadly now passed in to history but the lesson remains: the web moves faster than you might think, so omitting the year in your dates is a pretty dumb thing to do.

It's pretty obvious but I'll point out anyway that using dd/mm/yyyy or mm/dd/yyyy for dates on sites is a bad idea as well. Us crazy brits use the former, them crazy yanks use the latter and everyone ends up thoroughly confused about any written date before the 13th of the month. I personally like to use the full "4th December 2003" format, but when space is limited the least ambiguous format is ISO standard YYYY-MM-DD.

Extracting the length from MP3 files with Python

Ned Batchelder recently wrote about the difficulties involved in extracting the length from an MP3 file. We're going to need to solve this problem soon at work; luckily, it seems that the answer may lie in the Python bindings for mpgedit, an audio file editing library available for both Windows and Linux.

After installing the Windows package and experimenting for a while, I managed to extract the time from one of my test files using the following:

>>> import mpgedit
>>> play = mpgedit.Play('example.mp3')
>>> play.total_time()
(213, 129)
>>> secs, msecs = play.total_time()
>>> mins = secs / 60
>>> secs = secs - mins * 60
>>> print "%d:%02d minutes" % (mins, secs)
3:33 minutes

However, for other files total_time() is returning (-1, -1). I'm sure there's a solution to this but I haven't stumbled across it yet.

Downloading your hotmail inbox

Adrian just pointed me to a fantastic tool: Gotmail, a utility to download mail from Hotmail accounts. It's a command line utitlity, written in Perl and making use of the curl binary, which can connect to Hotmail over the web and grab any new emails, saving them locally as an mbox file and deleting them from the Hotmail server.

Naturally, anything like this is completely dependent on Hotmail's design staying the same and maintaining the tool is a constant arms race. At the moment, Hotmail is ahead - a recent upgrade to the Hotmail design (some time in the last few days) has rendered Gotmail useless. A call for help on the Gotmail mailing list from the lead developer makes particularly interesting reading. He's looking for developers and users who can help with the debugging effort required to get the tool working again, but the last paragraph of the email really caught my attention:

Developers: If you have some Python proficiency, and would like to assist in developing the next generation of Gotmail (development name: gotfemail), email me off-list. I have some pretty ambitious plans for this project, and depending on how much is actually implemented, Hotmail breakages should be either self-fixing or very simple to fix. I've done some work on making a generic library for this sort of job (so the fetchyahoo people and others might be interested), and some preliminary work on embedding the Javascript interpreter from the Mozilla project.

A self-fixing screen scraper sounds like one heck of an interesting project, and I can't complain about the choice of development language either ;) If you're a Python hacker looking for a new project this could be well worth checking out.

HTML entities for email addresses: don't bother

I've suspected this for a long time, and now here's the empirical evidence: Popular Spam Protection Technique Doesn't Work. If you're relying on HTML entities to protect your email address from spam harvesters - for example username@example.com - your email address may as well be in plain text. Chip Rosenthal downloaded a tool called "Web Data Extractor v4.0" and tried it on some test data to prove once and for all that the technique doesn't work.

My advice is to use your common sense when analysing a potential spam protection technique. If you were a spammer, would you be able to outwit the method? Spammers aren't always very smart, but the people who write spamming tools (and get paid big bucks for them) are. Also remember to think about the payoff - unencoding a bunch of entities is a cheap operation. Embedding a Javascript interpreter to decipher email addresses that are glued together using Javascript at the last possible moment is a lot harder and could slow down a tool, so it may not be worth the effort.

I'm still pretty confident in my own anti-spam harvester technique of hiding my email address behind a POST form, but even that could eventually be outsmarted by a really dedicated harvesting tool.

Selectutorial

New from Russ Weakley: Selectutorial, which taks his widely acclaimed step by step CSS tutorial style and applies it to CSS selectors. Having a full understanding of selectors is critical if you're going to take full advantage of CSS, so if you don't get them yet you should really check this out.

Repartitioning with Knoppix

I've been long bemoaning the fact that if you want to repartition your hard drive to install Linux as a dual boot with an existing Windows system the most frequently recommended method is to buy a copy of Partion Magic. You would have thought the open source software world would have provided a free alternative by now.

Via Andy Todd, it turns out that they have. GNU Parted is a repartioning tool for Linux. QtParted wraps it in a GUI with a Partition Magic style interface. And the awesome Knoppix comes with QtParted included on the disk. So instead of shelling out for an expensive package that you are unlikely to ever use more than once, you can download and burn a Knoppix CD, boot in to Linux and repartition from there. I'll be trying this out for real on Monday, and I'll report back with the results when I do.

As an aside, has anyone ever found a web page that lists all of the software included on the Knoppix CD?

Update: Closer inspection reveals that Parted can't resize NTFS. Thankfully, ntfsresize can - and ntfsresize is integrated in to QtParted. Magic.

Un-happened

Charles Miller, in Google, Microsoft and Tall Poppies.:

Bill Gates' original goal in forming Microsoft was famously to have (emphasis mine) "A computer on every desk and in every home, running Microsoft software". You'll not find the last three words of that sentence in any official Microsoft history (or at least I couldn't, and I searched hard). They've been carefully un-happened: the dream of a nascent monopolist truncated into a facade of altruism.

Google:

Fascinating.

IXR 2.0

Harry Fuecks has been hacking on my XML-RPC library, and has released a new version with some significant changes. His article on phpPatterns describes the changes and provides a link to download the updated code. He's made a bunch of interesting architectural changes which take advantage of a number of useful PEAR classes, including HTTP_Request which provides support for proxies and authentication, two frequently requested features.

I don't know when I'll get a chance to look at my version of the code again, since most of my current development work involves Python rather than PHP. If you're looking for an updated version of the library you would do well to check out Harry's enhancements.

Why run Windows on an ATM?

So you're writing the software for an ATM. It needs to display something pretty on the screen, control the hardware that serves out the money and talk securely to your central servers. It also needs to be stable, secure, reliable and allow remote administration. Why on earth would you choose Windows as the operating system?

Check out this article on The Register: Nachi worm infected Diebold ATMs. This just beggars belief. How a Windows worm spread on to a network with ATMs connected to it is beyond me - even if you take in to account employee laptops plugged in behind the firewall it's still incredible that the ATMs weren't on their own separate secure network.

Here's the best bit:

Billett defended the company's patching process, which he said involves testing each new bug fix, and deploying at a wide variety of institutions with a mix of network architectures. "A lot of those machines actually have to be visited by a service technician" to be patched, said Billett. "Our experience in the past is we are able to turn those around in one or two days."

What do you have to do to patch these things, plug in a keyboard and mouse?

Pyrex

Pyrex is a language for writing Python extension modules. It's pretty interesting - the syntax looks very similar to Python (the authors claim you can write C extension modules without knowing anything about the Python/C API) but uses additional type hints to compile down to ultra efficient C code, ready to be imported in to your Python applications. The prime numbers example maakes things a lot more clear:

#
#  Calculate prime numbers
#

def primes(int kmax):
  cdef int n, k, i
  cdef int p[1000]
  result = []
  if kmax > 1000:
    kmax = 1000
  k = 0
  n = 2
  while k < kmax:
    i = 0
    while i < k and n % p[i] <> 0:
      i = i + 1
    if i == k:
      p[k] = n
      k = k + 1
      result.append(n)
    n = n + 1
  return result

bash$ python pyrexc primes.pyx
bash$ gcc -shared primes.o -lxosd -o primes.so

>>> import primes
>>> primes.primes(10)
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
>>>

I imagine there's a slight performance impact from using Python's list data structures instead of a more low level C array, but I doubt it's significant. In any case, the real promise of Pyrex lies in making it easier to write Python wrappers for existing C libraries - a topic touched on by the Pyrex Documentation.

Discovering Berkeley DB

I'm working on a project at the moment which involves exporting a whole bunch of data out of an existing system. The system is written in Perl and uses Berkeley DB files for most of its storage.

I'd never done anything with Berkeley DB before, but luckily Python has a module which seems to do all of the hard work for me:

>>> db = bsddb.btopen('xpand.db')
>>> db.keys()[0:10]
[':archives:index.html', ':art:test.html', ... 
>>> db[':art:test.html']
'template;front.tp\x01\x01'
>>> 

The Berkeley DB libraries are maintained by Sleepycat Software. Unfortunately, their site is completely saturated with marketing jargon. Our customers rely on Berkeley DB for fast, scalable, reliable and cost-effective data management for their mission-critical applications. Great - now what does it do exactly?

Some digging around turned up the real information: the Berkeley DB Tutorial and Reference Guide, which contains pretty much everything you could possible want to know about the technology. It turns out that at a basic level Berkeley DB is just a very high performance, reliable way of persisting dictionary style data structures - anything where a piece of data can be stored and looked up using a unique key. The key and the value can each be up to 4 gigabytes in length and can consist of anything that can be crammed in to a string of bytes, so what you do with it is completely up to you. The only operations available are "store this value under this key", "check if this key exists" and "retrieve the value for this key" so conceptually it's pretty simple - the complicated stuff all happens under the hood.

It seems like a great alternative to a full on relational database for simple applications, although I'm slightly confused by the license which allows free use for open source products but requires a license for commercial applications. Does that mean that if I use the bsddb Python module in a commercial app I need to get a license from Sleepycat?

Feed you

Wow, that's what I call feedback! It's a shame pretty much everyone hates the new design but I like it so it stays. I've taken a few tips though and tweaked the link colours a bit, as well as making a few other small changes such as a darker green for the header and a 1em margin around the page.

In an attempt to satiate the voracious appetite for RSS displayed by some of my visitors I've set up two new feeds: Blog comments and Blogmarks. I don't use an aggregator myself so I'd appreciate feedback on how well they work. I've also put together a blogmarks archive - no search engine yet, but it's on the list.