11.29.06

The dark side of the force

Posted in python, quality at 1:29 am by Tarek Ziadé

Remember the funny post about Python vs Perl on Python mailing list ?

I think it fits pretty well to TDD. Here’s a slighty modified version:

EXTERIOR: DAGOBAH — DAY

With Yoda strapped to his back, Luke climbs up one of the
many thick vines that grow in the swamp until he reaches the
Dagobah statistics lab. Panting heavily, he continues his
exercises — writing test-driven code in Python, making
fakes, mock objects.

YODA: Code! Test! Yes. A programmer’s strength flows from non-regression.
Never stop doing tests. The dark side of code maintainability
reach you would without tests. Easy to write without them when
package you create. If once you start down the dark path,
forever will it dominate your destiny, consume you it will.

LUKE: Isn’t it simpler not to do tests ?

YODA: No… no… no. Quicker, easier, more seductive.

LUKE: So why I should do tests then ?

YODA: You will know. When your code you try to correct six months from
now.

11.06.06

How to prevent drowning in the huge rss daily feed you receive

Posted in buzz, datamining, marketing, python, rss, zope at 7:32 pm by Tarek Ziadé

Feeds, feeds, feeds everywhere. I can’t keep up with all incoming data. I am drowned !

I thought digg and similar services would help me on this, but it just raised the problem on a meta-level.

It takes me at least 20 minutes per day to:

  • Remove duplicates entries
  • Look over hundreds of entries to select the one that worth a reading, by looking at the title, the origin and the tags. (80% of the time)
  • Read my selection, and even though I know this will make it worse, add news feeds I’ve found from my selection.

I had to automate some of this daily activity. I had to cut off this huge amount of data and to get closer to what I needed to read.

But how ?

First of all, let’s analyze a bit what a feed reader wants. It can be resumed in two things:

  • She wants to get the best news on her field of interest, out of a huge list of rss feeds.
  • She wants to keep an eye on what’s going on out there, and make her field evolve with some indexes.

Most of the work can be done by pieces of software. I think this is called webmining but I am not an expert. Webmining would be a part of Datamining. But let’s cut off the big words: a couple of Python script can do the job.

That’s “Atomisator” job !

Atomisator

Atomisator grabs multiple feeds out there, removes duplicate news by using the distance of Levenshtein distance, and create a new feed out of it.

It also provides a way to filter up entries on the fly, by looking at each entry content. Simple filters for instance, will validate an entry if it contains certain words. The filtering system is pluggable, and works as a transformation chain, so new filter can be written to fine-tune the entries. This makes it possible to create several custom feeds form the same pile of data.

Another field of investigation was to use a bayesian network to filter entries but it doesn’t work well : pertinency moves too fast on this kind of news, and an inference mechanism would work on static news topics (Historic events maybe ?)

Last but not least, a special filter is used to collect statistics, and compute a buzz-o-meter report. This report indicates the top 50 most used words over all sources, and is filtered with an english common words dictionnary. It doesn’t use the tags like most tag clouds, because people often use the same tags over and over, without really thinking about it.

A scan of the post content is way better: you get the REAL tags.

That’s how I keep an eye on what’s the most talked about, even though it’s not on my custom feed: I can adapt it afterwards.

Ok now here’s the feeds I use, updated every 30mn :

Get it all here in this page


This is pretty handy for my daily job. The buzz-o-meter is for fun, but I see from time to time new words that pops in the list, I can investigate on. It also shows that Ajax and Ruby lead the buzz-o-sphere.

The source code is GPL and available here, but still poorly documented and not packaged, should be better soon. The version running the feeds is a bit different from the trunk and some merging will be done soon. This was a project started some months ago I just wrapped and re-lauched yesterday.

N.B.: I had a few comments on how poorly written was this blog entry. If you find some mistakes or badly turned sentences, don’t hesitate: tell me ! (i am french ;) )

11.01.06

Protecting a Python svn code base with the pre-commit hook

Posted in python, quality at 12:29 pm by Tarek Ziadé

In a community project, opened to various contributors, there are a few thing to take care of in order not to break the code. I am not talking about code reviewing but about bad code editing that brakes it all.

Most frequent errors are:

  • <tab> insertions, that get mixed with space-based indentation
  • carriage return insertion, before line feeds with some weird windows editors

Instead of tracking the developers that commited such things, and send them an army of white rabbits, an automatic code check is way better. It is not a good idea though, to clean up incoming code. The best thing to do is to block unwanted changes and warn the commiters, so they learn about it.

This is really easy with Subversion. I have grabbed a script on the web, and adapted it a bit for this task. Here it is:

#!/usr/bin/python2.4
# -*- coding: UTF-8 -*-
# adapted from:
#   http://blog.wordaligned.org/articles/2006/08/09/a-subversion-pre-commit-hook
# by Tarek Ziadé 

from subprocess import Popen
from subprocess import PIPE
import re
import os

re_options = re.IGNORECASE | re.MULTILINE | re.DOTALL

class EOF(object):
    def findall(self, content):
        if content.endswith('\\n'):
            return []
        return ['\n']

tab_catcher = re.compile(r’^\\t’, re_options)
windows_catcher = re.compile(r’\\r\\n$’, re_options)

testers = ((’found TAB’, tab_catcher),
           (’found CR/LF’, windows_catcher),
           (’no new line at the end’, EOF()))

def command_output(cmd):
    “”" Capture a command’s standard output.”"”
    return Popen(cmd.split(), stdout=PIPE).communicate()[0]

def files_changed(look_cmd):
    “”" List the files added or updated by this transaction.”"”
    def filename(line):
        return line[4:]

    def added_or_updated(line):
        return line and line[0] in (”A”, “U”)

    return [filename(line) for line in
            command_output(look_cmd % "changed").split("n")
            if added_or_updated(line)]

def file_contents(filename, look_cmd):
    “”"Return a file’s contents for this transaction”"”
    return command_output(”%s %s” % (look_cmd % “cat”, filename))

def test_expression(expr, filename, look_cmd):
    “”"test regexpr over file”"”
    return len(expr.findall(file_contents(filename, look_cmd))) > 0

def check_file(look_cmd):
    “”"checks Python files in this transaction”"”
    def is_python_file(fname):
        return os.path.splitext(fname)[1] in “.py”.split()

    erroneous_files = []

    for file in files_changed(look_cmd):
        if not is_python_file(file):
            continue

        for error_type, tester in testers:
            if test_expression(tester, file, look_cmd):
                erroneous_files.append((error_type, file))

    num_failures = len(erroneous_files)

    if num_failures > 0:
        sys.stderr.write(”[ERROR] please check your files:n”)
        for error_type, file in erroneous_files:
            sys.stderr.write(”[ERROR] %s in %sn” % (error_type, file))

    return num_failures

def main():
    from optparse import OptionParser
    parser = OptionParser()
    parser.add_option(”-r”, “–revision”,
                        help=”Test mode. TXN actually refers to a revision.”,
                        action=”store_true”, default=False)
    errors = 0
    (opts, (repos, txn_or_rvn)) = parser.parse_args()
    look_opt = (”–transaction”, “–revision”)[opts.revision]
    look_cmd = “svnlook %s %s %s %s” % (
        “%s”, repos, look_opt, txn_or_rvn)
    errors += check_file(look_cmd)

    return errors

if __name__ == “__main__”:
    import sys
    sys.exit(main())

I’ve also added a new line at end of file control. This script has to be called in the pre-commit hook script (look up in SVN documentation)

The call should look like:

/chemin/vers/script/svn_check_source.py "$REPOS" "$TXN" || exit 1

You can then extend the controls made by this script, by controlling for example the quality of the commited code with tools like pychecker. But these extra controls should not block commits and should be quite light to perform because the commiter waits for the changeset to be validated.

Sending a mail to the commiter with suggestions when her code doesn’t pass some quality checks is a better idea. Furthermore, for an extensive QA test, it is simpler to hook a script on a system like buildbot and create a nighlty digest over the whole code base. Pylint is very handy on this kind of controls, and can be fine tuned to generate a useful QA report that buildbot can send to developers.