10.23.07

iw.recipe.fss : a recipe to install File System Storage

Posted in plone, python, zope tagged at 7:56 am by Tarek Ziadé

 

What is FSS ?

When you need to work with a lot of static files in your Plone website, you should consider using File System Storage (FSS). It’s an Archetypes storage that can handle files like PDF, images or small videos. It prevents the growth of the ZODB. It’s not like Blobs, because the files are not transactionals. This means you won’t have to worry about network performances when you use ZEO: nothing will be copied from a node to another. You just have to use a NFS point to make sure all nodes uses the same files.

The missing part though, was to be able to easily deploy FSS using the standard way.

 

Deploying FSS

In Plone world, zc.buildout is now the leading project in the deployment area. Everyone should read Martin’s tutorial on how to use it. It makes a typical Plone deployable in a matter of minutes with no pain. It is based on a configuration reader that instanciate recipes objects in charge of installing a part of the system. You have recipes for apache, ldap, etc. See existing public recipes at Cheeseshop.

iw.recipe.fss is a recipe that takes care of creating file system folders and the configuration file used by the Product.

To use it, just insert a section in your buildout file (it’s called a part in buildout language) that describes each system storage, and the path to the configuration file:

[buildout]
parts:
    …
    fss
    …

[fss]
recipe = iw.recipe.fss
conf = ${zopeinstance:location}/etc/plone-filesystemstorage.conf

storages =
    storage_name /site/storage_path directory

See the recipe’s README.txt for all options.

That’s it. I love buildout.

 

More recipes to come

There are a lot of recipe available in the Python Index, but we still missing some to perform every kind of deployment. Besides the Plone recipes here are the recipes that I find realy usefull:

  • infrae.subversion: clean, simple way, to checkout a piece of code from a Subversion repository. It’s a perfect one to create developer buildouts.
  • zc.recipe.cmmi: the configure-make-make install recipe. Will perform a compilation and local installation) of Makefile compatible packages under BSD/Linux.
  • zc.recipe.egg: this one is very useful if an egg needs a special environment. For example if it needs a compiled library created with zc.recipe.cmmi.

We have more recipes that are being coded here at Ingeniweb, to cover our needs in deploying Plone instances with buildout. I’ll try to blog on them everytime we update the Cheeseshop. If you create recipes yourself I encourage you to share them on Cheeseshop. Sharing recipes, in my humble opinion, is an important thing to do in a community: it helps standards to raise because it shows how people uses software in real infrastructures.

10.18.07

Unobtrusive benchmark and debug of Python applications

Posted in plone, python, quality, zope tagged at 9:00 pm by Tarek Ziadé

There are many tools available for Python to perform benchmarks and debugging. For example:

  • Hotshot is bundled in the standard library and provide useful data. Maybe you have to install an extra package on some linux distribution if I recall it correctly, because it’s not GPL;
  • iPython provides a nice interface to perform live debugging, like automatic invocation of pdb on exceptions;
  • the standard module test provides pystone, that let you benchmark the computer in use before the timed tests. This is helpfull to bench the code on several computers: the measurements can be expressed in in pystones. In other words, you are able to have a reproducable measure of a piece of code and work on the code complexity to make it faster. In reality, any interference can change the results, but this is true as well for time measures.
  • all big python frameworks are using the logging module, so it’s easy to hook in it if extra logging is needed.

But when trying to equip an application in order to find out why some functionalities are slow, or why something goes wrong, it’s not always easy to set up precisely what you want to log if something is slow or what your want to hook if a bug appears. The simplest way is to call out all the mentioned tool from the code, but it too obtrusive. Another way is to use decorators.

To perform it, you’ll have to:

  • get and install iw.quality
  • create the benchmarking or debugging module

Get and install iw.quality

iw.quality gathers helpers for QA. It has an implementation of the Levenshtein distance discussed earlier, and now a decorator used for benchmarking and debugging purpose. Since it’s available in PyPi, you should be able to install it like this:

$ easy_install iw.quality

See setuptools informations if you need to install easy_install itself.

Preparing the benchmark or the debugging

Whether you are about to benchmark or debug your program, you need to list all the places in your code where you need to hook a log or a pdb. Then you can create a specialized python module that can be used when needed. This module will simply decorate the functions you want to work with.

Benchmarking

Here’s an example, let’s equip sqlalchemy for benchmarking.

benchmarking.py file:

#
# benchmarking queries
#
from iw.quality.decorators import log_time
import sqlalchemy

def logger(msg):
    print msg

simple_logger = log_time(logger=logger)

sqlalchemy.create_engine = simple_logger(sqlalchemy.create_engine)
sqlalchemy.engine.Engine.execute = simple_logger(sqlalchemy.engine.Engine.execute)

The log_time decorator comes with a few parameters, like logger wich is called with the log message. By default it uses logging.info, but you can use your own like in the example. The chosen functions are then decorated.

Let’s use it:

>>> import benchmarking      # applies the decorators
>>>from sqlalchemy import *

>>> db = create_engine('sqlite:///:memory:')
log_time::2007-10-18T21:50:52.352037::0.013::function 'create_engine',args: ('sqlite:///:memory:',), kw: {}
>>> db.execute('create table TEST(id int)')log_time::2007-10-18T21:52:13.761085::0.104::function 'execute', args:
(<sqlalchemy.engine.base.Engine object at 0x12e6e90>, 'create tableTEST(id int)'), kw: {}
<sqlalchemy.engine.base.ResultProxy object at 0x12e6ff0>
>>> db.execute('insert into TEST (id) values (1)')
log_time::2007-10-18T21:52:50.265860::0.000::function 'execute', args:(<sqlalchemy.engine.base.Engine object at 0x12e6e90>, 'insert into TEST
(id) values (1)'), kw: {}<sqlalchemy.engine.base.ResultProxy object at 0x12f50d0>

If you need to display more infos on the call, you can use your own formatter instead of the provoded one. Let’s extend the benchmark file:

def formatter(execution_time, function, args, kw):
    return '%s = %.3f ms' % (function, execution_time)

simple_logger = log_time(logger=logger, formatter=formatter)

And rerun some code:

>>> from sqlalchemy import *
>>> db = create_engine('sqlite:///:memory:')
<function create_engine at 0x12328b0> = 0.014 ms

You can add a treshold on the function timing, to log only functions that are up to this treshold. This is useful to filter a bit.

Debugging

For debugging purpose, you can use the debug parameter:

def debug(e):
    import pdb
    pdb.set_trace()

simple_logger = log_time(logger=logger, formatter=formatter,
                         debugger=debug)

It will be called in case of an Exception:

from sqlalchemy import *

db = create_engine('gcckc')
--Return--
/Users/tziade/tests/benchmarking.py(14)debug()
(Pdb) c
/Users/tziade/tests/banchmarking.py(10)logger()
(Pdb) c
function create_engine at 0x12328f0> = 3.544 ms

Traceback (most recent call last):...

raise esqlalchemy.exceptions.ArgumentError: Could not parse rfc1738 URL from string 'gcckc'

Conclusion

By using this simple decorator, it’s easy to group benchmarking and debugging in a specialized module, and generate custom reports. The best practice is to create a module per each use case. I didn’t hook it on Hotshot or other utilities to let people use the tools they like.

10.08.07

Make your code base healthier: the anti-cheater pattern

Posted in plone, python, zope tagged at 7:20 pm by Tarek Ziadé

I gave a few years ago some courses to college students. They had to write some small C++ programs and send them to me before the end of the course, so I could correct them and give some grades.

They were massively cheating :)

I created a anti-cheat tool to try to find the cheaters, mainly for fun and curiosity. To make it efficient, I tried to understand how people where cheating. They were cutting and pasting pieces of code from various people, and where arranging them so they would look original. They were changing the comments of course, and moving functions around so I wouldn’t recognize the class similarities when I was glancing through the code. But when they were cheating, 80% of the code would look similar to the programs they borrowed. When they were composing a new program out of several sources, the atomic bloc was roughly the function: it was most of the time coming for one source. In other words, there were a few students zero (like patient zero in epidemics), that where the providing a function help-yourself copy-n-paste catalog for 30 fellows.

The Levenshtein distance worked great there in finding the cheaters: given two strings, it computes the number of permutations needed to get from one string to the other. A ratio can therefore be calculated between the two strings. Roughly, when its value is equal or up to 0.7, we can consider that the two are quite similar. This can be applied to function code as well.

It was quit funny to use this tool at the end of the course, as I could point the cheaters immediatly. I hurd that some college use such programs now to detect if students are doing plagias in their homeworks.

In software, developers acts the same: in the very same code base, as long as there is more than one developer involved, you will always find the same functions duplicated again and again. Sometimes, the duplication is not done on purpose: it’s a natural use case involved by the APIs. In other words, a refactoring is needed to remove duplicate code. This is done most of the time by agile developers that smell the need: why this function is not made generic and moved into the base class ?

So it’s a good practice to hunt for duplicates, to make the code smaller, thus more robust. But it’s hard to see all duplicates, as it takes a lot of code reviewing time.

That’s where the anti-cheater pattern is useful

The pattern can be applied in two steps:

  1. Parsing the code and applying a bit of filtering.
  2. Calculating the distance, and reporting similarities.

 

Parsing the code

The first step is to read the code, using the compiler module. This is the cleanest way to extract the functions because the module parses the code and renders an Abstract Syntax Tree (AST), that is browsable without having to import, thus compile the code. Regular expression could work, but would be painful to create. In the AST, each node represents a piece of the program, with a specialized class. For example, a function is a node of the Function class and a list of children that represent the content of the function. compiler also provides a visitor pattern that allows us to set some hook everytime a function, a module, a class or anything else, gets traversed by the parser.

Below is the visitor used to parse the code for our duty:

registered_code = {}

class CleanNode(object):

    def __init__(self, node):
        code = [str(el) for el in node.getChildren()
                if el is not None and not isinstance(el, basestring)]
        # too small
        self.small = len(code) < 5
        self.code = ‘ ‘.join(code)
        self.name = node.name
        self.filename = node.filename
        self.key = node.key

        if hasattr(node, ‘klass’):
            self.klass = node.klass

class CodeSeeker(object):

    def __init__(self, filename):
        “”"compiles the AST”"”
        self.filename = os.path.realpath(filename)
        self.node = compiler.parseFile(self.filename)
        res = compiler.walk(self.node, self)

    def _key(self, node):
        “”"calculates an unique key for a node”"”
        if hasattr(node, ‘klass’):
            return ‘%s %s.%s:%s’ % (node.filename, node.klass,
                                    node.name, node.lineno)
        else:
            return ‘%s %s:%s’ % (node.filename, node.name, node.lineno)

    def _clean(self, node):
        return CleanNode(node)

    def register(self, node):
        “”"register the node”"”
        node.filename = self.filename
        node.key = self._key(node)
        node = self._clean(node)
        if not node.small:
            registered_code[node.key] = node

    #
    # compiler walker APIs
    #
    def visitFunction(self, t):
        self.register(t)

    def visitClass(self, t):
        for subnode in t.getChildren():
            if not subnode.__class__ in  (compiler.ast.Stmt, compiler.ast.Function):
                continue
            for f in subnode.getChildren():
                if f is None or isinstance(f, str):
                    continue
                f.klass = t.name
                self.visit(f)

def register_module(filename):
    “”"registers a module”"”
    CodeSeeker(filename)

def register_folder(folder):
    “”"walk a folder and register python modules”"”
    for root, dirs, files in os.walk(folder):
        if os.path.split(root)[-1] == ‘tests’:
            continue
        for file in files:
            if file.endswith(’.py’) and file != ‘interface.py’:
                register_module(os.path.join(root, file))

Each function, inside and outside classes, are registered in registered_code with a few metadata. There’s a few filters as you can see. Some are Plone specific (like the omission of tests folders, and interface.py files), and some removes very small functions (when there’s less than 5 nodes in the function, including its name, parameters, etc). This is the playground for our Levenshtein algorithm.

 

Calculating the distance

Now we can compare each function to each other, and get a ratio. When the value is up to 0.7 we can consider that the code is pretty similar. I have used David Neca’s package: http://trific.ath.cx/resources/python/levenshtein/ for this, because it’s fast and real simple to use:

from Levenshtein import ratio

def levenshtein(entry1, entry2):
    """returns the ratio"""
    return ratio(entry1, entry2)

Using it over our dictionnary will look like this:

def search_similarities():
    similar = []
    items = registered_code.items()
    done = []
    for key, value in items:
        done.append(key)
        code = str(value.code)
        for key2, value2 in items:
            if key2 == key or key2 in done:
                continue
            code2 = str(value2.code)
            ratio = levenshtein(code, code2)
            if ratio > 0.7:
                similar.append((ratio, value.key, value2.key))
    similar.sort()
    similar.reverse()
    return similar

Of course we could do some caching, and use an iterator to optimize the code (the sorting is not really needed)

 

Demo in Plone and Zope

Let’s run this pattern over Plone and Zope. I have tried it on Plone 3 lib and Zope 3 lib (within Zope 2.10).

For example, addOpenIdPlugin in plone.openid.plugins.oid:

def addOpenIdPlugin(self, id, title='', REQUEST=None):
    """Add a OpenID plugin to a Pluggable Authentication Service.
    """
    p=OpenIdPlugin(id, title)
    self._setObject(p.getId(), p)

    if REQUEST is not None:
        REQUEST["RESPONSE"].redirect(”%s/manage_workspace”
                “?manage_tabs_message=OpenID+plugin+added.” %
                self.absolute_url())

was trapped to be similar to manage_addSessionPlugin in plone.session.plugins.session:

def manage_addSessionPlugin(dispatcher, id, title=None, path='/', REQUEST=None):
    """Add a session plugin."""
    sp=SessionPlugin(id, title=title, path=path)
    dispatcher._setObject(id, sp)

    if REQUEST is not None:
        REQUEST.RESPONSE.redirect('%s/manage_workspace?'
                               'manage_tabs_message=Session+plugin+created.' %
                               dispatcher.absolute_url())

This tells us that a common API could be written that way (if it doesn’t alreay exists):

def addPlugin(container, klass, id, REQUEST=None, **kw):
    """adds a plugin"""
    plugin = klass(id, title=title, **kw)
    container._setObject(plugin.getId(), plugin)

    if REQUEST is not None:
        REQUEST.RESPONSE.redirect(('%s/manage_workspace?manage_tabs_message'
                                   '=Plugin+created') % container.absolute_url())

It would avoid having to write such boiler-plate code.

Another example, in Zope 2.10’s zope.app folder. The class SimpleViewClass in zope.app.pagetemplate.simpleviewclass looks very similar to the one found in zope.app.onlinehelp.onlinehelptopic. Mmmm it’s exactly the same in fact ! ;)

Last example. In zope.app.authentication.principalfolder, in PrincipalFolder class:

def search(self, query, start=None, batch_size=None):
    """Search through this principal provider."""
    search = query.get('search')
    if search is None:
        return
    search = search.lower()
    n = 1
    for i, value in enumerate(self.values()):
        if (search in value.title.lower() or
            search in value.description.lower() or
            search in value.login.lower()):
            if not ((start is not None and i < start)
                    or (batch_size is not None and n > batch_size)):
                n += 1
                yield self.prefix + value.__name__

This is very similar to GroupFolder’s one in zope.app.authentication.groupfolder:

def search(self, query, start=None, batch_size=None):
    """ Search for groups"""
    search = query.get('search')
    if search is not None:
        n = 0
        search = search.lower()
        for i, (id, groupinfo) in enumerate(self.items()):
            if (search in groupinfo.title.lower() or
                (groupinfo.description and
                 search in groupinfo.description.lower())):
                if not ((start is not None and i < start)
                        or
                        (batch_size is not None and n >= batch_size)):
                    n += 1
                    yield self.prefix + id

This should be a common code as well, in the base class.

 

Conclusion

My example is not very clean though:

  • the distance is calculated on the string representation of the tree of each function, and this is probably not optimal;
  • the ratio treshold was fixed by trying out the pattern on a few source code;
  • the whole thing is quite slow (I’m not Lundh).

So it’s more likely to be a base for a better implementation, or maybe a CheeseCake addon ?

But this works for me at this stage, and let me find duplicate code. I’m thinking of hooking it in a buildbot, to analyze what developers commit and, when a similarity is found, send a mail with a few suggestions.

10.05.07

Testing code that calls third party servers

Posted in plone, python, quality, zope at 10:03 am by Tarek Ziadé

One of the fundamentals of unit testing is that the unit test should never depend on any external resource. This is true for all data that might be needed to run the tests, but also for third party servers like LDAP or SQL: they have to be faked.

  • LDAP is quite painful to fake. The simplest way is to create all tests with a real LDAP server, then replace it with a class that returns explicit responses for each explicit request. This is managable when the LDAP layer is well done, and easy to patch.
  • Mailhost is also quite easy to patch in the test fixture, and printing back the mail sent instead of calling the smtplib will allow you to write doctests and unittest without depending on a smtp server through telnet.
  • For SQL, the simplest way, as long as you use a library that knows how to call different DBs through DBAPI, is to use a flat file DB system. I use sqlalchemy, and patching it in my test fixtures is easy as patching one line: the sqluri. For example, mysql://user:pass/server/base’ will become pysqlite:///path/to/package/tests/data/test.db’. The tests then interact with a SQLite file, and as long as your code uses sqlalchemy APIs, everything should work like if the DB was MySQL or Postgres. The only difference I can think of is the DB unicode settings, that might be different in the production server, so be careful in your doctest when you test strings.
  • For other third party elements, Mocking can help !

10.04.07

4 tips to keep in mind about the ZODB

Posted in plone, python, zope tagged at 2:26 am by Tarek Ziadé

Plone is a great tool, probably one of the best CMS out there, that let you rapidly create an advanced web application. But there are some real pitfalls you might fall into if you are not familiar to the underlying technology. This post provides a list of tips that can help you out. I am not pretending to give you a set of solutions to make your website fast and scalable, but by following those tips, you will think the way experienced Zope developers think when they code an application.

  1. Ask you customer what will be the load.

    It’s very important to know it when you are starting a Plone project. That might affect a lot how you are going to work: if the application needs to deal with hundreds of hits per minute, and store gigas of datas, you won’t organize it the same way than if the load is ridiculous, like a few hits a day. Project the data load in the future also: how big will be the site in one month ? one year ? If the data load grows, you might need to set up a cyclic purge to make sure the stuff doesn’t get huge.

  2. Ask yourself what data will be stored in the ZODB

    The ZODB is a great database, and provides a lot of features for CMS: each object (e.g. Document) is automatically saved on each change. This persistent layer, à la Hibernate, is nice. But this has a cost: if hundreds of objects are created by hundreds of users per minute, it won’t work. It will technically work but will become very slow because the ZODB has to deal with concurrency changes on the objects. You might argue that this won’t be a problem if you take care of who changes the data and how, but you won’t be able to deal with how base objects work in the ZODB. For example, BTrees objects, that are supposed to be fast and to be able to deal with many items, are not really scalable on writes: on my Intel MacBook, on a BTreeFolder containing 1000 items, running 4 concurrent threads that are creating objects in a pace of 300 ms, will generate conflict error on 10% of the requests. You should really ask yourself if you need ZODB features on some data. Maybe they will fit better in a SQL database if they don’t need versionning or sophisticated workflows.

    The ZCatalog in its classical shape, is also a particular case. It can weight 40% of the ZODB total size, so if you index a lot of things, and a lot of features on your websites are based on queries, it’s maybe a good idea to used a specialized database, like Xapian or Lucene. It’s faster, and it won’t make your ZODB grow. Hey, why indexes are stored in the ZODB anyway ?

  3. Don’t hide your code behind caches

    SQUID, memcached, CacheFu. All those tools are wonderful and mandatory when you put your site in production. But you should not hide your code behind them : that’s the best way to code a crappy, badly architectured application. You should take care on how your application scales and on your code effectiveness before you think about caching. Be careful about the complexity of each of your function, and how they scale.

  4. Be careful of the pound/ZEO mirage

    Making a cluster of ZEO nodes will not make your application faster, it will just raise the number of concurrent threads your application can handle simultaneously. Each node needs to be synchronized behind the scene everytime a data is changed. So the more nodes you have the more network traffic you will get. This can damage the overall performance of the application as well.

There’s a lot of work going on in this area. I am looking forward for what people will do on this sprint for example: http://plone.org/events/sprints/copenhagen-performance-sprint

« Previous entries