Zasync
A few years ago, when we hit with CPS on some big customers intranet scalability problems, we started to use ZAsync in order to perform some tasks in the background. That improved a lot the application overall performance. What ZAsync does is recording in BTrees within the ZODB tasks to perform, let’s say Python scripts to simplify. Then a twisted client that runs independantly opens the ZODB to read the BTree and find the task to perform. It acts like another Zope thread in some ways. But there’s something I never understood:
Why the job queue is stored in the ZODB database ?
When we talk about scalability, most of time, the infrastructure is more complex than a simple ZEO. It has Apaches, smtps, load balancers all over the place. It has cron tasks to perform a variety of things, link sending mails, creating images, or anything that can be done in the background. It has most of the time other piece of software that perform other things. Having a co-server that gives Zope code the ability to perform background tasks is good.
Having a co-server that gives any software the ability to program a task is better
Many applications, many different Zope instances, can benefit from a centralized task manager.
Quartz
In Java world, I have used a server called Quartz. It is an independant task manager, where you can register tasks and perform jobs, given a timing. It’s like a smart cron. Using the beans technology, it can run code independantly, or run it within a Java Server application’s context.
Why don’t we have such a software in Python ?
Maybe we do, but I have never found it, so i ported part of the idea to Python in a tool called TaskManager, that I use for example on fr.luvdit.com which is a Django application. It sends mails, calculates neighbourhoods, etc.. Maybe I should release it but that’s a packaging work I didn’t find the time to do. Any piece of Python software can register itself as a task, in order to provide a service. The jobs are stored into a SQL Database, that is opened through an API by all the clients that want to perform a task, and by the co-server that reads the queue and actually perform the tasks. It has three queue in fact, for different priorities. The client-side APIs are really simple and are nothing but SQL queries.
lovely.remotetask
Back to Zope. Lovely systems works on a Zope 3 tool, which seems to be working a bit like ZAsync: it stills stores the tasks in the ZODB, but dedicates a Zope application to work as a web service provider if I understood well. It’s the way to go in term of infrastructure but I think that it’s overkill to use a Zope instance for that.
Why do we need to deploy a whole Zope stack to have a co-server ?
A dedicated, pure Python application, using a SQL database, fits better because several task runners can work in the same queue, to create a real producer-consumers queue. In their need to perform tasks on various platforms, having a centralized job queue and several executors is more scalable because the producer doesn’t deal with several co-servers.
Furthermore, the XML-RPC layer is not a necessity, and not as robust as SQL: if the co-server is down, the Zope server cannot send jobs anymore, or check for job states and get them. Working with a SQL table prevent from this. You might argue that this is the worst scenario, but by experience, the more application servers an infrastructure has, the more potential point of failure you get. You might argue that the SQL server might go down as well, but it’s not a code stack, and just holds data to be processed: all the functionalities, thus the weaknesses, are on the co-server side. You might also argue that it makes the solution Python-dependant, but it would be deadly simple to provide a client for another language.
Anyway, using the ZODB to store such things and a Zope to play with them is a small mistake in my humble opinion, even if it’s based on PersistentQueue, which looks pretty robust. Let’s keep this kind of database do what it was meant for: storing persistent objects that are publishable.
What I would love to have
The perfect co-server that I can think of, would be an independant Python software, like TaskManager that would look like this:
------- ------- <-> co-server instance 1 / win32
| zope | <-> | sqldb | <-> co-server instance 2 / linux
------- ------- <-> co-server instance 3 / linux
^
---------------- |
| another server |<--------
----------------
- sqldb is a database that store jobs;
- each arrow is provided by a python API, that knows how to interact with the database;
- a co-server is an independant, pure Python runner, that picks up some work into the DB;
- each co-server instance is able to perform tasks, that are provided through a plugin system.
- for zope-dependant tasks, a generic task provides an entry point to execute code through XML-RPC calls or through a direct ZODB opening to avoid eating a thread (eg à la ZAsync);
OK, this is exactly Quartz
In the last five years, most of the scalability problems I bumped into, were resolved by a good practice: let’s be less Zope-centric when we talk about infrastructure.
I would be pleased to have a few comments from Lovely guys on this topic, and I thank them for their latest post, that helps a lot the community to think about scalable solutions for Zope.



Hadoop looks also very interesting as a potential Zope co-server. Though you’re probably already familiar with it. http://lucene.apache.org/hadoop/
Comment by Tiberiu Ichim — September 30, 2007 @ 4:37 pm |
@Tiberiu : very interested, I didn’t know about it, have you tried it ?
Comment by Tarek Ziadé — September 30, 2007 @ 5:52 pm |
I like your “perfect” co-server design. I basically thought of something similar, but I have some questions that stay unanswered :
– From the co-server pov. What should we do about tasks input and output?
– From the ZODB pov. Who is reponsible for getting/putting data from/to ZODB?
– Could we imagine the co-server being a ZEO client where needed?
– Are the co-servers supposed to poll the SQL database for new jobs? (Main problem of a SQLdb-centric design IMHO)
– Where could we implement a specific task scheduling? Statically in the sqldb python API, letting the tasks consumer pick the right task?
Anyway, thanks a lot for this little “state of the art” on co-servers.
Comment by Jean-Nicolas Bès — October 1, 2007 @ 1:38 pm |
I haven’t used Hadoop, just read an article and a description on how to interact with python software
http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python
I can also think of some other ways to integrate external “dumb” applications with zope: for example, integrating them with an intelligent message passing system, such as Cabochon or ActiveMQ
As for real, personal experience with such a design, I’ve only worked with lovely.remotetask, but keep an eye on the alternatives.
http://www.openplans.org/projects/cabochon
http://activemq.apache.org/index.html
Comment by Tiberiu Ichim — October 2, 2007 @ 12:09 pm |
hi tarek,
but here’s my feedback…
it took 4 days until i read your blog post
probably there is a misunderstanding about using XML-RPC… we don’t use XML-RPC to create remotetasks. you _can_ use xml-rpc to create remotetask (by creating a xmlrpc view that is generating it).
if you take a look at the readme (http://tinyurl.com/33xyga) you simply create a remotetask utility and create a task object.
the second issue – deploying an entire zope instance – that’s not necessary as well. lovely.remotetask spawns own processes for the taskservice – the tasks itself are beeing processed by creating a BaseRequest and using the publisher (publisher.publish). that means that you can use that instance for “normal” operation as well.
btw. in our setups we find that the remotetask instance has much more load than the other servers…
last thing – storage: the tasks itself are really small. if you process several thousand tasks per day it’s absolutely no problem to store them in zodb. we’re using zc.queue to store the queues on the zeo server. if the python packages of zc.queue and it’s objects are available on the zeo server as well – it will even do conflict resolution on the zeo server itself…
to be honest – i don’t want to switch to a sql-base remotetask service
for small installations it’s definitely an overkill to have dedicated instances for remotetasks
first of all
Comment by Jodok — October 2, 2007 @ 8:41 pm |
oops
forget the last two sentences – they were below my scrollback
Comment by Jodok — October 2, 2007 @ 8:42 pm |
@Jean-Nicolas Bès:
for the three first questions:
- the task can be a pure python independant function. in this case,
data is prepared, sent to the co-server, then the results are made available for the server to get them.
- the task is Zope dependant. In this case, the task is a generic task
that is called with the coordinates in the ZODB of the real task. This
generic task will know how to open the ZODB, and invoke the real task.
The real task can therefore apply the results directly in the ZODB.
In this case, the task acts a bit like a ZEO client yes. See how ZAsync does.
About the poll:
- each co-server needs to get the jobs by reading the DB yes,
why this would be a problem ?
- the job scheduling is a set of parameters in the SQL yes, then
each co-server scans the jobs and run them if it’s the right time.
Thanks for you feedback !
@Tiberiu: thanks for the extra infos, I need to digg into all those projects
@Jodok: Thanks for the explanation. If I get it well, you run the tasks in a separate process that is created from Zope itself. This means that Zope is responsible for task launching and managing, which is a choice. I want a separated task service for the reasons I mentioned, but i guess lovely.remotetask will fit for “normal” usages.
Comment by Tarek Ziadé — October 4, 2007 @ 2:52 am |