Fetchez le Python

Technical blog on the Python programming language, in a pure Frenglish style. Main topics are Python and Mozilla – This blog does not engage my employer

Category: sync

Firefox Sync Server in Python – Take 2

It’s been more than a month since the last update on my work on Firefox Sync. Time for a quick update.

The Code

The application grew quite well and was splitted in four separate projects:

  • SyncCore: contains the authentication back-ends and various utilities like the CEF logger or various WSGI helpers.
  • SyncReg: that’s the User application. Implements: https://wiki.mozilla.org/Labs/Weave/User/1.0/API. Can be used as a standalone WSGI application
  • SyncStorage: Contains the storage back-ends and implements https://wiki.mozilla.org/Labs/Weave/Sync/1.0/API (and the upcoming 1.1.) Can be used as a standalone WSGI application.
  • SyncServer: This is just a glue application that can be used to run in the same server both Reg and Storage servers. By default this application will run sqlite back-ends for storage and authentication, which means it can be launched with a zero-config environment.

I moved the code to bitbucket, and will clone it back to hg.mozilla.org once we set dedicated repositories for the Python server there. If you want to run your own Sync server, it’s still very simple.  Make sure you have the latest virtualenv installed, Mercurial and Make, then run:

$ hg clone http://bitbucket.org/tarek/sync-server Sync
$ cd Sync
$ make build

Then you can run your server on port 5000 by using the built-in web server:

$ bin/paster serve development.ini

Of course, a real setup should be done using SSL, a real web server like Apache/mod_wsgi and MySQL for the DB. But the default setup is useful and can replace the minimal-server Toby wrote.

Benching

One thing I want to make sure is that the Python server is as fast as possible, and faster than the PHP application. Since a Python web application can reuse the same interpreter in memory, there’s a lot of room for improvements like connection pooling and light memory caching. I also wanted to bench out various configurations for the DB, like using postgresql instead of mysql etc.

The team is currently working on stress testing our Sync infrastructure and the tool that we use is Grinder.  Grinder is a Java tool that uses Jython for writing tests,  and provides a simple console to drive it. The results Grinder return are raw results, and there’s quite some work left to do if you want to generate nice reports.

I used another tool to bench the server called Funkload. It’s a Python tool that uses unittest classes to run benches, and provides a functional test tool to query a web server and do some assertions like WebTest. It produces HTML reports that are containing a lot of metrics. Some I don’t use because they are specific to web sites. But it’s good enough to stress-test the Sync server and compare PHP and Python speed. One caveat is that it cannot be distributed. There’s a project called BenchMaster that adds this feature, that I need to try.

The stress test is the same than the Grinder one, and here are some reports using various configuration : http://sync.ziade.org/funkload/

While Python already appears to be slightly faster than PHP, those were done on my MacBook with 100 users loaded in the DB, 6000 objects each, so don’t mean a lot. Just that the Python application is not borked 😀 .

I’ll probably run Funkload in the same environment we run Grinder at Mozilla, where we have a realistic setup. I also want to have this kind of reports generated every day, so I can keep an eye on how the Python server performs. Making sure the app does not slow down when it grows is one important part of continuous integration.

Caching: Redis vs Memcached

I used Redis to do a bit of caching in the Python app, instead of Memcached like the PHP app. See my previous post for the rational.

Redis was very stable during my benches, but I have heard from some other projects that they had quite a few problems with it in production [I might post more details here in another blog post]. I still think this is the tool we should use in Sync, and I also want to experiment writing a full back-end for Sync using it. But the first version of the Sync server we will deploy on our servers will probably use Memcached since it’s proven to work well right now and since I don’t really need all the extra features Redis offers if the usage is restricted to volatile caching.

Continuous integration

I am still working alone on the Python app, but a continuous integration server is something we really want to have. I am a big fan of buildbot but I wanted to give a try at Hudson. The management interface is brilliant and I could set up a Hudson server for Sync in an hour. I eventually moved it at https://hudson.mozilla.org/job/Sync with other Mozilla projects from the WebDev team. It contains Pylint reports, test coverage report, and of course Chuck Norris keeps the code safe.

What’s next ?

The Python app is mostly done, besides a few things to clean up. The next big step will be to bench it alongside the PHP application on realistic data, fix any problem that will rise, then  work on pushing it in production. The production switch will probably happen gradually  since every node is standalone. And since the rest of the team is quite busy to make sure everything is ready for the upcoming Firefox 4 final release which includes Sync natively, switching to Python is not the #1 priority right now. I expect it  to happen before the end of the year though.

In the meantime once the benches are done and the code is rock-solid. I’ll start to play with different back-ends. A full Redis back-end and maybe something based on Riak or Cassandra.

Firefox Sync Server in Python — Take 1

I have been working for a bit more than a month now on the next generation of the Firefox Sync server in Python and while the project is still in its early stages and subject to a lot of changes, I think it’s a good idea to share now about what we are building here at Mozilla. Maybe that’ll attract contributors !

About Sync

Firefox Sync (formerly Weave) let you synchronize your Firefox bookmarks, history, passwords, opened tabs etc. so you can have them on any computer, or even use them from your iPhone by using Firefox Home.

Clients that are syncing work with our servers at Mozilla by using the Sync and the User APIs defined in these documents:

The User APIs manage the users accounts and tell the client which server holds the data of a given user.  In other words, each user is tightly coupled to a single server when reading or writing data. This natural sharding is great for scaling Sync, and is possible because users don’t share data (yet… ;))

Another important point is that the data are encrypted on client side before they are sent over. That’s because one of the key concept of Sync is that your data should not be known by our servers, to protect your privacy.  Well, we could probably still know how many bookmarks you have by counting the number of entries in the DB, or how often you use your browser. But as soon as you use a service like that you have to give away these kind of information, most of the time just because they are useful to make the service faster or understand any potential problem. Read our privacy policy here.

And the good news is that you can set up your own Sync server and even implement it yourself if you want.

So, a Sync server a pretty passive storage server, that is quite easy to scale while keeping data consistency across clients.

About the code

The current implementation uses Apache, PHP, LDAP, MySQL and Memcached. For various reasons I won’t detail in this post –that might be another post– , it has been decided to switch the Sync server to Python

Python libraries

The Sync server is  composed of web services and a few screens used for the password reset process, so using a web framework would have been overkill. Although, writing a wsgi-enabled server made a lot of sense since it allows people to run our implementation on their laptop, or on any wsgi-compatible web server they wish to use.

So, I’ve picked :

  • Routes, to dispatch requests to a few classes (controllers)
  • WebOb to process incoming requests and build responses
  • Paste. PasteScript, PasteDeploy, to group the configuration in an ini file and make it easy to run the application with a built-in server.

There are alternative routing systems, but Routes really fits my brain and make the dispatching quite simple. I really like the fact that you can optionally use regular expressions to validate URLs.  

WebOb is quite a standard library and make our life simple to read requests and write responses. The code in our controllers stays KISS with WebOb when you have to read incoming data: they’re all available in simple mappings. The response is also built by WebOb and you can forget about all the wsgi protocol details. We mainly return JSON dumps that WebOb wraps into responses.

Last, Paste is very handy to run the server locally, to initialize data, and handle multiple configurations. I should also say that my colleague Ian Bicking is behind the Paste and WebOb libs, and involved in the Sync project.  So those were quite natural choices.

The authentication process is a custom function that reads a basic authentication header and checks it using an authentication plugin (more on plugins later in this post.)

For the storage, I’ve picked SQLAlchemy and python-ldap.  I don’t really use the ORM part of SQLAlchemy and write pretty raw SQL queries to avoid any extra overhead. The benefit of the ORM was null here anyways, since all storage I/O are contained in a storage class that outputs simple mappings. I have created the mappers though, as they are useful to initialize a DB on a first run.

But when the server runs, SQLAlchemy is mainly used for:

  • its connection pooling abilities.
  • the nice parameters binding
  • the ability to switch to any DB system via configuration (as long as the SQL is compatible of course)

As for python-ldap (I didn’t implement the LDAP part yet), it’s the standard connector I have always used with various flavors of LDAP servers (OpenLDAP, ActiveDirectories, etc.). I don’t think there is any competitor for this anyways.

Caching

The caching is currently done using Memcached. For instance, when clients are often asking for specific collection items, they end up in memcached to lower the number of queries made to MySQL. For the Python implementation though, I’ve decided to use Redis instead.

In terms of speed, Redis and Memcached are quite similar. Redis though has interesting extras:

  • The data is saved to the disk, so you don’t lose your cache. The speed stays almost the same as memcached since the disk syncs are done asynchronously from time to time. Since a Sync user is tightly coupled to a storage server, that’s an interesting feature to have. And, hey, you can move data from a Redis DB to another, so migrating the cache to another server is even possible.
  • Redis provides built-in APIs to work with sets and lists, which authorizes more complex caching without extra code. This will allow us to do more caching in the future.

Storage

The storage itself will stay on MySQL but we will probably explore alternative storages systems in the future. One requirement of Sync is to be able to write data as fast as possible so all clients can have access to them as soon as possible.  Right now, Sync provides immediate consistency, since all writes are done synchronously on a single server.

Plugins

The PHP application was built with extensibility in mind: the way Mozilla stores the data and authenticates users (a mix of LDAP and MySQL) might not work if the code is used by someone else. That’s why the code was built using abstractions for the storage and the authentication part, and the Python version took back this good idea.

Basically, you can write a new authentication or storage class, and configure Sync to use it. See the documentation I am building on this: http://sync.ziade.org/doc/storage.html (temporary location)

Web server

The web server that runs the Python application will stay Apache (with mod_wsgi) since it has proven to work very well with the current implementation. I might bench other servers in the future though, like Gunicorn + nGninx or uWSGI + nGninx. We now have a nice Grinder script that realistically mimics Sync users, so..

Doc and Code

I’ve started a documentation, the temporary location is at http://sync.ziade.org/doc and you can grab the code we are building at http://hg.mozilla.org/users/telliott_mozilla.com/sync-server. You can already use the server with your Firefox / Firefox Home, but this is still at development stage, so use at your own risks.

I would love to get some feedback on that work !