Scaling Crypto work in Python

by Tarek Ziadé

We’re building a new service at Services called the Token Server – The idea is simple : give us a Browser ID assertion and a service name, and the Token Server will send you back a token that’s good for 30 minutes to use for the specific service.

That indirection makes our live easier to manage user authentication and resource allocation for our services . A few examples:

when a new user wants to use Firefox Sync, we can check which server has the smallest number of allocated users, and tell the user to go there
we can manage a user from a central place
we can manage a user we’ve never heard about before without asking her to register specifically to each service — that’s the whole point of Browser ID

I won’t get into more details because that’s not the intent of this blog post. But if you are curious the full draft spec is here – https://wiki.mozilla.org/Services/Sagrada/TokenServer

What’s this post is really about is how to build this token server.

The server is a single web service that gets a Browser ID assertion and does the following:

verify the assertion
create a token, which is a simple JSON mapping
encrypt and sign the token

The GIL, Gevent, greenlet and the likes

Implementing this using Cornice and a crypto lib is quite simple, but has one major issue : the crypto work is CPU intensive, and even if the libraries we can use have C code under the hood, it seems that the GIL is not released enough to let your threads really use several cores. For example, we benched M2Crypto and it was obvious that a multi-threaded app was locked by the GIL.

But we don’t use threads in our Python servers — we use Gevent workers, which are based on greenlets. But while greenlets help on I/O bound calls, it won’t help on CPU bound work : you’re tied into a single thread in this case and each greenlet that does some CPU work blocks the other ones.

It’s easy to demonstrate — see http://tarek.pastebin.mozilla.org/1476644 If I run it on my Mac Book Air, the pure Python synchronous version is always faster (huh, the gevent version is *much* slower, not sure why..)

So the sanest option is to use separate processes and set up a messaging queue between the web service that needs some crypto work to be done and specialized crypto workers.

We’re back in that case to our beloved 100% I/O bound model we know how to scale using NGinx + GUnicorn + GEvent

For the crypto workers, we want it to be as fast as possible, so we started to look at Crypto++ which seems promising because it uses CPU-specific calls in ASM. There’s the pycryptopp binding that’s available to work with Crypto++ but we happen to need to do some tasks that are not available in that lib yet — like HKDF.

Yeah, at that point it became obvious we’d use pure C++ for that part, and drive it from Python.

Message passing

Back to our Token server — we need to send crypto work to our workers and get back the result. The first option that comes in mind is to use multiprocessing to spawn our C++ workers and to feed them with work.

The model is quite simple, but now that we have one piece in C++, it’s getting harder to use the built-in tools in multiprocessing to communicate with our workers — we need to be lower level and start to work with signals or sockets. And well, I am not sure what would be left of multiprocessing then.

This is doable but a bit of a pain to do correctly (and in a portable way.) Moreover, if we want to have a robust system, we need to have things like a hearbeat, which requires more inter-process message passing. And now I need to code it in Python and C++

Hold on — Let me summarize my requirements:

inter-process communication
something less painful than signals or sockets
very very very fast

I got tempted by Memory Mapped Files, but the drawbacks I’ve read here and there scared me.

ZeroMQ

It turns out zeromq is perfect for this job – there are clients in Python and C++, and defining a protocol to exchange data from the Python web server to the crypto workers is quite simple.

In fact, this can be done as a reusable library that takes care of passing messages to workers and getting back results. It has been done hundreds of times, there are many examples in the zmq website, but I have failed to find any Python packaged library that would let me push some work to workers transparently, via a simple execute() call — if you know one tell me!.

So I am building one since it’s quite short and simple — The project is called PowerHose and is located here : https://github.com/mozilla-services/powerhose.

Here is its descriptions/limitations:

Powerhose is based on a single master and multiple workers protocol
The Master opens a socket and waits for workers to register themselves into it
The worker registers itself to the master, provides the path to its own socket, and wait for some work on it.
Workers are performing the work synchronously and send back the result immediatly.
The master load-balances on available workers, and if all are busy waits a bit before it times out.
The worker pings the master on a regular basis and exits if it’s unable to reach it. It attempts several time to reconnect to give a chance to the master to come back.
Workers are language agnostic and a master could run heterogeneous workers (one in C, one in Python etc..)
Powerhose is not serializing/deserializing the data – it sends plain strings. This is the responsibility of the program that uses it.
Powerhose is not responsible to respawn a master or a worker that dies. I plan to use daemontools for this, and maybe provide a script that runs all workers at once.
Powerhose do not queue works and just rely on zeromq sockets.

The library implements this protocol and gives two tools to use it:

A JobRunner class in Python, you can use to send some work to be done
A Worker class in Python and C++, you can use as a base class to implement workers

Here’s an example of using Powerhose:

The Server – https://github.com/mozilla-services/powerhose/blob/master/examples/square_master.py
The Python worker – https://github.com/mozilla-services/powerhose/blob/master/examples/square_worker.py
The C++ worker (don’t look at the code 🙂 – https://github.com/mozilla-services/powerhose/blob/master/examples/square_worker.cpp

For the Token server, we’ll have:

A JobRunner in our Cornice application
A C++ worker that uses Crypto++

The first benches look fantastic — probably faster that anything I’d have implemented myself using plain sockets 🙂

I’ll try to package Powerhose so other projects at Mozilla can use it. I am wondering if this could be useful to more people, since I failed to find that kind of tool. How do you scale your CPU-bound web apps ?

19 Comments to “Scaling Crypto work in Python”

Steve Genoud (@stevegenoud) says:

February 6, 2012 at 1:50 pm

Have you had a look at the last versions of ipython? There is a parallelisation module that does more or less what you want to do. It is also based on zmq, but does not interface directly with C.

- Rene Dudfield says:
  
  February 6, 2012 at 3:52 pm
  
  Sounds cool, and speedy!
  
  Are you firewalling off 0mq and running it on a secure transport layer, since it is considered insecure on an open network. I fear this degrades any performance gained by 0mq in the real world.
  
  As an aside, please consider changing to a crypto implementation which is not based on burning CPU cycles to provide security. For battery and environmental reasons, as well as future proofing security measures this is nicer. As a bonus you will have to send less data, and there will be less latency for your users. eg, http://en.wikipedia.org/wiki/Elliptic_curve_cryptography like DNSCurve uses: http://dnscurve.org/
  
  Supervisor might be an alternative to daemontools if you want something in python. http://supervisord.org/
  
  - lotyrin says:
    
    February 6, 2012 at 6:29 pm
    
    AFAIK, The fuzz attack was fixed and there’s no longer a reason not to run zmq over the net.
  - Tarek Ziadé says:
    
    February 6, 2012 at 10:23 pm
    
    All our stuff are on the same box — with a lot of cores. I am not considering using something else than IPC — but maybe it would be interesting to add some kind of security on the top of ZMQ tco exchanges.
- Tarek Ziadé says:
  
  February 6, 2012 at 10:21 pm
  
  I need to check the latest version, thx for the heads up – I wonder if they have an heartbeat for workers
  
Jeff Schroeder says:

February 6, 2012 at 5:04 pm

You might consider using monit over daemontools. Unlike daemontools, monit is actually maintained and respects pam_limits.

- Tarek Ziadé says:
  
  February 6, 2012 at 10:22 pm
  
  Will have a look, thx
  
Zachary Voase says:

February 6, 2012 at 5:24 pm

You should use a ØMQ device instead of writing your own JobRunner. The ØMQ device can multiplex RPC calls between multiple worker backends, which can connect to and disconnect from it at will. I needed to do this, so I wrote ZRPC: https://github.com/zacharyvoase/zrpc

The basic pattern is to run a load balancer in ‘broker’ mode and have your worker backends connect to it.

- Tarek Ziadé says:
  
  February 6, 2012 at 10:24 pm
  
  This looks pretty good, I’ll see how it can fit into my design, thanks ! One thing I have though is a heartbeat on the worker side — how does your pattern deal with the master dying ?
  
legneato says:

February 7, 2012 at 2:52 am

Why not use Celery and RabbitMQ via AMQP, which is already in use many places in Mozilla? Did I miss something? Is Rabbit too slow?

- Tarek Ziadé says:
  
  February 7, 2012 at 3:33 am
  
  Because we don’t need any persistency whatsoever, or doing long asynchronous work that might need the client to come later to get the result. We want to push and get back the result as fast as possible in a non-blocking-greenlet-friendly way,
  
  For that job, zeromq is the fastest thing I know about, with a very low latency, a very flexible node topology and can work with gevent thanks to gevent_zmq.
  
  I think the name is a bit of a misnomer… s/zeromq/smartsocket/
  
  - LegNeato says:
    
    February 7, 2012 at 4:11 am
    
    Right. I understand your use case, but I don’t see why rabbit doesn’t meet it (and celery supports what you want too I believe).
    
    I understand this is fast (c++ with protocol tuned for speed) but rabbit is pretty fast (erlang) and amqp leaves the option of switching to something better without changing the slaves.
    
    AMO uses rabbit and amqp, metrics knows how to read from it, pulse uses it for WOO, etc…unless the speed is vastly different rolling your own or using another broker/pool of slaves seems less than ideal.
    
    Of course, messaging and brokers are just a hobby for me, perhaps you have more info than me 🙂
  - LegNeato says:
    
    February 7, 2012 at 4:17 am
    
    Nevermind, I just read your other comments…ignore me 🙂
  - Ask Solem says:
    
    February 7, 2012 at 11:27 am
    
    Interestingly, we are humoring the possibility of adding a 0mq backend to Celery, which would be very similar to powerhose. To be used where persistency is not required. Work has not started yet, but there is a small number of people interested in this, I’ll let them have a look at powerhose.
Tarek Ziade: Scaling Crypto work in Python | Python | Syngu says:

February 7, 2012 at 7:43 am

[…] I’ll try to package Powerhose so other projects at Mozilla can use it. I am wondering if this could be useful to more people, since I failed to find that kind of tool. How do you scale your CPU-bound web apps ? Filed under: mozilla, python Python Read the original post on Planet Python… […]

Simon Davy (@bloodearnest) says:

February 7, 2012 at 4:42 pm

As your goal is CPU scalability, presumably on multicore servers, you might want to try a slightly different (but still simple) load balancing algorithm. That is, use a LIFO queue for _accepted, rather than a FIFO one. This means the most recently executed worker gets the next job, meaning it’s CPU cache is likely to be warmer.

If you do, I’d be interested to see some perf numbers for FIFO/LIFO 🙂

Tarek Ziadé says:

February 7, 2012 at 6:35 pm

@Ask Solem — sounds great, I’d love to help out

@Simon Davy that’s a very good point, I was not sure which strategy to use there – I will bench both and let you know

Tarek Ziadé says:

February 7, 2012 at 6:36 pm

@LegNeato — btw, one another reason is that we only use IPC in our case

circus – a process controller « Fetchez le Python says:

February 24, 2012 at 4:58 pm

[…] already in a usable state. For Mozilla, the main use case is to run Crypto Workers for Powerhose (read about Powerhose here) and we can already do […]