Firefox Sync Server is now 100% Python
by Tarek Ziadé
Two weeks ago, we pushed the last bit of the Python Sync Server in production, and there’s no more PHP.
For the client-side it’s not changing anything, since the Python server is just a re-write of the existing PHP server.
Lessons learned
The first push we did of the storage part on that week went really bad and we had to rollback urgently, fix the problems and push it back a few days later.
The main problem we had in production was related to the MySQL driver we used in conjunction with Gunicorn and GEvent. We picked PyMySQL because we wanted GEvent’s ability to monkey patch the socket module — using MySQL-Python would have been useless for this since it uses C code.
When you use Gevent workers with GUnicorn, sockets become automatically cooperative and you can handle more parallel requests that are waiting for data from the SQL server. Read more about this here.
And that’s exactly what the Sync server is: a thin layer of web services on the top of a database, sending requests and waiting for the results.
PyMySQL was working fine in our load tests and in staging. We were happily pushing the load and had slightly better performances than the PHP stack.We were not expecting a huge difference since most of the time (I’d say around 80%) is spent waiting for the SQL server and the Python server is using the same database.
But the main difference is that the Python stack stays persistent in memory, so we can pool connectors and avoid recreating TCP connections for every request. I don’t have any hard numbers yet, as we’re collecting them, but we’ve definitely reduced the time taken by our web services in those 20% spent outside the SQL server.
But.
But as soon as we pushed in production, everything started to lock. Some queries were just hanging and incoming requests were piling up until we were unable to cope with the load.
What happened is that PyMySQL is using socket.send() to send data to the MySQL server, without checking that all the bytes were really sent. And on high load, with Gevent, doing this will not work anymore because you’re not necessarily sending all bytes at once. The API to be used is send.sendall() to make sure everything is sent.
Here’s an extract of the doc for send():
- socket.send(string[, flags])
- Send data to the socket. The socket must be connected to a remote socket. The optional flags argument has the same meaning as for recv() above. Returns the number of bytes sent. Applications are responsible for checking that all data has been sent; if only some of the data was transmitted, the application needs to attempt delivery of the remaining data.
And for sendall():
- socket.sendall(string[, flags])
- Send data to the socket. The socket must be connected to a remote socket. The optional flags argument has the same meaning as for recv() above. Unlike send(), this method continues to send data from string until either all data has been sent or an error occurs. None is returned on success. On error, an exception is raised, and there is no way to determine how much data, if any, was successfully sent.
As soon as we’ve changed the code in the driver, (PyMySQL’s author was told about this, and the tip is now fixed. Also there’s the same problem in MyConPy it seems...) everything went smoothly.
So the question you’re probably wondering is: why didn’t we caught this issue in our load test environment ? The reason is that our load test script was not asserting all the responses the web server was returning, and we did not detect those errors and the locked queries were basically timing out in a mass of normal behavior. They “came back” as valid. The load test infrastructure, while filled with hundreds of thousands of fake users’ data, has less databases than in production so this kind of issue is not bubbling up as hard. While our load test infrastructure is very realistic, it will never be exactly like production.
The other thing is that the Grinder outputs raw data and we just used the Query Per Second indicator. I suspect we would have caught this issue with Funkload because it provides some results diagrams were you can see things like min and max.
So the main lessons learned here are:
- make sure the load test scripts assert all the responses (status + content)
- make sure your load testing tools detect any abnormal behavior — like a very very long request, even if it’s a fraction in a mass of normal behavior
I am very thankful to the Services Ops team, and in particular Pete who drove the production push. These guys rock.
What’s next
Now that everything works well, there are a few things we need to tweak in order to have a better system:
- Kill pending queries when a Gunicorn worker is restarted
- See if we can cache a few LDAP calls
- See if we can use several GUnicorn servers behind one Nginx — the CPU is under-used.
But overall, I hereby declare the Python push as a success.
And why exactly did you want to move away from PHP … ?
There are various reasons, on the top of my head:
1. Non-web stuff. We’re writing some libs that are not web specific, and Python is quite handy for this. While it’s doable in PHP, it’s a native use case in Python.
2. Rich web stack options. You can run the same app in sync mode, async mode, and keep things persistent in memory — so have pools as I’ve explained. Gunicorn, Gevent, Greenlets etc..
3. Packaging — we build RPMs for our servers, for every lib we use, even if the lib authors don’t provide them. It’s dead easy to automate with Python. I am not a Pear expert but I think the Python packaging eco-system has more feature and options when it comes to packaging.
4. Libraries — There’s a lot of Python libraries/frameworks out there we can use, and some of them are a big advantage to write web services. SQLAlchemy is one that comes in mind, that is one of the best ORM/DB tool out there, all languages included, to my knowledge.
5. Sharing at Mozilla. Most server-side Mozilla projects are moving to a Python-based eco-system. ‘Sumo’ and ‘Amo’ for instance have moved to Django — for its features of course. Since we’ve started to share libs, it makes sense to capitalize on Python.
Note that I am expressing my own opinions here, not Mozilla’s official one — but I think they’re probably close.
Mozilla’s official one is almost entirely 5. 🙂
(1) for me is pivotal. php is totally deficient for offline scripting. you end up writing every core library twice, once for offline, once for server-side.
(4) this also. most of the key functionality of php is not included through pear files but at compile time.
Pity what is under SQLAlchemy (DB-API and Python) is so horrible.
wew………. 🙂
[…] make this happen and to Pete for a last-minute deep-network-voodoo save that Tarek talks about in his blog. In our datacenters, things are humming along […]
Congrats!
Just because people like numbers:
How much data volume do you manage?
Thx!
I’ll try to get some fresh numbers and blog them soon
And you’ve pointed out an important difference between PyMySQL and MyConnPy — I reported the bug you referenced to the MyConnPy author five months ago — and even provided a patch. I’m still waiting for a fixed release. I don’t mind maintaining my own branch, but I wonder what other bugs are not getting fixed…
That’s a show stopper bug — If MyConnPy does not fix it, I suspect you should move to PyMySQL
I probably will eventually. But as I say I submitted a patch (it’s in the bug comments), so I have a working driver.
And I pushed indeed the fix into MyConnPy. Was indeed no good there. Can only get better! 🙂
Tarek, have you considered running PyPy as your runtime? It is said to be production-ready in many environments; why not at least give it a shot on your test server?. You can expect the memory footprint to increase, but hopefully the performance should improve markedly.
Oh, and if you do encounter any bottlenecks with specific functions, do file a bug report here (https://bugs.pypy.org/). They are actively looking for slowdowns in real-world applications.
Hi Mark, I am not sure gevent and the likes will work with PyPy
Sorry, but it has to be said:
“Use Twisted” :-).
There’s a non-blocking pure-python Twisted-compatible MySQL client that won’t use the wrong socket APIs and blow up your server, even: https://github.com/hybridlogic/txMySQL
Seriously though, your lessons here about load-testing and abnormal behavior transcend technology. Thanks for sharing. It’s a shame we all seem to have to re-learn them every couple of years!
Hey Glyph,
I guess a bug in a SQL driver is orthogonal to the framework used (Gevent or Twisted)
Can the Twisted sql client be used independently btw ?
Thx for the feedback
[…] https://tarekziade.wordpress.com/2011/07/12/firefox-sync-python/ […]
Congrats !!
Thx
What is Python?
A language that *even* works under window, Steve 😉
[…] from: Firefox Sync Server is now 100% Python « Fetchez le Python Bookmark on Delicious Digg this post Recommend on Facebook Tweet about it Subscribe to the […]
[…] Firefox Sync Server Switched to Python From PHP Here’s the blog post from one of the developers: https://tarekziade.wordpress.com/2011…x-sync-python/ […]
[…] the existing PHP server. Lessons learned The first push we did of the storage part on that week… Read more… Categories: Python Share | Related […]
Hi Tarek,
is there a way to install it on your own server? I wasn’t able to find any documentation on this point.
Cheers,
Janno
Damn, shame on me. I just found this: http://docs.services.mozilla.com/howtos/run-sync.html
Was zeromq even consider? Python support is stable and the library is stupid fast.
[…] it is now discontinued, since the php version has been dropped and now Mozilla runs a brand new python sync server. There are instructions, if you want to set up your own, however it is not for the soft hearted […]
Why is there a server at all? Wouldn’t it be better in some regards to just use a webdav folder to store information about bookmarks and the like?
It would enable most people to just use any online space, where e.g. python is not installed.
I’m thinking of how SyncKolab syncs your address book, calendar and stuff, just as one example.
Looking forward to your thoughts!
*ping*