Defining a wsgi app deployment standard
by Tarek Ziadé
Next month at Pycon, we’ll have a web summit and I’m invited there to talk about how I deploy web applications. This is not a new topic, as it was already discussed a bit last year — see Ian Bicking’s thought on the topic.
My presentation at the summit will be in two parts. I want to 1/ explain how I organized our Python deployments at Mozilla (using RPMs) 2/ make an initial proposal for a deployment standard that would work for the community at large – I intend to work on this during Pycon and later on the dedicated SIG.
Here’s an overview of the deployment standard idea…
How we deploy usually
If I want to roughly summarize how people deploy their web applications these days, from my knowledge I’d say that there are two main categories.
- Deployments that need to be done in the context of an existing packaging system — like RPM or DPKG
- Deployments that are done in no particular context, where we want it to just work. — like a directory containing a virtualenv and all the dependencies needed.
In both cases, preparing a deployment usually consists of fetching Python packages at PyPI and maybe compile some of them. These steps are usually done using tools like zc.buildout or virtualenv + pip, and in the case of Mozilla Services, a custom tool that transforms all dependencies into RPMs.
In one case we end up with a directory filled with everything needed to run the application, except the system dependencies, and in the other case with a collection of RPMs that can be deployed on the target system.
But in both cases, we end up using the same thing: a complete list of Python dependencies.
The trick with using tools like zc.buildout or pip is that from an initial list of dependencies, you end up pulling indirect dependencies. For instance, the Pyramid package will pull the Mako package and so on. A good practice is to have them listed in a single place and to pin each package to a specific version before releasing the app. Both pip and zc.buildout have tools to do this.
Deployments practices I have seen so far:
- a collection of rpms/debian packages/etc are built using tools like bdist_rpms etc.
- a virtualenv-based directory is created in-place in production or as a pre-build binary release that’s archived and copied in production
- a zc-buildout-based directory is created in-place in production or as a pre-build binary release that’s archived and copied in production
The part that’s still fuzzy for everyone that is not using RPMs or Debian packages is how to list system-level dependencies. We introduced in PEP 345 the notion of hint where you can define system level dependencies which name may not be the actual name on the target system. So if you say you need libxml-dev, which is valid under Debian, people that deploy your system will know they’ll need libxml-devel under Fedora. Yeah no magic here, it’s a tough issue. see Requires-External.
The Standard
EDIT : Ian has a much more rich standard proposal here. (see the comments)
The standard I have in mind is a very lightweight standard that could be useful in all our deployment practices – it’s a thin layer on the top of the WSGIstandard.
A wsgi application is a directory containing:
- a text file located in the directory at dependencies.txt, listing all dependencies – possibly reusing Pip’s requirements format
- a text file located in the directory at external-dependencies.txt, listing all system dependencies – possibly reusing PEP 345 format
- a Python script located it the directory at bin/wsgiapp with an “application” variable. The shebang line of the Python script might also point to a local Python interpreter (a virtualenv version)
From there we have all kind of possible scenarios where the application can be built and/or run with the usual set of tools
Here’s one example of a deployment from scratch :
- The repository of the project is cloned
- A virtualenv is created in the repository clone
- pip, which gets installed with virtualenv, is used to install all dependencies describes in dependencies.txt
- gunicorn is used to run the app locally using “cd bin; gunicorn wsgiapp:application”
- the directory is zipped and sent in production
- the directory is unzipped
- virtualenv is run again in the directory
- the app is hooked to Apache+mod_wsgi
Another scenario I’d use in our RPM environment:
- The repository of the project is cloned
- a RPM is built for each package in dependencies.txt
- if possible, external-dependencies.txt is used to feed a spec file.
- the app is deployed using the RPM collection
That’s the idea, roughly — a light standard to point a wsgi app and a list of dependencies.
Tarek, are Mozilla’s existing tools for doing this available somewhere? Unsurprisingly (given that I work for Red Hat), I use RPM for deployment management as well, but my process for creating RPMs from PyPI packages is currently fairly manual (use py2pack to fetch the source and make an initial spec file, hack on the spec file to make it a bit cleaner, then load that into the internal build system).
That wasn’t too bad for the initial dependency set, but there’s no way it’s sustainable long term.
I wrote pypi2rpm, a script on the top of distutils2 that creates rpms with a simple:
$ pypi2rpm WebOb
That includes versions sorting and such things.
There are options to provide your own spec file when the automatic metadata conversion is not clean enough.
For our deployments, besides a few packages, we’re good with the defaults.
People write a text file listing all dependencies and jenkins builds all RPMs for RHEL6 or Centos5
There’s am extra script in a tool called Mopytools that drive PyPI2rpm to create the set of rpms given the file.
Example of such a file: http://hg.mozilla.org/services/server-storage/file/28b0bbd6fc88/prod-reqs.txt
Thanks!
When using RPM, how do you manage typical issue 2 different project deployed on the same server but require different version of the same dependencies ? Or you don’t have such requirement and only 1 project will be deployed on any particular server ?
This aspect brings up an interesting question that I think can be answered by an execution environment. I don’t believe it is a Python specific concept. Basically, an execution environment gives you an entry point to a subset of a system. Just like virtualenv creates a sandboxed version of the Python environment, an execution environment provides an application specific layer for the larger system. In this way, your RPMs or tar.gz packages can be installed in this execution environment and when a dependency conflicts with the system dependency, your execution environment allows you to install and override that dependency.
Now, I have no idea what others are doing to solve this issue, but this has been my solution. The nice thing about an execution environment is that it allows consideration for other non-python applications. It is sort of like a chroot jail, without the need to completely reproduce the entire system. The goal is that the low level system details can be standardized, while still allowing applications a means of overriding those standards where necessary.
I wrote a really simple example here -> http://bitbucket.org/elarson/xenv. It is nothing fancy whatsoever, but in changing the question of how to deploy from being an issue of installing in an execution environment, mature process management tools (daemontools, init, supervisord, monit, god, etc.) become more powerful.
Yeah we simplify our lives by always having on application per server — or having a set of applications that use the same libs.
[…] That’s the idea, roughly — a light standard to point a wsgi app and a list of dependencies. Filed under: mozilla, python Python Read the original post on Planet Python… […]
This proposal provides a way for a project to communicate it’s settings to container. What about the reverse – for container to provide some settings to WSGI app? Like list of provided databases (RDBMs, memcached, redis), and services (SMTP address, writable directory for logs). Right now different hosters inject this stuff in ad-hoc ways patching django’s settings or adding custom code to pull this data from somewhere.
Environment variables are per-process globals and don’t work well with multithreaded servers.
Passing data via WSGI environ happens too late, when application already has request to process.
Redhat’s Openshift Express provide this as environment variables. Isn’t much of the things we discuss here somewhat implemented in Openshift ? I can’t find any discussion on how Openshift Express being implemented.
I think that’s managing dependencies (both pythonic and externals) with python is not a problem only for wsgi app : you’ll meet the same problem with other kind of python app.
We should be able to build system package (RPM, DEB…) with distribute and setuptools (or the new packaging module that I have not worked with yet), and we should be able to build a virtualenv sandbox from it too…
I think that you should not multiply files in order to do so (ie creating a *dependencies.txt files). I would like to be able to describe all dependencies in the setup file.
I don’t think it’s a matter of lacking of tools to do all these things. I think it’s more a matter of having a standard, unique place to describe all the deps, whitout having to recursively dive into all packages.
Not to bikeshed, but that’s a lot of extra files when we already have packaging tools doing most of this (as you of course know). wsgiapp can be an entry point, the dependencies install_requires dependencies. Only external-dependencies is a new kind of metadata.
I am thinking more about a project-level metadata file that describes what an application is — not just a single package, then all its dependencies.
I’ve created a more concrete proposal here: https://github.com/ianb/pywebapp/blob/master/docs/spec.txt (the other code in the repository roughly implements parts, but is not at all complete).
I don’t think there’s much value in an overly minimal specification. There’s lots of non-interesting questions about how applications works, that while uninteresting also can cause breakage so should be defined.
I like it a lot.
It’s really that idea of describing an application, not a Python package — my proposal was much lighter because I suspect many things you’ve described will never really be used.
Maybe there’s a middle ground
I’ll update my post to link to your spec
Many of the features came from adapting real applications to the format, so each feature generally has a realistic use case behind it.
After a bit more thought, here’s what I think can be a minimal format: https://plus.google.com/u/0/104537541227697934010/posts/NtWffVh4P4M
I have 3 wsgi apps running on a single server. The main deployment issues I have
– setting up different virtualenv per application
– automating the handling of the static content by apache/mod_wsgi or nginx/uwsgi, either by copying the static content from egg
is #1 an issue ? for #2, I think it can be partially solved by having a local apache/nginx configuration in each one of your project, then having an include in the main config — or copying over the conf into /etc/nginx/conf.d
To me, it seems that which OS-level packages to install cannot be a concern of the app.
For example, let’s say my app depends on a MySQL database. Which OS packages do I list in the external-dependencies file? The MySQL client? MySQL server? MySQL development headers?
If I’m running the DB on the same host as the app, then I probably need all three. If I’m running the DB on a different host, and I’m using a pure-python MySQL driver, then I don’t need any of those packages.
This type of distinction is not at the application level, it’s one level higher, solved by something like chef or puppet.
I completely agree, and this is why I mentioned that this info was just a *hint* and could not be automated for sure.
Rpm does support multiple packages in multiple versions (I think this is not the case with dpkg): same is under windows (and possibly under macos)
If every single python package would be deployed using /lib26/mypackage-A.B.C.D (where A, B, C and D are *integers*) that would solve all the cross dependencies if python would support a statement like: import mypackage >= (A, B, C, D).
To do this two key things need to be implemented:
1. enforcing __version__ to be a tuple of integers
2. make the changes in the import sintax
This will solve overnight all the dependencies problems.
Thanks
The idea was proposed some years ago — by Toshio and Myself iirc, but rejected because it was too much complexity on the top of the legacy, and because virtualenv solves this at the Python level.
I disagree, is still worthwhile to be pushed: all the other solutions are against the “duck” principle, just interim half backed patches (they don’t integrate at system level!). How complex would be introducing a new built in function called import2( modulename, (A, B, C, D)) if that cannot be made into a proper statement? I think it could be also added as simple module (no, splattering .pth files doesn’t qualify)..well you know better, my point is if that is the problem go for it.
how do you handle installation ? what happens to site-packages ?
which version do you load by default if a project has 3 versions installed ? what about legacy code ?
/myapp
^ this is the admin/distro elected (eg. __version__ == (0,0,2) )
import myapp
/myapp-0.0.1
^ some legacy app can require this one
import myapp == (0, 0, 1 ) (just a syntax example)
/myapp-0.0.3
^ my fasted paced app can required this
import myapp > (0, 0, 2 )
How does this sound? It looks transparent to me (and orthogonal): no need for magic post install changes.
I think the whole idea si worthwhile to be pushed again.
@Antonio : how do you elect a distro as being the admin/distro ? also, do you imply that the project has a single package, where you append a version number ? what about its scripts that gets installed elsewhere ? or is this a self-contained installation ala egg ? what happens to data file ?
each linux distro doesn’t change the package version within a release (eg. suse 12.1 will have mypackage in version 0.2.0 and it will just “upgrade” within the 0.2.x): that will qualify for the admin/distro (that will garantee a clear base line).
Packages and modules that’s what python hanlde well: I’m not involving “projects” here (too many confusing things go in there).
What’s so special in the “data” files? Are we talking about package/module data or things like man pages, docs, tests and icons etc.? That’s the reason for a admin/distro thing: it’s not in the scope for a package deciding box policies anyway.
As administrator I always avoided setuptools for that reason, I do NOT want a package to go over internet by default, I do not want install things requiring admin rights: is’ policy and sometime is enforced at company level.
What about a *new* package. Say you install Foo 1.2 then 1.3 then 1.1. Then for data file, if my project wants to install a man file, how do you arbitrate between 2 competing versions ?
And we just scratched the surface of the issues, If you want to push the idea maybe a more appropriate place is python-ideas
If Foo 1.2 is the one installed with the distro, then that’s the man page. If there’s conflicting files, then the packager should fix it in the spec file really: that will always be a problem, and my suggestion to keep files as orthogonal as possible.