Snow sprint report #2 : benchmarking
by Tarek Ziadé
EDIT: The chomsky was somehow limited, and was creating very similar documents. Dokai worked on another text generator that generates more various document. It is based on various file and combine random texts that are quite nice, check it out ! (same place, but the method is called random_text() (I have updated the code extract as well))
Dokai and Tom are working hard on the best way to hook the regular catalog with the Solr utility. I was a bit aside on this task so I didn’t catch up with it yet.
Anyway, I have prepared the field in order to compare a pure plone 3 with a solr-enabled one. I wanted to generate a Plone instance with many documents, which content would look realistic.
I found on ASPN a great recipe for a Chomsky-based random text generator: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/440546
So I have just bundled it in a script that can be used to generate Plone folders with documents in it. When Dokai and Tom work will be ready, we will use this script to load several thoushands of documents in the catalogs, to start a few benchmarks.
Here’s the script (used in an Extension, but straight forward to bundle in a class), you can also download it from here
""" Generates documents with realistic content, with a Chomsky random generator taken here : http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/440546 """ from Products.CMFCore.utils import getToolByName import logging import transaction leadins = """bunch of lines""" subjects = """bunch of lines""" verbs = """bunch of lines""" objects = """bunch of lines""" import textwrap, random from itertools import chain, islice, izip def chomsky(times=1): """Chomsky method of generating random text.""" return ' '.join(chain(random.choice(part).strip() for part in (leadins, subjects, verbs, objects) for i in xrange(times))) def gen_documents(context, folder, numdocs=10, root_name='doc_'): wftool = getToolByName(context, 'portal_workflow') for i in range(numdocs): root = i id_ = root_name + str(root) while id_ in folder.objectIds(): root += 1 id_ = root_name + str(root) desc = chomsky(5) title = chomsky(2) sub = chomsky(1) context.invokeFactory('Document', id_, description=desc, title=title, subject=sub) obj = context[id_] wftool.doActionFor(obj, 'publish') logging.info('created document #%d' % i) if i % 100 == 0: transaction.savepoint() def gen_folders(context, numfolders=10, numdocs=1000, root_folder_name='folder_', root_name='doc_'): wftool = getToolByName(context, 'portal_workflow') for i in range(numfolders): root = i id_ = root_folder_name + str(root) while id_ in context.objectIds(): root += 1 id_ = root_folder_name + str(root) context.invokeFactory('Folder', id_) obj = context[id_] wftool.doActionFor(obj, 'publish') logging.info('created folder #%d' % i) gen_documents(obj, obj, numdocs, root_name) transaction.savepoint() def gen_sample(portal): gen_folders(portal) def random_text(data, num_words=100): """Source: http://www.physics.cornell.edu/sethna/StatMech/ComputerExercises/RandText""" # Read in the file and create a prefix mapping words = data.split() prefix = {} for i in xrange(len(words)-2): prefix.setdefault((words[i], words[i+1]), []).append(words[i+2]) current_pair = random.choice(prefix.keys()) random_text = current_pair[0] + ' ' + current_pair[1] for i in xrange(num_words-2): # last two words in document may not have a suffix if current_pair not in prefix: break next = random.choice(prefix[current_pair]) random_text = random_text + ' ' + next current_pair = (current_pair[1], next) return random_text
Great work – we really need decent benchmarking tools.
You might also want to have a look at the JMeter tests created during the Performance Sprint. They also include a text generator with another approach: It fetches text snippets from news rss feeds.
http://dev.plone.org/collective/browser/JMeterTestPlans/trunk