So it’s been two months. We didn’t complete all the tasks on this list, but we came pretty close.
We got very far with the speed improvements:
* Server switch
* Plone upgrade
* Multiple Zope servers in production
* Changing file attachments’ storage mechanism
A pretty good showing on software dependency upgrades:
* Upgrade WordPress on StreetsBlog and StreetFilms
* Upgrade Xinha on StreetsBlog
* Upgrade Xinha on OpenPlans.org and LivableStreets
We hit most of the new feature targets:
* Project wiki backup
* Account deletion
* Project-specific search
* Tagging
We didn’t actually launch either of the large-scale lingering projects:
* Geotagging of people and projects
* NYC Fix-It Map: Some of the OpenGeo team
So, putting it all together, I think missing the target boils down to two things: the Plone 3 upgrade, and lack of commitment. Four of the seven missed target projects are blocking on the completion of the Plone upgrade, and the other three we just didn’t really get around to.
I think deadlines are an easy and obvious solution to the second problem. These objectives sort of qualify as deadlines, but they were a bit vague, not exactly a commitment, and I think most importantly grouped together — nothing has an individual deadline, so the set of projects are naturally prioritized and the ones that are large-ish (particularly those that “just” involve attention more than they do anything else) and/or not urgent priorities just slip to the back. Those are the sorts of projects that I think would really benefit from deadlines … provided they can be identified.
As for the Plone upgrade, it’s hard to say yet what it really points to. Maybe it means that our deployments take too much time and effort. (Which I think they do, but Doug’s been furiously working to improve that as he does deployments to OpenPlans.org, so each one is easier than the last.) Maybe it means that we don’t have enough sense of how, when and how much to test an upcoming deployment whose changes are very wide-reaching. (Which I think we don’t, but I have no idea what we could do better.) Or maybe it just means that upgrading OpenCore from Plone 2.5 to Plone 3 is a uniquely fuzzy and fairly risky thing and that I was just overly optimistic about the effort it would require, and it’s just a one-time mistake with no useful information to mine. I think the Xinha upgrade and the load-balancing will be pretty decent ways to figure out if there’s a fixable underlying problem here, so I’ll be particularly curious to see how those go.
There’s one other problem I think I see here. Though we did get around to most of them, the server switches, Streetsblog software upgrades, and Plone upgrade did all _happen_ later than I was expecting, even leaving aside how long they took to finish after they had been started. In a sense I think they all suffered from a sensible reluctance on the part of the stakeholders (variously, Nick, Aaron, Doug, Bryan and myself) to pull the switch on a major upgrade. In the case of the server switches and Streetsblog upgrades — the ones we’ve actually managed to pull the switch on — it was even difficult to schedule a time to do it, between regular site traffic levels and planned events on the sites that required high reliability. Broadly speaking, I think we can improve this with more and more explicit communication with stakeholders: agreeing upon a set of acceptance tests in each case; stakeholders perhaps listing, on a private calendar somewhere, their planned upcoming needs for reliability; the engineering team setting some span of time in which to perform an upgrade; stakeholders picking a date and time. I’d be curious to experiment with some of these.
September 4, 2008 at 7:55 pm
Thanks for this debrief, it’s very helpful to compare our goals to what we actually accomplished. In addition to what you mention above, I can think of two more issues that set our deployments back a bit. The first is a lack of testing resources. You mention that there was “sensible reluctance on the part of the stakeholders”; I’ve certainly had some “sensible reluctance” w.r.t. the Plone 3 deployment. This reluctance is informed by the fact that (until very recently, anyway) nobody other than myself has done any significant amount of hammering on the P3-based stack. It does seem to me that engineering tasks are being completed faster than QA is able to sign off them, creating a back-log of items that are close-to-ready but about which the developers may not feel entirely confident.
The other issue is a bit more pedestrian, which is that some of this has hit right in the middle of the summer vacation season. While in theory there are a number of folks who can troubleshoot and fix issues in any part of our codebase, in truth it’s likely that I’ll be able to resolve certain P3-related issues much more quickly than anyone else, and thus it’s unnecessarily risky to try to deploy that code to the live site when I’m not around to deal w/ issues as they come up.
Neither of these negate your ideas or suggestions, of course, I just wanted to add them to the list of things to consider in our future planning processes.
September 4, 2008 at 9:43 pm
Yeah, thanks for the great write-up.
Re. “engineering tasks are being completed faster than QA is able to sign off them”… I still feel that this is largely an organizational problem at TOPP:
1) What QA means in our process is entirely ad-hoc and optional. Tim has helpfully pushed us through some very productive testing episodes, and Doug and I are currently pushing hard on a very time consuming code-review (the first I know of here since I joined TOPP); but we as an organization haven’t prioritized these things or figured out how much time we need to spend on it. Lately, far as I know, we’ve been running without any schedule at all, which as Ethan pointed out is problematic. So +1 to being more explicit about scheduling.
2) We have no dedicated QA staff. QA at TOPP means yanking developers and designers away from developing and designing. Which would be OK only if we could learn to accept working at a radically slower pace. Personally I think we should put some dedicated testers in the staff budget ASAP. I think Spolsky’s pretty on-target about this: http://www.joelonsoftware.com/articles/fog0000000067.html
Tangent: I’m curious why this blog post ended up here rather than on say http://www.openplans.org/projects/operations/blog/