Posts Tagged ‘Project management’
This is part 5 in a series on organizational design and growth.
What it means
Each of us has a job to do, probably more than one, and our teammates should know what they are.
Why it’s important
Roles are a kind of standing commitment we make to each other. They’re a way of dividing up work which is easy to understand and simple to apply. Utilizing this tool will make it easier for us to coordinate our day to day work, and manage the changes and growth we’re going through.
Old Status Quo
Roles are vague or nonexistent. Management roles in particular are probably not well understood. Many people juggle multiple roles, all of which are implicit. Moving to a new team means learning from scratch what other people do. People take responsibility for tasks and decisions largely on a case-by-case basis, or based on implicit knowledge of what someone does (or doesn’t do). In many cases, there is only one person in the company who knows how to perform a certain function. When someone leaves or goes on vacation, gaps are left behind.
New Status Quo
Each individual has a clear understanding of the scope of their job. We have a handful of well defined roles, which are used by multiple teams and have written definitions. People who are new or transfer between teams have a relatively easy time understanding what the people around them are doing. Many day to day responsibilities are defined by roles, and more than one person can fill that role. When someone leaves a team, another person can cover their critical roles.
Behaviors that help
Define project roles: when starting something new, make it explicit who will be working on it. Often this will be more than one person, often from different teams. For example, customer-facing product changes should have at least a product owner and an engineering owner. This makes it easy to tell if too many concurrent projects are dependent on a single person, which is a recipe for blockage.
Define team roles: Most recurring tasks should fall within a defined role. An owner of a technical service is an example of a role. An on-call engineer is an example of a time-limited role. There are many others which will depend on the team and its scope.
Define job roles: Have a conversation with your teammates and manager about what the scope of your job is, which responsibilities are shared with other members of the team and which are yours alone.
Obstacles that stand in our way
Getting hung up on titles as ego gratification. Roles are tools, not masters.
Fear that a role limits your options, locks you into doing one thing forever. Roles can be as flexible as we want them to be.
Since I resumed active status in Debian, I’ve been thinking about how to bridge the gap between Debian and its derivatives*. I’ve spoken at length with Zack, the attendees of the Derivatives BoF at DebConf 10, and the fine folks at the Derivatives Front Desk about the technical and social issues affecting derivative projects, and could probably write a very thorough series of blog posts on the subject.
Instead, Zack and I decided to try doing something about it: we have begun a project to test out a new approach to the problem.
DEX is all about action: merging patches, fixing bugs, crunching data, whatever is necessary to get changes from derivatives into Debian proper. DEX doesn’t try to change the way any existing project works, but adds a “fast path” for getting code from one place to another.
DEX is a joint task force where developers from Debian and its derivatives work together on this common goal. As a pilot project, we’ve established an Ubuntu DEX Team focused on merging code from Ubuntu into Debian. With members from both projects, we hope to be able to resolve blockage anywhere in the pipeline. Whatever needs to get done in order to merge an Ubuntu patch, someone in the Ubuntu DEX team will know what to do. If we get good results with Ubuntu, we hope that other derivatives will follow. With thanks to David Paleino, we’re excited that the Utnubu project is merging into DEX as it aligns well with their goals. I’m very grateful to have Colin Watson and James Westby signed up to contribute as well.
Our first project is simple: turn this list green. This is an archive of quite old patches from Ubuntu, most of which have probably been merged already or made obsolete, but they pre-date any kind of tracking system so they need to be verified. Once that’s done, we’ll move on to a new project with a new todo list.
If you want to see Debian benefit from technical work done in derivatives, DEX is a chance for you to act together to make it happen. If you work on a derivative and want to carry a smaller delta, come and join us. I’m sure we’ll learn a lot from this experience.
* There are many instances of great cooperation between Debian and derivative distributions, including joint package maintenance teams, and some derivatives are even part of the Debian project. Nonetheless, there are areas were most people I’ve spoken to agree that we need to do better. This is what I’ve referred to as the “gap”.
In our first few years, Ubuntu experienced explosive growth, from zero to millions of users. Because Ubuntu is an open project, these people don’t just use Ubuntu, but can see what’s happening next and influence it through suggestions and contributions. The volume of suggestions quickly became unmanageable through ad hoc discussion, because the volume of feedback overwhelmed the relatively few people who were actively developing Ubuntu.
In order to better manage user feedback at this scale, Ubuntu Brainstorm was created in 2008. It’s a collaborative filtering engine which allows anyone to contribute an idea, and have it voted on by others. Since then, it’s been available to Ubuntu developers and leaders as an information source, which has been used in various ways. The top ideas are printed in the Ubuntu Weekly Newsletter each week. We experimented with producing a report each release cycle and sharing it with the developer community. People have been encouraged to take these suggestions to the Ubuntu Developer Summits. We continue to look for new and better ways to process the feedback provided by the user community.
Most recently, I asked my colleagues on the Ubuntu Technical Board in a meeting whether we should take responsibility for responding to the feedback available in Ubuntu Brainstorm. They agreed that this was worth exploring, and I put forward a proposal for how it might work. The proposal was unanimously accepted at a later meeting, and I’m working on the first feedback cycle now.
In short, the Technical Board will ensure that, every three months, the highest voted topics on Ubuntu Brainstorm receive an official response from the Ubuntu project. The Technical Board won’t respond to all of them personally, but will identify subject matter experts within the project, ask them to write a short response, and compile these responses for publication.
My hope is that this approach will bring more visibility to common user concerns, help users understand what we’re doing with their feedback, and generally improve transparency in Ubuntu. We’ve already selected the topics for the first iteration based on the most popular items of the past six months, and are organizing responses now. Please visit brainstorm.ubuntu.com and cast your votes for next time!
I first heard about DevOps from Lindsay Holmwood at linux.conf.au 2010. Since then, I’ve been following the movement with interest. It seems to be about cross-functional involvement in software teams, specifically between software development and system administration (or operations). In many organizations, especially SaaS shops, these two groups are placed in opposition to each other: developers are driven to deliver new features to users, while system administrators are held accountable for the operation of the service. In the best case, they maintain a healthy balance by pushing in opposite directions, but more typically, they resent each other for getting in the way, as a result of this dichotomy:
|is responsible for…||creating products||offering services|
|is measured on…||delivery of new features||high reliability|
|optimizes by…||increasing velocity||controlling change|
|and so is perceived as…||reckless and irresponsible||obstructing progress|
Of course, both functions are essential to a viable service, and so DevOps aims to replace this opposition with cooperation. By removing this friction from the organization, we hope to improve efficiency, lower costs, and generally get more work done.
So, DevOps promotes the formation of cross-functional teams, where individuals still take on specialist “development” or “operations” roles, but work together toward the common goal of delivering a great experience to users. By working as teammates, rather than passing work “over the wall”, they can both contribute to development, deployment and maintenance according to their skills and expertise. The team becomes a “devops” team, and is responsible for the entire product life cycle. Particular tasks may be handled by specialists, but when there’s a problem, it’s the team’s problem.
Some take it a step further, and feel that what’s needed is to combine the two disciplines, so that individuals contribute in both ways. Rather than thinking of themselves as “developers” or “sysadmins”, these folks consider themselves “devops”. They work to become proficient in both roles, and to synthesize new ways of working by drawing on both types of skills and experience. A common crossover activity is the development of sophisticated tools for automating deployment, monitoring, capacity management and failure resolution.
DevOps meets Cloud
Like DevOps, cloud is not a specific technology or method, but a reorganization of the model (as I’ve written previously). It’s about breaking down the problem in a different way, splitting and merging its parts, and creating a new representation which doesn’t correspond piece-for-piece to the old one.
DevOps drives cloud because it offers a richer toolkit for the way they work: fast, flexible, efficient. Tools like Amazon EC2 and Google App Engine solve the right sorts of problems. Cloud also drives DevOps because it calls into question the traditional way of organizing software teams. A development/operations division just doesn’t “fit” cloud as well as a DevOps model.
Deployment is a classic duty of system administrators. In many organizations, only the IT department can implement changes in the production environment. Reaping the benefits of an IaaS environment requires deploying through an API, and therefore deployment requires development. While it is already common practice for system administrators to develop tools for automating deployment, and tools like Puppet and Chef are gaining momentum, IaaS makes this a necessity, and raises the bar in terms of sophistication. Doing this well requires skills and knowledge from both sides of the “fence” between development and operations, and can accelerate development as well as promote stability in production.
This is exemplified by infrastructure service providers like Amazon Web Services, where customers pay by the hour for “black box” access to computing resources. How those resources are provisioned and maintained is entirely Amazon’s problem, while its customers must decide how to deploy and manage their applications within Amazon’s IaaS framework. In this scenario, some operations work has been explicitly outsourced to Amazon, but IaaS is not a substitute for system administration. Deployment, monitoring, failure recovery, performance management, OS maintenance, system configuration, and more are still needed. A development team which is lacking the experience or capacity for this type of work cannot simply “switch” to an IaaS model and expect these needs to be taken care of by their service provider.
With platform service providers, the boundaries are different. Developers, if they build their application on the appropriate platform, can effectively outsource (mostly) the management of the entire production environment to their service provider. The operating system is abstracted away, and its maintenance can be someone else’s problem. For applications which can be built with the available facilities, this will be a very attractive option for many organizations. The customers of these services may be traditional developers, who have no need for operations expertise. PaaS providers, though, will require deep expertise in both disciplines in order to build and improve their platform and services, and will likely benefit from a DevOps approach.
Technical architecture draws on both development and operations expertise, because design goals like performance and robustness are affected by all layers of the stack, from hardware, power and cooling all the way up to application code. DevOps itself promotes greater collaboration on architecture, by involving experts in both disciplines, but cloud is a great catalyst because cloud architecture can be described in code. Rather than talking to each other about their respective parts of the system, they can work together on the whole system at once. Developers, sysadmins and hybrids can all contribute to a unified source tree, containing both application code and a description of the production environment: how many virtual servers to deploy, their specifications, which components run on which servers, how they are configured, and so on. In this way, system and network architecture can evolve in lockstep with application architecture.
Cloudy promises such as dynamic scaling and fault tolerance call for a DevOps approach in order to be realized in a real-world scenario. These systems involve dynamically manipulating production infrastructure in response to changing conditions, and the application must adapt to these changes. Whether this takes the form of an active, intelligent response or a passive crash-only approach, development and operational considerations need to be aligned.
DevOps and cloud will continue to reinforce each other and gain momentum. Both individuals and organizations will need to adapt in order to take advantage of the opportunities provided by these new models. Because they’re complementary, it makes sense to adopt them together, so those with expertise in both will be at an advantage.
This is a repost from the ubuntu-devel mailing list, where there is probably some discussion happening by now.
After each UDS, the organizers evaluate the event and consider how it could be further improved in the future. As a result of this process, the format of UDS has evolved considerably, as it has grown from a smallish informal gathering to a highly structured matrix of hundreds of 45-to-60-minute sessions with sophisticated audiovisual facilities.
A survey can’t tell the whole story, though, so I would also like to start a more free-form discussion here among Ubuntu developers as well. I have some thoughts I’d like to share, and I’m interested in your perspectives as well.
The core purpose of UDS has always been to help Ubuntu developers to explore, refine and share their plans for the subsequent release. It has expanded over the years to include all kinds of contributors, not only developers, but the principle remains the same.
We arrive at UDS with goals, desires and ideas, and leave with a plan of action which guides our work for the rest of the cycle.
The status quo
UDS looks like this:
This screenshot is only 1600×1200, so there are another 5 columns off the right edge of the screen for a total of 18 rooms. With 7 time slots per day over 5 days, there are over 500 blocks in the schedule grid. 9 tracks are scattered over the grid. We produce hundreds of blueprints representing projects we would like to work on.
It is an impressive achievement to pull this event together every six months, and the organizers work very hard at it. We accomplish a great deal at every UDS, and should feel good about that. We must also constantly evaluate how well it is working, and make adjustments to accommodate growth and change in the project.
How did we get here?
(this is all from memory, but it should be sufficiently accurate to have this discussion)
In the beginning, before it was even called UDS, we worked from a rough agenda, adding items as they came up, and ticking them off as we finished talking about them. Ad hoc methods worked pretty well at this scale.
As the event grew, and we held more and more discussions in parallel, it was hard to keep track of who was where, and we started to run into contention. Ubuntu and Launchpad were planning their upcoming work together at the same time. One group would be discussing topic A, and find that they needed the participation of person X, who was already involved in another discussion on topic B. The A group would either block, or go ahead without the benefit of person X, neither of which was seen to be very effective. By the end of the week, everyone was mentally and physically exhausted, and many were ill.
As a result, we decided to adopt a schedule grid, and ensure that nobody was expected to be in two places at once. Our productivity depended on getting precisely the right people face to face to tackle the technical challenges we faced. This meant deciding in advance who should be present in each session, and laying out the schedule to satisfy these constraints. New sessions were being added all the time, so the UDS organizers would stay up late at night during the event, creating the schedule grid for the next day. In the morning, over breakfast, everyone would tell them about errors, and request revisions to the schedule. Revisions to the schedule were painful, because we had to re-check all of the constraints by hand.
So, in the geek spirit, we developed a program which would read in the constraints and generate an error-free schedule. The UDS organizers ran this at the end of each day during the event, checked it over, and posted it. In the morning, over breakfast, everyone would tell them about constraints they hadn’t been aware of, and request revisions to the schedule. Revisions to the schedule were painful, because a single changed constraint would completely rearrange the schedule. People found themselves running all over the place to different rooms throughout the day, as they were scheduled into many different meetings back-to-back.
At around this point, UDS had become too big, and had too many constraints, to plan on the fly (unconference style). We resolved to plan more in advance, and agree on the scheduling constraints ahead of time. We divided the event into tracks, and placed each track in its own room. Most participants could stay in one place throughout the day, taking part in a series of related meetings except where they were specifically needed in an adjacent track. We created the schedule through a combination of manual and automatic methods, so that scheduling constraints could be checked quickly, but a human could decide how to resolve conflicts. There was time to review the schedule before the start of the event, to identify and fix problems. Revisions to the schedule during the event were fewer and less painful. We added keynote presentations, to provide opportunities to communicate important information to everyone, and ease back into meetings after lunch. Everyone was still exhausted and/or ill, and tiredness took its toll on the quality of discussion, particularly toward the end of the week.
Concerns were raised that people weren’t participating enough, and might stay on in the same room passively when they might be better able to contribute to a different session happening elsewhere. As a result, the schedule was randomly rearranged so that related sessions would not be held in the same room, and everyone would get up and move at the end of each hour.
This brings us roughly to where things stand today.
Problems with the status quo
- UDS is big and complex. Creating and maintaining the schedule is a lot of work in itself, and this large format requires a large venue, which in turn requires more planning and logistical work (not to mention cost). This is only worthwhile if we get proportionally more benefit out of the event itself.
- UDS produces many more blueprints than we need for a cycle. While some of these represent an explicit decision not to pursue a project, most of them are set aside simply because we can’t fit them in. We have the capacity to implement over 100 blueprints per cycle, but we have *thousands* of blueprints registered today. We finished less than half of the blueprints we registered for 10.04. This means that we’re spending a lot of time at UDS talking about things which can’t get done that cycle (and may never get done).
- UDS is (still) exhausting. While we should work hard, and a level of intensity helps to energize us, I think it’s a bit too much. Sessions later in the week are substantially more sluggish than early on, and don’t get the full benefit of the minds we’ve brought together. I believe that such an intense format does not suit the type of work being done at the event, which should be more creative and energetic.
- The format of UDS is optimized for short discussions (as many as we can fit into the grid). This is good for many technical decisions, but does not lend itself as well to generating new ideas, deeply exploring a topic, building broad consensus or tackling “big picture” issues facing the project. These deeper problems sometimes require more time. They also benefit tremendously from face-to-face interaction, so UDS is our best opportunity to work on them, and we should take advantage of it.
- UDS sessions aim for the minimum level of participation necessary, so that we can carry on many sessions in parallel: we ask, “who do we need in order to discuss this topic?” This is appropriate for many meetings. However, some would benefit greatly from broader participation, especially from multiple teams. We don’t always know in advance where a transformative idea will come from, and having more points of view represented would be valuable for many UDS topics.
- UDS only happens once per cycle, but design and planning need to continue throughout the cycle. We can’t design everything up front, and there is a lot of information we don’t have at the beginning. We should aim to use our time at UDS to greatest effect, but also make time to continue this work during the development cycle. “design a little, build a little, test a little, fly a little”
- Concentrate on the projects we can complete in the upcoming cycle. If we aren’t going to have time to implement something until the next cycle, the blueprint can usually be deferred to the next cycle as well. By producing only moderately more blueprints than we need, we can reduce the complexity of the event, avoid waste, prepare better, and put most of our energy into the blueprints we intend to use in the near future.
- Group related sessions into clusters, and work on them together, with a similar group of people. By switching context less often, we can more easily stay engaged, get less fatigued, and make meaningful connections between related topics.
- Organize for cross-team participation, rather than dividing teams into tracks. A given session may relate to a Desktop Edition feature, but depends on contributions from more than just the Desktop Team. There is a lot of design going on at UDS outside of the “Design” track. By working together to achieve common goals, we can more easily anticipate problems, benefit from diverse points of view, and help each other more throughout the cycle.
- Build in opportunities to work on deeper problems, during longer blocks of time. As a platform, Ubuntu exists within a complex ecosystem, and we need to spend time together understanding where we are and where we are going. As a project, we have grown rapidly, and need to regularly evaluate how we are working and how we can improve. This means considering more than just blueprints, and sometimes taking more than an hour to cover a topic.
A few months ago, I wrote about changes in our development process for Ubuntu 10.04 LTS in order to meet our goals for this long-term release. So, how has it turned out?
Well, the development teams are still very busy preparing for the upcoming release, so there hasn’t been too much time for retrospection yet. Here are some of my initial thoughts, though.
- Merge from Debian testing – Martin Pitt has started a discussion on ubuntu-devel about how this went. For my part, I found that Lucid included fewer surprises than Karmic.
- Add fewer features – This is difficult to evaluate objectively, but my gut feeling is that we kept this largely under control. As usual, a few surprise desktop features were implemented that not everyone is happy about, myself included.
- Avoid major infrastructure changes – I think we did reasonably well here, though Plymouth is a notable exception. It resulted (unsurprisingly) in some nasty bugs which we’ve had to spend time dealing with.
- Extend beta testing – This will be difficult to assess, though if 10.04 beta was at least as good as 9.10 or 9.04 beta, then it will have arguably been a success.
- Freeze with Debian – Although early indications were good, this didn’t work out so well, as Debian’s freeze was delayed
- Visualize progress – The feature status page provided a lot of visual progress information, and the system behind it allowed us to keep track of work slippage throughout the cycle, both of which seemed like a firm step in the right direction. I’m looking forward to hearing from development teams how this information helped them (or not).
A more complete set of retrospectives on Lucid should give us some good ideas for how to improve further in Maverick and beyond.
Update: Fixed broken link.
List-keeping is generally regarded as boring administrative work, something only important to compulsive organizers. When a writer wants to portray a character as meticulous and dull, they need only brand them as a list-maker, with eyes bespectacled from years of squinting over their lists.
The reality of list-keeping is much more exciting: reminder lists are a mechanism for cognitive time travel. They allow us to transport information from the time when it occurs to us, to a time in the future when it will actually be useful. Like a wormhole, they connect distant points in spacetime (though unfortunately only in one direction, as in the Stargate universe).
Throughout my day, I will remember things I need to do, though not right away: an article which looks interesting, or someone I need to remember to call. Putting these items on a list frees my mind to keep going with whatever I’m doing, knowing that the idea is not lost. A common scenario for me is that I’m riding the tube, reading RSS feeds offline on my Android phone using NewsRob, and come across something I want to explore further. There is as yet no wireless service on the tube, so I can’t do anything but read, but I can send myself an email using K-9 which will be delivered later. At the other end of the wormhole, when I’m back online, I receive the email (usually at my computer) and pick up where I left off.
Traveling through time in your head may not be as exciting as flitting about in a TARDIS, but it is much more accessible, and genuinely rewarding.
For the first time in several years, I had the opportunity to attend a software conference in the city where I lived at the time. I’ve benefited from many InfoQ articles in the past couple of years, and watched recordings of some excellent talks from previous QCon events, so I jumped at the opportunity to attend QCon London 2010. It is being held in the Queen Elizabeth II Conference Center, conveniently located a short walk away from Canonical’s London office.
Whenever I attend conferences, I can’t help taking note of which operating systems are in use, and this tells me something about the audience. I was surprised to notice that in addition to the expected Mac and Windows presence, there was a substantial Ubuntu contingent and some Fedora as well.
A Scalable, Peer-led Model For Building Good Habits In Large & Diverse Development Teams
Jason explained the method he uses to coach software developers.
I got a bad seat on the left side of the auditorium, where it was hard to see the slides because they were blocked by the lectern, so I may have missed a few points.
He began by outlining some of the primary factors which make software more difficult to change over time:
- Readability: developers spend a lot of their time trying to understand code that they (or someone else) have written
- Complexity: as well as making code more difficult to understand, complexity increases the chance of errors. More complex code can fail in more ways.
- Duplication: when code is duplicated, it’s more difficult to change because we need to keep track of the copies and often change them all
- Dependencies and the “ripple effect”: highly interdependent code is more difficult to change, because a change in one place requires corresponding changes elsewhere
- Regression Test Assurance: I didn’t quite follow how this fit into the list, to be honest. Regression tests are supposed to make it easier to change the code, because errors can be caught more easily.
He then outlined the fundamental principles of his method:
- Focus on Learning over Teaching – a motivated learner will find their own way, so focus on enabling them to pull the lesson rather than pushing it to them (“there is a big difference between knowing how to do something and being able to do it”)
- Focus on Ability over Knowledge – learn by doing, and evaluate progress through practice as well (“how do you know when a juggler can juggle?”)
…and went on to outline the process from start to finish:
- Orientation, where peers agree on good habits related to the subject being learned. The goal seemed to be to draw out knowledge from the group, allowing them to define their own school of thought with regard to how the work should be done. In other words, learn to do what they know, rather than trying to inject knowledge.
- Practice programming, trying to exercise these habits and learn “the right way to do it”
- Evaluation through peer review, where team members pair up and observe each other. Over the course of 40-60 hours, they watch each other program and check off where they are observed practicing the habits.
- Assessment, where learners practice a time-boxed programming exercise, which is recorded. The focus is on methodical correctness, not speed of progress. Observers watch the recording (which only displays the code), and note instances where the habit was not practiced. The assessment is passed only if less than three errors are noticed.
- Recognition, which comes through a certificate issued by the coach, but also through admission to a networking group on LinkedIn, promoting peer recognition
Jason noted that this method of assessing was good practice in itself, helping learners to practice pairing and observation in a rigorous way.
After the principal coach coaches a pilot group, the pilot group then goes on to coach others while they study the next stage of material.
To conclude, Jason gave us a live demo of the assessment technique, by launching Eclipse and writing a simple class using TDD live on the projector. The audience were provided with worksheets containing a list of the habits to observe, and instructed to note instances where he did not practice them.
Production deployments using all your team
After a brief introduction to the problems targeted by the devops approach, Julian offered some advice on how to do it right.
He began with the people issues, reminding us of Weinberg’s second law, which is “no matter what they tell you, it’s always a people problem”.
His people tips:
- In keeping with a recent trend, he criticized email as a severely flawed communication medium, best avoided.
- respect everyone
- have lunch with people on the other side of the wall
- discuss your problems with other groups (don’t just ask for a specific solution)
- invite everyone to stand-ups and retrospectives
- co-locate the sysadmins and developers (thomas allen)
Next, a few process suggestions:
- Avoid code ownership generally (or rather, promote joint/collective ownership)
- Pair developers with sysadmins
- It’s done when the code is in production (I would rephrase as: it’s not done until the code is in production)
and then tools:
- Teach your sysadmins to use version control
- Help your developers write performant code
- Help developers with managing their dev environment
- Run your deploy scripts via continuous integration (leading toward continuous deployment)
- Use Puppet or Chef (useful as a form of documentation as well as deployment tools, and on developer workstations as well as servers)
- Integrate monitoring and continuous integration (test monitoring in the development environment)
- Deliver code as OS packages (e.g. RPM, DEB)
- Separate binaries and configuration
- Harden systems immediately and enable logging for tuning security configuration (i.e. configure developer workstations with real security, making the development environment closer to production)
- Give developers access to production logs and data
- Re-create the developer environment often (to clear out accumulated cruft)
I agreed with a lot of what was said, objected to some, and lacked clarity on a few points. I think this kind of material is well suited to a multi-way BOF style discussion rather than a presentation format, and would have liked more opportunity for discussion.
Getting distributed webservices done with Nosql
Lars and Fabrizio described the general “social network problem”, and how they went about solving it. This problem space involves the processing, aggregation and dissemination of notifications for a very high volume of events, as commonly manifest in social networking websites such as Facebook and Twitter which connect people to each other to share updates. Apparently simple functionality, such as displaying the most recent updates from one’s “friends”, quickly become complex at scale.
As an example of the magnitude of the problem, he explained that they process 18 million events per day, and how in the course of storing and sharing these across the social graph, some operations peak as high as 150,000 per second. Such large and rapidly changing data sets represent a serious scaling challenge.
They originally built a monolithic, synchronous system called Phoenix, built on:
- LAMP frontends: Apache+PHP+APC (500 of them)
- Sharded MySQL multi-master databases (150 of them)
- memcache nodes with 1TB+ (60 of them)
They then added on asynchronous services alongside this, to handle things like Twitter and mobile devices, using Java (Tomcat) and RabbitMQ. The web frontend would send out AMQP messages, which would then be picked up by the asynchronous services, which would (where applicable) communicate back to Phoenix through an HTTP API call.
When the time came to re-architect their activity , they identified the following requirements:
- endless scalability
- storage- and cloud-independent
- flexible and extensible data model
This led them to an architecture based on:
- Nginx + Janitor
- Embedded Jetty + RESTeasy
- NoSQL storage backends (no fewer than three: Redis, Voldemort and Hazelcast)
They described this architecture in depth. The things which stood out for me were:
- They used different update strategies (push vs. pull) depending on the level of fan-out for the node (i.e. number of “friends”)
- They implemented a time-based activity filter which recorded a global timeline, from minutes out to days. Rather than traversing all of the user’s “friends” looking for events, they just scan the most recent events to see if their friends appear there.
- They created a distributed, scalable concurrent ID generator based on Hazelcast, which uses distributed locking to assign ranges to nodes, so that nodes can then quickly (locally) assign individual IDs
- It’s interesting how many of the off-the-shelf components had native scaling, replication, and sharding features. This sort of thing is effectively standard equipment now.
Their list of lessons learned:
- Start benchmarking and profiling your app early
- A fast and easy deployment keeps motivation high
- Configure Voldemort carefully (especially on large heap machines)
- Read the mailing lists of the NoSQL system you use
- No solution in docs? – read the sources
- At some point stop discussing and just do it
Learnings from almost five years as a Skype Architect
Andres began with an overview of Skype, which serves 800,000 registered users per employee (650 vs. 521 million). Their core team is based in Estonia. Their main functionality is peer-to-peer, but they do need substantial server infrastructure (PHP, C, C++, PostgreSQL) for things like peer-to-peer supporting glue, e-commerce and SIP integration. Skype uses PostgreSQL heavily in some interesting ways, in a complex multi-tiered architecture of databases and proxies.
His first lesson was that technical rules of thumb can lead us astray. It is always tempting to use patterns that have worked for us previously, in a different project, team or company, but they may not be right for another context. They can and should be used as a starting point for discussion, but not presumed to be the solution.
Second, he emphasized the importance of paying attention to functional architecture, not only technical architecture. As an example, he showed how the Skype web store, which sells only 4 products (skype in, skype out, voicemail, and subscription bundles of the previous three) became incredibly complex, because no one was responsible for this. Complex functional architecture leads to complex technical architecture, which is undesirable as he noted in his next point.
Keep it simple: minimize functionality, and minimize complexity. He gave an example of how their queuing system’s performance and scalability were greatly enhanced by removing functionality (the guarantee to deliver messages exactly once), which enabled the simplification of the system.
He also shared some organizational learnings, which I appreciated. Maybe my filters are playing tricks on me, but it seems as if more and more discussion of software engineering is focusing on organizing people. I interpret this as a sign of growing maturity in the industry, which (as Andres noted) has its roots in a somewhat asocial culture.
He noted that architecture needs to fit your organization. Design needs to be measured primarily by how well they solve business problems, rather than beauty or elegance.
He stressed the importance of communication, a term which I think is becoming so overused and diluted in organizations that it is not very useful. It’s used to refer to everything from roles and responsibilities, to personal relationships, to cultural norming, and more. In the case of Skype, what Andres learned was the importance of organizing and empowering people to facilitate alignment, information flow and understanding between different parts of the business. Skype evolved an architecture team which interfaces between (multiple) business units and (multiple) engineering teams, helping each to understand the other and taking responsibility for the overall system design.
Overall, I thought the day’s talks gave me new insight into how Internet applications are being developed and deployed in the real world today. They affirmed some of what I’ve been wondering about, and gave me some new things to think about as well. I’m looking forward to tomorrow.
The development of Ubuntu 10.04 has been underway for nearly two months now, and will produce our third long-term (LTS) release in April. Rick Spencer, desktop engineering manager, summarized what’s ahead for the desktop team, and a similar update will be coming soon from Jos Boumans, our new engineering manager for the server team.
What I want to talk about, though, is not the individual projects we’re working on. I want to explain how the whole thing comes together, and what’s happening behind the scenes to make 10.04 LTS different from other Ubuntu releases.
Changing the focus
Robbie Williamson, engineering manager for the foundations team, has captured the big picture in the LTS release plan, the key elements of which are:
Merge from Debian testing
By merging from Debian testing, rather than the usual unstable, we aim to avoid regressions early in the release cycle which tend to block development work. So far, Lucid has been surprisingly usable in its first weeks, compared to previous Ubuntu releases.
Add fewer features
By starting fewer development projects, and opting for more testing projects over feature projects, we will free more time and energy for stabilization. This approach will help us to discover regressions earlier, and to fix them earlier as well. This doesn’t mean that Ubuntu 10.04 won’t have bugs (with hundreds of millions of lines of source code, there is no such thing as a bug-free system), but we believe it will help us to produce a system which is suitable for longer-term use by more risk-averse users.
Avoid major infrastructure changes
We will bring in less bleeding-edge code from upstream than usual, preferring to stay with more mature components. Where a major transition is brewing upstream, we will probably opt to defer it to the next Ubuntu cycle. While this might delay some new functionality slightly, we believe the additional stability is well worth it for an LTS release.
Extend beta testing
With less breakage early in the cycle, we plan to enter beta early. Traditionally, the beta period is when we receive the most user feedback, so we want to make the most of it. We’ll deliver a usable, beta-quality system substantially earlier than in 9.10, and our more adventurous users will be able to upgrade at that point with a reasonable expectation of stability.
Freeze with Debian
With Debian “squeeze” expected to freeze in March, Ubuntu and Debian will be stabilizing on similar timelines. This means that Debian and Ubuntu developers will be attacking the same bugs at the same time, creating more opportunities to join forces.
Staying on course
In addition, we’re rolling out some new tools and techniques to track our development work, which were pioneered by the desktop team in Ubuntu 9.10. We believe this will help us to stay on course, and make adjustments earlier when needed. Taking some pages from the Agile software development playbook, we’ll be planning in smaller increments and tracking our progress using burn-down charts. As always, we aim to make Ubuntu development as transparent as possible, so all of this information is posted publicly so that everyone can see how we’re doing.
Delivering for users
By making these changes, we aim to deliver for our users the right balance of stability and features that they expect from an Ubuntu LTS release. In particular, we want folks to feel confident deploying Ubuntu 10.04 in settings where it will be actively maintained for a period of years.
Morning Session: Project Patterns
I chose to attend a session entitled: Is this the Way We Really Want to do Things? Seeing Project Patterns and Changing What You Don’t Like (Johanna Rothman). My goal was to explore the causes of the troublesome patterns I see in projects at work. In particular, I see:
- Too many projects starting up at once
- Projects being instantiated without enough consideration for the probability of success (“is this a good idea?” rather than “can we realistically achieve this?”)
- Key people finding out too late about projects which affect them
All of these patterns lead to increased project risk, communication bottlenecks, low motivation, and high stress.
In the session, we conducted a simulation of a team, with engineers, a project manager, a senior manager and a customer. I took on the role of the senior manager.
In the course of the simulation, we received requirements from the customer, implemented them, and delivered products. While the team was working on implementation, I talked with the customer about what was coming next: what would happen when we delivered, what the next project would be, and so on. Part of the simulation was that I had to be separated from the group while they were working.
When we delivered the first batch of products, and the customer was happy with them, it was time to decide what to do next. We gave the customer a choice of two projects we had discussed, one of which was similar to the previous one (but larger scale and more involved), while the other was different. Despite repeated attempts, we could not persuade the customer to prioritize one over the other.
So, I decided that we should change gears and start work on the “different” project. It seemed to be of greater economic value to the customer, and simpler to execute. One of the engineers disagreed with this decision, but didn’t explain why. The project manager seemed to agree, and I left the team to work. They produced a prototype, which the customer liked, and with a few small changes it was accepted as a finished product.
To my surprise, though, I found out later that the team was in fact working on both projects at once, delivering two different types of products. The decision hadn’t actually been made. These unexpected products were delivered to the customer, but didn’t meet the expanded requirements, and that work was wasted.
The debrief which followed was unfortunately too short, and I didn’t feel that we were able to fully explore what the simulation had revealed. The project manager indicated that he hadn’t understood the decision to have been made, pointing to a communication failure.
This reminded me that while we often think of a decision as an event which happens at a point in time, it is more commonly a process of change, which takes time and must be revisited in order to check progress and evaluate. A decision is really just an idea for a change, and there is more work to be done in order to implement that idea. This can be true even when there is a very explicit “decision point”: it still takes time for that message to be received, interpreted and accepted.
One of the tangents we followed during the debrief had to do with how humans think about numbers. Jerry asked each member of the group to write down a random number, and then we wrote them all on a flipchart. They were: 8, 75, 47, 72, 45, 44, 32, 6, 13 and 47. This reminded me of the analyses of election results which indicate fraud.
Afternoon Session: Saying No
After lunch, I decided to attend Jerry Weinberg’s session, Saying No That Really Means No. This was much larger than the morning session, with over 40 people sitting in a large circle.
The subject of discussion was the variety of difficulties that people face in saying “no” to things which don’t seem right for them. For example, saying “yes” to a project which is doomed to failure. This seemed like a good follow-on to the morning’s exercise.
Jerry began by asking the audience to name some of their difficulties, and tell stories of times when they had trouble saying “no”. One of these stories was role-played and analyzed as an example. Most of the time, though, was filled with storytelling and discussion.
This is a deeply complex topic, because this problem is rooted in self-image, social norms, egocentrism, misperception, and other cognitive phenomena. There was no key insight for me, just a reinforcement of the necessity of self-awareness. The only way to avoid patterns like this is to notice when they are happening, and that can be challenging, especially in a stressful situation.
Once you realize what’s happening, there are all sorts of tools which can be applied to the problem: negotiation, requests for help, problem-solving, even simple inaction.