Stemming the tide of Ubuntu kernel bugs
The Ubuntu kernel team receives an extraordinary number of bug reports, about 1000 in the past week. Yesterday, Leann Ogasawara, our Ubuntu kernel QA lead, addressed a roomful of Ubuntu developers. She shared how the kernel team is handling this situation, and asked for ideas and suggestions from the crowd.
To try to help out, I reviewed the most recent screenful of kernel bug reports (75) to see if there were any patterns we could take advantage of. I discussed with the kernel team some ways in which we could improve our approach, and implemented some of the changes.
Altogether, this was only a few hours of work, but should eliminate a large number of invalid reports, and significantly increase the quality of many more.
A quick back-of-the-envelope count revealed the following categories:
Suspend or hibernate failures (36%)
A majority of these are automated reports from apport. This is good, because we have the opportunity to collect relevant information from the system when the problem happened, but it also means that there are a lot of reports.
Although some new logging was added in 9.04, these reports still often do not contain enough information to diagnose the problem.
One bit of data which the kernel team has said would be useful is the frequency of the failure: does it fail every time, or only sometimes? We can improve the logging to keep track of successful resumes as well as failures, and then include this data in the report.
Networking problems, both wired and wireless (13%)
The kernel team has a partial specification for some improvements to make here.
Package installation and upgrade failures (10%)
The kernel tends to be a trigger point for a variety of problems in this area which are not its fault. For example, if the system is very low on disk space, upgrading the kernel can fail because it is a large package, so we automatically suppress those reports. In my sample, none of the failures being reported against the kernel actually belonged there.
To help address this, we can suppress bogus reports, and redirect valid reports to the appropriate package. I committed fixes to apport which will file the problem reports against grub or initramfs-tools if they were caused by failures in update-grub or update-initramfs respectively. I also added an apport bug pattern to suppress bug reports against the kernel which contained certain dpkg unpacking errors, and added a patch to apt to try to detect this case as well.
Audio-related problems (9%)
Currently, the first step for most of these bug reports is to ask the user to complete the report by running apport-collect -p alsa-base to collect audio-related debugging data.
Because they account for a significant proportion of all kernel bugs, I committed an apport patch to simply attach this information by default for all kernel bugs.
Kernel panics, oopses, lockups etc. (8%)
These bugs are notoriously tricky to file properly, because the system is often non-functional or severely impaired.
In Karmic, we now have a kernel crash dump facility which is very easy to use. Rather than reporting a bug saying “my computer locks up”, you can throw a switch which will enable the problem to be automatically detected, recorded and analyzed. By the time the bug report reaches the kernel developers, it should have detailed information about where the problem occurred, rather than requiring the reporter to use things like digital cameras to capture panic messages.
We’ve also wired up kerneloops to apport, so that oopses are reported through an automatic facility which can produce a more complete bug report.