We'll see | Matt Zimmerman

a potpourri of mirth and madness

Stemming the tide of Ubuntu kernel bugs

The Ubuntu kernel team receives an extraordinary number of bug reports, about 1000 in the past week. Yesterday, Leann Ogasawara, our Ubuntu kernel QA lead, addressed a roomful of Ubuntu developers. She shared how the kernel team is handling this situation, and asked for ideas and suggestions from the crowd.

To try to help out, I reviewed the most recent screenful of kernel bug reports (75) to see if there were any patterns we could take advantage of. I discussed with the kernel team some ways in which we could improve our approach, and implemented some of the changes.

Altogether, this was only a few hours of work, but should eliminate a large number of invalid reports, and significantly increase the quality of many more.

A quick back-of-the-envelope count revealed the following categories:

Suspend or hibernate failures (36%)

A majority of these are automated reports from apport. This is good, because we have the opportunity to collect relevant information from the system when the problem happened, but it also means that there are a lot of reports.

Although some new logging was added in 9.04, these reports still often do not contain enough information to diagnose the problem.

One bit of data which the kernel team has said would be useful is the frequency of the failure: does it fail every time, or only sometimes? We can improve the logging to keep track of successful resumes as well as failures, and then include this data in the report.

Networking problems, both wired and wireless (13%)

The kernel team has a partial specification for some improvements to make here.

Package installation and upgrade failures (10%)

The kernel tends to be a trigger point for a variety of problems in this area which are not its fault. For example, if the system is very low on disk space, upgrading the kernel can fail because it is a large package, so we automatically suppress those reports. In my sample, none of the failures being reported against the kernel actually belonged there.

To help address this, we can suppress bogus reports, and redirect valid reports to the appropriate package. I committed fixes to apport which will file the problem reports against grub or initramfs-tools if they were caused by failures in update-grub or update-initramfs respectively. I also added an apport bug pattern to suppress bug reports against the kernel which contained certain dpkg unpacking errors, and added a patch to apt to try to detect this case as well.

Audio-related problems (9%)

Currently, the first step for most of these bug reports is to ask the user to complete the report by running apport-collect -p alsa-base to collect audio-related debugging data.

Because they account for a significant proportion of all kernel bugs, I committed an apport patch to simply attach this information by default for all kernel bugs.

Kernel panics, oopses, lockups etc. (8%)

These bugs are notoriously tricky to file properly, because the system is often non-functional or severely impaired.

In Karmic, we now have a kernel crash dump facility which is very easy to use. Rather than reporting a bug saying “my computer locks up”, you can throw a switch which will enable the problem to be automatically detected, recorded and analyzed. By the time the bug report reaches the kernel developers, it should have detailed information about where the problem occurred, rather than requiring the reporter to use things like digital cameras to capture panic messages.

We’ve also wired up kerneloops to apport, so that oopses are reported through an automatic facility which can produce a more complete bug report.

Advertisements

Written by Matt Zimmerman

August 7, 2009 at 09:57

12 Responses

Subscribe to comments with RSS.

  1. Great work.

    I think aggregating bug reports and prioritizing/sanitizing them is one of Ubunutus major contribution to the FOSS ecosystem.
    Just yesterday I wondered if you guys something like GoogleAnalytics for Launchpad. I know there is the not very useful “this bug affects me too”-link (I would prefer a big “Digg” sign) but maybe a good Analytics setup would provide even more infos on where most peoples problems lie.

    Hope that might help.

    Cheers

    Tom

    August 7, 2009 at 10:40

  2. hello Matt,

    Good post; As you says, a lot of bugs are filed; for me that means that the community is growing up, most of users are able to report bugs, the “apport” script is the best usefull way to let devs knows about troubles, if there are others let we know of course.

    How to do better:
    – promote “apport” when a crash / log warning appear, with the good syntax to be used & the top best option (recently discovered “apport-collect” for example).
    – review “apport” to be more powerfull (dependencies / suggested packages: apport-retrace, …) & always up to dated.
    – So apport is a good tool but don’t ask informations about it: have you tried “man apport” or “help apport” ?
    – lot of bugs are reported as kernel’s ones but they are not: video driver, script, main libs (libc / python, …)
    – Ubuntu use & introduce lot of experimental packages which produce a wide range of troubles & bugs. If most of them are quickly resolved by updates, some others depends on external projects ( mozilla, gnome, …) & sometimes take age to become rc or final release. Compared to Sidux, Ubuntu stay bugged longer. So, merging packages have to be made shortly.
    – new linux users are often lost because they are not techies, don’t know how & where to report bugs or are disapointed with their bug report if it take age to be resolved: how many bugs are just declared “invalid” because of missing informations without indications about them to help the bug author to give them back ? So, Ubuntu users are not only techies and if devs want to divide the bugs number, they have to pay attention about the bug calendar ( oldy buggy package producing bug on other package), the “5 bugs a day” was a good idea.
    – i think that a choice have to be made about apt-get / aptitude: often some warnings, freezes, crashes appear because of bad updates / upgrades: most of them can be resolved by removing/purging & reinstalling packages. So, “apt-get” or/and “aptitude” does not take care enough about removing/purging the settings/libs/links/… left behind the updated/upgraded packages: can find the same package installed with different releases at the same time even if the older one is no more needed by an other package. Only need a strong solution: is apt-get or aptitude the future ?
    – neither “find orphans” (gtkorphan) don’t find multiple release of the same package nor janitor.
    – new choices have been made about hal/udev but lot of configs are not up to dated.

    To resume i would say that it’s time to clean the sweat home !!!

    dino99

    August 7, 2009 at 12:02

  3. We accomplished a bit more today, adding wifi-related debug information to the kernel’s apport hook:

    http://bazaar.launchpad.net/~ubuntu-core-dev/ubuntu/karmic/apport/ubuntu/revision/1480

    to help with the network-related bugs.

    mdz

    August 7, 2009 at 17:34

    • Quick response, thanks Matt
      Sorry for my previous post, i was a little agressive & that was not my intention. I’ve looking at your coding: very clear, good work !!!
      As i understand, you focus on mains actual bugs origins. Are the different warnings/errors be standardisable: some kind of signature that could be stored in a database (with uuid for example) & treated by a generic script to upload to specific dev; what about an error package and all it’s direct/indirect dependencies ? As an user warning/error is known by his admin, how dependencies act about that ?
      On my own installation, i try to glance at errors but logs are in different places as each .conf package generate his log; even logrotate have difficulties to find his kids !!!
      For example, logrotate.conf need to be customized and /etc/logrotate.d/ need user/admin to list the logs with required settings: not easy for a new open user even if Ubuntu want to be friendly.(by default these files are quite empty & installing a package does not write entries in, some does i guess). /Var/log/ has logs in others in subfolders like kern.log / auth / messages … Logrotate by default don’t deal with these ones. So reporting bugs with huge logs don’t help, sometime several hundred Mo when cron full filled them.

      When googling about an error/warning we sometime/often be scared by an useless warning (recently i’ve searched about “can’t find devive.map” when booting: the devs discussions about that are: error generated by an old mechanism no more used with today’s kernels, but the warning is still there & the first thread is not recent: time to decide who have to do the modifications). That kind of error need to be identified (and the others too) and declared as no-serious/medium/dangerous or hided when useless.

      ( Some brain storming for internal design )

      dino99

      August 9, 2009 at 15:16

      • apport has a facility for pattern matching duplicate bugs, so that if the reports are sufficiently standardized, we can suppress dupicates and direct the user to more information about their problem. This is one of the reasons why we’re asking that all Ubuntu bug reports are filed using ubuntu-bug. For example, if the stack trace closely matches one in another bug, the retracer will automatically mark it as a duplicate.

        As for syslog files, it is usually best to include only an excerpt, not the entire log. There is a convenience function in the apport.hookutils library which makes this easy.

        We should definitely eliminate “harmless” error messages which lead users to think there is a problem. These should be filed as bugs, and fixed. bug 58386 is an example of where we’ve fixed such a problem.

        Matt Zimmerman

        August 16, 2009 at 11:41

  4. This is good work and very interesting article.

    Adding facilities to tighten bug time resolution equals improving a lot ubuntu-linux quality.

    Keep up.

    Xavier Verne

    August 9, 2009 at 15:58

  5. […] that the data do not support. Yes, we’re constantly trying to improve, as Canonical CTO Matt Zimmerman calls out. But I look at this as a very good problem to […]

  6. […] we’re constantly trying to improve, as Canonical CTO Matt Zimmerman calls out. But I look at this as a very good problem to […]

  7. To state the obvious: more care in initial programming will reduce bug reports – no programming error = no bug.

    Kathleen Murphy

    September 9, 2010 at 20:17

    • True enough, but Ubuntu (like other Linux distributions) originates very little of the code included in the product. Even if Ubuntu developers introduced zero bugs (an ambitious goal for any programmer!) there would still be more than enough to generate a tide of bug reports.

      Matt Zimmerman

      September 11, 2010 at 12:57


Comments are closed.

%d bloggers like this: