Stats Gathering

THIS PAGE IS WORK IN PROGRESS.

This page is meant to collect and track information about goals and achievements of the Stats client and server project for Google Summer of Code 2009.

= Why = Statistics about relative importance/popularity of packages
 * show what packages need extra care
 * have motivation potential
 * ease up binary package related decision-making
 * (provide package file lists without installation)

= What = Collect statistics on Gentoo machines (clients) and send them in for automated evaluation (server). The idea is not new, popcon of Debian (and Ubuntu) is doing just that. Such a system needs to be implemented and integrated for Gentoo GNU/Linux.

Collected information (at most)

 * Hardware already collected by Smolt
 * Stuff from emerge --info --verbose
 * Global use flags
 * Compile flags
 * Local use flags and masks
 * (Package file list)

Privacy and configuration
Users only use/enable our tool if they either
 * feel their privacy is respected or
 * don't care about privacy at that level.

On installation time submission should
 * be disabled
 * come with sensible defaults

The user controls what information the client submits:
 * Turn on/off information categories, e.g. "yes, submit global use flags"
 * (Specify additional filters like regex on certain stuff?)
 * Add user's real name and contact info if wanted

Submission is
 * logged and transparent
 * composed from the data the user wants submitted
 * auto privacy-filtered where possible, e.g.
 * Only registered overlays taken into account
 * Submitted use flags must all be know Gentoo- or overlay-wide so you keep private/confidential stuff to you

This auto-filtering is worth mentioning as Debian/Ubuntu also tracks and submits files not coming from a known package. While collecting such information can be of value it might be seen as a threat to privacy. We will explicitly not collect such information with this approach for Gentoo.

= How =

Plan A
Extend Smolt (Python) by said functionaility and qualities. Treat hardware and packing information in equal ways.

Plan B
If Plan A turns out to be not possible or the wrong thing to do it shall be switched to plan B: Develop a standalone app to provide said functionaility and qualities

= Open Questions =
 * When is gathered information submitted?
 * On manual trigger?
 * Would anyone remember or do that?
 * After a successful emerge action?
 * How would we integrate this?
 * Too much bandwidth?
 * Privacy issue?
 * Cron-based?
 * How do we do analysis on server side, especially with plan B?
 * (What's the best way avoiding huge file list uploads?)
 * (Upload "diffs", a bit like rsync?)
 * How do we handle client IPs on the server side?
 * How do we protect against "stats pollution" and attacks?

= Progress = Not started yet.