====== Debian Metadata Analysis ====== The (proposed) Debian Metadata agent is an agent that is passed pointers to Debian package format binary control files and source dsc files, and organizes and inserts that data into the database for later use by the repository front-ends. It is currently a work in progress. This page, and eventually the agent, are being written by Matt Taggart. ===== Metadata sources ===== There are several ways we will encounter the metadata - [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-binarycontrolfiles|Binary package control files]] - these are contained within the binary debs and can be obtained by extracting the package with ar(1) and gunzip/untar'ing the control.tar.gz. Example: [[http://fossology.org/~taggart/dokuwiki/debian_metadata/binary.control|binary.control]] - [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-sourcecontrolfiles|Source package debian/control files]] - These are contained in source package (as part of the diff.gz) and can be obtained by unpacking the source package. Example: [[http://fossology.org/~taggart/dokuwiki/debian_metadata/source.control|source.control]] - [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-debiansourcecontrolfiles|Source package dsc file]] - the .dsc file is one (of 2 or 3) of the files that makes up a Debian source package and is the metadata that describes the source package. It is most often encountered in apt repositories, and it's location in the repo might indicate more about the package (like what release of Debian it is a part of). It is usually signed by the developer that created that version of the source package. Example: [[http://fossology.org/~taggart/dokuwiki/debian_metadata/dsc|dsc]] - [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-debianchangesfiles|Debian changes files]] - These are generated as part of package building and used by archive maintenence software to generate archive metadata. They are not often encountered in apt repositories or Debian releases, because the data they contain is digested and made available via other methods. It is usually signed by the developer who did the package build that generated it. Example: [[http://fossology.org/~taggart/dokuwiki/debian_metadata/source+binary.changes|full source + binary]] [[http://fossology.org/~taggart/dokuwiki/debian_metadata/diff.changes|source diff]] [[http://fossology.org/~taggart/dokuwiki/debian_metadata/diff+binary.changes|source diff + binary]] - apt repository files - apt repositories have additional files we can gather data from. The are defined for each release/section, like "unstable/main" or "sarge/contrib"). For Debian releases, they are contained within Available as either plain text, gzipped, bzip2'd, or several side by side (the data should be identical though). The examples given are usually exerpts (note the '...'), the files themselves are huge. Here is a [[http://fossology.org/~taggart/dokuwiki/debian_metadata/dist-tree|description]] of what the hierarchy of a full Debian repository looks like. FIXME: explain the diff/ directory scheme - Packages files - contain the control data for all binary packages for that archive section. Generated by the dpkg-scanpackages(8) tool, which scans all the .deb files to extract the control info. Currently for Debian unstable/i386 this file is 5.7MB compressed, 20MB uncompressed. These are the files that apt and similar tools use to know what binary packages are available. Example: [[http://fossology.org/~taggart/dokuwiki/debian_metadata/Packages.i386-main|Packages file for 'sid' for i386]] - Source files - contain the control data for all source packages. Generated by the dpkg-scansources(8) tool, which scans all the .dsc files to extract the control info. Currently for Debian unstable this file is 1.7MB compressed, 6.2MB uncompressed. These are the files that apt and similar tools use to know what source packages are available. Example: [[http://fossology.org/~taggart/dokuwiki/debian_metadata/Sources|Sources file for 'sid']] - Contents files - contains a list of all files and their corresponding section/package. Currently for Debian unstable/i386 this file is 11MB compressed, 145MB uncompressed. These are the files the apt-file tool uses to provide file listings and search ability. Example: [[http://fossology.org/~taggart/dokuwiki/debian_metadata/Contents-i386|Contents file for 'sid' for i386]] - Release files - contain release metadata and can contain checksums of Packages files that belong to that release. Release files are often GPG signed. This metadata can be used by apt to differentiate apt sources and also to establish a trust path from an archive signing key to debs (signed Release file contains checksums of the Packages files, Packages files contain checksum of the debs). Example: [[http://fossology.org/~taggart/dokuwiki/debian_metadata/Release.sid.txt|Top level Release file for 'sid']] [[http://fossology.org/~taggart/dokuwiki/debian_metadata/Release.i386-main|Release file for i386 main]] ===== Data analysis ===== Before we can design the database tables to hold the metadata, we need to analyse the data and determine types and sizes. [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html|Debian Policy]] has good descriptions of the fields. However it does not describe any size limitations for the fields. Looking at the source code for dpkg reveals that it mostly uses C "char *" types, so no hard coded size. I wrote a quick script to analyze all the current binary metadata in Debian's unstable release by parsing the /var/lib/dpkg/available file on a current unstable system. This is an easy way to get a ballpark answer, when the agent is complete we'll be able to do queries that give us a more precise answer. The script outputs fields, the longest string of that field found, and, as and example, the name of a package with that longest length and field data. - the script: [[http://fossology.org/~taggart/dokuwiki/debian_metadata/findsizes|findsizes]] - the input: [[http://fossology.org/~taggart/dokuwiki/debian_metadata/available|available]] - the output: [[http://fossology.org/~taggart/dokuwiki/debian_metadata/sizes.txt|sizes.txt]] Postgresql data types are defined in the PostgreSQL manual [[http://www.postgresql.org/docs/8.1/interactive/datatype.html|here.]] According to the [[http://www.postgresql.org/docs/8.1/interactive/datatype-character.html|character types]] section of the manual, Tip: There are no performance differences between these three types (char, varchar, text), apart from the increased storage size when using the blank-padded type. While character(n) has performance advantages in some other database systems, it has no such advantages in PostgreSQL. In most situations text or character varying should be used instead. So for most text fields we can use 'text' rather than have to use 'char' or 'varchar' and have to worry about choosing a size large enough or subpar performance. Based on the above descriptions and the script output, here is a summary: ==== Binary control metadata ==== These fields are found in binary control files (#1) and source control files (#2) (one set per binary package the source package delivers). Here are the fields defined by Debian policy: ^ Field ^ Type ^ Length Found ^ Assumed Max ^ Suggested Datatype ^ |Package | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Package|limited text]] | 55 | 70 | text | |Source | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Source|limited text]] | 51 | 70 | text | |Version | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Version|mixed numbers and text, upstream dash debian version]] | 35 | 40 | array of two text or maybe two different fields | |Section | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Section|limited text]] | 20 | 25 | text | |Priority | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Priority|one of five alpha-only strings]] | 9 | 9 | text | |Architecture | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Architecture|one of 25 mixed strings]] | 18 | 25 | text | |Essential | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Essential|boolean]] | 3(yes) | 3 | boolean | |Depends et al | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-binarydeps|complex]] comma separated package names with versioned relationships | 3300+ chars, ~150 versioned depends | 200 versioned depends | array of composite type: package, relationship, version (all text) for each of Depends, Recommends, Suggests, Enhances, Pre-Depends | |Installed-Size | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Installed-Size|kilobytes (positive int)]] | 6 digits | 8 digits? | integer | |Maintainer | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Maintainer|name plus RFC822 email address in angle brackets]] | 96 | 120 | text (maybe separate name and email?) | |Description | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Description|single line short description, plus multiline long]] | short(89) | short(100) | short text, long text | Architecture in this context is the architecture for which this particular binary deb is built, so something like i386 or all(for binary independent). In addition to the above there is additional info we can infer based on the binary package we are processing: - Full name of the package file, probably of the form "NAME_VERSION-DEBIANVERSION_ARCH.deb". This should be the same as Package, Version, Architecture, but could be checked. - Location in the directory hierarchy can tell us a lot. If the package is contained in a debian pool like structure, we may be able to determine what release it is part of by looking at files just above the pool directory. We could also look for a dist directory next to the pool and use that to determine release, branch, section, and architecture this deb is referenced by (could be multiple due to the way the pool is designed). Some of this data is already in the table (Section and Architecture) . - We could poke deeper into the binary package and extract information from the changelog, copyright file, etc. That sort of data may be better off in another table, but once that is done we could reference it. We will discuss that elsewhere. ==== Source control metadata ==== These fields are found in source package control files (#2). Here are the fields defined by Debian Policy: ^ Field ^ Type ^ Length Found ^ Assumed Max ^ Suggested Datatype ^ |Source | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Source|limited text]] | | | text | |Maintainer | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Maintainer|name plus RFC822 email address in angle brackets]] | | | text (maybe separate name and email?) | |Uploaders | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Uploaders|multiple instances of name plus RFC822 email address in angle brackets]] | | | array of text (maybe separate name and email?) | |Section | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Section|limited text]] | 20 | 25 | text | |Priority | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Priority|one of five alpha-only strings]] | 9 | 9 | text | |Build-Depends et al | [[http://www.us.debian.org/doc/debian-policy/ch-relationships.html#s-sourcebinarydeps|complex]] comma separated package names with versioned relationships | | | array of composite type: package, relationship, version (all text) for each of Build-Depends, Build-Depends-Indep, Build-Conflicts, Build-Conflicts-Indep | |Standards-Version | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Standards-Version|3 or 4 dotted version number string]] | | | text | Just like the additional information we can infer about binary packages we can do the same for Source package (minus Architecture). ==== Source dsc metadata ==== Type #3 from the list above, here's what Debian Policy lists, ^ Field ^ Type ^ Length Found ^ Assumed Max ^ Suggested Datatype ^ |Format | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Format|limited version string]] | 3 | 5 | ? | |Source | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Source|limited text]] | | | text | |Version | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Version|mixed numbers and text, upstream dash debian version]] | 35 | 40 | array of two text or maybe two different fields | |Maintainer | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Maintainer|name plus RFC822 email address in angle brackets]] | | | text (maybe separate name and email?) | |Uploaders | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Uploaders|multiple instances of name plus RFC822 email address in angle brackets]] | | | array of text (maybe separate name and email?) | |Binary | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Binary|complex]] comma separated list of binary packages | | | array of text | |Architecture | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Architecture|one of 25 mixed strings]] | 18 | 25 | text | |Build-Depends et al | [[http://www.us.debian.org/doc/debian-policy/ch-relationships.html#s-sourcebinarydeps|complex]] comma separated package names with versioned relationships | | | array of composite type: package, relationship, version (all text) for each of Build-Depends, Build-Depends-Indep, Build-Conflicts, Build-Conflicts-Indep | |Standards-Version | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Standards-Version|3 or 4 dotted version number string]] | | | text | |Files | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Files|complex]] list of md5sum, size, filename for source package files probably dsc/orig.tar.gz/diff.gz or just dsc/tar.gz | | | text | There is additional data in the dsc that we could not determine from the Source control file itself. These things are determined at source package build time and added to the dsc. We could dig in the source and determine them ourselves, but the dsc makes it easier. These addition fields are: - Format - added by the tool that built the source package - Version - from debian/changelog - Binary - digest of binary packages provided, from control - Architecture - from control, union of the list of archs supported by the binary packages provided (any if not specific) - Files - determined by the tool that built the source package Section and Priority are the only two not also provided by the dsc. It's easier for use to use the dsc since it does not require unpacking the source and provides more info. ==== changes files metadata ==== Type #4 from our list. Debian Policy lists the following fields, ^ Field ^ Type ^ Length Found ^ Assumed Max ^ Suggested Datatype ^ |Format | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Format|limited version string]] | 3 | 5 | ? | |Date | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Date|date in RFC822 format]] | 31 | 32 | text (maybe more complicated?) | |Source | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Source|limited text]] | | | text | |Binary | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Binary|complex]] comma separated list of binary packages | | | array of text | |Architecture | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Architecture|one of 25 mixed strings]] | 18 | 25 | text | |Version | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Version|mixed numbers and text, upstream dash debian version]] | 35 | 40 | array of two text or maybe two different fields | |Distribution | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Distribution|alpha-only text]] | 12 | 12 | text | |Urgency | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Urgency|low/medium/high plus optional comment in parens]] | | | text | |Maintainer | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Maintainer|name plus RFC822 email address in angle brackets]] | | | text (maybe separate name and email?) | |Changed-By | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Changed-By|name plus RFC822 email address in angle brackets]] | | | text (maybe separate name and email?) | |Description | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Description|single line short description, plus multiline long]] | short(89) | short(100) | short text, long text | |Closes | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Closes|space-separated list of bug numbers]] | | | array of ? | |Changes | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Changes|complex]] changelog data, structured but complicated | | | text, but could parse and put in a more complicated structure if valuable | |Files | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Files|complex]] list of md5sum, size, filename for source package files probably dsc/orig.tar.gz/diff.gz or just dsc/tar.gz | | | text | As stated above, changes files are not usually encountered in a normal Debian repository. If we happen to come upon them it will probably be accidentally included as part of a stand-alone product or as part of a build area. Assuming that the changes file is found in proximity to the debian package to which it refers, it can provide some additional useful information (if it found by itself we probably can't do much with it, we might be able to search the repo for it's missing parent, but I don't know how useful that is). In addition to data that can be determined by the binary or source packages to which the changes file refers, we get the following: - Date - the actual time the package was built (not the same as the changelog timestamp and can potentially be very different in the case of binary rebuilds (aka binNMUs). - Urgency - from debian/changelog - Changed-By - The person/buildd that actually built the package (often not the maintainer) - Closes - bugs that this upload closes, passed to the bug tracking system ==== APT repository metadata ==== === Packages files === Packages files are binary package records separated by single blank lines. Each record contain the same fields as the binary control data with the addition of: ^ Field ^ Type ^ Length Found ^ Assumed Max ^ Suggested Datatype ^ |Size | size in kilobytes of the deb | 9 | 12 | ? | |Filename | full path to deb in pool, limited text | | | ? | |MD5sum | md5sum | 32 | 32 | ? | |SHA1 | sha1sum | 40 | 40 | ? | |SHA256 | sha256sum | 64 | 64 | ? | As mentioned before, the Packages file's location in the hierarchy can also infer details. === Sources files === Sources files are source package records separated by single blank lines. Each record contain the same fields as the source control data with the addition of: ^ Field ^ Type ^ Length Found ^ Assumed Max ^ Suggested Datatype ^ |Directory | full path to package directory in pool, limited text | | | ? | === Release files === ^ Field ^ Type ^ Length Found ^ Assumed Max ^ Suggested Datatype ^ |Archive | limited text (unstable,testing,etc) | | | ? | |Component | limited text (main,contrib,etc) | | | ? | |Origin | limited text (Debian,Ubuntu,etc) | | | ? | |Label | limited text (Debian,etc) | | | ? | |Architecture | [[http://www.us.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Architecture|one of 25 mixed strings]] | 18 | 25 | text | TODO add top-level fields === Contents files === Contains a 32 line header explaining the file, then is a list of all files in the release. Each line is the full path to the file as installed on the system minus the initial slash, whitespace, the section, a slash, the package name. For example: etc/nosendfile net/sendfile usr/X11R6/bin/noseguy x11/xscreensaver usr/X11R6/man/man1/noseguy.1x.gz x11/xscreensaver usr/doc/examples/ucbmpeg/mpeg_encode/nosearch.param graphics/ucbmpeg usr/lib/cfengine/bin/noseyparker admin/cfengine === Additional notes === It's worth noting that maintainers are allowed to invent their own RFC822-compliant fields and add them to the source control file. Depending on where they are added that data might be passed on to the binary control data and then the Packages files, etc. One such example of this is the "Url" field which is in wide use, but not yet included in Debian Policy. Our design should allow for such cases and alert us when it encounters them. ===== Planning ===== With so many different types and locations of metadata, we need to break the problem down and work on pieces individually initially in order to make progress. However it is important to keep all the metadata types in mind when designing so that we can take advantage of any synergies. I expect that the database design will be an iterative process as try things and learn what works. We will eventually have various methods for extracting all the metadata types listed above, but at first we need to start somewhere. There are a couple initial strategies I have been thinking of: - Parse the binary and source packages as they are found: This is most useful for individual "products" uploaded to the repository. But the case of a larger product, like a Debian release, the data found may be less useful because it is not associated with the larger release. - Parse the apt repository metadata: This is most useful for uploads of full apt repositories, like a Debian or Ubuntu release. Has the potential to be very fast, but isn't looking at the packages themselves and can't handle packages outside a repository. ===== Proposed SQLs tables ===== To be determined... ===== Code ===== Parsing code in progress... Database manipulating code waiting on table definitions. Agent interface code waiting on database and discussion with team. Note, there is a [[http://dep.debian.net/deps/dep5/ | proposal to make debian/copyright machine parsable.]] [[http://wiki.debian.org/DFSGLicenses | DFSG Licenses wiki]] ===== Potential Queries ===== In addition to the ability to search all metadata fields, here are some ideas for other queries that might be interesting - Package size related - Largest/smallest packages ("What are the 10 largest packages?") - Constraints on package size ("Only show me packages smaller than 5mb") - Size of diff.gz ("What package has the largest/most patches to upstream?") - Size of diff.gz as a percentage of orig.tar.gz ("What package is the most patched?") - Maintainer related - Most/least per release/section/whatever. ("Which maintainer has the most packages in main/etch?") - Package count stats: mean/median/mode/standard deviation/graphs ("What is the average number of packages per maintainer?") - Regexp searches "Which packages are team maintained (".*alioth\.debian\.org.*" or ".* team .*"), sorted by number of packages maintained?" - Package date related - Oldest/Newest source/binary packages per release ("What packages haven't been rebuilt recently?") - dependency related - "I need to change package foo, show me all the packages that could be affected (both runtime dependencies and packages that build against foo)" - we could generate dependency graphs using dotty - Everything that [[http://www.debian.org/distrib/packages|http://packages.debian.org]] can do - Complex queries/questions - "What are the packages in unstable that are maintained by the Debian QA team but haven't had a new version in more than 1 year?" - "Show me packages in section admin with the same maintainer, sorted by number of packages per maintainer" - "Show me all the priority optional packages maintained by Anibal" - "How up to date with Debian policy are packages maintained by Matt Taggart?" - From Debian Policy we know what most of these fields should look like, it would be an interesting QA exercise to flag weird stuff.