![]() |
FOSSology Advancing open source analysis and development |
The (proposed) Debian Metadata agent is an agent that is passed pointers to Debian package format binary control files and source dsc files, and organizes and inserts that data into the database for later use by the repository front-ends. It is currently a work in progress.
This page, and eventually the agent, are being written by Matt Taggart.
There are several ways we will encounter the metadata
: explain the diff/ directory schemeBefore we can design the database tables to hold the metadata, we need to analyse the data and determine types and sizes.
Debian Policy has good descriptions of the fields. However it does not describe any size limitations for the fields. Looking at the source code for dpkg reveals that it mostly uses C “char *” types, so no hard coded size.
I wrote a quick script to analyze all the current binary metadata in Debian's unstable release by parsing the /var/lib/dpkg/available file on a current unstable system. This is an easy way to get a ballpark answer, when the agent is complete we'll be able to do queries that give us a more precise answer. The script outputs fields, the longest string of that field found, and, as and example, the name of a package with that longest length and field data.
Postgresql data types are defined in the PostgreSQL manual here. According to the character types section of the manual,
Tip: There are no performance differences between these three types (char, varchar, text), apart from the increased storage size when using the blank-padded type. While character(n) has performance advantages in some other database systems, it has no such advantages in PostgreSQL. In most situations text or character varying should be used instead.
So for most text fields we can use 'text' rather than have to use 'char' or 'varchar' and have to worry about choosing a size large enough or subpar performance.
Based on the above descriptions and the script output, here is a summary:
These fields are found in binary control files (#1) and source control files (#2) (one set per binary package the source package delivers). Here are the fields defined by Debian policy:
| Field | Type | Length Found | Assumed Max | Suggested Datatype |
|---|---|---|---|---|
| Package | limited text | 55 | 70 | text |
| Source | limited text | 51 | 70 | text |
| Version | mixed numbers and text, upstream dash debian version | 35 | 40 | array of two text or maybe two different fields |
| Section | limited text | 20 | 25 | text |
| Priority | one of five alpha-only strings | 9 | 9 | text |
| Architecture | one of 25 mixed strings | 18 | 25 | text |
| Essential | boolean | 3(yes) | 3 | boolean |
| Depends et al | complex comma separated package names with versioned relationships | 3300+ chars, ~150 versioned depends | 200 versioned depends | array of composite type: package, relationship, version (all text) for each of Depends, Recommends, Suggests, Enhances, Pre-Depends |
| Installed-Size | kilobytes (positive int) | 6 digits | 8 digits? | integer |
| Maintainer | name plus RFC822 email address in angle brackets | 96 | 120 | text (maybe separate name and email?) |
| Description | single line short description, plus multiline long | short(89) | short(100) | short text, long text |
Architecture in this context is the architecture for which this particular binary deb is built, so something like i386 or all(for binary independent).
In addition to the above there is additional info we can infer based on the binary package we are processing:
These fields are found in source package control files (#2). Here are the fields defined by Debian Policy:
| Field | Type | Length Found | Assumed Max | Suggested Datatype |
|---|---|---|---|---|
| Source | limited text | text | ||
| Maintainer | name plus RFC822 email address in angle brackets | text (maybe separate name and email?) | ||
| Uploaders | multiple instances of name plus RFC822 email address in angle brackets | array of text (maybe separate name and email?) | ||
| Section | limited text | 20 | 25 | text |
| Priority | one of five alpha-only strings | 9 | 9 | text |
| Build-Depends et al | complex comma separated package names with versioned relationships | array of composite type: package, relationship, version (all text) for each of Build-Depends, Build-Depends-Indep, Build-Conflicts, Build-Conflicts-Indep | ||
| Standards-Version | 3 or 4 dotted version number string | text |
Just like the additional information we can infer about binary packages we can do the same for Source package (minus Architecture).
Type #3 from the list above, here's what Debian Policy lists,
| Field | Type | Length Found | Assumed Max | Suggested Datatype |
|---|---|---|---|---|
| Format | limited version string | 3 | 5 | ? |
| Source | limited text | text | ||
| Version | mixed numbers and text, upstream dash debian version | 35 | 40 | array of two text or maybe two different fields |
| Maintainer | name plus RFC822 email address in angle brackets | text (maybe separate name and email?) | ||
| Uploaders | multiple instances of name plus RFC822 email address in angle brackets | array of text (maybe separate name and email?) | ||
| Binary | complex comma separated list of binary packages | array of text | ||
| Architecture | one of 25 mixed strings | 18 | 25 | text |
| Build-Depends et al | complex comma separated package names with versioned relationships | array of composite type: package, relationship, version (all text) for each of Build-Depends, Build-Depends-Indep, Build-Conflicts, Build-Conflicts-Indep | ||
| Standards-Version | 3 or 4 dotted version number string | text | ||
| Files | complex list of md5sum, size, filename for source package files probably dsc/orig.tar.gz/diff.gz or just dsc/tar.gz | text |
There is additional data in the dsc that we could not determine from the Source control file itself. These things are determined at source package build time and added to the dsc. We could dig in the source and determine them ourselves, but the dsc makes it easier. These addition fields are:
Section and Priority are the only two not also provided by the dsc. It's easier for use to use the dsc since it does not require unpacking the source and provides more info.
Type #4 from our list. Debian Policy lists the following fields,
| Field | Type | Length Found | Assumed Max | Suggested Datatype |
|---|---|---|---|---|
| Format | limited version string | 3 | 5 | ? |
| Date | date in RFC822 format | 31 | 32 | text (maybe more complicated?) |
| Source | limited text | text | ||
| Binary | complex comma separated list of binary packages | array of text | ||
| Architecture | one of 25 mixed strings | 18 | 25 | text |
| Version | mixed numbers and text, upstream dash debian version | 35 | 40 | array of two text or maybe two different fields |
| Distribution | alpha-only text | 12 | 12 | text |
| Urgency | low/medium/high plus optional comment in parens | text | ||
| Maintainer | name plus RFC822 email address in angle brackets | text (maybe separate name and email?) | ||
| Changed-By | name plus RFC822 email address in angle brackets | text (maybe separate name and email?) | ||
| Description | single line short description, plus multiline long | short(89) | short(100) | short text, long text |
| Closes | space-separated list of bug numbers | array of ? | ||
| Changes | complex changelog data, structured but complicated | text, but could parse and put in a more complicated structure if valuable | ||
| Files | complex list of md5sum, size, filename for source package files probably dsc/orig.tar.gz/diff.gz or just dsc/tar.gz | text |
As stated above, changes files are not usually encountered in a normal Debian repository. If we happen to come upon them it will probably be accidentally included as part of a stand-alone product or as part of a build area. Assuming that the changes file is found in proximity to the debian package to which it refers, it can provide some additional useful information (if it found by itself we probably can't do much with it, we might be able to search the repo for it's missing parent, but I don't know how useful that is). In addition to data that can be determined by the binary or source packages to which the changes file refers, we get the following:
Packages files are binary package records separated by single blank lines. Each record contain the same fields as the binary control data with the addition of:
| Field | Type | Length Found | Assumed Max | Suggested Datatype |
|---|---|---|---|---|
| Size | size in kilobytes of the deb | 9 | 12 | ? |
| Filename | full path to deb in pool, limited text | ? | ||
| MD5sum | md5sum | 32 | 32 | ? |
| SHA1 | sha1sum | 40 | 40 | ? |
| SHA256 | sha256sum | 64 | 64 | ? |
As mentioned before, the Packages file's location in the hierarchy can also infer details.
Sources files are source package records separated by single blank lines. Each record contain the same fields as the source control data with the addition of:
| Field | Type | Length Found | Assumed Max | Suggested Datatype |
|---|---|---|---|---|
| Directory | full path to package directory in pool, limited text | ? |
| Field | Type | Length Found | Assumed Max | Suggested Datatype |
|---|---|---|---|---|
| Archive | limited text (unstable,testing,etc) | ? | ||
| Component | limited text (main,contrib,etc) | ? | ||
| Origin | limited text (Debian,Ubuntu,etc) | ? | ||
| Label | limited text (Debian,etc) | ? | ||
| Architecture | one of 25 mixed strings | 18 | 25 | text |
TODO add top-level fields
Contains a 32 line header explaining the file, then is a list of all files in the release. Each line is the full path to the file as installed on the system minus the initial slash, whitespace, the section, a slash, the package name. For example: etc/nosendfile net/sendfile usr/X11R6/bin/noseguy x11/xscreensaver usr/X11R6/man/man1/noseguy.1x.gz x11/xscreensaver usr/doc/examples/ucbmpeg/mpeg_encode/nosearch.param graphics/ucbmpeg usr/lib/cfengine/bin/noseyparker admin/cfengine
It's worth noting that maintainers are allowed to invent their own RFC822-compliant fields and add them to the source control file. Depending on where they are added that data might be passed on to the binary control data and then the Packages files, etc. One such example of this is the “Url” field which is in wide use, but not yet included in Debian Policy. Our design should allow for such cases and alert us when it encounters them.
With so many different types and locations of metadata, we need to break the problem down and work on pieces individually initially in order to make progress. However it is important to keep all the metadata types in mind when designing so that we can take advantage of any synergies. I expect that the database design will be an iterative process as try things and learn what works.
We will eventually have various methods for extracting all the metadata types listed above, but at first we need to start somewhere. There are a couple initial strategies I have been thinking of:
To be determined…
Parsing code in progress…
Database manipulating code waiting on table definitions.
Agent interface code waiting on database and discussion with team.
Note, there is a proposal to make debian/copyright machine parsable.
In addition to the ability to search all metadata fields, here are some ideas for other queries that might be interesting