Gentoo Logo

Distfile Patching Support

Content:

1.  Project Description

This is a Gentoo GSoC 2011 project.

The intention of this project is to develop the software and implement the necessary infrastructure to generate binary patches for all the distfiles available on Gentoo Linux mirrors using a generic tool for create binary patches: diffball. These patches would be fetched and applied by the package managers, whenever possible. The package manager will be able to detect if a patch is ok or not, and discard it if needed.

The patches would be created by a server owned by Gentoo Linux Infrastructure and mirrored as usual, or as desired by the infrastructure team. Ideally each mirror would have all the distfiles needed by all the packages currently on the tree and sequences of binary patches between all these distfiles. When a distfile is removed from the mirror, all the patches against this file should be removed as well.

This project also wants to address a big issue found on GLEP 25, that is the difference between the md5sum's of distfiles reconstructed by different versions of compressors like bzip2. The distfiles with wrong md5sums will be saved in a separated directory like ${DISTDIR}/delta-reconstructed/, and the package manager will verify the checksums of each piece used for reconstruct that distfile, because if we can trust in each piece used, we can trust in the distfile itself. With this approach the package managers that doesn't have patching support will not complain about wrong distfiles, and will just download the full sources as usual. The package managers with patching support will handle the files from ${DISTDIR}/delta-reconstructed/ and use them properly.

The software produced during the development of this project would allow people to save a lot of bandwidth and disk space when upgrading their Gentoo Linux installations. As a result of this, the Gentoo Linux mirrors will also save a lot of bandwidth and will be able to serve a greater number of users with the same amount of resources. Usually just a few parts of the distfiles will change on a upgrade, then isn't worth to download the full sources again. Basically people is fetching repeated files over and over again when upgrading their systems.

The server-side part will be integrated by Gentoo Linux infra and produce the binary patches transparently. The client-side part will be totally integrated with the package managers (Portage for now) and will work transparently for the users. The distfile patching support will not break the compatibility of the mirrors or the gentoo-x86 tree with package managers that don't support it yet. Any package manager that doesn't know about binary patches for distfiles will work as works now, then this project can be integrated and tested without break anything that is currently working fine.

There's a lot of binary patching implementations around done by other Linux distributions or OSS projects, like the Fedora's Presto project, but they are usually tied to a package manager of choice and designed to work with binary packages, not source packages like Gentoo does. There's also a Gentoo-related implementation called Deltup, but it's poorly implemented, lack some features we need and is hard to integrate nicely with Gentoo Infrastructure. One of the main issues with Deltup is that the remote server generates the binary patches on demand, and the user needs to wait on queues while the server download the sources and create the binary patches, and this can take a lot of time and makes the package manager time out and download the full sources instead.

2.  Developers

Developer Nickname Role
rafaelmartins Member

All developers can be reached by e-mail using nickname@gentoo.org.

3.  Project tasks

The tasks of the distpatch project are:

XZ support for diffball - Implement XZ support for diffball (finished)

Write patches for diffball to add XZ support. Currently just bzip and gzip are supported.

Starting date: 23 May 2011

A program to generate binary patches for several versions of one package - distdiffer script (finished)

Write a program that is able to create sequential patches between several version of the sources of a package. This program will be called by the program that will run as batch job in the future to produce sequences of binary patches for all the available distfiles. This program will create a file with checksum information for all the binary patches of the package and for the uncompressed sources, to be used by the reconstructor. This program will also be responsible to verify if the distfile can be splited in patches or not.

Starting date: 30 May 2011

A program to reconstruct a distfile from a sequence of binary patches - distpatcher script

Write a program that will get a sequence of binary patches, verify the checksums of each one, using the checksum file created by the program produced as the previous deliverable and reconstruct the distfile. If the distfile is ok, will be saved at ${DISTDIR}, if not, will be saved at ${DISTDIR}/delta-reconstructed/. This program will also create a local database of checksums, for the files inside ${DISTDIR}/delta-reconstructed/.

Starting date: 13 Jun 2011

A program to generate binary patches for all available distfiles, as needed - distdiffall

Write a program that will create sequential patches for several distfiles at once as a batch job. This program will check which packages are needing to be patched and create the patches on demand. This program will also be responsible by remove old binary patches that aren't needed anymore and will use the program from the second deliverable to create binary patches for each package.

Starting date: 27 Jun 2011

Patch the Portage to use binary patch sequences - Portage support to binary deltas

Adapt Portage, to be able to use reconstructed tarballs, incorporating the program from the third deliverable, and being able to look at the directory ${DISTDIR}/delta-reconstructed/ and use the reconstructed files without bother about the wrong distfiles, using the local checksum database created by the reconstructor script to verify the checksums for the patches and for the uncompressed source. The time reserved for this deliverable is a bit bigger, because I don't know how much complex it will be.

Starting date: 18 Jul 2011

Software integration and tests - Infrastructure test

Make all this thing work on a simulated environment, to see how it behave and fix possible bugs.

Starting date: 1 Aug 2011

Gather and polish documentation written during each of the previous phases - Release documentation

Starting date: 8 aug 2011

4.  Resources

Resources offered by the distpatch project are:



Print

Page updated 6 Jun 2011

Summary: Improve the performance of the Gentoo Linux mirrors by reducing the overall bandwidth load, allowing people to fetch binary patches from the mirrors, instead of the full source tarballs, when updating some package. This project is partially based on GLEP 25, by Brian Harring.

Rafael G. Martins
Author

Donate to support our development efforts.

Copyright 2001-2014 Gentoo Foundation, Inc. Questions, Comments? Contact us.