Web lists-archives.com

Rebuilding the entire Debian archive twice on arm64 hardware for fun and proft




[ Please note the cross-post and respect the Reply-To... ]

Hi folks,

This has taken a while in coming, for which I apologise. There's a lot
of work involved in rebuilding the whole Debian archive, and many many
hours spent analysing the results. You learn quite a lot, too! :-)

I promised way back before DC18 that I'd publish the results of the
rebuilds that I'd just started. Here they are, after a few false
starts. I've been rebuilding the archive *specifically* to check if we
would have any problems building our 32-bit Arm ports (armel and
armhf) using 64-bit arm64 hardware. I might have found other issues
too, but that was my goal.

The logs for all my builds are online at

  https://www.einval.com/debian/arm/rebuild-logs/

for reference. See in particular

  https://www.einval.com/debian/arm/rebuild-logs/armel/FAIL/FAIL.html
  https://www.einval.com/debian/arm/rebuild-logs/armhf/FAIL/FAIL.html

for automated analysis of the build logs that I've used as the basis
for the stats below.

Executive summary
=================

As far as I can see we're basically fine to use arm64 hosts for
building armel and armhf, *so long as* those hosts include hardware
support for the 32-bit A32 instruction set. As I've mentioned before
[1] that's not a given on *all* arm64 machines, but there are
sufficient machine types available that I think we should be
fine. There are a couple of things we need to do in terms of setup -
see "Machine configuration" below.

[1] https://lists.debian.org/debian-arm/2018/06/msg00062.html

Methodology
===========

I (naïvely) just attempted to rebuild all the source packages in
unstable main, at first using pbuilder to control the build process
and then later using sbuild instead. I didn't think to check on the
stated architectures listed for the source packages, which was a
mistake - I would do it differently if redoing this test. That will
have contributed quite a large number of failures in the stats below,
but I believe I have accounted for them in my analysis.

I built lots of packages, using a range of machines in a small build
farm at home:

 * Macchiatobin
 * Seattle
 * Synquacer
 * Multiple Mustangs

using my local mirror for improved performance when fetching
build-deps etc. I started off with a fixed list of packages that were
in unstable when I started each rebuild, for the sake of
simplicity. That's one reason why I have two different numbers of
source packages attempted for each arch below. If packages failed due
to no longer being available, I simply requeued using the latest
version in unstable at that point.

I then developed a script to scan the logs of failed builds to pick up
on patterns that matched with obvious causes. Once that was done, I
worked through all the failures to (a) verify those patterns, and (b)
identify any other failures. I've classified many of the failures to
make sense of the results. I've also scanned the BTS for existing bugs
matching my failed builds (and linked to them), or filed new bugs
where I could not find matches.

I did *not* investigate fully every build failure. For example, where
a package has never been built before on armel or armhf and failed
here I simply noted that fact. Many of those are probably real bugs,
but beyond the scope of my testing.

For reference, all my scripts and config are in git at

  https://git.einval.com/cgi-bin/gitweb.cgi?p=buildd-scripts.git

armel results
=============

Total source packages attempted: 28457
Successfully built:              25827
Failed:                           2630

Almost half of the failed builds were simply due to the lack of a
single desired build dependency (nodejs:armel, 1289). There were a
smattering of other notable causes:

 * 100 log(s) showing build failures (java/javadoc)

   Java build failures seem particularly opaque (to me!), and in many
   cases I couldn't ascertain if it was a real build problem or just
   maven being flaky. :-(

 * 15 log(s) showing Go 32-bit integer overflow

   Quite a number of go packages are blindly assuming sizes for 64-bit
   hosts. That's probably fair, but seems unfortunate.

 * 8 log(s) showing Sbuild build timeout

   I was using quite a generous timeout (12h) with sbuild, but still a
   very small number of packages failed. I'd earlier abandoned
   pbuilder for sbuild as I could not get it to behave sensibly with
   timeouts.

The stats that matter are the arch-specific failures for armel:

  * 13 log(s) showing Alignment problem
  * 5 log(s) showing Segmentation fault
  * 1 log showing Illegal instruction

and the new bugs I filed:

  * 3 bugs for arch misdetection
  * 8 bugs for alignment problems
  * 4 bugs for arch-specific test failures
  * 3 bugs for arch-specific misc failures

Considering the number of package builds here, I think these numbers
are basically noise. The vast majority of the failures I found were
either already known in the BTS (260), unrelated to what I was looking
for, or both.

See below for more details about host configuration for armel builds.

armhf results
=============

Total source packages attempted: 28056
Successfully built:              26772
Failed:                           1284

FTAOD: I attempted fewer package builds for armhf as we just had a
smaller number of packages when I started that rebuild. A few weeks
later, it seems we had a few hundred more source packages for the
armel rebuild.

The armhf rebuild showed broadly the same percentage of failures, if
you take into account the nodejs difference - it exists in the armhf
archive, so many hundreds more packages could build using it.

In a similar vein for notable failures:

 * 89 log(s) showing build failures (java/javadoc)

   Similar problems, I guess...

 * 15 log(s) showing Go 32-bit integer overflow

   That's the same as for armel, I'm assuming they're the same
   packages without checking!

 * 4 log(s) showing Sbuild build timeout

   Only 4 timeouts compared to the 8 for armel. *Maybe* a sign that
   armhf will be slightly quicker in build time, so less likely to hit
   a timeout? Total guesswork on small-number stats! :-)

Arch-specific failures found for armhf:

  * 11 log(s) showing Alignment problem
  * 4 log(s) showing Segmentation fault
  * 1 log(s) showing Illegal instruction

and the new bugs I filed:

  • 1 bugs for arch misdetection
  • 8 bugs for alignment problems
  • 10 bugs for arch-specific test failures
  • 3 bugs for arch-specific misc failures

Again, these small numbers tell me that we're fine. I liked to 139
existing bugs in the BTS here.

Machine configuration
=====================

To be able to support 32-bit builds on arm64 hardware, there are a few
specific hardware support issues to consider:

 * Our 32-bit Arm kernels are configured to fix up userspace alignment
   faults, which hides lazy programming at the cost of a (sometimes
   massive) slowdown in performance when this triggers. The arm64
   kernel *cannot* be configured to do this - if a userspace program
   triggers an alignment exception, it will simply be handed a SIGBUS
   from the kernel. This was one of the main things I was looking for
   here, common to both armel and armhf. In the end, I only found a
   very small number of problems.

   Given that, I think we should *immediately* turn off the alignment
   fixups on our existing 32-bit Arm buildd machines. Let's flush out
   any more problems early, and I don't expect to see many.

   To give credit here: Ubuntu have been using arm64 machines for
   building 32-bit Arm packages for a while now, and have already been
   filing bugs with patches which will have helped reduce this
   problem. Thanks!

 * In theory(!), that's all we should need to worry about for armhf,
   but our armel software baseline needs two additional pieces of
   configuration to make things work, enabling emulation for:

   + SWP (low-level locking primitive, deprecated since ARMv6 AFAIK)
   + CP15 barriers (low-level barrier primitives, deprecated since ARMv7)

   Again, there is quite a performance cost to enabling these but they
   are at least possible!

   In my initial testing for rebuilding armhf only, I did not enable
   either of these. I was then finding *lots* of "Illegal Instruction"
   crashes due to CP15 barrier usage in armhf Haskell and Mono
   programs. This suggests that the baseline architecture in these
   toolchains is incorrectly set to target ARMv6 rather than
   ARMv7. That should be fixed and all those packages rebuilt at some
   point.

Bug highlights
==============

 * In the glibc build, we found an arm64 kernel bug (#904385) which
   has since been fixed upstream thanks to Will Deacon at Arm. I've
   backported the fix into 4.9-stable so the fix will be in our
   Stretch kernels soon.

 * There's something really weird happening with Vim (#917859). It
   FTBFS for me with an odd test failure for both armel-on-arm64 and
   armhf-on-arm64 *using sbuild*, but in a porter box chroot or
   directly on my hardware using debuild it works just
   fine. Confusing...

 * I've filed quite a number of bugs over the last few weeks. Many are
   generic new FTBFS reports for old packages that haven't been
   rebuilt in a while, and some of them look unmaintained. However,
   quite a few of my bugs are arch-specific ones in better-maintained
   packages and several have already been fixed

 * Yesterday, I filed a slew of identical-looking reports for packages
   using MPI and all failing tests. It seems that we have a real
   problem hitting openmpi-based packages across the archive at the
   moment (#918157 in libpmix2). I'm going to verify that on my
   systems shortly.

Thanks
======

I've spent a lot of time looking at existing FTBFS bugs over the last
weeks, to compare results against what I've been seeing. Much kudos to
people who have been finding and filing those bugs ahead of me, in
particular Adrian Bunk and Matthias Klose who have filed *many* such
bugs. Also thanks to Helmut Grohne for his script to pull down a
summary of FTBFS bugs from UDD - that saved many hours of effort!

Please let me know if you think you've found a problem in what I've
done, or how I've analysed the results here. I still have my machines
set up for easy rebuilds, so reproducing things and testing fixes is
quite easy - just ask!

-- 
Steve McIntyre, Cambridge, UK.                                steve@xxxxxxxxxx
  Armed with "Valor": "Centurion" represents quality of Discipline,
  Honor, Integrity and Loyalty. Now you don't have to be a Caesar to
  concord the digital world while feeling safe and proud.

Attachment: signature.asc
Description: PGP signature