My recent adventures with buildbot, ccache, nginx, and a Debian upgrade

I recently acquired a new computer (AMD 5950X 16-core CPU with128GB RAM and a 1TB SSD). Nice! Building Octave is much faster now than it was on the older AMD FX-8350 systems I had. So I started thinking that it would be nice to move some buildbot jobs to the new system. Then I started thinking that it would be even better if all the buildbot jobs could be executed faster. So I now have two of the 5950X systems and have moved all the buildbot jobs to the new hardware.

While moving the buildbot jobs to the new systems, I also decided to try to update the configuration on the master. The --enable-octave=stable and --enable-octave=release configurations of MXE Octave for Windows should be done with the “release” branch of MXE Octave and the --enable-octave=default build should be using the “default” branch. The daily MXE Octave builds should now only happen if there are actually changes in either the MXE Octave or Octave sources on the branches relevant to the build.

As seems to happen whenever I touch the buildbot configuration, there were some glitches and moments of confusion. What started out as an attempt to upgrade the buildbot master ended up as an OS upgrade on the Digital Ocean server that is running the buildbot master. That broke the hgweb service because of the way Python packaging works on Debian now. The initial symptom was an nginx error that did not immediately point to an hgweb configuration error or Python issue. That took some time to figure out…

Then there was some confusion due to the way ccache was configured on my buildbot workers. It turns out (and in retrospect, perhaps should have been obvious) that sharing a cache directory among multiple buildbot builds and using the configuration option hash_dir = false is a BAD IDEA. Packages that were built successfully many times before were failing to build because the wrong files were being picked up from the cache. Oops!

After fixing those issues, things seem to be working more or less correctly now and builds are significantly faster, so I hope this will speed up development. It would be nice to get even faster turnaround on the MXE Octave builds and maybe there are some additional things we can do but I’ll open a separate topic for that discussion.

Now I have the “problem” of what to do with the old systems. They may still be useful for a while. I could install other Linux distributions (my systems are all currently running Debian) and use them to run buildbot workers to test secondary targets where quick turnaround is not as important. Suggestions?

1 Like

IIRC, we spoke about having the test suite run for the Windows builds in the past.
I hope I don’t open Pandora’s box by that question: Is switching one of those machines to Windows an option?
Even if it is, I’m not sure if it would be possible to have it run the tests as a step of the builders.
It might (or might not) be possible to transfer the installer with scp to the Windows machine and log into it via ssh and run the respective commands remotely from the actual builder.

If that would be working, we could also think about running a couple of pkg test ... commands on that machine to notice runtime regressions or incompatibilities of changes in Octave core with code that is out “in the wild”.
At the moment, we only notice compile time issues with the MXE Octave builders…

1 Like

All good suggestions. It would definitely be good to be testing the installer and running the test suite for the Windows builds and also to be running tests for the packages.

1 Like

One thing I forgot to mention is that after the buildbot master upgrade, the web interface was really slow to respond when I used firefox on my Debian desktop. It seemed to respond much better with firefox on an Android device and also using chrome on the Debian desktop. So I’m not sure whether that is something that could or should be fixed in the buildbot web interface plugin thing or firefox.

I noticed the first login was slow, but subsequent logins are “normal”.

The web interface does not seem any slower to me. Maybe right after the upgrade the server was a little busy with internal tasks?

Yeah, I’m not sure what was happening. It is reasonably responsive for me now with firefox on the same system where I originally experienced terribly slow performance.

Also, this morning I noticed some build failures that were similar to the ones I saw Friday night and that I thought were due to bad ccache results. So I cleared the cache files and started new builds. They made it past the previous problem spots. But I don’t understand why they failed now. I thought I had already cleared all the cache files and I removed the “hash_dir” ccache setting. Hmm. I guess I’ll just have to wait and see whether this problem reappears again later. The ccache settings for my buildbot workers are

umask = 002
max_size = 100G
cache_dir_levels = 3

so I don’t know what could cause it to return incorrect files from the cache. If this problem happens again, I’ll try to remember to run a new build with ccache disabled. If that works but then building again with the ccache enabled fails, then I think that would confirm my guess. I’m not sure how to debug why or how ccache is returning the wrong result (assuming that is the problem).

The waterfall view doesn’t show for me with Edge (Chromium) on Windows 10. The navigation menu on the left hand side appears. But the area that would usually show the red and green bars stays blank with a “loading” animation.
It shows fine with Chrome on Android.