Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 30 Aug 2019 09:25:22 -0700
From:      Enji Cooper <yaneurabeya@gmail.com>
To:        Li-Wen Hsu <lwhsu@freebsd.org>
Cc:        fcp@freebsd.org, FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: FCP 20190401-ci_policy: CI policy
Message-ID:  <339B7A20-F88D-4F60-B133-612189663272@gmail.com>
In-Reply-To: <CAKBkRUwKKPKwRvUs00ja0%2BG9vCBB1pKhv6zBS-F-hb=pqMzSxQ@mail.gmail.com>
References:  <CAKBkRUwKKPKwRvUs00ja0%2BG9vCBB1pKhv6zBS-F-hb=pqMzSxQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

> On Aug 27, 2019, at 21:29, Li-Wen Hsu <lwhsu@freebsd.org> wrote:
>=20
> It seems I was doing wrong that just changed the content of this FCP
> to "feedback", but did not send to the right mailing lists.
>=20
> So I would like to make an announcement that the FCP
> 20190401-ci_policy "CI policy":
>=20
> https://github.com/freebsd/fcp/blob/master/fcp-20190401-ci_policy.md
>=20
> is officially in "feedback" state to hopefully receive more comments
> and suggestions, then we can move on for the next FCP state.

First off, thank you Li-Wen and Kristof for spearheading this proposal; it=E2=
=80=99s a very contentious topic with a lot of strong emotions associated wi=
th it.

As the person who has integrated a number of tests and helped manage them fo=
r a few years (along with some of the care and feeding associated with them)=
, this task is non-trivial. In particular when issues that I filed in bugzil=
la are fixed quickly and linger in the tree for some time, impacting a lot o=
f folks who might rely on build and test suite stability.

The issue, as I see it, from a CI/release perspective that the new policy at=
tempts to define a notion of =E2=80=9Cstable=E2=80=9D, in terms of both test=
s and other code; right now, stability is sort of defined on a honor system b=
asis with the FreeBSD test suite as a litmus test of sorts to convey a sense=
 of stability.

=3D=3D=3D=3D=3D=3D

One thing that I don=E2=80=99t see in the proposal is the health of the =E2=80=
=9Cmake tinderbox=E2=80=9D target in a CI world (this is a gap in our curren=
t CI process).

Another thing that I don=E2=80=99t see in the proposal is about the health o=
f head vs stable and how it relates to MFCs. I see a lot more issues occur o=
n stable branches go unfixed for some time, in part because some fixes or en=
hancements haven=E2=80=99t been MFCed. Part of the problem I see these days i=
s a bit of a human/resource problem: if developers can=E2=80=99t test their c=
hanges easily, they don=E2=80=99t MFC them.

This issue has caused me to do a fair amount of triage in the past when back=
porting changes, in order to discover potentially missing puzzle pieces in o=
rder to make my tests and code work.

=3D=3D=3D=3D=3D=3D

The big issues, as I see it based on the discussions that has taken place in=
 the thread, is around revert timing and etiquette, and dealing with unrelia=
ble tests.

First off, revert timing and etiquette: while I see the FCP as an initial fr=
amework, I am a bit concerned with the heavy handed ness of =E2=80=9Cwhat co=
nstitutes needing reversion=E2=80=9D: should this be done after N consistent=
 failures in a certain period (be they build or test)? Furthermore, why is a=
 human involved in making this decision (apart from maybe a technical soluti=
on via automation not being available yet)?

Second off, unreliable tests:

* Unreliable tests need to be qualified not based on a single run, but a pat=
tern of runs.

The way that this worked at Facebook is, if a test failed, it would attempt t=
o rerun it multiple times (10 in total IIRC). If the test was consistently f=
ailing on a build, the test would be automatically disabled, and all committ=
ers in a revision range would be nagged as part of disabling those tests. Th=
is generally works because of siloization of Facebook components, but is a m=
uch harder problem to solve with FreeBSD because it is a complete
OS distribution and sometimes small seemingly disconnected changes can cause=
 a lot of grief.

So what to do?

I suggest expanding the executors and running individuals suites instead of t=
he whole batch of tests. While it wouldn=E2=80=99t fix everything and would b=
e an expensive thing to do with our current test infrastructure, it would al=
low folks to better pinpoint issues and be able to get some level of coverag=
e, as opposed to throwing all of test execution out, like a baby with the ba=
th water.

How do we get there?
- Expand the CI executor pool.
- Provide a tool or process with which we can define test suites.
- Make spinning up executors faster: with virtual machines this is typically=
 done by using Big Iron infrastructure clusters (e.g., ESXi clusters) and so=
mething like thin provisioning where one could start from a common image/sna=
pshot, instead of taking a hit copying around images. Linux can do this with=
 btrfs; we can do this with ZFS with per VM datasets, snapshotting, etc.

While this only gets part of the way to a potential solution, it is a good w=
ay to begin solving the isolation/execution problem.

* A number of tests that existed in the tree have varying quality/reliabilit=
y; I agree that system level tests (of which the pf tests are one of many) a=
re less reliable than unit/API functional tests. This is the nature of the b=
east of testing.

The core issue I see with the test suite as it stands, is that it mixes inte=
gration/system level tests (less deterministic) with functional/unit tests (=
generally more deterministic).

Using test mock frameworks would be a good technical solution to making syst=
em tests into functional/unit tests (googlemock and unittest.mock are two of=
 many good tools I know of in this area), but we need a way to run both case=
s.

I can see now where some of the concern over labeling test types was a conce=
rn when I first started this work (des@/phk@ aired this concern).

Part of the technical/procedural solution to allowing commingling of tests i=
s to go back and label the tests appropriately. I=E2=80=99ll send out an FCP=
 for this sometime in the next week or two.

=3D=3D=3D=3D=3D=3D

Taking a step back, as others have brought up, we=E2=80=99re currently hinde=
red by tooling: we are applying a DVCS (git, hg) based technique (CI) to sub=
version and testing changes after they=E2=80=99ve hit head, instead of befor=
e they hit head.

While phabricator can partially solve this by testing upfront (we don=E2=80=99=
t enforce this; I=E2=80=99ve made my concerns with this not being a requirem=
ent well-known in the past), the solution is limited by bandwidth for testin=
g, i.e., testing is an all or nothing exercise right now and building multip=
le toolchains/architectures takes a considerable amount of time. We could le=
verage cloud/distributed solutions for this (Cirrus CI, Travis if the integr=
ation existed), but this would require using github or teaching a tool how t=
o make the appropriate REST api calls to run the tests and query the status (=
in progress, pass, fail, etc).

Applying labels and filtering on test suites will get us partway to a final s=
olution from a test perspective, but a lot of work needs to be done with pha=
bricator, etc.

We also need to have build failures with tier 1 architectures with GENERIC b=
e a commit blocking operation. Full stop.

=3D=3D=3D=3D=3D=3D

While some of the thoughts I put down aren=E2=80=99t complete solutions, I h=
ave subproposals that should be done/things that could be worked on before i=
mplementing the proposed CI policy. Some of the things I brought up above=20=


While I can=E2=80=99t work on it now, December break is coming up, and with i=
t I=E2=80=99ll have more time to work on projects like this. I=E2=80=99ll pu=
t down some TODO items so I can look at tackling them during the break.

Thank you,
-Enji=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?339B7A20-F88D-4F60-B133-612189663272>