Iâve been forming an opinion for awhile that is potentially eyebrow raising, but that Iâm now basically convinced of: that cross-system integration testing is not worth it, and in most projects is actually an anti-pattern.
This is potentially surprising, because even in projects that use a good balance of â1000s of unit tests with ~10s/100s of integration testsâ, itâs assumed that, of course, we need cross-system integration tests to show that things âactually workâ.
However, my experience, based on the last ~4-5 systems Iâve either built or worked on, is pretty much the opposite, and that the projects with âshowing things actually workâ integration tests have not had that investment pay off, and so here Iâll try to reason about that and articulate why.
Integration Test Terminology
First, a mea culpa that âcross-systemâ as an adjective to integration testing, for traditional usages of the term, seems very redundant.
E.g. âintegration testingâ is de facto cross-system. Thatâs the âintegrationâ. Obvious.
But I think there is some slight nuance, mostly around sphere of influence.
Letâs take a typical web application, and it has four standard layers:
- Frontend HTML/CSS/JS
- JSON REST APIs for the frontend JS/AJAX to call
- Database layer for REST APIs that persists data to MySQL
- Gateway layer for REST APIs that calls 3rd-party systems (e.g. vendors like a credit card API)
Given these four layers, I assert there are two flavors of cross-layer integration testing:
-
Intra-system integration testing.
This is integrating (as in starting real versions of) the various sub-systems within the overall system I control, so real-frontend for layer 1 + real-REST servers for layer 2 + real-MySQL database for layer 3.
So, just like the term âintegration testâ insinuates, I could stand all three of these layers/sub-systems up, and do some happy path testing, e.g. with Selenium tests, and make sure things work well across the stack.
These tests make sense to me.
-
Inter-/cross-system integration testing.
This is integrating the above sub-systems (real-frontend + real-REST + real-MySQL) plus the real-vendor system that the gateway layer, layer 4, talks to.
These are the tests I want to avoid.
So, just to highlight, the differentiation between these two flavors is control:
- Intra-system testing == only layers/sub-systems I control
- Cross-system testing == all layers/sub-systems, even those outside my control
What Do You Mean By Control?
So, continuing with defining terminology, by âcontrolâ, I mean two things:
- Are my interactions with the system completely isolated (isolation), and
- Can I nuke/mutate the systemâs data whenever I want (control)?
Basically, I want to be able to setup exactly the test data I want for a given test (control) and then execute it with assurances nothing else will muck with my data (isolation).
So, letâs evaluate this âdo we have control?â criteria against each of our sub-systems:
-
For the frontend HTML/CSS/JS, is my test data isolated from other systems and other tests?
Yes: each selenium test has its own browser instance so its own cookies, local storage, etc., so itâs effectively isolated from the others.
-
For the REST API, is my test data isolated from other systems and other tests?
Yes: REST APIs are typically stateless, so np.
-
For the MySQL database, is my test data isolated from other systems and other tests?
Depends: if I create my very own MySQL database, and only run a single test at a time, and nuke the data between every test, then yes, I have data isolation for the tests.
(Alternatively, if my app already supports sharding, maybe I donât have to nuke the entire database, and instead each test gets itâs own shard, but the effect is the same.)
Basically, because I own this sub-system, I can do whatever gyrations (dedicated, frequently-nuked database) I need to get isolated data.
-
For the vendor API, is my test data isolated from other systems and other tests?
Typically not at all.
Iâve worked with great vendors and terrible vendors over the years, but even the best vendors have never had a test/integration system that gave me 100% data isolation. There is basically never a way to reset the data in their integration systems.
Here is where it becomes obvious I donât own this sub-system: I canât hack on it to setup the data isolation I need.
Why Does Control Matter?
So, weâve established that control is about which sub-systems whose persistent data is at your whim. But why does that matter?
Because any automated test that touches a system/sub-system whoâs data is outside of your control is going to suck.
A baseline, non-negotiable, requirement for effective tests is the ability to control their data.
If the test cannot control and isolate itâs data, you are risking several bad things:
- Test Flakiness
- Test Complexity
- Test Incompleteness
- System Availability
Iâll drill into each.
Test Flakiness
Lack of isolation means it will be very easy for âother thingsâ to break your tests. Where âother thingsâ could be other systems, or internal users just harmlessly poking around, or even your tests themselves.
For example, you setup a test account in your vendorâs system. You have ~100s of tests using it, and then the vendor decides to nuke their sandbox database. Your tests now all break.
Or you have ~100s of tests using the vendorâs system, and one of your own tests causes it to tip over an inherent system limit (e.g. maybe every test run only creates data, but cannot delete it), and now the test account is locked/broken/egregiously slow.
I have worked with systems that have automated tests that use shared data, and it invariably happens that âsomething elseâ pokes the data, and tests break.
When tests break, especially repeatedly for the same core reasons (lack of a solid foundation), developers quickly get frustrated, become disenfrancised, and the tests themselves and your culture around TDD will suffer.
You must have 100% data control and isolation from day one, or you are building on quicksand.
Test Complexity
Besides flakiness, even if you think you can build/hack around the above issues, invariably tests that cannot control their data are just inherently complex.
For example, my gold standard of a test flow is a cadence that looks like:
- Given the world looks like X
- When action Y happens
- Then the world looks like Z
Just to show a concrete example, here is a real-and-slightly-edited test for one an offline job (to tangentinally highlight/show off that even offload Hadoop/Spark/Pig jobs can have effective, TDD-style test suites):
@Test
public void shouldDropExtraPstData() throws Exception {
// given one event that happens at 11/3 11pm UTC (so 11/3 PST)
final FooInputEvent i1 = newFooInputEvent(e -> setEventTime(e, NOV3.atTime(23, 0)));
// and another event that happens at 11/4 1am UTC (so also 11/3 PST)
final FooInputEvent i2 = newFooInputEvent(e -> setEventTime(e, NOV4.atTime(1, 0)));
// when the job runs
runScript(i1, i2);
// then we got only 1 combined UTC event out
assertEquals(outputEvents.size(), 1);
assertEqualsWithoutHeader(outputEvents.get(0), newFooOutputEvent(0));
}
In cross-system integration tests where âthe worldâ is not in your control, itâs impossible to have those âgiven Xâ lines at the beginning of your Given/When/Then cadence, so you have to start inventing ways around it, which dramatically increases the complexity of the tests.
Or, instead of inventing ways around it, sometime cross-system tests just skip it the âworld looks like Xâ all together, and assume âthe data is already thereâ. Which, for me, is one of the worst things that could happen, as now the input condition is not part of your test methodâs source code, but instead is implicitly setup out-of-band, e.g. by an engineer clicking around and manually setting up the test case.
There are several reprocussions from this:
- Test readability is terrible, because code reviews and maintainers now have no idea what your âthe world looks like Xâ input data is. They have to guess at what you setup in the shared database.
- The codebase maintainability is terrible, because if a maintainer finds a bug, and says âokay, I need a new test with
testFooBar
âs input data, but with flagX
set toY
instead ofZ
â, they need to manually copy/pastetestFooBar
âs shared data. This is tedious and so they basically will not do it. - The test maintainability itself is terrible, because if the shared data is somehow broken/reset, itâs now basically guaranteed that no engineer, even the engineer that wrote the test, knows how to re-create
testFooBar
âs âthe world looks like Xâ input data.
Basically, if you donât have âthe world looks like Xâ directly in your test methods, your tests are terrible.
Note that sometimes cross-system tests will attempt to have âworld looks like Xâ setup routines, that still live on top of a shared, not-controlled, not-isolated environment, e.g. by creating a brand new user/account in your vendorâs integration system for every test run.
Or you might try to have a clean-up routine, that uses some of the vendorâs delete APIs to re-delete what you just created.
And these gyrations are more admirable than the implicit âuse data that we setup by handâ approahc, but, to me, theyâre already too far down the slippery slope away from âtests that are actually simple to read and maintainâ, because youâve introduced a slew of cognitive overhead, without actually solving the fundamental control and isolation issues, that compromises the long-term simplicity, readability, and reliability of your test suite.
Test Incompleteness
Somewhat related to complexity, but the exact opposite, is that without data control and isolation, you risk your tests becoming too simple, and therefore trending towards useless.
For example, a common approach to the âwe cannot reset the shared dataâ constraint is to make only either read-only or append-only calls/test cases in your cross-system integration tests.
E.g. you might write a test that only reads the same data from the shared database each time. It does no writes/mutations, so no problem, it should never break.
Except now you have entire codepaths (any write/mutation codepath, which is typically where most of the fun/complex logic exists) that is simply untested.
System Availability
I added this point later, as itâs not around test data itself, which for me is the key to great tests, but if you donât have data control and isolation over a vendorâs sub-system, itâs also very likely you canât ensure the overall availability of that sub-system.
E.g. if your cross-system automated tests are part of your CI/deployment pipeline, and your vendorâs system goes down for maintenance, what do you do? You canât release? What if your vendor also uses their integration system for beta testing, and regularly pushes out bugs? Their bug means you canât release?
(Granted, there are assertions that this is precisely the point of integration testing, but Iâll muse about that later.)
If you limit your integration testing to intra-system sub-systems, you have de facto 100% control over their availability, and so will have the most resilient automated tests.
In summary, my assertion is that if you donât have data control and isolation, your automated tests will be a net negative to the long-term health of your codebase, and you just shouldnât write them.
What Do We Do Instead?
If we give up on cross-system integration testing, of touching their real sub-system during our CI/deployment system, obviously we still want to have some amount of test coverage of our interactions with them. What should that look like?
To me, the best answer is two things:
- A well-defined contract between the two systems, and
- Stub implementations of that contract.
A well-defined contract is basically documenting the vendor sub-systemâs API in some sort of schema, e.g. protobuf, Thrift, Swagger, etc. It doesnât really matter which one.
The goal should be to give our codebase a very good idea about âis this request well formed?â, since weâre no longer going to get this validation from the vendor sub-system itself during our automated tests.
And then once we have that contract, which provides a strong sanity check that âweâve asked the vendor system to do Xâ is a fundamental operation that they support, we can use stubs to flush out naive/just-enough-to-work behavior to verify âdoes our system handle the response for request X correctly?â
As a short intro, stubs are basically in-memory versions of vendor sub-systems that you yourself write and maintain. So if your gateway layer has a CreditCardApi
interface, you would write your own, that looks something like:
interface CreditCardApi {
void charge(String number, int amount);
}
// Real implementation that makes API calls
class VendorCardCardImpl implements CreditCardApi {
void charge(String number, int amount) {
// make wire call to vendor
}
}
// Stub implementation that does in-memory logic
class StubCreditCardImpl implements CreditCardApi {
private List<Account> accounts = ...;
void charge(String number, int amount) {
Account a = findAccount(number);
a.charges += new Charge(amount);
}
}
I have an older, longer post on mocks vs. stubs, that goes into more details, but focusing on the cross-system testing aspect, there are several pros/cons:
-
Pro: We have achieved data isolation, as only our test will use the
StubCreditCardImpl
âs internal state. -
Pro: Our tests should become easier to read, because we now control the âworld looks like Xâ step, as we can fabricate
StubCreditCardImpl
âs internal state to be whatever we need. Do we need a cancelled account? An account with a zero balance? These scenarios are now easy to produce on the fly, in code.For example:
void shouldHandleEmptyAccount() { // Given our user is out of cash stubCards.put("a1", new Account().setRemaining(0.00); // When our system tries to charge that amount system.doThings(); // Then we handle the rejection assertThat(...); }
-
Pro: Again see the article on mocks vs. stubs for more details, but particularly for integration tests, stubs are much more usable than mocks because they are stateful.
This means they can more naturally handle a longer-running integration-style test like âstart out with no accountsâ, âload the first page and make an accountâ, ânow see the account get returnedâ, which are typically very tricky/verbose to setup with mocks.
(Mocks are more suited for in-the-small unit testing; although pedantically stubs can work well there as well.)
-
Con: We have to write the
StubCreditCardImpl
ourselves.I have another post about vendor services providing stubs out of the box, but unfortunately that rarely happens.
So, you have to weigh the ROI of âinvestment in making stubsâ vs. the âreturn of a super-reliable test pipelineâ.
So, if you put these two things together, a strong contract + good-enough stubs, I assert youâll get 80% of the benefit of cross-system integration tests, with 20% (or less) of the pain and suffering that comes from not having data control and isolation.
But That Doesnât Test If It Really Works?
The biggest objection to this approach is we donât really know if our cross-system interaction works, e.g. what if we try to charge $1000, and the vendorâs API blows up anytime the amount is greater than $900?
Our testing approach of a documented schema/contract + stubs will never catch this!
That is true, but I have two rationalizations: the first is syntax vs. semantics, and the second is compensating with quality and agility.
Vendor API Syntax vs. API Semantics
To me, you can think of our vendor requests as having two aspects:
- Stateless âsyntaxâ, e.g. is this a well-formed request?
- Stateful âsemanticsâ, e.g. does this specific request work given this specific state?
Per the previous section, the first aspect, the syntax, we have ideally covered at compile/build time by using a strongly-typed schema/contract. So weâre good there.
For the second, my articulation of this aspect, semantics, is âdoes the contents of our requests, given various âworld looks like Xâ input conditions, pass the business rules of the vendor system?â
When worded like this, what weâre basically asking is, âwill all of our requests pass all of the semantic/business validation rules of the vendorâs system?â
I have an interesting rationalization, that admittedly borders on a dodge, which is that itâs very unlikely that weâre going to really cover all of the business rules.
For example, they might have 100 business rules around saving a credit card. Name checks, address checks, credit checks, etc. Are we really going to test every single one? Probably not. Even if we wanted to, our cross-system integration suite would grow into the 1000s of tests (which I have seen before), and become a nightmare in turns of runtime.
Also, even if we thought we could cover all of them, given they are a vendor API, they can make their own releases, and add more business rules for saving a credit card, anytime they want, without telling us. (Granted, vendors often communicate breaking changes, but those are syntax changes, e.g. we removed field X
, and itâs very easy and common to slip in new semantic rules, e.g. existing field X
now must pass business rule Y
, without communication or a grace period.)
So, given we wonât cover all business rules, and they can add new business rules at any point in time, what is the best we can do?
We can:
- Make sure we handle success, and
- Make sure we handle failure, which is any business rule failing.
As long as we handle these two scenarios, the happy path and a general failure case, our system should basically work. And we donât need to test, as in a cross-system, makes-a-wire-call integration test, each individual semantic variation of âit could return card not foundâ or âit could return bad addressâ on every single test run.
Note that I donât mean we should just throw random data at the vendor API. Obviously our system has to to itâs own miniature mental-model the vendorâs expected semantics, e.g. âa $10 charge when the balance is $5 will failâ.
We need to build that mental-model of the vendorâs expected behavior into our system, and we do, but we can assert that we play within those semantic boundaries with our unit tests and our stub-based integration tests. We donât need physical wire call verification of them every single time, because they will rarely change.
And if they do change, we will find out quickly, adapt our unit and/or stub-based tests to encode that new aspect of their semantic mental-model, and continue. Which is basically the next point.
Compensating with Quality and Agility
So, maybe you donât buy my argument that âyou wonât test all the semantics anyway, so donât botherâ. You want to try and still have some âno really, see, it worksâ cross-system integration tests.
The closing argument, for me, which is based on the systems that Iâve personally worked on, is that the systems without cross-system integration tests end up being higher quality in the long run.
This is because:
-
Their test suites use the given/when/then âworld looks like Xâ convention, so copy/pasting a âworld looks like Xâ test into a âworld looks like Yâ test is very quick and simple, so reproducing production bugs is very quick. Youâre testing infrastructure is already there and easy to use.
In a cross-system environment, per previous sections, reproducing a production bug with a slight variation of an existing testâs data is tedious and time-consuming.
-
Their build pipelines are typically super-fast and super-reliable, so after the production issue is reproduced and fixed, itâs very quick to get a release out.
In a cross-system environment, cross-system integration tests typically make more/slower wire calls, and coupled their inherent flakiness, either slow down (as in wall clock time) or slow down (as in false negatives and ensuing manual investigation/reruns) the release process.
-
Their tests, with predominantly unit tests and some intra-system integration tests, are fast and stable enough that developers feel empowered to ruthlessly refactor, maintaining a high quality bar, which overtime means production bugs stay quick to diagnose and fix.
(Admittedly, intra-system integration tests make wire calls as well, so this point, for me, is based more on observation and correlation than direct causation.)
The combination of these two factors (API syntax vs. semantics, and compensating with quality and agility) is why I feel confident, for the projects Iâm leading and personally responsible for, having no cross-system integration tests.
Who is a Vendor?
So, hopefully youâre following my logic so far and basically in agreement that, sure, if our in-house system integrates with credit card Vendor X (e.g. Stripe or First Data or whoever), we should not couple our test suite and release process to their shared sandbox/integration environment. Thatâs just asking for trouble.
But my next musing, which is admittedly more/only applicable to larger companies, is: who do you treat as a vendor?
In the example used so far, the external credit card processing vendor, that is a pretty obvious line to draw around âa vendorâ.
But what about an internal team that youâre working with for a new project? Theyâre providing a FancyFooService
API to you. Are they a vendor?
I think the traditional temptation is to say no, theyâre not a vendor, theyâre an internal team, and we can easily access their systems in our dev/test environment, so letâs make really sure our flows work, and write integration tests that makes real wire calls to their FancyFooService
implementation.
Unfortunately, even though the team/their sub-system is internal, this rarely means you actually have control of their sub-system, using our definition of control from earlier.
Specifically: is your test data isolated from other systems and other tests?
If this is an internal shared environment (e.g. every system in the company runs in kind of a shadow-production environment), the answer is no, your test data is shared with all of the other systems and all of the other tests.
You donât have control or isolation.
So, all of my assertions so far, about test complexity and flakiness and system availability, which hopefully led to an obvious conclusion âof course that applies to the external vendorâ now also apply to the internal, in-house âvendorâ.
This system has effectively become a vendor to you.
Which means you should not write any automated tests against it.
Making Life Easier
I have made a lot of points, but besides just pontificating about âthrow out all your cross-system testsâ, I wanted to offer some suggestions to alleviate the pain.
Specifically, if you approach a stubs-based approach to intra-system testing, I think there are two things that can make life easier:
- Use a unified RPC framework across the entire company
- Prefer noun-based RPC frameworks
- Services should ship their own stubs
Unified RPC Framework
When youâre writing infrastructure to stub (or mock if you must) vendor systems out, youâll get a much higher ROI if you can do that just once.
E.g. instead of writing infrastructure for faking Thrift for some vendor sub-systems, and infrastructure for faking GRPC for some vendor sub-systems, and infrastructure for faking SOAP for others, and infrastructure for faking bespoke JSON/HTTP for others, you can write a single infrastructure and invest heavily in making it powerful and easy to use.
Obviously Google has done this by using âprotobuf everywhereâ internally. At LinkedIn, we use âRest.li everywhereâ internally.
If youâre in a larger company, that is more than ~10 years old, itâs very unlikely that you have this, and admittedly obtaining this is going to be a nightmare. Unfortunately, I have no good ideas for you.
But if youâre starting a larger company, then choose one and only one up front.
Prefer noun-based RPC frameworks
If you have the luxury of choosing an RPC framework, Iâve found noun-based systems to work better than verb-based systems.
Specifically something like GRPC is a verb-based system, where you make up new verbs of CreateAccount
, MakeDeposit
, DeleteAccount
. All of these are just method calls, that GRPC handles the serialization/etc. for you. Which is fine.
But noun-based systems, e.g. REST (props to Rest.li) and GraphQL, have a fixed set of verbs, e.g. PUT
, GET
, UPDATE
, that is basically fixed, and then N
nouns, e.g. Account
, Deposit
, etc.
I canât believe itâs from 2006, but see Yeggeâs post Execution in a Kingdom of Nounâs for more information.
But, specifically for building infrastructure to fake cross-system integration tests, if you have a noun-based architecture, it becomes really nice and generic to setup cross-sub-system data in a generic way.
E.g. all a test has to do is create the nouns it wants, âmake a new Account that looks like thisâ, âmake a new Deposit that looks like thatâ, and then defer to your generalized infrastructure to put each noun into itâs corresponding stub.
A real-but-edited example is:
public void testWithOneUser() {
// given a single user
user = new UserBuilder();
account = new AccountBuilder().withAdmin(user);
system.setup(user, account);
// when
...
}
Where the User
noun and Account
noun are from two completely separate âvendorâ sub-systems, e.g. the user API which is owned by internal team X, and the account API which is owned by internal team Y.
The fact that our inter-system RPC is all noun-based two great outcomes:
- The ânoun looks like Xâ is basically verbatim what my best-practice âgivenâ section looks like for âworld looks like Xâ, and
- Once the test infrastructure supports the fixed set of verbs, e.g.
GET
,PUT
, etc., it can scale really well toN
nouns.
Services should ship their own stubs
One final thing that would make life easier for adopting a stub-based approach to fake cross-system integration tests would be if you as a consumer of the vendor API didnât have to write your own stubs in the first place.
I wrote about this back in 2013 (although, sheesh, 2006, Yeggeâs still got me), but to me it makes a lot of sense for owners of the vendor API to write their own stubs, and provide them to their users as a convenience.
This has two benefits:
- It amoritizes the cost of writing the stubs by having it done once, and then shared across N downstream users
- The vendor API owners are the ones who best know their API, so can make their stubs match (within reason) the production run-time behavior.
For a concrete example, I think it makes a ton of sense for AWS to ship a stubbed version of itâs aws-java-sdk. They already have interfaces for S3Client
, DynamoDBClient
, etc. How many of their users could write super-effective test suites by using an in-memory StubS3Client
or StubDynamoDBClient
that did a best-effort mimicking of production?
I think itâd be incredibly useful. Such that at Bizo, we started doing this, by writing/open sourcing our own aws-java-sdk-stubs library, which we used liberally internally, but unfortunately has not seen a ton of adoption otherwise.
One admitted wrinkle is that typically stubs are written as in-process/in-memory implementations, e.g. âjust use an in-memory mapâ in whatever your primarily language is. And for a vendor like AWS that supports N languages, it would suck to provide N stub API implementations. One solution is to still write a local/fake version, but use wire calls to cross process/language boundaries, which is exactly what AWS does with DynamoDB Local and Iâd done for the old-school simpledb.
Conclusion
Wrapping up, this is probably my longest post to date, but Iâve tried to comprehensively address the topic, as Iâve been thinking about it a lot lately.
Well, shoot, I donât really want to add another section for this, but I am fully supportive of developers being able to touch a vendorâs shared integration system, or even touch the vendorâs production system, to do initial âhow does this really work?â exploration, trouble shooting, and debugging.
And having this sort of âdebug-on-demandâ cross-system/cross-vendor infrastructure to make exploration and debugging easy is a good, worthwhile investment.
But my main point, which hopefully Iâve articulated well and convinced you of, is that cross-system, cross-vendor infrastructure should not bleed over into automated tests, as without the fundamental first princples of data control and data isolation, I assert youâre building on a shaky foundation, and your effort is better spent elsewhere.
(Update July 2018: The Google Testing blog has a post, Just Say No to More End-to-End Tests, which covers some of the common failure scenarios for end-to-end tests, as well as points out how slow their feedback loop is (such that they become cascading failures of ârarely is everything greenâ), albeit they end up recommending the same âtesting periodâ approach, vs. my more draconian stance.)