Microsoft’s Protocol Documentation Program

From left to right, Wolfgang Grieskamp, Nico Kicillof, and Bob Binder.

IN 2002, Microsoft began the difficult process of verifying much of the technical documentation for its Windows communication protocols. The undertaking came about as a consequence of a consent decree Microsoft entered into with the U.S. Department of Justice and several state attorneys general that called for the company to make available certain client-server communication protocols for third-party licensees. A series of RFC-like technical documents were then written for the relevant Windows client-server and server-server communication protocols, but to ensure interoperability Microsoft needed to verify the accuracy and completeness of those documents. From the start, it was clear this wouldn’t be a typical quality assurance (QA) project. First and foremost, a team would be required to test documentation, not software, which is an inversion of the normal QA process; and the documentation in question was extensive, consisting of more than 250 documents—30,000 pages in all. In addition, the compliance deadlines were tight. To succeed, the Microsoft team would have to find an efficient testing methodology, identify the appropriate technology, and train an army of testers—all within a very short period of time.

This case study considers how the team arrived at an approach to that enormous testing challenge. More specifically, it focuses on one of the testing methodologies used—model-based testing—and the primary challenges that have emerged in adopting that approach for a very large-scale project. Two lead engineers from the Microsoft team and an engineer who played a role in reviewing the Microsoft effort tell the story.

“WOLFGANG GRIESKAMP: One of the challenges for our project was to make sure the functions performed by Windows servers could also be performed by other servers.”

Now with Google, Wolfgang Grieskamp at the time of this project was part of Microsoft’s Windows Server and Cloud Interoperability Group (Winterop), the group charged with testing Microsoft’s protocol documentation and, more generally, with ensuring that Microsoft’s platforms are interoperable with software from the world beyond Microsoft. Previously, Grieskamp was a researcher at Microsoft Research, where he was involved in efforts to develop model-based testing capabilities.

Nico Kicillof, who worked with Grieskamp at Microsoft Research to develop a model-based testing tool called Spec Explorer, continues to guide testing efforts as part of the Winterop group.

Bob Binder is an expert on matters related to the testing of communication protocols. He too has been involved with the Microsoft testing project, having served as a test methodology consultant who also reviewed work performed by teams of testers in China and India.

For this case study, Binder spoke with Kicillof and Grieskamp regarding some of the key challenges they’ve faced over the course of their largescale testing effort.

BOB BINDER: When you first got involved with the Winterop Team [the group responsible for driving the creation, publication, and QA of the Windows communication protocols], what were some of the key challenges?

NICO KICILLOF: The single greatest challenge was that we were faced with testing protocol documentation rather than protocol software. We had prior expertise in testing software, but this project called for us to define some new processes we could use to test more than 30,000 pages of documentation against existing software implementations already released to the world at large, even in some cases where the original developers were no longer with Microsoft. And that meant the software itself would be the gold standard we would be measuring the documentation against, rather than the other way around. That represented a huge change of perspective.

WOLFGANG GRIESKAMP: What was needed was a new methodology for doing that testing. What’s more, it was a new methodology we needed to apply to a very large set of documents in relatively short order. When you put all that together, it added up to a really big challenge. I mean, coming up with something new is one thing. But then to be faced with immediately applying it to a mission-critical problem and getting a lot of people up to speed just as fast as possible—that was really something.

BINDER: What did these documents contain, and what were they intended to convey?

GRIESKAMP: They’re actually similar to the RFCs (request for comments) used to describe Internet protocol standards, and they include descriptions of the data messages sent by the protocol over the wire. They also contain descriptions of the protocol behaviors that should surface whenever data is sent—that is, how some internal data states ought to be updated and the sequence in which that is expected to occur. Toward that end, these documents follow a pretty strict template, which is to say they have a very regular structure.

BINDER: How did your testing approach compare with the techniques typically used to verify specifications?

GRIESKAMP: When it comes to testing one of these documents, you end up testing each normative statement contained in the document. That means making sure each testable normative statement conforms to whatever it is the existing Microsoft implementation for that protocol actually does. So if the document says the server should do X, but you find the actual server implementation does Y, there’s obviously a problem.

In our case, for the most part, that would mean we’ve got a problem in the document, since the implementation—right or wrong—has already been out in the field for some time. That’s completely different from the approach typically taken, where you would test the software against the spec before deploying it.

BINDER: Generally speaking, a protocol refers to a data-formatting standard and some rules regarding how the messages following those formats ought to be sequenced, but I think the protocols we’re talking about here go a little beyond that. In that context, can you explain more about the protocols involved here?

GRIESKAMP: We’re talking about network communication protocols that apply to traffic sent over network connections. Beyond the data packets themselves, those protocols include many rules governing the interactions between client and server—for example, how the server should respond whenever the client sends the wrong message.

One of the challenges for our project was to make sure the functions performed by Windows servers could also be performed by other servers. Suppose you have a Windows-based server that’s sharing files and a Windows-based client accessing them. That’s all Microsoft infrastructure, so they should be able to talk to each other without any problems. Tests were performed some time ago to make sure of that. But now suppose the server providing the share is running Unix, and a Windows client is running in that same constellation. You still should be able to access the share on the Unix file server in the same way, with the same reliability and quality as if it were a Windows-based file server. In order to accomplish that, however, the Unix-based server would need to follow the same protocol as the Windows-based server. That’s where the challenge tends to get a little more interesting.

KICILLOF: That sets the context for saying something about the conditions under which we had to test. In particular, if you’re accounting for the fact that the Windows server might eventually be replaced by a Unix server, you have to think in terms of black-box testing. We can’t just assume we know how the server is implemented or what its code looks like. Indeed, many of these same tests have been run against non-Microsoft implementations as part of our effort to check for interoperability.

GRIESKAMP: Besides running these tests internally to make sure the Windows server actually behaves the way our documents say it ought to, we also make those same tests available for PlugFests, where licensees who have implemented comparable servers are invited to run the tests against their servers. The goal there is to achieve interoperability, and the most fundamental way to accomplish that is to initiate tests on a client that can basically be run against any arbitrary server in the network, be it a Windows server, a Unix server, or something else.

BINDER: Many of the protocols you’ve tested use the Microsoft remote procedure call stack—in addition to standard protocols such as SOAP and TCP/IP. What types of challenges have you encountered in the course of dealing with these different underlying stacks?

GRIESKAMP: First off, we put the data more or less directly on the wire so we can just bypass some of those layers. For example, there are some layers in the Windows stack that allow you to send data over TCP without establishing a direct TCP connection, but we chose not to use that. Instead, we talk directly to the TCP socket to send and receive messages.

That allows us to navigate around one part of the stack problem. Another issue is that some protocols travel over other protocols—just as TCP, for example, usually travels over IP, which in turn travels over Ethernet. So what we did to account for that was to assume a certain componentization in our testing approach. That allows us to test the protocol just at the level of abstraction we’re concerned with—working on the assumption the underlying transport layers in the stack are behaving just as they ought to be. If we weren’t able to make that assumption, our task would be nearly impossible.

Because of the project’s unique constraints, the protocol documentation team needed to find a testing methodology that was an ideal fit for their problem. Early efforts focused on collecting data from real interactions between systems and then filtering that information to compare the behaviors of systems under test with those described in the protocol documentation. The problem with this approach was that it was a bit like boiling the ocean. Astronomical amounts of data had to be collected and sifted through to obtain sufficient information to cover thoroughly all the possible protocol states and behaviors described in the documentation—bearing in mind that this arduous process would then have to be repeated for more than 250 protocols altogether.

Eventually the team, in consultation with the U.S. Technical Committee responsible for overseeing their efforts, began to consider model-based testing. In contrast to traditional forms of testing, model-based testing involves generating automated tests from an accurate model of the system under test. In this case, the system under test would not be an entire software system but rather just the protocols described in the documentation, meaning the team could focus on modeling the protocols’ state and behavior and then target the tests that followed on just those levels of the stack of interest for testing purposes.

A team at Microsoft Research had been experimenting with model-based testing since 2002 and had applied it successfully, albeit on a much smaller scale, to a variety of testing situations—including the testing of protocols for Microsoft’s Web Services implementation. In the course of those initial efforts, the Microsoft Research team had already managed to tackle some of the thorniest concerns, such as for the handling of nondeterminism. They also had managed to create a testing tool, Spec Explorer, which would prove to be invaluable to the Winterop team.

BINDER: Please say a little about how you came to settle on model-based testing as an appropriate testing methodology.

GRIESKAMP: In looking at the problem from the outset, it was clear it was going to be something huge that required lots of time and resources. Our challenge was to find a smart technology that would help us achieve quality results while also letting us optimize our use of resources. A number of people, including some of the folks on the Technical Committee, suggested model-based testing as a promising technology we should consider. All of that took place before either Nico or I joined the team.

The team then looked around to find some experts in model-based testing, and it turned out we already had a few in Microsoft Research. That led to some discussions about a few test cases in which model-based testing had been employed and the potential the technology might hold for this particular project. One of those test cases had to do with the SMB (Server Message Block) file-sharing protocol. The results were impressive enough to make people think that perhaps we really should move forward with model-based testing. That’s when some of us with model-based testing experience ended up being brought over from Microsoft Research to help with the validation effort.

KICILLOF: The specific approach to model-based testing we had taken in Microsoft Research was one that proved to be well suited to this particular problem. Using the tool we had created, Spec Explorer, you could produce models of software that specified a set of rules spelling out how the software was expected to behave and how the state was expected to change as a consequence of each potential interaction between the software and its environment. On the basis of that, test cases could then be generated that included not only pre-scripted test sequences but also the oracle, which is a catalog of all the outcomes that might be expected to follow from each step taken.

In this way it was possible to create tests that would allow you to check along the entire sequence to make sure the system was responding in just the ways you expected it to. And that perfectly matches the way communication protocol documents are written, because they’re intended to be interpreted as the rules that govern which messages you should expect to receive, as well as the messages that should then be sent in response.

BINDER: That implies a lot of interesting things. It’s easy enough to say, “We have a model and some support for automating exploration of the model.” But how did you manage to obtain that model in the first place? What was the process involved in going through the fairly dense prose in each one of those protocol documents and then translating all that into a model?

GRIESKAMP: The first step with model-based testing involved extracting normative statements from all those documents. That had to be done manually since it’s not something we’re yet able to automate—and we won’t be able to automate it until computers are able to read and understand natural human language.

The next step involved converting all those normative statements into a “requirement specification,” which is a big table where each of the normative statements has been numbered and all its properties have been described. After that followed another manual step in which a model was created that attempted to exercise and then capture all those requirements. This demanded some higher-level means for measuring so you could make sure you had actually managed to account for all the requirements. For your average protocol, we’re talking here about something on the order of many hundreds of different requirements. In some cases, you might even have many thousands of requirements, so this is a pretty large-scale undertaking.

But the general idea is to go from the document to the requirements, and from there to either a model or a traditional test design—whichever one is consistent with your overall approach.

Microsoft encountered challenges because of its choice to adopt model-based testing for the project. On the one hand, the technology and methodology Microsoft Research had developed seemed to fit perfectly with the problem of testing protocol documents. On the other hand, it was an immature technology that presented a steep learning curve. Nonetheless, with the support of the Technical Committee, the team decided to move forward with a plan to quickly develop the technology from Microsoft Research into something suitable for a production-testing environment.

Not surprisingly, this did not prove easy. In addition to the ordinary setbacks that might be expected to crop up with any software engineering project on an extremely tight deadline, the Microsoft protocol documentation team faced the challenge of training hundreds of test developers in China and India on the basics of a new, unfamiliar testing methodology.

Even after they had a cadre of well-trained testers in place, many hurdles still remained. While the tool-engineering team faced the pressure of stabilizing and essentially productizing the Spec Explorer software at breakneck speed, the testing team had to start slogging through hundreds of documents, extracting normative statements, building requirements specifications, and constructing models to generate automated test suites. Although Spec Explorer provides a way to automate tests, there still were several important steps in the process that required human judgment. These areas ended up presenting the team with some of its greatest challenges.

BINDER: How did you manage to convince yourselves you could take several hundred test developers who had virtually no experience in this area and teach them a fairly esoteric technique for translating words into rule systems?

GRIESKAMP: That really was the core risk in terms of taking the model-based testing approach. Until recently, model-based testing technology had been thought of as something that could be applied only by experts, even though it has been applied inside Microsoft for years in many different ways.

Many of the concerns about model-based testing have to do with the learning curve involved, which is admittedly a pretty steep one, but it’s not a particularly high one. That is, it’s a different paradigm that requires a real mental shift, but it’s not really all that complex. So it’s not as though it’s accessible only to engineers with advanced degrees—everybody can do it. But the first time you’re confronted with it, things do look a little unusual.

BINDER: Why is that? What are some of those key differences people have to get accustomed to?

KICILLOF: The basic difference is that a model actually consists of a rule system. So the models we build are made up of rules indicating that under some certain enabling condition, some corresponding update should be performed on state.

From a developer’s perspective, however, a program is never just a set of rules. There’s a control flow they create and have complete control over. A programmer will know exactly what’s to be executed first and what’s then supposed to follow according to the inputs received.

“NICO KICILLOF: Increasing the interoperability of our products is a worthy goal in and of itself. We’re obviously in a world of heterogeneous technology where customers expect products to interoperate.”

What’s fortuitous in our case is that we’re working from protocol specifications that are themselves sets of rules that let you know, for example, that if you’ve received message A, then you should update your abstract data model and your internal state in a certain way, after which you should issue message B. It doesn’t explain how a protocol flows from that point on. The combination of all those rules is what determines the actual behavior of the protocol. So there was often a direct correspondence between certain statements in each of these technical documents and the kinds of models we’ve had to build. That’s made it really easy to build the models, as well as to check to make sure they’ve been built correctly according to the statements found in the documents.

GRIESKAMP: Because this isn’t really all that complex, our greatest concern had to do with just getting people used to a new way of thinking. So to get testers past that initial challenge, we counted a lot on getting a good training program in place. That at first involved hiring people to provide the training for each and every new person our vendors in China and India hired to perform the testing for us. That training covered not only our model-based testing approach, but also some other aspects of the overall methodology.

BINDER: How long did it take for moderately competent developers who had never encountered model-based testing before to get to the point where they could actually be pretty productive?

“BOB BINDER: How did you manage to convince yourselves you could take several hundred test developers who had no experience in this area and teach them a fairly esoteric technique for translating words into rule systems?”

KICILLOF: On average, I’d say that took close to a month.

BINDER: Once your testers were trained, how did your testing approach evolve? Did you run into any significant problems along the way?

GRIESKAMP: It proved to be a fairly smooth transition since we were just working with concepts that were part of the prototype we had already developed back at Microsoft Research. That said, it actually was just a prototype when this team took it over, so our main challenge was to stabilize the technology. You know how prototypes are—they crash and you end up having to do workarounds and so forth. We’ve had a development team working to improve the tool over the past three years, and thousands of fixes have come out of that.

Another potential issue had to do with something that often crops up in model-based testing: a state-explosion problem. Whenever you model—if you naively define some rules to update your state whenever certain conditions are met and then you just let the thing run—there’s a good chance you’re going to end up getting overrun by all those state updates. For example, when using this tool, if you call for an exploration, that should result in a visualization of the exploration graph that you can then inspect. If you’re not careful, however, you could end up with thousands and thousands of states the system will try to explore for you. There’s just no way you’re going to be able to visualize all of that.

Also, in order to see what’s actually going on, you need to have some way of pruning down the potential state space such that you can slice out those areas you know you’re going to need to test. That’s where one of our biggest challenges was: finding the right way to slice the model.

The idea here was to find the right slicing approach for any given problem, and the tool provides a lot of assistance for accomplishing that. It didn’t come as a surprise to us that this issue of finding the right way to slice the space would end up being a problem—we had expected that. We actually had already added some things to the tool to deal with that, which is probably one of the reasons the project has proved to be a success.

KICILLOF: The secret is to use test purposes as the criterion for slicing.

BINDER: With that being only a subset of all the behaviors you would be looking at in some particular use case?

GRIESKAMP: Right. So that’s why it has to be clear that whenever you’re doing some slicing, you’re cutting away some of the system potential, which means you may lose some test coverage. That’s why this ends up being so challenging. As Nico was saying, however, since the slicing is also closely coupled with your test purposes, you still ought to end up being able to cover all the requirements in your documentation.

KICILLOF: Yes, coupling to test purposes is key because if the slicing were done just according to your use cases, only the most common usage patterns of the system might end up being tested. But that’s not the case here.

Also, throughout the tool chain, we provide complete traceability between the statements taken from the specification and the steps noted in a test log. We have tools that can tell you whether the way you’ve decided to slice the model leaves out any requirements you were intending to test. Then at the end you get a report that tells you whether your slicing proved to be excessive or adequate.

By all accounts, the testing project has been extremely successful in helping ensure that Microsoft’s protocol documents are of sufficiently high quality to satisfy the company’s regulatory obligations related to Windows Client and Windows Server communications. But the effort hasn’t stopped there, as much the same approach has been used to test the protocol documentation for Office, SharePoint Server, SQL Server, and Exchange Server.

This work, done with the goal of providing for interoperability with Microsoft’s high-volume products, was well suited to the model-based testing technology that was productized to support the court-ordered protocol documentation program. Because projects can be scaled by dividing the work into well-defined units with no cross dependencies, the size of a testing project is limited only by the number of available testers. Because of this scalability, projects can also be completed efficiently, which bodes well for the technology’s continued use within Microsoft—and beyond. What’s more, Microsoft’s protocol documentation testing effort appears to have had a profound effect on the company’s over-all worldview and engineering culture.

BINDER: Within Microsoft, do you see a broader role for the sort of work you’re doing? Or does it pretty much just begin and end with compliance to the court decree?

KICILLOF: It goes beyond the decree. Increasing the interoperability of our products is a worthy goal in and of itself. We’re obviously in a world of heterogeneous technology where customers expect products to interoperate.

That’s also changing the way products are developed. In fact, one of our goals is to improve the way protocols are created inside Microsoft. That involves the way we design protocols, the way we document protocols such that third parties can use them to talk to our products, and the way we check to make sure our documentation is correct.

GRIESKAMP: One aspect of that has to do with the recognition that a more systematic approach to protocol development is needed. For one thing, we currently spend a lot of money on quality assurance, and the fact that we used to create documentation for products after they had already been shipped has much to do with that. So, right there we had an opportunity to save a lot of money.

Specification or model-driven development is one possible approach for optimizing all of this, and we’re already looking into that. The idea is that from each artifact of the development process you can derive documentation, code stubs, and testable specifications that are correct by definition. That way, we won’t end up with all these different independently created artifacts that then have to be pieced together after the fact for testing purposes.

For model-based testing in particular, I think this project serves as a powerful proof point of the efficiencies and economies that can be realized using this technology. That’s because this is by far the largest undertaking in an industrial setting where, within the same project, both traditional testing methodologies and model-based testing have been used. This has created a rare opportunity to draw some side-by-side comparisons of the two.

We have been carefully measuring various metrics throughout, so we can now show empirically how we managed essentially to double our efficiency by using model-based testing. The ability to actually document that is a really big deal.

BINDER: Yes, that’s huge.

GRIESKAMP: There are people in the model-based testing community who have been predicting tenfold gains in efficiency. That might, in fact, be possible if all your users have Ph.Ds or are super adept at model-based testing. But what I think we’ve been able to show is a significant—albeit less dramatic—improvement with a user population made up of normal people who have no background in model-based testing whatsoever. Also, our numbers include all the ramp-up and education time we had to invest to bring our testers up to speed.

Anyway, after accounting for all that plus the time taken to do a document study and accomplish all kinds of other things, we were able to show a 42% reduction in effort when using the model-based testing approach. I think that ought to prove pretty compelling not just for Microsoft’s management but also for a lot of people outside Microsoft.