Rocket Powered: A Month in Selenium

I realise that this blog has been pretty quiet. Part of the reason for that is that I'm terrible at sitting down and just writing. What I really need is an incentive. That incentive arrived this month in the form of the Selenium Fellowship, which takes the form of a stipend to fund work hacking on Selenium. Part of the agreement is a monthly blog post. So, you all have the Software Freedom Conservancy to thank :)

So, what contributions have I been making to the Selenium project this month?

There are two major highlights. The first of these is Selenium Conf, which was in Berlin. I gave the State of the Union keynote (so called because the first one was an update of how the merger of the Selenium and WebDriver projects was going) Over the past few Selenium Conferences, the theme has slowly been building that Open Source Software depends on people to move it forward. This time, the message was far starker, as I counted the number of people who contribute to key parts of the project --- for some pieces, we depend on one person alone. I also covered the various moving pieces in the project, using Kent Beck's "3X" model as a framework to hold the talk together.

As well as being part of the show at SeConf, I also had the pleasure of helping out Jim Evans in the "Fix a Bug, Become a Committer" workshop. He did a great job explaining how the pieces fit together, and by the end of the workshop, we had everyone building Selenium and running tests in their IDEs of choice (provided that choice wasn't "Eclipse"), which is a testament to the hard work he'd put into preparing the session. It did highlight that the "getting started" docs probably need a bit of a polish to become usable. I was also invited to do a Q&A with the folks in the "Selenium Grid" workshop, where I broke from theme to talk about the role of QA in a team. Thanks for being patient, everyone!

In terms of code, as I write this, I've landed 57 commits since September 17th. Part of this was to help shape the 3.6 release. For Java, the theme of this release was the slow deprecation of the amorphous blob of data that is "DesiredCapabilities" to the more strongly-typed "*Options" classes (eg. FirefoxOptions, ChromeOptions, etc). The idea behind the original WebDriver APIs was to lead people in the right direction: if they could hit the "autocomplete" keyboard combination in their IDE of choice, then they'd be able to figure out what to do next. The strong typing is a continuation of this concept, and is something that all the main contributors are fans of.

One implementation detail we made in the Java tree is that each of the Options classes are also Capabilities. I made this choice for two reasons. The first is philosophical. We don't know ahead of time what new features will land in browsers (headless running for Chrome and Firefox are examples), so we'll always need an "escape hatch", to allow people to set additional settings and capabilities we're not aware of. The second is pragmatic. The internals of Selenium's java code is set up to deal with Capabilities, and people extending the framework have been dealing with them as an implicit contract of the code.

In the wild, there are two major, and one very minor, "dialects" of the JSON-based protocol spoken by the various implementations. The first is the original "JSON Wire Protocol", and the second is the version of that protocol that has been standardised as part of the W3C "WebDriver" specification. We took pains when standardising to make sure that a JSON Wire Protocol response is almost always a valid W3C response (technical note: because all values are returned as a JSON Object with a "value" entry, which contains the return value), but there are two areas where the dialects diverge wildly.

One area is around the "Advanced User Interactions" APIs. The end point offered by the W3C spec is significantly more flexible and nifty than the original version in the Selenium project, but it is also a lot more complex to implement.

The other area is around "New Session", which is command used to create a new Selenium session. The JSON Wire Protocol demands that the user place the set of features that they're interested in using into a "desiredCapabilities" JSON blob. This was originally designed as part of a "resource acquisition is initialisation" pattern --- you'd load up the blob with everything you might want (a chrome profile, an equivalent firefox profile, the proxy you'd like to use) mashing together items that theoretically only belonged to one browser into a single unit. The remote end was then to do a "best effort" attempt to meet those requirements, and then report back what it had provided. The local end (the driver code) was then to test whether or not the returned driver was suitable for whatever it was that users wanted to do. Which is why they were called desired capabilities --- you made a wish, and then could look to see if it came true. If nothing matched, it was legit for a selenium implementation to just start up any driver and give you that.

The W3C protocol is a lot more structured. It provides for an ordered series of matches that can be made, with capabilities that must be present in all cases. For our example above, the proxy would be used for any driver, and then there'd be an ordered set of possible matches for chrome and then firefox (or vice versa). Each driver provider gets a chance to fulfill that request, and if it can, then we use that driver. If nothing matches, then we fail to initialise the session and return an exception to the users.

The more structured data used by the W3C New Session command is sent in a different key in the JSON blob, and this is by design. In theory, it's possible to map a JSON Wire Protocol "New Session" payload to the W3C one, and to map the W3C structure to something close to the JSON Wire Protocol payload. Sadly, this process is complex and error prone, and there are language bindings that have been released that get this wrong to one degree or another (and, indeed, some that don't even make the effort) All this means that the Selenium Server has to try and discern the user's intent from the blob of data sent across the wire. Getting this right, and flexible, has been the focus of the forthcoming 3.7 release. It's fiddly work, but it'll be worth it in the end.

Another common problem we see is that some servers out there speak the W3C protocol natively (eg. IEDriverServer, geckodriver, the Selenium Server) and others don't yet (eg. safaridriver, chromedriver, and services such as Sauce Labs). A big part of the 3.5 release was the "pass through" mode, which means that if the Selenium Server detects that both ends speak the same "dialect" of the wire protocol, it'll just shuttle data backwards and forwards. However, if it detects that the two ends don't speak the same protocol, it'll do "protocol conversion", mapping JSON Wire Protocol calls to and from W3C ones. This has been made easier by the fact that the W3C spec is congruent with the JSON Wire Protocol -- the two have identical end points for many commands.

But not all commands. The main ones that have been causing grief have been the advanced user interaction commands, particularly when a local end speaks the JSON Wire Protocol, and the remote end speaks the W3C one. Just such this situation arises for users of some cloud-based Selenium servers, and its been a constant source of questions from users. To help address this, I've landed some code that does emulation of the common JSON Wire Protocol advanced user interaction commands (things like "moveTo"). Hopefully this will address the majority of headaches that people are experiencing using this new functionality.

Let's see what the next month brings. Hopefully, we'll ship 3.7 :)

Rocket Powered

Wednesday, 18 October 2017

A Month in Selenium - September

About Me