Lone Spelunker Development: How to Make Accessible HTML5 Games that Work with Screen Readers, Part Two

This is the second of a three-part series. Links to the other two articles will be added to this page as the other parts become available.

In this article, I will talk about the technical aspects of how I exposed the gameplay of Battle Weary to work with screen readers. Battle Weary is an HTML5 roguelike game written in Javascript, and it uses the WAI-ARIA specification to expose its behavior to screen readers.

Before we begin, be aware that much of what follows builds upon (and in some cases, goes against) the WAI-ARIA specification for exposing web applications to screen readers. It would probably be good for anyone looking to make HTML5 games accessible to at least skim the important parts of the specification to understand the discussion below. (Or at least bookmark it so that you can come back to it later once you start diving into it yourself.)

Overview

This section will summarize the components we need to have in place at a high level, and then we'll go into detail for each below.

The model for how we structure the interactive game experience is that of a conversation. Rather than trying to expose all of the UI elements as navigable items that must be individually discovered, understood, and activated, we instead manage the gameplay experience in a more linear format that expects a sort of call-and-response interaction.

To achieve this, we use an ARIA-live region that will serve as our mouthpiece to speak things to the user. We'll call this the "Emcee", because it will essentially serve as our "master of ceremonies", giving announcements and framing the context of the presentation.

We then, essentially, shut down the normal navigation and exploration features of the screen reader so that they don't "clutter up" the experience of playing the game. We do this so that the user doesn't have to navigate around, doesn't get off-focus to interrupt gameplay, and doesn't accidentally break out of the game when they don't expect it.

And finally, we structure our activity to ensure that the conversation is intelligible, and add affordances that allow recovery if something else interrupts our conversation (such as an alert from an email message coming in).

In the following sections, we will look at each of these elements and how to achieve them.

The Emcee

The Emcee is going to serve as our mouthpiece for communicating textual content back to the user. It's an ARIA-live region placed outside of the main game display area which is both atomic and assertive. Here's the HTML for it:

 <div id="EMCEE"
  class="screen-reader-only"
  aria-atomic="true"
  aria-live="assertive"
  tabindex="-1"
  ></div>

This will cause any text that you inject into the #EMCEE div to be spoken aloud (because it is an ARIA-live region), in its entirety (because it is atomic), at the earliest possible opportunity (because it is assertive).

For instance, if you have jQuery available on the page, you can have it speak "Hello World" by typing into the console:

 $('#EMCEE").html( "Hello world!" );

In Battle Weary, I went ahead and showed this div for debugging purposes and for it to provide a sort of automatic subtitling, but it could be hidden from view for non-screen-reader users. However, you cannot simply use "display: none" to do so, because screen readers are smart enough to know that this means it's not in the flow of the document and therefore ignore it, which would kill your Emcee from doing its job. So you have to hide the Emcee div without actually hiding it. There are many techniques for doing so; here is one example:

 .screen-reader-only {
  position: absolute;
  height: 1px;
  width: 1px;
  clip: rect(1px 1px 1px 1px); // IE 6 and 7
  clip: rect(1px,1px,1px,1px);
  clip-path: polygon(0px 0px, 0px 0px, 0px 0px);
  -webkit-clip-path: polygon(0px 0px, 0px 0px, 0px 0px);
  overflow: hidden !important;
 }

This approach keeps the item in the active page flow without actually showing it to sighted users.

Narrating the Gameplay

The Emcee is the key for exposing the game to the player in a way that removes a lot of cumbersome UI navigation, but for it to do its job, we need to take over one side of the conversation and make it intelligible to the player by emitting relevant information to it whenever the user takes an action.

To accomplish this, I made an announce() function that could be called from Javascript to gather text to be spoken aloud, and optionally flush the content to the live region. Why?

Consider a case where you are making a card game (like Battle Weary's combat system). You want to announce the activity that the player does, and you want to announce the new state that this card play has led you to. The code for the card playing is almost certainly separate from the code for the new state that you're going to.

But if you send these two utterances to the Emcee separately, the second one would override the first one, cutting it off. You would never hear the first one. So what you want is a model where we can send utterances to the Emcee, but have the Emcee only utter them when things are "settled down" and it is ready to speak them all in their entirety.

To accomplish this, we add the concept of accruing all the accumulated to-be-spoken content and then "flushing" it all to the Emcee at once when we're done talking and need the player to make a choice, much in the same way that CB-radio users say, "over" when they're done talking and the next person can respond.

We keep a variable called currentMessage and accrue spoken content in it until we are ready to flush it to be spoken.

 announce( text, flush ) {
  this.currentMessage += "  " + text;
  if (!flush) return;
  var safeMessage = this.currentMessage.stripTags();
  $('#EMCEE').html( safeMessage )
  this.currentMessage = "";
 }

This uses an extension to the String function to add the stripTags() function:

 String.prototype.stripTags = function() {
  var tmp = document.createElement("DIV");
  tmp.innerHTML = this;
  return tmp.textContent || tmp.innerText || "";
 }

We strip the tags so that we can give the announce() function an HTML string without the screen reader saying things like, "less than bold greater than important thing less than slash bold greater than". Since we will often want to display and speak the same text, this comes in exceedingly handy, especially if we want to have some text that is spoken but not shown – in that case, we just wrap that content in a span that is given a class that has a display:hidden; CSS rule on it.

Once you have the above in place, you can do this when the user plays a card:

 GAME.announce( 'You played the "Attack" card.' );

...and then in the code that shows the result, you can do this:

 GAME.announce( 'Enemy died. Press space to continue.', true );

This will eventually speak, "You played the Attack card. Enemy died. Press space to continue." You have the ability to cleanly annotate the action that leads out of the previous state and prompt the user for the next state.

Interaction Organization

Which brings us to the mechanics for organizing the interaction with the game. For this system to work, the game has to be, essentially, a well-defined state machine that always has a codified set of interactions that can lead to other states. This is because, at any time, the user needs to be able to have the game re-speak the current game status.

If you imagine a game of Uno, if something interrupts the game when it is telling you what card is on top of the discard pile, you're out of luck if you can't get it to speak that again. At any time, the user needs to be able to have the game elucidate the current status of the game and what it expects from the user.

In Battle Weary, the game switches between several different contexts: traveling around the map, battling monsters with cards, responding to choices or notices, etc. Because games are complex, the keyboard commands that are meaningful in one part of the game are not meaningful in other parts, or they may have different meanings. For instance, when traveling around a map, the arrow keys move the player to adjacent rooms – pressing the up arrow key moves the player north. But when making a choice in a choice dialog, the arrow keys select different options.

So, there should always be a key that allows the player to get a spoken declaration of the current status and context of the game. In Battle Weary, I used the "S" key, for "Status", but it could be anything as long as it's consistent throughout the game and announced up-front.

This means that you will probably want to organize your game structure in such a way that your game engine can query the current game state for a string that describes the state. I used the concept of a "current controller" object that represents each of the possible game states, and which could be queried for this string as needed.

Similarly, you should also have a key for getting help. In Battle Weary, I used the "H" key, for "Help". This would do a similar task, only instead of speaking the current game status, it speaks the current list of key commands and interaction context.

The difference between status and help

"Wait, what's the difference there?" you might be asking. There is definitely some nuance here that merits discussion.

When we interact with a game, there are two levels we do it at. There's the conceptual model of the game - the interesting part of the game - that we deal with using our brain. It's the "mind's eye" that looks into the game and understands it as an experience. If we were playing a game of Tic-Tac-Toe, this is the part of the game that considers things like where the X's and O's are, whether we're winning or losing, noting where our opponent just played, and figuring out where we'll play next.

But there's also the interaction model for the game - the part of the game that allows us to interact with that higher-level conceptual model. Do we click a square to add our 'X'? Do we use the arrow keys to highlight a square and press space bar? Do we drag an "X" symbol into a grid? Or what? There's a rote, mechanical element that we need to learn in order to express ourselves and participate in the conceptual model.

Typically, once we internalize that second category of game information - the part that tells us how to interact with the game - we typically don't need to reference it again. Or we may only need to refer to it if we want to do something "weird" that we don't normally do while playing, like splitting or taking insurance when playing BlackJack or announcing "Uno!" when we are playing our second-to-last card in a game of Uno. We will seldom trigger the help text ourselves, and the game won't automatically speak it - it will only speak it on our request.

But the conceptual information, though - we may need to refer to that over and over again, and will likely be something that is automatically emitted, since it is crucial information that helps orient the player on a turn-by-turn basis.

Luckily, both are simple to implement, since it's really just storing two HTML strings whenever the game state changes: one for the status and one for the help. When the user issues a command that changes the game state, you just update them again. And when the user hits the given key command, you simply emit the associated string. Easy peasy!

(Well, it's "Easy Peasy" from a technical perspective. The hard part is authoring those strings so they are useful, intelligible, and terse. We'll talk more about that in Part Three.)

Also, one little optional improvement you could make to the above: Since the status is not something sighted users need, but the help could benefit them, you might consider actually showing a dialog box with the help when the user hits "H", while only announcing the text when the user hits "S". This is what I did with Battle Weary, and it worked quite well. Making a game accessible often helps all users, not just the ones who must rely on those affordances.

Switch Support

This section is a draft recommendation, and should be considered experimental and untested.

In addition to screen readers, you can make it so that switch users can also play your game by adopting a model where all options for a given game state can be tabbed through and shown to the user. To accomplish this, you'd listen for the TAB key and show a new menu of choices. As the user continues pressing the TAB key, it cycles through this menu, and pressing SPACE would select that choice as if the user had pressed the corresponding key command. (To support non-switch users, you could also support shift-TAB to move backwards through choices.)

The way I handled this is that the "normal" controller would push a new "linear" controller onto the stack, which would present the choices to the user, and then pass the selected choice back to its calling controller. That way, a single controller handles all of the activity; you just have a new controller that can present the choices in a different way.

Managing the Focus

The other sticky thing that was problematic when testing with VoiceOver was that it was easy to focus an element that wasn't the game.

The way the WAI-ARIA specification works, keyboard commands are understood to behave in different ways depending on what has the keyboard focus. Pressing space while focusing on a button clicks the button, while pressing space while focusing on a text area adds a space character to the text area, for instance.

For an HTML5 game, then, the game grinds to a halt if it loses focus.

Typically, this is fine and intended behavior if the game loses focus because the user decided to navigate to a different web page.

But it's bad if the game loses focus because some sub-element has gained focus, like the letter "f" in one of your dialog boxes.

So, we take steps to aggressively manage the focus. The goal is to ensure that the only thing on the page that the screen reader sees is the game itself. If you're on the page, you're playing the game.

Unfortunately, in my testing, that appears impossible. But we can approach it by implementing a few tricks.

First, we set up the page's HTML content so that it makes the game one big "widget". We set the game's div to have the ARIA-role of "application" and make it the only object that can naturally receive focus:

 <div id="GAME"
  role="application"
  aria-roledescription="game"
  aria-activedescendant=""
  aria-label=''
  tabindex="0"
  >

Then, we mark all of the sub-elements, and any other web page elements on the page (except the game itself and the live region mentioned above), as being ARIA-hidden with a tabindex of -1:

 <div id="GAME-UI"
  aria-hidden="true"
  tabindex="-1"
 >

This (usually) causes the screen reader to only see one big "application" object on the page and prevents it from indexing those other parts of the page and exposing them as navigable options to the user. This is exactly what we want, because we don't want two places for the user to go to do things. Since the game is going to be considered one big widget, basically, we want the application focus to go there when we load the page, and stay there.

(Note that the specification for the ARIA-hidden attribute states that, in general, we should only use it for elements that are truly hidden from all users, but it carves out an exception for cases where we hide elements from screen readers for expediency and improving the screen reader user's experience. The canonical example is a headline with an image; the image is ancillary and may be hidden in order to prevent the user from having to listen to (and navigate between) two different linked elements. In our case, we are hiding the elements because we want the screen reader to think of the game itself as atomic. I think this satisfies the requirement, since we truly are replacing the otherwise-cumbersome UI with something much more streamlined, but a purist may object to the level at which this intervenes. To that argument, I can only respond that the WAI-ARIA spec is otherwise unsuited to the task, so no matter what we do, there will be concessions, so we might as well go the route that makes the most compelling and immersive game experience.)

Even with the above, it is possible to "sneak" into the descendants of the game, say by clicking on them directly, such as might happen when initially trying to give the application focus. So we also set up a sentinel to watch for cases where the current focus is placed on something in the game's hierarchy and then instead bump the focus back up to the game element itself. I do this with a focus manager.

Unfortunately, as far as I am aware, there is no event you can add a listener for that tells you when the focus changes, so until a better way comes to light, we brute-force it and just check several times a second that the focus is in a valid place, and if not, push it into one. I do this with a FocusManager object:

class FocusManager {

 constructor() {
  this.gameHasFocus = false;
  setInterval( function() {
   this.manageFocus();
  }.bind(this), 100 );
  var game = document.getElementById('GAME');
  game.focus();
 }
 
 manageFocus() {
  var current = document.activeElement;
  if (current.id == 'GAME') {
   this.gainFocus();
   return;
  }
  while( true ) {
   /*   The EMCEE is part of the game, so if it has focus,
        go ahead and set it back to the main game. */
   if (current.id == 'EMCEE') {
     var game = document.getElementById('GAME');
     game.focus();
     this.gainFocus();
     return;
   }
   if (current.id == 'GAME') {
    current.focus();
    this.gainFocus();
    return;
   }
   if (!current.parentNode) break;
   current = current.parentNode;
  }
  this.loseFocus();
 }
 
 loseFocus() {
  if (this.gameHasFocus == false) return;
  this.gameHasFocus = false;
  if (GAME.controller.unfocus) GAME.controller.unfocus();
 }
 
 gainFocus() {
  if (this.gameHasFocus == true) return;
  this.gameHasFocus = true;
  if (GAME.controller.refocus) {
   GAME.controller.refocus();
  } else {
   if (GAME.controller.status) {
    GAME.announce( GAME.controller.status() );
   }
  }
 }
 
}

This code just watches the focus, and if it is ever a descendant of the game div, or the "Emcee", it refocuses the primary game div. (In other words, if it the current focus is any game-related DOM element, it refocuses the main game DOM element.)

It also triggers some hooks into the game engine, allowing the current game state to respond to when the game has lost focus and regained focus. By default, it simply announces the current controller's status when the game refocuses (i.e., if you come back to the Tic-Tac-Toe game, it will tell you the state of the board when you left off).

Now, people familiar with the ARIA spec may be crying foul right now. You'll note that the ARIA-label for the GAME div is empty. That is against the specification; there should always be an ARIA-label or an ARIA-labelled-by so that the screen reader knows what to say when it is highlighted if the content itself is not emittable as a description.

But here's the thing – that label gets spoken automatically and often while the game has focus, but not in a predictable and reliable fashion. Whatever we put in there is going to be randomly echoed into the game stream at unpredictable times, and may not be uttered in its entirety at all. We cannot stop it and we cannot rely on it. So we leave it blank. Otherwise, it will make the gameplay cumbersome and confusing. (This is the same reasoning that leads to the valid exception for the usage of ARIA-hidden properties when the content isn't actually hidden from sighted users.)

Now, taking this route means we have a responsibility to do our due diligence and make sure that the lack of information in the ARIA-label never leaves the player hanging. That's why we implemented the gainFocus() function above; when the game receives focus, we make sure to always emit something useful.

(One possible further improvement: If the user is idle long enough, announce() a notice that the player can always press 'H' for help. Choosing a good idle length might be difficult and require testing with actual players, though.)

Edit: Since the above code was written, I identified an issue with it. Depending on how you structure your game, you may wish to keep a flag and only issue that initial GAME.announce() if it's not the first time the game has gained focus. Otherwise, this announcement could override a more fulsome first utterance that your game may emit upon startup.

Handling Key Input

The last little piece is handling keyboard input. When the game has focus, we need to be able to respond to keyboard events. But when the game does not have focus, we should not interfere with the keyboard commands to prevent stepping on the toes of screen reader users trying to navigate away.

Here is the code I am currently using for this:

didKeyDown( e ) {

 // First, we check to make sure the game
 // is what has focus.
 // If not, we ignore key presses, to allow
 // screen readers to do their thing unimpeded.

 // You can add other valid focus targets here
 // if you need to.
 var validFocus = [ 'GAME' ];

 if (validFocus.indexOf( document.activeElement.id ) == -1) {
  return false;
 }

 // Otherwise, we're going to handle the keystroke.
 e.preventDefault();
 
 // Give the controller "first shot" at the key.
 if (this.controller.keyDown) {
  if (this.controller.keyDown( e ) == true) return true;
 }
 // Any other default key actions for your app may go here.
 switch( e.key ) {
  // Player asks for status.
  case 's':
   var status = this.controller.status();
   GAME.announce( status );
   return true;
  case 'h':
   this.showHelp();
   return true;
 }
 return true;
}

As you can see from this code, the key commands from the player are completely ignored if the game does not have focus. This way, if the user is navigating away from the game, it doesn't interfere or speak while the player is doing other things. And if it does have focus, it kills keystrokes to prevent things like accidental selection of characters when pressing arrow keys, even if the current game context doesn't explicitly look for those keys.

But how does the user navigate away from the game if we kill all key events? Well, certain "escape focus" key commands take precedence over what is given to the web page. So those crucial navigation elements that allow a user to leave a web page and go to another tab or another application are not affected. Only when it comes to navigation within the web page does this level of control kick in. So we're safe! (At least this is true with VoiceOver; it has not been tested in other screen readers such as Jaws, so I may have to amend this approach in the future if those other screen readers have different behavior. I can't imagine that they would, though, because otherwise, a web page could "capture" a screen reader user and never let them go.)

Onboarding and Sentinel Pages

One thing I was not satisfied with in Battle Weary was the onboarding. Once you're playing, Battle Weary is streamlined, fun, and immersive when used with a screen reader. But getting to the point where you have the game in focus, the key commands internalized, and the game ready to play is still a bit rocky. Browsers do not have any affordances that streamline that, so we're going to need a way to assist with that.

A screen reader user following a link to the game will get dumped into an itch.io page with the game framed in an iframe. If they do nothing, the game will start with the correct focus and they can start playing, but if they start navigating around the page, they'll break out of the game's focus and it can get pretty difficult to get back to the game.

Honestly, I don't see a lot of ways around this other than to prep players with a "sentinel page" – a page that comes before the game itself, using standard, plain-jane HTML markup. This would be a good place to give an overview of the game's context, rules, and the ubiquitous "H" and "S" keys, too.

This is quite close to what QuentinC's Playroom does. The benefits for it are clear, and it seems to be battle-tested, so I think it's safe to call sentinel pages a best practice for screen reader accessible games.

One problem, though, is that this approach could be easily and accidentally subverted by someone sharing the link to the game page itself rather than to this "sentinel" page. That would circumvent the gentle onboarding.

What I am experimenting with now is to include the game in the sentinel page itself, but hidden, and add a button on the page that kills the sentinel page content, reveals the game, and forces focus on it. If the player is navigating the sentinel page, it already has focus, and pressing a button on the page to kill the sentinel content and start the game should work flawlessly.

I've experimented with this approach, and it seems to, indeed, work well, dumping you straight into the game with focus. It works well enough that I'm going to consider it a best practice for now, but it will need more testing.

Conclusion

Once you have the above elements in place, you have the building blocks of an accessible HTML5 game. You can:

Capture the focus and keep it
Announce text to the user to reflect game state
Accept keyboard commands to navigate the game space

...all in a manner that screen readers can work well with.

In Part Three of this series, I'll talk about the design elements that can help you avoid pitfalls and improve the gameplay quality for the screen reader user.

Lone Spelunker Development

Friday, March 22, 2019

How to Make Accessible HTML5 Games that Work with Screen Readers, Part Two