Lone Spelunker Development: How to Make Accessible HTML5 Games that Work with Screen Readers, Part Three

This is the first of a three-part series. Links to the other two articles will be added to this page as the other parts become available.

In the first two parts of this series, we discussed an overview of screen reader capabilities in HTML5 games and some of the technical implementation details of setting up a system that allows playing games with a screen reader. In this part, we will discuss some of the more esoteric, design-level considerations that the previous two parts bring to the fore.

In many ways, these are the most important elements to get right. All the technical details mentioned earlier, even if executed flawlessly, will fall down if the design side of things does not leverage them properly. So this is what we will discuss in this part.

Writing Considerations

First, let's talk about writing considerations. When you interact with a game solely through the script you write for the game, the manner in which we write the utterances that will be spoken matters.

Brevity

One thing I struggled with in Battle Weary was how succinct to make the text.

On one hand, I wanted the language used in the game to be as short as possible, to reduce the amount of time that players had to sit there waiting through exposition to get to the meat of what's being described so they can make decisions.

On the other hand, stripping all the "flavor text" from the game would rob the game world of all of its immersiveness. The sighted users get nicely-decorated maps that (try to) evoke the game world with visual decorations. I felt that screen reader users might want to have the flavor text to give them an equivalent feel for the world their character was exploring.

I have to admit, I am at a bit of a loss as to how to reconcile those two concerns. The best balance between brevity and flavor is a subjective choice, but it still should probably be driven by some guiding philosophy so it doesn't go all over the place over the course of a game.

I'm pretty sure the best approach is not to default to only speaking the bare minimum and having the player hit a key for the flavor text. And I'm pretty sure it's a bad idea to have paragraphs of florid text for every interaction.

What I settled on for Battle Weary was trying to be very succinct, but still having at least a little flavor to the succinct statements. Instead of "The goblin hits you for 3 damage," I say, "The goblin stabs you with a crude spear for 3 damage." The latter is longer, but not so long that it is tedious to listen to, yet still helps evoke a mental image of what your hero is facing.

Punctuation and phrasing

The reality is that text-to-speech technology, while very good, still isn't anywhere near as good as "normal" human speech, so you may have to help the text-to-speech parser along.

One thing I encountered during development was the problem of speaking about cards. For instance, consider the following text:

Play the attack card.

You, as a human, might speak this text in one of two ways, and the intonation you'd use would make a difference in what you mean. If there are multiple types of cards, and "attack cards" are one type of card, the above text might mean, "choose a card of class attack and play it". In this case, you'd speak "attack card" as a contiguous text phrase without pauses, connecting the two words.

But if the title of the card is "attack", you'd use a different intonation, adding slight pauses around the word "attack" to convey where the title begins and ends.

Unfortunately, the text-to-speech parser has no idea about these nuances, so you have to help it out a bit. The default speech would be the first approach, but if you needed the second approach – the title of the card is "Attack" – then you'd ask the text-to-speech parser to speak:

Play the "Attack" card.

Placing quote marks around "Attack" causes the text-to-speech parser to intone that part of the phrase differently, with the desired pauses.

Luckily, this is very intuitive and straightforward syntax, and in most cases, it can even be displayed as screen text, since it matches up nicely with what you'd normally display visually. Unfortunately, this is not always the case. Sometimes you need to add punctuation to the "spoken" text that you wouldn't want to display visually.

A good example is headings. If you handed the text-to-speech parser the text for this section of this article, it would speak:

Punctuation and phrasing the reality...

...without a significant pause or concluding intonation between the end of the heading and the first paragraph, because headings don't typically include punctuation. The solution is to place a period at the end of the heading to help the text-to-speech parser understand that the content is disparate:

Punctuation and phrasing. The reality...

...but of course you probably don't want to display that period.

There are also some phrasings that TTS simply doesn't handle where. For instance, the following phrase simply doesn't render well:

Where you sample the water matters.

...because the "where" gets lost in the shuffle at the beginning of the sentence. The word "where" doesn't start many sentences, so it comes off sounding awkward. But if we were to rewrite this sentence to:

The location where you sample the water matters.

...then the TTS system handles it elegantly and it is much more understandable.

The end result of all of this is that you often need to keep "what gets spoken" and "what gets displayed" separate. It's wise to implement that in your systems from the start when possible; don't just store text. Store text that will be presented to the user as two components: the spoken aspect and the display aspect.

Add Whitespace Generously

Another good best practice is to always add a space at the end of your text content, or be sure that when you concatenate strings, you add a space. I encountered a few places where I'd have my game speak two sentences, concatenated from two sources, and failed to put a space between them, so what was given to the TTS parser was something like:

You walk west.You are in the dark cave.

This causes the TTS parser to speak something that sounds like this:

You walk west dot you are in the dark cave.

...all as one contiguous sentence, which is of course not what I intended. Adding a space between the two sentences fixes it. Since TTS parsers seem to behave much like HTML, where whitespace is "collapsed", you can add as many spaces between items as you need, so getting in the habit of having whitespace around your stored strings and inserting it when you concatenate them for speech is fine.

Help versus Status

In the previous article part, we talked a bit about the idea of always providing a way for the player to call up the game's Help and Status. I'd like to talk about that in a little bit more detail now.

These two things are different in character, but are also closely related. Help describes how to interact with the game, but does not convey much about its state, beyond it describing the current space of commands you have available. State describes the game's state that you are making decisions to act upon, but doesn't describe how to actually act on it. Taken together, you can understand the game status and how you can respond to it with your own decisions.

So why separate them? It's a good question, and it took me a while to arrive on the conclusion that they need to be explicitly and consistently separated. It boils down to the fact that this is what is required to streamline the play experience and maximize immersion.

Ideally, when you play the game, you've already memorized the key commands, you have them down to "muscle memory", and you aren't even aware that you are physically pressing that up arrow key in order to move your character north. So to achieve that "flow" state for a screen reader user, we need to omit, to the extent possible, the exposition about how to move around and just let them move around. The status, then, is what we emit to give the player what they need to know to make decisions, but not anything that explains what controls to use.

This is especially true since the player will likely need to hear the game's status over and over. The more succinct we can make it, the better, and controls which they should internalize after only a few moves are prime candidates for culling verbiage.

But of course the reality is that everyone who plays your game doesn't come in with that familiarity with the game controls, and will need help. That's why you also need to have the help-like content.

It was that inherent push-and-pull that led me to the realization that we need two affordances here, not just one. Originally, I was envisioning a single "status" key that the user could hit at any time to get all that information, but it was cumbersome and often made the game feel more didactic than it needed to. Once I separated the two, the feel of the game improved significantly.

Additionally, since users not reliant on screen readers could also benefit from a "Press H at any time to get help" for the key commands they use, too, it made sense to separate them on that front, too. You don't need to tell a sighted user where the X's and O's are on Tic-Tac-Toe, but you do need to tell them the key commands. You can kill two birds with one stone by exposing the "accessible" Help to all users.

Now, all the above said, you may find that there are some places where you can "inline" help to make things less likely that people will need to call up the help at all. For instance, if you're going to tell players that they are currently deciding where to move right now, you can very succinctly mention that you can use the arrow keys to do so if they use the arrow keys to move. Since we automatically emit the status, that would prevent people from having to refer to the help to learn to use the game. Just use this sparingly; every help item that appears in status will be heard multiple times.

Consider Order

In many games, it's not important what order you present content in. Indeed, in many games, much of the content is presented concurrently. In Tic-Tac-Toe, for instance, most players just look at the board directly, seeing all the cells as they relate to each other at once. While their eyes may focus on one cell at a time, the presentation is all concurrent.

This poses a difficulty when it comes to translating a game state like that to the spoken word. How do you take concurrently-presented information and turn it into something sequential? Obviously, you will have to just list everything out one by one, but the order in which we do that matters.

This is because the user can strike a key at any time to take an action. We cannot measure when the user is done listening to the screen reader's spoken text, so we have to assume that content can be interrupted at any time with a player decision.

So, we need to put the most pertinent information first in any given utterance. The goal is to let the user who has already "mind mapped" the game to wait the least amount of time to get the information they need to move on to the next game state.

But this is counterbalanced by the fact that the easiest way to get lost is to assume you're one place when you're not, and continue taking actions. Orienting information must come first, especially if your game can interrupt the "normal" flow of things with new questions.

In other words, the most important thing to know about a cell in Tic-Tac-Toe is whether it is an X or O. But to understand its significance, we need to be firmly aware of where that X or O is. So the state for a Tic-Tac-Toe game would have to lead with the cell you are in currently, and then say whether it's an X or O. A person very familiar with the game and able to envision where they are on the grid might prefer to hear X or O first...until they get lost, screw up their mental map of the game, and make a bad decision as a result.

So instead of:

"X. You are in the left center cell."

...we should write:

"Left center cell. X."

Things get even more complicated when you announce changes to the game state. If I hit the up arrow key while the cursor is on the left center cell, I need to know that it accepted the change to the game status first. So now we're looking at:

"Moved up. Top left cell. O."

Now, the most important thing about the cell is in the third position, but it's always clear and reliable, to veterans and newbies alike, how they are navigating, where they are, what they're doing, and what the status is.

Note that a veteran can still navigate this quickly. If they are in the top left cell, they can hit the down arrow twice in quick succession to go check the bottom left cell. They'll miss the status for the intervening cell, but will end up in the right place, and, importantly, they'll know they landed where they intended to "zoom" to before hearing the cell status:

"Mov...Moved down. Bottom left cell. Empty."

This "Action > Position > Status" hierarchy seems to work pretty well, especially when the full game status can get rather long. Most people playing your game will try to build a mental model of the game, which will be supported by the recurring status utterances, but they won't need to listen to it every time. It's crucial information, but if they know what its state was and they are familiar with the ways in which their actions can permute the game state (or leave it alone), they don't often need to hear the status unless something unexpected happens in the game...at which point the "Position" part of the game comes into play early.

Explicit Distinctions

You may have noticed in the above discussion that I had to come up with names for cells in the hypothetical Tic-Tac-Toe game. "Bottom left cell". "Left center cell". And so on.

This illustrates an important point. A lot of the distinctiveness we get for free in a visual medium is not available in a text-based format. In Battle Weary, I wanted the ability to have multiple enemies in one location, but choosing between "goblin" and "goblin" to attack renders them indistinguishable. In a typical roguelike, you could have dozens of similar enemies with the same name, but they are distinguishable simply by their position. One is north of you and one is west of you.

In a text format, this is often no longer the case. I solved this by giving each goblin a specific adjective. "Slimy goblin". "Smelly goblin". "Dirty goblin". Now, not only are they distinguishable, but it also adds a little bit of character and world-building to the game as a result.

I did a similar trick with room names. In the maze-like forest level that Battle Weary generates, a sighted user has a distinct advantage over the screen reader user because they can see the relative placement of their character amidst the more fulsome environment, whereas describing all the rooms adjacent to the current one with spoken audio would quickly get tiresome. So to give the screen reader user a hook to hang their mental model of the game world on, I tried giving each map cell a unique, evocative name, like "Dark Hollow", "Shadowy Clearing", etc. Now, at least, the player has a way of distinguishing the rooms and has a chance of recognizing that they're back to known territory if they got lost.

Of course, sometimes, distinction is worthless. If you have three "Attack" cards in your deck, it doesn't really matter which one is in your hand if they're all the same, so there's no reason to differentiate them. Obviously, you need to use your judgement on which things warrant extra distinction and which don't.

Additional Details for Spoken Content

Early on in development, I had a function that could display a pop-up choice for the player. It quickly became clear during testing that sometimes, it is easier to connect different pieces of content when presented in a visual medium. If you see a dialog box that says, "How many gold pieces do you want to spend?" and the options are "One", "Two", "Three", it is easy to connect the question with the answers when seeing it visually, but more difficult if it is spoken – especially if the player "skips over" the spoken part by quickly pressing a keyboard key.

What I needed was a way to add extra exposition for choices selected by the screen reader. For the sighted user, it would show "One", but for the screen reader, it would speak "spend one gold piece" to make clear what was happening.

I also found that, when showing help, that I could omit some parts for the sighted user; for instance, there is no need to tell the sighted user they can press "S" to have the screen reader announce a summary of what they can already see. So having some affordance to be able to define different text for announcing and displaying is generally needed, and as you write, you need to keep in mind what parts are relevant for which media – spoken versus display – to ensure that the best possible experience is delivered by the text.

Structural Considerations

The design considerations implied by screen readers do not end at how we phrase things or the order in which we utter them. There are also other design constraints that we need to put on the game itself in order to make it enjoyable – or even playable – by screen reader users.

Rethink Gameplay Features for Flow

As I mentioned earlier, the impetus for this project was a desire to make the roguelike game genre readily playable by screen reader users. But the reality is that if I were to attempt to bring over the full roguelike experience, it would be a slog.

It could be done. All traditional roguelikes are turn-based games on a grid, and that could be exposed to be fully discoverable. (Indeed, any roguelike that runs in a Unix shell is playable in this sort of context because it's all just text.) That doesn't mean it would be pleasant, enjoyable, and appropriately designed for the task, though.

Roguelikes often have a heavy emphasis on the map, which is a large, tile-based affair, usually with hundreds of tiles per level. Rooms can often have dozens of elements of interest in them, like masses of white worms or "monster zoos". They can have odd, irregular shapes with multiple doors in walls at odd intervals and locations. And they can have bottlenecks and open spaces that play into how many enemies can attack you at once, so not understanding the geography can, in edge cases, get you killed, and yet the vast majority of the time, positioning is irrelevant. Most of the time, you can just constantly hit the arrow keys to wander around a large dungeon filled with mostly-empty tiles. All this is technically describable in text, but I wouldn't want to sit there listening to something like this every time I moved one tile:

There is a white worm mass two tiles west and one tile north. There is a white worm mass two tiles west. There is a white worm mass two tiles west and one tile south. There is a sword three tiles west and four tiles north. There is a white worm mass four tiles west and one tile north. It is wounded. This room is rectangular, twelve tiles wide and four tiles tall. You are seven tiles from the west wall and three tiles from the north wall. There is a door five tiles west and four tiles north. It is locked.

Nor would I want to have to drop out of my movement actions every turn to move a cursor around to check everything I can see to make sure nothing surprising or dangerous is going on.

So how did I handle it? I decided to replace the largely irrelevant nature of the tile-based movement with a room-by-room movement. This preserves a lot of the roguelike activity (roaming around, killing monsters, and taking their stuff) while streamlining away the parts that would be exceedingly dull as a narrative piece. Enemies that would have normally been masses of multiple tiles now just require a single mention. So instead of having to hear that above description a dozen times or more to get through a room, you just hear something like this once:

There is a white worm mass here, slightly wounded. There is a sword here. There is a locked door in the north wall.

A huge difference! It also makes things more streamlined from a UI perspective, since you can just choose from three things rather than having to choose something at a position.

We do lose a small amount of nuance – we lose that element of using the geography to take advantage of bottlenecks, for instance. But it's a good trade-off, and in any case, the important thing is whether the game is fun to play, not whether it "ticks the boxes" of the Berlin Interpretation.

Parallel Mechanics and Interfaces where Possible

One thing to keep in mind is that because there are so few screen reader accessible games on the web currently, you'll probably be asking the user to learn some entirely new UI conventions for your game, especially if you go the route I'm describing in these articles.

That means that users will likely already be unfamiliar with your game's controls. If you turn around and fill your game with many interaction sub-modes, that's just going to make it that much harder for your players to learn your game well enough to seamlessly play it.

To a certain extent, that's unavoidable. Your "Main Menu" is going to have different interactions than your core gameplay, probably. But you can still try to minimize it by looking for places where you could unify the gameplay interaction.

You can do a lot with arrow keys and space bar, for instance. Battle Weary only has three real interaction modes: moving around the map (arrow keys move you and space bar interacts with something there), playing cards (arrow keys select cards and space bar plays it), and dialog boxes (arrow keys select options and space bar confirms the choice). In all cases, it's arrow keys and space, so learning the game is quick, and even when the player gets into a new place they're not familiar with, trying what they already know will lead them in the right direction.

There will be some games out there which cannot be boiled down to arrow keys and space bar (plus the "S" and "H" keys), but even then, I suspect the interaction paradigm can be made consistent across all game modes.

But even in cases where there is a new interaction paradigm needed (say, a fishing mini-game or something that is totally different than the core gameplay), ensuring that some of the interaction elements are consistent – like the persistent availability of the "S" and "H" keys – will help orient users and smooth transitions between interaction schemes.

The more similar and unified your UI control scheme is, the faster a screen reader user can internalize the controls.

Games that Lend Themselves to Screen Reader Play

Obviously, some games are going to lend themselves to screen reader play more than others. A "choose your own adventure" style game is going to be easier to play with a screen reader than a "Bullet Hell" shooter. If you're looking to make games for a screen reader, keeping in mind the limitations of funneling all gameplay through textual descriptions is paramount.

One thing that I struggled with in Battle Weary was the fact that sighted players have an advantage over non-sighted users because the layout of the map is so visual. A sighted user can see at a glance not only their current room, but also the rooms around them. If they've gotten lost in the forest, and they see the Forest Guide in an adjacent room, it's suddenly clear where they need to go. Screen reader users do not have that information, so they're going to have to keep wandering around.

I could address this by having a way for screen reader users to "look around". I actually thought about adding a third persistent key – "L" for "look" – that would let the player get a very detailed breakdown of everything a sighted user would see.

But I wasn't sure that would actually assist gameplay. It would encourage the same problem mentioned above, where, for optimal play, the player would have to stop after every move and use an interface for "looking around", and I didn't want that. Ultimately, the brevity of the game jam's seven days made the decision for me, but even if I hadn't been under time pressure, I'm not sure I would have implemented it.

Given time, I think I may have instead opted to address that disparity in challenge levels a different way, such as changing the procedural generation of the forest mazes to be less "twisty" and long, adding a "backtrack" command to head back towards the exit, or giving the player access to purchasing items that could teleport you out of the forest to remove that challenge for those stymied by it. Addressing the underlying problem that the screen reader makes difficult is a better plan than "kludging" a solution that will approximate the advantage a user who can see the screen has.

Conclusion

All in all, it was an interesting experiment, and I believe I've been able to deliver a solid, playable game that a screen reader can be used to play. With some crucial design choices and a little effort, the game experience is quite comparable whether you can see it or not, and it opens a genre that has very little in the way of screen reader support to players who might be interested in it.

The principles and best practices outlined in this article series can be applied to just about any HTML5 game, and could also be used for interactive educational applications and other "managed" interactive experiences. There are still a few weak spots and places for possible improvement, but the interaction experience is so vastly improved over the tedious and clumsy method that would otherwise be used (by exposing the entire DOM to screen readers in a confusing way), that it appears to be the best approach for heavily interactive experiences like a game.

However, all the above has only been tested with VoiceOver. We need to do more testing with more screen readers to see if any other problems crop up or any other best practices become needed. Until then, I'm going to use this structure for future projects in an attempt to make them more playable by a wider audience (at least, for the projects that would lend themselves to text-based play, or which are required by law to be due to receiving federal funding).

Lone Spelunker Development

Saturday, March 23, 2019

How to Make Accessible HTML5 Games that Work with Screen Readers, Part Three