Watch Your Language (Automatically)

I was writing internal documentation on how I set up automated language checking at Transloadit. Halfway through, I thought this could be useful to the rest of the world :earth_americas: as well, so I rewrote it in a more generic fashion. I'll attempt to first give a high-level overview of the problem, then I will drive all the way down to the low-level nuts & bolts of solving it. I hope you'll enjoy, here goes!

At Transloadit we've been extracting all significantly sized chunks of text (documentation, blog posts, static pages) into a separate content repository.

Up until this migration, our text was scattered across MySQL tables, templates, and HTML files. A big soup of content, layout, code, locations. Developers were able to access - but without much joy. Non-developers didn't stand a chance.

We thought it would be interesting to see if we could attract technical writers and give them full access to our content. We pictured they could use the GitHub web interface in wiki-like fashion so they could improve our language without being distracted by code, accidentally changing it, or needing much skill in that area.

This hasn't reached its full potential yet, but:

as developers we're already enjoying working on the (purely Markdown) content
having all content in a separate repository opens doors to other cool possibilities, like automated quality control, or: Continuous Integration

Continuous Integration

Continuous Integration is a concept normally associated with code. Here's how ThoughtWorks explained it:

Continuous Integration (CI) is a development practice that requires developers to integrate code into a shared repository several times a day. Each check-in is then verified by an automated build, allowing teams to detect problems early.

At Transloadit we're already using this for all our code. But could we also use this for our English?

A question that was extra relevant for us, because while the majority of our customers lives in the United States, Transloadit is Berlin-based, and nobody on our current team is a native English speaker.

Considering how damaging language errors can be when people are still in the early stages of evaluating a product - makes it all the more important for us to have some extra checks in place.

I'm attacking poor content quality in three areas:

Inconsiderate Writing
Messy Formatting
Spelling Errors

Inconsiderate Writing

To quote npm weekly,

Odds are none of us intends to exclude or hurt fellow members of the community, but polarizing and gender-favoring language has a way of slipping into what we write. Sometimes it’s a big help to have a second set of eyes that can look things over, notice what we’ve overlooked, and nudge us towards being more considerate and inclusive. Alex helps "catch insensitive, inconsiderate writing" by identifying possibly offensive language and suggesting helpful alternatives.

To install alex we run a simple npm install --save alex. According to its author,

Alex isn’t as smart as a human, but it tries its best and is sometimes overly happy to let you know something may be insensitive.

This means there may occasionally be false positives and we don't want alex' warnings to be fatal, so we're using

node_modules/.bin/alex || true

We'll get to see language that alex thinks could be improved, but we won't make those suggestions critical.

For example

  74:74-74:76    warning  `he` may be insensitive, use `they`, `it` instead

We're currently rewriting our docs to be more inclusive thanks to this project.

Messy Formatting

Our text files are in the Markdown format. This format was chosen, because

it's fairly easy to digest for humans and computers,
has a great ecosystem of tools around it, and
offers a good separation between a document's structure, and its layout. We can specify that something is emphasized, but not Comic Sans. Those decisions are left to the designers.

Often there are multiple ways to achieve the same goal in Markdown. As with code, it helps to settle on a convention, and force every contributor to follow that. By taking away some of this (useless) artistic freedom, the resulting document looks well maintained, and invites further contribution.

For this we're using mdast with a lint plugin: npm install --save mdast mdast-lint

Since we don't want to check external projects (like mdast itself) or re-check built artifacts, we're excluding a few locations

'_site/' >> .mdastignore
'node_modules/' >> .mdastignore

We then saved the following convention in .mdastrc, but this is of course dependent on your settings and lint preferences

{
  "plugins": {
    "lint": {
        "blockquote-indentation": 2,
        "emphasis-marker": "*",
        "first-heading-level": false,
        "link-title-style": "\"",
        "list-item-indent": false,
        "list-item-spacing": false,
        "no-shell-dollars": false,
        "maximum-heading-length": false,
        "maximum-line-length": false,
        "no-duplicate-headings": false,
        "no-blockquote-without-caret": false,
        "no-file-name-irregular-characters": true,
        "no-file-name-outer-dashes": false,
        "no-heading-punctuation": false,
        "no-html": false,
        "no-multiple-toplevel-headings": false,
        "ordered-list-marker-style": ".",
        "ordered-list-marker-value": "one",
        "strong-marker": "*"
    }
  },
  "settings": {
    "gfm": true,
    "yaml": true,
    "rule": "-",
    "ruleSpaces": false,
    "ruleRepetition": 70,
    "emphasis": "*",
    "listItemIndent": "1",
    "incrementListMarker": false,
    "spacedTable": false
  }
}

Then we lint for the first time

node_modules/.bin/mdast --frail .

This may return

_posts/2015-09-15-spelling.md
  246:1      warning  Use spaces instead of hard-tabs         no-tabs

As a bonus, mdast can even attempt to repair this automatically

node_modules/.bin/mdast --output .

We were impressed by how much mdast was able to fix. Make sure your files are committed to Git before running this command, though. You'll want to review the changes made, and revert them if needed. You'll need a few iterations to get this to a good place.

Spelling Errors

William Dutton, director of the Oxford Internet Institute at Oxford University, says in Spelling mistakes 'cost millions' in lost online sales that in some informal parts of the internet, such as Facebook, there is greater tolerance towards spelling and grammar.

However, there are other aspects, such as a home page or commercial offering that are not among friends and which raise concerns over trust and credibility. In these instances, a misspelt word could be a killer issue.

You had me at 'concerns'. Let's get to work. For checking spelling in Markdown documents we're using npm install --save markdown-spellcheck.

It may not catch grammar and many other subtleties ("it's" vs "its"), but at least many of my unfortunate stubborn mistakes like these get caught before reaching production:

my own fantasy English ("symbiose" vs "symbiosis")
stubborn misfires ("editted" vs "edited"), and
mixing British with US English ("summarise" vs "summarize")

(in my defense: I'm not a native English speaker :smile:)

What's cool is that markdown-spellcheck will automatically skip code blocks and other Markdowny things - but obviously we still had to ignore things like Transloadit & FFmpeg

'Transloadit' >> .spelling
'FFmpeg' >> .spelling

We're now ready to check our Markdown files for spelling mistakes

node_modules/.bin/mdspell \
  --report \
  --en-us \
  --ignore-numbers \
  --ignore-acronyms \
  **/*.md \
  _layouts/*.html \
  _includes/*.html \
  *.html

This might return that "editted" is not a word.

First Run

There's a good chance the first run uncovers many issues, both with your documents, as with the dictionary. It's a good idea to run mdspell without the --report flag so it will enter the default interactive mode.

This will allow you to exclude certain files and build a personalized dictionary inside .spelling. Something that will take a while and is a good job if you want to be productive on an otherwise uninspired afternoon.

As you add new content you'll sometimes have to add words to the whitelist as well. But at least you'll know that all cases were words stray from the dictionary, will be deliberate. And that's a good feeling.

Combine

Now let's put this all together.

Since these are all tools we installed from npm it might make sense to use npm run scripts, but in our case I chose a Makefile simply because we like TABing through shell autocompletion and so that we can have the same developer entry point in all of our projects, whether written in Node.js, Bash, or Go.

SHELL       := /usr/bin/env bash
tstArgs     :=
tstPattern  := **/*.md _layouts/*.html _includes/*.html *.html

.PHONY: fix-markdown
fix-markdown:
  @echo "--> Fixing Messy Formatting.."
  @node_modules/.bin/mdast --output .

.PHONY: test-inconsiderate
test-inconsiderate:
  @echo "--> Searching for Inconsiderate Writing (non-fatal).."
  @node_modules/.bin/alex $(tstArgs) || true

.PHONY: test-spelling
test-spelling:
  @$(MAKE) test-spelling-interactive tstArgs=--report

.PHONY: test-spelling-interactive
test-spelling-interactive:
  @echo "--> Searching for Spelling Errors.."
  @node_modules/.bin/mdspell \
  $(tstArgs) \
    --en-us \
    --ignore-numbers \
    --ignore-acronyms \
    $(tstPattern)

.PHONY: test-markdown-lint
test-markdown-lint:
  @echo "--> Searching for Messy Formatting.."
  @node_modules/.bin/mdast --frail $(tstArgs) .

.PHONY: test
test: test-inconsiderate test-spelling test-markdown-lint
  @echo "All okay : )"

Now we can run make test to see if all our checks pass.

We can run make test-spelling to only zoom in on spelling mistakes, or make test-spelling-interactive if we want to enter interactive mode after writing content with a lot of new words unlikely to be in the dictionary already.

If you have Bash Completion, just type make, press TAB, and see all the available shortcuts.

Automate

To automate testing, we'll require a Continuous Integration server.

Travis CI, Strider, Drone.io, all fit the bill. As long as we have a central place that will execute code in a reliable and repeatable fashion whenever a change to your repository is made.

We're using Jenkins for private projects, and I created 3 new chained jobs for our content repository:

content-build turns our Markdown into static HTML content via Jekyll, then triggers:
content-test runs all the commands in this post, then triggers:
content-inject stores the HTML into our website, then triggers: website-build, website-test, website-deploy. A chain we had already set up to deploy our website.

So now new content can only be injected and deployed if all checks pass. It's a pretty long chain but luckily a machine takes care of that :smile:

And when that machine detects typos in new content, we have a Slack integration set up so we get notified immediately.

So Is This Perfect Now?

No. Humans are fallible and so are their machines and dictionaries.

I'll need to keep tweaking .spelling, and "it" needs to keep correcting me. But via this automated quality control for language, we keep each other in check, and have less errors than before.

In the case of Transloadit, we were able to fix 151 mistakes in our first run

Yes.. it turns out we ~~are~~ were really bad spellers!