Scrape All Text From a Domain

Here are some commands to download the most important pages of your site as plain text (determined by MAX_DEPTH), and save it into one big DOMAIN.txt file.

This could come in handy when you want to have everything checked for grammar & spelling errors.

After the spellcheck you'd still have to search through your codebase / database to find & fix the culprits, but this should already save you some time in discovery.

#!/usr/bin/env bash -e
#
# Downloads a site's text to 1 text file, so you can easily
# have it grammar/spellchecked
#
# Requires: wget, html2text
# Recommended: pandoc vs html2text
# Improve at: https://kvz.io/blog/2013/04/19/obtain-all-text-from-your-website/
#

[ -z "${DOMAIN}" ]    && echo "Cannot continue without DOMAIN. " && exit 1
[ -z "${EXCLUDE}" ]   && EXCLUDE="*.css,*.js,*.rss,*.xml,*.png,*.jpg,*.jpeg,*.gif,*.flv,*.swf,*.mp4,*.mov,*.mp3,*.wav"
[ -z "${MAX_DEPTH}" ] && MAX_DEPTH="1"
[ -z "${OUTPUT}" ]    && OUTPUT="./${DOMAIN}.txt"
[ -z "${TMPDIR}" ]    && TMPDIR="/tmp"
[ -z "${TXTENGINE}" ] && [ -x "$(which pandoc)" ]    && TXTENGINE="pandoc +RTS -K16m -RTS -t markdown -o- -f html -i"
[ -z "${TXTENGINE}" ] && [ -x "$(which html2text)" ] && TXTENGINE="html2text -nobs"
[ -z "${TXTENGINE}" ] && echo "Cannot continue without pandoc or html2text. " && exit 1

wget \
  --adjust-extension \
  --convert-links \
  --directory-prefix="${TMPDIR}" \
  --level="${MAX_DEPTH}" \
  --no-parent \
  --recursive \
  --reject="${EXCLUDE}" \
  --restrict-file-names=windows,lowercase \
"https://${DOMAIN}"

[ -f "${OUTPUT}" ] && rm -f "${OUTPUT}"
find "${TMPDIR}/${DOMAIN}" -type f -print0 -name '*.html' \
  |while read -d $'\0' file; do
  echo "imported by ${0} from: $(echo "${file}" |sed "s@^${TMPDIR}/${DOMAIN}@@")" >> "${OUTPUT}"
  echo "==================================" >> "${OUTPUT}"
  ${TXTENGINE} "${file}" >> "${OUTPUT}"
  echo -e "\n\n\n\n" >> "${OUTPUT}"
done

echo ""
echo " Combined text file ready in ${OUTPUT}"
echo " To cleanup after this script, type: rm -rf \"${TMPDIR}/${DOMAIN}\""
echo ""

Required:

$ brew install wget html2text # or apt-get install wget html2text

Run it:

$ DOMAIN=kvz.io ./obtain_site_text.sh

Recommended:

If you can install Pandoc, the resulting text output will be in Markdown and of much higher quality.

Improvements are more than welcome!

Very useful ! Thx !

Debian version:

diff --git a/obtain_site_text.sh b/obtain_site_text-debian_ver...
index 4c44de5..abb34ad 100644
--- a/obtain_site_text.sh
+++ b/obtain_site_text-debian_ver...
@@ -1,4 +1,4 @@
#!/usr/bin/env bash -e#!/bin/bash
#
# Downloads a site's text to 1 text file, so you can easily
# have it grammer/spellchecked
@@ -13,7 +13,7 @@
[ -z "${MAX_DEPTH}" ] && MAX_DEPTH="1"
[ -z "${OUTPUT}" ] && OUTPUT="./${DOMAIN}.txt"
[ -z "${TMPDIR}" ] && TMPDIR="/tmp"
[ -z "${TXTENGINE}" ] && [ -x "$(which pandoc)" ] && TXTENGINE="pandoc+RTS -K16m -RTS -t markdown -o- -f html -i"
[ -z "${TXTENGINE}" ] && [ -x "$(which html2text)" ] && TXTENGINE="html2text -nobs"
[ -z "${TXTENGINE}" ] && echo "Cannot continue without pandoc or html2text. " && exit 1

Legacy Comments (2)