- Published on
Scrape All Text From a Domain
- Authors

- Name
- Kevin van Zonneveld
- @kvz
Here are some commands to download the most important pages of your
site as plain text (determined by MAX_DEPTH), and save it into one
big DOMAIN.txt file.
This could come in handy when you want to have everything checked for grammar & spelling errors.
After the spellcheck you'd still have to search through your codebase / database to find & fix the culprits, but this should already save you some time in discovery.
#!/usr/bin/env bash -e
#
# Downloads a site's text to 1 text file, so you can easily
# have it grammar/spellchecked
#
# Requires: wget, html2text
# Recommended: pandoc vs html2text
# Improve at: https://kvz.io/blog/2013/04/19/obtain-all-text-from-your-website/
#
[ -z "${DOMAIN}" ] && echo "Cannot continue without DOMAIN. " && exit 1
[ -z "${EXCLUDE}" ] && EXCLUDE="*.css,*.js,*.rss,*.xml,*.png,*.jpg,*.jpeg,*.gif,*.flv,*.swf,*.mp4,*.mov,*.mp3,*.wav"
[ -z "${MAX_DEPTH}" ] && MAX_DEPTH="1"
[ -z "${OUTPUT}" ] && OUTPUT="./${DOMAIN}.txt"
[ -z "${TMPDIR}" ] && TMPDIR="/tmp"
[ -z "${TXTENGINE}" ] && [ -x "$(which pandoc)" ] && TXTENGINE="pandoc +RTS -K16m -RTS -t markdown -o- -f html -i"
[ -z "${TXTENGINE}" ] && [ -x "$(which html2text)" ] && TXTENGINE="html2text -nobs"
[ -z "${TXTENGINE}" ] && echo "Cannot continue without pandoc or html2text. " && exit 1
wget \
--adjust-extension \
--convert-links \
--directory-prefix="${TMPDIR}" \
--level="${MAX_DEPTH}" \
--no-parent \
--recursive \
--reject="${EXCLUDE}" \
--restrict-file-names=windows,lowercase \
"https://${DOMAIN}"
[ -f "${OUTPUT}" ] && rm -f "${OUTPUT}"
find "${TMPDIR}/${DOMAIN}" -type f -print0 -name '*.html' \
|while read -d $'\0' file; do
echo "imported by ${0} from: $(echo "${file}" |sed "s@^${TMPDIR}/${DOMAIN}@@")" >> "${OUTPUT}"
echo "==================================" >> "${OUTPUT}"
${TXTENGINE} "${file}" >> "${OUTPUT}"
echo -e "\n\n\n\n" >> "${OUTPUT}"
done
echo ""
echo " Combined text file ready in ${OUTPUT}"
echo " To cleanup after this script, type: rm -rf \"${TMPDIR}/${DOMAIN}\""
echo ""
Required:
$ brew install wget html2text # or apt-get install wget html2text
Run it:
$ DOMAIN=kvz.io ./obtain_site_text.sh
Recommended:
If you can install Pandoc, the resulting text output will be in Markdown and of much higher quality.
Improvements are more than welcome!
Legacy Comments (2)
These comments were imported from the previous blog system (Disqus).
Very useful ! Thx !
Debian version:
diff --git a/obtain_site_text.sh b/obtain_site_text-debian_ver...
index 4c44de5..abb34ad 100644
--- a/obtain_site_text.sh
+++ b/obtain_site_text-debian_ver...
@@ -1,4 +1,4 @@
#!/usr/bin/env bash -e#!/bin/bash
#
# Downloads a site's text to 1 text file, so you can easily
# have it grammer/spellchecked
@@ -13,7 +13,7 @@
[ -z "${MAX_DEPTH}" ] && MAX_DEPTH="1"
[ -z "${OUTPUT}" ] && OUTPUT="./${DOMAIN}.txt"
[ -z "${TMPDIR}" ] && TMPDIR="/tmp"
[ -z "${TXTENGINE}" ] && [ -x "$(which pandoc)" ] && TXTENGINE="pandoc+RTS -K16m -RTS -t markdown -o- -f html -i"
[ -z "${TXTENGINE}" ] && [ -x "$(which html2text)" ] && TXTENGINE="html2text -nobs"
[ -z "${TXTENGINE}" ] && echo "Cannot continue without pandoc or html2text. " && exit 1
Hey, it seems that even if I set this line:
[ -z "${MAX_DEPTH}" ] && MAX_DEPTH="10"
I can't get more than the home.
Do you know why?
I also tryed:
DOMAIN=ndd.net MAX_DEPTH=20 sh ../obtain_site_text.sh