RegEx for SEO

Update http to https links in many HTML files at once

With this RegEx Recipe, it's easy to update many HTML files to make your internal links point to https rather than http, after enabling TLS (SSL).

Gary W. Longsine

11 Jun 2021 • 10 min read

When equipping a web site for TLS (Transport Layer Security, formerly known as SSL) this important SEO step is often skipped. Make sure that all of the links in your HTML begin with the secure TLS protocol declaration https:// rather than the older unencrypted http:// protocol.

Your website might have hundreds or thousands of pages with many thousands of lines of HTML which include the older http:// protocol in the link tags. Fixing them can be time consuming, error prone, and not very high on your priority list — but leaving them can cause extra redirects in the conversation between the web server and the web browser clients, which can have a negative impact on your SEO (search engine optimization).

For best search results ranking you'll want to fix this issue by making sure that all of your internal links point to https, not http. This might seem like a minor problem, but under certain circumstances it can have an outsized negative effect on the user experience, and on your SEO.

Does it really matter that HTML links point to http rather than https?

We recently fixed a web site which had developed an odd problem when the site — which had been performing well — was configured to use (the excellent caching and firewall service) CloudFlare as a layer between the host web server and the wild and wooly outside Internet.

Problem Statement

Occasionally, and it seemed almost at random, the site would experience odd errors. Sometimes the site would be rendered without the CSS styling in place. Other times a page wouldn't load at all, producing an error from the web server in the 500 range (notably not a 404 error). The page would usually reload fine upon refreshing the page, but occasionally two or three refreshes were necessary before the page loaded and displayed correctly.

Troubleshooting Process

Since nothing else had really changed (same server configuration, same traffic levels) we decided to investigate the one thing that had changed — the addition of CloudFlare. We made some improvements to the TLS configuration between CloudFlare and the web server, but those had no effect on the odd, sporadic errors.

Next, we exercised the site until we saw the error that had been reported. (It didn't take long, about 1 out of 4 or 5 page loads would produce one of these weird errors, as we clicked through the site more or less at random.) As reported, the page would load correctly on a refresh but we noticed it loaded a little slower than we expected.

Once we had identified a possibly problematic page, we ran the page through the W3C Markup Validation Service to see if there were HTML markup errors. The page validated without errors. That was a bit of a surprise. Sometimes this exact type of odd problem is caused by HTML which is riddled with markup errors, which cause the browsers to guess at the page structure and layout. But not in this case!

So we went back a page, and looked at the link that we had clicked to get here, and noticed that, although the site had been outfitted for TLS years ago, the link pointing to this page was http:// rather than https:// — and as odd as this might seem, this turned out to be the source of the trouble.

Use Ack! (a better grep) to see how many links refer to http:

Using ack (a better grep) we discovered that the site had many (thousands of) internal links which were pointing to http.

You can count the number of http: references in your HTML using ack, and the Unix utility wc (word count) like this:

% ack --html "http:" . |wc -l

9216

In that folder (represented by ".") there are 9,216 instances of the string "http:" in the HTML files.

Too Many Redirects (Probably) Caused the Problem

Using the regex recipe below we updated the HTML so that the internal links all pointed to https. We were pleasantly surprised that the problem went away. We didn't investigate the issue further, but the problem hasn't returned.

It appears that the addition of CloudFlare as a firewall for the site increased the number of redirects which happen when a page is requested over http. Updating the links to directly use https eliminated these redirects. The site became more responsive and the odd random page load errors went away.

NOTE: If you're just getting started editing your website, this might help you. The term "link" in the context of the web can refer to a few different ways of looking at the same thing. The URI or URL is a structured string of characters which refers to a unique location on the Internet, and tells what protocol to use to fetch it. Here we are referring to the HTML code for the link, which is fromally known as the HTML element "a" and which originally stood for "anchor". In the HTML code, it looks like this: <a href="http://example.com">. You'll recognize the URL inside it, that's this part: http://example.com. Wrapping the URL in the HTML element "a" which specifies the protocol (which could be http, https, ftp, or many other protocols) is what turns a URL into a "clickable" link.

If you understand this distinction, it can be easier to google your way to the answers you seek when you run into problems.

Oh, and you might be thinking that a better way to fix this site would be to use relative URLs (of the style: ../my/path/to/myfile.html) and generally that would be better. This site was already using relative URLs in most of the HTML. However, there were references to subdomains in the navigation and elsewhere, which were better expressed as fully qualified URLs with the subdomain, such as: https://blog.example.com/my/path/to/myfile.html where "blog" is the subdomain referenced from HTML files on the example.com website.

So yes, as strange at it might seem and at least sometimes, it seems to matter that our internal links point to the https protocol when it's available.

SEO Benefits by updating links from http to https

This type of change is sometimes referred to in the SEO industry as "technical SEO". The algorithms of big search engines including Google and Bing understand and care about problems with web sites which degrade or enhance the user experience, and technical issues, even relatively minor ones, can affect the ranking of a page or an entire website in search results.

Three issues are addressed by updating your http:// link tags to https://.

Reduces the number of redirects (which can improve site performance).
Modernizes one modest aspect of your HTML (the Google algorithm knows so much more about your web site than you might think).
The resulting HTML is more technially correct. Even though validators don't (yet) flag this as an error or warning, HTML links should request the secure page from sites that have TLS configured. (There's a browser plugin, HTTPS Everywhere which automatically prefers https whenever available from a site.)

I'm not aware that anyone has documented that old links which point to the insecure http protocol on a site which has TLS (SSL) enabled directly affects page rank directly. However, by causing extra redirects and potentially 502 (or other 50x) errors, and by negatively affecting the user experience, updating them to https can remove issues which the Google algorithm is known to care about and which could negatively affect your SEO performance.

Important Safety Tip: Always… Or was it Never?

Here are a few things to think about before you try using regex to edit your web site (or, if you're a web designer, someone else's web site) for the first time.

Don't run these commands as the "root" user. This regex recipe allows you to make enormous numbers of changes to a large number of files in a fraction of a second, and typos happen. If your web server files don't allow you to edit them using your normal user account, you can and should fix that problem, first, before trying these regex commands to edit your HTML files. Generally speaking, you want your web server files to be editible by a group, like "staff" and you want your user account to belong to that group. The UNIX commands chown and chmod can help you set this up.
Always, always, always double-check your "present working directory" before you execute any command like this which executes by recursively descending through a folder hierarchy and acting upon the files it finds. A quick "pwd" is all it takes.
Don't edit the "live" web site. Well, if you must edit the live web site, make sure that you have a backup of your web documents folder that can be very quickly restored. On macOS you can use the "ditto" command to easily duplicate a folder hierarchy, like an Apache web server htdocs folder. On macOS or Linux you can use something like "cp -rp".
Use a source code repository like Git or Mercurial to manage the process of making changes to your HTML, CSS, .htaccess, Vhost config, and other text-based files that comprise your website.

A RegEx Recipe using find and perl to update http links to https

If you're a web designer or systems administrator with a basic familarity with a terminal, an editor like vim or emacs, and a shell like bash or zsh, you can use regular expressions to easily and safely perform changes to large numbers of HTML files at once. If you've never used regular expressions before, they can be a bit intimidating.

If you follow these workflow tips, you can rapidly gain confidence and begin using regular expresisons to fix issues in your websites.

Workflow Tips for using RegEx to Batch Edit HTML Files

If you're new to a batch editing large numbers of files at onece, like this, sometimes the process for how to do something like edit a bunch of HTML files all at once can be a little mysterious, a bit of a black box. Here's an outline of a process you can use to recursively find and edit large numbers of HTML files, safely.

Make sure you have a current (up to the moment before you attempt these edits) backup of your entire website document folder (htdocs) and that you can restore it quickly.
Make your edits in a working copy.
Check the status in the working copy (make sure all HTML files are checked in): % git status
Perform your edit with the regex below.
Check the status again, to verify that files you expected to be modified were modified, and that other files weren't modified: % git status
Look at the edits which were made: % git diff
If you're satisfied with the edits, add them: % git add *.html
Commit the changes to the repository: % git commit -m "updates http to https for all html files in (folder-name)".
If you're using GitHub or BitBucket, push your changes to the upstream repository: % git push
Then, log into your web site and pull the changes: % git pull

The RegEx Recipe to edit many HTML files using perl -e from http: to https:

Here's a simple recipe which you can use to fix something that often gets overlooked after a web server has been equipped with a TLS (Transport Layer Security) certificate. (Formerly known as SSL.)

This recipe will sweep through all of the HTML files in a folder hierarchy and convert the unencrypted http:// links to secure TLS https://.

If you're new to the world of terminal shells, bash and perl scripts, and stacking unix commands like find to batch-edit HTML files, this can look a little intimidating. If you have a backup, and especially if you're using git, you can easily roll back to your previous position if you make a mistake. If you play with some of these recipes you'll eventually get the hang of it. Each line of UNIX command-line magic can be understood by you! This regex recipe is basically a mini one-line perl script, wrapped in a tiny shell script on the same line. Perl is used in a special mode called inline editing (perl -e) which tells the Perl interpreter to process each line of text with the other parameters you give it, in this case a pattern match, and a substitution.

This regex recipe has been tested with the bash shell on macOS 11.4 (Big Sur) and a few prior versions of Mac OS X. Bash, Perl, and the regular expression libraries used by Perl have been pretty stable over years, so this should work on most modern versions of Linux, incuding inside containers like Docker and cloud-hosted servers. It will probably work in a Cygwin environment on Windows, too.

Edit a single file:

$ perl -e "s/<a href=\"http:/<a href=\"https:/g" -pi myfile.html

Edit many files in a folder hierarchy:

$ for each in `find myfolder/ -name "*.html"` ; do perl -e "s/<a href=\"http:/<a href=\"https:/g" -pi $each; done

Note that the above regex recipe matches and edits all of the http: links. This will include any links that you have which point to other web sites. If you want to match only your internal links, limit the match, like this:

$ for each in `find myfolder/ -name "*.html"` ; do perl -e "s/<a href=\"http:\/\/example.com/<a href=\"https:\/\/example.com/g" -pi $each; done

Both the forward slash mark / and the backslash mark \ have special meanings in regular expressions.

The forward slash mark / in the expression above is a divider. In some regular expression handlers, such as in the text editor vi or vim, you may see a substitution command like this, where "s:" means "substitute" and /// defines the boundaries of the pattern to match and the string to replace it with.

s:/match_this/change_it_to_this/

The backslash mark \ means "interpret the next character literally, rather than as a control character". We use it in the regular expression to be able to match strings like "http://example.com" which include the forward slash character, so that the match doesn't stop before the end of the string we want to match. This allows us to match the entire string: http:\/\/example.com in this regex recipe.

I should update this page and improve it, later.

SEOtipster #ToDo:

Prism (for nice code styling)
✅ Add an Ack! example.
Test the recipe for both bash $ and zsh % (I'm pretty sure that the shell looping syntax will need to be modified for zsh, so as written above it probably only works in the bash shell).
As originally published (circa June 2021) this article is intended to be useful to both new and experienced web designers. Consequently, parts of it are probably a little opaque for novices, and much of it may seem unnecessary context and explanation for webmasters who are just looking for a simple recipe. Down the road we should probably do a "just the facts, please!" collection of the recipes, with maybe the context available separately. It probably hits a sweet spot in the middle, though, for people who have some experience and haven't tried batch editing, yet.