Created: Last updated:
It might be a good thing when your webpages have links with a trailing slash and the default but for certain web applications this could lead to some problem with duplicate content.
You will have trouble with duplicate content you may not even be aware off—but search engines are.
Duplicate Content and SEO
So, why should you care about duplicate content in search engines?
Unfortunately duplicate content does not mean your page gets listed twice and pushes the competition further down the ranks in search engine results page, aka SERP. Because this would naturally happen search engines have to take care of this problem.
It is not clear if this also has an impact on your overall rank in search results but one thing is sure: Google recognizes the duplicate content.
There are hints and rumors that Google may not only penalize your webpages but your whole website for presenting the same content more than once; unless you specifying a canonical link in your document's head.
You probably wonder, "What are you talking about?" To understand what I am talking about we have to dig deeper into how the web evolved.
Well, it all began with the desire to show dynamic content and the parameters in the URL. These parameters would be the links with the question mark and trailing information which looks something like this
From a search engine's point of view this is not necessarily a bad thing or problem to work with. Nevertheless search engines are not very fond of this for one simple reason. The parameters can be used for so many things and can result in so many different results. In the begin they were mostly used to reorder or filter content of one single page.
Many search engines decided to simply strip the parameters. If the parameters are used to simply reorder or filter the content—like in a table—then the different results are not worth indexing. In the example above they would only request the /index.php file no matter how many different links they discovered with different parameters.
Unfortunately, along came the developers who funneled all request through one single index.php. Take one single php file and give each page an id and you are basically able to present an infinite number of unique and different pages. This works fine for users and browsers but many search engines simply stripped the parameters and never indexed the individual pages.
SEO friendly links
Once people realized that the search engines are not indexing the different pages they came up with what is known as SEO friendly links.
Now just so we are clear: the human readable text is not the real trick—spider bots and indexing engines don't care about human readability. Unique and properly formatted links, i.e. avoiding the parameters, defines a SEO friendly link.
If a human readable text benefits the ranking may or may not be true but the real benefit is really to get your webpages indexed in the first place.
Back to the future
While SEO friendly links are great because our pages are indexed and probably are ranked higher due to the readable text we are confronted with a new problem. We thought we went into the future but actually:
SEO friendly links are a step back in time and how web servers with static HTML pages behave. Lets go back to the beginnings of the web and look at how a typical static website and server setup may look like:
- server file setup: /index.html
- → possible other link: /
- server file setup: /about/index.html
- → possible other link: /about/
- server file setup: /contact/index.html
- → possible other link: /contact/
With this simple structure we can set links like in the file setup but also note that second possible other link and particularly the trailing forward slash. Why does it work and result in a valid response from the web server?
Because the web server recognizes the slash as a directory and it is usually configured to look for a file named index.html in such a directory if no filename was provided; as simple as that.
Why again is this a problem?
This trailing slash will be a problem when we have a web application that creates SEO friendly links and does not care about requests with the trailing slash.
In a web application we usually don't have any html files. A web application that creates SEO friendly links does not even need or care about any file extensions like a .html or .htm. If you have a link like /about it will understand the request and respond the very same way as our static example above with a link like /about/index.html or /about/.
So, what about search engines? What if a search engine is curious about the missing html extension? Is this /about link a file and webpage (like about.html) or a directory (hence more like /about/)? It probably will not send a request for the first case and an about.html file but how about a /about/ request just to see what happens? How will your web application respond?
There are a couple problems that could go wrong. First the web application may complain that information is missing and returns a 4xx error or worse it results and ends in a crash of the application. Second it could simply ignore the missing information and respond like there is no trailing slash. This second case is exactly where we will have a problem with duplicate content and possibly SEO.
A search engine will get a valid 200 OK response from both requests and it will just so happen that the content is identical.
If you have Google's Webmaster Tools it should be ease to spot these problems in the Diagnostics section and HTML suggestions and some duplicate meta descriptions or title tags.
Send a redirect
To avoid this duplicate my framework handles this by sending an 301 redirect to the same link but without the trailing slash. Feel free to give it try here.