SEO friendly URL encoding for search results
This tome series was originally authored as part of an introduction to Apache SOLR - an open source faceted search engine. However the theories presented in this article should be applicable to most site search appliances.
Before delving deeper into the types of encodings and the impact that they have on search engine optimisation, it is probably a good idea to get the three letter acronyms (TLAs) out of the way.
- SEP - Search Engine Provider - a provider of world wide web based search such as Google, Yahoo! and MSN Naturally, there are many other search engine providers, far too many to be listed to be listed here. The major ones that seem to come up in conversations about SEO are listed here in order of precedence. This is valid for the Australian market, other markets may not give the same weighting to the list, however the principles behind SEO should be transferable from one SEP to another.
- SEO - Search Engine Optimisation - constructing URLs and page content to provide maximum spidering ease for SEPs and rank improvement for a particular web site. There will not be a deep introspection into the concepts behind SEO friendly page structure as advice and implementation differ greatly depending on whom you converse with. Suffice it to say, a well structured page separating content and presentation (as is the case with CSS and HTML) is good practice and goes far beyond SEO into accessibility - see the Web Accessibility Initiative for further details (http://www.w3.org/WAI/). The term 'SEO friendly' is used in the context of 'more SEO friendly' or 'less SEO friendly' to indicate possibly better or worse practices respectively for URL encoding
- SSA - Site Search Appliance - A specific search engine for a site, in this case SOLR would be considered the SSA for the site.
URL and its encoding
Any URL is made up of the following parts This is not an entirely complete representation as there is also an optional port number and fragment part of the URL:
<scheme name> : <hierarchical part> <hostname> <path> [ ? <query> ]
Where the scheme name <scheme name> (or protocol as it is known) generally consists of a combination of letters terminated by a colon (":"). In our investigations the two protocols that we will see almost exclusively are http: and https:. The term 'generally' is used as this is not entirely correct. It can also contain number, full stops ('.') (or periods in some countries), pluses ('+') and the hyphen ('-') characters
The <hierarchical part> starts with a double forward slash ('//')
The <hostname> determines the address of the host either in IP address format (4 dot separated groups of numbers - e.g. 127.0.0.1) , or a domain name (more human readable dot separated address - e.g. www.example.com).
The <path> is a sequence of segments which is conceptually similar to a directory structure on a computer separated by a forward slash ('/').
The [ ? <query> ] part starts with a question mark ('?') followed by a query key (normally a well chosen name) followed by an equals sign ('=') followed by a query value.
This document will focus mainly on the <path> and [ ? <query> ] parts of the URL and various methods to change these around to increase SEO.
The following URL has been marked up to show the break-up of the various parts:
http://www.example.com/some/path/segments/?key1=value1&key2=value2 \__/ \_______________/\_________________/\______________________/ | hostname path segments query | scheme
Now that we have this covered, we will start looking at how the URL is parsed from an SEO perspective.
Search Engine Optimisation (SEO) is having a greater impact on how content is authored and the URL at which it can be referenced. In order for search engines such as Google, Yahoo and MSN to index the maximum number of relevant pages, it has become prudent to encode search page URLs in a certain way. This should increase the content that is accessible to search engines when spidering a site.
Whether or not you agree with the practice of SEO, it is a commercial reality that is here to stay. Furthermore there are other gains to be had from providing a SEO friendly site especially in the realms of accessibility. On the other end of the spectrum, there are some sites that attempt to manipulate the system by providing both URL rich and keyword rich results with the sole purpose of providing SEO without any regard for the users of the system. A balance should be maintained with the focus firmly on the users of the system rather than purely for greater search engine rankings. Keeping the users in mind whilst creating a search appliance will not only provide for a happier customer experience, but should also provide more SEO friendly URLs.
Although SEO can be considered a dark art - all search engine providers (SEPs) keep their algorithms a closely guarded secret - some information is available and generally agreed upon. Even so called 'experts' of SEO cannot always agree as to an approach that will gain the best rankings for a particular URL. Furthermore, the SEPs continually refine their algorithms in response to those that attempt to exploit the algorithms with no extra benefit to the sites' users. Still, SEO can be seen as a helper for both the users of the site and the search engines in surfacing new and more relevant content easily.
There is more to SEO than simply the URL encoding method. In fact, entire industries have been spawned to deal with this challenge. Some other optimisations which need to be taken into account for a more SEO friendly site include (in order of perceived relevance):
- Content layout
- Semantic layout
- Keyword and description relevancy.
The actual mechanics of the optimisation process is far too complex, change too rapidly and not always publicly available to be covered here and this, coupled with the continual shifting of algorithmic interpretations, would date the advice rather quickly. In fact over the years many discussions have been had with various SEO providers whose advice will contradict advice from other SEO providers. It is though the SEO practice is dependent on the time of day and phases of the moon. Moreover, the 'experts' continually change and update their 'best practice' standards in the pursuit of greater rankings for specific sites. Apart from touching on the URL encoding practices, other facets of SEO is left as an investigation exercise for the reader.
However, distilling the combined knowledge, SEO 'experts' mainly agree that URL encoding is preferable to request parameters as not all SEPs will parse and recognise key/value request parameters[2. There is speculation that some search engines will begin to post data through forms in an attempt to dig deeper within the site to discover more information from the site. It can be seen as a natural progression that SEPs will continue to attempt to refine their spiders so that more of the site is surfaced to users.].
Whilst I have heard arguments that an SEO friendly URL is more memorable to users, I find that this is at best a spurious argument, and if anything is done more for readability of URLs than for memorability. As an example, the following URL from a typical blog:
From a users perspective, reading the above URL provides a lot more information than a URL with a query parameter:
The first URL can be read and information gleaned from it - it is no great stretch of the imagination that the post was made in the year 2008, in the month of July (07) and the title would be similar to "Dos and Don'ts for SEO Friendly URLs". However it is a far stretch that this could be considered memorable, as the user would need to remember that the post was made in July 2008, apostrophes removed, spaces replaced with hyphens, and all characters lower-cased apart from the characters of the SEO and URL part of the SEO friendly URL.
This then leads to search engines, when told of the title of the post on the particular blog, it becomes easier to find this through search engines by searching for the title, and if available restricting the search to a particular URL, rather than hitting the front page of the blog and browsing through the many posts (which becomes more difficult if the post is old and is buried deep within the site). Of course the site's search functionality would also allow a search on the title (if it exists). This becomes more difficult if the original blog web site cannot be remembered, or is spelled incorrectly.
If anything, when emailing a link to others, a quick scan of the more SEO friendly URL will provide hints as to whether it would be worth reading, especially in our time limited lives. Whether or not the article will contain useful information is another question entirely.
In order to investigate various URL encoding strategies, I will be using an example of a DVD site search appliance (SSA).
The examples will all be base on a search for DVDs with the the following criteria:
- The title and/or body text contains the term 'super' - i.e. the query string is 'super'
- The DVD must be in the category of 'action' and 'drama'
- We want to sort on price
- The DVD edition must be a special
- We want 25 results per page
We will look at the site search appliance and how it works, and how it all starts to fit together. Using the most basic of URLs we will investigate the thought processes that are needed to implement an SSA and extensions to make it SEO friendly.
Here we go... The Site Search Appliance »