Scale out Sitecore Infrastructure

Fullfill the SLAs you commited on towards your customer

Created: 27.12.2017

High availability SOLR Sitecore 8

When thinking about your Sitecore Infrastructure you have many opportunities to scale out. This Example is based on Sitecore 8.1 update 2. As this can be quite cost intensive in terms of License and Server costs, you need to think about few factors that influence the scale of your system.

1. What is the number of visitors you expect to visit your website?

2. What is the Service Level Agreement (SLA) towards your client?

3. Do you have peaks that you need to cover?

4. Do the peaks differ much from the general traffic?

5. How and when do the peaks appear (rarely, often, predictable, unpredictable)?

6. What locations your website is called from?

7. Where do your authors work from?

8. How many authors will work with Sitecore at the same time?

9. What is the SLA offered by your infrastructure provider?

The case I will describe has the following answers on the above questions:

1. 140.000.000 visitiors per year

2. 99,9% availability on Websites

3. Yes we have peaks that double the amount of visits arround christmas time.

4. Peaks double the amount of visits

5. Predictable peak arround christmas, predictable peak arround easter and some unpredictable peaks due to successful marketing campaigns

6. Our website is called from 25 countries from Portugal to Japan. We also cover the chinese market.

7. Our authors work locally in the 25 countries. Most of them are connected via VDIs that run in Germany

8. unknown

9. We did not get an SLA from our private cloud host.

So let's take a walk through the thoughts we have made in our project to scale out the system.

Most important to start with is to check the SLAs of the infrastructure providers you are dealing with. Once you know these you know how big the gap is you have to fill.

Assume you have a cloud provider giving you an SLA on 80% uptime, that means the gap is quite big and you need to fill with a lot of redundnancies accross locations.

Starting to scale out sitecore:

Starting from a normal Sitecore Infrastructure with already separated CD (Content Delivery) and CM (Content Manangement) (containing Publishing and Processing), we use SOLR and run a Session Server on MongoDB and an Analytics Collection DB on MongoDB.

First problem, if your infrastructure provider does not provide you and uptime of 100% and your website needs to be be always up, then if your one CD fails, your website iis gone.

Second problem, also your CD server has limits. So checking the expected visitors and the peaks you also might need at least a second CD.

1. Scale Content Delivery Web Servers

Better to have at least two CDs. The number of your CDs is dependend on the number of requests that the Webservers need to handle and how your system behaves performance wise. And of course it is dependend on your budget as every CD requires Infrastructure, Operation System License and Sitecore license.

2. Separate Content Management, Publishing and Processiong Instance

Let's take a look at the CM machine. Depending on the SLAs you offer to your authors and the amount of authors working on the CM at the same time, also this unit will come to it's limits.

Next step is to separate Publishing to a separate server, processing to separate server and Separating also the reporting DB from the Master,Core and Web.

For sure, publishing and processing require separate Sitecore licenses. So this step is also always a matter of costs.

Starting with Sitecore 8.2.x, Sitecore offers a publishing service that is way faster and does not require a Sitecore license. It can be deployed on a separate server.

3. Scale MongoDBs (Session and Analytics collection)

We figured out in our project that the session DB is mandatory to run otherwise the CDs will also fail. So no matter how much you scale on CD side you also have to scale our your Session DB. Mongo usually scales out in clusters of 3. You can use 2 data notes and one arbiter that is just there to vote for one to become the master and one to become the secondary DB. This saves some server resources as the arbiter does not need much power. You can basically take the smallest server or flavour you can get.

Same goes for the collection DB running on Mongo.

Just a hint: Mongo offers a free and an enterprise version that requires costly licensing. Check the differences carefully. We are fine with the free one.

4. Duplicate Content Manangement Servers

Dependent on the number of authors and the SLAs towards the CMs you might need more than one Server. So if one machine is down or under high load the other can take over. In general use the load is balanced equally between both.

Another scaleout to be seen in this picture is SOLR. SORL can be run in a cluster of several machines. Also Zookeeper can be scaled out so you have redundant machines.

5. Running Sitecore in several Datacenters

Last but not least if you don't want to be dependent on one location you can scale out to a separate Datacenter (DC) in a different location. Reasons can be to prevent from denail of service attacks or if you fear other physical attacks. As one Datacenter also shares usually one backbone, this can be also single point of failure that you might want to prevent.

Usually you need that high amount of availability only for our websites. So You need again 2 CD Server, a SQL server with Core and Web DB and Session DBs connected to your CDs.

To feed that several Web DBs you can use either Sitecore Publishing Targets or SQL merge replication. Merge replication comes latest into place if you run datacenters far away from the Master DB so latency becomes a problem.

In addition to that, you can scale also SOLR and the Mongo Collections. In this picture Mongos are scaled as a replicated shareded cluster with a Mongo router in each DC and 2 Shards per Datacenter. So the replication is done accross DCs.

Please Note: When you scale out using redundancy you need to make sure to put replicas ore duplicate servers on different hardware to prevent having hardware outages as single point of failures.

6. How to Scale Master DB

Only thing that is left now is the Master DB which is still the single point of failure in your Infrastructure Architecture. Duplicating the Master DB leads to some problems as it is a DB running Microsoft SQL Server that is written into from CMs.

What you can do is having a replicated passive SQL Server so in case your Master DB catches fire, you can bring in the Passive one as quick as possible. More or less you have created some kind of backup mechanism.

Sitecore supports from Version 8.2 the SQL Server -Always On- feature.

7. Scale your Loadbalancers

If you have just one loadbalancer that routes the incoming requests to your Web Servers, this is also a single point of failure. So also Loadbalancers can be scaled out.

In our project we created 2 loadbalancers per site (DNS) so configurations can be separated per site and we have redundancy in case one loadbalancer failes.

8. SOLR SwitchONRebuild

When setting up SOLR you might want to consider few things.

1. If you run several sites in a single Sitecore instance and require rebuilts of your SOLR indexes it makes sense to separate SOLR indexes per sites and per DB (so have 3 indexes per site). So a rebuilt that is usually dropping all data and rebuild from scratch does not influence other sites. This way, indexes also won't become to large.

2. If you built important features based on Sitecore buckets or retrieve the data anyway from SOLR e.g. for performance reasons, you might want to separate those functionalites also into separate Indexes to have separation of concerns and risk.

3. SOLR provides a feature called SwitchOnRebuild. You have to double the cores you want to use this feature and then do the configuration accordingly. In this case both indexes contain same data. If a rebuild is triggered, the duplicated index is used to run the features or search while the first one gets reindexed. Once this is done, the rebuilt index serves search and other features and the secondary gets reindexed. Please Note: This requires double ammount of space for indexes.