Home > Performance > Achieving 99.99% Uptime

Achieving 99.99% Uptime

Right from the beginning, Nexus is built with high redundancy and high security without compromising on user experience. This boils down to 2 main areas

  1. System Architecture
    The system architecture for Nexus ensures that it can easily achieve 99.99% and above uptime, with ease of scaling at minimal effort.

    Below is the system diagram for the infrastructure in place for Nexus.

    1. NUS has an impressive network in place with multiple redundancies and internet backbone. Having this infrastructure in place means that Nexus does not need to bother about network uptime, since this is handled by NUS Network Team.
    2. Our department owns two Application Front End (AFE), running in active/passive mode. AFE’s help by reducing the attack surface of the web servers and also protect the web servers from malicious attacks e.g ddos. They also perform HTTP compression, TCP offloading and SSL offloading. This offloads the main tasks from the web server so that the web server can fully concentrate on serving the application. These babies from Cresendo Networks suffer no degradation with all the features turned on, so we have no worries that the bottleneck is at the AFE boxes.
    3. The webservers are running Windows Server 2008 + IIS 7.0. IIS 7.0 has a cool feature called shared configuration which allows IIS configurations to be mapped across multiple IIS servers automatically! There is no longer a need to go to each server to update IIS. Also Windows Server 2008 firewall is automatically turned on and the required rules are automatically added when you add/remove server roles and features. DFS is also set on the webservers to automatically mirror the website content and IIS settings of each server.
    4. Both web servers are actually virtual servers running on a VMWare ESX. We have also bought VM Virtual Centre, which allows us to manage all our ESX from a single console. It also allows us to clone existing servers to another ESX even when the server is running. This translates to ease of setting up additional web servers to handle increased web load.
    5. The databases are running SQL Server 2008 with Windows Server 2008. DB Mirroring is setup on both the main database and the log database. There are several reasons why we did not go for clustering. Primarily is the cost factor, clustering requires similiar hardware, a common SAN storage and is hard to scale. Using DB Mirroring on the other hand allows us to host the databases on a powerful machine (Principal), with a less powerful machine acting as the Backup. If a day comes when the Principal is not able to handle the load, all we need to do is to get a more powerful machine, configure that as the backup and do a failover. Voila, more powerful DB server with minimal hassles.
  2. Code Architecture
    Of course, all the best hardware in the world will not help if the code is bad. So what has Nexus done to ensure code quality?

    Nexus source code can be obtained at http://www.codeplex.com/nexus/Release/ProjectReleases.aspx

    1. Code Performance Tracking: Built right into the heart of Nexus is a performance tracker, this tracks how long a user action takes and if it takes too long, an email is sent to the administrator. (Refer to 3rd Party Libraries -> EntLib -> TimedLog.cs -> Dispose method)
    2. LINQ to SQL: Instead of writing my own data access layer, i used LINQ to SQL as my datalayer. Improved performance can be obtained if i were to convert the queries to compiled linq, but that’s for another day.
    3. Bitwise comparison: In order to have a flexible permission system, i’ve decided to go for bitwise comparison for the permission system instead of using Roles. Roles are still used to indicate the various “roles” played by the existing user, but for individual tab permissions, i’m using bitwise comparison.  There are a few reason for this. Firstly, i will need to get all the permissions in one shot, i will rarely ever just get a particular permission. Secondly, it does not make sense to keep creating columns of bits just to add a permission, finally, using int i get 32bits to play around with, which is more than enough. (Refer to Backend Tiers -> TabAccessLevel.cs)
    4. Security: No one likes a system which is not secure. So to ensure sufficient security, during the transmission of any sensitive data (e.g passwords), a one-time RSA key is requested by the web client. The server will store the private key required to decrypt the data and send the client the public key. The client then uses this public key to encrypt the sensitive information and pass it along to the server, who then decrypts and get the actual value. Also to prevent tampering, each user action is logged and checked against his current rights to ensure that he has the appropriate permission before he is allowed to proceed.

That in conclusion is how we are able to achieve 99.99% uptime for Nexus. Of course this solution does not come cheap but the important thing is having enough redundancy and scalability to ensure that user experience is never compromised.

Categories: Performance
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: