Scale Hacking: Cloud Computing, Software and System Performance

May 21, 2014

Introduction to MongoDB: The Complete Presentation

I had two great lectures regarding MongoDB and how to best utilize it in the last week. Therefore, I decided to share the presentation with you.

Key Presentation Topics

MongoDB Background: Company, Customers and the roots of NoSQL
Why more people are choosing MongoDB?
Data Design for NoSQL
MongoDB Installation
Basic DDL and DML syntax
MEAN (MongoDB, Express, Angular, node.js)
Best Practices for MongoDB migration

P.S Don't miss my Hebrew Big Data Webinar at July 7th, 2014

Keep Performing,
Moshe Kaplan

May 13, 2014

6 Easy Steps to Configure MongoDB Replication Set

In this tutorial we'll create a 3 nodes cluster, where the first serves as a primary node, and second as a failover node and the third as an Arbiter

1. Setup Mongo and Set a Configuration File
In all the 3 servers adjust the configuration file /etc/mongod.conf:
#Select your replication set name

replSet=[replication_set_name]
~~#Select the replication log size~~

~~oplogSize=1024~~

replication:
   oplogSizeMB: 
   replSetName:

Disable the bind_ip parameter to avoid binding to only 127.0.0.1 interface

#bind_ip

2. Restart All 3 mongod Daemons

> sudo service mongod restart

3. Create an Initial Configuration on the Primary
Login to the primary mongo and create an initial configuration. Please notice to use the private IP and not the loopback address (127.0.0.1):
> mongo

Primary> cfg = {"_id" : "[replication_set_name]", "members" : [{"_id" : 0,"host" : "[Primary_Host_IP]:27017"}]}

Primary> rs.initiate(cfg);

4. Add the Failover Instance to the Replication Set

Primary> rs.add("[Failover_Host_IP]:27017")

5. Add the Arbier Instance to the Replication Set

Primary> rs.addArb("[Arbiter_Host_IP]:27017")

6. Verify the Replication Set Status
Primary> rs.status()

Bottom Line
I wish every data cluster setup was as easy as a setup of a MongoDB replication set.

Keep Performing,
Moshe Kaplan

Apr 25, 2014

Consider Using SSL? Don't Forget Choosing the Right CDN for that!

Fact #1: More and More Sites are using SSL to Secure Their Users' Transactions
Everybody requires security and privacy these days.
If you don't believe it, take a look at Google, Facebook and Twitter. All them are using HTTPS and SSL to secure all their webpages and API calls including simple feed presentation and search page presentation.
And yes, this fact is still valid even in the post Heartbleed era.

Fact #2: Sites and Widgets are Required for Quicker than Light Loading Time
In the online business, time is money. And faster webpage load times worth a lot of money.

Fact #3: Webpages Secured by SSL have Poor Performance
When you surf for the first time to a website (or take a look at a widget), you are required to have several phases in order to view the website:

Resolve the Site DNS.
Call for the first Web page.

Perform a TCP handshake.
Perform a SSL handshake.
Retrieve the page itself (+ encryption overhead)

Call the embedded resources: images, CSS and JavaScript files (+ encryption overhead).

As you can see, the initial loading of a regular webpage is not short at all. Adding the SSL handshake to this process as well as the encryption and decryption and the overhead on the content, results in even longer times.

What Can be Done?

A common solution is choosing a SSL offloading device such as Radware's Alteon. This device will shorten the encryption and decryption times at the server side. However, it will not reduce the SSL handshake time or shorten the time needed to transfer the page encryption overhead.

The only way to shorten this time is shortening the round trip time between the users and your servers. If this is sounds like a CDN case study, you are right.

CDN is a Key Solution to Managing HTTPS Traffic

Modern CDN solutions support SSL termination at the edge. Therefore, the SSL handshake time can reduced from up to 1 second to several dozens of ms (see in the figures).

This is a major plus to the benefit of shortening the static files serving time by serving a cached copy from the CDN edge.

The good news are that this benefit is valid for both static files and dynamic calls.

Figure 1: Lightweight HTTP REST API Call: Local (Left waterfall) vs Remote (right waterfall), where local call was done from a server located at the same data center as the web server and the remote was done from a remote location with a 200ms round trip to the web server. As we can see most the time is due to the network round trips rather than server processing. Please neglect the initial DNS resolve time.

Figure 2: Same call this time using HTTPS. We can see that the original waiting time was split to two, while SSL connection time doubled it. We can see that in this case as well the SSL processing time at server is neglectable (22ms) while the round trip costs us about 420ms. Please avoid the DNS resolve time in this case as well.

CDN Selection for HTTPS Traffic Cases

While many CDN services support SSL offloading to their own domain (e.g https://your_domain.cdn_provider.com), you probably would like to use your own domain name (e.g https://your_domain.com). Therefore, you should verify that the CDN provider supports custom SSL certificates. Common cloud CDN providers such as AWS and MaxCDN are known to support it, while providers like Microsoft Azure don't.

Bottom Line

CDN are a corner stone to every web scale deployment these days, and many times you will find they solve issues you were not expecting for them before.

Keep Performing,

Moshe Kaplan

Apr 10, 2014

Looking for PostgreSQL Performance Issues

As traffic goes up, even your PostgreSQL may become a bottleneck.
In this cases it is important to analyze the traffic and understand the usage pattern. The way you will be able to tune the system to meet the challenge.

Understand Usage Pattern at Peak Time
Use PostreSQL top project to get key usage patterns in real time:

Current active SQL statements running
Query plans
Locks
User tables and indexes statistics

Understand Overall Usage Pattern

To get a broad insight of PostgreSQL behavior use pgFouine. This tool analyzes the PostgreSQL logs and provides detailed usage patterns reports such as leading queries, duration, queries by type and queries patterns.

You can get some of these metrics by querying the pg_catalog schema (such as pg_stat_user_tables and pg_stat_user_indexes), and use log_statement to analyze all queries.

Enable Slow Queries

Probably #1 tool to eliminate performance issues:

Add pg_stat_statements to shared_preload_libraries statement @ postresql.conf
Restart PostgreSQL daemon.
use the pg_stat_statements view to pinpoint the bottlenecks:

SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit /
nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;

Explain the Execution Plans
Use the Explain statement to analyze slow queries execution paths and eliminate them:

EXPLAIN (FORMAT JSON) SELECT * FROM table_name;

Bottom Line
Using these great tools, you can boost your PostgreSQL and meet the business challenges

Keep Performing,
Moshe Kaplan

Mar 30, 2014

How to Migrate from MySQL to MongoDB

In the last week I was working on a key project to migrate a BI platform from MySQL to MongoDB. The product that its development is headed by Yuval Leshem is gaining a major adaption and the company was facing a scale challenge.
We chosen MongoDB as the platform data infrastructure to support high data insert rate and scale data analysis.
Unlike many projects of this type, we accomplished the migration from plan to production in a week, mostly due to smart and simple plan.

I wanted to share with you some of lessons we learnt during the process:

Data Migration: Mongify
This tool provides a two steps solution to migrate your RDBMS (MySQL) to NoSQL (Mongo):

Mapping database structure
Export the data and import it according to the defined structure

Since it's an open source you can easily dive into the code and adjust it to your own business case. Moreover, the code is maintained by Andrew Kalek that is very cooperative.

Filter by Date Part (Day, Month, Year..)
If you are regular to using the DB date parts functions such as YEAR() and MONTH(), there are options to do it in MongoDB (see $where and aggregation framework). However, both require intensive IO. The best solution in this case is saving 3 (or more) more fields for each original field. These fields will include the relevant date part, and can be indexed for effective query:

[original field]
[original field]_[year part]
[original field]_[month part]
[original field]_[day part]
[original field]_[hour part]

Default Values
MongoDB has no defined schema, so there are no default values as well. Therefore it's up to your data layer (or ORM) to take care of it
This is relevant to to default timestamp as well

Data Casting
Same case as with default values. You app should take care of it.
Please notice that there is a defined mapping from values and types that you can find at Mongify code.

Auto Numbers (1..N)
Same case here, but you may have to choose one of following ways:

Shift your way of thinking of auto increment ids and start using MongoDB auto "_id"s a solution.
You can generate the auto increment ids using a counters database and findAndModify (in this case I will recommend you having a special purpose database and 1:1 collection mapping to gain future releases granular locking). For details see the link on top.

Mongoose as an ORM
If you use node.js consider using Mongoose as your ORM, This one will solve many of your issues by adding structure to your schema. However, please notice that you may loose some flexibility.

Data Analysts
MongoDB is not SQL compliant, and you will have hard time with your data analysts. However, you can ease the change by using the following two methods:

Introduce them to Query Mongo.
Make sure your documents have no sub documents, if you don't have to. Elsewhere, transforming the data to tabulator view will require a major effort from them.

Avoid Normalizing Your Data
If you designed your data infrastructure as a non normalized structure, it will be much easier to move data to NoSQL. If your data is normalized, it is better to the app to take care of the data reconstruction.

Queries Results Limitation

MongoDB results are limited to a document size. If you need to query 200K+ records, you may need to page the data using skip and limit (or better, adding a filter based on the last limited row key value).

Bottom Line
Migration from MySQL to MongoDB requires some effort and a shift in your state of mind, but it can be done relatively fast using careful planning according to the steps defined before.

Keep Performing,
Moshe Kaplan

Mar 5, 2014

MySQL Indexing: Don't Forget to Have Enough Free Space

When you modify you indexes in MySQL (and especially in MyISAM), make sure that the free space on the disk that holds your tmpdir folder is larger than your largest index file.

Why We Need to Such a Large Free Space?
MySQL is using the tmpdir to copy the original index file to and "repair it" by sorting the data.

What Happens if We Don't Have Enough Space?
In this case MySQL will make it best to modify the index file based on the given space. The result is a very slow process (or never ending one) and poor results. If you will check the show processlist, you will find out the state "Repair by keycache" instead of "Repair by sorting"

What to Do?
Make sure you have enough free space (> largest index file) and that the tmpdir option file is located on this disk.

Bottom Line
Make sure you have enough free space to get best performance

Keep Performing,
Moshe Kaplan

Feb 21, 2014

When Should I Use MongoDB rather than MySQL (or other RDBMS): The Billing Example

NoSQL is a hot buzz in the air for a pretty long time (well, it not only a buzz anymore).

However, when should we really use it?

Best Practices for MongoDB

NoSQL products (and among them MongoDB) should be used to meet challenges. If you have one of the following challenges, you should consider MongoDB:

You Expect a High Write Load

MongoDB by default prefers high insert rate over transaction safety. If you need to load tons of data lines with a low business value for each one, MongoDB should fit. Don't do that with $1M transactions recording or at least in these cases do it with an extra safety.

You need High Availability in an Unreliable Environment (Cloud and Real Life)

Setting replicaSet (set of servers that act as Master-Slaves) is easy and fast. Moreover, recovery from a node (or a data center) failure is instant, safe and automatic

You need to Grow Big (and Shard Your Data)

Databases scaling is hard (a single MySQL table performance will degrade when crossing the 5-10GB per table). If you need to partition and shard your database, MongoDB has a built in easy solution for that.

Your Data is Location Based

MongoDB has built in spacial functions, so finding relevant data from specific locations is fast and accurate.

Your Data Set is Going to be Big (starting from 1GB) and Schema is Not Stable

Adding new columns to RDBMS can lock the entire database in some database, or create a major load and performance degradation in other. Usually it happens when table size is larger than 1GB (and can be major pain for a system like BillRun that is described bellow and has several TB in a single table). As MongoDB is schema-less, adding a new field, does not effect old rows (or documents) and will be instant. Other plus is that you do not need a DBA to modify your schema when application changes.

You Don't have a DBA

If you don't have a DBA, and you don't want to normalize your data and do joins, you should consider MongoDB. MongoDB is great for class persistence, as classes can be serialized to JSON and stored AS IS in MongoDB. Note: If you are expecting to go big, please notice that your will need to follow some best practices to avoid pitfalls

Real World Case Study: Billing

In the last ILMUG, Ofer Cohen presented BillRun, a next generation Open Source billing solution that utilizes MongoDB as its data store. This billing system runs in production in the fastest growing cellular operator in Israel, where it processes over 500M CDRs (call data records) each month. In his presentation Ofer presented how this system utilizes MongoDB advantages:

Schema-less design enables rapid introduction of new CDR types to the system. It let BillRun keep the data store generic.
Scale BillRun production site already manages several TB in a single table, w/o being limited by adding new fields or being limited by growth
Rapid replicaSet enables meeting regulation with easy to setup multi data center DRP and HA solution.
Sharding enables linear and scale out growth w/o running out of budget.
With over 2,000/s CDR inserts, MongoDB architecture is great for a system that must support high insert load. Yet you can guarantee transactions with findAndModify (which is slower) and two-phase commit (application wise).
Developer oriented queries, enable developers write a elegant queries.
Location based is being utilized to analyze users usage and determining where to invest in cellular infrastructure.