Scale Hacking: Cloud Computing, Software and System Performance: 2015

Nov 25, 2015

Apache htaccess Debugging Ugly. This Will Save Your A$$...

If you ever created redirection rules in Apache htaccess or configuration file, you probably know that things can easily turn ugly. Without debugging tools and with long testing cycle, the debugging can be painful.

The htaccess tester tool can solve your issues: just place your requested URL and the actual htaccess that is being used, and you will get the actual result.

Keep Performing,
Moshe Kaplan

Sep 20, 2015

5 Immidiate Steps to Take Care of Your MongoDB Performance

Do you face some performance issues in your MongoDB setup?
In this case use the following steps to provide some first aid to your system and gain some space for a long term architecture (such as Sharding)

Step 1: Enable slow queries
Get intelligence about your system behavior and performance bottlenecks. Usually there is a high correlation between the slow queries and your performance bottleneck, so use the following method to enable your system profiling collection:
db.setProfilingLevel(1, 100);

Step 2: Use explain
Explore the problematic queries using explain. You can also use mtools to analyze the logged queries to find high frequent ones.

Step 3: Create indexes
Your analysis should result with new indexes in order to improve the queries
Don't forget to use index buildup in the background to avoid collections locking and system downtime.

Step 4: Use sparse indexes to reduce the size of the indexes
If you use sparse documents, and heavily using the $exists key words in your queries, using sparse indexes (that includes only documents that includes your field) can minimize your index size the boost your query performance.

Step 5: Use secondary preferred to offload queries to slaves
You probably have a replica set and it's waste of resources not using your slaves for read queries (especially for reporting and search operations).
By changing your connection string to secondary preferred, your application will try to run read queries on the slaves before doing that on your master.
Bottom Line
Using these simple method, you can gain time and space before hitting the wall.

Keep Performing,
Moshe Kaplan

Sep 2, 2015

Prepare for Failure in Your AWS Environment

In the cloud everything can happen.
Actually everything will happen.

Therefore, in your design, you should be ready for failures: even if you expect your disk mounts to be there for you, they might not be. And you are doing auto scaling, it is most likely that one in a time they won't be there for you.

Therefore, to avoid hanging servers due to failure to mount disks and bad messages as the follow: "The disk drive for /tmp is not ready yet or not present", make sure your servers are not bound by your disks (otherwise you will not be able to contact your servers, or your OpsWorks will notify you that the server is booting forever).

Avoiding Waiting for Your Mount
The secret is a small option: nobootwait that will make sure your server is not waiting for the mount to be ready. You can configure it in your /etc/fstab, or even better in your Chef recipe:
mount "/tmp" do
device "172.32.17.48:/tmp"
fstype "nfs"
options "rw,nobootwait"
action [:mount, :enable]
end

Bottom Line
The right design will help you keep you system running in a cloud based environment

Keep Performing,
Moshe Kaplan

Jul 31, 2015

MongoDB 3.0 WiredTiger is Big News for Multi Tenant Deployments

A database per tenant is a common practice in multi tenant systems.
Why? Well, first it s easier to implement as you don't need to record the tenant id in every document, and second, it easier to avoid security issues by compromising the tenant id.

MongoDB 3.0 Storage Engines
MongoDB introduced a new concept in version 3.0 (that is familiar for MySQL users): you can select your preferred storage engine. You can continue using the old MMAPv1 storage engine (that is still the default one for version 3.0) or select WiredTiger that offers compression, document level lock and better performance in some cases.

What is Wrong with MMAPv1
MMAPv1 creates for every new database a data file that its minimal size is ~70MB (and this one is being filled, a new file is created with a doubled size).
This may not be an issue for large size databases, but if your system design is built on a large number of tenants, that many of them will not have more than few records (lets say trial tenants that decided not to go on w/ your system), you are going to have 70GB disk allocation for every 1,000 tenants.
This behavior will result in another side affect: very high IO usage (mostly for read), that is due to the need to read every time a large file in order to update or insert very few rows.

WiredTiger Comes to Rescue
WiredTiger includes a compression method that can cut 85% of your storage needs in a large database (trust me that I saw this number in a 10TB billing system). That is great, but more important, it does not allocate this 70MB file per each database (it satisfies with two small files for indexes and data).
The result: for a 3,500 tenants system, the database was shrunk from 330GB to under 1GB... not to mention that IOPS that were dropped from 12,000 to 600...

Bottom Line
If you are using MongoDB for multi tenant system, WiredTiger can cut your storage needs in 99% and your IOPS in 95%. This can save a fortune.

Keep Performing,
Moshe Kaplan

May 30, 2015

1 Click from Code to Prod: Spark, Scala, sbt, Intellij and Hadoop

You will probably will find dozens of Q&A articles how to create a new scala project using Intellij and submit it to a remote Hadoop based Spark cluster. However, none of them is actually complete and shows the full picture.
This post is going to save you a lot of time, so stay tuned...

Expected Outcome: an environment that will let you in one click move from coding your Scala to submitting it to a remote YARN based Spark cluster.

Note: some issues like "no spaces" may be overcome using double quotes or other methods, but we recommend you to follow the process and make it the simple way to avoid unexpected outcomes.

Prerequisites

Install JDK. Make sure JDK is placed in a folder w/o spaces (for example C:\Java)
Configure the JAVA_HOME environment variable to you installation location
Download and install the latest Intellij IDE
Install the Intellij Scala plugin
Download and install scala. Again make sure scala is placed in a folder w/o spaces (for example C:\scala)
Set the environment variable for Scala
Download and install sbt 0.13.8. Again make sure sbt is placed in a folder w/o spaces (for example C:\sbt)
Add the sbt folder to the PATH environment variable

In windows, you will need 1) to download winutils.exe; 2) create Hadoop folder; 3) and2) place in a bin folder inside a in order to avoid the following error:

Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

If your environment is connected to the internet using a proxy add the following proxy parameter of both HTTP and HTTPS to the JAVA_OPTIONS environment variable:
-Dhttp.proxyHost=yourcache.com -Dhttp.proxyPort=8080 -Dhttps.proxyHost=yourcache.com -Dhttps.proxyPort=8080

Note: when working w/ Intellij, consider using the early access program for quick fixes especially in a dynamic environment like Scala and Spark

Create the Initial Project

Create a new Scala (and not an sbt on) project in Intellij
Select the right JDK version (based on the installation you made before)
Select the right Scala SDK to match the cluster (see below): click on create and select the right one (or click on download and get it):

Create the Basic Project Files

This should be done in the file system and not inside the Intellij IDE to avoid surprises.

Create a build.sbt file in the project root. Please notice:

Matching the Spark client and cluster versions.
Matching the Hadoop client and cluster versions.
Matching the Scala and Spark version by looking for the spark-core package in Maven Central. In our case you should look for the spark cluster version (1.2.0) and then get the matching Scala version (2.10) from the ArtifactId. The minor version can be found in the Scala site.

The various Spark-core versions and matching the Spark and Scala versions
Adding the "provided" keyword to the library dependencies in order to avoid jar clashes when building the project:
[error] (*:assembly) deduplicate: different file contents found in the following:
[error] \.ivy2\cache\javax.activation\activation\jars\activation-1.1.jar:javax/activation/ActivationDataFlavor.class
[error] \.ivy2\cache\org.eclipse.jetty.orbit\javax.activation\orbits\javax.activation-1.1.0.v201105071233.jar:javax/activation/ActivationDataFlavor.class
Not using the provided key: in order to avoid cases where sbt assembly run correctly, but actually the make (or sbt run) does not, you should include the following reincluding in your build.sbt (and not in your assembly.sbt): run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))
Exclude javax.servlet file to avoid the following errors:
[error] (run-main-0) java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package
java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package
at java.lang.ClassLoader.checkCerts(ClassLoader.java:895)
at java.lang.ClassLoader.preDefineClass(ClassLoader.java:665)
at java.lang.ClassLoader.defineClass(ClassLoader.java:758)
Keeping a spaced line between each two lines.
From the command line in the project root run:

sbt
sbt update
sbt assembly
sbt run

If you get during running the following exception, do worry, it is just a cleanup issue and you can disregard it:
ERROR Utils: Uncaught exception in thread SparkListenerBus
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterru
tibly(AbstractQueuedSynchronizer.java:998) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)

Create a /project/assembly.sbt file and run sbt assembly to verify the project jar is being created: sbt assembly. Using this you will be able to avoid:
[error] Not a valid command: assembly
[error] Not a valid project ID: assembly
[error] Not a valid configuration: assembly
[error] Not a valid key: assembly
[error] assembly
Create a /main/scala/SimpleApp.scala file (or any other original name for your project main file). Please notice to include

The spark conf should include your spark master that will serve your jar using setMaster, in order to avoid the following error:
"A master URL must be set in your configuration"
If you have limited resources (and you will have), configure the number of used cores and the allocated memory per core.
The path where your compiled jar is located using setJars. After building your project in the first time you will be able to find it inside the target folder in your project. If you want configure setJars, you will get messages that Spark cannot find your Jar

The /build.sbt file:
name := "SimpleApp"

scalaVersion := "2.10.4"

run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.5.0" % "provided" excludeAll ExclusionRule(organization = "javax.servlet")

libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.5.0" % "provided" excludeAll ExclusionRule(organization = "javax.servlet")

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" % "provided"

The /project/assembly.sbt file:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

The /main/scala/SimpleApp.scala file:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
def main(args: Array[String]) {
val logFile = "hdfs://hadoop.name.node:8020/tmp/POC/*"
val conf = new SparkConf()
.setAppName("Simple Application")
.setMaster("spark://spark.master.node:7077")
.set("spark.executor.memory", "64m")
.set("spark.cores.max", "4")
.setJars(List("/path/to/target/scala-2.10/SimpleApp-assembly-0.1-SNAPSHOT.jar"))
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}

Arrange the IDE and Submit your First Job

Create a new configuration.
Add sbt assembly to "the before launch" configuration in order to generate a jar file:

Add a new task to the "Before Launch"
Add a new external tool
Set the sbt location in the program and "assembly" in parameters

Bottom Line
It may take a little time to launch a proper Scala and Spark configuration in Intellij, but the result worths it!

Keep Performing,
Moshe Kaplan

Apr 19, 2015

5 Steps to Migrate from MySQL Community Edition to MariaDB

Why Should You Consider the Migration?
There are many reasons for it, but in bottom line in open source you should look for the community.

MySQL CE, MariaDB or Percona DB?
Well every one should make their decisions. However, you should decide if you are looking for commercial support or community support. If you are looking for commercial support, choose the company that you most trust (and gives you the best deal).
If you are looking for community support take a look where community is, check if the forums are active and if bugs that are being reported by the community are being taken care of. Finally search on Linkedin. It's a great way to sense where the wind blows.

What Should You Expect?
Faster releases, better response to community, some performance boost and in the bottom line: no change is need from your client side.

How to Migrate?
Migration currently is very simple, just like upgrading to a new major MySQL release:

Backup:
sudo service mysql stop
sudo cp -R /var/lib/mysql /var/lib/mysql.old
sudo cp -R /etc/mysql/my.cnf /etc/mysql/my.cnf.old
Uninstall MySQL
dpkg --get-selections | grep -v deinstall | grep -i mysql
sudo apt-get purge -y percona-toolkit
sudo apt-get purge -y php5-mysql
sudo apt-get purge -y libmysqlclient18 mysql-client mysql-client-5.5 mysql-client-core-5.5 mysql-common mysql-server mysql-server-5.5 mysql-server-core-5.5
Install MariaDBsudo apt-get install software-properties-common
sudo apt-key adv --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 0xcbcb082a1bb943db
sudo add-apt-repository 'deb http://sfo1.mirrors.digitalocean.com/mariadb/repo/10.0/ubuntu trusty main'
sudo apt-get -y update
sudo apt-get -y dist-upgrade
sudo apt-get install mariadb-server mariadb-client percona-toolkit
Select your root password
Upgrade
sudo mysql_upgrade -uroot -p

That's all. You will not need to change any client or recompile them.

Bottom Line
Migration is easier then you may expect, now you can test and verify if it fits your needs.

Keep Performing
Moshe Kaplan

Apr 4, 2015

Modifying Your MySQL Structure w/o Downtime

Did you try to add a new index to your MySQL huge table and suffered from a downtime?
If the answer is positive, you should introduce yourself PT-ONLINE-SCHEMA-CHANGE.

How the Magic is Done?
Actually Percona imitates MySQL behavior with a little tweak.
When modifying a table structure MySQL copies the original table structure, modifies it, copies the data and finally renames the table.
The only problem w/ this behavior that it locks the original table...

Percona is doing the same, but instead of locking the original table, it reviews the latest changes and implements them on the new table. That way the original table still serves the users, and changes replacement is done in a single atomic process.

Percona Toolkit Installation
Download the Percona toolkit and install it (the following is relevant for Ubuntu):
> wget http://www.percona.com/downloads/percona-toolkit/2.2.13/deb/percona-toolkit_2.2.13_all.deb
> sudo dpkg -i percona-toolkit_2.2.13_all.deb

Making a Change
Just call the tool with permissions, database name (D flag), table name (t flag), command to execute (--alter flag) and finally use the execute flag to implement the changes.
pt-online-schema-change --alter "ADD COLUMN c1 INT" D=sakila,t=actor -uuser -p"password" --execute

Things to Notice

You must have a primary key on the table
If you want only to verify the process before replacing the tables themselves, use --dry-run instead of execute (or just drop this parameter).

Bottom Line
Modifying your database will cause performance degradation, but it should not result in a downtime.

Keep Performing,
Moshe Kaplan

Jan 27, 2015

12 Ways to Boost Your Elasticsearch Performance

The ELK (Elasticsearch, logstash, Kibana) stack is amazing.
In no time you can create a fully functional analytics service from data collection to dashboard presentation.

But what happens at scale? How can make sure this blazing fast solution keeps serving your business team even when your data includes hundreds of millions of data points and more.

What to Focus on?
Elasticsearch performs two major tasks:

Data load and indexing which is CPU intensive.
Search and queries that is Memory intensive.

You should design your system to match you business case pattern.

Step 1: Keep your version up to date
Elasticsearch is a relatively young tool, and the team delivers new features and fixes in a rapid way, so make sure you keep with the latest versions.

Step 2: Tune Your Memory
Elasticsearch memory utilization should be about 50% of your machine. It should be configured using the $ES_HEAP_SIZE environment variable to this number (2G for example): export ES_HEAP_SIZE=2G
Note: Probably this method should not work, as the init.d script overrides it... edit your /etc/init.d/elasticsearch with the ES_HEAP_SIZE=2g parameter.

Step 3: Select Your Storage
Disks are crucial when your data is larger then your memory. Choose local SSD disks. They will cost less and perform better.

Step 4: Stripe Your Data
Use path.data and path.logs to stripe your data and logs on multiple disks to gain more IOPS.

Step 5: Prepare for Index Merging:
Index merging is probably the most frustrating process in Elasticsearch. It's required to keep your system performance in the long run, but can end in relatively short high resource utilization. Elasticsearch protects itself to merge up to 20MB/s. If it serves as your back office system, you can disable the index.store.throttle.type settings to none.

Step 6: Plan for Bulk Loading
Like any other data solution, you should data in bulks when possible to fasten your load and minimize resource utilization. This is the reason you should check Bulk API.

Step 7: Optimize Your Index
Run optimize on your index when it is stable (for example after a daily load) to verify best performance

Step 8: Enlarge the File Handler Limit
Like other data solutions, Elasticsearch utilizes a high number of file handlers. Make sure to add the following settings to /etc/security/limits.conf:
* soft nofile 64000
* hard nofile 64000

Step 9: Make RAM Space for Your Indexes
Elasticsearch is optimized to clusters w/ over 10GB RAM as its default room for indexes is 10% of its memory. Since the best practice is having at least 512MB for the index buffer size, if your system is so large, make sure you add the following configuration to: /etc/elasticsearch/elasticsearch.yml

indices.memory.index_buffer_size=512M

Step 10: Change Mappings
Elasticsearch by default has some data mapping that may be avoided in your case to save disk space, memory and boost performance:

The _source field that stores the original data
The _all field combines all fields to a single one for special search for any

Step 11: Add Monitoring
You can either choose Marvel, the ELK management tool with the Kibana look that is part of the Enterprise package or make your own using open source solutions or hosted solutions like New Relic.

Step 12: Sharding
It none working, start sharding and adding nodes to your system.

Bottom Line
Elasticsearch is an amazing tool and with the right configuration it can keep serving your analytics needs even in the scale of billions of events.

Keep Performing,
Moshe Kaplan

Jan 16, 2015

Offloading SSL using AWS ELB

If you are using AWS elastic load balancer to scale your system, you may find that it is a good solution to offload SSL termination from your servers.

Why Should You Offload SSL Termination?
HTTPS is an encrypted protocol, and encryption required high CPU utilization to perform the needed mathematical computations.
Since most web applications are CPU bounded, you should avoid processing SSL at your servers.

Why AWS Elastic Load Balancer (or Any other LB) Is a Great Candidate?
In order to perform load balancing, the load balancer must decrypt the traffic and read its content. This is done by placing your certificate on the load balancer.
If you consider the network between your LB and your servers to be secure, you should prefer to avoid re-encryption of the traffic, and keep it clear.

How Can I Make Sure Traffic is Actually Secured?
In some cases, you want all your users to use HTTPS as an encrypted channel in order to keep your users privacy and avoid eavesdropping and injections.
In these cases you want to catch traffic that did not use HTTPS before being terminated in the LB and redirect it to HTTPS. This can be done by evaluating by the X-Forwarded-Proto server field in your .htaccess or Apache configuration:
RewriteEngine On
RewriteCond %{HTTP:X-Forwarded-Proto} !https [NC]

RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

Bottom Line
A careful design can help you get more out of your web servers

Keep Performing,
Moshe Kaplan

Scale Hacking: Cloud Computing, Software and System Performance

Pages