Nov 29, 2014

MySQL Replication Over Slow Links/High Latency

MySQL replication is considered to be efficient and usually changes in the master server are performed in the slaves within a single second. 
However, if you suffer from a replication that fails to close the gap, there are two main reasons for it: 

  1. Slave Disk Issue: as replication is single threaded per database, usually the slave lags behind due to disk latency when implementing the changes. In this case you should consider using SSD to accelerate the process.
  2. Low Bandwidth/High Latency Networking: in case where the two servers are located on remote locations (high latency case) or there is a low bandwidth between the servers, we should focus on minimizing the traffic between the servers using one (or both) of the following methods:
    1. Using statement based replication: Row based replication creates a SQL statement per each changed row in the database. Statement based replication is records the actual SQL statement sent by the application. Usually statement based replication is much more efficient from log size aspect. However, you should be aware that it may not work correctly if you use UPDATE ... LIMIT 1 for example.
    2. Compressing the traffic: MySQL supports log replication compression using the slave_compressed_protocol parameter. This method will reduce the traffic between the servers by up to 80%. However, compression is compute intensive, so you should be aware of some extra CPU utilization (that is usually not an issue in databases). This parameter should be enabled on both servers:
      1. Dynamically from the MySQL command line:SET GLOBAL slave_compressed_protocol = 1;
      2. In the MySQL configuration file:#compress master-slave communication
        slave_compressed_protocol = 1
Bottom Line
Understand why your replication lags behind and use the right method to solve it. Yes, it is that easy.

Keep Performing,

Nov 22, 2014

Analyzing Twitter Streams in Real Time using Spark

One of the most interesting (and some people say not working) features in Apache Spark is the ability to analyze the Twitter stream in real time.

DZone just released earlier this week the new Spark RefCard. Since I made a peer review of it, it is a good time to discuss this topic.

Tuning an Out of a Box Solution
Spark provides an out of the box example that with some tuning, you will get the top ten trending Twitter tags every 10 and 60 seconds.

  1. Create a new Twitter App or use your existing app credentials at Twitter Apps.
  2. Download and install Java, Scala and Spark
  3. Adjust the environment variables :
    1. Scala home
      export SCALA_HOME=/usr/lib/scala
    2. PATH to run scala
      export PATH=$PATH:$SCALA_HOME/bin
    3. Add the location of spark-streaming-twitter_2.10-1.0.0.jar,  twitter4j-core-3.0.3.jar and twitter4j-stream-3.0.3.jar to CLASSPATH
      export CLASSPATH=$CLASSPATH:/root/spark/lib/twitter4j/
  4. Run the code after tuning some parameters:
    1. Get into the spark foldercd /var/lib/spark/spark-1.0.1/
    2. If you are running on a single core machine (or you want to just make sure you will get results and not just "WARN BlockManager: Block input-0-XXXXXXXX already exists on this machine; not re-adding it") change the ./bin/run-example code:
      sudo sed -i 's/local\[\*\]/local\[2\]/g' *.txt
    3. Run the example (please remember that you should write the class name, including the streaming., and avoid placing the path, the scala extension or any other fancy stuff:sudo ./bin/run-example streaming.TwitterPopularTags 
  5. The result will be shown after several seconds:Popular topics in last 60 seconds (194 total):
    #MTVStars (42 tweets)
    #NashsNewVideo (9 tweets)
    #IShipKarma (6 tweets)
    #SledgehammerSaturday (6 tweets)
    #NoKiam (5 tweets)
    #mufc's (3 tweets)
    #gameinsight (3 tweets)

Bottom Line
Spark is an amazing platform, with some little adjustments you will be able to enjoy it in a few minutes 

Keep Performing,
Moshe Kaplan


Intense Debate Comments

Ratings and Recommendations