Jun 10, 2018

Getting Enterprise Features to your MongoDB Community Edition

Many of us need MongoDB Enteprise Edition, but might be short of resources, or would like to compare the value.

I have summarized several key features of MongoDB Enteprise Edition and their alternatives

Monitoring Options:
  • MongoDB Cloud Manager: Performance monitoring ($500/yr/machine) => $1000-1500
  • Datadog/NewRelic => $120-$180/yr per machine, Datadog is better for this case
  • DYI using tools such mongotop, mongostat, mtools and integrate w/ grapha and other

Replication is super recommended and is part of the community edition:
Replica set => min 3 nodes, at least 2 data nodes in 3 data centers (2 major DC and one small).
Backup and Restore:
There are 3 major options (that can be combined of course):
  • fsync to mongodb and physical backup:
    • fast backup/restore
    • Might be inconsistent/unreliable
  • Logical backup: based on mongodump
    • Can be done w/ $2.5/GB using the cloud manager w/ Point in time recovery
    • Can be done w/ Percona hot backup
    • Incremental is supported 
  • Have a delayed node
The first two may be done using a 3rd data node in hidden for backup (high frequency backup) that enable

Encryption Alternatives:
  • Disk based encryption => data at rest (can be done in AWS and several storage providers)
  • eCryptFS => Percona => data at Rest
  • Encryption based application by the programmers in the class level before saving to disk.

Use Percona edition is a good alternative that may close many of your enterprise needs

BI :
Well supported with MongoDB BI Connector in the enterprise edition, but can be done also w/
  • Some BI tool supports MongoDB natively
  • 3rd party provider for JDBC connector: such as simba and https://www.progress.com/jdbc/mongodb
Bottom Line
Getting your MongoDB Community Edition to meet Enterprise Requirements is not simple but with the right effort it can be done.

Keep Performing,
Moshe Kaplan

Mar 5, 2018

Some Lessons of Spark and Memory Issues on EMR

In the last few days we went through several perfomrance issues with spark as data grow dramaticaly. The easiest way to go around might be increasing the instance sizes. However, as scaling up is not a scalable strategy, we were looking for alternate ways to back to track, as one of our Spark/Scala based pipelines strarted to crash.

Some Details About Our Process
We run a Scala (2.1) based job on a Spark 2.2.0/EMR 5.9.0 cluster w/ 64 r3.xlarge nodes.
The job analyzes several data sources each of few houndred GB (and growing) using the dataframe API and output data to S3 using ORC format. 

How Did We Recover?
Analyzing the logs of the crashed cluster resulted w/ the following error:

WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

Setting the spark.yarn.executor.memoryOverhead to 2500 (the maximum on the instance type we used r3.xlarge) did not make a major change.

spark-submit --deploy-mode cluster --conf spark.yarn.executor.memoryOverhead=2500 ...

We raised the bar by disabling the virtual and physical memory checks and increasing the virtual to physical memory ratio to 4 (This is done step 1: Software and Steps of EMR creation by setting the following value of Edit software settings)
{"classification":"spark","properties":{"maximizeResourceAllocation":"true"}},{"classification":"yarn-site","properties":{" yarn.nodemanager.vmem-pmem-ratio":"4","yarn.nodemanager.pmem-check-enabled":"false","yarn.nodemanager.vmem-check-enabled":"false"}}

However, this made the magic till hitting the next limit (probably spark tasks were killed when they trying to abuse the physical memory) with the following error:

ExecutorLostFailure (executor  exited caused by one of the running tasks) Reason: Container marked as failed: container_ on host:. Exit status: -100. Diagnostics: Container released on a *lost* node 

This one was solved  by increasing the number of dataframe partitions (in this case from 1024 to 2048), that reduced the needed memory per partition.

Note: if you want to change dataframe default partitions number (200) use the following:
setConf("spark.sql.shuffle.partitions", partitions.toString)
setConf("spark.default.parallelism", partitions.toString)

If you want to take another look on the default partitioning and how to automate the numbers, take a look at Romi Kuntsman's lecture.

Right now, we run in full power ahead. yet when we may hit the next limit, it may worth an update.

Bottom Line
As Spark heavily utilizes cluster RAM as an effective way to maximize speed, it is highly important to monitor it and verify your cluster settings and partitioning strategy meet your growing data needs.

Keep Performing,
Moshe Kaplan


Intense Debate Comments

Ratings and Recommendations