Scale Hacking: Cloud Computing, Software and System Performance: 2010

Dec 31, 2010

Prepare for Database2011

Hi,

Raphael Fogel invited me to give a lecture at Database2011, the central databases conference in Israel. The event will take place on Jan 13 at the Avenue Conference Center.
So reserve the date and come prepare to hear How Sharding turned MySQL into the Internet de-facto database standard? What is Sharding? Why the biggest internet players chose MySQL? and What are the latest solution in this field.

Keep Performing and Happy Holidays,
Moshe Kaplan

Nov 16, 2010

Cloud Computing Design and Best Practices

Today I presented "Cloud Computing Design and Best Practices" in CloudCon, one of the largest cloud events in Israel.

It was a great lecture and I would like to share with you some of the lecture's insights:

Assumptions: Don't refer cloud computing as a brave new world. It is, but when you design your cloud system, don’t forget to keep care of the basic rules you had before.
Create A Road Map: You probably will be able to turn your system into a 1 Billion users system in day one. Therefore, design your road map and understand how you will reach your target. In order to get there, you should define the various system parts and you should understand how each of them will be scaled out/removed/replaced in the future to meet your road map.
Start Fast: People love success. Your investors love it, Your Marketing guys Love it, Your Customers love it and even Your Development guys do. Therefore, start safe and fast. If your development is great in C#, start with it. If you have an existing software product, start with it as well. It is always better to start with a model based on technologies and products that you are good at, then getting into an adventure that you cannot control its risks.
Minimize Costs: After you decided to take the fast track, you should control your costs, and most important: your growth costs. You should go over your business plan and turn it into technical requirements plan and find your bottlenecks. Based on these bottlenecks, define solution for each of them: if you are in the online ads market, take care of your impressions module; if you are in the video business, take care of your video processing module. Why? In the viral and online business the growth is exponential. Therefore, cost growth is exponential, and you have to take care of it before your budget will run away.
Best Strategies:

Scale out: think of sharing nothing, understand how to take each server and split it to infinite number of servers. If a larger server procurement is your best solution, you probably in the wrong direction.
Sharding: Data is usually the largest obstacle for scaling out, as conservative designs concentrate the data in a single place. If you have a similar case, take the path the giants already taken: Shard your database either if it is MySQL or SQL Server (read my Best Seller Sharding Post)
In Memory Database: In Memory is X5-X10 faster than on disk. Therefore, analyze your system, and understand what you can do without going to disk. In case you may be able avoid neglected number of transactions, use this technique to cut your costs.

Refactor on the Run: as a player in a growing business, you don't have the option to rest. A 100 users system is different than a 100M users system, and as the system grows, smaller modules that were neglected in the first phase will become more important in terms of bottleneck, cost or business sensitivity. Your way to handle it should be refactor the system, step after step to meet the business goals.
Define the Exit Strategy: You should always remember that your cloud operator is still a vendor and your best partner in the early days may become an obstacle when you become a giant. Therefore, choose carefully your cloud provider tools. I would recommend you to think twice before you choose propriety data stores like SimpleDB, and if you do: Create your own interface and have an exit strategy when needed.
Everybody is using Open Source. What if your organization is not an expert in this field? As written before, you should start fast. There are plenty of cloud providers that support Windows and .Net, and you can get a head start if you will use the technology you are familiar with. When you grow, you may refactor your product and add technologies to remove your bottlenecks such as Erlang for Push or LAMP to handle your most common processes that result in 90% of your costs.
Looking for more strategies?

CDN: extract your static and streaming content to CDN provider. This move will cut your server and network utilization and will improve your end user experience.
Smart Clients: turn your end user client to be sensitive to network failures. If you are using Gmail and seen the loading... label instead of 404 page, you probably understand what I'm talking about (otherwise Google for JQuery).
Elastic Growth: if your system pattern usage is not uniform, consider turning on and off some of your instances to keep costs down and meet the spikes.
Replication: don't forget to keep your data safe and meet failures. Make sure you do it using commodity hardware and software.
Prepare for downtime & upgrades: make sure that you can always go on. Downtimes will come, since in large scale everything happens, even what should have not. Make sure you never really shut down you whole system even when you upgrade it.
NoSQL and SQL: choose SQL as a start if you are great at it. But don't neglect NoSQL when you will get larger.

Risk Management: You are going to do a bold move ahead, and you should prepare yourself as I mentioned before:

Choose your vendors carefully and know your exit strategy. Your Cloud Operator is a provider as well.
Hedge costs and take care of your bottlenecks.
Stress your system all the way to guarantee you can get to the next level.
Think one move ahead and keep aligned to your strategy.
Listen to your users feedback. Your business depends on your clients; take care of them.

Cloud Computing Design Best Practices

View more presentations from Moshe Kaplan.

Bottom Line
Working according to these rules can help you reach the 1 Bill users system you were looking for

Keep Performing,
Moshe Kaplan

Oct 11, 2010

The Path to the Cloud

You decided to make the big decision and migrate to the cloud.

What Should You Do Now?

Well, migrating an active system between two hosting providers (and even from your small desktop under your desk), can be risky and may affect the business due to downtime or user frustration. Migration to the cloud that enables a new paradigm can be even more complex since it requires changing concepts and deal with a new brave world..

So How Should We Start?

The first rule is to not think of the cloud as a cloud. You should prepare for a simple migration (think of the cloud provider just as another hosting provider). This way you can minimize the major risk of incompatibility. Follow these steps to accomplish the first step:

Prepare your integration environment. If you don't have one, it's time for you to have one. If you do have, it's time to make sure that it's 100% compatible with your production environment. We'll use this environment in order to stimulate the migration.
Arrange your DNS records. Migrating to another location will probably require modifying your DNS records to reflect the new location. Some hosting providers do not allow naming for external locations (for example GoGrid is an example for such a provider). So you should verify that you know how to change your DNS records, or migrate the records to another register.
Implement a DRP environment. You should have one first in your current provider environment in order to eliminate issues such as access lists and network latency. Why is it so important?

Hosting migration is risky. You would like to have easy to roll back in case of non successful migration
Data Migration take a lot of time. If your databases (or NoSQL) are large, it will take a long time to migrate them between locations (even using a 100Mb channel, it will take a 3 hours downtime to migrate a 100GB database). The most efficient way to shorten this time will be by using Log shipping between the databases, or implement a replication between 2 NoSQL sites. When you'll choose to perform the migration to the new site, all you'll need to do is stopping the primary instance and turn the passive instance to primary one. Since the data is being replicated between sites all the time, the data migration time slot will be minimized to at most few minutes.

Migrate your DRP environment to the new cloud provider. Start with the integration environment in order to verify there are no network issues in implementing it.
Verify the new provider stability. Now that you have servers in the new hosting provider, it's time to verify it stability, its network performance and any other issues that may arise and you were not expected to.
Implement a Reverse Proxy Solution. Some of your client will be ready for the migration. They may have old stale DNS records or they might use your IP address for some kind of reason. Using reverse proxy can help your routing traffic from your old location to the new one, minimizing lost traffic and downtime.
Perform migration test in the integration environment. You should turn the passive site into active one, and check that user can continue work after the downtime. You should document the process, and if times are not good enough, you should exercise the process. After you are satisfied with results, it's time for prime time.
Migrate your production DRP site the new hosting provider. Perform the process in a similar manner to the integration DRP migration.
Prepare for The big day. Transform your DRP (or the new hosting environment) into a production one.
Verify. Wait a few days and verify the system was stabilized, if so make few steps before closing your old hosting facilities and cut costs:

Implement a new DRP site both for the integration and the production environments.
Verify that the routed traffic from the old site is neglected.

The big day. it's time to shut down your old server, close the site and say bye-bye to your old provider.

Just Scale It
After doing so much work, it's time to decide what cloud items do you want to you. It may depend on the cloud provider offering, and your decision how and if you want to avoid vendor lock in:

CDN: if you have major static files traffic, or if you are able to turn some of the traffic into static one (think of Linkedin profiles for example), you could store it in CDN facilities that offer can reduce your web servers traffic and better prepare your for peaks. It also can help you
Queues: Avoid taking care of queues availability and let this internet giants take care of it.
NoSQL stores: you could implement open source solutions such as Cassandra or chose the provider key value store.
Rational Databases: Some cloud providers offer these days a database as a Service that let you get rid of these log shipping, clustering and care of daily backups. If you are interested, drop me a note and I refer you to some interesting companies.
Load Balancing: You could choose using software based load balancers such as HAProxy, or choosing the cloud solution.
Monitoring: Some cloud provider offer you a system monitoring, saving you major effort to monitor your servers.

Take It One Step Further

Now that you chosen the tools you just have to make sure your system can scale out:

Separate back office services from web interfaces, creating pools of small machines that each can take care of their tasks.
Parallelize your back end services, making sure that no single service turns into a future bottleneck.
In case of push systems and communication between parties, create a directory system that can route events between different servers.

Bottom Line

If you felt that you are bounded to your current hosting provider, it's time to think again. Using this simple, yet not so short method, you could reduce risks and accomplish the change.

Keep Performing,

Moshe Kaplan

Sep 16, 2010

Sharding Again

Click on the post title to read the full post and the comments.

It was a long time since we last discussed Sharding.
Yesterday the Twitter sharding case study was presented at the Metacafe Knowledge sharing meetup by Gidi Meir Morris. Therefore, I think it is a good reason to refresh our minds.

Hey! Where Are All These Whales?
Twitter turned its web common architecture (that caused it many problems) based on Ruby on Rails web layer and MySQL into a 3 layer system that includes:

Web application server called Flapps.
Sharding middleware called Gizzard. This layer takes care of database requests and sends them to the target shard. It takes care of replication as well.
Old good (not so scalable) MySQL in a horizontal sharding mode.

But hey, it is better taking a look in this great Prezi presentation (use the More>Full Screen to better watch the presentation):

A bird's nest:A primer on FlockDB & Gizzard on Prezi

Bottom Line
Highly scalable databases are becoming commodity. In the Era of Cassandra and Gizzard, Google is starting to lose its competitive edge and high priced Oracle, IBM and Microsoft database are no longer a must for a startup. Will it affect these companies' bottom lines? Only time will tell.

Keep Performing,
Moshe Kaplan

Aug 28, 2010

How to Monitor Your Cloud Service Provider

Click on the post title to read the full post and the comments.
The emergence of cloud computing and Everything as a Service in last few years, raised many issues in enterprises that started to outsource part of their services and infrastructure into the cloud.
If in the pre-migration IT managers knew the availability of every piece of their infrastructure, now things are a little bit more complex.

But are Things Really Different?
If you had a COTS ERP or CRM system that you bought from SAP, Oracle or Microsoft, did you really know that happening in the system internals? Or did you focus in measuring the end user experience as well as system metrics like CPU and disk utilization?

What Should We Measure?
In order to decide what we should measure, we should ask first several questions:

What will cause me troubles if I'll not measure?
Where the money is coming out of?
What are you paying for?
What is the service interface?

All these questions have specific answers in a common agreement between the parties: the SLA (Service Level Agreement). If it is important enough, you should provide metrics for that, and if you have metrics for that you probably can monitor that using your monitoring system. The importance of SLA monitoring in the Cloud world is highly resembled by Oblicore acquisition by CA in Jan 2010,

What Should We Not Measure?

Usually I try to avoid specific devices monitoring of in the cloud provider since it will not provide me any information and will break the interface of the service (sounds like object oriented design).

However, if you pay for things that are virtual like redundancy or extra capacity, the service provider should provide measurements for that.

Can You Give Us a Small Example?

Usually, I tend to monitor the user experience. For example, if I use a SMS gateway to send SMS through my system, I'll monitor:

The submission time of an SMS request
The time to get the SMS in a handset that is stimulated using a cellular modem.

I will perform this test once in a while. E.g. if I send dozens of SMS every second, I will perform measurement every few seconds.

The Bottom Line

Choose your metrics carefully according to the defined SLA and monitor them using your monitoring system. Do not be tempted to over measure in order to avoid breaking the interface,

Keep Performing,

Moshe Kaplan

Jul 25, 2010

Open is the New Black (hmm.... Standard)

Click on the post title to read the full post and the comments.

Only few years ago when a technological company was lagging behind in a new niche that was turned into a mainstream, it had a magic move: creating a new standard.

Standard: The Magic Touch
Many times the standard technology was less efficient, less robust and with less users than the technology that was advanced by the market leader. Yet, it had a magic charm: it was open, and most the players in the market could use it without royalties or other fees. That was including #2, #3 in the market as well as many other giants that were not focused in that market, but had to integrate and expand their product support to that technology.
It worked for DEC with Ethernet (vs. IBM's Token Ring); it was great for NetApp that chose iSCSI with 3% market share vs. EMC leading Fiber Channel protocol and it was like a charm for a small silicon valley firm named CISCO that chosen IP.

How do you Standardize Software?
Let say you are a software company that is lagging behind. Anyone said Yahoo! vs. Google in the internet market? Yahoo! had a great idea to scale out using a BigTable like product, but had no deep pockets for that.
Another example? Facebook? What if you had the largest image store publicly available, yet you have only few hundred developers and you need a new scalable database. What do you do?

Open: The New Magic Touch
Well, both Yahoo! and Facebook chose Open Source their infrastructure products creating two of the leading infrastructure products in todays world: Apache Hadoop and Apache Cassandra.

And Cloud, What about the Cloud?
Well it seems that last week Rackspace move: establishing OpenStack.org and contributing major parts of its propriety cloud infrastructure code to this project, is a clear say that being #2 in the IaaS market (see Jack of all Clouds latest report) that is being invaded by major players such as VMware, Google, Salesforce.com and Microsoft is not an easy task even to such a major player in the old hosting business. (If you want to know more regarding this move, see Geva Perry's analysis).

Bottom Line
It seems that the cloud market and mostly the IaaS one is getting to their mainstream phase, where scale is a top factor and community is a major factor in survival.

Would you like to share you opinion? Can you add more information? Feel free to comment!

Keep Performing,
Moshe Kaplan

Jul 10, 2010

New Blog Announcement: The VP R&D Handbook Blog

Click on the title to view the full post, comments and add your own comment

Last month I opened a new blog

Are You Leaving Us?
No. The focus of the two blogs are totally different, I'll continue to post here new and good stuff more than once in a while.

So, What is the Difference?
Well it is gonna be a major one:

Language: Hebrew blog rather that English one.
Location: Microsoft Israel rather than Google US.
Content: Soft stuff like people management, R&D management issues and other issues from the VP R&D desk rather than cloud, system architecture and performance and other hard core stuff.

Sounds Cool?
If you are interested in this blog, feel free to enter my VP R&D handbook blog that may turn someday into a book.

Last Words
If you feel that you have something to contribute, feel free to send me a draft. I will be glad to have host posts.

Keep Performing,
Moshe Kaplan

Jul 3, 2010

Microsoft is Catching Up with the Internet Industry

Click on the title to view the full post, comments and add your own comment
For a long time I blame Microsoft for not really understanding the Internet and cloud Industry basic need: very large systems that can handle billions of daily transactions with a very low revenue per transaction. Now, after admitting its failure in the mobile industry (see Kin's death earlier this week), Microsoft finally reveals AppFabric after a very long time (I would like to thank my colleague Alon Biran for notifying me regarding it).

What is AppFabric?
The roots of AppFabric are located in the "Velocity" project that was developed in the SQL Server group (a year and a half ago I wrote a Knol about). This project was aimed to provide a key-value store (super hash) that will be used for two main objectives:

Instant in memory storage data store (database replacement).
Shared web session repository that will enable instant failure between web servers (the only Microsoft in house solution till now was using SQL Server as a shared repository)

After being in various CTP and beta versions for almost two years, AppFabric is finally out as part of the Windows platform rather than being a part of the SQL Server product. You may take a look at my colleague David Chappell's AppFabric white paper for complete set of features and scenarios.

Who are the Current Players?
The major players in the market these days are open source (Every each of the products has its own unique features, please feel free to comment on this post regarding it):

Memcached: the #1 product in the market. Its roots are in 2003. It is an open source and Linux based product. Yet, Memcached has .Net client API as well as windows porting of the product itself.
SharedCache: a native C# open source product that is a Memcached like.
StateServer: commercial native windows product by ScaleOut Software.
NCache Express: another commercial native windows product by Alachisoft.

Since AppFabric is provided for "free" as part of the Windows platform, the commercial native .Net products may be hurt from the Microsoft move. However, they may gain more users due to better publicity for this kind of solutions in the Microsoft development community,

Why Did It Take So Long?
It seems that Microsoft had a major cannibalization issue: If you provide a good key-value store then you may reduce the databases size and usage intensity. Thus, the number of high cost database licenses will be significant lower and the bottom line will be smaller. Oracle has the same problem with Coherence that is now part of the Fusion platform.
Taking this product into RTM is a true Microsoft understanding that if it really want to take a significant role in the Internet and cloud industry, it will have to sacrifice more than just few cents in its financial reports bottom line.

Bottom Line
These are very good news that Microsoft is finally provide a solution for a basic need of the Internet and cloud industry. However, while Microsoft was catching up with the Industry 2003 news, new requirements were raised including MapReduce, BigTable and other NoSQL advanced features. How much time will it take this time? Probably only Steve Ballmer has the answer.

Keep Performing,
Moshe Kaplan

Jun 19, 2010

At My Command Unleash Version!

Click on the post title to read the full post and the comments.

As you may know I'm a big fun of Continuous Integration, a.k.a CI, (running every trunk commit with some unit tests and integration tests, finding that you ruined the build and earning the blame T-shirt is such a good way to start the morning :-).

Cool things you can do with CI
However, today's CI is much more from just running several tests. These days you can do almost everything with it:

Get your latest version from trunk.
Build the version.
Launch a new virtual machine based on a predefined image or by installing it almost from scratch.
Install the version on the new machine.
Run smoke tests to provide an instant validation of the code stability.
Run nightly builds to carefully check that the version is clean.
Run stress tests to validate capacity and long multiple days tests to validate memory leaks.
Notify key players with the results.

Cloud Related Cool Scenarios
Cloud computing is basically about obtaining on demand resources through API (the GUI is a fancy stuff that can be useful but is not really a must). Therefore, you can use CI to automate every step in your testing, upload to production and rollback when needed. Some of the scenarios are:

Performing Software Upgrade: Launching a new instance, installing the latest software version and creating an Image (AMI in AWS).
Launching an entire new integration environment that is identical to the production environment and check its functionality as well as its correspondence with the non functional requirements.
Launching a new environment for marketing needs.
Managing your new version upload, by creating a script that will prepare new servers, upgrade your database servers (or RDS), add new application and web servers to the load balancer farms, and finally will remove the old server and shut them down.
Implement Failover scenarios (DRP)

Industry Best Practices
According to Amazon (see Jinesh Varia, Technology Evangelist @Amazon Web Services document), CI and automation is the best practice to manage your cloud environment:

Create a management server (like CruiseControl or regular CRON jobs).
Create a build-test-deploy cycle script using the chosen management server that will generate:

New AMI
Deploy new servers based on these AMI
Add these servers to existing load balancer
Attach elastic IPs
Generate static files to S3 and CloudFront
Generate new queues and notifications using SQS and SNS
Remove old servers and unneeded AMIs

Using this method can save you a lot of time in configuration and optimization, as well as many production issues due to human errors.

What CI to choose?
There are many CI products in the market, some of them are open source and other are commercial. Here are some of leading tools, thanks to Alon Nativ advice:

CruiseControl: a leading open source product with dozens of plugins and 3rd parties.
Apache Continuum: an open source product with many features.
Final Builder: a commercial and major tool in the market with great visual interface and many built in scripts and API
JetBrains TeamCity: a commercial tool with a built in integration to Amazon AWS.

You can also take a look at RightScale product and services. RightScale provides semi CI in a SaaS manner with many features we discussed before and more. I included here a RightScale video, which is one of the best industry resources regarding this issue:

Bottom Line
Now after going over the details, I recommend you to:

Design your cloud environment.
Design your baseline AMIs/Images.
Design your upgrade, rollback and test-build-deploy scripts based on these basic AMIs.
Implement them using your chosen Continuous Integration tool.
Roll it out to production.

Keep Performing,
Moshe Kaplan

Jun 2, 2010

How Much Money did Amazon Make from that Post?

Click on the title to view the full post
My mate, Dima Shestak, just noticed me regarding a new phenomena in Amazon AWS spot prices in the last few days. The prices are going up...

Is there any connection to my last post: "Solving the Spot Instances Pricing Paradox"? Only time will tell...

Update: a recent graph (June 19, 2010) shows that the trend continues and prices are even touching the on demand prices:

Keep Performing,
Moshe Kaplan

May 26, 2010

Solving the Spot Instances Pricing Paradox

Click on the title to view the full post
Geva Perry, from Thinking out Cloud, had a great post last December regarding AWS spot instances prices (bidding prices where charge is done based on market current pricing) and why the spot instances pricing should not be higher than on demand instances. The post had a comment thread by James Watters, Guy Rosen, Shlomo Swidler and me discussing why spot instances pricing should not surpass the on demand instances pricing, and why this assumption was broken (for relatively short spikes) in the reality.

NOTE: Please find my previous post regarding spot instances for more information.

Why NoSQL is (not) Really Bad for You?

A few days ago Zeev Rubinstein, a friend of mine, sent me cool link of Mu Dynamics Research Labs post, that described why they chose CouchDB, a leading Erlang based NoSQL product and what are their major benefits from these products:

Low CPU and memory utilization.
Map/Reduce support and usage of non normalized data model that fasten their time to market.
Avoiding SQL Injections. Since NoSQL systems are not widely spread, there is relatively small number of known attacks against these products. However, please notice that as NoSQL products will turn into mainstream, there will be probably new types of attacks against these products. Yet, if you are an ethical hacker, it can be a new and existing topic that maybe can turn your name to the hottest one in the industry.

Keep Performing,
Moshe Kaplan

May 21, 2010

When Agile Met Cloud Computing

My mate Alon Nativ just referred me to JetBrains newly introduced TeamCity feature: integration of their Continues Integration (CI) platform with Amazon EC2. This new feature enables on demand instances during the CI process.

What is it good for?
CI tends to have non linear usage behavior: Minor use in the beginning of the project, and massive use in the end of the day and just before releasing the new version.
During these peak hours TeamCity can launch on demand instances, running the build and the tests and shut down the instances when it ends.

Bottom Line
It is a fine example of how development teams can keep their infrastructure costs as low as possible, while maintaining maximum flexibility.

Keep Performing,
Moshe Kaplan

May 19, 2010

Boosting Your Blog Traffic and Performance

Caution: This post does not include technical stuff.
This time I would like to share with you some inside stuff from the other side of boosting traffic performance. This time I would like to share with you several SMO, SEO and user satisfaction boosting methods:

Make sure your blog is really in the right language. For example, blogger.com default language is your own one. If you tend to blog in other language, please make sure you changed the blogger.com settings to the appropriate language.
Turn your site into a top news site using related links with images widget. Take a look the cool outbrain widget.

Outbrain Thumbnail Widget from Outbrain on Vimeo.
Place a popular posts widget at your blog's side. If other people thought these posts were good, there is high chance that your next visitor will be interested in them as well. See example and PostRank. Do not forget to register PageRank and subscribe to your blog in order make sure the widget is correctly initialized.
Place a recent comments widget. Users content is free traffic generator, and if your visitor is interested in the comment, he may leave another one. See example.
Avoid connecting your blog to Google Buzz, Facebook notes, Plaxo and Linkedin. These sites crawl your posts from the blog and prevent users from getting into your original post. According to my experience it can be up to 50% of your daily traffic.
Do use posting of links to Linkedin, Twitter (the Linkedin connection is really useful), Facebook, Google Buzz and Plaxo. It will generate high quality traffic to your site.
Add social sharing toolbar. Surfers my like your post and retweet/like/etc and will generate more traffic and more credibility. See sociofluid.com solution and WidgetBox and ShareThis.
Facebook badges are great to generate returning traffic and credibility. See profile and like.
Add retweets to make it twitter oriented. Add Stumbleupon as well and don't forget digg
Place your blog feeder in other sites in order to generate more traffic. See Google example.
Use Google Analytics and ClickTale to analyze your users behavior.
Take a look at other additional cool hacks.
Write a good stuff :-)

Few more tips

June 2010: Update and refresh your Twitter profile. Use twitbacks.com for that. See my Twitter profile as an example.
July 2010: Add tools to enlarge your Twitter followers group, that may help you get more traffic later. TwitterCounter is a great example.

Now it's time to start working :-)

Note: There is a great comment thread to this post with interesting insights.

Keep Performing,
Moshe Kaplan

May 11, 2010

Memories from David Chappell's Lecture or Why does MS Azure still in its Alpha?

Clarification: This post is based on my understanding of the cloud computing market. I may not fully understood the full Microsoft offer, and probably in some aspects my colleague David Chappell and I have different perspectives and opinions.
UPDATE 1: (May 16, 2010) Based on conversation with David Chappell.

I attended yesterday David Chappell's great lecture regarding the Microsoft Azure platform. Except for letting you know that you cannot find a decent WiFi Access in the Microsoft R&D offices, I would like to share with you several technical and product details regarding Azure and my opinion regarding Microsoft Cloud Computing GTM:

Both auto scale and getting back to the original state are supported through the API and not automatically . You should do it manually. (UPDATE 1)
There is no admin mode: you can install (almost) nothing, no SQL Server, ~~no PHP,~~ no MySQL in service mode and no RDP... or in other words total vendor lock-in. Since HA system requires service mode you must use the Azure stack for HA systems. (Update 1)
No imaging or "~~stopped~~ suspended staged server" mode where you pay only for the storage. You must pay for suspended staged machines (like in GoGrid. unlike Amazon EBS based machines)
Bandwidth - few people complain regarding it, because few people use it "there is a lot of room right now...". Please notice that having a lot of room is good. However, only time will tell how Microsoft will
Storage - S3, SimpleDB and SQS like storage mechanism that is accessed using REST (no SOAP in the system):

Blobs:

EBS like binary data storage
Can be mapped to "Azure Drives" that can be used as EBS: NTFS files
CDN mechanism Support

Tables:

Key value storage (NoSQL)
Table>Entiity>Property {@Name;@Type;@Value}.
Don't expect for SQL and it really scalable. You are right, it is sounds just like SimpleDB.

Queues: SQS like

Used for task distribution between workers
Please notice that you should delete messages from the Q, in order to make sure that the message does not appear again (there is a configurable 30 seconds timeout till messages will reappear)

Access: grouping data into storage accounts

Fabric:

Fabric Controller: controls the VMs, networking and etc. A fabric agent is installed on every machine. It is interesting that Microsoft exposes this element and tells us about it, while Amazon keeps it hidden... (UPDATE 1): Please notice that this is a plus for MS, as turns out that Amazon for example does not tell us everything about its infrastructure, and maybe it is time to expose this information.
Azure Development Fabric: a downloadable framework that can be installed for development purposes on your premises. Why should you have this mess of incompatibility between development and production environments for the aim of saving few dozens of USD a month? Maybe MS knows... (UPDATE 1): Please notice that saving few bucks is a great idea. However, based on my experience WAN effects and real hosting environment can present different behavior than the one you had on your development premises. Therefore, I highly recommend you to install both your integration and production instances in the same environment.

SQL Azure Database:

Up to 10GB database. Sharding is a must if you really have data. ~~But hey, if you really had data, you were not here...~~ (UPDATE 1): If you have critical systems with a lot of data, I don't recommend using Azure yet. However, you can find a previous post of mine regarding SQL Server Sharding implementation.

Windows Azure Platform AppFabric

Service Bus: a SOA like solution that is aimed to connect between intranet web services and the cloud. This a major component that does not exist in other players offering, due to the MS Go To Market (GTM) strategy (see below). If you use other cloud computing environment, you may use this service while integrating it with your internal IT environment.
Access Control

Pricing
Pricing is the #1 reason for using the cloud, and it seems that Microsoft is lagging behind in this

Computing: $0.12-$0.96 per hour. that is X3 AWS spot prices. (UPDATE 1): Please notice that spot prices are not the official Amazon price list. However, it is common method to use it and it is well explained in a previous post of mine.
Storage: $0.15/GB per month + $0.01/10K operations
SQL Azure: $10/GB per month. Yes it is limited to 10GB. (UPDATE 1): Please notice that the Amazon offering is extremely different both in options and in pricing model, so decisions should be taken based on detailed calculation..
Traffic:

America/Europe: $0.1/GB in, $0.15/GB out
APAC: $0.3/GB in, $0.45/GB out

Target Applications
The Azure plarform best fits the following case scenarios according to ~~Microsoft (based on Amazon experience)~~ David Chappell (David does not work for Microsoft, nor does he speak for them):

Massive scale web applications
HA apps
Apps with variable loads
Short/Unpredictable life time applications: marketing campaigns, pre sale...
Parallel processing applications: finance
Startups
Applications that do not fit well to the current organizations:

No data center cases
Joint venture with other parties
When biz guys don't want to see the IT

Microsoft Go To Market Strategy
(Update 1): Please notice that is my analysis regarding Microsoft Go To Market strategy and is based on my market view, and from conversations with market leaders and may not reflect David Chappel's opinion:

Microsoft is building on their partnerships and marketing force rather than providing the best technical/pricing solution. They build on these ISVs to expand their existing platform to Azure as part of their online/cloud/SaaS strategy and portfolio. (UPDATE 1): I believe that currently the Azure solution is less flexible than IaaS offers in the market, and with higher cost.
MS currently focus in establishing itself as a valid player in the cloud computing market, by comparing itself to Amazon, Google, Force.com and VMware. (UPDATE 1): Microsoft, of course, do not say it by themselves. However, they sponsor respected industry analysts to pass this message
It seems that Microsoft was trying to focus in solving enterprises requirements, rather than approaching the Cloud native players: web 2.0 and internet players that are native early adapters... Is it a correct marketing move? Are enterprises really head to this platform, when there are better and more stable platforms? Will they head at all?
My 2 Cents: we probably gonna see here another Java/.Net round. However, this time is not just operating system and development environment, but the whole stack from hardware to applications: Microsoft oriented players will try the Azure platform, while others will probably use other solutions. However, this time Microsoft is not in a good position, since (1) major part of the new players in the market prefer Java/Python/PHP/Erlang stacks instead of the .Net stack; and (2) other players already offer Microsoft offering as part of their stack.

(UPDATE 1): Personally I think that Microsoft has a great development force, and it shown in the past that it can get into existing markets with strong leaders (Netscape in the browsers, Oracle in the database, Oracle and SAP in ERP and CRM) and turn in a leader in the market. I believe that this is their target in the cloud computing arena and I'm sure that in the future Azure will be a competent player. However, this is still a first version, and I would not recommend you to run your critical system based on it, as you probably did not run your critical system on the first MS SQL Server version.

Keep Performing,

Moshe Kaplan

May 4, 2010

VMForce: a Joint Strategic Move by VMware and Salesforce.com

Salesforce.com and VMware have just announced VMforce.com, a new PaaS that enables Java developers deploy their systems on VMware/SpringSource/force.com based cloud platform. This solution has a lower vendor lock-in barrier than the current force.com, but it includes all existing benefits of force.com pre-built platform elements, including reporting and analytics, search, web services API, and application security services.

This announcement is a result of several strategic moves in the market:

Salesforce.com is turning from SaaS and very limited PaaS platform (Force.com) to a closer position to the IaaS market. By this move it turns itself into a valid competitor in the PaaS market (the market that Google App Engine and Microsoft Azure are aiming to).
SpringSource acquisition by VMware less that a year ago, enabled VMware with its first PaaS platform component (SpringSource was behind the Spring framework that enables instant delivery and management of Java applications).

Few prospects into the future that only time will tell:

Salesforce.com move into the PaaS world may signal that its next target is the IaaS market that it currently controlled by Amazon Web Services. This move can be achieved by acquiring smaller players such as OpSource Cloud.
VMware is currently focused in the hypervisor market (vendor position). Will this move take it closer to cloud computing provider market (service provider)? Or will it keep with its mother company strategy: keep playing with all market providers?
Will EMC get into the Enterprise Software market in response to Oracle move into the Storage and Hardware market (Sun acquisition)? Will EMC acquire Salesforce.com gaining both a cloud computing provider and the leading SaaS package in the market?

Keep Performing,
Moshe Kaplan

Apr 30, 2010

Issuu: Erlang, Hadoop and AWS Become Mainstream

There is a buzz in the industry regarding how the Erlang, Hadoop and AWS stack enables sites scalability and how it lets them handling the demand when you hit the buzz (TechCrunch, Dig...).
An example for that is Issuu.com, a content publishing platform that is using this stack to enable millions of publications in the web and the mobiles (Android and iPhone). Tania Anderson describes this case study in a Danish article (use Google Translate for it).

In that article Issuu's CEO Michael Hansen reveal several key issues in their strategy:

How they managed to recruit Erlang programmers (hired talented programmers and learned them from scratch)
Why they favored Amazon Web Services (Microsoft Azure hardly fits the internet mainstream stacks and Google App Engine is focused on the front end while Issue needs back end processing as well),
How could they avoid AWS down times (They did not. However, most of the industry was effected, so it was a non issue).

Keep Performing,
Moshe Kaplan

Apr 7, 2010

Amazon Pushed the Bar. Again!

Everybody is talking these days about iPad, Windows Phone 7, iPhone OS4 and Nexus One and how they gonna change the world. Well if everybody is going to change the world, they need reliable servers and more important traffic efficient communications that will enable mobile application keep updated while consuming as few as possible battery and network resources. If you look for the right buzz Google for COMET, Push and Long Polling.
Current State
Well to be honest, the COMET Servers market state is not so good. Most of the products in the market are either young or luck important features such as high availability. However, I'll keep the full details of the market state to survey that will be exposed in the next few days.
The Amazon Move
Amazon detected this new hot market and exposed today SNS: Simple Notification Service. This service provides you the ability to push data to your clients without taking care of establishing the server infrastructure and taking care of HA.
As a first impression I can say the pricing is very attractive (100K first messages are free) and even the beta limit of 100 topics (channels) is valid for many application (yes I know, there are many of us who needs XXXK :-)
UPDATE: at a second impression this product (SNS) still do not support the need for a push.
The bottom line
If you are a mobile developer, take a look at this new service. It can save you a lot of TTM.

UPDATE and CLARIFICATION
The Amazon SNS API includes the following API:

Topics: Create, Delete and List channels and Set and Get their attributes.
Topics Permissions: Add and Remove permissions on the channel to AWS accounts
Subscription: Subscribe and Unsubscribe to a channel and subscription confirmation. Subscription is done to URL (http and https), Email account and SQS. Subscription cannot be done using COMET method or other methods that are feasible for mobile subscriber (unless you gonna poll your email account)
Publish: Publish a new info item into the channel.

What is missing? the push mechanism that can support mobile and web clients and notify them regarding new info items.

The Real Bottom Line
Read the full Spec and the API before posting to your blog :-)

Keep Performing,
Moshe Kaplan

Mar 29, 2010

Web: Round #5

Dong...
The web is changing... HTML 5, Smartphones (iPhone, Android et al.), RIA, Video and AJAX are changing the industry.
This time it is more than a technological question. It is more a business question: Who will control the platform (Apple, MS, Google)? What is the business model (App Stores, Licensed platforms, Ads, Freemium or paid content)? and how the big guys are going to do money out of it? or at least how will they avoid damaging their existing business models?
The results of this battle will highly effect what will be the common technological methods in the next few years, and choosing the right platform may save you a large effort, frustration and $$$ in the upcoming years.

Dong...
In the next few weeks I'll place several posts regarding our analysis and decision taking these days: What is the right COMET (push mechanism) platform? How to provide streamline service? What is the effect of HTML 5 and is it relevant to you?
Dong...
In the meantime I recommend you going over Jeremy Allaire's TechCrunch article where Jeremy presents the business aspects and several technological issues,

Keep Performing,
Moshe Kaplan

Feb 21, 2010

Lecture: Extract The Traffic from the DB

A few days ago I had a presentation in the AlphaGeeks meetup in Tel Aviv, presenting NoSQL, Memcached, CouchDB, Sharding and other buzzwords that help you extract the traffic from the database and boost your system performance. If you were not there, you can take a look at the presentation (English) or the recorded video (Hebrew).

The Presentation (English)

Extract The Traffic From The Db

View more presentations from Moshe Kaplan.

The Video (Hebrew)

Keep Performing,
Moshe Kaplan

Feb 17, 2010

Agile Tools for Agile Performance

We invest these days in our team, turning it into Agile. This way we expect to bring sooner and better products to market.
We selected Agilo by Agile42 as out task, bug and Wiki product. This a Trac based product that has the following pros:

It is based on Trac, so it includes all the common Trac features: road map, bugs, tasks and Wiki in a single product
It is customized to Agile methodology including:

White board (pro version)
Sprints, Milestones
Sprint Dashboard with Sprint Burndown, closure tickets rate and commitment charts

It has better UI that Trac
It has great packaging for instant installation (Trac instant installation can be find in BitNami).
Its community version is free (it has pro version with several extra features such as white board)

Some useful info if you turn to Agilo:

Agilo installation
Installing Agilo as a Windows service

Download the Windows Server 2003 Resource Kit from Microsoft
Install the service according to MS
Update: Change in run.bat the set VIRTUAL_ENV=%cd% to set VIRTUAL_ENV=%Agilo%. Create the Agilo system variable with a value that matched the path where the run.bat file is located at.

Avoid errors

"The password file could not be updated. Trac requires read and write access to both the password file and its parent directory": Change TrustedInstaller and Users permissions on the tracenv directory
Got "acct_mgr.web_ui.MessageWrapper", well, open the trac.db and run DELETE FROM session_attribute to solve this issue.

Control your source code

If your SVN is not on the Trac/Agilo machine, you should use SVNSync to make a local SVN read only copy:
svnsync synchronize http://localhost/svn/project --sync-username slaveuser --sync-password tjohej --source-password password
c:\Python25\Scripts\trac-admin.exe c:\projects\trac\project\ resync

Modifications

Changing attached user files size for tickets and wiki (Update May 22, 2010) using the max_size parameter in trac.ini.

Control your sprint

You should add priority field to the task in order to support any prioritization
You should add bug association with the sprint in order to see both in the same presentation
Agilo seperates between bugs and tasks (however, you probably manage both in the same sprint), therefore we created a report that controls all issues:
query:?status=accepted
&status=assigned
&status=closed
&status=new
&status=reopened
&status=review
&order=priority
&col=id
&col=summary
&col=status
&col=owner
&col=type
&col=priority
&col=component
&col=severity
&col=remaining_time
&col=drp_resources
&sprint=SPRINT_NAME

More goodies to follow,

Keep Performing,
Moshe Kaplan

Feb 16, 2010

Lectute: Memcached?, SimpleDB? NoSQL?: how the big boys handle massive query loads with non-SQL solutions

What: 5th AlphaGeeks Meetup

Where: Tushia 10, Tel Aviv, Israel

When: Wednesday, February 17, 2010, 18:30-21:30

Links: Facebook Event, AlphaGeeks Site

Abstract

In the 1st AlphaGeeks meetup I presented the sharding concept and how can it help you meet the 1 billion events/day systems requirements. This time we'll talk about the new sexy and emerging technologies of No SQL and how can they help you meet these requirements.

Other lectures

- Amitay Dobo: On C# (4) and Mono.

Why C# is a kick ass language that can bridge traditional, functional and dynamic typing languages, and how it all works with the Mono project.

- Yuval Goldstein will conclude n international survey of 300 developers, about their jobs, their salaries, professionalism and overall happiness.

Unleash Your Cloud Load Stress Monster

A common question when you prepare your system for the slash dot effect is "How do I check that my system is capable to hold these numbers?"

Location. Location. Location.
You may have the following options:

Buying a lot of hardware and setup a one time (or more) lab and check your system capabilities. Pros: its your own servers and you will always find something to do with these extra servers. Cons: It will burn your budget, keep your staff night and days to setup it and will require several days to months to get all installed.
Rent a lab and do your things there: Pros: You really don't need to put that amount of money Cons: Schedule the lab, making sure that the hardware and networking meet your needs, making sure software licenses are available... Most important, if you find a major issues in the first day, you will have to close the lab and reschedule another test (and pay again).
Setup a cloud based lab. Pros: No setup fees, no need to schedule, no need to commit, and you can save your environment, shut it down, and turn it on when you will need it again. Cons: You don't really own the servers, but hey, who really wants to own servers?

The Tools
OK, so we chose the cloud again. What about the tools? should we choose HP Software LoadRunner or Radview WebLOAD? If so get ready to write a 6 digits number check.
However, the smart choice is selecting the open source tool: Apache JMeter, that can generate HTTP stress (it's a world wide web world after all) at the price of $0. This tool requires you to build the stress scripts in a manual manner using drag and drop, parameters configuration and BASH scripts and it supports visualization using graphs and reports (it also support SOAP, HTTPS, LDAP, JMS, IMAP, JDBC...)
One last thing, JMeter support "bot network" mode, where several JMeter instances can load a single system and provide a unified reporting.

Decisions. Decisions. Decisions.
So we chosen the cloud environment (Amazon AWS currently provides the best offer) and JMeter... now just before launching instances, lets make several decisions that will help us keep costs as low as possible.

Getting Best Prices

Windows or Linux: since JMeter is Java based, it's platform independent and Linux will be the smart choice.
Spot prices: by using spot request, you can save about 60% of your CPU cost, and 30% of the total cost.
Install both stress loaders and the system in Amazon to avoid paying for traffic.

Stop talking. Start Working.

Lets start with several basic steps:

Download Client Tools that will be used to connect using SSH from Windows host :

Download WinSCP
Download Putty
Download PuttyGen

Sign up
Gen a KeyPair from the AWS management console (can be done using CLI if you prefer so).
Download the KeyPair (PEM file) and create a private key (PPK file) that can be used by Putty and WinSCP:

Open PuttyGen
Conversions > Import Key to import your .PEM file
Click on "Save private key" to create your private key file (PPK)

Launch your instance and connect to it

Launch an image from EBS based image (if you prefer to keep your work for next time). Use spot request to save some money. Please notice that the default Linux flavor is Fedura.
Connect using WinSCP and the PPK file. CLI should be done by starting Putty from within the WinSCP.

Install JMeter and its dependencies:

Download and install JMeter

Download the JMeter tar file from the site using wget
Unzip the file using tar -zxvf file.tar.gz

Download and install Java

Download java using wget from http://java.com/en/download/manual.jsp and install it
Set X permissions on the Java: chmod a+x jre-6u-linux-i586.bin
Run jre-6u18-linux-i586.bin to install Java

Set environment variables:

Set Path: PATH=$PATH:/etc/java/jre1.6.0_18/bin (update it according to path where you installed Java).
Set Path: PATH=$PATH:./

Launch your JMeter

jmeter -n -t my_test.jmx -l log.jtl
You may find full details of this syntax in the Apache Jakarta JMeter page:

-n: nongui mode
-t: the script file you built before
-l: the results file

Last Words:
Finally we a stress lab in the cloud, all that left is writing your stress script, installing your system and start stressing it...

Keep Performing,
Moshe Kaplan

Feb 9, 2010

Blocked Sessions In the Cloud

When you test you new software (or feature) in a new environment (e.g installing your system in cloud environment) you may face errors when you'll try to connect your newly deployed service. What happened?
There are two options:

You did not install correctly your system. You can verify it by connecting the service from within the server (using Terminal Services in Windows case). If it fails you should start exploring the event log and your application log.
Someone is blocking your sessions (probably it is a firewall). You can verify it by running netstat -na from you client command line, and check for SYN_SENT lines in the output. If it's attached to the server IP and port that your service uses, you definitely have firewall in you way. There are several options who is the blocker and how to solve it:

Your computer personal firewall. Probability: Low; Verification: try connecting from another computer.
Your company firewall. Probability: Medium-Low; Verification: try connecting from another computer which is outside your company network (your favorite neighborhood cafe can be great).
Your cloud provider firewall. Probability: High; Verification: if it's Amazon AWS, login to the AWS Management Console, verify the Security Group that your instance is linked to, and verify the Security Group rules.
Your server firewall: Probability: High (if you are using Windows); Verification: check the server firewall configuration that it allows incoming connections in the relevant ports.

Keep Performing
Moshe Kaplan

Feb 7, 2010

PHP Developer? Dance Like You Never Dance Before

Facebook exposed last week its last technology: HipHop for PHP.

Why Should You Need It?
PHP is slow (relatively to Java, .Net/C# and of course to compiled code like C/C++) since it based on interpreter. Faster means more actions using fewer CPU cycles. Fewer CPU cycles mean less servers, less CO2 emission and some say most importantly: more money in the bank.

What Could You Do So Far?
There are several PHP accelerators in the market like Alternative PHP Cache (APC), eAccelerator and Zend Optimizer+. These accelerators optimizes PHP intermediate code, caches data and compiled code from the PHP bytecode compiler (very similar to turning C# into MSIL or JVM into bytecode).

So What are the News?
There was still a major performance gap between native (unmanaged code) and bytecode. This gap is closed by this new Facebook technology: HipHop transforms the bytecode into native code and gains major performance boost (see the attached image from Facebook Blog).

Last words
You may think that this can be useful only to a large site like Facebook with its 350M users. However, every site with dozens of servers will get major benefits by using this technology: performance bottlenecks reduction, and slashing the number of servers (ya again: money in the bank, CO2 emission and operator time...)

Keep Performing,
Moshe Kaplan

Pages

Dec 31, 2010

Nov 16, 2010

Oct 11, 2010

Sep 16, 2010

Aug 28, 2010

Jul 25, 2010

Jul 10, 2010

Jul 3, 2010

Jun 19, 2010

Jun 2, 2010

May 26, 2010

May 23, 2010

May 21, 2010

May 19, 2010

May 11, 2010

May 4, 2010

Apr 30, 2010

Apr 7, 2010

Mar 29, 2010

Feb 21, 2010

Feb 17, 2010

Feb 16, 2010

Feb 9, 2010

Feb 7, 2010

ShareThis

Intense Debate Comments

Ratings and Recommendations