Probably this not something you will be glad to have.
In a quick analysis we found out that the alerts and the high IO originated from servers that were installed in a new data center.
While the actual CPU utilization devoted to IO wait at the old data center was around 25%, in the new data center it was about 75%.
Who is to Blame?
In the new data center NetApp 2240c was chosen as a storage appliance, while in the old IBM V7000 unified was used. Both systems had SAS disks so we didn't expected a major difference between the two. Yet, it was something worth to explore.
In order to verify the source we made a read/write performance benchmark to both systems by running the following commands:
- Write: dd if=/dev/zero of=/tmp/outfile count=512 bs=1024k
- Read: dd if=/tmp/outfile of=/dev/null bs=4096k
UPDATE II: when using dd, you should better use dd if=/dev/urandom of=/tmp/outfile.txt bs=2048000 count=100 that actually uses random input and just allocate spaces with nulls
On the NetApp 2240 we got 0.62GB/s write rate and 2.0GB/s read rate (in site #2)
On the IBM V7000 unified we got 0.57GB/s write rate and 2.1GB/s read rate (in site #2)
On the IBM V7000 unified we got 1.1GB/s write rate and 3.4GB/s read rate (in site #1)
When selecting and migrating between storage appliances pay attention to their performance. Otherwise, you may tackle this differences in production. However, differences should be inspect in the same environment. In our case something that was seemed like a storage issue turned into a VM/OS configuration or network issue (exact cause is still under investigation).