Categories
-
Loading
Upcoming Big Data Events
- Hadoop Summit 2013 HBase Meetup on June 25, 2013 1:30 pm
- IMS San Francisco 2013 on July 30, 2013
- NoSQL Now! on August 20, 2013 11:14 am
- SES San Francisco 2013 on September 10, 2013
- DataWeek 2013 on September 28, 2013
Upcoming Cloud Computing Events
- Cloud Slam ’13 on June 18, 2013
- ICITA 2013 on July 1, 2013
- Cloud Identity Summit 2013 on July 8, 2013
- Dreamforce ’13 on November 18, 2013
-
Most popular articles
Recent Jobs
-
Director of Web Services
at Elgin Community College
Location: Elgin , IL -
ERP Application Analyst - PLM
at Garmin
Location: Olathe , KS -
IS Analyst I ? CV Research
at St. Luke's Health System
Location: Kansas City, MO -
Web Development - Software Engineer
at Garmin
Location: Olathe , KS -
Software Engineer - Tools Development
at Garmin
Location: Olathe , KS
-
Director of Web Services
at Elgin Community College
-
HPC Application Performance on ESX 4.1: Stream
Note that the scaling starts to fall off after two threads and the memory links are
essentially saturated at 8 threads. This is one reason why HPC apps often do not
see much benefit from enabling Hyper-Threading. To achieve the maximum aggregate
memory bandwidth in a virtualized environment, two virtual machines (VMs) with 8
vCPUs each were used. This is appropriate only for
modeling apps that can be split across multiple machines. One instance of stream
with N=5×107 was run in each VM simultaneously so the total amount of
memory accessed was the same as in the native test. The advanced configuration
option preferHT=1 is used (see below). Bandwidths
reported by the VMs are summed to get the total. The results are shown in Table
2: just slightly greater bandwidth than for the corresponding native case.
Table 2. Virtualized total memory bandwidth, MB/s, 2 VMs, preferHT=1
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>Total
threads
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>2
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>4
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>8
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>16
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Copy
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>12535
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>22526
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>27606
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">27104
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Scalar
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>10294
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>18824
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>26781
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">26537
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Add
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>13578
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>24182
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>30676
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">30537
style="FONT-SIZE: 12pt">Triad
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>13070
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>23476
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>30449
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center> style="FONT-SIZE: 12pt">30010
It is apparent that the Linux “first-touch” scheduling algorithm together with the
simplicity of the Stream algorithm are enough to ensure that nearly all memory
accesses in the native tests are “local” (that is, the processor each thread
runs on and the memory it accesses both belong to the same NUMA node). In ESX
4.1 NUMA information is not passed to the guest OS and (by default) 8-vCPU VMs
are scheduled across NUMA nodes in order to take advantage of more physical
cores. This means that about half of memory accesses will be “remote” and that
in the default configuration one or two VMs must produce significantly less
bandwidth than the native tests. Setting preferHT=1
tells the ESX scheduler to count logical processors (hardware threads) instead
of cores when determining if a given VM can fit on a NUMA node. In this case
that forces both memory and CPU of an 8-vCPU VM to be scheduled on a single NUMA
node. This guarantees all memory accesses are local and the aggregate bandwidth
of two VMs can equal or exceed native bandwidth. Note that a single VM cannot
match this bandwidth. It will get either half of it (because it’s using the
resources of only one NUMA node), or about 70% (because half the memory accesses
are remote). In both native and virtual environments, the maximum bandwidth of
purely remote memory accesses is about half that of purely local. On machines
with more NUMA nodes, remote memory bandwidth may be less and the importance of
memory locality even greater.
Summary
In both
native and virtualized environments, equivalent maximum memory bandwidth can be
achieved as long as the application is written or configured to use only local
memory. For native this means relying on the Linux “first-touch” scheduling
algorithm (for simple apps) or implementing explicit mechanisms in the code
(usually difficult if the code wasn’t designed for NUMA). For virtual a
different mindset is needed: the application needs to be able to run across
multiple machines, with each VM sized to fit on a NUMA node. On machines with
hyper-threading enabled, preferHT=1 needs to be set
for the larger VMs. If these requirements can be met, then a valuable feature of
virtualization is that the app needs to have no NUMA awareness at all; NUMA
scheduling is taken care of by the hypervisor (for all apps, not just for those
where Linux is able to align threads and memory on the same NUMA node). For
those apps where these requirements can’t be met (ones that need a large single
instance OS), current development focus is on relaxing these requirements so
they are more like native, while retaining the above advantage for small
VMs.





