US20130290957A1 - Efficient execution of jobs in a shared pool of resources - Google Patents
Efficient execution of jobs in a shared pool of resources Download PDFInfo
- Publication number
- US20130290957A1 US20130290957A1 US13/590,881 US201213590881A US2013290957A1 US 20130290957 A1 US20130290957 A1 US 20130290957A1 US 201213590881 A US201213590881 A US 201213590881A US 2013290957 A1 US2013290957 A1 US 2013290957A1
- Authority
- US
- United States
- Prior art keywords
- job
- virtual
- data
- virtual machine
- physical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
Definitions
- This invention relates to an efficient approach for utilization of job processing in a shared pool of resources. More specifically, the invention relates to assessing the virtual and physical topology of the shared resources and processing jobs responsive to the combined topology.
- MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computer nodes.
- the framework is commonly referred to as a cluster.
- Computational processing can occur on data stored either in a filesystem or a database.
- a master node receives a job input and partitions the job into smaller sub-jobs, which are distributed to the other nodes in the cluster or grid.
- the nodes in the cluster or grid are arranged in a hierarchy, and the sub-jobs may be further partitioned and distributed.
- the nodes responsible for processing the sub-jobs return processed data to the master node. More specifically, the processed data is collected and combined by the master node to form an output.
- MapReduce is an algorithmic technique for the distributed processing of large amounts of data associated with a job.
- MapReduce enables distribution of data processing across a network of nodes. Although there is a convenience factor associated with use MapReduce, there is performance issues associated with current uses of MapReduce for processing jobs.
- This invention comprises a method for efficient processing of jobs in a shared pool of resources.
- a computer implemented method for implementation in a shared pool of resources. More specifically, the shared pool includes a physical host in communication with at least one physical machine, with the physical machine supporting one or more virtual machines. Status information associated with the operation of the virtual machines is collected. In addition, local topology information associated with the shared pool of resources is gathered. The aspect of gathering this information includes periodically communicating with an embedded monitor of the physical machine. The gathered topology information is organized, with the topology information including storage topology underlying a virtual topology of the resources and associated resource utilization information. Once the storage topology information is organized, it may be leveraged to facilitate processing of one or more jobs. More specifically, a job may be responsively assigned to a select virtual machine in the shared pool that supports efficient performance of I/O associated with the job.
- a computer implemented method for implementation in a shared pool of resources. More specifically, the shared pool of resources includes a physical host in communication with at least one physical machine supporting one or more virtual machines. Status information is collected from one or more of the virtual machines. Local topology information of a hierarchical organization of a shared pool of resources represented by the physical and virtual machines is periodically gathered and organized. More specifically, the organized topology information is stored in the shared pool of resources. Utilization of storage resources and virtual machines represented in the topology are assessed. A job is assigned to one or more select virtual machines in response to the topology and the assessment of the storage resources. Accordingly, the job assignment supports efficient performance responsive to both the topology and the storage resource utilization assessment.
- FIG. 1 depicts a cloud computing node according to an embodiment of the present invention.
- FIG. 2 depicts a cloud computing environment according to an embodiment of the present invention.
- FIG. 3 depicts abstraction model layers according to an embodiment of the present invention.
- FIG. 4 depicts a block diagram illustrating the architecture for using cloud aware MapReduce.
- FIG. 5 depicts a flow chart illustrating allocation of resources and management of data and virtual machine placement.
- FIG. 6 depicts a flow graph illustrating sample network topology for data placement.
- FIG. 7 depicts a flow graph for sample data placement.
- FIG. 8 depicts a flow graph for virtual machine placement.
- FIG. 9 depicts a table with values assigned to the flow graph for virtual machine placement and node categorization.
- FIG. 10 depicts a flow chart illustrating a process for assessing and leveraging the physical and virtual machine topology in a shared pool of resources.
- FIG. 11 depicts a flow chart illustrating steps to support the aspect of leveraging the storage topology of the shared pool of resources.
- FIG. 12 depicts a block diagram illustrating tools embedded in a computer system to support a technique employed for assessment of resource utilization for use in assignment of a job within a shared pool of resources.
- FIG. 13 depicts a block diagram showing a system for implementing an embodiment of the present invention.
- a manager or director may be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices, or the like.
- the manager(s) or director(s) may also be implemented in software for processing by various types of processors.
- An identified manager or director of executable code may, for instance, comprise one or more physical or logical blocks of computer instructions which may, for instance, be organized as an object, procedure, function, or other construct. Nevertheless, the executable of an identified manager or director need not be physically located together, but may comprise disparate instructions stored in different locations which, when joined logically together, comprise the managers and directors and achieve the stated purpose of the managers and directors.
- a manager or director of executable code could be a single instruction, or many instructions, and may even be distributed over several different code segments, among different applications, and across several memory devices.
- operational data may be identified and illustrated herein within the manager or director, and may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different storage devices, and may exist, at least partially, as electronic signals on a system or network.
- a cloud computing environment is service oriented with a focus on statelessness, low coupling, modularity, and semantic interoperability.
- An infrastructure comprising a network of interconnected nodes.
- FIG. 1 a schematic of an example of a cloud computing node is shown.
- Cloud computing node ( 10 ) is only one example of a suitable cloud computing node and is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the invention described herein. Regardless, cloud computing node ( 10 ) is capable of being implemented and/or performing any of the functionality set forth hereinabove.
- a computer system/server 12
- Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with computer system/server ( 12 ) include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above systems or devices, and the like.
- Computer system/server ( 12 ) may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system.
- program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular jobs or implement particular abstract data types.
- Computer system/server ( 12 ) may be practiced in distributed cloud computing environments where jobs are performed by remote processing devices that are linked through a communications network.
- program modules may be located in both local and remote computer system storage media including memory storage devices.
- computer system/server ( 12 ) in cloud computing node ( 10 ) is shown in the form of a general-purpose computing device.
- the components of computer system/server ( 12 ) may include, but are not limited to, one or more processors or processing units ( 16 ), a system memory ( 28 ), and a bus ( 18 ) that couples various system components including system memory ( 28 ) to processor ( 16 ).
- Bus ( 18 ) represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
- Computer system/server ( 12 ) typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server ( 12 ), and it includes both volatile and non-volatile media, removable and non-removable media.
- System memory ( 28 ) can include computer system readable media in the form of volatile memory, such as random access memory (RAM) ( 30 ) and/or cache memory ( 32 ).
- Computer system/server ( 12 ) may further include other removable/non-removable, volatile/non-volatile computer system storage media.
- storage system ( 34 ) can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not shown and typically called a “hard drive”).
- a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”)
- an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media
- each can be connected to bus ( 18 ) by one or more data media interfaces.
- memory ( 28 ) may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
- Program/utility ( 40 ), having a set (at least one) of program modules ( 42 ), may be stored in memory ( 28 ) by way of example, and not limitation, as well as an operating system, one or more application programs, other program modules, and program data. Each of the operating systems, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a networking environment.
- Program modules ( 42 ) generally carry out the functions and/or methodologies of embodiments of the invention as described herein.
- Computer system/server ( 12 ) may also communicate with one or more external devices ( 14 ), such as a keyboard, a pointing device, a display ( 24 ), etc.; one or more devices that enable a user to interact with computer system/server ( 12 ); and/or any devices (e.g., network card, modem, etc.) that enable computer system/server ( 12 ) to communicate with one or more other computing devices. Such communication can occur via Input/Output (I/O) interfaces ( 22 ). Still yet, computer system/server ( 12 ) can communicate with one or more networks such as a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) via network adapter ( 20 ).
- LAN local area network
- WAN wide area network
- public network e.g., the Internet
- network adapter ( 20 ) communicates with the other components of computer system/server ( 12 ) via bus ( 18 ).
- bus ( 18 ) It should be understood that although not shown, other hardware and/or software components could be used in conjunction with computer system/server ( 12 ). Examples, include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.
- cloud computing environment ( 50 ) comprises one or more cloud computing nodes ( 10 ) with which local computing devices used by cloud consumers, such as, for example, personal digital assistant (PDA) or cellular telephone ( 54 A), desktop computer ( 54 B), laptop computer ( 54 C), and/or automobile computer system ( 54 N) may communicate.
- Nodes ( 10 ) may communicate with one another. They may be grouped (not shown) physically or virtually, in one or more networks, such as Private, Community, Public, or Hybrid clouds as described hereinabove, or a combination thereof.
- cloud computing environment ( 50 ) This allows cloud computing environment ( 50 ) to offer infrastructure, platforms and/or software as services for which a cloud consumer does not need to maintain resources on a local computing device. It is understood that the types of computing devices ( 54 A)-( 54 N) shown in FIG. 2 are intended to be illustrative only and that computing nodes ( 10 ) and cloud computing environment ( 50 ) can communicate with any type of computerized device over any type of network and/or network addressable connection (e.g., using a web browser).
- FIG. 3 a set of functional abstraction layers provided by cloud computing environment ( 50 ) is shown. It should be understood in advance that the components, layers, and functions shown in FIG. 3 are intended to be illustrative only and embodiments of the invention are not limited thereto. As depicted, the following layers and corresponding functions are provided: hardware and software layer ( 60 ), virtualization layer ( 62 ), management layer ( 64 ), and workload layer ( 66 ).
- the hardware and software layer ( 60 ) includes hardware and software components.
- Examples of hardware components include mainframes, in one example IBM® zSeries® systems; RISC (Reduced Instruction Set Computer) architecture based servers, in one example IBM pSeries® systems; IBM xSeries® systems; IBM BladeCenter® systems; storage devices; networks and networking components.
- Examples of software components include network application server software, in one example IBM WebSphere® application server software; and database software, in one example IBM DB2® database software.
- IBM WebSphere® application server software in one example IBM DB2® database software.
- Virtualization layer ( 62 ) provides an abstraction layer from which the following examples of virtual entities may be provided: virtual servers; virtual storage; virtual networks, including virtual private networks; virtual applications and operating systems; and virtual clients.
- management layer ( 64 ) may provide the following functions: resource provisioning, metering and pricing, user portal, service level management, and SLA planning and fulfillment.
- resource provisioning provides dynamic procurement of computing resources and other resources that are utilized to perform jobs within the cloud computing environment.
- Metering and pricing provides cost tracking as resources are utilized within the cloud computing environment, and billing or invoicing for consumption of these resources. In one example, these resources may comprise application software licenses.
- Security provides identity verification for cloud consumers and jobs, as well as protection for data and other resources.
- User portal provides access to the cloud computing environment for consumers and system administrators.
- Service level management provides cloud computing resource allocation and management such that required service levels are met.
- Service Level Agreement (SLA) planning and fulfillment provides pre-arrangement for, and procurement of, cloud computing resources for which a future requirement is anticipated in accordance with an SLA.
- SLA Service Level Agreement
- Workloads layer ( 66 ) provides examples of functionality for which the cloud computing environment may be utilized. Examples of workloads and functions which may be provided from this layer include, but is not limited to: mapping and navigation, software development and lifecycle management, virtual classroom education delivery, data analytics processing, job processing, and processing one or more jobs responsive to the hierarchy of virtual resources within the cloud computing environment.
- a virtual machine is a software and/or hardware implementation of a computer that executes programs similar to a physical machine.
- the virtual machine supports an instance of an operating system along with one or more applications to run in an isolated partition within the computer.
- the virtual machine enables different operating systems to run in the same computer simultaneously.
- One physical machine may support multiple virtual machines. Multiple operating systems can run in the same physical machine, and each of the virtual machines may process jobs with different operating systems. Accordingly, the use of one more virtual machines associated with a single physical machine supports efficient use of hardware while processing multiple jobs.
- Efficient use of a virtual machine configuration in a cloud computing system is challenging due to the distributed nature of the physical topology of the physical machines, i.e. nodes. More specifically, the concerns pertain to a assessing and exposing the physical topology underlying the virtual topology of the cloud computing system and leveraging job processing responsive to the physical and virtual topology.
- a cloud platform referred to herein as CAM, is provided to combine a cluster file system with a resource schedule.
- CAM adopts a three level approach to avoid placement anomalies, the three levels include data placement, virtual machine-job placement, and job placement.
- data placement data is placed within the cluster based on offline profiling of the jobs that most commonly run on the data.
- Job placement is affected by CAM selecting the best possible physical node(s) to place the sets of virtual machines assigned to a job.
- CAM exposes otherwise hidden compute, storage, and network topologies to the job scheduler such that it make optimal job assignments.
- CAM reconciles resource allocation with a variety of other competing constraints, such as storage utilization, changing CPU load and network link capacities using a flow-network based algorithm that is able to simultaneously reduce the specified constraints for both initial placement and for readjusting under virtual machine and data migration.
- FIG. 4 is a block diagram ( 400 ) illustrating the overall architecture for using CAM.
- the physical resources supporting the cloud consist of a cluster of physical nodes with local storage directly attached to the individual nodes.
- CAM uses a filesystem, which in one embodiment may be a General Parallel File System-Shared None Cluster, ( 410 ) to provide its storage layer.
- GPFS-SNC is designed to be a cloud storage platform, which supports timely and resource-efficient deployment of virtual machines in the cloud.
- GPFS-SNC manages the local disks directly attached to a cluster of commodity physical machines.
- the physical layer ( 420 ) is illustrated with three commodity physical machines ( 430 ), ( 440 ), and ( 450 ).
- the quantity of physical machines shown herein is for illustrative purposes. In one embodiment, the quantity may include a smaller or larger quantity of commodity physical machines. Each of these machines has a local disk ( 432 ), ( 442 ), and ( 452 ), respectively.
- Filesystem ( 410 ) supports co-locating all blocks of a file at one location, rather than stripping the file across the network. This enables a virtual machine I/O request to be serviced locally from the stored location instead of remotely from physical hosts across the network. CAM leverages this feature to ensure that co-located virtual machine images are stored at one location and can be accessed efficiently.
- Filesystem ( 410 ) also supports an efficient block-level pipelined replication scheme, which guarantees fast distributed recovery and high I/O throughput through fast parallel reads. This feature is useful for CAM for achieving efficient failure recovery.
- filesystem ( 410 ) specifies a user level application program interface, API that can be used to query the physical location of files. CAM uses this API to get actual block location information for determining storage closeness for data and virtual machine placement.
- a server ( 460 ) is provided in communication with the physical layer ( 420 ).
- Information exposed by the server ( 460 ) consists of network and storage topologies, and other dynamic node-level information, such as CPU load.
- each of the physical machines ( 430 ), ( 440 ), and ( 450 ) has an agent ( 434 ), ( 444 ), and ( 454 ), respectively.
- Each agent runs on its respective physical machine and periodically collects and conveys to the server ( 460 ) a variety of pieces of data about the respective machine, such as utilization of outbound and inbound network bandwidth, I/O utilization, and CPU, memory, and storage load.
- the server ( 460 ) consolidates dynamic information it receives from the agents ( 434 ), ( 444 ), and ( 454 ), and serves it along with topology information about each job running in the cluster.
- the topology information is derived from existing virtual machine placement configuration.
- a job scheduler ( 470 ) interfaces with the server ( 460 ) to obtain accurate and current topology information.
- the scheduler ( 470 ) readjusts job placement accordingly whenever a change in the configuration is observed. Accordingly, the job scheduler leverages the storage and physical host resource utilization in CAM.
- Network topology information is represented by a distance matrix that encodes the distance between each pair of virtual machines as cross-rack, cross-node, or cross-virtual machine.
- a distance matrix that encodes the distance between each pair of virtual machines as cross-rack, cross-node, or cross-virtual machine.
- Storage topology information is provided as a mapping between each virtual device containing the dataset and the virtual machine to which it is local.
- a disk attached to a node can be directly accessed through a PCI bus.
- the physical blocks belonging to a virtual machine image attached to a virtual machine could be located on a different node. Even though a virtual device might appear to be directly connected to the virtual machine, the image file backing the device could be across the network, and potentially closer to another virtual machine in the cluster than the one it is directly attached to.
- the server ( 460 ) queries the physical image location through an API and presents the information to the scheduler ( 470 ).
- the distance is estimated based on observed data transfer rates between the virtual machines, and in one embodiment is expressed in units of bandwidth.
- the instruction get_block_location supports getting the actual block location instead of the location of a virtual machine, thereby guaranteeing locality.
- the instructions get_vm_networkinfo, get_vm_diskinfo, and get_vm_cpuinfo facilitate the job scheduler ( 470 ) with respect to querying the I/O and CPU contention information related to network and disk utilization.
- the scheduler ( 470 ) can leverage this additional information to make smarter decisions, including in one embodiment placing I/O intensive jobs on physical hosts that have idle I/O resources.
- FIG. 5 is a flow chart ( 500 ) illustrating allocation of resources and management of data and virtual machine placement.
- CAM is a cloud platform with specific interfaces and support for running MapReduce jobs.
- a dataset to be processed is initially placed on a general purpose filesystem (GPFS). More specifically, storage and compute resources are not segregated.
- GPFS general purpose filesystem
- a MapReduce job is submitted by providing the application, e.g. relevant java class files, indicating a previously uploaded dataset corresponding to the job, and the number and type of virtual machines to be used for the job ( 502 ).
- each virtual machine supports several MapReduce jobs slots depending on the number of virtual CPUs and virtual RAM allocated to the virtual machine. The greater the number of virtual machines assigned to a job, the faster the job is completed.
- CAM determines an optimal placement for the set of new virtual machines requested ( 504 ) by considering factors including, but not limited to, current workload distribution among cluster node, distribution of the input datasets required by the job, and the physical locations of the required master virtual machine images.
- the virtual machine images required to boot the virtual machines on the selected nodes are created from the respective master images ( 506 ).
- a copy-on-write mechanism provided by the general purpose filesystem is utilized for creating the virtual machine, as this allows quickly provisioning a virtual machine image instance without requiring a data copy of a master image.
- job class files are copied into a cloned virtual machine image ( 508 ).
- the copying at step ( 508 ) takes place by mounting the image as a loop-back file system.
- the data images are attached to the virtual machines and the respective files are mounted within the virtual machine ( 510 ), as this enables jobs to access data contained therein.
- each physical machine ( 430 ), ( 440 ), and ( 450 ) is equipped with local disks ( 432 ), ( 442 ), and ( 452 ), respectively.
- a distributed file system ( 410 ) also referred to herein as GPFS-SNC, is installed on top of the physical machines ( 430 ), ( 440 ), and ( 450 ).
- the virtual machine images ( 436 ), ( 446 ), and ( 456 ) are stored in the distributed file system ( 410 ).
- a cloud manager allocates resources to jobs and manages data placement and virtual machine placement.
- Placement of both data and virtual machines are aspects to be considered with respect to cost. Specifically, placement includes guaranteeing virtual machine closeness, avoiding hotspots, and balancing physical storage utilization according to different job types.
- Virtual machine closeness expresses cost of accessing data and captures how data should be placed with respect to its associated virtual machines to minimize access times.
- the hotspot factor expresses the expected load on a machine and identifies machines that do not have enough computational resource(s) to support the assigned virtual machines. To avoid a hotspot, data needs to be placed on the least loaded machine. This can be determined by measuring currently allocated computational resources to the machine and adding to it the expected allocation requirement of the virtual machines that would work with the data to be placed on the machine.
- storage utilization expresses a percentage of total physical machine storage space that is in use.
- Table 2 illustrates a table showing the significance of three factors outlined above that affect performance of different workload.
- FIG. 6 is an illustrating of a flow graph ( 600 ) illustrating sample network topology for data placement. More specifically, the sample network topology consists of six physical nodes (p 1 , p 2 , p 3 , p 4 , p 5 , p 6 ) identified as ( 610 ), ( 612 ), ( 614 ), ( 616 ), ( 618 ), and ( 620 ). The physical nodes are organized into three racks (r 1 , r 2 , r 3 ) identified as ( 630 ), ( 632 ), and ( 634 ). A master rack r 4 , e.g. switch, identified as ( 650 ) connects the racks. In one embodiment, the topology shown here can support any topology where network traffic can be estimated. The unit of data placement is a virtual machine image to ensure that an entire image is available at one location.
- FIG. 7 Based on the flow graph of FIG. 6 , a second flow ( 700 ) graph is illustrated in FIG. 7 for sample data placement.
- Two data items d 1 , ( 702 ), and d 2 , ( 704 ), with requests for 5 and 2 virtual machine images, respectively, are submitted to the cloud, e.g. share pool of resources.
- the number of virtual machine images requested by a data item is denoted as the data item's supply for the flow graph.
- a sink node, S ( 790 ) is added to the graph to support the virtual machines.
- the number of virtual machines that a sink node can handle is assigned as a demand value.
- sink node ( 790 ) has a demand of seven and is the only place that can receive all of the flows.
- Each flow graph edge has two parameters attached, including the capacity of the edge and the cost for a flow to go through the edge.
- the data nodes, d 1 ( 702 ) and d 2 ( 704 ), have outgoing links to each rack with virtual machine closeness as costs.
- data node d 1 ( 702 ) has link ( 712 ) going to rack r 1 ( 722 ), data node d 1 ( 702 ) has link ( 714 ) going to rack r 2 ( 724 ), data node d 1 ( 702 ) has link ( 716 ) going to rack r 3 ( 726 ), data node d 1 ( 702 ) has link ( 718 ) going to rack r 4 ( 728 ), data node d 2 ( 704 ) has link ( 732 ) going to rack r 1 ( 722 ), data node d 2 ( 704 ) has link ( 734 ) going to rack r 2 ( 724 ), data node d 2 ( 704 ) has link ( 736 ) going to rack r 3 ( 726 ), and data node d 2 ( 704 ) has link ( 738 ) going to rack r 4 ( 728 ).
- the hotspot factor is encoded in the links ( 750 ), ( 752 ), ( 754 ), ( 756 ), ( 758 ), and ( 760 ) from the racks to each physical node p within its range.
- r 4 ( 728 ) serves as a switch between the racks, it is shown as directly connected to all the physical nodes to ensure that the least loaded machine can be chosen for Map intensive jobs without being constrained by the network topology.
- Nd j is the number of virtual machine images requested by dataset d j
- ⁇ jk capture virtual machine closeness.
- the cost, ⁇ jk of outgoing link from the dataset d j to physical host p i on which the data is placed on rack r k is estimated conservatively by the traffic in the shuffle phase as follows:
- ⁇ jk size intermediate ⁇ - ⁇ data * num Reducer - 1 * distance max ⁇ num Reducer
- distance max is the network distance between any two nodes in the rack r k .
- the hotspot factor is captured using ⁇ i for physical node p i and is estimated by the current and expected load as follows:
- load curr and load min represent the current load and minimum current load, respectively
- a is a parameter that acts as a knob to tune the weight of the hotspot factor with respect to other costs.
- the expected load, load exp is as follows:
- ⁇ j ⁇ j / ⁇ j
- ⁇ j the number of d j 's associated jobs that arrive within a given time interval
- ⁇ j the mean time for each virtual machine to process a block.
- Storage utilization of a physical node p i is captured by ⁇ i which is determined by the current storage utilization compared with minimum storage utilization of all p i s, as follows:
- a split factor parameter is provided to specify whether flows from a node are allowed to be split across different link.
- the value of this parameter is defined as true or false. For example, if the split factor for all the links from d 1 and d 2 are trues, all flows from data nodes will in whole go through one of the r 1 , r 2 , r 3 , r 4 , but will not be split between the racks.
- the goal of virtual machine placement is to maximize global data locality and job throughput.
- Our model considers both virtual machine migration and delayed scheduling of a job as part of the optimal solution. Delaying a job is used to explore better data locality opportunities that can arise in the near future, while minimizing time wasted during the waiting. Migrating a virtual machine belonging to a job enables the scheduler to make room for other suitable jobs or to explore better location opportunity.
- FIG. 8 is an illustration of a flow graph ( 800 ) for virtual machine placement.
- Each job v j ( 802 ) and ( 804 ) is submitted to the system at the source node with the number of requested virtual machines, N vj , as the supply value.
- the goal of the virtual machine allocator is to maintain the job as unscheduled, e.g. allocate 0, or allocate N vj , virtual machines for each request.
- the request from each job acts as a flow that goes either through the rack nodes ( 810 ), ( 812 ), ( 814 ), and ( 816 ), or through the unscheduled nodes ( 820 ), ( 822 ), and ( 824 ), and finally to the sink node ( 890 ). If a job is unscheduled, none of its virtual machines are allocated. Otherwise, the flow goes through the physical nodes ( 830 ), ( 832 ), ( 834 ), ( 836 ), ( 838 ), and ( 840 ). Based on the min-cost solution, an allocation scheme with min-cost can be derived. If the virtual machines are allocated to the highest level rack, it implies that the virtual machines can be allocated arbitrarily to any set of nodes in the virtual machines under the rack.
- the job type information is modeled as the cost of the edge from each job to the rack nodes in the flow based graph.
- the higher level rack has higher cost than the lower level rack in terms of reduced traffic.
- the cost of the highest level rack is estimated by a worst case virtual machine arrangement with regard to the map and reduced traffic.
- the cost of the edges to the unscheduled nodes is set to be increased over time so that delayed jobs get allocated sooner than recently submitted jobs. That cost also controls when a job stops waiting for better locality and therefore offers a knob to tune the trade-off between data locality and latency.
- the aggregated unscheduled nodes control how many virtual machines can remain unscheduled to control system resource utilization and data locality trade-off.
- the cost of the edges to running nodes set is increased over time and job-progress aware.
- FIG. 9 is illustrates a table ( 900 ) with values assigned to the flow graph for virtual machine placement and node categorization.
- Various nodes in the graph are categorized into different types as shown in the table.
- a preferred node set (pr j ) ( 910 ) is a set of graph nodes that point to a set of physical node p i that have a job, v j , associated dataset stored on them.
- An edge from a preferred node to physical node p i has the cost of 0 and the capacity of the number of virtual machine disk images stored on physical nodes p i .
- a running node set (ru j ) ( 920 ) is a set of dynamically added node that point to physical nodes (p i s) that are currently hosting the jobs (v j ) virtual machines.
- An edge from ru j to p i has a cost of 0 and the capacity of the number of virtual machines running on physical nodes p i .
- An unscheduled node set, u j , ( 930 ) is a set of nodes that provide information about currently unscheduled jobs.
- the unscheduled node set, u j has an outgoing edge with capacity N vj and code of 0 to an unscheduled aggregator.
- An unscheduled aggregator node, u, ( 940 ) has an outgoing edge with cost 0 to the sink with capacity defined as:
- N unsched ⁇ ( N vj ) ⁇ M+M idle
- M is the total number of virtual machines that the cluster can support and M idle denotes the number of idle virtual machine slots allowed in the cluster.
- the rack node set, r k , ( 950 ) represents a rack in the topology of the cluster. It has outgoing links with cost 0 to its sub-racks, or if it is at the lowest level, to physical nodes.
- the links have capacity N rk that is the total number of virtual machine slots that can be serviced by its underlying nodes.
- the physical host node set, p i , ( 960 ) has an outgoing link to the sink with capacity being the number of virtual machines that can be accommodated on the physical host N vm and cost 0.
- the graph has a sink node ( 970 ) with demand represented as ⁇ (N vj ).
- the job node set, v j , ( 980 ) represents each job node v j with supply N vj . It has multiple outgoing edges corresponding to the potential virtual machine allocation decisions for the job set.
- the edges include, a rack node set, a preferred node set, a running node set, and an unscheduled node set.
- the rack node set, r k has an edge to r k that indicates that r k can accommodate v j .
- the cost of the edge is p j that is calculated by the map and reduce traffic cost. If the capacity of the edge is greater than N vj , it implies that the virtual machines of v j will be allocated on some preferred nodes on the rack.
- the preferred node set, p rj has an edge from job v j to the job wide preferred nodes set, p rj , has capacity N vj and cost ⁇
- the cost is estimated by only the reduce phase traffic because in this case map traffic is assumed to be zero.
- the unscheduled node set, u j has an edge to the job-wide unscheduled node u j with capacity N vj and cost ⁇ j , which corresponds to the penalty of leaving job v j unscheduled.
- ⁇ j d*T, where T is the time that job v j is left unscheduled and d is a constant used to adjust the cost relative to other costs.
- the split factor for this link is marked as true, which means the allocation of all the virtual machines are either satisfied or delayed until the next round.
- the virtual machine allocation assignment can be obtained from the graph by locating where the associated flow leads to for each virtual machine request v j .
- Flow to an unscheduled node indicates that the virtual machine request is skipped for the current round. If the flow leads to a preferred node set, the virtual machine request is scheduled on that set of nodes. If the flow foes to a rack node, it implies that the virtual machines from the job are assigned to arbitrary hosts in that rack.
- the number of flows set to a physical host through rack nodes or preferred nodes set is lower than the number of available virtual machines of each physical host. This is guaranteed by the specified link capacity from physical host to sink. Accordingly, all virtual machine requests that are allocated will be matched to a corresponding physical host.
- FIG. 10 is a flow chart ( 1000 ) illustrating a process for assessing and leveraging the physical and virtual machine topology in a shared pool of resources.
- the status of the virtual machines in the shared pool is collected ( 1002 ). It is recognized that one or more virtual machines may be assigned to a single physical host, thereby supporting the virtualization of an underlying physical machine. Based upon the collected data, associated topology information is gathered and communicated to a server at a root node of the hierarchical organization of physical and virtual machines ( 1004 ).
- Each virtual machine is provided with an embedded agent, and each physical machine to which the virtual machines are assigned is provided with an embedded monitor.
- a server machine is provided in communication with each of the physical machines and functions to periodically collect information from the embedded monitors of the underlying physical machines.
- the embedded monitors function to sense local topology, disk, and network information.
- the embedded monitors are in the form of software that runs on the physical machine to collect status data of the related virtual machines.
- the embedded agent of each virtual machine communicates actual topology and system utilization information to the server machine. Accordingly, each virtual and physical machine includes embedded tools to gather topology associated utilization information and to convey the gathered information to the server machine.
- the gathered data is organized in a single location ( 1006 ).
- the single location may be a root node representing a physical server that is in communication with each of the virtual machines and their associated physical machine.
- data communicated from the embedded agents of the virtual machines includes actual topology and system utilization information.
- the data gathered at step ( 1004 ) includes a storage topology underlying a virtual topology of the shared pool of resources together with the associated resource utilization information.
- a job is assigned to a select virtual machine in the shared pool with the assignment leveraging the organized storage topology information ( 1008 ).
- the job assignment at step ( 1008 ) is designed to support efficient performance of job associated I/O. Accordingly, the process of gathering and organizing local topology information enables jobs to be intelligently assigned to a select virtual machine.
- the job assignment at step ( 1008 ) may be to one virtual machine or to multiple virtual machines. Similarly, the job may be a read job or a write job. With respect to both scenarios, a virtual topological distance is returned in response to the job assignment ( 1010 ).
- the virtual topological distance may be a distance between two or more virtual machines when the job is assigned to multiple machines, or the virtual topological distance may be between a virtual machine and a block of data when the job is supporting a single virtual machine for a read or write job.
- a positive response to the determination at step ( 1012 ) is followed by creating a shared memory channel for inter-virtual machine data communication between two virtual machines local to the same physical machine ( 1014 ).
- the creation of the shared memory channel supports efficient data transfer between the two virtual machines. More specifically, memory copy may be employed for communication between the virtual machines, thereby avoiding communication across a virtual network stack ( 1016 ).
- a negative response to the determination at step ( 1012 ) is followed by utilization of the virtual network stack for communication between the virtual machines supporting the assigned job ( 1018 ). Accordingly, the physical proximity of virtual machines may lend itself to efficient transfer of inter-virtual machine communication.
- a physical machine supports the virtual machine and data blocks support the job.
- the location of the data blocks within the shared pool affects the assignment of the job to the physical machine and associated virtual machine(s). More specifically, an efficient use of resources in the shared pool ensures a physical proximity of the physical machine to the data blocks.
- the job is assigned to a physical machine in the same physical data center as the subject data blocks. Accordingly, part of the process of assignment of the job at step ( 1008 ) includes ascertaining a physical location of the data blocks in the shared pool supporting the job.
- FIG. 11 is a flow chart ( 1100 ) illustrating the additional steps to support the aspect of leveraging the storage topology of the shared pool of resources.
- a virtual machine is designated to receive the job for processing ( 1102 ).
- utilization information of the physical machine local to the virtual machine is ascertained ( 1104 ).
- the utilization information includes, but is not limited to the processing unit and network utilization information. It is determined if the underlying physical machine has the bandwidth and capability to support the job ( 1106 ).
- a negative response to the determination at step ( 1106 ) is followed by returned to selection and assignment of a different virtual machine in the shared pool.
- a positive response to the determination at step ( 1106 ) is an indication that the selected virtual machine has both sufficient bandwidth to support the job and a close topological distance to the physical location of the data block(s) to support the job ( 1108 ).
- the close topological distance includes, but is not limited to, data residing in the same data center as the virtual machine. Accordingly, the aspect of leveraging the storage topology includes an assessment of the operation of the machine together with location of data to support the job.
- FIG. 12 is a block diagram ( 1200 ) illustrating tools embedded in a computer system to support a technique employed for assessment of resource utilization for use in assignment of a job within a shared pool of resources.
- a shared pool of configurable computer resources is shown with a first data center ( 1210 ), a second data center ( 1230 ), and a third data center ( 1250 ).
- first data center 1210
- second data center 1230
- third data center 1250
- Each of the data centers represents a computing resource. Accordingly, one or more data centers may be employed to support efficient and intelligent assignment of jobs with respect to resource utilization and proximity to data blocks in support of the job(s).
- Each of the data centers in the system is provided with at least one server in communication with data storage. More specifically, the first data center ( 1210 ) is provided with a first server ( 1220 ) having a processing unit ( 1222 ), in communication with memory ( 1224 ) across a bus ( 1226 ), and in communication with data storage ( 1228 ); the second data center ( 1230 ) is provided with a second server ( 1240 ) having a processing unit ( 1242 ), in communication with memory ( 1244 ) across a bus ( 1246 ), and in communication with second local storage ( 1248 ); and the third data center ( 1250 ) is provided with a third server ( 1260 ) having a processing unit ( 1262 ), in communication with memory ( 1264 ) across a bus ( 1266 ), and in communication with third local storage ( 1268 ).
- the first server ( 1220 ) is also referred to herein as a physical host. Communication among the data centers is supported across one or more network connections ( 1205 ).
- the second server ( 1240 ) includes two virtual machines ( 1232 ) and ( 1236 ).
- the first virtual machine ( 1232 ) has an embedded agent ( 1232 a ) and the second virtual machine ( 1236 ) has an embedded agent ( 1236 a ).
- the second server ( 1240 ) includes a monitor ( 1234 ) to facilitate communication with the first and second virtual machines ( 1232 ) and ( 1236 ), respectively.
- the third server ( 1260 ) includes two virtual machines ( 1252 ) and ( 1256 ).
- the first virtual machine ( 1252 ) has an embedded agent ( 1252 a ) and the second virtual machine ( 1256 ) has an embedded agent ( 1256 a ).
- the third server ( 1260 ) includes a monitor ( 1254 ) to facilitate communication with the first and second virtual machines ( 1252 ) and ( 1256 ), respectively.
- a monitor ( 1254 ) to facilitate communication with the first and second virtual machines ( 1252 ) and ( 1256 ), respectively.
- the invention should not be limited to these quantities, as these quantities are merely for illustrative purposes.
- the quantity of the virtual machines in communication with the second and third servers ( 1240 ) and ( 1260 ), respectively, may be increased or decreased.
- the monitor ( 1234 ) of the server ( 1230 ) collects status data from each of the virtual machines ( 1232 ) and ( 1236 ).
- Monitor ( 1234 ) communicates with embedded agents ( 1232 a ) and ( 1236 a ) to collect virtual machine status from virtual machines ( 1232 ) and ( 1236 ), respectively.
- monitor ( 1254 ) collects status data from each of the virtual machines ( 1252 ) and ( 1256 ), and specifically embedded agents ( 1252 a ) and ( 1256 a ), respectively.
- the first server ( 1220 ) is provided with a functional unit ( 1270 ) having one or more tools to support intelligent assignment of one or more jobs in the shared pool of resources.
- the functional unit ( 1270 ) is shown local to the first data center ( 1210 ), and specifically in communication with memory ( 1224 ). In one embodiment, the functional unit ( 1270 ) may be local to any of the data centers in the shared pool of resources.
- the tools embedded in the functional unit ( 1270 ) include, but are not limited to, a director ( 1272 ), a topology manager ( 1274 ), a hook manager ( 1276 ), a storage topology manager ( 1278 ), a resource utilization manager ( 1280 ), and an application manager ( 1282 ).
- the director ( 1272 ) is provided in the shared pool to periodically communicate with monitors ( 1234 ) and ( 1254 ) to organize and retain in a single location a storage topology underlying a virtual topology of the shared pool of resource, together with associated resource utilization information. More specifically, the communication of the director ( 1272 ) with the monitors ( 1234 ) and ( 1254 ) supports gathering and organization of the topology of the shared pool of resources. By organizing and understanding the topology data, the director ( 1272 ) may leverage the resource utilization information to intelligently assign a job to one or more of the shared resources in the pool, and in a manner that support efficient performance of job associated I/O. Accordingly, the director ( 1272 ) both gathers and leverages the topology to support efficient processing of read and write jobs in the shared pool of resources.
- Virtual topological distance data includes, but is not limited to, a distance between two virtual machines or a distance between a virtual machine and a block of data. For example, two virtual machines in communication with the same server are considered in relatively close proximity. However, a second virtual machine in communication with a second server and a third virtual machine in communication with a third server are considered relatively distant in comparison to the two virtual machines in communication with the same server.
- the storage topology manager ( 1278 ) returns the physical location of the data blocks to the director ( 1272 ), thereby enabling the director to intelligently assign a job to a virtual machine in response to the location of the subject data block(s).
- the topology manager ( 1274 ) functions to address distances within the hierarchy with respect to efficient job processing
- the storage topology manager ( 1278 ) functions to address the location of the data block supporting the job.
- the resource utilization manager ( 1280 ) functions to address utilization of one or more physical or virtual resources. Each resource has innate limitations.
- the resource utilization manager ( 1280 ) returns utilization information of a processing unit and network utilization information associated with the underlying physical and virtual machines to the director ( 1272 ).
- the application manager ( 1282 ) which is in communication with the resource utilization manager ( 1280 ), assigns the job to a virtual machine responsive to the resource utilization information.
- the application manager ( 1282 ) ensures that the job assignment to a machine in the topology ensures the machine has a sufficient bandwidth to support the job, as well as a sufficiently close topological distance to data blocks to support the job. Accordingly, both utilization and bandwidth are accounted for by the resource utilization manager ( 1280 ) and the application manager ( 682 ), respectively.
- the hook manager ( 674 ) functions to facilitate communication among virtual machines. More specifically, the hook manager ( 674 ), which is in communication with the director ( 672 ), is provided to create a shared memory channel for inter-virtual machine communication.
- the shared memory channel facilitates communication between two virtual machines sitting on the same physical machine by enabling data transfer between two such virtual machines to take place on the same memory stack, e.g. across the memory channel. Accordingly, the shared memory channel created by the hook manager ( 674 ) supports efficient communication of data within the hierarchical structure of the shared pool of resources.
- the director ( 1272 ), topology manager ( 1274 ), hook manager ( 1276 ), storage topology manager ( 1278 ), resource utilization manager ( 1280 ), and application manager ( 1282 ) are shown residing in the functional unit ( 1270 ) of the server ( 1220 ) local to the first data center ( 1210 ).
- the functional unit ( 1270 ) and associated director and managers, respectively may reside as hardware tools external to the memory ( 1224 ) of server ( 1220 ) of the first data center ( 1210 ), they may be implemented as a combination of hardware and software, or may reside local to the second data center ( 1230 ) or the third data center ( 1250 ) in the shared pool of resources.
- the director and managers may be combined into a single functional item that incorporates the functionality of the separate items.
- each of the director and manager(s) are shown local to one data center. However, in one embodiment they may be collectively or individually distributed across the shared pool of configurable computer resources and function as a unit to assess the topology of processing units and data storage in the shared pool, and to process one or more jobs responsive to the hierarchy.
- the managers may be implemented as software tools, hardware tools, or a combination of software and hardware tools.
- aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
- the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
- a computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
- a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.
- a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
- Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
- the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- LAN local area network
- WAN wide area network
- Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
- These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- FIG. 13 is a block diagram ( 1300 ) showing a system for implementing an embodiment of the present invention.
- the computer system includes one or more processors, such as a processor ( 1302 ).
- the processor ( 1302 ) is connected to a communication infrastructure ( 1304 ) (e.g., a communications bus, cross-over bar, or network).
- the computer system can include a display interface ( 1306 ) that forwards graphics, text, and other data from the communication infrastructure ( 1304 ) (or from a frame buffer not shown) for display on a display unit ( 1308 ).
- the computer system also includes a main memory ( 1310 ), preferably random access memory (RAM), and may also include a secondary memory ( 1312 ).
- main memory 1310
- RAM random access memory
- the secondary memory ( 1312 ) may include, for example, a hard disk drive ( 1314 ) and/or a removable storage drive ( 1316 ), representing, for example, a floppy disk drive, a magnetic tape drive, or an optical disk drive.
- the removable storage drive ( 1316 ) reads from and/or writes to a removable storage unit ( 1318 ) in a manner well known to those having ordinary skill in the art.
- Removable storage unit ( 1318 ) represents, for example, a floppy disk, a compact disc, a magnetic tape, or an optical disk, etc., which is read by and written to by removable storage drive ( 1316 ).
- the removable storage unit ( 1318 ) includes a computer readable medium having stored therein computer software and/or data.
- the secondary memory ( 1312 ) may include other similar means for allowing computer programs or other instructions to be loaded into the computer system.
- Such means may include, for example, a removable storage unit ( 1320 ) and an interface ( 1322 ).
- Examples of such means may include a program package and package interface (such as that found in video game devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, and other removable storage units ( 1320 ) and interfaces ( 1322 ) which allow software and data to be transferred from the removable storage unit ( 1320 ) to the computer system.
- the computer system may also include a communications interface ( 1324 ).
- Communications interface ( 1324 ) allows software and data to be transferred between the computer system and external devices. Examples of communications interface ( 1324 ) may include a modem, a network interface (such as an Ethernet card), a communications port, or a PCMCIA slot and card, etc.
- Software and data transferred via communications interface ( 1324 ) are in the form of signals which may be, for example, electronic, electromagnetic, optical, or other signals capable of being received by communications interface ( 1324 ). These signals are provided to communications interface ( 1324 ) via a communications path (i.e., channel) ( 1326 ).
- This communications path ( 1326 ) carries signals and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, a radio frequency (RF) link, and/or other communication channels.
- RF radio frequency
- computer program medium “computer usable medium,” and “computer readable medium” are used to generally refer to media such as main memory ( 1310 ) and secondary memory ( 1312 ), removable storage drive ( 1316 ), and a hard disk installed in hard disk drive ( 1314 ).
- Computer programs are stored in main memory ( 1310 ) and/or secondary memory ( 1312 ). Computer programs may also be received via a communication interface ( 1324 ). Such computer programs, when run, enable the computer system to perform the features of the present invention as discussed herein. In particular, the computer programs, when run, enable the processor ( 1302 ) to perform the features of the computer system. Accordingly, such computer programs represent controllers of the computer system.
- each block in the flowcharts or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- the enhanced cloud computing model supports flexibility with respect to application processing and disaster recovery, including, but not limited to, supporting separation of the location of the data from the application location and selection of an appropriate recovery site.
- a platform is provided with a resource scheduler to address performance degradation of MapReduce jobs when running in the cloud environment.
- Cloud aware MapReduce adopts a three level approach to avoid placement anomalies due to inefficient resource allocation, including: placing data within the cluster that run jobs that most commonly operate on the data, selecting the mode appropriate physical nodes to place the set of virtual machines assigned to a job, and exposing compute, storage, and network topologies to the scheduler.
- CAM uses a flow network based algorithm that is able to reconcile resource allocation with a variety of other competing constraints, such as storage utilization, changing processor load and network link capacities.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Debugging And Monitoring (AREA)
Abstract
Embodiments of the invention relate to a shared group of resource and efficient processing of one or more jobs in the share group of resources. Tools are provided in the shared group of resource to assess and organize a topology of the shared resources, including physical and virtual machines, as well as storage devices. The topology is stored in a known location and utilized for efficient assignment of one or more jobs responsive to the hierarchy.
Description
- This application is a continuation patent application claiming the benefit of the filing date of U.S. patent application Ser. No. 13/457,090 filed on Apr. 26, 2012, and titled “Efficient Execution of Jobs in a Shared Pool of Resources” now pending, which is hereby incorporated by reference.
- This invention relates to an efficient approach for utilization of job processing in a shared pool of resources. More specifically, the invention relates to assessing the virtual and physical topology of the shared resources and processing jobs responsive to the combined topology.
- MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computer nodes. In instances where all of the nodes use the same hardware or a grid if the nodes use different hardware, the framework is commonly referred to as a cluster. Computational processing can occur on data stored either in a filesystem or a database. Specifically, a master node receives a job input and partitions the job into smaller sub-jobs, which are distributed to the other nodes in the cluster or grid. In one embodiment, the nodes in the cluster or grid are arranged in a hierarchy, and the sub-jobs may be further partitioned and distributed. The nodes responsible for processing the sub-jobs return processed data to the master node. More specifically, the processed data is collected and combined by the master node to form an output. Accordingly, MapReduce is an algorithmic technique for the distributed processing of large amounts of data associated with a job.
- As described above, MapReduce enables distribution of data processing across a network of nodes. Although there is a convenience factor associated with use MapReduce, there is performance issues associated with current uses of MapReduce for processing jobs.
- This invention comprises a method for efficient processing of jobs in a shared pool of resources.
- In one aspect, a computer implemented method is provided for implementation in a shared pool of resources. More specifically, the shared pool includes a physical host in communication with at least one physical machine, with the physical machine supporting one or more virtual machines. Status information associated with the operation of the virtual machines is collected. In addition, local topology information associated with the shared pool of resources is gathered. The aspect of gathering this information includes periodically communicating with an embedded monitor of the physical machine. The gathered topology information is organized, with the topology information including storage topology underlying a virtual topology of the resources and associated resource utilization information. Once the storage topology information is organized, it may be leveraged to facilitate processing of one or more jobs. More specifically, a job may be responsively assigned to a select virtual machine in the shared pool that supports efficient performance of I/O associated with the job.
- In another aspect, a computer implemented method is provided for implementation in a shared pool of resources. More specifically, the shared pool of resources includes a physical host in communication with at least one physical machine supporting one or more virtual machines. Status information is collected from one or more of the virtual machines. Local topology information of a hierarchical organization of a shared pool of resources represented by the physical and virtual machines is periodically gathered and organized. More specifically, the organized topology information is stored in the shared pool of resources. Utilization of storage resources and virtual machines represented in the topology are assessed. A job is assigned to one or more select virtual machines in response to the topology and the assessment of the storage resources. Accordingly, the job assignment supports efficient performance responsive to both the topology and the storage resource utilization assessment.
- Other features and advantages of this invention will become apparent from the following detailed description of the presently preferred embodiment of the invention, taken in conjunction with the accompanying drawings.
- The drawings referenced herein form a part of the specification. Features shown in the drawings are meant as illustrative of only some embodiments of the invention, and not of all embodiments of the invention unless otherwise explicitly indicated.
-
FIG. 1 depicts a cloud computing node according to an embodiment of the present invention. -
FIG. 2 depicts a cloud computing environment according to an embodiment of the present invention. -
FIG. 3 depicts abstraction model layers according to an embodiment of the present invention. -
FIG. 4 depicts a block diagram illustrating the architecture for using cloud aware MapReduce. -
FIG. 5 depicts a flow chart illustrating allocation of resources and management of data and virtual machine placement. -
FIG. 6 depicts a flow graph illustrating sample network topology for data placement. -
FIG. 7 depicts a flow graph for sample data placement. -
FIG. 8 depicts a flow graph for virtual machine placement. -
FIG. 9 depicts a table with values assigned to the flow graph for virtual machine placement and node categorization. -
FIG. 10 depicts a flow chart illustrating a process for assessing and leveraging the physical and virtual machine topology in a shared pool of resources. -
FIG. 11 depicts a flow chart illustrating steps to support the aspect of leveraging the storage topology of the shared pool of resources. -
FIG. 12 depicts a block diagram illustrating tools embedded in a computer system to support a technique employed for assessment of resource utilization for use in assignment of a job within a shared pool of resources. -
FIG. 13 depicts a block diagram showing a system for implementing an embodiment of the present invention. - It will be readily understood that the components of the present invention, as generally described and illustrated in the Figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the apparatus, system, and method of the present invention, as presented in the Figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention.
- The functional unit(s) described in this specification has been labeled with tools in the form of manager(s) and director(s). A manager or director may be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices, or the like. The manager(s) or director(s) may also be implemented in software for processing by various types of processors. An identified manager or director of executable code may, for instance, comprise one or more physical or logical blocks of computer instructions which may, for instance, be organized as an object, procedure, function, or other construct. Nevertheless, the executable of an identified manager or director need not be physically located together, but may comprise disparate instructions stored in different locations which, when joined logically together, comprise the managers and directors and achieve the stated purpose of the managers and directors.
- Indeed, a manager or director of executable code could be a single instruction, or many instructions, and may even be distributed over several different code segments, among different applications, and across several memory devices. Similarly, operational data may be identified and illustrated herein within the manager or director, and may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different storage devices, and may exist, at least partially, as electronic signals on a system or network.
- Reference throughout this specification to “a select embodiment,” “one embodiment,” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases “a select embodiment,” “in one embodiment,” or “in an embodiment” in various places throughout this specification are not necessarily referring to the same embodiment.
- Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided, such as examples of a topology manager, a hook manager, a storage topology manager, a resource utilization manager, an application manager, a director, etc., to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
- The illustrated embodiments of the invention will be best understood by reference to the drawings, wherein like parts are designated by like numerals throughout. The following description is intended only by way of example, and simply illustrates certain selected embodiments of devices, systems, and processes that are consistent with the invention as claimed herein.
- A cloud computing environment is service oriented with a focus on statelessness, low coupling, modularity, and semantic interoperability. At the heart of cloud computing is an infrastructure comprising a network of interconnected nodes. Referring now to
FIG. 1 , a schematic of an example of a cloud computing node is shown. Cloud computing node (10) is only one example of a suitable cloud computing node and is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the invention described herein. Regardless, cloud computing node (10) is capable of being implemented and/or performing any of the functionality set forth hereinabove. In cloud computing node (10) there is a computer system/server (12), which is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with computer system/server (12) include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above systems or devices, and the like. - Computer system/server (12) may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular jobs or implement particular abstract data types. Computer system/server (12) may be practiced in distributed cloud computing environments where jobs are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
- As shown in
FIG. 1 , computer system/server (12) in cloud computing node (10) is shown in the form of a general-purpose computing device. The components of computer system/server (12) may include, but are not limited to, one or more processors or processing units (16), a system memory (28), and a bus (18) that couples various system components including system memory (28) to processor (16). Bus (18) represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnects (PCI) bus. Computer system/server (12) typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server (12), and it includes both volatile and non-volatile media, removable and non-removable media. - System memory (28) can include computer system readable media in the form of volatile memory, such as random access memory (RAM) (30) and/or cache memory (32). Computer system/server (12) may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system (34) can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not shown and typically called a “hard drive”). Although not shown, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”), and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media can be provided. In such instances, each can be connected to bus (18) by one or more data media interfaces. As will be further depicted and described below, memory (28) may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
- Program/utility (40), having a set (at least one) of program modules (42), may be stored in memory (28) by way of example, and not limitation, as well as an operating system, one or more application programs, other program modules, and program data. Each of the operating systems, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a networking environment. Program modules (42) generally carry out the functions and/or methodologies of embodiments of the invention as described herein.
- Computer system/server (12) may also communicate with one or more external devices (14), such as a keyboard, a pointing device, a display (24), etc.; one or more devices that enable a user to interact with computer system/server (12); and/or any devices (e.g., network card, modem, etc.) that enable computer system/server (12) to communicate with one or more other computing devices. Such communication can occur via Input/Output (I/O) interfaces (22). Still yet, computer system/server (12) can communicate with one or more networks such as a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) via network adapter (20). As depicted, network adapter (20) communicates with the other components of computer system/server (12) via bus (18). It should be understood that although not shown, other hardware and/or software components could be used in conjunction with computer system/server (12). Examples, include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.
- Referring now to
FIG. 2 , illustrative cloud computing environment (50) is depicted. As shown, cloud computing environment (50) comprises one or more cloud computing nodes (10) with which local computing devices used by cloud consumers, such as, for example, personal digital assistant (PDA) or cellular telephone (54A), desktop computer (54B), laptop computer (54C), and/or automobile computer system (54N) may communicate. Nodes (10) may communicate with one another. They may be grouped (not shown) physically or virtually, in one or more networks, such as Private, Community, Public, or Hybrid clouds as described hereinabove, or a combination thereof. This allows cloud computing environment (50) to offer infrastructure, platforms and/or software as services for which a cloud consumer does not need to maintain resources on a local computing device. It is understood that the types of computing devices (54A)-(54N) shown inFIG. 2 are intended to be illustrative only and that computing nodes (10) and cloud computing environment (50) can communicate with any type of computerized device over any type of network and/or network addressable connection (e.g., using a web browser). - Referring now to
FIG. 3 , a set of functional abstraction layers provided by cloud computing environment (50) is shown. It should be understood in advance that the components, layers, and functions shown inFIG. 3 are intended to be illustrative only and embodiments of the invention are not limited thereto. As depicted, the following layers and corresponding functions are provided: hardware and software layer (60), virtualization layer (62), management layer (64), and workload layer (66). The hardware and software layer (60) includes hardware and software components. Examples of hardware components include mainframes, in one example IBM® zSeries® systems; RISC (Reduced Instruction Set Computer) architecture based servers, in one example IBM pSeries® systems; IBM xSeries® systems; IBM BladeCenter® systems; storage devices; networks and networking components. Examples of software components include network application server software, in one example IBM WebSphere® application server software; and database software, in one example IBM DB2® database software. (IBM, zSeries, pSeries, xSeries, BladeCenter, WebSphere, and DB2 are trademarks of International Business Machines Corporation registered in many jurisdictions worldwide). - Virtualization layer (62) provides an abstraction layer from which the following examples of virtual entities may be provided: virtual servers; virtual storage; virtual networks, including virtual private networks; virtual applications and operating systems; and virtual clients.
- In one example, management layer (64) may provide the following functions: resource provisioning, metering and pricing, user portal, service level management, and SLA planning and fulfillment. The functions are described below. Resource provisioning provides dynamic procurement of computing resources and other resources that are utilized to perform jobs within the cloud computing environment. Metering and pricing provides cost tracking as resources are utilized within the cloud computing environment, and billing or invoicing for consumption of these resources. In one example, these resources may comprise application software licenses. Security provides identity verification for cloud consumers and jobs, as well as protection for data and other resources. User portal provides access to the cloud computing environment for consumers and system administrators. Service level management provides cloud computing resource allocation and management such that required service levels are met. Service Level Agreement (SLA) planning and fulfillment provides pre-arrangement for, and procurement of, cloud computing resources for which a future requirement is anticipated in accordance with an SLA.
- Workloads layer (66) provides examples of functionality for which the cloud computing environment may be utilized. Examples of workloads and functions which may be provided from this layer include, but is not limited to: mapping and navigation, software development and lifecycle management, virtual classroom education delivery, data analytics processing, job processing, and processing one or more jobs responsive to the hierarchy of virtual resources within the cloud computing environment.
- A virtual machine is a software and/or hardware implementation of a computer that executes programs similar to a physical machine. The virtual machine supports an instance of an operating system along with one or more applications to run in an isolated partition within the computer. In one embodiment, the virtual machine enables different operating systems to run in the same computer simultaneously. One physical machine may support multiple virtual machines. Multiple operating systems can run in the same physical machine, and each of the virtual machines may process jobs with different operating systems. Accordingly, the use of one more virtual machines associated with a single physical machine supports efficient use of hardware while processing multiple jobs.
- Efficient use of a virtual machine configuration in a cloud computing system is challenging due to the distributed nature of the physical topology of the physical machines, i.e. nodes. More specifically, the concerns pertain to a assessing and exposing the physical topology underlying the virtual topology of the cloud computing system and leveraging job processing responsive to the physical and virtual topology.
- A cloud platform, referred to herein as CAM, is provided to combine a cluster file system with a resource schedule. CAM adopts a three level approach to avoid placement anomalies, the three levels include data placement, virtual machine-job placement, and job placement. With respect to data placement, data is placed within the cluster based on offline profiling of the jobs that most commonly run on the data. Job placement is affected by CAM selecting the best possible physical node(s) to place the sets of virtual machines assigned to a job. In order to further minimize the possibility of a placement anomaly, CAM exposes otherwise hidden compute, storage, and network topologies to the job scheduler such that it make optimal job assignments. CAM reconciles resource allocation with a variety of other competing constraints, such as storage utilization, changing CPU load and network link capacities using a flow-network based algorithm that is able to simultaneously reduce the specified constraints for both initial placement and for readjusting under virtual machine and data migration.
-
FIG. 4 is a block diagram (400) illustrating the overall architecture for using CAM. The physical resources supporting the cloud consist of a cluster of physical nodes with local storage directly attached to the individual nodes. As shown, CAM uses a filesystem, which in one embodiment may be a General Parallel File System-Shared Nothing Cluster, (410) to provide its storage layer. GPFS-SNC is designed to be a cloud storage platform, which supports timely and resource-efficient deployment of virtual machines in the cloud. GPFS-SNC manages the local disks directly attached to a cluster of commodity physical machines. As shown, the physical layer (420) is illustrated with three commodity physical machines (430), (440), and (450). The quantity of physical machines shown herein is for illustrative purposes. In one embodiment, the quantity may include a smaller or larger quantity of commodity physical machines. Each of these machines has a local disk (432), (442), and (452), respectively. Filesystem (410) supports co-locating all blocks of a file at one location, rather than stripping the file across the network. This enables a virtual machine I/O request to be serviced locally from the stored location instead of remotely from physical hosts across the network. CAM leverages this feature to ensure that co-located virtual machine images are stored at one location and can be accessed efficiently. - Filesystem (410) also supports an efficient block-level pipelined replication scheme, which guarantees fast distributed recovery and high I/O throughput through fast parallel reads. This feature is useful for CAM for achieving efficient failure recovery. In addition, filesystem (410) specifies a user level application program interface, API that can be used to query the physical location of files. CAM uses this API to get actual block location information for determining storage closeness for data and virtual machine placement.
- CAM employs three components to provide topology awareness. As shown, a server (460) is provided in communication with the physical layer (420). The server (460), also referred to herein as the CAM topology server, provides topology information required to enable a scheduler to make optimal job assignment. Information exposed by the server (460) consists of network and storage topologies, and other dynamic node-level information, such as CPU load. Further as shown, each of the physical machines (430), (440), and (450) has an agent (434), (444), and (454), respectively. Each agent runs on its respective physical machine and periodically collects and conveys to the server (460) a variety of pieces of data about the respective machine, such as utilization of outbound and inbound network bandwidth, I/O utilization, and CPU, memory, and storage load. The server (460) consolidates dynamic information it receives from the agents (434), (444), and (454), and serves it along with topology information about each job running in the cluster. The topology information is derived from existing virtual machine placement configuration. A job scheduler (470) interfaces with the server (460) to obtain accurate and current topology information. The scheduler (470) readjusts job placement accordingly whenever a change in the configuration is observed. Accordingly, the job scheduler leverages the storage and physical host resource utilization in CAM.
- Network topology information is represented by a distance matrix that encodes the distance between each pair of virtual machines as cross-rack, cross-node, or cross-virtual machine. When two virtual machines are placed on the same node, they are connected through a virtual network connection. By virtue of the fact that the virtual machines share the same node hardware, the virtual network provides a high-speed medium that is significantly faster than the inter-node or inter-rack links. Network traffic between the virtual machines on the same node does not have to pass through the hardware link. The network virtual device forwards the traffic in-memory through highly optimized ring buffers.
- Storage topology information is provided as a mapping between each virtual device containing the dataset and the virtual machine to which it is local. In a native hardware context, a disk attached to a node can be directly accessed through a PCI bus. In the cloud, however, the physical blocks belonging to a virtual machine image attached to a virtual machine could be located on a different node. Even though a virtual device might appear to be directly connected to the virtual machine, the image file backing the device could be across the network, and potentially closer to another virtual machine in the cluster than the one it is directly attached to. The server (460) queries the physical image location through an API and presents the information to the scheduler (470).
- Examples of specific API provided by the server (460) is described in Table 1 below:
-
TABLE 1 API Description Int get_VM_distance Returns the distance between two virtual (string vm1, string vm2) machines struct block_location Returns the actual location of blocks get_block_location (string src, long offset, long length) int get_vm_networking Returns the network utilization (string vm, struct information of physical host on which the networkinfo) virtual machine is running int get_vm_diskinfo Returns disk utilization information (string vm, string device, struct diskinfo) int get_vm_cpuinfo Returns CPU utilization information of (string vm, struct cpuinfo) the physical host on which the virtual machine is running
The instruction get_vm_distance, provides job to the job scheduler (470) with hints of the network distance between two virtual machines. The distance is estimated based on observed data transfer rates between the virtual machines, and in one embodiment is expressed in units of bandwidth. The instruction get_block_location, supports getting the actual block location instead of the location of a virtual machine, thereby guaranteeing locality. The instructions get_vm_networkinfo, get_vm_diskinfo, and get_vm_cpuinfo facilitate the job scheduler (470) with respect to querying the I/O and CPU contention information related to network and disk utilization. In one embodiment, the scheduler (470) can leverage this additional information to make smarter decisions, including in one embodiment placing I/O intensive jobs on physical hosts that have idle I/O resources. -
FIG. 5 is a flow chart (500) illustrating allocation of resources and management of data and virtual machine placement. As described, CAM is a cloud platform with specific interfaces and support for running MapReduce jobs. A dataset to be processed is initially placed on a general purpose filesystem (GPFS). More specifically, storage and compute resources are not segregated. A MapReduce job is submitted by providing the application, e.g. relevant java class files, indicating a previously uploaded dataset corresponding to the job, and the number and type of virtual machines to be used for the job (502). In one embodiment, each virtual machine supports several MapReduce jobs slots depending on the number of virtual CPUs and virtual RAM allocated to the virtual machine. The greater the number of virtual machines assigned to a job, the faster the job is completed. - CAM determines an optimal placement for the set of new virtual machines requested (504) by considering factors including, but not limited to, current workload distribution among cluster node, distribution of the input datasets required by the job, and the physical locations of the required master virtual machine images. The virtual machine images required to boot the virtual machines on the selected nodes are created from the respective master images (506). In one embodiment, a copy-on-write mechanism provided by the general purpose filesystem is utilized for creating the virtual machine, as this allows quickly provisioning a virtual machine image instance without requiring a data copy of a master image. Following step (506), job class files are copied into a cloned virtual machine image (508). In one embodiment, the copying at step (508) takes place by mounting the image as a loop-back file system. Finally, the data images are attached to the virtual machines and the respective files are mounted within the virtual machine (510), as this enables jobs to access data contained therein.
- In relationship to
FIG. 4 , each physical machine (430), (440), and (450) is equipped with local disks (432), (442), and (452), respectively. A distributed file system (410) also referred to herein as GPFS-SNC, is installed on top of the physical machines (430), (440), and (450). The virtual machine images (436), (446), and (456) are stored in the distributed file system (410). In one embodiment, a cloud manager allocates resources to jobs and manages data placement and virtual machine placement. - Placement of both data and virtual machines are aspects to be considered with respect to cost. Specifically, placement includes guaranteeing virtual machine closeness, avoiding hotspots, and balancing physical storage utilization according to different job types. Virtual machine closeness expresses cost of accessing data and captures how data should be placed with respect to its associated virtual machines to minimize access times. The hotspot factor expresses the expected load on a machine and identifies machines that do not have enough computational resource(s) to support the assigned virtual machines. To avoid a hotspot, data needs to be placed on the least loaded machine. This can be determined by measuring currently allocated computational resources to the machine and adding to it the expected allocation requirement of the virtual machines that would work with the data to be placed on the machine. Finally, storage utilization expresses a percentage of total physical machine storage space that is in use.
- The following table, Table 2, illustrates a table showing the significance of three factors outlined above that affect performance of different workload.
-
TABLE 2 Virtual Machine Hotspot Storage Job Type Closeness Factor Utilization Map and Reduce Yes Yes Yes Intensive Map Intensive No Yes Yes Reduce Intensive No No Yes
For workloads that are both Map and Reduce intensive, related data should be placed close together and on the least loaded machine. For Map intensive workloads, the data should be placed on the least loaded machine, but does not necessarily need to be placed close together due to the light shuffle traffic in such workload. For Reduce intensive workloads, the only concern is the storage utilization of the machine on which the virtual machine is to be placed. For all types of workloads, it is desirable to place data evenly across racks to minimize the need to rearrange data over time for supporting migrating virtual machines. - The factors organized in Table 2 are used in constructing a min-cost flow graph that encodes the factors.
FIG. 6 is an illustrating of a flow graph (600) illustrating sample network topology for data placement. More specifically, the sample network topology consists of six physical nodes (p1, p2, p3, p4, p5, p6) identified as (610), (612), (614), (616), (618), and (620). The physical nodes are organized into three racks (r1, r2, r3) identified as (630), (632), and (634). A master rack r4, e.g. switch, identified as (650) connects the racks. In one embodiment, the topology shown here can support any topology where network traffic can be estimated. The unit of data placement is a virtual machine image to ensure that an entire image is available at one location. - Based on the flow graph of
FIG. 6 , a second flow (700) graph is illustrated inFIG. 7 for sample data placement. Two data items d1, (702), and d2, (704), with requests for 5 and 2 virtual machine images, respectively, are submitted to the cloud, e.g. share pool of resources. The number of virtual machine images requested by a data item is denoted as the data item's supply for the flow graph. A sink node, S (790), is added to the graph to support the virtual machines. The number of virtual machines that a sink node can handle is assigned as a demand value. In the example shown herein, sink node (790) has a demand of seven and is the only place that can receive all of the flows. Each flow graph edge has two parameters attached, including the capacity of the edge and the cost for a flow to go through the edge. The data nodes, d1 (702) and d2 (704), have outgoing links to each rack with virtual machine closeness as costs. As shown, data node d1 (702) has link (712) going to rack r1 (722), data node d1 (702) has link (714) going to rack r2 (724), data node d1 (702) has link (716) going to rack r3 (726), data node d1 (702) has link (718) going to rack r4 (728), data node d2 (704) has link (732) going to rack r1 (722), data node d2 (704) has link (734) going to rack r2 (724), data node d2 (704) has link (736) going to rack r3 (726), and data node d2 (704) has link (738) going to rack r4 (728). Six physical nodes are shown herein, specifically p1 (740), p2 (742), p3 (744), p4 (746), p5 (748), and p6 (750). The hotspot factor is encoded in the links (750), (752), (754), (756), (758), and (760) from the racks to each physical node p within its range. Even though r4 (728) serves as a switch between the racks, it is shown as directly connected to all the physical nodes to ensure that the least loaded machine can be chosen for Map intensive jobs without being constrained by the network topology. - All the physical nodes p1 (740), p2 (742), p3 (744), p4 (746), p5 (748), and p6 (750) are linked to the sink node (790) with storage utilization as link costs. There is no direct link from data item node dj to the associated physical host pi. This is to support scaling up the system. The following table, Table 3 shows the values assigned to the flow graph for data placement.
-
TABLE 3 Data set dj Rack rk Physical host Sink Supply Σ(Ndj) 0 0 — Σ(Ndj) Incoming link N/ Ndj Rack rk Physical host from Outgoing link To rack To Physical To sink N/A (capacity, cost) (Ndj, αjk) host (Capi, γi) (Capi, βi) - As shown, Ndj is the number of virtual machine images requested by dataset dj, αjk capture virtual machine closeness. The cost, αjk, of outgoing link from the dataset dj to physical host pi on which the data is placed on rack rk is estimated conservatively by the traffic in the shuffle phase as follows:
-
- where distancemax is the network distance between any two nodes in the rack rk.
The hotspot factor is captured using βi for physical node pi and is estimated by the current and expected load as follows: -
βi =a*(loadexp+loadcurr−loadmin) - , where loadcurr and loadmin represent the current load and minimum current load, respectively, and a is a parameter that acts as a knob to tune the weight of the hotspot factor with respect to other costs. The expected load, loadexp, is as follows:
-
loadexp=Σj(p j/(1−p j)*CRes(d j) - , where pj=λj/μj, and λj represents the number of dj's associated jobs that arrive within a given time interval, and μj represents the mean time for each virtual machine to process a block.
- Storage utilization of a physical node pi is captured by γi which is determined by the current storage utilization compared with minimum storage utilization of all pis, as follows:
-
γi =b*(storageUtilization p i−storageUtilizationmin) - , where b is a parameter used to fine tune the weight of storage utilization with respect to other factors. Finally with respect to capacity, the following formula is employed as an estimate of capacity for each physical host:
-
Capi=freespacepi/sizeVMImg - To enable the graph to capture the correlation between virtual machine image placements for one data request, a split factor parameter is provided to specify whether flows from a node are allowed to be split across different link. In one embodiment, the value of this parameter is defined as true or false. For example, if the split factor for all the links from d1 and d2 are trues, all flows from data nodes will in whole go through one of the r1, r2, r3, r4, but will not be split between the racks. Once a new data upload request is received, the cloud server updates the graph and computes a global optimal solution based on the computed solution for the newly updated data. Accordingly, the scheduler is periodically updated based on the new solution and can accommodate varying loads.
- The goal of virtual machine placement is to maximize global data locality and job throughput. Our model considers both virtual machine migration and delayed scheduling of a job as part of the optimal solution. Delaying a job is used to explore better data locality opportunities that can arise in the near future, while minimizing time wasted during the waiting. Migrating a virtual machine belonging to a job enables the scheduler to make room for other suitable jobs or to explore better location opportunity.
-
FIG. 8 is an illustration of a flow graph (800) for virtual machine placement. Each job vj (802) and (804), is submitted to the system at the source node with the number of requested virtual machines, Nvj, as the supply value. The goal of the virtual machine allocator is to maintain the job as unscheduled, e.g. allocate 0, or allocate Nvj, virtual machines for each request. There is a single sink node S (890) with demand equal to minus the sum of the supply. The request from each job acts as a flow that goes either through the rack nodes (810), (812), (814), and (816), or through the unscheduled nodes (820), (822), and (824), and finally to the sink node (890). If a job is unscheduled, none of its virtual machines are allocated. Otherwise, the flow goes through the physical nodes (830), (832), (834), (836), (838), and (840). Based on the min-cost solution, an allocation scheme with min-cost can be derived. If the virtual machines are allocated to the highest level rack, it implies that the virtual machines can be allocated arbitrarily to any set of nodes in the virtual machines under the rack. - The job type information is modeled as the cost of the edge from each job to the rack nodes in the flow based graph. The higher level rack has higher cost than the lower level rack in terms of reduced traffic. In one embodiment, the cost of the highest level rack is estimated by a worst case virtual machine arrangement with regard to the map and reduced traffic. The cost of the edges to the unscheduled nodes is set to be increased over time so that delayed jobs get allocated sooner than recently submitted jobs. That cost also controls when a job stops waiting for better locality and therefore offers a knob to tune the trade-off between data locality and latency. The aggregated unscheduled nodes control how many virtual machines can remain unscheduled to control system resource utilization and data locality trade-off. The cost of the edges to running nodes set is increased over time and job-progress aware.
-
FIG. 9 is illustrates a table (900) with values assigned to the flow graph for virtual machine placement and node categorization. Various nodes in the graph are categorized into different types as shown in the table. A preferred node set (prj) (910) is a set of graph nodes that point to a set of physical node pi that have a job, vj, associated dataset stored on them. An edge from a preferred node to physical node pi has the cost of 0 and the capacity of the number of virtual machine disk images stored on physical nodes pi. A running node set (ruj) (920) is a set of dynamically added node that point to physical nodes (pis) that are currently hosting the jobs (vj) virtual machines. An edge from ruj to pi has a cost of 0 and the capacity of the number of virtual machines running on physical nodes pi. An unscheduled node set, uj, (930) is a set of nodes that provide information about currently unscheduled jobs. The unscheduled node set, uj, has an outgoing edge with capacity Nvj and code of 0 to an unscheduled aggregator. An unscheduled aggregator node, u, (940) has an outgoing edge withcost 0 to the sink with capacity defined as: -
Nunsched=Σ(N vj)−M+M idle - where M is the total number of virtual machines that the cluster can support and Midle denotes the number of idle virtual machine slots allowed in the cluster.
- The rack node set, rk, (950) represents a rack in the topology of the cluster. It has outgoing links with
cost 0 to its sub-racks, or if it is at the lowest level, to physical nodes. The links have capacity Nrk that is the total number of virtual machine slots that can be serviced by its underlying nodes. The physical host node set, pi, (960) has an outgoing link to the sink with capacity being the number of virtual machines that can be accommodated on the physical host Nvm andcost 0. The graph has a sink node (970) with demand represented as Σ(Nvj). The job node set, vj, (980) represents each job node vj with supply Nvj. It has multiple outgoing edges corresponding to the potential virtual machine allocation decisions for the job set. - The edges include, a rack node set, a preferred node set, a running node set, and an unscheduled node set. The rack node set, rk, has an edge to rk that indicates that rk can accommodate vj. The cost of the edge is pj that is calculated by the map and reduce traffic cost. If the capacity of the edge is greater than Nvj, it implies that the virtual machines of vj will be allocated on some preferred nodes on the rack. The preferred node set, prj, has an edge from job vj to the job wide preferred nodes set, prj, has capacity Nvj and cost θ The cost is estimated by only the reduce phase traffic because in this case map traffic is assumed to be zero. The running node set, ruj, has a link from job vj, capacity Nvj, and cost φ=c*T, where T is the time the job has been executing on the set of machines and c is a constant used to adjust the cost relative to other costs. The unscheduled node set, uj, has an edge to the job-wide unscheduled node uj with capacity Nvj and cost εj, which corresponds to the penalty of leaving job vj unscheduled. εj=d*T, where T is the time that job vj is left unscheduled and d is a constant used to adjust the cost relative to other costs. The split factor for this link is marked as true, which means the allocation of all the virtual machines are either satisfied or delayed until the next round.
- Based on an output of a min-cost flow solution, the virtual machine allocation assignment can be obtained from the graph by locating where the associated flow leads to for each virtual machine request vj. Flow to an unscheduled node indicates that the virtual machine request is skipped for the current round. If the flow leads to a preferred node set, the virtual machine request is scheduled on that set of nodes. If the flow foes to a rack node, it implies that the virtual machines from the job are assigned to arbitrary hosts in that rack.
- The number of flows set to a physical host through rack nodes or preferred nodes set is lower than the number of available virtual machines of each physical host. This is guaranteed by the specified link capacity from physical host to sink. Accordingly, all virtual machine requests that are allocated will be matched to a corresponding physical host.
-
FIG. 10 is a flow chart (1000) illustrating a process for assessing and leveraging the physical and virtual machine topology in a shared pool of resources. The status of the virtual machines in the shared pool is collected (1002). It is recognized that one or more virtual machines may be assigned to a single physical host, thereby supporting the virtualization of an underlying physical machine. Based upon the collected data, associated topology information is gathered and communicated to a server at a root node of the hierarchical organization of physical and virtual machines (1004). - Each virtual machine is provided with an embedded agent, and each physical machine to which the virtual machines are assigned is provided with an embedded monitor. A server machine is provided in communication with each of the physical machines and functions to periodically collect information from the embedded monitors of the underlying physical machines. The embedded monitors function to sense local topology, disk, and network information. In one embodiment, the embedded monitors are in the form of software that runs on the physical machine to collect status data of the related virtual machines. Similarly, the embedded agent of each virtual machine communicates actual topology and system utilization information to the server machine. Accordingly, each virtual and physical machine includes embedded tools to gather topology associated utilization information and to convey the gathered information to the server machine.
- Following the step of gathering the topological data at step (1004), the gathered data is organized in a single location (1006). In one embodiment, the single location may be a root node representing a physical server that is in communication with each of the virtual machines and their associated physical machine. As shown at step (1004), data communicated from the embedded agents of the virtual machines includes actual topology and system utilization information. The data gathered at step (1004) includes a storage topology underlying a virtual topology of the shared pool of resources together with the associated resource utilization information. With the knowledge of the gathered data, a job is assigned to a select virtual machine in the shared pool with the assignment leveraging the organized storage topology information (1008). In one embodiment, the job assignment at step (1008) is designed to support efficient performance of job associated I/O. Accordingly, the process of gathering and organizing local topology information enables jobs to be intelligently assigned to a select virtual machine.
- The job assignment at step (1008) may be to one virtual machine or to multiple virtual machines. Similarly, the job may be a read job or a write job. With respect to both scenarios, a virtual topological distance is returned in response to the job assignment (1010). The virtual topological distance may be a distance between two or more virtual machines when the job is assigned to multiple machines, or the virtual topological distance may be between a virtual machine and a block of data when the job is supporting a single virtual machine for a read or write job. Following step (1010), it is determined if the returned topological distance is between at least two virtual machines (1012). A positive response to the determination at step (1012) is followed by creating a shared memory channel for inter-virtual machine data communication between two virtual machines local to the same physical machine (1014). The creation of the shared memory channel supports efficient data transfer between the two virtual machines. More specifically, memory copy may be employed for communication between the virtual machines, thereby avoiding communication across a virtual network stack (1016). Conversely, a negative response to the determination at step (1012) is followed by utilization of the virtual network stack for communication between the virtual machines supporting the assigned job (1018). Accordingly, the physical proximity of virtual machines may lend itself to efficient transfer of inter-virtual machine communication.
- A physical machine supports the virtual machine and data blocks support the job. The location of the data blocks within the shared pool affects the assignment of the job to the physical machine and associated virtual machine(s). More specifically, an efficient use of resources in the shared pool ensures a physical proximity of the physical machine to the data blocks. In one embodiment, the job is assigned to a physical machine in the same physical data center as the subject data blocks. Accordingly, part of the process of assignment of the job at step (1008) includes ascertaining a physical location of the data blocks in the shared pool supporting the job.
- In addition to the location of the blocks, the bandwidth of the underlying physical machine to support the job is critical. The step of leveraging the storage topology information at step (1008) may require one or more additional steps.
FIG. 11 is a flow chart (1100) illustrating the additional steps to support the aspect of leveraging the storage topology of the shared pool of resources. As described above, in response to the storage topology, a virtual machine is designated to receive the job for processing (1102). Prior to actual job processing, utilization information of the physical machine local to the virtual machine is ascertained (1104). The utilization information includes, but is not limited to the processing unit and network utilization information. It is determined if the underlying physical machine has the bandwidth and capability to support the job (1106). A negative response to the determination at step (1106) is followed by returned to selection and assignment of a different virtual machine in the shared pool. Conversely, a positive response to the determination at step (1106) is an indication that the selected virtual machine has both sufficient bandwidth to support the job and a close topological distance to the physical location of the data block(s) to support the job (1108). In one embodiment, the close topological distance includes, but is not limited to, data residing in the same data center as the virtual machine. Accordingly, the aspect of leveraging the storage topology includes an assessment of the operation of the machine together with location of data to support the job. - As shown in
FIGS. 10-11 , a method is provided to employ the topological organization of the machines, together with the location of data blocks to support the job, for intelligent assignment of a job. The job is assigned to a machine that has been assessed to support efficient processing.FIG. 12 is a block diagram (1200) illustrating tools embedded in a computer system to support a technique employed for assessment of resource utilization for use in assignment of a job within a shared pool of resources. Specifically, a shared pool of configurable computer resources is shown with a first data center (1210), a second data center (1230), and a third data center (1250). Although three data centers are shown in the example herein, the invention should not be limited to this quantity of data centers in the computer system. Each of the data centers represents a computing resource. Accordingly, one or more data centers may be employed to support efficient and intelligent assignment of jobs with respect to resource utilization and proximity to data blocks in support of the job(s). - Each of the data centers in the system is provided with at least one server in communication with data storage. More specifically, the first data center (1210) is provided with a first server (1220) having a processing unit (1222), in communication with memory (1224) across a bus (1226), and in communication with data storage (1228); the second data center (1230) is provided with a second server (1240) having a processing unit (1242), in communication with memory (1244) across a bus (1246), and in communication with second local storage (1248); and the third data center (1250) is provided with a third server (1260) having a processing unit (1262), in communication with memory (1264) across a bus (1266), and in communication with third local storage (1268). The first server (1220) is also referred to herein as a physical host. Communication among the data centers is supported across one or more network connections (1205).
- The second server (1240) includes two virtual machines (1232) and (1236). The first virtual machine (1232) has an embedded agent (1232 a) and the second virtual machine (1236) has an embedded agent (1236 a). In addition, the second server (1240) includes a monitor (1234) to facilitate communication with the first and second virtual machines (1232) and (1236), respectively. The third server (1260) includes two virtual machines (1252) and (1256). The first virtual machine (1252) has an embedded agent (1252 a) and the second virtual machine (1256) has an embedded agent (1256 a). In addition, the third server (1260) includes a monitor (1254) to facilitate communication with the first and second virtual machines (1252) and (1256), respectively. Although only two virtual machines (1232) and (1236) are shown in communication with the second server (1240) and only two virtual machines (1252) and (1256) are shown in communication with the third server (1260), the invention should not be limited to these quantities, as these quantities are merely for illustrative purposes. The quantity of the virtual machines in communication with the second and third servers (1240) and (1260), respectively, may be increased or decreased.
- As shown herein, each of the second and third servers (1240) and (1260), respectively, supports two virtual machines (1232), (1236) and (1252), (1256), respectively. The monitor (1234) of the server (1230) collects status data from each of the virtual machines (1232) and (1236). Monitor (1234) communicates with embedded agents (1232 a) and (1236 a) to collect virtual machine status from virtual machines (1232) and (1236), respectively. Similarly, monitor (1254) collects status data from each of the virtual machines (1252) and (1256), and specifically embedded agents (1252 a) and (1256 a), respectively.
- The first server (1220) is provided with a functional unit (1270) having one or more tools to support intelligent assignment of one or more jobs in the shared pool of resources. The functional unit (1270) is shown local to the first data center (1210), and specifically in communication with memory (1224). In one embodiment, the functional unit (1270) may be local to any of the data centers in the shared pool of resources. The tools embedded in the functional unit (1270) include, but are not limited to, a director (1272), a topology manager (1274), a hook manager (1276), a storage topology manager (1278), a resource utilization manager (1280), and an application manager (1282).
- The director (1272) is provided in the shared pool to periodically communicate with monitors (1234) and (1254) to organize and retain in a single location a storage topology underlying a virtual topology of the shared pool of resource, together with associated resource utilization information. More specifically, the communication of the director (1272) with the monitors (1234) and (1254) supports gathering and organization of the topology of the shared pool of resources. By organizing and understanding the topology data, the director (1272) may leverage the resource utilization information to intelligently assign a job to one or more of the shared resources in the pool, and in a manner that support efficient performance of job associated I/O. Accordingly, the director (1272) both gathers and leverages the topology to support efficient processing of read and write jobs in the shared pool of resources.
- As described above, several managers are provided to support the functionality of the director (1272). The topology manager (1274), which is in communication with the director (1272), functions to return a virtual topological distance data to the director (1272). Virtual topological distance data includes, but is not limited to, a distance between two virtual machines or a distance between a virtual machine and a block of data. For example, two virtual machines in communication with the same server are considered in relatively close proximity. However, a second virtual machine in communication with a second server and a third virtual machine in communication with a third server are considered relatively distant in comparison to the two virtual machines in communication with the same server. The storage topology manager (1278), which is in communication with the director (1272), functions to return a physical location of one or more data blocks in support of a job in the shared pool of resource. In one embodiment, the storage topology manager (1278) returns the physical location of the data blocks to the director (1272), thereby enabling the director to intelligently assign a job to a virtual machine in response to the location of the subject data block(s). Accordingly, the topology manager (1274) functions to address distances within the hierarchy with respect to efficient job processing, and the storage topology manager (1278) functions to address the location of the data block supporting the job.
- Three other managers are also provided, including a resource utilization manager (1280), an application manager (1282), and a hook manager (1274). The resource utilization manager (1280) functions to address utilization of one or more physical or virtual resources. Each resource has innate limitations. The resource utilization manager (1280) returns utilization information of a processing unit and network utilization information associated with the underlying physical and virtual machines to the director (1272). The application manager (1282), which is in communication with the resource utilization manager (1280), assigns the job to a virtual machine responsive to the resource utilization information. More specifically, the application manager (1282) ensures that the job assignment to a machine in the topology ensures the machine has a sufficient bandwidth to support the job, as well as a sufficiently close topological distance to data blocks to support the job. Accordingly, both utilization and bandwidth are accounted for by the resource utilization manager (1280) and the application manager (682), respectively.
- In addition to the managers described in detail above, the hook manager (674) functions to facilitate communication among virtual machines. More specifically, the hook manager (674), which is in communication with the director (672), is provided to create a shared memory channel for inter-virtual machine communication. The shared memory channel facilitates communication between two virtual machines sitting on the same physical machine by enabling data transfer between two such virtual machines to take place on the same memory stack, e.g. across the memory channel. Accordingly, the shared memory channel created by the hook manager (674) supports efficient communication of data within the hierarchical structure of the shared pool of resources.
- As identified above, the director (1272), topology manager (1274), hook manager (1276), storage topology manager (1278), resource utilization manager (1280), and application manager (1282) are shown residing in the functional unit (1270) of the server (1220) local to the first data center (1210). Although in one embodiment, the functional unit (1270) and associated director and managers, respectively, may reside as hardware tools external to the memory (1224) of server (1220) of the first data center (1210), they may be implemented as a combination of hardware and software, or may reside local to the second data center (1230) or the third data center (1250) in the shared pool of resources. Similarly, in one embodiment, the director and managers may be combined into a single functional item that incorporates the functionality of the separate items. As shown herein, each of the director and manager(s) are shown local to one data center. However, in one embodiment they may be collectively or individually distributed across the shared pool of configurable computer resources and function as a unit to assess the topology of processing units and data storage in the shared pool, and to process one or more jobs responsive to the hierarchy. Accordingly, the managers may be implemented as software tools, hardware tools, or a combination of software and hardware tools.
- As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
- Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
- A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
- Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- Aspects of the present invention are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
- The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- Referring now to
FIG. 13 is a block diagram (1300) showing a system for implementing an embodiment of the present invention. The computer system includes one or more processors, such as a processor (1302). The processor (1302) is connected to a communication infrastructure (1304) (e.g., a communications bus, cross-over bar, or network). The computer system can include a display interface (1306) that forwards graphics, text, and other data from the communication infrastructure (1304) (or from a frame buffer not shown) for display on a display unit (1308). The computer system also includes a main memory (1310), preferably random access memory (RAM), and may also include a secondary memory (1312). The secondary memory (1312) may include, for example, a hard disk drive (1314) and/or a removable storage drive (1316), representing, for example, a floppy disk drive, a magnetic tape drive, or an optical disk drive. The removable storage drive (1316) reads from and/or writes to a removable storage unit (1318) in a manner well known to those having ordinary skill in the art. Removable storage unit (1318) represents, for example, a floppy disk, a compact disc, a magnetic tape, or an optical disk, etc., which is read by and written to by removable storage drive (1316). As will be appreciated, the removable storage unit (1318) includes a computer readable medium having stored therein computer software and/or data. - In alternative embodiments, the secondary memory (1312) may include other similar means for allowing computer programs or other instructions to be loaded into the computer system. Such means may include, for example, a removable storage unit (1320) and an interface (1322). Examples of such means may include a program package and package interface (such as that found in video game devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, and other removable storage units (1320) and interfaces (1322) which allow software and data to be transferred from the removable storage unit (1320) to the computer system.
- The computer system may also include a communications interface (1324). Communications interface (1324) allows software and data to be transferred between the computer system and external devices. Examples of communications interface (1324) may include a modem, a network interface (such as an Ethernet card), a communications port, or a PCMCIA slot and card, etc. Software and data transferred via communications interface (1324) are in the form of signals which may be, for example, electronic, electromagnetic, optical, or other signals capable of being received by communications interface (1324). These signals are provided to communications interface (1324) via a communications path (i.e., channel) (1326). This communications path (1326) carries signals and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, a radio frequency (RF) link, and/or other communication channels.
- In this document, the terms “computer program medium,” “computer usable medium,” and “computer readable medium” are used to generally refer to media such as main memory (1310) and secondary memory (1312), removable storage drive (1316), and a hard disk installed in hard disk drive (1314).
- Computer programs (also called computer control logic) are stored in main memory (1310) and/or secondary memory (1312). Computer programs may also be received via a communication interface (1324). Such computer programs, when run, enable the computer system to perform the features of the present invention as discussed herein. In particular, the computer programs, when run, enable the processor (1302) to perform the features of the computer system. Accordingly, such computer programs represent controllers of the computer system.
- The flowcharts and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowcharts or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
- The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
- The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated. Accordingly, the enhanced cloud computing model supports flexibility with respect to application processing and disaster recovery, including, but not limited to, supporting separation of the location of the data from the application location and selection of an appropriate recovery site.
- As described herein a platform is provided with a resource scheduler to address performance degradation of MapReduce jobs when running in the cloud environment. Cloud aware MapReduce adopts a three level approach to avoid placement anomalies due to inefficient resource allocation, including: placing data within the cluster that run jobs that most commonly operate on the data, selecting the mode appropriate physical nodes to place the set of virtual machines assigned to a job, and exposing compute, storage, and network topologies to the scheduler. CAM uses a flow network based algorithm that is able to reconcile resource allocation with a variety of other competing constraints, such as storage utilization, changing processor load and network link capacities.
- It will be appreciated that, although specific embodiments of the invention have been described herein for purposes of illustration, various modifications may be made without departing from the spirit and scope of the invention. Accordingly, the scope of protection of this invention is limited only by the following claims and their equivalents.
Claims (10)
1. A method comprising:
in a shared pool of resources, including a physical host in communication with at least one physical machine supporting one or more virtual machines, collecting virtual machine status from one or more of the virtual machines;
gathering local topology information of the shared pool of resources, including periodically communicating with an embedded monitor of the physical machine;
organizing the gathered local topology information, including a storage topology underlying a virtual topology and associated resource utilization information;
leveraging the organized topology information, including responsively assigning a job to a select virtual machine in the shared pool, including the job assignment supporting efficient performance of job associated I/O.
2. The method of claim 1 , further comprising returning a virtual topological distance associated with the job, the virtual topological distance selected from the group consisting of: a distance between two virtual machines, and a distance between a virtual machine and a block of data.
3. The method of claim 2 , further comprising creating a shared memory channel to support the job, the channel supporting inter-virtual machine data communication between a first virtual machine and a second virtual machine sitting on a same physical machine.
4. The method of claim 3 , wherein the shared memory channel supports efficient data transfer for both the first and second virtual machines.
5. The method of claim 1 , wherein the step of leveraging the organized storage topology information includes returning a physical location of one or more data blocks to support the job in the shared pool.
6. The method of claim 5 , further comprising returning utilization information of a processing unit and network utilization information of the physical machine local to the virtual machine.
7. The method of claim 6 , wherein the step of assigning the job to a virtual machine includes selecting a virtual machine having sufficient bandwidth and a close topological distance to the physical location of the one or more data blocks to support the job.
8. A computer implemented method comprising:
in a shared pool of resources, including a physical host in communication with at least one physical machine supporting one or more virtual machines, collecting virtual machine status from one or more of the virtual machines;
periodically gathering local topology information of a hierarchical organization of resources represented by the physical and virtual machines;
organizing the topology information;
assessing utilization of storage resources and virtual machines represented in the topology; and
assigning a job to one or more select virtual machines in the shared pool, with the job assignment supporting efficient performance responsive to the topology and resource utilization assessment.
9. The method of claim 8 , further comprising ascertaining a topological distance associated with the job, the topological distance selected from the group consisting of: a distance between two virtual machines, and a distance between a virtual machine and a block of data.
10. The method of claim 9 , further comprising creating a shared memory channel to support the job between a first virtual machine and a second virtual machine sitting on a same physical machine.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/590,881 US20130290957A1 (en) | 2012-04-26 | 2012-08-21 | Efficient execution of jobs in a shared pool of resources |
DE102013207603.7A DE102013207603B4 (en) | 2012-04-26 | 2013-04-25 | Run jobs efficiently in a shared pool of resources |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/457,090 US8972983B2 (en) | 2012-04-26 | 2012-04-26 | Efficient execution of jobs in a shared pool of resources |
US13/590,881 US20130290957A1 (en) | 2012-04-26 | 2012-08-21 | Efficient execution of jobs in a shared pool of resources |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/457,090 Continuation US8972983B2 (en) | 2012-04-26 | 2012-04-26 | Efficient execution of jobs in a shared pool of resources |
Publications (1)
Publication Number | Publication Date |
---|---|
US20130290957A1 true US20130290957A1 (en) | 2013-10-31 |
Family
ID=49462248
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/457,090 Active 2033-01-11 US8972983B2 (en) | 2012-04-26 | 2012-04-26 | Efficient execution of jobs in a shared pool of resources |
US13/590,881 Abandoned US20130290957A1 (en) | 2012-04-26 | 2012-08-21 | Efficient execution of jobs in a shared pool of resources |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/457,090 Active 2033-01-11 US8972983B2 (en) | 2012-04-26 | 2012-04-26 | Efficient execution of jobs in a shared pool of resources |
Country Status (2)
Country | Link |
---|---|
US (2) | US8972983B2 (en) |
CN (1) | CN103377091B (en) |
Cited By (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130091094A1 (en) * | 2011-10-06 | 2013-04-11 | International Business Machines Corporation | Accelerating data profiling process |
US20140068621A1 (en) * | 2012-08-30 | 2014-03-06 | Sriram Sitaraman | Dynamic storage-aware job scheduling |
US20140064066A1 (en) * | 2012-08-29 | 2014-03-06 | Nec Laboratories America, Inc. | Data Processing |
US20140282578A1 (en) * | 2013-03-14 | 2014-09-18 | Justin S. Teller | Locality aware work stealing runtime scheduler |
US20150020070A1 (en) * | 2013-07-12 | 2015-01-15 | Bluedata Software, Inc. | Accelerated data operations in virtual environments |
US20150100817A1 (en) * | 2013-10-08 | 2015-04-09 | International Business Machines Corporation | Anticipatory Protection Of Critical Jobs In A Computing System |
US20150121376A1 (en) * | 2013-10-25 | 2015-04-30 | Samsung Electronics Co., Ltd. | Managing data transfer |
CN105184452A (en) * | 2015-08-14 | 2015-12-23 | 山东大学 | MapReduce operation dependence control method suitable for power information big-data calculation |
CN105450754A (en) * | 2015-11-30 | 2016-03-30 | 国云科技股份有限公司 | Method of physical machine and virtual machine for mutually sharing cloud disk |
US9372626B2 (en) | 2014-06-12 | 2016-06-21 | Lenovo Enterprise Solutions (Singapore) Pte. Ltg. | Parallel storage system testing wherein I/O test pattern identifies one or more set of jobs to be executed concurrently |
US20160182591A1 (en) * | 2013-06-24 | 2016-06-23 | Alcatel Lucent | Automated compression of data |
US20160203024A1 (en) * | 2015-01-14 | 2016-07-14 | Electronics And Telecommunications Research Institute | Apparatus and method for allocating resources of distributed data processing system in consideration of virtualization platform |
US9462058B1 (en) * | 2015-11-19 | 2016-10-04 | International Business Machines Corporation | Data locality in data integration applications |
US9495211B1 (en) * | 2014-03-04 | 2016-11-15 | Google Inc. | Allocating computing resources based on user intent |
US9537745B1 (en) * | 2014-03-07 | 2017-01-03 | Google Inc. | Distributed virtual machine disk image deployment |
US20170031716A1 (en) * | 2014-09-30 | 2017-02-02 | International Business Machines Corporation | Merging connection pools to form a logical pool of connections during a preset period of time thereby more efficiently utilizing connections in connection pools |
US9575749B1 (en) * | 2015-12-17 | 2017-02-21 | Kersplody Corporation | Method and apparatus for execution of distributed workflow processes |
US20170222904A1 (en) * | 2016-01-29 | 2017-08-03 | AppDynamics, Inc. | Distributed Business Transaction Specific Network Data Capture |
US9762672B2 (en) | 2015-06-15 | 2017-09-12 | International Business Machines Corporation | Dynamic node group allocation |
US9817786B1 (en) * | 2015-06-26 | 2017-11-14 | Amazon Technologies, Inc. | Ingress data placement |
US9817721B1 (en) * | 2014-03-14 | 2017-11-14 | Sanmina Corporation | High availability management techniques for cluster resources |
US9846589B2 (en) | 2015-06-04 | 2017-12-19 | Cisco Technology, Inc. | Virtual machine placement optimization with generalized organizational scenarios |
US20180024863A1 (en) * | 2016-03-31 | 2018-01-25 | Huawei Technologies Co., Ltd. | Task Scheduling and Resource Provisioning System and Method |
CN107862064A (en) * | 2017-11-16 | 2018-03-30 | 北京航空航天大学 | One high-performance based on NVM, expansible lightweight file system |
US10225148B2 (en) | 2015-09-23 | 2019-03-05 | International Business Machines Corporation | Social network of virtual machines |
CN109656675A (en) * | 2017-10-11 | 2019-04-19 | 阿里巴巴集团控股有限公司 | Bus apparatus, computer equipment and the method for realizing physical host cloud storage |
US10318333B2 (en) * | 2017-06-28 | 2019-06-11 | Sap Se | Optimizing allocation of virtual machines in cloud computing environment |
US10375161B1 (en) * | 2014-04-09 | 2019-08-06 | VCE IP Holding Company LLC | Distributed computing task management system and method |
CN110196750A (en) * | 2018-02-26 | 2019-09-03 | 华为技术有限公司 | A kind of distribution method and its relevant device of equipment |
US20190310893A1 (en) * | 2018-04-05 | 2019-10-10 | International Business Machines Corporation | Workload management with data access awareness in a computing cluster |
US10523533B2 (en) | 2016-06-21 | 2019-12-31 | International Business Machines Corporation | Cloud network assessment based on scoring virtual network performance relative to underlying network performance |
US10599436B2 (en) | 2015-12-31 | 2020-03-24 | Huawei Technologies Co., Ltd. | Data processing method and apparatus, and system |
US10761891B2 (en) | 2018-04-05 | 2020-09-01 | International Business Machines Corporation | Workload management with data access awareness by aggregating file locality information in a computing cluster |
US10776356B2 (en) | 2017-04-07 | 2020-09-15 | Micro Focus Llc | Assigning nodes to shards based on a flow graph model |
US10860352B2 (en) * | 2013-07-25 | 2020-12-08 | Hewlett Packard Enterprise Development Lp | Host system and method for managing data consumption rate in a virtual data processing environment |
US10915365B2 (en) | 2015-12-31 | 2021-02-09 | Huawei Technologies Co., Ltd. | Determining a quantity of remote shared partitions based on mapper and reducer nodes |
US10977091B2 (en) | 2018-04-05 | 2021-04-13 | International Business Machines Corporation | Workload management with data access awareness using an ordered list of hosts in a computing cluster |
US20230221996A1 (en) * | 2022-01-13 | 2023-07-13 | Dell Products L.P. | Consensus-based distributed scheduler |
US20230229519A1 (en) * | 2022-01-14 | 2023-07-20 | Goldman Sachs & Co. LLC | Task allocation across processing units of a distributed system |
Families Citing this family (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8843935B2 (en) | 2012-05-03 | 2014-09-23 | Vmware, Inc. | Automatically changing a pre-selected datastore associated with a requested host for a virtual machine deployment based on resource availability during deployment of the virtual machine |
US8892779B2 (en) | 2012-05-10 | 2014-11-18 | International Business Machines Corporation | Virtual machine allocation at physical resources |
US9052963B2 (en) * | 2012-05-21 | 2015-06-09 | International Business Machines Corporation | Cloud computing data center machine monitor and control |
US8972986B2 (en) * | 2012-05-25 | 2015-03-03 | International Business Machines Corporation | Locality-aware resource allocation for cloud computing |
JP2014059862A (en) * | 2012-08-22 | 2014-04-03 | Canon Inc | Data flow resource allocation device and method |
US10169348B2 (en) * | 2012-08-23 | 2019-01-01 | Red Hat, Inc. | Using a file path to determine file locality for applications |
GB2507338A (en) * | 2012-10-26 | 2014-04-30 | Ibm | Determining system topology graph changes in a distributed computing system |
US9323584B2 (en) * | 2013-09-06 | 2016-04-26 | Seagate Technology Llc | Load adaptive data recovery pipeline |
US9280422B2 (en) | 2013-09-06 | 2016-03-08 | Seagate Technology Llc | Dynamic distribution of code words among multiple decoders |
JP6191361B2 (en) * | 2013-09-25 | 2017-09-06 | 富士通株式会社 | Information processing system, information processing system control method, and control program |
US10193963B2 (en) * | 2013-10-24 | 2019-01-29 | Vmware, Inc. | Container virtual machines for hadoop |
US9755987B2 (en) * | 2014-02-05 | 2017-09-05 | Futurewei Technologies, Inc. | Virtual resource mapping mechanisms |
US9612878B2 (en) | 2014-03-31 | 2017-04-04 | International Business Machines Corporation | Resource allocation in job scheduling environment |
US9832168B2 (en) * | 2014-07-01 | 2017-11-28 | Cable Television Laboratories, Inc. | Service discovery within multi-link networks |
US9367344B2 (en) * | 2014-10-08 | 2016-06-14 | Cisco Technology, Inc. | Optimized assignments and/or generation virtual machine for reducer tasks |
US9411628B2 (en) * | 2014-11-13 | 2016-08-09 | Microsoft Technology Licensing, Llc | Virtual machine cluster backup in a multi-node environment |
TWI574158B (en) * | 2014-12-01 | 2017-03-11 | 旺宏電子股份有限公司 | Data processing method and system with application-level information awareness |
US9430290B1 (en) | 2015-03-31 | 2016-08-30 | International Business Machines Corporation | Determining storage tiers for placement of data sets during execution of tasks in a workflow |
CN106294440B (en) * | 2015-05-27 | 2019-06-07 | 阿里巴巴集团控股有限公司 | The method and apparatus of data real-time migration |
US10848574B2 (en) | 2015-06-11 | 2020-11-24 | Microsoft Technology Licensing, Llc | Computing resource management system |
CN105138276B (en) * | 2015-07-14 | 2018-05-18 | 苏州科达科技股份有限公司 | Data storage method and data storage system |
US10171300B2 (en) | 2015-11-02 | 2019-01-01 | International Business Machines Corporation | Automatic redistribution of virtual machines as a growing neural gas |
US10129169B2 (en) * | 2016-04-07 | 2018-11-13 | International Business Machines Corporation | Specifying a highly-resilient system in a disaggregated compute environment |
US10505832B2 (en) * | 2017-05-10 | 2019-12-10 | Sap Se | Resource coordinate system for data centers |
CN107479979B (en) * | 2017-08-31 | 2020-07-28 | 安徽江淮汽车集团股份有限公司 | CPU load rate optimization method and system of gearbox control unit |
CN111182239B (en) * | 2020-01-12 | 2021-07-06 | 苏州浪潮智能科技有限公司 | AI video processing method and device |
JP2021157339A (en) * | 2020-03-25 | 2021-10-07 | 富士通株式会社 | Information processing method, and information processing program |
CN113918644B (en) * | 2020-07-07 | 2024-09-06 | 华为技术有限公司 | Method and related device for managing data of application program |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050120160A1 (en) * | 2003-08-20 | 2005-06-02 | Jerry Plouffe | System and method for managing virtual servers |
US20080082977A1 (en) * | 2006-09-29 | 2008-04-03 | Microsoft Corporation | Automatic load and balancing for virtual machines to meet resource requirements |
US20090300605A1 (en) * | 2004-10-29 | 2009-12-03 | Hewlett-Packard Development Company, L.P. | Virtual computing infrastructure |
US7784053B2 (en) * | 2003-04-29 | 2010-08-24 | International Business Machines Corporation | Management of virtual machines to utilize shared resources |
US7865582B2 (en) * | 2004-03-24 | 2011-01-04 | Hewlett-Packard Development Company, L.P. | System and method for assigning an application component to a computing resource |
US20110029675A1 (en) * | 2009-07-31 | 2011-02-03 | Wai-Leong Yeow | Resource allocation protocol for a virtualized infrastructure with reliability guarantees |
US8145760B2 (en) * | 2006-07-24 | 2012-03-27 | Northwestern University | Methods and systems for automatic inference and adaptation of virtualized computing environments |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7558217B2 (en) | 2003-08-15 | 2009-07-07 | Hewlett-Packard Development Company, L.P. | Method and system for initializing host location information across smart bridge topology changes |
US8429630B2 (en) | 2005-09-15 | 2013-04-23 | Ca, Inc. | Globally distributed utility computing cloud |
US7921416B2 (en) | 2006-10-20 | 2011-04-05 | Yahoo! Inc. | Formal language and translator for parallel processing of data |
WO2009059377A1 (en) | 2007-11-09 | 2009-05-14 | Manjrosoft Pty Ltd | Software platform and system for grid computing |
US8370833B2 (en) * | 2008-02-20 | 2013-02-05 | Hewlett-Packard Development Company, L.P. | Method and system for implementing a virtual storage pool in a virtual environment |
US8874694B2 (en) | 2009-08-18 | 2014-10-28 | Facebook, Inc. | Adaptive packaging of network resources |
US8332862B2 (en) | 2009-09-16 | 2012-12-11 | Microsoft Corporation | Scheduling ready tasks by generating network flow graph using information receive from root task having affinities between ready task and computers for execution |
KR101285078B1 (en) | 2009-12-17 | 2013-07-17 | 한국전자통신연구원 | Distributed parallel processing system and method based on incremental MapReduce on data stream |
US8539192B2 (en) | 2010-01-08 | 2013-09-17 | International Business Machines Corporation | Execution of dataflow jobs |
-
2012
- 2012-04-26 US US13/457,090 patent/US8972983B2/en active Active
- 2012-08-21 US US13/590,881 patent/US20130290957A1/en not_active Abandoned
-
2013
- 2013-04-25 CN CN201310147025.XA patent/CN103377091B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7784053B2 (en) * | 2003-04-29 | 2010-08-24 | International Business Machines Corporation | Management of virtual machines to utilize shared resources |
US20050120160A1 (en) * | 2003-08-20 | 2005-06-02 | Jerry Plouffe | System and method for managing virtual servers |
US7865582B2 (en) * | 2004-03-24 | 2011-01-04 | Hewlett-Packard Development Company, L.P. | System and method for assigning an application component to a computing resource |
US20090300605A1 (en) * | 2004-10-29 | 2009-12-03 | Hewlett-Packard Development Company, L.P. | Virtual computing infrastructure |
US8145760B2 (en) * | 2006-07-24 | 2012-03-27 | Northwestern University | Methods and systems for automatic inference and adaptation of virtualized computing environments |
US20080082977A1 (en) * | 2006-09-29 | 2008-04-03 | Microsoft Corporation | Automatic load and balancing for virtual machines to meet resource requirements |
US20110029675A1 (en) * | 2009-07-31 | 2011-02-03 | Wai-Leong Yeow | Resource allocation protocol for a virtualized infrastructure with reliability guarantees |
Cited By (61)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8719271B2 (en) * | 2011-10-06 | 2014-05-06 | International Business Machines Corporation | Accelerating data profiling process |
US20130091094A1 (en) * | 2011-10-06 | 2013-04-11 | International Business Machines Corporation | Accelerating data profiling process |
US9143452B2 (en) * | 2012-08-29 | 2015-09-22 | Nec Laboratories America, Inc. | Data processing |
US20140064066A1 (en) * | 2012-08-29 | 2014-03-06 | Nec Laboratories America, Inc. | Data Processing |
US20140068621A1 (en) * | 2012-08-30 | 2014-03-06 | Sriram Sitaraman | Dynamic storage-aware job scheduling |
US20140282578A1 (en) * | 2013-03-14 | 2014-09-18 | Justin S. Teller | Locality aware work stealing runtime scheduler |
US10536501B2 (en) * | 2013-06-24 | 2020-01-14 | Alcatel Lucent | Automated compression of data |
US20160182591A1 (en) * | 2013-06-24 | 2016-06-23 | Alcatel Lucent | Automated compression of data |
US20150020070A1 (en) * | 2013-07-12 | 2015-01-15 | Bluedata Software, Inc. | Accelerated data operations in virtual environments |
US10055254B2 (en) * | 2013-07-12 | 2018-08-21 | Bluedata Software, Inc. | Accelerated data operations in virtual environments |
US10740148B2 (en) | 2013-07-12 | 2020-08-11 | Hewlett Packard Enterprise Development Lp | Accelerated data operations in virtual environments |
US10860352B2 (en) * | 2013-07-25 | 2020-12-08 | Hewlett Packard Enterprise Development Lp | Host system and method for managing data consumption rate in a virtual data processing environment |
US20150100817A1 (en) * | 2013-10-08 | 2015-04-09 | International Business Machines Corporation | Anticipatory Protection Of Critical Jobs In A Computing System |
US20150100816A1 (en) * | 2013-10-08 | 2015-04-09 | International Business Machines Corporation | Anticipatory protection of critical jobs in a computing system |
US9411666B2 (en) * | 2013-10-08 | 2016-08-09 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd. | Anticipatory protection of critical jobs in a computing system |
US9430306B2 (en) * | 2013-10-08 | 2016-08-30 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd. | Anticipatory protection of critical jobs in a computing system |
US20150121376A1 (en) * | 2013-10-25 | 2015-04-30 | Samsung Electronics Co., Ltd. | Managing data transfer |
US9495211B1 (en) * | 2014-03-04 | 2016-11-15 | Google Inc. | Allocating computing resources based on user intent |
US11086676B1 (en) | 2014-03-04 | 2021-08-10 | Google Llc | Allocating computing resources based on user intent |
US10310898B1 (en) | 2014-03-04 | 2019-06-04 | Google Llc | Allocating computing resources based on user intent |
US11847494B2 (en) | 2014-03-04 | 2023-12-19 | Google Llc | Allocating computing resources based on user intent |
US9537745B1 (en) * | 2014-03-07 | 2017-01-03 | Google Inc. | Distributed virtual machine disk image deployment |
US10509664B1 (en) | 2014-03-07 | 2019-12-17 | Google Llc | Distributed virtual machine disk image deployment |
US9817721B1 (en) * | 2014-03-14 | 2017-11-14 | Sanmina Corporation | High availability management techniques for cluster resources |
US10375161B1 (en) * | 2014-04-09 | 2019-08-06 | VCE IP Holding Company LLC | Distributed computing task management system and method |
US9372626B2 (en) | 2014-06-12 | 2016-06-21 | Lenovo Enterprise Solutions (Singapore) Pte. Ltg. | Parallel storage system testing wherein I/O test pattern identifies one or more set of jobs to be executed concurrently |
US10740147B2 (en) | 2014-09-30 | 2020-08-11 | International Business Machines Corporation | Merging connection pools to form a logical pool of connections during a preset period of time thereby more efficiently utilizing connections in connection pools |
US11429443B2 (en) | 2014-09-30 | 2022-08-30 | International Business Machines Corporation | Merging connection pools to form a logical pool of connections during a preset period of time thereby more efficiently utilizing connections in connection pools |
US20170031716A1 (en) * | 2014-09-30 | 2017-02-02 | International Business Machines Corporation | Merging connection pools to form a logical pool of connections during a preset period of time thereby more efficiently utilizing connections in connection pools |
US10268516B2 (en) * | 2014-09-30 | 2019-04-23 | International Business Machines Corporation | Merging connection pools to form a logical pool of connections during a preset period of time thereby more efficiently utilizing connections in connection pools |
US20160203024A1 (en) * | 2015-01-14 | 2016-07-14 | Electronics And Telecommunications Research Institute | Apparatus and method for allocating resources of distributed data processing system in consideration of virtualization platform |
US9846589B2 (en) | 2015-06-04 | 2017-12-19 | Cisco Technology, Inc. | Virtual machine placement optimization with generalized organizational scenarios |
US9762672B2 (en) | 2015-06-15 | 2017-09-12 | International Business Machines Corporation | Dynamic node group allocation |
US10915486B1 (en) | 2015-06-26 | 2021-02-09 | Amazon Technologies, Inc. | Ingress data placement |
US9817786B1 (en) * | 2015-06-26 | 2017-11-14 | Amazon Technologies, Inc. | Ingress data placement |
CN105184452A (en) * | 2015-08-14 | 2015-12-23 | 山东大学 | MapReduce operation dependence control method suitable for power information big-data calculation |
US10225148B2 (en) | 2015-09-23 | 2019-03-05 | International Business Machines Corporation | Social network of virtual machines |
US9607062B1 (en) * | 2015-11-19 | 2017-03-28 | International Business Machines Corporation | Data locality in data integration applications |
US9462058B1 (en) * | 2015-11-19 | 2016-10-04 | International Business Machines Corporation | Data locality in data integration applications |
CN105450754A (en) * | 2015-11-30 | 2016-03-30 | 国云科技股份有限公司 | Method of physical machine and virtual machine for mutually sharing cloud disk |
US9575749B1 (en) * | 2015-12-17 | 2017-02-21 | Kersplody Corporation | Method and apparatus for execution of distributed workflow processes |
WO2017106718A1 (en) * | 2015-12-17 | 2017-06-22 | Kersplody Corporation | Method and apparatus for execution of distrubuted workflow processes |
US10360024B2 (en) | 2015-12-17 | 2019-07-23 | Kersplody Corporation | Method and apparatus for execution of distributed workflow processes |
US10599436B2 (en) | 2015-12-31 | 2020-03-24 | Huawei Technologies Co., Ltd. | Data processing method and apparatus, and system |
US10915365B2 (en) | 2015-12-31 | 2021-02-09 | Huawei Technologies Co., Ltd. | Determining a quantity of remote shared partitions based on mapper and reducer nodes |
US20170222904A1 (en) * | 2016-01-29 | 2017-08-03 | AppDynamics, Inc. | Distributed Business Transaction Specific Network Data Capture |
US20180024863A1 (en) * | 2016-03-31 | 2018-01-25 | Huawei Technologies Co., Ltd. | Task Scheduling and Resource Provisioning System and Method |
US10523533B2 (en) | 2016-06-21 | 2019-12-31 | International Business Machines Corporation | Cloud network assessment based on scoring virtual network performance relative to underlying network performance |
US10776356B2 (en) | 2017-04-07 | 2020-09-15 | Micro Focus Llc | Assigning nodes to shards based on a flow graph model |
US10318333B2 (en) * | 2017-06-28 | 2019-06-11 | Sap Se | Optimizing allocation of virtual machines in cloud computing environment |
CN109656675A (en) * | 2017-10-11 | 2019-04-19 | 阿里巴巴集团控股有限公司 | Bus apparatus, computer equipment and the method for realizing physical host cloud storage |
CN107862064A (en) * | 2017-11-16 | 2018-03-30 | 北京航空航天大学 | One high-performance based on NVM, expansible lightweight file system |
CN110196750A (en) * | 2018-02-26 | 2019-09-03 | 华为技术有限公司 | A kind of distribution method and its relevant device of equipment |
CN112005219A (en) * | 2018-04-05 | 2020-11-27 | 国际商业机器公司 | Workload management with data access awareness in a compute cluster |
US20190310893A1 (en) * | 2018-04-05 | 2019-10-10 | International Business Machines Corporation | Workload management with data access awareness in a computing cluster |
US10768998B2 (en) * | 2018-04-05 | 2020-09-08 | International Business Machines Corporation | Workload management with data access awareness in a computing cluster |
US10761891B2 (en) | 2018-04-05 | 2020-09-01 | International Business Machines Corporation | Workload management with data access awareness by aggregating file locality information in a computing cluster |
US10977091B2 (en) | 2018-04-05 | 2021-04-13 | International Business Machines Corporation | Workload management with data access awareness using an ordered list of hosts in a computing cluster |
DE112019000421B4 (en) | 2018-04-05 | 2023-03-09 | International Business Machines Corporation | WORKLOAD MANAGEMENT WITH DATA ACCESS DISCOVERY IN A COMPUTING CLUSTER |
US20230221996A1 (en) * | 2022-01-13 | 2023-07-13 | Dell Products L.P. | Consensus-based distributed scheduler |
US20230229519A1 (en) * | 2022-01-14 | 2023-07-20 | Goldman Sachs & Co. LLC | Task allocation across processing units of a distributed system |
Also Published As
Publication number | Publication date |
---|---|
CN103377091B (en) | 2016-10-05 |
CN103377091A (en) | 2013-10-30 |
US8972983B2 (en) | 2015-03-03 |
US20130290953A1 (en) | 2013-10-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8972983B2 (en) | Efficient execution of jobs in a shared pool of resources | |
US10969967B2 (en) | Allocation and balancing of storage resources based on anticipated workload levels | |
CA2978889C (en) | Opportunistic resource migration to optimize resource placement | |
US9229764B2 (en) | Estimating migration costs for migrating logical partitions within a virtualized computing environment based on a migration cost history | |
US8694996B2 (en) | Application initiated negotiations for resources meeting a performance parameter in a virtualized computing environment | |
US9218196B2 (en) | Performing pre-stage replication of data associated with virtual machines prior to migration of virtual machines based on resource usage | |
US10616134B1 (en) | Prioritizing resource hosts for resource placement | |
US9432300B2 (en) | Allocation of storage resources in a networked computing environment based on energy utilization | |
US9229777B2 (en) | Dynamically relocating workloads in a networked computing environment | |
CN109684074A (en) | Physical machine resource allocation methods and terminal device | |
US20170371707A1 (en) | Data analysis in storage system | |
Li et al. | DAG scheduling in mobile edge computing | |
US11336519B1 (en) | Evaluating placement configurations for distributed resource placement | |
US11080092B1 (en) | Correlated volume placement in a distributed block storage service | |
US10721181B1 (en) | Network locality-based throttling for automated resource migration | |
US11381468B1 (en) | Identifying correlated resource behaviors for resource allocation | |
WO2023056780A1 (en) | Storage system workload scheduling for deduplication | |
US11048554B1 (en) | Correlated volume placement in a distributed block storage service |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |