Jassas et al., 2018 - Google Patents

Failure analysis and characterization of scheduling jobs in google cluster trace

Jassas et al., 2018

Document ID: 14326791430182861820
Author: Jassas M; Mahmoud Q
Publication year: 2018
Publication venue: IECON 2018-44th Annual Conference of the IEEE Industrial Electronics Society

External Links

Cited by

Snippet

Most public and private cloud providers have experienced failure in one of their services that may affect numerous applications and websites. Thus, in order to understand the causes of different types of failures and remediate the issue, failure analysis is one of the most critical …

Continue reading at www.researchgate.net (PDF) (other versions)

238000004458 analytical method 0 title abstract description 38

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3495—Performance evaluation by tracing or monitoring for systems
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3442—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment for planning or managing the needed capacity
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5072—Grid computing
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Programme initiating; Programme switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
- G06F11/0775—Content or structure details of the error report, e.g. specific table structure, specific error fields
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/86—Event-based monitoring
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06Q—DATA PROCESSING SYSTEMS OR METHODS, SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL, SUPERVISORY OR FORECASTING PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL, SUPERVISORY OR FORECASTING PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management, e.g. organising, planning, scheduling or allocating time, human or machine resources; Enterprise planning; Organisational models
- G06Q10/063—Operations research or analysis
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/008—Reliability or availability analysis
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/875—Monitoring of systems including the internet

Similar Documents

Publication	Publication Date	Title
Jassas et al.	2018	Failure analysis and characterization of scheduling jobs in google cluster trace
Nie et al.	2018	Machine learning models for GPU error prediction in a large scale HPC system
Chen et al.	2014	Failure analysis of jobs in compute clouds: A google cluster case study
Birke et al.	2014	Failure analysis of virtual and physical machines: patterns, causes and characteristics
Kavulya et al.	2010	An analysis of traces from a production mapreduce cluster
US10083094B1 (en)	2018-09-25	Objective based backup job scheduling
Garraghan et al.	2014	An analysis of failure-related energy waste in a large-scale cloud environment
Oppenheimer et al.	2006	Service Placement in a Shared Wide-Area Platform.
US20120053925A1 (en)	2012-03-01	Method and System for Computer Power and Resource Consumption Modeling
CN111580934B (en)	2024-08-09	Resource allocation method with consistent performance of multi-tenant virtual machine in cloud computing environment
Vazquez	2015	Time series forecasting of cloud data center workloads for dynamic resource provisioning
US9870269B1 (en)	2018-01-16	Job allocation in a clustered environment
Patel et al.	2015	Workload estimation for improving resource management decisions in the cloud
Domingues et al.	2005	Resource usage of windows computer laboratories
Kim et al.	2020	Towards hpc i/o performance prediction through large-scale log analysis
EP2742662A2 (en)	2014-06-18	Application performance analysis that is adaptive to business activity patterns
Wolski et al.	2013	Using parametric models to represent private cloud workloads
US20220272151A1 (en)	2022-08-25	Server-side resource monitoring in a distributed data storage environment
Ai et al.	2016	On elasticity measurement in cloud computing
US20200097347A1 (en)	2020-03-26	Preemptive deep diagnostics and health checking of resources in disaggregated data centers
US8954563B2 (en)	2015-02-10	Event enrichment using data correlation
Rood et al.	2009	Grid resource availability prediction-based scheduling and task replication
Roy et al.	2013	Reducing service failures by failure and workload aware load balancing in saas clouds
Jacquet et al.	2024	ScroogeVM: Boosting Cloud Resource Utilization with Dynamic Oversubscription
Cirstoiu et al.	2007	Monitoring, accounting and automated decision support for the alice experiment based on the MonALISA framework