research-article

CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs

Authors:

Yun LiangAuthors Info & Claims

HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Pages 79 - 92

https://doi.org/10.1145/3078597.3078602

Published: 26 June 2017 Publication History

Abstract

Stochastic gradient descent (SGD) is widely used by many machine learning algorithms. It is efficient for big data ap- plications due to its low algorithmic complexity. SGD is inherently serial and its parallelization is not trivial. How to parallelize SGD on many-core architectures (e.g. GPUs) for high efficiency is a big challenge. In this paper, we present cuMF_SGD, a parallelized SGD solution for matrix factorization on GPUs. We first design high-performance GPU computation kernels that accelerate individual SGD updates by exploiting model parallelism. We then design efficient schemes that parallelize SGD updates by exploiting data parallelism. Finally, we scale cuMF SGD to large data sets that cannot fit into one GPU's memory. Evaluations on three public data sets show that cuMF_SGD outperforms existing solutions, including a 64- node CPU system, by a large margin using only one GPU card.

References

[1]

Recommending items to more than a billion people, 2015. https://code.facebook.com/posts/861999383875667/recommending-items-to-more-than-a-billion-people/.

[2]

NVIDIA CUDA programming guide., 2016. http://docs.nvidia.com/cuda/cuda-c-programming-guide.

[3]

NVIDIA Maxwell Architecture . https://developer.nvidia.com/maxwell-compute-architecture, 2016.

[4]

NVIDIA NVLink, 2016. http://www.nvidia.com/object/nvlink.html.

[5]

NVIDIA Pascal Architecture, 2016. http://www.geforce.com/hardware/10series/architecture.

[6]

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Georgia, USA, 2016.

Digital Library

[7]

S. Al-Kiswany, A. Gharaibeh, and M. Ripeanu. GPUs as storage system accelerators. IEEE Transactions on Parallel and Distributed Systems, 24(8):1556--1566, 2013.

Digital Library

[8]

L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010, pages 177--186. Springer, 2010.

[9]

M. Butler, K. Sajjapongse, and M. Becchi. Improving application concurrency on GPUs by managing implicit and explicit synchronizations. In Parallel and Distributed Systems (ICPADS), 2015 IEEE 21st International Conference on, pages 535--544. IEEE, 2015.

Digital Library

[10]

X. Cai, Z. Xu, G. Lai, C. Wu, and X. Lin. GPU-accelerated restricted boltzmann machine for collaborative filtering. In International Conference on Algorithms and Architectures for Parallel Processing. Springer, 2012.

Digital Library

[11]

J. Canny and H. Zhao. Bidmach: Large-scale learning with zero memory allocation. In BigLearning, NIPS Workshop, 2013.

[12]

S. Chang, Y. Zhang, J. Tang, D. Yin, Y. Chang, M. A. Hasegawa-Johnson, and T. S. Huang. Streaming recommender systems. In Proceedings of the 26th International Conference on World Wide Web, WWW '17, 2017.

Digital Library

[13]

J. Chen, L. Tan, P. Wu, D. Tao, H. Li, X. Liang, S. Li, R. Ge, L. Bhuyan, and Z. Chen. GreenLA: green linear algebra software for GPU-accelerated heterogeneous computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 57. IEEE Press, 2016.

Digital Library

[14]

T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.

[15]

W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A fast parallel stochastic gradient method for matrix factorization in shared memory systems. ACM Transactions on Intelligent Systems and Technology (TIST), 2015.

Digital Library

[16]

W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A learning-rate schedule for stochastic gradient methods to matrix factorization. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2015.

[17]

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223--1231, 2012.

Digital Library

[18]

B. Del Monte and R. Prodan. A scalable GPU-enabled framework for training deep neural networks. In Green High Performance Computing (ICGHPC), 2016 2nd International Conference on, pages 1--8. IEEE, 2016.

[19]

C. del Mundo and W.-c. Feng. Enabling efficient intra-warp communication for Fourier transforms in a many-core architecture. In Supercomputing, 2013. Proceedings of the 2013 ACM/IEEE International Conference on, 2013.

[20]

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121--2159, 2011.

Digital Library

[21]

J. Fang, A. L. Varbanescu, and H. Sips. A comprehensive performance comparison of CUDA and OpenCL. In 2011 International Conference on Parallel Processing, pages 216--225. IEEE, 2011.

Digital Library

[22]

M. Gates, H. Anzt, J. Kurzak, and J. Dongarra. Accelerating collaborative filtering using concepts from high performance computing. In Big Data, 2015 IEEE International Conference on, 2015.

Digital Library

[23]

R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.

Digital Library

[24]

A. Goswami, J. Young, K. Schwan, N. Farooqui, A. Gavrilovska, M. Wolf, and G. Eisenhauer. GPUShare: Fair-sharing middleware for GPU clouds. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International, pages 1769--1776. IEEE, 2016.

[25]

S. Heldens, A. L. Varbanescu, and A. Iosup. Dynamic load balancing for high-performance graph processing on hybrid CPU-GPU platforms. In Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms, pages 62--65. IEEE Press, 2016.

Digital Library

[26]

C.-J. Hsieh and I. S. Dhillon. Fast coordinate descent methods with variable selection for non-negative matrix factorization. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.

Digital Library

[27]

J. Jin, S. Lai, S. Hu, J. Lin, and X. Lin. GPUSGD: A GPU-accelerated stochastic gradient descent algorithm for matrix factorization. Concurrency and Computation: Practice and Experience, 2015.

Digital Library

[28]

A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In ACM SIGPLAN Notices, volume 48, pages 395--406. ACM, 2013.

Digital Library

[29]

A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S. W. Keckler, M. T. Kandemir, and C. R. Das. Anatomy of GPU memory system for multi-application execution. In Proceedings of the 2015 International Symposium on Memory Systems, pages 223--234. ACM, 2015.

Digital Library

[30]

A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Orchestrated scheduling and prefetching for GPGPUs. In ACM SIGARCH Computer Architecture News, volume 41, pages 332--343. ACM, 2013.

Digital Library

[31]

R. Kaleem, S. Pai, and K. Pingali. Stochastic gradient descent on GPUs. In Proceedings of the 8th Workshop on General Purpose Processing using GPUs, pages 81--89. ACM, 2015.

Digital Library

[32]

D. B. Kirk and W. H. Wen-mei. Programming massively parallel processors: a hands-on approach. Newnes, 2012.

Digital Library

[33]

T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455--500, 2009.

Digital Library

[34]

Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 2009.

Digital Library

[35]

B. Li, S. Tata, and Y. Sismanis. Sparkler: supporting large-scale matrix factorization. In Proceedings of the 16th International Conference on Extending Database Technology. ACM, 2013.

Digital Library

[36]

C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou. Locality-driven dynamic GPU cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing, pages 67--77. ACM, 2015.

Digital Library

[37]

M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient mini-batch training for stochastic optimization. In SIGKDD, pages 661--670. ACM, 2014.

Digital Library

[38]

Z. Liu, Y.-X. Wang, and A. Smola. Fast differentially private matrix factorization. In Proceedings of the 9th ACM Conference on Recommender Systems, RecSys'15, 2015.

Digital Library

[39]

Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 2012.

Digital Library

[40]

X. Meng, J. Bradley, B. Yuvaz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. JMLR, 2016.

Digital Library

[41]

Y. Nishioka and K. Taura. Scalable task-parallel SGD on matrix factorization in multicore architectures. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, IPDPSW '15, pages 1178--1184, Washington, DC, USA, 2015. IEEE Computer Society.

Digital Library

[42]

J. Oh, W.-S. Han, H. Yu, and X. Jiang. Fast and robust parallel SGD matrix factorization. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015.

Digital Library

[43]

J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.

[44]

B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.

Digital Library

[45]

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '08, 2008.

Digital Library

[46]

S. Schelter, V. Satuluri, and R. Zadeh. Factorbird-a parameter server approach to distributed matrix factorization. arXiv preprint arXiv:1411.0602, 2014.

[47]

G. M. Shipman, T. S. Woodall, R. L. Graham, A. B. Maccabe, and P. G. Bridges. Infiniband scalability in Open MPI. In Proceedings 20th IEEE International Parallel and Distributed Processing Symposium, pages 10--pp. IEEE, 2006.

Digital Library

[48]

D. Song and S. Chen. Exploiting SIMD for complex numerical predicates. In 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW), 2016.

[49]

S. Song and K. W. Cameron. System-level power-performance efficiency modeling for emergent GPU architectures. In PACT, pages 473--474, 2012.

Digital Library

[50]

J. Tan, S. L. Song, K. Yan, X. Fu, A. Marquez, and D. Kerbyson. Combating the reliability challenge of GPU register file at low supply voltage. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on, pages 3--15. IEEE, 2016.

Digital Library

[51]

W. Tan, L. Cao, and L. Fong. Faster and Cheaper: Parallelizing large-scale matrix factorization on GPUs. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC '16, 2016.

Digital Library

[52]

C. Teflioudi, F. Makari, and R. Gemulla. Distributed matrix completion. In IEEE 12th International Conference on Data Mining. IEEE, 2012.

Digital Library

[53]

A. Todd, H. Truong, J. Deters, J. Long, G. Conant, and M. Becchi. Parallel gene upstream comparison via multi-level hash tables on GPU. In 22nd IEEE International Conference on Parallel and Distributed Systems (ICPADS), 2016.

[54]

S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65--76, 2009.

Digital Library

[55]

X. Xie, Y. Liang, X. Li, Y. Wu, G. Sun, T. Wang, and D. Fan. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48.

Digital Library

[56]

X. Xie, Y. Liang, G. Sun, and D. Chen. An efficient compiler framework for cache bypassing on GPUs. In IEEE/ACM International Conference on Computer-Aided Design, 2013.

Digital Library

[57]

X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. Coordinated static and dynamic cache bypassing for GPUs. In International Symposium on High Performance Computer Architecture, HPCA'15, pages 76--88, 2015.

[58]

F. Yan, O. Ruwase, Y. He, and E. Smirni. SERF: efficient scheduling for fast deep neural network serving via judicious parallelism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 26. IEEE Press, 2016.

Digital Library

[59]

Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '10, pages 86--97, 2010.

Digital Library

[60]

H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In 2012 IEEE 12th International Conference on Data Mining. IEEE, 2012.

Digital Library

[61]

H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In 2012 IEEE 12th International Conference on Data Mining. IEEE, 2012.

Digital Library

[62]

H.-F. Yu, C.-J. Hsieh, H. Yun, S. Vishwanathan, and I. S. Dhillon. A scalable asynchronous distributed algorithm for topic modeling. In Proceedings of the 24th International Conference on WWW, pages 1340--1350. ACM, 2015.

Digital Library

[63]

H. Yun, H.-F. Yu, C.-J. Hsieh, S. V. N. Vishwanathan, and I. Dhillon. NOMAD: Non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. Proc. VLDB Endow., 2014.

Digital Library

[64]

D. Zastrau and S. Edelkamp. Stochastic gradient descent with GPGPU. In Annual Conference on Artificial Intelligence. Springer, 2012.

Digital Library

[65]

T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116. ACM, 2004.

Digital Library

[66]

Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the Netflix prize. In International Conference on Algorithmic Applications in Management. Springer, 2008.

Digital Library

[67]

M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In NIPS, 2010.

Digital Library

Cited By

Li ZQin YXiao QYang WLi K(2024)cuFasterTucker: A Stochastic Optimization Strategy for Parallel Sparse FastTucker Decomposition on GPU PlatformACM Transactions on Parallel Computing10.1145/364809411:2(1-33)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3648094
Elahi FFazlali MMalazi HElahi M(2024)Parallel Fractional Stochastic Gradient Descent With Adaptive Learning for Recommender SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318521235:3(470-483)Online publication date: Mar-2024
https://doi.org/10.1109/TPDS.2022.3185212
Qin WLuo XZhou M(2024)Adaptively-Accelerated Parallel Stochastic Gradient Descent for High-Dimensional and Incomplete Data Representation LearningIEEE Transactions on Big Data10.1109/TBDATA.2023.332630410:1(92-107)Online publication date: Feb-2024
https://doi.org/10.1109/TBDATA.2023.3326304
Show More Cited By

Index Terms

CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs

Recommendations

Research on accelerating method for video quality measurement program using GPGPU
RACS '16: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

In recent times, with the advancing of the graphics processing unit (GPU), parallel computing using general-purpose computing on GPU (GPGPU) is expanding. This is achieved through a processing speed faster than those of traditional computing ...
StreamMR: An Optimized MapReduce Framework for AMD GPUs
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems

MapReduce is a programming model from Google that facilitates parallel processing on a cluster of thousands of commodity computers. The success of MapReduce in cluster environments has motivated several studies of implementing MapReduce on a graphics ...
Parallel quadtree coding of large-scale raster geospatial data on GPGPUs
GIS '11: Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

Global remote sensing and large-scale environmental modeling have generated huge amounts of raster geospatial data. While the inherent data parallelism of large-scale raster geospatial data allows straightforward coarse-grained parallelization at the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

June 2017

254 pages

ISBN:9781450346993

DOI:10.1145/3078597

General Chairs:
Howie Huang
George Washington University, USA
,
Jon Weissman
University of Minnesota, USA
,
Program Chairs:
Adriana Iamnitchi
University of South Florida, USA
,
Alexandru Iosup
Vrije Universiteit Amsterdam and Delft University of Technology, NLD

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC '17

Sponsor:

University of Arizona
SIGARCH

HPDC '17: The 26th International Symposium on High-Performance Parallel and Distributed Computing

June 26 - 30, 2017

DC, Washington, USA

Acceptance Rates

HPDC '17 Paper Acceptance Rate 19 of 100 submissions, 19%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
605
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)3

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li ZQin YXiao QYang WLi K(2024)cuFasterTucker: A Stochastic Optimization Strategy for Parallel Sparse FastTucker Decomposition on GPU PlatformACM Transactions on Parallel Computing10.1145/364809411:2(1-33)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3648094
Elahi FFazlali MMalazi HElahi M(2024)Parallel Fractional Stochastic Gradient Descent With Adaptive Learning for Recommender SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318521235:3(470-483)Online publication date: Mar-2024
https://doi.org/10.1109/TPDS.2022.3185212
Qin WLuo XZhou M(2024)Adaptively-Accelerated Parallel Stochastic Gradient Descent for High-Dimensional and Incomplete Data Representation LearningIEEE Transactions on Big Data10.1109/TBDATA.2023.332630410:1(92-107)Online publication date: Feb-2024
https://doi.org/10.1109/TBDATA.2023.3326304
Büyükkaya KKarsavuran MAykanat C(2024)Stochastic Gradient Descent for matrix completionKnowledge-Based Systems10.1016/j.knosys.2023.111176283:COnline publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1016/j.knosys.2023.111176
Huang YZhang MDing DJiang EXiao QYou XTian YYang M(2024)Towards Detection-Recovery Strategy for Robust Decentralized Matrix FactorizationComputer Security – ESORICS 202410.1007/978-3-031-70879-4_2(24-44)Online publication date: 5-Sep-2024
https://doi.org/10.1007/978-3-031-70879-4_2
Xia CChen YZhang HWu J(2023)STADIA: Photonic Stochastic Gradient Descent for Neural Network AcceleratorsACM Transactions on Embedded Computing Systems10.1145/360792022:5s(1-23)Online publication date: 31-Oct-2023
https://dl.acm.org/doi/10.1145/3607920
Huang YLiu YBai YChen SLi R(2023)UMA-MF: A Unified Multi-CPU/GPU Asynchronous Computing Framework for SGD-Based Matrix FactorizationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.331753534:11(2978-2993)Online publication date: Nov-2023
https://doi.org/10.1109/TPDS.2023.3317535
Gómez-Luna JGuo YBrocard SLegriel JCimadomo ROliveira GSingh GMutlu O(2023)Evaluating Machine LearningWorkloads on Memory-Centric Computing Systems2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00013(35-49)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00013
Gulcan SOzdal MAykanat C(2023)Load balanced locality-aware parallel SGD on multicore architectures for latent factor based collaborative filteringFuture Generation Computer Systems10.1016/j.future.2023.04.007146(207-221)Online publication date: Sep-2023
https://doi.org/10.1016/j.future.2023.04.007
Miao XMa LYang ZShao YCui BYu LJiang J(2022)CuWide: Towards Efficient Flow-Based Training for Sparse Wide Models on GPUsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.303810934:9(4119-4132)Online publication date: 1-Sep-2022
https://doi.org/10.1109/TKDE.2020.3038109
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents