Article

Big Sequence Management: A glimpse of the Past, the Present, and the Future

Author:

Themis PalpanasAuthors Info & Claims

Proceedings of the 42nd International Conference on SOFSEM 2016: Theory and Practice of Computer Science - Volume 9587

Pages 63 - 80

https://doi.org/10.1007/978-3-662-49192-8_6

Published: 23 January 2016 Publication History

Abstract

There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of sequences, or data series. Examples of such applications come from biology, astronomy, entomology, the web, and other domains. It is not unusual for these applications to involve numbers of data series in the order of hundreds of millions to billions, which are often times not analyzed in their full detail due to their sheer size. In this work, we describe recent efforts in designing techniques for indexing and mining truly massive collections of data series that will enable scientists to easily analyze their data. We show that the main bottleneck in mining such massive datasets is the time taken to build the index, and we thus introduce solutions to this problem. Furthermore, we discuss novel techniques that adaptively create data series indexes, allowing users to correctly answer queries before the indexing task is finished. We also show how our methods allow mining on datasets that would otherwise be completely untenable, including the first published experiments using one billion data series. Finally, we present our vision for the future in big sequence management research.

References

[1]

Adhd-200 2011. http://fcon\_1000.projects.nitrc.org/indi/adhd200/

[2]

Sloan digital sky survey 2015. https://www.sdss3.org/dr10/data_access/volume.php

[3]

Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: Lomet, D.B. ed. FODO 1993. LNCS, vol. 730, pp. 69---84. Springer, Heidelberg 1993

Digital Library

[4]

An, N., Kanth, R., Kothuri, V., Ravada, S.: Improving performance with bulk-inserts in oracle r-trees. In: VLDB, pp. 948---951. VLDB Endowment 2003

Digital Library

[5]

Assent, L., Krieger, R., Afschari, F., Seidl, T.: The TS-tree: efficient time series search and retrieval. In EDBT 2008

Digital Library

[6]

Aβfalg, J., Kriegel, H.-P., Kröger, P., Renz, M.: Probabilistic similarity search for uncertain time series. In: Winslett, M. ed. SSDBM 2009. LNCS, vol. 5566, pp. 435---443. Springer, Heidelberg 2009

Digital Library

[7]

Astrahan, M.M., Blasgen, M.W., Chamberlin, D.D., Eswaran, K.P., Gray, J., Griffiths, P.P., King, W.F., Lorie, R.A., McJones, P.R., Mehl, J.W., Putzolu, G.R., Traiger, I.L., Wade, B.W., Watson, V.: System R: relational approach to database management. TODS 12, 97---137 1976

Digital Library

[8]

Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 189, 509---517 1975

Digital Library

[9]

Berchtold, S., Keim, D.A., Kriegel, H.P.: The X-tree: an index structure for high-dimensional data. In: VLDB, pp. 28---39 1996

Digital Library

[10]

Bernstein, P., Bykov, S., Geller, A., Kliot, G., Thelin, J.: Orleans: distributed virtual actors for programmability and scalability. MSR-TR-2014-41 2014

[11]

Bu, Y., wing Leung, T., chee Fu, A.W., Keogh, E., Pei, J., Meshkin, S.: Wat: finding top-k discords in time series database. In: SDM, pp. 449---454 2007

[12]

Camerra, A., Palpanas, T., Shieh, J., Keogh, E.: iSAX 2.0: indexing and mining one billion time series. In: ICDM 2010

Digital Library

[13]

Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.J.: Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. KAIS 391, 123---151 2014

[14]

Chakrabarti, K., Keogh, E., Mehrotra, S., Pazzani, M.: Locally adaptive dimensionality reduction for indexing large time series databases. In: SIGMOD 2002

Digital Library

[15]

Chan, K.-P., Fu. A.-C.: Efficient time series matching by wavelets. In: ICDE 1999

[16]

Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 413, 1---58 2009

Digital Library

[17]

Dallachiesa, M., Nushi, B., Mirylenka, K., Palpanas, T.: Uncertain time-series similarity: return to the basics. PVLDB 511, 1662---1673 2012

Digital Library

[18]

Dallachiesa, M., Palpanas, T., Ilyas, I.F.: Top-k nearest neighbor search in uncertain data series. PVLDB 81, 13---24 2014

Digital Library

[19]

Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.: Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1, 1542---1552 2008

Digital Library

[20]

Soisalon-Soininen, E., Widmayer, P.: Single and bulk updates in stratified trees: an amortized and worst-case analysis. In: Klein, R., Six, H.-W., Wegner, L. eds. Computer Science in Perspective. LNCS, vol. 2598, pp. 278---292. Springer, Heidelberg 2003

Digital Library

[21]

Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD 1984

Digital Library

[22]

Huijse, P., Estévez, P.A., Protopapas, P., Principe, J.C., Zegers, P.: Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comp. Int. Mag. 93, 27---39 2014

Digital Library

[23]

Van den Bercken, J., Seeger, B.: An evaluation of generic bulk loading techniques. In: VLDB, pp. 461---470 2001

Digital Library

[24]

Van den Bercken, J., Widmayer, P., Seeger, B.: A generic approach to bulk loading multidimensional index structures. In: VLDB 1997

Digital Library

[25]

Kadiyala, S., Shiri, N.: A compact multi-resolution index for variable length queries in time series databases. KAIS 152, 131---147 2008

[26]

Kashino, K., Smith, G., Murase, H.: Time-series active search for quick retrieval of audio and video. In: ICASSP 1999

Digital Library

[27]

Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: KDD 2011

Digital Library

[28]

Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. KAIS 33, 263---286 2000

[29]

Keogh, E.J., Palpanas, T., Zordan, V.B., Gunopulos, D., Cardle, M.: Indexing large human-motion databases. In: VLDB, pp. 780---791 2004

Digital Library

[30]

Arge, L., Hinrichs, K.H., Vahrenhold, J., Vitter, J.V.: Efficient bulk operations on dynamic R-trees. Algorithmica 331, 104---128 2002

Digital Library

[31]

Lerner, A., Shasha, D.: Aquery: query language for ordered data, optimization techniques, and experiments. In: VLDB 2003

Digital Library

[32]

Li, C.S., Yu, P., Castelli, V.: Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences. In: ICDE 1996

Digital Library

[33]

Liao, H., Han, J., Fang, J.: Multi-dimensional index on hadoop distributed file system. In: NAS 2010

Digital Library

[34]

Lin, J., Keogh, E., Lonardi, S.: A symbolic representation of time series, with implications for streaming algorithms. In: DMKD 2003

Digital Library

[35]

Lin, J., Khade, R., Li, Y.: Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst. 392, 287---315 2012

Digital Library

[36]

Palpanas, T.: Data series management: the road to big sequence analytics. SIGMOD Rec. 442, 47---52 2015

Digital Library

[37]

Palpanas, T., Vlachos, M., Keogh, E.J., Gunopulos, D.: Streaming time series summarization using user-defined amnesic functions. IEEE Trans. Knowl. Data Eng. 207, 992---1006 2008

Digital Library

[38]

Palpanas, T., Vlachos, M., Keogh, E.J., Gunopulos, D., Truppel, W.: Online amnesic approximation of streaming time series. In: ICDE, pp. 339---349 2004

Digital Library

[39]

Rafiei, D., Mendelzon, A.: Similarity-based queries for time series data. In: SIGMOD 1997

Digital Library

[40]

Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD 2012

Digital Library

[41]

Raman, V., Attaluri, G.K., Barber, R., Chainani, N., Kalmuk, D., KulandaiSamy, V., Leenstra, J., Lightstone, S., S. Liu, S., Lohman, G.M., Malkemus, T., Müller, R., Pandis, I., Schiefer, B., Sharpe, D., Sidle, R., Storm, A.J., Zhang, L.: DB2 with BLU acceleration: so much more than just a column store. PVLDB 611, 1080---1091 2013

Digital Library

[42]

Raza, U., Camerra, A., Murphy, A.L., Palpanas, T., Picco, G.P.: Practical data prediction for real-world wireless sensor networks. IEEE Trans. Knowl. Data Eng. 278, 2231---2244 2015

[43]

Choubey, R., Chen, L., Rundensteiner, E.A.: GBI: a generalized R-tree bulk-insertion strategy. In: Güting, R.H., Papadias, D., Lochovsky, F.H. eds. SSD 1999. LNCS, vol. 1651, pp. 91---108. Springer, Heidelberg 1999

Digital Library

[44]

Sadri, R., Zaniolo, C., Zarkesh, A.M., Adibi, J.: A sequential pattern query language for supporting instant data mining for e-services. In: VLDB 2001

Digital Library

[45]

Sarangi, S.R., Murthy, K.: DUST: a generalized notion of similarity between uncertain time series. In: KDD 2010

Digital Library

[46]

Schäfer, P., Högqvist, M.: SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets. In: EDBT 2012

[47]

Shasha, D.: Tuning time series queries in finance: case studies and recommendations. IEEE Data Eng. Bull. 222, 40---46 1999

[48]

Shieh, J., Keogh, E.: iSAX: disk-aware mining and indexing of massive time series datasets. DMKD 191, 24---57 2009

Digital Library

[49]

Shieh, J., Keogh, E.J.: iSAX: indexing and mining terabyte sized time series. In: KDD, pp. 623---631 2008

Digital Library

[50]

Stonebraker, M., Abadi, M., Batkin, D.J., Chen, J. X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O'Neil, E.J., O'Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: a column-oriented DBMS. In: VLDB 2005

Digital Library

[51]

Stonebraker, M., Brown, P., Poliakov, A., Raman, S.: The architecture of SciDB. In: Bayard Cushing, J., French, J., Bowers, S. eds. SSDBM 2011. LNCS, vol. 6809, pp. 1---16. Springer, Heidelberg 2011

Digital Library

[52]

Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 610, 793---804 2013

Digital Library

[53]

Warren Liao, T.: Clustering of time series data - a survey. Pattern Recogn. 3811, 1857---1874 2005

Digital Library

[54]

Ye, L., Keogh, E.J.: Time series shapelets: a new primitive for data mining. In: KDD 2009

Digital Library

[55]

Yeh, M., Wu, K., Yu, P.S., Chen, M.: PROUD: a probabilistic approach to processing similarity queries over uncertain data streams. In: EDBT 2009

Digital Library

[56]

Yi, B., Faloutsos, C.: Fast time sequence indexing for arbitrary Lp norms. In: VLDB 2000

Digital Library

[57]

Zoumpatianos, K., Idreos, S., Palpanas, T.: Indexing for interactive exploration of big data series. In: SIGMOD 2014

Digital Library

[58]

Zoumpatianos, K., Idreos, S., Palpanas, T.: RINSE: interactive data series exploration with ADS+. PVLDB 812, 1912---1923 2015

Digital Library

[59]

Zoumpatianos, K., Lou, Y., Palpanas, T., Gehrke, J.: Query workloads for data series indexes. In: KDD 2015

Digital Library

Cited By

Wang ZWang QWang PPalpanas TWang W(2023)Dumpy: A Compact and Adaptive Index for Large Data Series CollectionsProceedings of the ACM on Management of Data10.1145/35889651:1(1-27)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588965
Liakos PPapakonstantinopoulou KKotidis Y(2022)ChimpProceedings of the VLDB Endowment10.14778/3551793.355185215:11(3058-3070)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.14778/3551793.3551852
Echihabi KFatourou PZoumpatianos KPalpanas TBenbrahim H(2022)Hercules against data series similarity searchProceedings of the VLDB Endowment10.14778/3547305.354730815:10(2005-2018)Online publication date: 7-Sep-2022
https://dl.acm.org/doi/10.14778/3547305.3547308
Show More Cited By

Big Sequence Management: A glimpse of the Past, the Present, and the Future
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information retrieval
    1. Search engine architectures and scalability
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

Management of distributed big data for social networks
CCGRID '16: Proceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

In the current era of big data, high volumes of a wide variety of valuable data can be easily collected and generated from a broad range of data sources of different veracities at a high velocity. Due to the well-known 5V's of these big data, many ...
Dragoon: a hybrid and efficient big trajectory management system for offline and online analytics
Abstract
With the explosive use of GPS-enabled devices, increasingly massive volumes of trajectory data capturing the movements of people and vehicles are becoming available, which is useful in many application areas, such as transportation, traffic ...
Handling big data: research challenges and future directions

Today, an enormous amount of data is being continuously generated in all walks of life by all kinds of devices and systems every day. A significant portion of such data is being captured, stored, aggregated and analyzed in a systematic way without ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Proceedings of the 42nd International Conference on SOFSEM 2016: Theory and Practice of Computer Science - Volume 9587

January 2016

613 pages

ISBN:9783662491911

Editors:
RźSiźš MăźRtiźš Freivalds
University of Latvia, Riga, Latvia
,
Gregor Engels
University of Paderborn, Paderborn, Germany
,
Barbara Catania
University of Genoa, Genoa, Italy

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 January 2016

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang ZWang QWang PPalpanas TWang W(2023)Dumpy: A Compact and Adaptive Index for Large Data Series CollectionsProceedings of the ACM on Management of Data10.1145/35889651:1(1-27)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588965
Liakos PPapakonstantinopoulou KKotidis Y(2022)ChimpProceedings of the VLDB Endowment10.14778/3551793.355185215:11(3058-3070)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.14778/3551793.3551852
Echihabi KFatourou PZoumpatianos KPalpanas TBenbrahim H(2022)Hercules against data series similarity searchProceedings of the VLDB Endowment10.14778/3547305.354730815:10(2005-2018)Online publication date: 7-Sep-2022
https://dl.acm.org/doi/10.14778/3547305.3547308
Paparrizos JLiu CElmore AFranklin MMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Debunking Four Long-Standing Misconceptions of Time-Series Distance MeasuresProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389760(1887-1905)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389760
Linardi MPalpanas T(2019)Scalable, variable-length similarity search in data seriesProceedings of the VLDB Endowment10.14778/3275366.328496811:13(2236-2248)Online publication date: 17-Jan-2019
https://dl.acm.org/doi/10.14778/3275366.3284968
Palpanas TBeckmann V(2019)Report on the First and Second Interdisciplinary Time Series Analysis Workshop (ITISA)ACM SIGMOD Record10.1145/3377391.337740048:3(36-40)Online publication date: 20-Dec-2019
https://dl.acm.org/doi/10.1145/3377391.3377400
Chatzigeorgakidis GSkoutas DPatroumpas KPalpanas TAthanasiou SSkiadopoulos S(2019)Local Pair and Bundle Discovery over Co-Evolving Time SeriesProceedings of the 16th International Symposium on Spatial and Temporal Databases10.1145/3340964.3340982(160-169)Online publication date: 19-Aug-2019
https://dl.acm.org/doi/10.1145/3340964.3340982
Kondylakis HDayan NZoumpatianos KPalpanas TBoncz PManegold SAilamaki ADeshpande AKraska T(2019)Coconut PalmProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3320233(1941-1944)Online publication date: 25-Jun-2019
https://dl.acm.org/doi/10.1145/3299869.3320233
Kondylakis HDayan NZoumpatianos KPalpanas T(2018)CoconutProceedings of the VLDB Endowment10.5555/3199517.319951911:6(677-690)Online publication date: 1-Feb-2018
https://dl.acm.org/doi/10.5555/3199517.3199519
Echihabi KZoumpatianos KPalpanas TBenbrahim H(2018)The lernaean hydra of data series similarity searchProceedings of the VLDB Endowment10.14778/3282495.328249812:2(112-127)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.14778/3282495.3282498
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents