skip to main content
research-article

SpatialSSJP: QoS-Aware Adaptive Approximate Stream-Static Spatial Join Processor

Published: 06 November 2023 Publication History

Abstract

The widespread adoption of Internet of Things (IoT) motivated the emergence of mixed workloads in smart cities, where fast arriving geo-referenced big data streams are joined with archive tables, aiming at enriching streams with descriptive attributes that enable insightful analytics. Applications are now relying on finding, in real-time, to which geographical regions data streaming tuples belong. This problem requires a computationally intensive stream-static join for joining a dynamic stream with a disk-resident static table. In addition, the time-varying nature of fluctuation in geospatial data arriving online calls for an approximate solution that can trade-off QoS constraints while ensuring that the system survives sudden spikes in data loads. In this paper, we present SpatialSSJP, an adaptive spatial-aware approximate query processing system that specifically focuses on stream-static joins in a way that guarantees achieving an agreed set of Quality-of-Service goals and maintains geo-statistics of stateful online aggregations over stream-static join results. SpatialSSJP employs a state-of-art stratified-like sampling design to select well-balanced representative geospatial data stream samples and serve them to a stream-static geospatial join operator downstream. We implemented a prototype atop Spark Structured Streaming. Our extensive evaluations on big real datasets show that our system can survive and mitigate harsh join workloads and outperform state-of-art baselines by significant magnitudes, without risking rigorous error bounds in terms of the accuracy of the output results. SpatialSSJP achieves a relative accuracy gain against plain Spark joins of approximately 10% in worst cases but reaching up to 50% in best case scenarios.

References

[1]
I. M. Al Jawarneh, P. Bellavista, A. Corradi, L. Foschini, and R. Montanari, “Efficiently integrating mobility and environment data for climate change analytics,” in Proc. IEEE 26th Int. Workshop Comput. Aided Model. Des. Commun. Links Netw., 2021, pp. 1–5.
[2]
I. M. Al Jawarneh, P. Bellavista, A. Corradi, L. Foschini, and R. Montanari, “QoS-aware approximate query processing for smart cities spatial data streams,” Sensors, vol. 21, no. 12, 2021, Art. no.
[3]
I. M. Al Jawarneh, P. Bellavista, L. Foschini, and R. Montanari, “Spatial-aware approximate Big Data stream processing,” in Proc. IEEE Glob. Commun. Conf., 2019, pp. 1–6.
[4]
M. Armbrust et al., “Structured streaming: A declarative api for real-time applications in apache spark,” in Proc. Int. Conf. Manage. Data, Houston, TX, USA, 2018, pp. 601–613.
[5]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in Proc. 2nd USENIX Conf. Hot Topics Cloud Comput., 2010, Art. no.
[6]
M. Armbrust et al., “Spark SQL: Relational data processing in spark,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2015, pp. 1383–1394.
[7]
R. T. Whitman, M. B. Park, B. G. Marsh, and E. G. Hoel, “Spatio-temporal join on apache spark,” in Proc. 25th ACM SIGSPATIAL Int. Conf. Adv. Geographic Inf. Syst., 2017, pp. 1–10.
[8]
E. H. Jacox and H. Samet, “Spatial join techniques,” ACM Trans. Database Syst., vol. 32, no. 1, pp. 7–es, 2007.
[9]
A. Arasu et al., “Stream: The stanford data stream management system,” in Data Stream Management. Berlin, Germany: Springer, 2016, pp. 317–336.
[10]
I. M. Al Jawarneh, P. Bellavista, A. Corradi, L. Foschini, and R. Montanari, “Spatially representative online Big Data sampling for smart cities,” in Proc. IEEE 25th Int. Workshop Comput. Aided Model. Des. Commun. Links Netw., 2020, pp. 1–6.
[11]
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica, “Discretized streams: Fault-tolerant streaming computation at scale,” in Proc. Twenty-4th ACM Symp. Operating Syst. Princ., 2013, pp. 423–438.
[12]
A. Shahvarani and H.-A. Jacobsen, “Distributed stream KNN join,” in Proc. Int. Conf. Manage. Data, 2021, pp. 1597–1609.
[13]
X. Chen, Y. Vigfusson, D. M. Blough, F. Zheng, K.-L. Wu, and L. Hu, “GOVERNOR: Smoother stream processing through smarter backpressure,” in Proc. IEEE Int. Conf. Autonomic Comput., 2017, pp. 145–154.
[14]
S. L. Lohr, Sampling: Design and Analysis. Toronto, ON, Canada: Nelson Education, 2009.
[15]
J. M. Hellerstein, P. J. Haas, and H. J. Wang, “Online aggregation,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 1997, pp. 171–182.
[16]
P. Carbone, A. Katsifodimos, and S. Haridi, Stream Window Aggregation Semantics and Optimization, Berlin, Germany: Springer, 2019.
[17]
New York, NY, USA (N.Y.). Taxi and Limousine Commission. New York, NY, USA City Taxi Trip Data, 2019-02-20 ed. Ann Arbor, MI, USA: Inter-Univ. Consortium for Political and Social Research [distributor], 2009-2018.
[18]
G. Wang, X. Chen, F. Zhang, Y. Wang, and D. Zhang, “Experience: Understanding long-term evolving patterns of shared electric vehicle networks,” in Proc. 25th Annu. Int. Conf. Mobile Comput. Netw., 2019, pp. 1–12.
[19]
R. Derakhshan, A. Sattar, and B. Stantic, “A new operator for efficient stream-relation join processing in data streaming engines,” in Proc. 22nd ACM Int. Conf. Inf. Knowl. Manage., 2013, pp. 793–798.
[20]
Y.-H. Jeon, K.-H. Lee, and H.-J. Kim, “Distributed join processing between streaming and stored Big Data under the micro-batch model,” IEEE Access, vol. 7, pp. 34583–34598, 2019.
[21]
J. Yu, Z. Zhang, and M. Sarwat, “GeoSparkViz: A scalable geospatial data visualization framework in the apache spark ecosystem,” in Proc. 30th Int. Conf. Sci. Stat. Database Manage., Bozen-Bolzano, Italy, 2018, Art. no.
[22]
M. Tang, Y. Yu, A. R. Mahmood, Q. M. Malluhi, M. Ouzzani, and W. G. Aref, “Locationspark: In-memory distributed spatial query processing and optimization,” Front. Big Data, vol. 3, 2020, Art. no.
[23]
D. Xie, F. Li, B. Yao, G. Li, L. Zhou, and M. Guo, “Simba: Efficient in-memory spatial analytics,” in Proc. Int. Conf. Manage. Data, 2016, pp. 1071–1085.
[24]
J. Yu, Z. Zhang, and M. Sarwat, “Spatial data management in apache spark: The geospark perspective and beyond,” GeoInformatica, vol. 23, no. 1, pp. 37–78, 2019.
[25]
A. Eldawy and M. F. Mokbel, “Spatialhadoop: A mapreduce framework for spatial data,” in Proc. IEEE 31st Int. Conf. Data Eng., 2015, pp. 1352–1363.
[26]
V. Pandey, A. Kipf, T. Neumann, and A. Kemper, “How good are modern spatial analytics systems?,” Proc. VLDB Endowment, vol. 11, no. 11, pp. 1661–1673, 2018.
[27]
J. Yu, J. Wu, and M. Sarwat, “Geospark: A cluster computing framework for processing large-scale spatial data,” in Proc. 23rd SIGSPATIAL Int. Conf. Adv. Geographic Inf. Syst., 2015, pp. 1–4.
[28]
J. N. Hughes, A. Annex, C. N. Eichelberger, A. Fox, A. Hulbert, and M. Ronquest, “Geomesa: A distributed architecture for spatio-temporal fusion,” in Geospatial Informatics, Fusion, and Motion Video Analytics V, Bellingham, WA, USA: SPIE, 2015, pp. 128–140.
[29]
I. M. Al Jawarneh, P. Bellavista, A. Corradi, L. Foschini, and R. Montanari, “Efficient QoS-aware spatial join processing for scalable NoSQL storage frameworks,” IEEE Trans. Netw. Service Manage., vol. 18, no. 2, pp. 2437–2449, Jun. 2021.
[30]
M. A. Naeem, G. Dobbie, C. Lutteroth, and G. Weber, “Skewed distributions in semi-stream joins: How much can caching help?,” Inf. Syst., vol. 64, pp. 63–74, 2017.
[31]
A. Aji et al., “Hadoop-GIS: A high performance spatial data warehousing system over MapReduce,” in Proc. VLDB Endowment Int. Conf. Very Large Data Bases, 2013, pp. 1009–1020.
[32]
S. You, J. Zhang, and L. Gruenwald, “Large-scale spatial join query processing in cloud,” in Proc. IEEE 31st Int. Conf. Data Eng. Workshops, 2015, pp. 34–41.
[33]
H. Sun, R. Birke, W. Binder, M. Björkqvist, and L. Y. Chen, “AccStream: Accuracy-aware overload management for stream processing systems,” in Proc. IEEE Int. Conf. Autonomic Comput., 2017, pp. 39–48.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 35, Issue 1
Jan. 2024
202 pages

Publisher

IEEE Press

Publication History

Published: 06 November 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media