Scalable Data Validation Framework in Big Data Pipelines: A Python-Driven Approach for Data Integrity and Performance Optimization
Keywords:
scalable data validation, big data pipelines, Python-driven approach, schema validation, Apache SparkAbstract
The most critical challenge in modern day big data pipelines is to ensure data integrity in large scale distributed processing system. Folder validation methodologies usually lack scalability, which leads to performance bottlenecks and inconsistencies across heterogeneous data sources. The objective of the study is to introduce a python driven scalable data validation framework which seamlessly integrate with Apache Spark Pipeline which helps in real time high throughput validation while preserving computational efficiency.
Downloads
References
J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
P. J. K. Joseph and S. G. Koolagudi, "Big Data Processing Using Apache Spark – A Review," Concurrency and Computation: Practice and Experience, vol. 32, no. 1, pp. e5190, 2020.
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," in Proc. 9th USENIX Conf. Networked Systems Design and Implementation (NSDI’12), 2012, pp. 1–14.
T. White, Hadoop: The Definitive Guide, 4th ed. O’Reilly Media, 2015.
L. Golab, T. Johnson, and V. Shkapenyuk, "Data Quality in Stream Processing Systems: A Survey," IEEE Data Eng. Bull., vol. 41, no. 2, pp. 20–36, 2018.
M. A. Baazizi, D. Colazzo, and G. Ghelli, "Schema Validation for JSON Documents in NoSQL Systems," IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 2, pp. 637–651, 2021.
D. Agrawal, S. Das, and A. El Abbadi, "Big Data and Cloud Computing: Current State and Future Opportunities," in Proc. 14th Int. Conf. Extending Database Technology (EDBT’11), 2011, pp. 530–533.
S. Sakr, A. Liu, D. Mutharaju, and B. S. Prasanna, Big Data Processing Systems: A Case Study with Apache Spark, Springer, 2020.
D. Broneske, M. Köhler, and G. Saake, "Schema Evolution in Big Data Management Systems," IEEE Data Engineering Bulletin, vol. 39, no. 3, pp. 41–51, 2016.
G. Papadakis, F. Naumann, W. Lehner, T. Palpanas, and K. Stefanidis, "The Four Generations of Entity Resolution," IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 9, pp. 1821–1845, 2020.
H. Kang, J. J. Jung, and K. H. Lee, "Data Quality Validation for Big Data Platforms in Smart Cities," IEEE Access, vol. 8, pp. 140899–140912, 2020.
B. Chandramouli, J. Goldstein, R. Kaushik, M. Ahmad, and M. J. Franklin, "Real-Time Big Data Processing Frameworks," in Proc. ACM SIGMOD Int. Conf. Management of Data, 2020, pp. 265–278.
M. Akdere, U. Cetintemel, and N. Tatbul, "Plan-Based Complex Event Detection Across Distributed Sources," VLDB Journal, vol. 24, no. 2, pp. 295–320, 2015.
L. Xu, C. Jiang, J. Wang, J. Yuan, and Y. Ren, "Information Security in Big Data: Privacy and Data Mining," IEEE Access, vol. 2, pp. 1149–1176, 2014.
M. Götz, J. Jovanovic, and A. Bernstein, "Data Validation for AI-Driven Data Pipelines," in Proc. IEEE Int. Conf. Big Data (Big Data), 2021, pp. 1302–1311.
T. Dunning and E. Friedman, Practical Machine Learning: Innovations in Recommendation, O’Reilly Media, 2014.
S. Abiteboul, R. Hull, and V. Vianu, Foundations of Databases, Addison-Wesley, 1995.
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, "Pregel: A System for Large-Scale Graph Processing," in Proc. ACM SIGMOD Int. Conf. Management of Data, 2010, pp. 135–146.
K. H. Lee, Y. J. Lee, H. Choi, Y. D. Chung, and B. Moon, "Parallel Data Processing with MapReduce: A Survey," ACM SIGMOD Record, vol. 40, no. 4, pp. 11–20, 2011.
D. McKinney, Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, O’Reilly Media, 2017.