Watershed-ng : an extensible distributed stream processing framework.
Data
2016
Título da Revista
ISSN da Revista
Título de Volume
Editor
Resumo
Most high-performance data processing (a.k.a. big data) systems allow users to express their computation
using abstractions (like MapReduce), which simplify the extraction of parallelism from applications. Most
frameworks, however, do not allow users to specify how communication must take place: That element is
deeply embedded into the run-time system abstractions, making changes hard to implement. In this work, we
describe Wathershed-ng, our re-engineering of the Watershed system, a framework based on the filter–stream
paradigm and originally focused on continuous stream processing. Like other big-data environments,
Watershed provided object-oriented abstractions to express computation (filters), but the implementation
of streams was a run-time system element. By isolating stream functionality into appropriate classes,
combination of communication patterns and reuse of common message handling functions (like compression
and blocking) become possible. The new architecture even allows the design of new communication patterns,
for example, allowing users to choose MPI, TCP, or shared memory implementations of communication
channels as their problem demands. Applications designed for the new interface showed reductions in
code size on the order of 50% and above in some cases. The performance results also showed significant
improvements, because some implementation bottlenecks were removed in the re-engineering process.
Descrição
Palavras-chave
Distributed systems, Watershed, Big data, Frameworks
Citação
ROCHA, R. et al. Watershed-ng: an extensible distributed stream processing framework. Concurrency and Computation, v. 28, p. 2487-2502, jan. 2016. Disponível em: <https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.3779>. Acesso em: 03 maio 2023.