Note
Click here to download the full example code or to run this example in your browser via Binder
Parallel processing in Neuraxle¶
This demonstrates how to stream data in parallel in a Neuraxle pipeline. The pipeline steps’ parallelism here will be obvious.
The pipeline has two steps: 1. Preprocessing: the step that process the data simply sleeps. 2. Model: the model simply multiplies the data by two.
This can be used with scikit-learn as well to transform things in parallel, and any other library such as tensorflow.
Pipelines benchmarked: 1. We first use a classical pipeline and evaluate the time. 2. Then we use a minibatched pipeline and we evaluate the time. 3. Then we use a parallel pipeline and we evaluate the time.
We expect the parallel pipeline to be faster due to having more workers in parallel, as well as starting the model’s transformations at the same time that other batches are being preprocessed, using queues.
Out:
Classical 'Pipeline' execution time: 2.022265672683716 seconds.
Minibatched 'MiniBatchSequentialPipeline' execution time: 2.0235798358917236 seconds.
Parallel 'SequentialQueuedPipeline' execution time: 0.5125672817230225 seconds.
import time
import numpy as np
from neuraxle.base import ExecutionContext as CX
from neuraxle.distributed.streaming import SequentialQueuedPipeline
from neuraxle.pipeline import BasePipeline, Pipeline, MiniBatchSequentialPipeline
from neuraxle.steps.loop import ForEach
from neuraxle.steps.misc import Sleep
from neuraxle.steps.numpy import MultiplyByN
def eval_run_time(pipeline: BasePipeline):
pipeline.setup(CX())
a = time.time()
output = pipeline.transform(list(range(100)))
b = time.time()
seconds = b - a
return seconds, output
def main():
"""
The task is to sleep 0.02 seconds for each data input and then multiply by 2.
"""
sleep_time = 0.02
preprocessing_and_model_steps = [ForEach(Sleep(sleep_time=sleep_time)), MultiplyByN(2)]
# Classical pipeline - all at once with one big batch:
p = Pipeline(preprocessing_and_model_steps)
time_vanilla_pipeline, output_classical = eval_run_time(p)
print(f"Classical 'Pipeline' execution time: {time_vanilla_pipeline} seconds.")
# Classical minibatch pipeline - minibatch size 5:
p = MiniBatchSequentialPipeline(preprocessing_and_model_steps,
batch_size=5)
time_minibatch_pipeline, output_minibatch = eval_run_time(p)
print(f"Minibatched 'MiniBatchSequentialPipeline' execution time: {time_minibatch_pipeline} seconds.")
# Parallel pipeline - minibatch size 5 with 4 parallel workers per step that
# have a max queue size of 10 batches between preprocessing and the model:
p = SequentialQueuedPipeline(preprocessing_and_model_steps,
n_workers_per_step=4, max_queued_minibatches=10, batch_size=5)
time_parallel_pipeline, output_parallel = eval_run_time(p)
print(f"Parallel 'SequentialQueuedPipeline' execution time: {time_parallel_pipeline} seconds.")
assert np.array_equal(output_classical, output_minibatch)
assert np.array_equal(output_classical, output_parallel)
assert time_parallel_pipeline < time_minibatch_pipeline, str((time_parallel_pipeline, time_vanilla_pipeline))
if __name__ == '__main__':
main()
Total running time of the script: ( 0 minutes 4.562 seconds)