neuraxle.data_container

Module-level documentation for neuraxle.data_container. Here is an inheritance diagram, including dependencies to other base modules of Neuraxle:


Neuraxle’s DataContainer classes

Classes for containing the data that flows throught the pipeline steps.

Classes

DACT

alias of neuraxle.data_container.DataContainer

DataContainer(data_inputs, ids, …)

DataContainer (dact) class to store IDs (ids), data inputs (di), and expected outputs (eo) together.

ExpandedDataContainer(data_inputs, ids, …)

Sub class of DataContainer to expand data container dimension.

ListDataContainer(data_inputs[, ids, …])

Sub class of DataContainer to perform list operations.

StripAbsentValues

This object, when passed to the default_value_data_inputs argument of the DataContainer.batch method, will return the minibatched data containers such that the last batch won’t have the full batch_size if it was incomplete with trailing None values at the end.

ZipDataContainer(data_inputs, ids, …)

Sub class of DataContainer to zip two data sources together.


class neuraxle.data_container.StripAbsentValues[source]

Bases: object

This object, when passed to the default_value_data_inputs argument of the DataContainer.batch method, will return the minibatched data containers such that the last batch won’t have the full batch_size if it was incomplete with trailing None values at the end.

class neuraxle.data_container.DataContainer(data_inputs: Optional[DIT] = None, ids: Optional[IDT] = None, expected_outputs: Optional[EOT] = None, sub_data_containers: List[NamedDACTTuple] = None, *, di: Optional[DIT] = None, eo: Optional[EOT] = None)[source]

Bases: typing.Generic

DataContainer (dact) class to store IDs (ids), data inputs (di), and expected outputs (eo) together. In some dacts, you could have only ids and data inputs, and in other dacts you could have only expected outputs, or you could have all if you want, such as when your Pipeline is used to train a model in a certain ExecutionMode within a certain ExecutionContext.

You can use typing for your dact, and create a dact, such as:

from typing import List
from neuraxle.data_container import DataContainer as DACT

dact: DACT[List[str], List[int], List[float]] = DACT(
    ids=['a', 'b', 'c'],
    data_inputs=[1, 2, 3],
    expected_outputs=[1.0, 2.0, 3.0]
)

This is because the DataContainer inherits from the Generic type as class DataContainer(Generic[IDT, DIT, EOT]): ....

The DataContainer object is passed to all of the BaseStep ‘s handle methods :
  • handle_transform()

  • handle_fit_transform()

  • handle_fit()

Most of the time, the steps will manage it in the handler methods.

__init__(data_inputs: Optional[DIT] = None, ids: Optional[IDT] = None, expected_outputs: Optional[EOT] = None, sub_data_containers: List[NamedDACTTuple] = None, *, di: Optional[DIT] = None, eo: Optional[EOT] = None)[source]

Create a DataContainer[IDT, DIT, EOT] object from specified ids, di, and eo.

Parameters
  • ids – ids that are iterable. If None, will put a range of integers of data_inputs length. Often a list of integers.

  • di – same as data_inputs, but shorter.

  • eo – same as expected_outputs, but shorter.

  • data_inputs – data inputs that are iterable. Can use di instead.

  • expected_outputs – expected outputs that are iterable. If None, will put a list of None of data_inputs length.

  • sub_data_containers – sub data containers.

ids

Get ids.

If the ids are None, the following IDs will be returned:

  • If the data_inputs is a DataFrame, will return the index of the DF.

  • Else if the ids are None, will return a range of integers of data_inputs length.

  • Else if the data_inputs aren’t iterable, will return a range of integers of expected_outputs length.

Returns

ids

Return type

Iterable

di

Get data inputs.

Returns

data inputs

Return type

Iterable

eo

Get expected outputs.

If the expected outputs are None, will return a list of None of data_inputs length.

Returns

expected outputs

Return type

Iterable

sdact

Get sub data containers.

Returns

sub data containers

static from_di(data_inputs: DIT) → neuraxle.data_container.DataContainer[~IDT, ~DIT, typing.List[NoneType]][IDT, DIT, List[None]][source]

Create a DataContainer (dact) from data inputs (di).

static from_eo(expected_outputs: EOT) → neuraxle.data_container.DataContainer[typing.List[NoneType], typing.List[NoneType], ~EOT][List[None], List[None], EOT][source]

Create a DataContainer (dact) from expected outputs (eo).

without_di() → neuraxle.data_container.DataContainer[~IDT, typing.List[NoneType], ~EOT][IDT, List[None], EOT][source]
without_eo() → neuraxle.data_container.DataContainer[~IDT, ~DIT, typing.List[NoneType]][IDT, DIT, List[None]][source]
with_di(di: DIT) → neuraxle.data_container.DataContainer[~IDT, ~DIT, ~EOT][IDT, DIT, EOT][source]
with_eo(eo: EOT) → neuraxle.data_container.DataContainer[~IDT, ~DIT, ~EOT][IDT, DIT, EOT][source]
set_ids(ids: IDT) → neuraxle.data_container.DataContainer[source]

Set ids.

Parameters

ids – data inputs’ ids. Often a range of integers.

Returns

self

set_data_inputs(data_inputs: DIT) → neuraxle.data_container.DataContainer[source]

Set data inputs.

Parameters

data_inputs (Iterable) – data inputs

Returns

self

set_expected_outputs(expected_outputs: EOT) → neuraxle.data_container.DataContainer[source]

Set expected outputs.

Parameters

expected_outputs (Iterable) – expected outputs

Returns

self

get_ids_summary() → str[source]
add_sub_data_container(name: str, data_container: neuraxle.data_container.DataContainer) → neuraxle.data_container.DataContainer[source]

Get sub data container if item is str, otherwise get a zip of ids, data inputs, and expected outputs.

Returns

self

get_sub_data_container_names() → List[str][source]

Get sub data container names.

Returns

list of names

set_sub_data_containers(sub_data_containers: List[DACT]) → neuraxle.data_container.DataContainer[source]

Set sub data containers :return: self

minibatches(batch_size: int, keep_incomplete_batch: bool = True, default_value_data_inputs=None, default_value_expected_outputs=None) → Iterable[neuraxle.data_container.DataContainer[~IDT, ~DIT, ~EOT][IDT, DIT, EOT]][source]

Yields minibatches extracted from looping on the DataContainer’s content with a batch_size and a certain behavior for the last batch when the batch_size is uneven with the total size.

Note that the default value for IDs is None.

data_container = DataContainer(data_inputs=np.array(list(range(10)))
for data_container_batch in data_container.minibatches(batch_size=2):
    print(data_container_batch.data_inputs)
    print(data_container_batch.expected_outputs)
# [array([0, 1]), array([2, 3]), ..., array([8, 9])]

data_container = DataContainer(data_inputs=np.array(list(range(10)))
for data_container_batch in data_container.minibatches(batch_size=3, keep_incomplete_batch=False):
    print(data_container_batch.data_inputs)
# [array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8])]

data_container = DataContainer(data_inputs=np.array(list(range(10)))
for data_container_batch in data_container.minibatches(
    batch_size=3,
    keep_incomplete_batch=True,
    default_value_data_inputs=None,
    default_value_expected_outputs=None
):
    print(data_container_batch.data_inputs)
# [array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8]), array([9, None, None])]

data_container = DataContainer(data_inputs=np.array(list(range(10)))
for data_container_batch in data_container.minibatches(
    batch_size=3,
    keep_incomplete_batch=True,
    default_value_data_inputs=StripAbsentValues()
):
    print(data_container_batch.data_inputs)
# [array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8]), array([9])]
Parameters
  • batch_size (int) – number of elements to combine into a single batch

  • keep_incomplete_batch (bool) – (Optional.) A bool representing

whether the last batch should be dropped in the case it has fewer than batch_size elements; the default behavior is to keep the smaller batch. :param default_value_data_inputs: expected_outputs default fill value for padding and values outside iteration range, or StripAbsentValues to trim absent values from the batch :param default_value_expected_outputs: expected_outputs default fill value for padding and values outside iteration range, or StripAbsentValues to trim absent values from the batch :return: an iterator of DataContainer

get_n_batches(batch_size: int, keep_incomplete_batch: bool = True) → int[source]
copy() → neuraxle.data_container.DataContainer[~IDT, ~DIT, ~EOT][IDT, DIT, EOT][source]
tolist() → neuraxle.data_container.DataContainer[typing.List, typing.List, typing.List][List[T], List[T], List[T]][source]
tolistshallow() → neuraxle.data_container.DataContainer[typing.List, typing.List, typing.List][List[T], List[T], List[T]][source]
to_numpy() → neuraxle.data_container.DataContainer[numpy.ndarray, numpy.ndarray, numpy.ndarray][numpy.ndarray, numpy.ndarray, numpy.ndarray][source]
apply_conversion_func(conversion_function: Callable[[Any], Any]) → neuraxle.data_container.DataContainer[source]

Apply conversion function to data inputs, expected outputs, and ids, and set the new values in self. Returns self. Conversion function must be able to handle None values.

unpack() → Tuple[IDT, DIT, EOT][source]

Unpack to a tuples of (ids, data input, expected output).

Returns

tuple of ids, data inputs, expected outputs

_str_data(_idata: Union[IDT, DIT, EOT]) → str[source]
neuraxle.data_container.DACT[source]

alias of neuraxle.data_container.DataContainer

class neuraxle.data_container.ExpandedDataContainer(data_inputs, ids, expected_outputs, old_ids)[source]

Bases: neuraxle.data_container.DataContainer

Sub class of DataContainer to expand data container dimension.

See also

DataContainer

__init__(data_inputs, ids, expected_outputs, old_ids)[source]

Create a DataContainer[IDT, DIT, EOT] object from specified ids, di, and eo.

Parameters
  • ids – ids that are iterable. If None, will put a range of integers of data_inputs length. Often a list of integers.

  • di – same as data_inputs, but shorter.

  • eo – same as expected_outputs, but shorter.

  • data_inputs – data inputs that are iterable. Can use di instead.

  • expected_outputs – expected outputs that are iterable. If None, will put a list of None of data_inputs length.

  • sub_data_containers – sub data containers.

reduce_dim() → neuraxle.data_container.DataContainer[source]

Reduce DataContainer to its original shape with a list of multiple ids, data_inputs, and expected outputs.

Returns

reduced data container

Return type

DataContainer

static create_from(data_container: neuraxle.data_container.DataContainer) → neuraxle.data_container.ExpandedDataContainer[source]

Create ExpandedDataContainer with a summary id for the new single id.

Parameters

data_container (DataContainer) – data container to transform

Returns

expanded data container

Return type

ExpandedDataContainer

class neuraxle.data_container.ZipDataContainer(data_inputs: Optional[DIT] = None, ids: Optional[IDT] = None, expected_outputs: Optional[EOT] = None, sub_data_containers: List[NamedDACTTuple] = None, *, di: Optional[DIT] = None, eo: Optional[EOT] = None)[source]

Bases: neuraxle.data_container.DataContainer

Sub class of DataContainer to zip two data sources together.

static create_from(data_container: neuraxle.data_container.DataContainer, *other_data_containers, zip_expected_outputs: bool = False) → neuraxle.data_container.ZipDataContainer[source]

Merges two data sources together. Zips only the data input part and keeps the expected output of the first DataContainer as is. NOTE: Expects that all DataContainer are at least as long as data_container.

Parameters
  • data_container (DataContainer) – the main data container, the attribute of this data container will be kept by the returned ZipDataContainer.

  • other_data_containers (List[DataContainer]) – other data containers to zip with data container

  • zip_expected_outputs (bool) – Determines wether we kept the expected_output of data_container or we zip the expected_outputs of all DataContainer provided

Returns

expanded data container

Return type

ExpandedDataContainer

concatenate_inner_features()[source]

Concatenate inner features from zipped data inputs. Assumes each data_input entry is an iterable of numpy arrays.

class neuraxle.data_container.ListDataContainer(data_inputs: Any, ids=None, expected_outputs: Any = None, sub_data_containers=None)[source]

Bases: neuraxle.data_container.DataContainer, typing.Generic

Sub class of DataContainer to perform list operations. It allows to perform append, and concat operations on a DataContainer.

See also

DataContainer

__init__(data_inputs: Any, ids=None, expected_outputs: Any = None, sub_data_containers=None)[source]

Create a DataContainer[IDT, DIT, EOT] object from specified ids, di, and eo.

Parameters
  • ids – ids that are iterable. If None, will put a range of integers of data_inputs length. Often a list of integers.

  • di – same as data_inputs, but shorter.

  • eo – same as expected_outputs, but shorter.

  • data_inputs – data inputs that are iterable. Can use di instead.

  • expected_outputs – expected outputs that are iterable. If None, will put a list of None of data_inputs length.

  • sub_data_containers – sub data containers.

static empty(original_data_container: neuraxle.data_container.DataContainer = None) → neuraxle.data_container.ListDataContainer[source]
append(_id: str, data_input: Any, expected_output: Any)[source]

Append a new data input to the DataContainer.

Parameters
  • _id (str) – id for the data input

  • data_input – data input

  • expected_output – expected output

Returns

append_data_container_in_data_inputs(other: neuraxle.data_container.DataContainer) → neuraxle.data_container.ListDataContainer[source]

Append a data container to the data inputs of this data container.

Parameters

other (DataContainer) – data container

Returns

append_data_container(other: neuraxle.data_container.DataContainer) → neuraxle.data_container.ListDataContainer[source]

Append a data container to the DataContainer.

Parameters

other (DataContainer) – data container

Returns

extend(other: neuraxle.data_container.DataContainer)[source]

Concat the given data container at the end of self so as to extend each IDs, DIs, and EOs.

Parameters

data_container (DataContainer) – data container

Returns

neuraxle.data_container._pad_or_keep_incomplete_batch(data_container, batch_size, default_value_data_inputs, default_value_expected_outputs) → neuraxle.data_container.DataContainer[source]
neuraxle.data_container._pad_incomplete_batch(data_container: neuraxle.data_container.DataContainer, batch_size: int, default_value_data_inputs: Any, default_value_expected_outputs: Any) → neuraxle.data_container.DataContainer[source]
neuraxle.data_container._pad_data(data: Iterable[T_co], default_value: Any, batch_size: int)[source]
neuraxle.data_container._inner_concatenate_np_array(np_arrays_to_concatenate: List[numpy.ndarray])[source]

Concatenate numpy arrays on the last axis, expanding and broadcasting if necessary.

Parameters

np_arrays_to_concatenate (Iterable[np.ndarray]) – numpy arrays to zip with the other

Returns

concatenated np array

Return type

np.ndarray