neuraxle.data_container¶

Module-level documentation for neuraxle.data_container. Here is an inheritance diagram, including dependencies to other base modules of Neuraxle:

Neuraxle’s DataContainer classes¶

Classes for containing the data that flows throught the pipeline steps.

Classes

`DACT`	alias of `neuraxle.data_container.DataContainer`
`DataContainer`(data_inputs, ids, …)	DataContainer (dact) class to store IDs (ids), data inputs (di), and expected outputs (eo) together.
`ExpandedDataContainer`(data_inputs, ids, …)	Sub class of DataContainer to expand data container dimension.
`ListDataContainer`(data_inputs[, ids, …])	Sub class of DataContainer to perform list operations.
`StripAbsentValues`	This object, when passed to the default_value_data_inputs argument of the DataContainer.batch method, will return the minibatched data containers such that the last batch won’t have the full batch_size if it was incomplete with trailing None values at the end.
`ZipDataContainer`(data_inputs, ids, …)	Sub class of DataContainer to zip two data sources together.

class neuraxle.data_container.StripAbsentValues[source]¶

Bases: object

This object, when passed to the default_value_data_inputs argument of the DataContainer.batch method, will return the minibatched data containers such that the last batch won’t have the full batch_size if it was incomplete with trailing None values at the end.

class neuraxle.data_container.DataContainer(data_inputs: Optional[DIT] = None, ids: Optional[IDT] = None, expected_outputs: Optional[EOT] = None, sub_data_containers: List[NamedDACTTuple] = None, *, di: Optional[DIT] = None, eo: Optional[EOT] = None)[source]¶

Bases: typing.Generic

DataContainer (dact) class to store IDs (ids), data inputs (di), and expected outputs (eo) together. In some dacts, you could have only ids and data inputs, and in other dacts you could have only expected outputs, or you could have all if you want, such as when your Pipeline is used to train a model in a certain ExecutionMode within a certain ExecutionContext.

You can use typing for your dact, and create a dact, such as:

from typing import List
from neuraxle.data_container import DataContainer as DACT

dact: DACT[List[str], List[int], List[float]] = DACT(
    ids=['a', 'b', 'c'],
    data_inputs=[1, 2, 3],
    expected_outputs=[1.0, 2.0, 3.0]
)

This is because the DataContainer inherits from the Generic type as class DataContainer(Generic[IDT, DIT, EOT]): ....

The DataContainer object is passed to all of the BaseStep ‘s handle methods :

handle_transform()
handle_fit_transform()
handle_fit()

Most of the time, the steps will manage it in the handler methods.

See also

BaseStep, StripAbsentValues

__init__(data_inputs: Optional[DIT] = None, ids: Optional[IDT] = None, expected_outputs: Optional[EOT] = None, sub_data_containers: List[NamedDACTTuple] = None, *, di: Optional[DIT] = None, eo: Optional[EOT] = None)[source]¶

Create a DataContainer[IDT, DIT, EOT] object from specified ids, di, and eo.

Parameters

ids – ids that are iterable. If None, will put a range of integers of data_inputs length. Often a list of integers.
di – same as data_inputs, but shorter.
eo – same as expected_outputs, but shorter.
data_inputs – data inputs that are iterable. Can use di instead.
expected_outputs – expected outputs that are iterable. If None, will put a list of None of data_inputs length.
sub_data_containers – sub data containers.

ids¶

Get ids.

If the ids are None, the following IDs will be returned:

If the data_inputs is a DataFrame, will return the index of the DF.
Else if the ids are None, will return a range of integers of data_inputs length.
Else if the data_inputs aren’t iterable, will return a range of integers of expected_outputs length.

Returns: ids
Return type: Iterable

di¶

Get data inputs.

Returns: data inputs
Return type: Iterable

eo¶

Get expected outputs.

If the expected outputs are None, will return a list of None of data_inputs length.

Returns: expected outputs
Return type: Iterable

sdact¶

Get sub data containers.

Returns: sub data containers

static from_di(data_inputs: DIT) → neuraxle.data_container.DataContainer[~IDT, ~DIT, typing.List[NoneType]][IDT, DIT, List[None]][source]¶: Create a DataContainer (dact) from data inputs (di).

static from_eo(expected_outputs: EOT) → neuraxle.data_container.DataContainer[typing.List[NoneType], typing.List[NoneType], ~EOT][List[None], List[None], EOT][source]¶: Create a DataContainer (dact) from expected outputs (eo).

without_di() → neuraxle.data_container.DataContainer[~IDT, typing.List[NoneType], ~EOT][IDT, List[None], EOT][source]¶

without_eo() → neuraxle.data_container.DataContainer[~IDT, ~DIT, typing.List[NoneType]][IDT, DIT, List[None]][source]¶

with_di(di: DIT) → neuraxle.data_container.DataContainer[~IDT, ~DIT, ~EOT][IDT, DIT, EOT][source]¶

with_eo(eo: EOT) → neuraxle.data_container.DataContainer[~IDT, ~DIT, ~EOT][IDT, DIT, EOT][source]¶

set_ids(ids: IDT) → neuraxle.data_container.DataContainer[source]¶

Set ids.

Parameters: ids – data inputs’ ids. Often a range of integers.
Returns: self

set_data_inputs(data_inputs: DIT) → neuraxle.data_container.DataContainer[source]¶

Set data inputs.

Parameters: data_inputs (Iterable) – data inputs
Returns: self

set_expected_outputs(expected_outputs: EOT) → neuraxle.data_container.DataContainer[source]¶

Set expected outputs.

Parameters: expected_outputs (Iterable) – expected outputs
Returns: self

get_ids_summary() → str[source]¶

add_sub_data_container(name: str, data_container: neuraxle.data_container.DataContainer) → neuraxle.data_container.DataContainer[source]¶

Get sub data container if item is str, otherwise get a zip of ids, data inputs, and expected outputs.

Returns: self

get_sub_data_container_names() → List[str][source]¶

Get sub data container names.

Returns: list of names

set_sub_data_containers(sub_data_containers: List[DACT]) → neuraxle.data_container.DataContainer[source]¶: Set sub data containers :return: self

minibatches(batch_size: int, keep_incomplete_batch: bool = True, default_value_data_inputs=None, default_value_expected_outputs=None) → Iterable[neuraxle.data_container.DataContainer[~IDT, ~DIT, ~EOT][IDT, DIT, EOT]][source]¶

Yields minibatches extracted from looping on the DataContainer’s content with a batch_size and a certain behavior for the last batch when the batch_size is uneven with the total size.

Note that the default value for IDs is None.

data_container = DataContainer(data_inputs=np.array(list(range(10)))
for data_container_batch in data_container.minibatches(batch_size=2):
    print(data_container_batch.data_inputs)
    print(data_container_batch.expected_outputs)
# [array([0, 1]), array([2, 3]), ..., array([8, 9])]

data_container = DataContainer(data_inputs=np.array(list(range(10)))
for data_container_batch in data_container.minibatches(batch_size=3, keep_incomplete_batch=False):
    print(data_container_batch.data_inputs)
# [array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8])]

data_container = DataContainer(data_inputs=np.array(list(range(10)))
for data_container_batch in data_container.minibatches(
    batch_size=3,
    keep_incomplete_batch=True,
    default_value_data_inputs=None,
    default_value_expected_outputs=None
):
    print(data_container_batch.data_inputs)
# [array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8]), array([9, None, None])]

data_container = DataContainer(data_inputs=np.array(list(range(10)))
for data_container_batch in data_container.minibatches(
    batch_size=3,
    keep_incomplete_batch=True,
    default_value_data_inputs=StripAbsentValues()
):
    print(data_container_batch.data_inputs)
# [array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8]), array([9])]

Parameters

batch_size (int) – number of elements to combine into a single batch
keep_incomplete_batch (bool) – (Optional.) A bool representing

whether the last batch should be dropped in the case it has fewer than batch_size elements; the default behavior is to keep the smaller batch. :param default_value_data_inputs: expected_outputs default fill value for padding and values outside iteration range, or StripAbsentValues to trim absent values from the batch :param default_value_expected_outputs: expected_outputs default fill value for padding and values outside iteration range, or StripAbsentValues to trim absent values from the batch :return: an iterator of DataContainer

See also

StripAbsentValues

get_n_batches(batch_size: int, keep_incomplete_batch: bool = True) → int[source]¶

copy() → neuraxle.data_container.DataContainer[~IDT, ~DIT, ~EOT][IDT, DIT, EOT][source]¶

tolist() → neuraxle.data_container.DataContainer[typing.List, typing.List, typing.List][List[T], List[T], List[T]][source]¶

tolistshallow() → neuraxle.data_container.DataContainer[typing.List, typing.List, typing.List][List[T], List[T], List[T]][source]¶

to_numpy() → neuraxle.data_container.DataContainer[numpy.ndarray, numpy.ndarray, numpy.ndarray][numpy.ndarray, numpy.ndarray, numpy.ndarray][source]¶

apply_conversion_func(conversion_function: Callable[[Any], Any]) → neuraxle.data_container.DataContainer[source]¶: Apply conversion function to data inputs, expected outputs, and ids, and set the new values in self. Returns self. Conversion function must be able to handle None values.

unpack() → Tuple[IDT, DIT, EOT][source]¶

Unpack to a tuples of (ids, data input, expected output).

Returns: tuple of ids, data inputs, expected outputs

_str_data(_idata: Union[IDT, DIT, EOT]) → str[source]¶

neuraxle.data_container.DACT[source]¶: alias of neuraxle.data_container.DataContainer

class neuraxle.data_container.ExpandedDataContainer(data_inputs, ids, expected_outputs, old_ids)[source]¶

Bases: neuraxle.data_container.DataContainer

Sub class of DataContainer to expand data container dimension.

See also

DataContainer

__init__(data_inputs, ids, expected_outputs, old_ids)[source]¶

Create a DataContainer[IDT, DIT, EOT] object from specified ids, di, and eo.

Parameters

ids – ids that are iterable. If None, will put a range of integers of data_inputs length. Often a list of integers.
di – same as data_inputs, but shorter.
eo – same as expected_outputs, but shorter.
data_inputs – data inputs that are iterable. Can use di instead.
expected_outputs – expected outputs that are iterable. If None, will put a list of None of data_inputs length.
sub_data_containers – sub data containers.

reduce_dim() → neuraxle.data_container.DataContainer[source]¶

Reduce DataContainer to its original shape with a list of multiple ids, data_inputs, and expected outputs.

Returns: reduced data container
Return type: DataContainer

static create_from(data_container: neuraxle.data_container.DataContainer) → neuraxle.data_container.ExpandedDataContainer[source]¶

Create ExpandedDataContainer with a summary id for the new single id.

Parameters: data_container (DataContainer) – data container to transform
Returns: expanded data container
Return type: ExpandedDataContainer

class neuraxle.data_container.ZipDataContainer(data_inputs: Optional[DIT] = None, ids: Optional[IDT] = None, expected_outputs: Optional[EOT] = None, sub_data_containers: List[NamedDACTTuple] = None, *, di: Optional[DIT] = None, eo: Optional[EOT] = None)[source]¶

Bases: neuraxle.data_container.DataContainer

Sub class of DataContainer to zip two data sources together.

static create_from(data_container: neuraxle.data_container.DataContainer, *other_data_containers, zip_expected_outputs: bool = False) → neuraxle.data_container.ZipDataContainer[source]¶

Merges two data sources together. Zips only the data input part and keeps the expected output of the first DataContainer as is. NOTE: Expects that all DataContainer are at least as long as data_container.

Parameters

data_container (DataContainer) – the main data container, the attribute of this data container will be kept by the returned ZipDataContainer.
other_data_containers (List[DataContainer]) – other data containers to zip with data container
zip_expected_outputs (bool) – Determines wether we kept the expected_output of data_container or we zip the expected_outputs of all DataContainer provided

Returns

expanded data container

Return type

ExpandedDataContainer

concatenate_inner_features()[source]¶: Concatenate inner features from zipped data inputs. Assumes each data_input entry is an iterable of numpy arrays.

class neuraxle.data_container.ListDataContainer(data_inputs: Any, ids=None, expected_outputs: Any = None, sub_data_containers=None)[source]¶

Bases: neuraxle.data_container.DataContainer, typing.Generic

Sub class of DataContainer to perform list operations. It allows to perform append, and concat operations on a DataContainer.

See also

DataContainer

__init__(data_inputs: Any, ids=None, expected_outputs: Any = None, sub_data_containers=None)[source]¶

Create a DataContainer[IDT, DIT, EOT] object from specified ids, di, and eo.

Parameters

ids – ids that are iterable. If None, will put a range of integers of data_inputs length. Often a list of integers.
di – same as data_inputs, but shorter.
eo – same as expected_outputs, but shorter.
data_inputs – data inputs that are iterable. Can use di instead.
expected_outputs – expected outputs that are iterable. If None, will put a list of None of data_inputs length.
sub_data_containers – sub data containers.

static empty(original_data_container: neuraxle.data_container.DataContainer = None) → neuraxle.data_container.ListDataContainer[source]¶

append(_id: str, data_input: Any, expected_output: Any)[source]¶

Append a new data input to the DataContainer.

Parameters

_id (str) – id for the data input
data_input – data input
expected_output – expected output

Returns

append_data_container_in_data_inputs(other: neuraxle.data_container.DataContainer) → neuraxle.data_container.ListDataContainer[source]¶

Append a data container to the data inputs of this data container.

Parameters: other (DataContainer) – data container
Returns

append_data_container(other: neuraxle.data_container.DataContainer) → neuraxle.data_container.ListDataContainer[source]¶

Append a data container to the DataContainer.

Parameters: other (DataContainer) – data container
Returns

extend(other: neuraxle.data_container.DataContainer)[source]¶

Concat the given data container at the end of self so as to extend each IDs, DIs, and EOs.

Parameters: data_container (DataContainer) – data container
Returns

neuraxle.data_container._pad_or_keep_incomplete_batch(data_container, batch_size, default_value_data_inputs, default_value_expected_outputs) → neuraxle.data_container.DataContainer[source]¶

neuraxle.data_container._pad_incomplete_batch(data_container: neuraxle.data_container.DataContainer, batch_size: int, default_value_data_inputs: Any, default_value_expected_outputs: Any) → neuraxle.data_container.DataContainer[source]¶

neuraxle.data_container._pad_data(data: Iterable[T_co], default_value: Any, batch_size: int)[source]¶

neuraxle.data_container._inner_concatenate_np_array(np_arrays_to_concatenate: List[numpy.ndarray])[source]¶

Concatenate numpy arrays on the last axis, expanding and broadcasting if necessary.

Parameters: np_arrays_to_concatenate (Iterable[np.ndarray]) – numpy arrays to zip with the other
Returns: concatenated np array
Return type: np.ndarray