neuraxle.data_container¶
Module-level documentation for neuraxle.data_container. Here is an inheritance diagram, including dependencies to other base modules of Neuraxle:
Neuraxle’s DataContainer classes¶
Classes for containing the data that flows throught the pipeline steps.
Classes
|
DataContainer (dact) class to store IDs (ids), data inputs (di), and expected outputs (eo) together. |
|
Sub class of DataContainer to expand data container dimension. |
|
Sub class of DataContainer to perform list operations. |
This object, when passed to the default_value_data_inputs argument of the DataContainer.batch method, will return the minibatched data containers such that the last batch won’t have the full batch_size if it was incomplete with trailing None values at the end. |
|
|
Sub class of DataContainer to zip two data sources together. |
-
class
neuraxle.data_container.
StripAbsentValues
[source]¶ Bases:
object
This object, when passed to the default_value_data_inputs argument of the DataContainer.batch method, will return the minibatched data containers such that the last batch won’t have the full batch_size if it was incomplete with trailing None values at the end.
-
class
neuraxle.data_container.
DataContainer
(data_inputs: Optional[DIT] = None, ids: Optional[IDT] = None, expected_outputs: Optional[EOT] = None, sub_data_containers: List[NamedDACTTuple] = None, *, di: Optional[DIT] = None, eo: Optional[EOT] = None)[source]¶ Bases:
typing.Generic
DataContainer (dact) class to store IDs (ids), data inputs (di), and expected outputs (eo) together. In some dacts, you could have only ids and data inputs, and in other dacts you could have only expected outputs, or you could have all if you want, such as when your
Pipeline
is used to train a model in a certainExecutionMode
within a certainExecutionContext
.You can use typing for your dact, and create a dact, such as:
from typing import List from neuraxle.data_container import DataContainer as DACT dact: DACT[List[str], List[int], List[float]] = DACT( ids=['a', 'b', 'c'], data_inputs=[1, 2, 3], expected_outputs=[1.0, 2.0, 3.0] )
This is because the DataContainer inherits from the
Generic
type asclass DataContainer(Generic[IDT, DIT, EOT]): ...
.- The DataContainer object is passed to all of the
BaseStep
‘s handle methods : handle_transform()
handle_fit_transform()
handle_fit()
Most of the time, the steps will manage it in the handler methods.
See also
-
__init__
(data_inputs: Optional[DIT] = None, ids: Optional[IDT] = None, expected_outputs: Optional[EOT] = None, sub_data_containers: List[NamedDACTTuple] = None, *, di: Optional[DIT] = None, eo: Optional[EOT] = None)[source]¶ Create a DataContainer[IDT, DIT, EOT] object from specified ids, di, and eo.
- Parameters
ids – ids that are iterable. If None, will put a range of integers of data_inputs length. Often a list of integers.
di – same as
data_inputs
, but shorter.eo – same as
expected_outputs
, but shorter.data_inputs – data inputs that are iterable. Can use di instead.
expected_outputs – expected outputs that are iterable. If None, will put a list of None of data_inputs length.
sub_data_containers – sub data containers.
-
ids
¶ Get ids.
If the ids are None, the following IDs will be returned:
If the data_inputs is a
DataFrame
, will return the index of the DF.Else if the ids are None, will return a range of integers of data_inputs length.
Else if the data_inputs aren’t iterable, will return a range of integers of expected_outputs length.
- Returns
ids
- Return type
Iterable
-
di
¶ Get data inputs.
- Returns
data inputs
- Return type
Iterable
-
eo
¶ Get expected outputs.
If the expected outputs are None, will return a list of None of data_inputs length.
- Returns
expected outputs
- Return type
Iterable
-
sdact
¶ Get sub data containers.
- Returns
sub data containers
-
static
from_di
(data_inputs: DIT) → neuraxle.data_container.DataContainer[~IDT, ~DIT, typing.List[NoneType]][IDT, DIT, List[None]][source]¶ Create a DataContainer (dact) from data inputs (di).
-
static
from_eo
(expected_outputs: EOT) → neuraxle.data_container.DataContainer[typing.List[NoneType], typing.List[NoneType], ~EOT][List[None], List[None], EOT][source]¶ Create a DataContainer (dact) from expected outputs (eo).
-
without_di
() → neuraxle.data_container.DataContainer[~IDT, typing.List[NoneType], ~EOT][IDT, List[None], EOT][source]¶
-
without_eo
() → neuraxle.data_container.DataContainer[~IDT, ~DIT, typing.List[NoneType]][IDT, DIT, List[None]][source]¶
-
set_ids
(ids: IDT) → neuraxle.data_container.DataContainer[source]¶ Set ids.
- Parameters
ids – data inputs’ ids. Often a range of integers.
- Returns
self
-
set_data_inputs
(data_inputs: DIT) → neuraxle.data_container.DataContainer[source]¶ Set data inputs.
- Parameters
data_inputs (Iterable) – data inputs
- Returns
self
-
set_expected_outputs
(expected_outputs: EOT) → neuraxle.data_container.DataContainer[source]¶ Set expected outputs.
- Parameters
expected_outputs (Iterable) – expected outputs
- Returns
self
-
add_sub_data_container
(name: str, data_container: neuraxle.data_container.DataContainer) → neuraxle.data_container.DataContainer[source]¶ Get sub data container if item is str, otherwise get a zip of ids, data inputs, and expected outputs.
- Returns
self
-
get_sub_data_container_names
() → List[str][source]¶ Get sub data container names.
- Returns
list of names
-
set_sub_data_containers
(sub_data_containers: List[DACT]) → neuraxle.data_container.DataContainer[source]¶ Set sub data containers :return: self
-
minibatches
(batch_size: int, keep_incomplete_batch: bool = True, default_value_data_inputs=None, default_value_expected_outputs=None) → Iterable[neuraxle.data_container.DataContainer[~IDT, ~DIT, ~EOT][IDT, DIT, EOT]][source]¶ Yields minibatches extracted from looping on the DataContainer’s content with a batch_size and a certain behavior for the last batch when the batch_size is uneven with the total size.
Note that the default value for IDs is None.
data_container = DataContainer(data_inputs=np.array(list(range(10))) for data_container_batch in data_container.minibatches(batch_size=2): print(data_container_batch.data_inputs) print(data_container_batch.expected_outputs) # [array([0, 1]), array([2, 3]), ..., array([8, 9])] data_container = DataContainer(data_inputs=np.array(list(range(10))) for data_container_batch in data_container.minibatches(batch_size=3, keep_incomplete_batch=False): print(data_container_batch.data_inputs) # [array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8])] data_container = DataContainer(data_inputs=np.array(list(range(10))) for data_container_batch in data_container.minibatches( batch_size=3, keep_incomplete_batch=True, default_value_data_inputs=None, default_value_expected_outputs=None ): print(data_container_batch.data_inputs) # [array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8]), array([9, None, None])] data_container = DataContainer(data_inputs=np.array(list(range(10))) for data_container_batch in data_container.minibatches( batch_size=3, keep_incomplete_batch=True, default_value_data_inputs=StripAbsentValues() ): print(data_container_batch.data_inputs) # [array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8]), array([9])]
- Parameters
batch_size (
int
) – number of elements to combine into a single batchkeep_incomplete_batch (
bool
) – (Optional.) A bool representing
whether the last batch should be dropped in the case it has fewer than batch_size elements; the default behavior is to keep the smaller batch. :param default_value_data_inputs: expected_outputs default fill value for padding and values outside iteration range, or
StripAbsentValues
to trim absent values from the batch :param default_value_expected_outputs: expected_outputs default fill value for padding and values outside iteration range, orStripAbsentValues
to trim absent values from the batch :return: an iterator of DataContainerSee also
-
tolist
() → neuraxle.data_container.DataContainer[typing.List, typing.List, typing.List][List[T], List[T], List[T]][source]¶
-
tolistshallow
() → neuraxle.data_container.DataContainer[typing.List, typing.List, typing.List][List[T], List[T], List[T]][source]¶
-
to_numpy
() → neuraxle.data_container.DataContainer[numpy.ndarray, numpy.ndarray, numpy.ndarray][numpy.ndarray, numpy.ndarray, numpy.ndarray][source]¶
-
apply_conversion_func
(conversion_function: Callable[[Any], Any]) → neuraxle.data_container.DataContainer[source]¶ Apply conversion function to data inputs, expected outputs, and ids, and set the new values in self. Returns self. Conversion function must be able to handle None values.
- The DataContainer object is passed to all of the
-
class
neuraxle.data_container.
ExpandedDataContainer
(data_inputs, ids, expected_outputs, old_ids)[source]¶ Bases:
neuraxle.data_container.DataContainer
Sub class of DataContainer to expand data container dimension.
See also
-
__init__
(data_inputs, ids, expected_outputs, old_ids)[source]¶ Create a DataContainer[IDT, DIT, EOT] object from specified ids, di, and eo.
- Parameters
ids – ids that are iterable. If None, will put a range of integers of data_inputs length. Often a list of integers.
di – same as
data_inputs
, but shorter.eo – same as
expected_outputs
, but shorter.data_inputs – data inputs that are iterable. Can use di instead.
expected_outputs – expected outputs that are iterable. If None, will put a list of None of data_inputs length.
sub_data_containers – sub data containers.
-
reduce_dim
() → neuraxle.data_container.DataContainer[source]¶ Reduce DataContainer to its original shape with a list of multiple ids, data_inputs, and expected outputs.
- Returns
reduced data container
- Return type
-
static
create_from
(data_container: neuraxle.data_container.DataContainer) → neuraxle.data_container.ExpandedDataContainer[source]¶ Create ExpandedDataContainer with a summary id for the new single id.
- Parameters
data_container (DataContainer) – data container to transform
- Returns
expanded data container
- Return type
-
-
class
neuraxle.data_container.
ZipDataContainer
(data_inputs: Optional[DIT] = None, ids: Optional[IDT] = None, expected_outputs: Optional[EOT] = None, sub_data_containers: List[NamedDACTTuple] = None, *, di: Optional[DIT] = None, eo: Optional[EOT] = None)[source]¶ Bases:
neuraxle.data_container.DataContainer
Sub class of DataContainer to zip two data sources together.
-
static
create_from
(data_container: neuraxle.data_container.DataContainer, *other_data_containers, zip_expected_outputs: bool = False) → neuraxle.data_container.ZipDataContainer[source]¶ Merges two data sources together. Zips only the data input part and keeps the expected output of the first DataContainer as is. NOTE: Expects that all DataContainer are at least as long as data_container.
- Parameters
data_container (DataContainer) – the main data container, the attribute of this data container will be kept by the returned ZipDataContainer.
other_data_containers (List[DataContainer]) – other data containers to zip with data container
zip_expected_outputs (
bool
) – Determines wether we kept the expected_output of data_container or we zip the expected_outputs of all DataContainer provided
- Returns
expanded data container
- Return type
-
static
-
class
neuraxle.data_container.
ListDataContainer
(data_inputs: Any, ids=None, expected_outputs: Any = None, sub_data_containers=None)[source]¶ Bases:
neuraxle.data_container.DataContainer
,typing.Generic
Sub class of DataContainer to perform list operations. It allows to perform append, and concat operations on a DataContainer.
See also
-
__init__
(data_inputs: Any, ids=None, expected_outputs: Any = None, sub_data_containers=None)[source]¶ Create a DataContainer[IDT, DIT, EOT] object from specified ids, di, and eo.
- Parameters
ids – ids that are iterable. If None, will put a range of integers of data_inputs length. Often a list of integers.
di – same as
data_inputs
, but shorter.eo – same as
expected_outputs
, but shorter.data_inputs – data inputs that are iterable. Can use di instead.
expected_outputs – expected outputs that are iterable. If None, will put a list of None of data_inputs length.
sub_data_containers – sub data containers.
-
static
empty
(original_data_container: neuraxle.data_container.DataContainer = None) → neuraxle.data_container.ListDataContainer[source]¶
-
append
(_id: str, data_input: Any, expected_output: Any)[source]¶ Append a new data input to the DataContainer.
- Parameters
_id (str) – id for the data input
data_input – data input
expected_output – expected output
- Returns
-
append_data_container_in_data_inputs
(other: neuraxle.data_container.DataContainer) → neuraxle.data_container.ListDataContainer[source]¶ Append a data container to the data inputs of this data container.
- Parameters
other (DataContainer) – data container
- Returns
-
append_data_container
(other: neuraxle.data_container.DataContainer) → neuraxle.data_container.ListDataContainer[source]¶ Append a data container to the DataContainer.
- Parameters
other (DataContainer) – data container
- Returns
-
extend
(other: neuraxle.data_container.DataContainer)[source]¶ Concat the given data container at the end of self so as to extend each IDs, DIs, and EOs.
- Parameters
data_container (DataContainer) – data container
- Returns
-
-
neuraxle.data_container.
_pad_or_keep_incomplete_batch
(data_container, batch_size, default_value_data_inputs, default_value_expected_outputs) → neuraxle.data_container.DataContainer[source]¶
-
neuraxle.data_container.
_pad_incomplete_batch
(data_container: neuraxle.data_container.DataContainer, batch_size: int, default_value_data_inputs: Any, default_value_expected_outputs: Any) → neuraxle.data_container.DataContainer[source]¶
-
neuraxle.data_container.
_pad_data
(data: Iterable[T_co], default_value: Any, batch_size: int)[source]¶
-
neuraxle.data_container.
_inner_concatenate_np_array
(np_arrays_to_concatenate: List[numpy.ndarray])[source]¶ Concatenate numpy arrays on the last axis, expanding and broadcasting if necessary.
- Parameters
np_arrays_to_concatenate (Iterable[np.ndarray]) – numpy arrays to zip with the other
- Returns
concatenated np array
- Return type
np.ndarray