ds_provider_azure_py_lib.dataset¶

File: __init__.py Region: ds_provider_azure_py_lib/dataset

Azure Datasets: Table and Blob

This module implements a datasets for Azure.

Example (AzureTable):

>>> azure_table = AzureTable(
...     settings=AzureTableDatasetSettings(
...         table_name="users",
...         partition_key="partition_key",
...         row_key="row_key",
...         query_filter="additional query filter",
...         delete_table=False,
...     ),
...     linked_service=AzureLinkedService(
...         settings=AzureLinkedServiceSettings(
...             account_name="account name",
...             access_key="access key"
...         ),
...     ),
... )
>>> azure_table.read()
>>> table_data = azure_table.output

Example (AzureBlob):

>>> azure_blob = AzureBlob(
...     deserializer=AzureBlobDeserializer(format=DatasetStorageFormatType.CSV),
...     serializer=AzureBlobSerializer(format=DatasetStorageFormatType.CSV),
...     settings=AzureBlobDatasetSettings(
...         container_name="my-container",
...         blob_name="path/to/example_file.csv",
...         prefix=None, # for multiple blobs, provide a prefix instead of blob_name
...         create=CreateSettings(
...            overwrite_blob_if_exists=True,  # overwrite existing blob or raise an error.
...            new_container=True # create container if missing or raise an error.
...         ),
...         purge=DeleteSettings(
...            delete_container=True # delete the container or only delete the blob
...         ),
...     ),
...     linked_service=AzureLinkedService(
...         settings=AzureLinkedServiceSettings(
...             account_name="account name",
...             access_key="access key"
...         ),
...        id=uuid.uuid4(),
...        name="testazurepackage",
...        version="0.0.1",
...        description="testazurepackage",
...     ),
... id=uuid.uuid4(),
... name="testazurepackage",
... version="0.0.1",
... description="testazurepackage"
... )
>>> azure_blob.read()
>>> blob_data = azure_blob.output

Submodules¶

Classes¶

`AzureBlob`	Tabular dataset object which identifies data within a data store,
`AzureBlobDatasetSettings`	Settings for Azure Blob Storage dataset operations.
`AzureTable`	Tabular dataset object which identifies data within a data store,
`AzureTableDatasetSettings`	Settings for Azure Table Storage dataset operations.

Package Contents¶

class ds_provider_azure_py_lib.dataset.AzureBlob[source]¶

Bases: ds_resource_plugin_py_lib.common.resource.dataset.base.TabularDataset[AzureLinkedServiceType, AzureBlobDatasetSettingsType, ds_resource_plugin_py_lib.common.serde.serialize.PandasSerializer, ds_resource_plugin_py_lib.common.serde.deserialize.PandasDeserializer], Generic[AzureLinkedServiceType, AzureBlobDatasetSettingsType]

Tabular dataset object which identifies data within a data store, such as table/csv/json/parquet/parquetdataset/ and other documents.

The input of the dataset is a pandas DataFrame. The output of the dataset is a pandas DataFrame.

linked_service: AzureLinkedServiceType¶

settings: AzureBlobDatasetSettingsType¶

serializer: ds_resource_plugin_py_lib.common.serde.serialize.PandasSerializer | None¶

deserializer: ds_resource_plugin_py_lib.common.serde.deserialize.PandasDeserializer | None¶

property type: ds_provider_azure_py_lib.enums.ResourceType¶

Get the type of the dataset.

Returns:: ResourceType

_list_blobs(prefix: str) → azure.core.paging.ItemPaged[azure.storage.blob.BlobProperties][source]¶

List all blobs in the container with a specific prefix.

Parameters:: prefix – a string prefix to match one or multiple blobs.
Returns:: An iterable of BlobProperties matching the prefix.
Return type:: ItemPaged[BlobProperties]

_read_blob(blob: str) → pandas.DataFrame[source]¶

Read a specific blob in the container.

Parameters:: blob – name of the blob to read.
Returns:: content of the blob as a DataFrame.
Return type:: pd.DataFrame

_read_blobs(prefix: str) → pandas.DataFrame[source]¶

Read all blobs in the container with a specific prefix.

Parameters:: prefix – a string prefix to match one or multiple blobs.
Returns:: Content of all blobs concatenated as a DataFrame.
Return type:: pd.DataFrame

_create_container() → None[source]¶

Create a container in the Azure Blob Storage.

Raises:: CreateError – If the container creation fails.
Returns:: None

_create_blob(stream: bytes, blob: str) → None[source]¶

Create a specific blob in the container.

Parameters:

stream – data stream to upload to the blob.
blob – name of the blob to create.

Raises:

CreateError – If the blob creation fails.

Returns:

None

_delete_blob(blob: str) → pandas.DataFrame[source]¶

Delete a specific blob in the container.

Parameters:: blob – name of the blob to delete.
Returns:: Empty DataFrame upon successful deletion.
Return type:: pd.DataFrame
Raises:: DeleteError – If the blob deletion fails.

_delete_blobs(prefix: str) → pandas.DataFrame[source]¶

Delete all blobs in the container with a specific prefix.

Parameters:: prefix – a string prefix to match one or multiple blobs.
Returns:: Empty DataFrame upon successful deletion of all blobs.
Return type:: pd.DataFrame
Raises:: DeleteError – If one or more blob deletions fail.

read(**_kwargs: Any) → None[source]¶

Read Azure Blob Storage dataset.

Parameters:: _kwargs – Additional keyword arguments to pass to the request.
Returns:: None
Raises:: ReadError – If reading the blob(s) fails.

create(**_kwargs: Any) → None[source]¶

Create a blob in the container

Parameters:: _kwargs – Additional keyword arguments to pass to the request. (not used)
Returns:: None
Raises:: CreateError – If the blob creation fails.

update() → NoReturn[source]¶

Update existing rows in the target matched by identity columns defined in self.settings. Atomic. Must not insert new rows.

Raises:

UpdateError – If the operation fails.
NotSupportedError – If the provider does not support update.

See also

Full contract: docs/DATASET_CONTRACT.md – update()

list() → NoReturn[source]¶

Discover available resources and populate self.output with a DataFrame of resources and their metadata. Idempotent.

Raises:

ListError – If the operation fails.
NotSupportedError – If the provider does not support listing.

See also

Full contract: docs/DATASET_CONTRACT.md – list()

purge(**_kwargs: Any) → None[source]¶

Purge (remove all content from) the container.

For Azure Blob Storage, this deletes all blobs from the container, leaving the container empty. The container itself is not deleted.

Parameters:: _kwargs – Additional keyword arguments to pass to the request. (not used)
Returns:: None
Raises:: DeleteError – If the purge operation fails.

upsert() → NoReturn[source]¶

Insert rows that do not exist, update rows that do, matched by identity columns defined in self.settings. Atomic.

Raises:

UpsertError – If the operation fails.
NotSupportedError – If the provider does not support upsert.

See also

Full contract: docs/DATASET_CONTRACT.md – upsert()

delete(**_kwargs: Any) → None[source]¶

Delete specific blob(s) or the entire container from Azure Blob Storage.

For Azure Blob Storage, a “row” is a blob. This method deletes: - Specific blob by blob_name - Multiple blobs by prefix - Entire container if delete_container=True and no blob_name/prefix provided

Parameters:: _kwargs – Additional keyword arguments to pass to the request. (not used)
Returns:: None
Raises:: DeleteError – If the deletion fails or requirements not met.

rename() → NoReturn[source]¶

Rename the resource in the backend. Atomic. Not idempotent.

Raises:

RenameError – If the operation fails.
NotSupportedError – If the provider does not support renaming.

See also

Full contract: docs/DATASET_CONTRACT.md – rename()

close() → None[source]¶

No need to close the linked service. Just to comply with the interface.

Returns:: None

static concat(dfs: list[pandas.DataFrame]) → pandas.DataFrame[source]¶

concatenate a list of dataframes into a single dataframe.

Parameters:: dfs – DataFrames to concatenate.
Returns:: Concatenated DataFrame or empty DataFrame if input list is empty.
Return type:: DataFrame

get_details() → dict[str, Any][source]¶

Get details of the dataset.

Returns:: Details of the dataset.
Return type:: Dict[str, Any]

class ds_provider_azure_py_lib.dataset.AzureBlobDatasetSettings[source]¶

Bases: ds_resource_plugin_py_lib.common.resource.dataset.DatasetSettings

Settings for Azure Blob Storage dataset operations.

Exactly one of blob_name or prefix must be provided for read()/delete(); if specifying both, only blob_name will be considered. prefix is not used for create(); it can be called only with blob_name. create by default (if not passed) will attempt to create the container if it does not exist. delete() removes specific blob(s) by name or prefix.

container_name: str¶

blob_name: str | None = None¶

prefix: str | None = None¶

create: CreateSettings¶

purge: PurgeSettings¶

class ds_provider_azure_py_lib.dataset.AzureTable[source]¶

Bases: ds_resource_plugin_py_lib.common.resource.dataset.TabularDataset[AzureLinkedServiceType, AzureTableDatasetSettingsType, ds_provider_azure_py_lib.serde.AzureTableSerializer, ds_provider_azure_py_lib.serde.AzureTableDeserializer], Generic[AzureLinkedServiceType, AzureTableDatasetSettingsType]

Tabular dataset object which identifies data within a data store, such as table/csv/json/parquet/parquetdataset/ and other documents.

The input of the dataset is a pandas DataFrame. The output of the dataset is a pandas DataFrame.

linked_service: AzureLinkedServiceType¶

settings: AzureTableDatasetSettingsType¶

__post_init__() → None[source]¶

property type: ds_provider_azure_py_lib.enums.ResourceType¶

Get the type of the Dataset.

Returns:: ResourceType

_prepare_content(content: pandas.DataFrame) → dict[str, Any][source]¶

Ensure that the content is provided and is in the correct format.

Parameters:: content (pd.DataFrame) – The content to prepare.
Returns:: The prepared content.
Return type:: dict
Raises:: DatasetException – If the content is not a DataFrame, is empty, or does not contain required columns.

_get_table_client() → azure.data.tables.TableClient[source]¶

Return a TableClient for the currently configured table.

Returns:: TableClient

_build_transaction_from_input(operation: str, params: collections.abc.Mapping[str, Any] | None = None) → list[TransactionEntry][source]¶

Build a list of transaction entries from self.input. operation: operation name as expected by TableClient.submit_transaction, e.g. “create”, “upsert”, “delete”

Parameters:

operation (str) – The operation to perform.
params – optional params dict passed as third item in tuple (when required) e.g. {“mode”: UpdateMode.REPLACE}

Returns:

list[TransactionEntry]

Raises:

CreateError – If there is an error preparing content for creation.
UpdateError – If there is an error preparing content for update.
DeleteError – If there is an error preparing content for deletion.
DatasetException – If there is a general error preparing content.

_submit_transaction(transaction: collections.abc.Iterable[TransactionEntry], error_cls: type[ds_resource_plugin_py_lib.common.resource.dataset.errors.DatasetException]) → None[source]¶

Submit transaction and map TableTransactionError to provided error_type.

Parameters:

transaction (Iterable[TransactionEntry]) – The transaction to submit.
error_cls (builtins.type[DatasetException]) – The exception class to raise on error.

Raises:

error_cls – An error submitting the transaction.

_delete_table() → None[source]¶

Deletes the entire table from Azure Table Storage.

Returns:: None
Raises:: DeleteError – If the table could not be deleted.

_create_table() → None[source]¶

Creates a table in Azure Table Storage if it does not exist.

Returns:: None
Raises:: CreateError – If the table could not be created due to an error other than it already existing.

read(**_kwargs: Any) → None[source]¶

Read Azure Table Storage dataset.

Parameters:: _kwargs – Additional keyword arguments
Returns:: None
Raises:: ReadError – If there is an error reading from Azure Table Storage.

create(**_kwargs: Any) → None[source]¶

Create an entity in Azure Table Storage.

Returns:: None
Raises:: CreateError – If the entity could not be created.

update(**_kwargs: Any) → None[source]¶

Update an entity in Azure Table Storage.

Returns:: None

delete(**_kwargs: Any) → None[source]¶

Delete specific entities from Azure Table Storage.

Only entities specified in self.input are deleted, matched by PartitionKey and RowKey.

Parameters:: _kwargs – Additional keyword arguments
Returns:: None
Raises:: DeleteError – If there is an error deleting from Azure Table Storage.

rename() → NoReturn[source]¶

Rename the resource in the backend. Atomic. Not idempotent.

Raises:

RenameError – If the operation fails.
NotSupportedError – If the provider does not support renaming.

See also

Full contract: docs/DATASET_CONTRACT.md – rename()

close() → None[source]¶

No need to close the linked service. Just to comply with the interface.

Returns:: None

list() → NoReturn[source]¶

Discover available resources and populate self.output with a DataFrame of resources and their metadata. Idempotent.

Raises:

ListError – If the operation fails.
NotSupportedError – If the provider does not support listing.

See also

Full contract: docs/DATASET_CONTRACT.md – list()

purge(**_kwargs: Any) → None[source]¶

Purge all entities from the table or drop the entire table.

If delete_table=True in settings, deletes the entire table. Otherwise, deletes all entities from the table, leaving it empty.

Returns:: None
Raises:: DeleteError – If there is an error purging from Azure Table Storage.

upsert(**_kwargs: Any) → None[source]¶

Insert rows that do not exist, update rows that do, matched by identity columns defined in self.settings. Atomic.

Raises:

UpsertError – If the operation fails.
NotSupportedError – If the provider does not support upsert.

See also

Full contract: docs/DATASET_CONTRACT.md – upsert()

get_details() → dict[str, Any][source]¶

Get details about the dataset.

Returns:: dict[str, Any]

class ds_provider_azure_py_lib.dataset.AzureTableDatasetSettings[source]¶

Bases: ds_resource_plugin_py_lib.common.resource.dataset.DatasetSettings

Settings for Azure Table Storage dataset operations.

The read settings contains read-specific configuration that only applies to the read() operation, not to create(), delete(), update(), etc.

table_name: str¶

purge: PurgeSettings¶: Purge-specific settings. Only applies to the purge() operation.

read: ReadSettings¶

Read-specific settings. Only applies to the read() operation.

By default, read() will use read without filter.