ds_provider_azure_py_lib.dataset

File: __init__.py Region: ds_provider_azure_py_lib/dataset

Azure Datasets: Table and Blob

This module implements a datasets for Azure.

Example (AzureTable):
>>> azure_table = AzureTable(
...     settings=AzureTableDatasetSettings(
...         table_name="users",
...         partition_key="partition_key",
...         row_key="row_key",
...         query_filter="additional query filter",
...         delete_table=False,
...     ),
...     linked_service=AzureLinkedService(
...         settings=AzureLinkedServiceSettings(
...             account_name="account name",
...             access_key="access key"
...         ),
...     ),
... )
>>> azure_table.read()
>>> table_data = azure_table.output
Example (AzureBlob):
>>> azure_blob = AzureBlob(
...     deserializer=AzureBlobDeserializer(format=DatasetStorageFormatType.CSV),
...     serializer=AzureBlobSerializer(format=DatasetStorageFormatType.CSV),
...     settings=AzureBlobDatasetSettings(
...         container_name="my-container",
...         blob_name="path/to/example_file.csv",
...         prefix=None, # for multiple blobs, provide a prefix instead of blob_name
...         create=CreateSettings(
...            overwrite_blob_if_exists=True,  # overwrite existing blob or raise an error.
...            new_container=True # create container if missing or raise an error.
...         ),
...         purge=DeleteSettings(
...            delete_container=True # delete the container or only delete the blob
...         ),
...     ),
...     linked_service=AzureLinkedService(
...         settings=AzureLinkedServiceSettings(
...             account_name="account name",
...             access_key="access key"
...         ),
...        id=uuid.uuid4(),
...        name="testazurepackage",
...        version="0.0.1",
...        description="testazurepackage",
...     ),
... id=uuid.uuid4(),
... name="testazurepackage",
... version="0.0.1",
... description="testazurepackage"
... )
>>> azure_blob.read()
>>> blob_data = azure_blob.output

Submodules

Classes

AzureBlob

Tabular dataset object which identifies data within a data store,

AzureBlobDatasetSettings

Settings for Azure Blob Storage dataset operations.

AzureTable

Tabular dataset object which identifies data within a data store,

AzureTableDatasetSettings

Settings for Azure Table Storage dataset operations.

Package Contents

class ds_provider_azure_py_lib.dataset.AzureBlob[source]

Bases: ds_resource_plugin_py_lib.common.resource.dataset.base.TabularDataset[AzureLinkedServiceType, AzureBlobDatasetSettingsType, ds_resource_plugin_py_lib.common.serde.serialize.PandasSerializer, ds_resource_plugin_py_lib.common.serde.deserialize.PandasDeserializer], Generic[AzureLinkedServiceType, AzureBlobDatasetSettingsType]

Tabular dataset object which identifies data within a data store, such as table/csv/json/parquet/parquetdataset/ and other documents.

The input of the dataset is a pandas DataFrame. The output of the dataset is a pandas DataFrame.

linked_service: AzureLinkedServiceType
settings: AzureBlobDatasetSettingsType
serializer: ds_resource_plugin_py_lib.common.serde.serialize.PandasSerializer | None
deserializer: ds_resource_plugin_py_lib.common.serde.deserialize.PandasDeserializer | None
property type: ds_provider_azure_py_lib.enums.ResourceType

Get the type of the dataset.

Returns:

ResourceType

_list_blobs(prefix: str) azure.core.paging.ItemPaged[azure.storage.blob.BlobProperties][source]

List all blobs in the container with a specific prefix.

Parameters:

prefix – a string prefix to match one or multiple blobs.

Returns:

An iterable of BlobProperties matching the prefix.

Return type:

ItemPaged[BlobProperties]

_read_blob(blob: str) pandas.DataFrame[source]

Read a specific blob in the container.

Parameters:

blob – name of the blob to read.

Returns:

content of the blob as a DataFrame.

Return type:

pd.DataFrame

_read_blobs(prefix: str) pandas.DataFrame[source]

Read all blobs in the container with a specific prefix.

Parameters:

prefix – a string prefix to match one or multiple blobs.

Returns:

Content of all blobs concatenated as a DataFrame.

Return type:

pd.DataFrame

_create_container() None[source]

Create a container in the Azure Blob Storage.

Raises:

CreateError – If the container creation fails.

Returns:

None

_create_blob(stream: bytes, blob: str) None[source]

Create a specific blob in the container.

Parameters:
  • stream – data stream to upload to the blob.

  • blob – name of the blob to create.

Raises:

CreateError – If the blob creation fails.

Returns:

None

_delete_blob(blob: str) pandas.DataFrame[source]

Delete a specific blob in the container.

Parameters:

blob – name of the blob to delete.

Returns:

Empty DataFrame upon successful deletion.

Return type:

pd.DataFrame

Raises:

DeleteError – If the blob deletion fails.

_delete_blobs(prefix: str) pandas.DataFrame[source]

Delete all blobs in the container with a specific prefix.

Parameters:

prefix – a string prefix to match one or multiple blobs.

Returns:

Empty DataFrame upon successful deletion of all blobs.

Return type:

pd.DataFrame

Raises:

DeleteError – If one or more blob deletions fail.

read(**_kwargs: Any) None[source]

Read Azure Blob Storage dataset.

Parameters:

_kwargs – Additional keyword arguments to pass to the request.

Returns:

None

Raises:

ReadError – If reading the blob(s) fails.

create(**_kwargs: Any) None[source]

Create a blob in the container

Parameters:

_kwargs – Additional keyword arguments to pass to the request. (not used)

Returns:

None

Raises:

CreateError – If the blob creation fails.

update() NoReturn[source]

Update existing rows in the target matched by identity columns defined in self.settings. Atomic. Must not insert new rows.

Raises:
  • UpdateError – If the operation fails.

  • NotSupportedError – If the provider does not support update.

See also

Full contract: docs/DATASET_CONTRACT.mdupdate()

list() NoReturn[source]

Discover available resources and populate self.output with a DataFrame of resources and their metadata. Idempotent.

Raises:
  • ListError – If the operation fails.

  • NotSupportedError – If the provider does not support listing.

See also

Full contract: docs/DATASET_CONTRACT.mdlist()

purge(**_kwargs: Any) None[source]

Purge (remove all content from) the container.

For Azure Blob Storage, this deletes all blobs from the container, leaving the container empty. The container itself is not deleted.

Parameters:

_kwargs – Additional keyword arguments to pass to the request. (not used)

Returns:

None

Raises:

DeleteError – If the purge operation fails.

upsert() NoReturn[source]

Insert rows that do not exist, update rows that do, matched by identity columns defined in self.settings. Atomic.

Raises:
  • UpsertError – If the operation fails.

  • NotSupportedError – If the provider does not support upsert.

See also

Full contract: docs/DATASET_CONTRACT.mdupsert()

delete(**_kwargs: Any) None[source]

Delete specific blob(s) or the entire container from Azure Blob Storage.

For Azure Blob Storage, a “row” is a blob. This method deletes: - Specific blob by blob_name - Multiple blobs by prefix - Entire container if delete_container=True and no blob_name/prefix provided

Parameters:

_kwargs – Additional keyword arguments to pass to the request. (not used)

Returns:

None

Raises:

DeleteError – If the deletion fails or requirements not met.

rename() NoReturn[source]

Rename the resource in the backend. Atomic. Not idempotent.

Raises:
  • RenameError – If the operation fails.

  • NotSupportedError – If the provider does not support renaming.

See also

Full contract: docs/DATASET_CONTRACT.mdrename()

close() None[source]

No need to close the linked service. Just to comply with the interface.

Returns:

None

static concat(dfs: list[pandas.DataFrame]) pandas.DataFrame[source]

concatenate a list of dataframes into a single dataframe.

Parameters:

dfs – DataFrames to concatenate.

Returns:

Concatenated DataFrame or empty DataFrame if input list is empty.

Return type:

DataFrame

get_details() dict[str, Any][source]

Get details of the dataset.

Returns:

Details of the dataset.

Return type:

Dict[str, Any]

class ds_provider_azure_py_lib.dataset.AzureBlobDatasetSettings[source]

Bases: ds_resource_plugin_py_lib.common.resource.dataset.DatasetSettings

Settings for Azure Blob Storage dataset operations.

Exactly one of blob_name or prefix must be provided for read()/delete(); if specifying both, only blob_name will be considered. prefix is not used for create(); it can be called only with blob_name. create by default (if not passed) will attempt to create the container if it does not exist. delete() removes specific blob(s) by name or prefix.

container_name: str
blob_name: str | None = None
prefix: str | None = None
create: CreateSettings
purge: PurgeSettings
class ds_provider_azure_py_lib.dataset.AzureTable[source]

Bases: ds_resource_plugin_py_lib.common.resource.dataset.TabularDataset[AzureLinkedServiceType, AzureTableDatasetSettingsType, ds_provider_azure_py_lib.serde.AzureTableSerializer, ds_provider_azure_py_lib.serde.AzureTableDeserializer], Generic[AzureLinkedServiceType, AzureTableDatasetSettingsType]

Tabular dataset object which identifies data within a data store, such as table/csv/json/parquet/parquetdataset/ and other documents.

The input of the dataset is a pandas DataFrame. The output of the dataset is a pandas DataFrame.

linked_service: AzureLinkedServiceType
settings: AzureTableDatasetSettingsType
__post_init__() None[source]
property type: ds_provider_azure_py_lib.enums.ResourceType

Get the type of the Dataset.

Returns:

ResourceType

_prepare_content(content: pandas.DataFrame) dict[str, Any][source]

Ensure that the content is provided and is in the correct format.

Parameters:

content (pd.DataFrame) – The content to prepare.

Returns:

The prepared content.

Return type:

dict

Raises:

DatasetException – If the content is not a DataFrame, is empty, or does not contain required columns.

_get_table_client() azure.data.tables.TableClient[source]

Return a TableClient for the currently configured table.

Returns:

TableClient

_build_transaction_from_input(operation: str, params: collections.abc.Mapping[str, Any] | None = None) list[TransactionEntry][source]

Build a list of transaction entries from self.input. operation: operation name as expected by TableClient.submit_transaction, e.g. “create”, “upsert”, “delete”

Parameters:
  • operation (str) – The operation to perform.

  • params – optional params dict passed as third item in tuple (when required) e.g. {“mode”: UpdateMode.REPLACE}

Returns:

list[TransactionEntry]

Raises:
  • CreateError – If there is an error preparing content for creation.

  • UpdateError – If there is an error preparing content for update.

  • DeleteError – If there is an error preparing content for deletion.

  • DatasetException – If there is a general error preparing content.

_submit_transaction(transaction: collections.abc.Iterable[TransactionEntry], error_cls: type[ds_resource_plugin_py_lib.common.resource.dataset.errors.DatasetException]) None[source]

Submit transaction and map TableTransactionError to provided error_type.

Parameters:
  • transaction (Iterable[TransactionEntry]) – The transaction to submit.

  • error_cls (builtins.type[DatasetException]) – The exception class to raise on error.

Raises:

error_cls – An error submitting the transaction.

_delete_table() None[source]

Deletes the entire table from Azure Table Storage.

Returns:

None

Raises:

DeleteError – If the table could not be deleted.

_create_table() None[source]

Creates a table in Azure Table Storage if it does not exist.

Returns:

None

Raises:

CreateError – If the table could not be created due to an error other than it already existing.

read(**_kwargs: Any) None[source]

Read Azure Table Storage dataset.

Parameters:

_kwargs – Additional keyword arguments

Returns:

None

Raises:

ReadError – If there is an error reading from Azure Table Storage.

create(**_kwargs: Any) None[source]

Create an entity in Azure Table Storage.

Returns:

None

Raises:

CreateError – If the entity could not be created.

update(**_kwargs: Any) None[source]

Update an entity in Azure Table Storage.

Returns:

None

delete(**_kwargs: Any) None[source]

Delete specific entities from Azure Table Storage.

Only entities specified in self.input are deleted, matched by PartitionKey and RowKey.

Parameters:

_kwargs – Additional keyword arguments

Returns:

None

Raises:

DeleteError – If there is an error deleting from Azure Table Storage.

rename() NoReturn[source]

Rename the resource in the backend. Atomic. Not idempotent.

Raises:
  • RenameError – If the operation fails.

  • NotSupportedError – If the provider does not support renaming.

See also

Full contract: docs/DATASET_CONTRACT.mdrename()

close() None[source]

No need to close the linked service. Just to comply with the interface.

Returns:

None

list() NoReturn[source]

Discover available resources and populate self.output with a DataFrame of resources and their metadata. Idempotent.

Raises:
  • ListError – If the operation fails.

  • NotSupportedError – If the provider does not support listing.

See also

Full contract: docs/DATASET_CONTRACT.mdlist()

purge(**_kwargs: Any) None[source]

Purge all entities from the table or drop the entire table.

If delete_table=True in settings, deletes the entire table. Otherwise, deletes all entities from the table, leaving it empty.

Returns:

None

Raises:

DeleteError – If there is an error purging from Azure Table Storage.

upsert(**_kwargs: Any) None[source]

Insert rows that do not exist, update rows that do, matched by identity columns defined in self.settings. Atomic.

Raises:
  • UpsertError – If the operation fails.

  • NotSupportedError – If the provider does not support upsert.

See also

Full contract: docs/DATASET_CONTRACT.mdupsert()

get_details() dict[str, Any][source]

Get details about the dataset.

Returns:

dict[str, Any]

class ds_provider_azure_py_lib.dataset.AzureTableDatasetSettings[source]

Bases: ds_resource_plugin_py_lib.common.resource.dataset.DatasetSettings

Settings for Azure Table Storage dataset operations.

The read settings contains read-specific configuration that only applies to the read() operation, not to create(), delete(), update(), etc.

table_name: str
purge: PurgeSettings

Purge-specific settings. Only applies to the purge() operation.

read: ReadSettings

Read-specific settings. Only applies to the read() operation.

By default, read() will use read without filter.