ds_provider_azure_py_lib.dataset.blob

File: blob.py Region: ds_provider_azure_py_lib/dataset/blob

Azure Blob Dataset

This module implements a blob dataset for azure.

Example

>>> azure_blob = AzureBlob(
...     deserializer=AzureBlobDeserializer(format=DatasetStorageFormatType.CSV),
...     serializer=AzureBlobSerializer(format=DatasetStorageFormatType.CSV),
...     settings=AzureBlobDatasetSettings(
...         container_name="my-container",
...         blob_name="path/to/example_file.csv",
...         prefix=None, # for multiple blobs, provide a prefix instead of blob_name
...         create=CreateSettings(
...            overwrite_blob_if_exists=True, # overwrite the blob that already exists or raise an error
...            new_container=True # create a new container or raise an error
...         ),
...         delete=DeleteSettings(
...            delete_container=True # confirm deletion of the container via delete() method
...         ),
...     ),
...     linked_service=AzureLinkedService(
...         settings=AzureLinkedServiceSettings(
...             account_name="account name",
...             access_key="access key"
...         ),
...        id=uuid.uuid4(),
...        name="testazurepackage",
...        version="0.0.1",
...        description="testazurepackage",
...     ),
... id=uuid.uuid4(),
... name="testazurepackage",
... version="0.0.1",
... description="testazurepackage"
... )
>>> azure_blob.read()
>>> blob_data = azure_blob.output

Attributes

logger

AzureBlobDatasetSettingsType

AzureLinkedServiceType

Classes

CreateSettings

Settings for create operations.

PurgeSettings

Settings for purge operations

AzureBlobDatasetSettings

Settings for Azure Blob Storage dataset operations.

AzureBlob

Tabular dataset object which identifies data within a data store,

Module Contents

ds_provider_azure_py_lib.dataset.blob.logger
class ds_provider_azure_py_lib.dataset.blob.CreateSettings[source]

Settings for create operations.

overwrite_blob_if_exists: bool = True

controls whether to overwrite an existing blob in case of name conflict. If True, the create operation will overwrite the existing blob with the new content. If False, the create operation will raise an error if a blob with the same name already exists.

new_container: bool = True

confirm creation of a new container if it does not exist already. If True, the create operation will attempt to create the container if it does not exist. If False, the create operation will raise an error if the container does not exist.

class ds_provider_azure_py_lib.dataset.blob.PurgeSettings[source]

Settings for purge operations

delete_container: bool = False

Confirm deletion of the entire container when purge() is called. If True, delete() will delete the container. If False, delete() will remove all blobs from the container but keep the container itself.

class ds_provider_azure_py_lib.dataset.blob.AzureBlobDatasetSettings[source]

Bases: ds_resource_plugin_py_lib.common.resource.dataset.DatasetSettings

Settings for Azure Blob Storage dataset operations.

Exactly one of blob_name or prefix must be provided for read()/delete(); if specifying both, only blob_name will be considered. prefix is not used for create(); it can be called only with blob_name. create by default (if not passed) will attempt to create the container if it does not exist. delete() removes specific blob(s) by name or prefix.

container_name: str
blob_name: str | None = None
prefix: str | None = None
create: CreateSettings
purge: PurgeSettings
ds_provider_azure_py_lib.dataset.blob.AzureBlobDatasetSettingsType
ds_provider_azure_py_lib.dataset.blob.AzureLinkedServiceType
class ds_provider_azure_py_lib.dataset.blob.AzureBlob[source]

Bases: ds_resource_plugin_py_lib.common.resource.dataset.base.TabularDataset[AzureLinkedServiceType, AzureBlobDatasetSettingsType, ds_resource_plugin_py_lib.common.serde.serialize.PandasSerializer, ds_resource_plugin_py_lib.common.serde.deserialize.PandasDeserializer], Generic[AzureLinkedServiceType, AzureBlobDatasetSettingsType]

Tabular dataset object which identifies data within a data store, such as table/csv/json/parquet/parquetdataset/ and other documents.

The input of the dataset is a pandas DataFrame. The output of the dataset is a pandas DataFrame.

linked_service: AzureLinkedServiceType
settings: AzureBlobDatasetSettingsType
serializer: ds_resource_plugin_py_lib.common.serde.serialize.PandasSerializer | None
deserializer: ds_resource_plugin_py_lib.common.serde.deserialize.PandasDeserializer | None
property type: ds_provider_azure_py_lib.enums.ResourceType

Get the type of the dataset.

Returns:

ResourceType

_list_blobs(prefix: str) azure.core.paging.ItemPaged[azure.storage.blob.BlobProperties][source]

List all blobs in the container with a specific prefix.

Parameters:

prefix – a string prefix to match one or multiple blobs.

Returns:

An iterable of BlobProperties matching the prefix.

Return type:

ItemPaged[BlobProperties]

_read_blob(blob: str) pandas.DataFrame[source]

Read a specific blob in the container.

Parameters:

blob – name of the blob to read.

Returns:

content of the blob as a DataFrame.

Return type:

pd.DataFrame

_read_blobs(prefix: str) pandas.DataFrame[source]

Read all blobs in the container with a specific prefix.

Parameters:

prefix – a string prefix to match one or multiple blobs.

Returns:

Content of all blobs concatenated as a DataFrame.

Return type:

pd.DataFrame

_create_container() None[source]

Create a container in the Azure Blob Storage.

Raises:

CreateError – If the container creation fails.

Returns:

None

_create_blob(stream: bytes, blob: str) None[source]

Create a specific blob in the container.

Parameters:
  • stream – data stream to upload to the blob.

  • blob – name of the blob to create.

Raises:

CreateError – If the blob creation fails.

Returns:

None

_delete_blob(blob: str) pandas.DataFrame[source]

Delete a specific blob in the container.

Parameters:

blob – name of the blob to delete.

Returns:

Empty DataFrame upon successful deletion.

Return type:

pd.DataFrame

Raises:

DeleteError – If the blob deletion fails.

_delete_blobs(prefix: str) pandas.DataFrame[source]

Delete all blobs in the container with a specific prefix.

Parameters:

prefix – a string prefix to match one or multiple blobs.

Returns:

Empty DataFrame upon successful deletion of all blobs.

Return type:

pd.DataFrame

Raises:

DeleteError – If one or more blob deletions fail.

read(**_kwargs: Any) None[source]

Read Azure Blob Storage dataset.

Parameters:

_kwargs – Additional keyword arguments to pass to the request.

Returns:

None

Raises:

ReadError – If reading the blob(s) fails.

create(**_kwargs: Any) None[source]

Create a blob in the container

Parameters:

_kwargs – Additional keyword arguments to pass to the request. (not used)

Returns:

None

Raises:

CreateError – If the blob creation fails.

update() NoReturn[source]

Update existing rows in the target matched by identity columns defined in self.settings. Atomic. Must not insert new rows.

Raises:
  • UpdateError – If the operation fails.

  • NotSupportedError – If the provider does not support update.

See also

Full contract: docs/DATASET_CONTRACT.mdupdate()

list() NoReturn[source]

Discover available resources and populate self.output with a DataFrame of resources and their metadata. Idempotent.

Raises:
  • ListError – If the operation fails.

  • NotSupportedError – If the provider does not support listing.

See also

Full contract: docs/DATASET_CONTRACT.mdlist()

purge(**_kwargs: Any) None[source]

Purge (remove all content from) the container.

For Azure Blob Storage, this deletes all blobs from the container, leaving the container empty. The container itself is not deleted.

Parameters:

_kwargs – Additional keyword arguments to pass to the request. (not used)

Returns:

None

Raises:

DeleteError – If the purge operation fails.

upsert() NoReturn[source]

Insert rows that do not exist, update rows that do, matched by identity columns defined in self.settings. Atomic.

Raises:
  • UpsertError – If the operation fails.

  • NotSupportedError – If the provider does not support upsert.

See also

Full contract: docs/DATASET_CONTRACT.mdupsert()

delete(**_kwargs: Any) None[source]

Delete specific blob(s) or the entire container from Azure Blob Storage.

For Azure Blob Storage, a “row” is a blob. This method deletes: - Specific blob by blob_name - Multiple blobs by prefix - Entire container if delete_container=True and no blob_name/prefix provided

Parameters:

_kwargs – Additional keyword arguments to pass to the request. (not used)

Returns:

None

Raises:

DeleteError – If the deletion fails or requirements not met.

rename() NoReturn[source]

Rename the resource in the backend. Atomic. Not idempotent.

Raises:
  • RenameError – If the operation fails.

  • NotSupportedError – If the provider does not support renaming.

See also

Full contract: docs/DATASET_CONTRACT.mdrename()

close() None[source]

No need to close the linked service. Just to comply with the interface.

Returns:

None

static concat(dfs: list[pandas.DataFrame]) pandas.DataFrame[source]

concatenate a list of dataframes into a single dataframe.

Parameters:

dfs – DataFrames to concatenate.

Returns:

Concatenated DataFrame or empty DataFrame if input list is empty.

Return type:

DataFrame

get_details() dict[str, Any][source]

Get details of the dataset.

Returns:

Details of the dataset.

Return type:

Dict[str, Any]