ds_provider_azure_py_lib.dataset¶
File: __init__.py
Region: ds_provider_azure_py_lib/dataset
Azure Datasets: Table and Blob
This module implements a datasets for Azure.
- Example (AzureTable):
>>> azure_table = AzureTable( ... settings=AzureTableDatasetSettings( ... table_name="users", ... partition_key="partition_key", ... row_key="row_key", ... query_filter="additional query filter", ... delete_table=False, ... ), ... linked_service=AzureLinkedService( ... settings=AzureLinkedServiceSettings( ... account_name="account name", ... access_key="access key" ... ), ... ), ... ) >>> azure_table.read() >>> table_data = azure_table.output
- Example (AzureBlob):
>>> azure_blob = AzureBlob( ... deserializer=AzureBlobDeserializer(format=DatasetStorageFormatType.CSV), ... serializer=AzureBlobSerializer(format=DatasetStorageFormatType.CSV), ... settings=AzureBlobDatasetSettings( ... container_name="my-container", ... blob_name="path/to/example_file.csv", ... prefix=None, # for multiple blobs, provide a prefix instead of blob_name ... create=CreateSettings( ... overwrite_blob_if_exists=True, # overwrite existing blob or raise an error. ... new_container=True # create container if missing or raise an error. ... ), ... purge=DeleteSettings( ... delete_container=True # delete the container or only delete the blob ... ), ... ), ... linked_service=AzureLinkedService( ... settings=AzureLinkedServiceSettings( ... account_name="account name", ... access_key="access key" ... ), ... id=uuid.uuid4(), ... name="testazurepackage", ... version="0.0.1", ... description="testazurepackage", ... ), ... id=uuid.uuid4(), ... name="testazurepackage", ... version="0.0.1", ... description="testazurepackage" ... ) >>> azure_blob.read() >>> blob_data = azure_blob.output
Submodules¶
Classes¶
Tabular dataset object which identifies data within a data store, |
|
Settings for Azure Blob Storage dataset operations. |
|
Tabular dataset object which identifies data within a data store, |
|
Settings for Azure Table Storage dataset operations. |
Package Contents¶
- class ds_provider_azure_py_lib.dataset.AzureBlob[source]¶
Bases:
ds_resource_plugin_py_lib.common.resource.dataset.base.TabularDataset[AzureLinkedServiceType,AzureBlobDatasetSettingsType,ds_resource_plugin_py_lib.common.serde.serialize.PandasSerializer,ds_resource_plugin_py_lib.common.serde.deserialize.PandasDeserializer],Generic[AzureLinkedServiceType,AzureBlobDatasetSettingsType]Tabular dataset object which identifies data within a data store, such as table/csv/json/parquet/parquetdataset/ and other documents.
The input of the dataset is a pandas DataFrame. The output of the dataset is a pandas DataFrame.
- linked_service: AzureLinkedServiceType¶
- settings: AzureBlobDatasetSettingsType¶
- serializer: ds_resource_plugin_py_lib.common.serde.serialize.PandasSerializer | None¶
- deserializer: ds_resource_plugin_py_lib.common.serde.deserialize.PandasDeserializer | None¶
- property type: ds_provider_azure_py_lib.enums.ResourceType¶
Get the type of the dataset.
- Returns:
ResourceType
- _list_blobs(prefix: str) azure.core.paging.ItemPaged[azure.storage.blob.BlobProperties][source]¶
List all blobs in the container with a specific prefix.
- Parameters:
prefix – a string prefix to match one or multiple blobs.
- Returns:
An iterable of BlobProperties matching the prefix.
- Return type:
ItemPaged[BlobProperties]
- _read_blob(blob: str) pandas.DataFrame[source]¶
Read a specific blob in the container.
- Parameters:
blob – name of the blob to read.
- Returns:
content of the blob as a DataFrame.
- Return type:
pd.DataFrame
- _read_blobs(prefix: str) pandas.DataFrame[source]¶
Read all blobs in the container with a specific prefix.
- Parameters:
prefix – a string prefix to match one or multiple blobs.
- Returns:
Content of all blobs concatenated as a DataFrame.
- Return type:
pd.DataFrame
- _create_container() None[source]¶
Create a container in the Azure Blob Storage.
- Raises:
CreateError – If the container creation fails.
- Returns:
None
- _create_blob(stream: bytes, blob: str) None[source]¶
Create a specific blob in the container.
- Parameters:
stream – data stream to upload to the blob.
blob – name of the blob to create.
- Raises:
CreateError – If the blob creation fails.
- Returns:
None
- _delete_blob(blob: str) pandas.DataFrame[source]¶
Delete a specific blob in the container.
- Parameters:
blob – name of the blob to delete.
- Returns:
Empty DataFrame upon successful deletion.
- Return type:
pd.DataFrame
- Raises:
DeleteError – If the blob deletion fails.
- _delete_blobs(prefix: str) pandas.DataFrame[source]¶
Delete all blobs in the container with a specific prefix.
- Parameters:
prefix – a string prefix to match one or multiple blobs.
- Returns:
Empty DataFrame upon successful deletion of all blobs.
- Return type:
pd.DataFrame
- Raises:
DeleteError – If one or more blob deletions fail.
- read(**_kwargs: Any) None[source]¶
Read Azure Blob Storage dataset.
- Parameters:
_kwargs – Additional keyword arguments to pass to the request.
- Returns:
None
- Raises:
ReadError – If reading the blob(s) fails.
- create(**_kwargs: Any) None[source]¶
Create a blob in the container
- Parameters:
_kwargs – Additional keyword arguments to pass to the request. (not used)
- Returns:
None
- Raises:
CreateError – If the blob creation fails.
- update() NoReturn[source]¶
Update existing rows in the target matched by identity columns defined in
self.settings. Atomic. Must not insert new rows.- Raises:
UpdateError – If the operation fails.
NotSupportedError – If the provider does not support update.
See also
Full contract:
docs/DATASET_CONTRACT.md–update()
- list() NoReturn[source]¶
Discover available resources and populate
self.outputwith a DataFrame of resources and their metadata. Idempotent.- Raises:
ListError – If the operation fails.
NotSupportedError – If the provider does not support listing.
See also
Full contract:
docs/DATASET_CONTRACT.md–list()
- purge(**_kwargs: Any) None[source]¶
Purge (remove all content from) the container.
For Azure Blob Storage, this deletes all blobs from the container, leaving the container empty. The container itself is not deleted.
- Parameters:
_kwargs – Additional keyword arguments to pass to the request. (not used)
- Returns:
None
- Raises:
DeleteError – If the purge operation fails.
- upsert() NoReturn[source]¶
Insert rows that do not exist, update rows that do, matched by identity columns defined in
self.settings. Atomic.- Raises:
UpsertError – If the operation fails.
NotSupportedError – If the provider does not support upsert.
See also
Full contract:
docs/DATASET_CONTRACT.md–upsert()
- delete(**_kwargs: Any) None[source]¶
Delete specific blob(s) or the entire container from Azure Blob Storage.
For Azure Blob Storage, a “row” is a blob. This method deletes: - Specific blob by blob_name - Multiple blobs by prefix - Entire container if delete_container=True and no blob_name/prefix provided
- Parameters:
_kwargs – Additional keyword arguments to pass to the request. (not used)
- Returns:
None
- Raises:
DeleteError – If the deletion fails or requirements not met.
- rename() NoReturn[source]¶
Rename the resource in the backend. Atomic. Not idempotent.
- Raises:
RenameError – If the operation fails.
NotSupportedError – If the provider does not support renaming.
See also
Full contract:
docs/DATASET_CONTRACT.md–rename()
- close() None[source]¶
No need to close the linked service. Just to comply with the interface.
- Returns:
None
- class ds_provider_azure_py_lib.dataset.AzureBlobDatasetSettings[source]¶
Bases:
ds_resource_plugin_py_lib.common.resource.dataset.DatasetSettingsSettings for Azure Blob Storage dataset operations.
Exactly one of blob_name or prefix must be provided for read()/delete(); if specifying both, only blob_name will be considered. prefix is not used for create(); it can be called only with blob_name. create by default (if not passed) will attempt to create the container if it does not exist. delete() removes specific blob(s) by name or prefix.
- container_name: str¶
- blob_name: str | None = None¶
- prefix: str | None = None¶
- create: CreateSettings¶
- purge: PurgeSettings¶
- class ds_provider_azure_py_lib.dataset.AzureTable[source]¶
Bases:
ds_resource_plugin_py_lib.common.resource.dataset.TabularDataset[AzureLinkedServiceType,AzureTableDatasetSettingsType,ds_provider_azure_py_lib.serde.AzureTableSerializer,ds_provider_azure_py_lib.serde.AzureTableDeserializer],Generic[AzureLinkedServiceType,AzureTableDatasetSettingsType]Tabular dataset object which identifies data within a data store, such as table/csv/json/parquet/parquetdataset/ and other documents.
The input of the dataset is a pandas DataFrame. The output of the dataset is a pandas DataFrame.
- linked_service: AzureLinkedServiceType¶
- settings: AzureTableDatasetSettingsType¶
- property type: ds_provider_azure_py_lib.enums.ResourceType¶
Get the type of the Dataset.
- Returns:
ResourceType
- _prepare_content(content: pandas.DataFrame) dict[str, Any][source]¶
Ensure that the content is provided and is in the correct format.
- Parameters:
content (pd.DataFrame) – The content to prepare.
- Returns:
The prepared content.
- Return type:
dict
- Raises:
DatasetException – If the content is not a DataFrame, is empty, or does not contain required columns.
- _get_table_client() azure.data.tables.TableClient[source]¶
Return a TableClient for the currently configured table.
- Returns:
TableClient
- _build_transaction_from_input(operation: str, params: collections.abc.Mapping[str, Any] | None = None) list[TransactionEntry][source]¶
Build a list of transaction entries from self.input. operation: operation name as expected by TableClient.submit_transaction, e.g. “create”, “upsert”, “delete”
- Parameters:
operation (str) – The operation to perform.
params – optional params dict passed as third item in tuple (when required) e.g. {“mode”: UpdateMode.REPLACE}
- Returns:
list[TransactionEntry]
- Raises:
CreateError – If there is an error preparing content for creation.
UpdateError – If there is an error preparing content for update.
DeleteError – If there is an error preparing content for deletion.
DatasetException – If there is a general error preparing content.
- _submit_transaction(transaction: collections.abc.Iterable[TransactionEntry], error_cls: type[ds_resource_plugin_py_lib.common.resource.dataset.errors.DatasetException]) None[source]¶
Submit transaction and map TableTransactionError to provided error_type.
- Parameters:
transaction (Iterable[TransactionEntry]) – The transaction to submit.
error_cls (builtins.type[DatasetException]) – The exception class to raise on error.
- Raises:
error_cls – An error submitting the transaction.
- _delete_table() None[source]¶
Deletes the entire table from Azure Table Storage.
- Returns:
None
- Raises:
DeleteError – If the table could not be deleted.
- _create_table() None[source]¶
Creates a table in Azure Table Storage if it does not exist.
- Returns:
None
- Raises:
CreateError – If the table could not be created due to an error other than it already existing.
- read(**_kwargs: Any) None[source]¶
Read Azure Table Storage dataset.
- Parameters:
_kwargs – Additional keyword arguments
- Returns:
None
- Raises:
ReadError – If there is an error reading from Azure Table Storage.
- create(**_kwargs: Any) None[source]¶
Create an entity in Azure Table Storage.
- Returns:
None
- Raises:
CreateError – If the entity could not be created.
- delete(**_kwargs: Any) None[source]¶
Delete specific entities from Azure Table Storage.
Only entities specified in self.input are deleted, matched by PartitionKey and RowKey.
- Parameters:
_kwargs – Additional keyword arguments
- Returns:
None
- Raises:
DeleteError – If there is an error deleting from Azure Table Storage.
- rename() NoReturn[source]¶
Rename the resource in the backend. Atomic. Not idempotent.
- Raises:
RenameError – If the operation fails.
NotSupportedError – If the provider does not support renaming.
See also
Full contract:
docs/DATASET_CONTRACT.md–rename()
- close() None[source]¶
No need to close the linked service. Just to comply with the interface.
- Returns:
None
- list() NoReturn[source]¶
Discover available resources and populate
self.outputwith a DataFrame of resources and their metadata. Idempotent.- Raises:
ListError – If the operation fails.
NotSupportedError – If the provider does not support listing.
See also
Full contract:
docs/DATASET_CONTRACT.md–list()
- purge(**_kwargs: Any) None[source]¶
Purge all entities from the table or drop the entire table.
If delete_table=True in settings, deletes the entire table. Otherwise, deletes all entities from the table, leaving it empty.
- Returns:
None
- Raises:
DeleteError – If there is an error purging from Azure Table Storage.
- upsert(**_kwargs: Any) None[source]¶
Insert rows that do not exist, update rows that do, matched by identity columns defined in
self.settings. Atomic.- Raises:
UpsertError – If the operation fails.
NotSupportedError – If the provider does not support upsert.
See also
Full contract:
docs/DATASET_CONTRACT.md–upsert()
- class ds_provider_azure_py_lib.dataset.AzureTableDatasetSettings[source]¶
Bases:
ds_resource_plugin_py_lib.common.resource.dataset.DatasetSettingsSettings for Azure Table Storage dataset operations.
The read settings contains read-specific configuration that only applies to the read() operation, not to create(), delete(), update(), etc.
- table_name: str¶
- purge: PurgeSettings¶
Purge-specific settings. Only applies to the purge() operation.
- read: ReadSettings¶
Read-specific settings. Only applies to the read() operation.
By default, read() will use read without filter.