ds_protocol_sftp_py_lib.dataset

File: __init__.py Region: ds_protocol_sftp_py_lib/dataset

SFTP Dataset

This module implements the SFTP Dataset, which is a dataset that can be used to read and write data from an SFTP server.

Example

>>> dataset = SftpDataset(
...     id=uuid.uuid4(),
...     name="My SFTP Dataset",
...     version="1.0",
...     deserializer=PandasDeserializer(),
...     serializer=PandasSerializer(),
...     settings=SftpDatasetSettings(
...         folder_path="/path/to/dataset.csv",
...         file_name="dataset.csv",
...     ),
...     linked_service=SftpLinkedService(
...         id=uuid.uuid4(),
...         name="My SFTP Linked Service",
...         version="1.0.0",
...         settings=SftpLinkedServiceSettings(
...             host="sftp.example.com",
...             port=22,
...             username="username",
...             password="password",
...         ),
...     )
... )
... dataset.read()
... data = dataset.output

Submodules

Classes

SftpDataset

Tabular dataset object which identifies data within a data store,

SftpDatasetSettings

Settings for the SFTP dataset.

Package Contents

class ds_protocol_sftp_py_lib.dataset.SftpDataset[source]

Bases: ds_resource_plugin_py_lib.common.resource.dataset.TabularDataset[SftpLinkedServiceType, SftpDatasetSettingsType, ds_resource_plugin_py_lib.common.serde.serialize.PandasSerializer, ds_resource_plugin_py_lib.common.serde.deserialize.PandasDeserializer], Generic[SftpLinkedServiceType, SftpDatasetSettingsType]

Tabular dataset object which identifies data within a data store, such as table/csv/json/parquet/parquetdataset/ and other documents.

The input of the dataset is a pandas DataFrame. The output of the dataset is a pandas DataFrame.

linked_service: SftpLinkedServiceType
settings: SftpDatasetSettingsType
serializer: ds_resource_plugin_py_lib.common.serde.serialize.PandasSerializer
deserializer: ds_resource_plugin_py_lib.common.serde.deserialize.PandasDeserializer
property type: ds_protocol_sftp_py_lib.enums.ResourceType

Get the type of the dataset.

read() None[source]

Read files from the SFTP server.

Returns:

The output is stored in the output attribute as a DataFrame containing the contents of the matched files.

Return type:

None

Raises:

ReadError – If there is an error reading from the SFTP dataset.

create() None[source]

Create data on the SFTP server.

Note

This method is not idempotent. If called multiple times with the same parameters, it will raise a CreateError if the file already exists. If a network or server error occurs after the file is created but before the method returns, retrying may result in a CreateError due to the file’s existence. Orchestration and retry policies should account for this non-idempotent behavior.

Returns:

None

Raises:

CreateError – If there is an error creating the dataset on the SFTP server, or if the file already exists.

update() None[source]

Update operation is not supported for in this provider.

Returns:

None

Raises:

NotSupportedError – Always raised since update is not supported for SftpDataset.

upsert() None[source]

Upsert a file on the SFTP server. If the file already exists, it will be overwritten.

Returns:

None

Raises:

UpsertError – If there is an error upserting the dataset on the SFTP server.

delete() None[source]

Delete operation is not supported for in this provider.

Returns:

None

Raises:

NotSupportedError – Always raised since delete is not supported for SftpDataset.

purge() None[source]

Purge the dataset, deleting all files matching the pattern from the SFTP server.

Returns:

None

Raises:

PurgeError – If there is an error purging files from the SFTP server

list() None[source]

List the files in the directory on the SFTP server based on the specified pattern and settings.

Returns:

The output is stored in the output attribute as a DataFrame containing the file information.

Return type:

None

Raises:

ListError – If there is an error listing the files in the SFTP dataset.

rename() None[source]

Rename operation is not supported for in this provider.

Returns:

None

Raises:

NotSupportedError – Always raised since rename is not supported for SftpDataset.

close() None[source]

Close any open connections or resources.

Returns:

None

_get_folder_and_file_path() str[source]

Get combined path of folder_path and file_name, using forward slashes. This ensures consistent path formatting across Windows, Linux, and macOS. It also replaces any Windows-style backslashes with forward slashes.

Returns:

The full file path as a POSIX-style string.

Return type:

str

_ensure_file_does_not_exist(remote_path: str) None[source]

Ensure the target file does not already exist on the SFTP server.

Parameters:

remote_path (str) – Full target file path on the SFTP server.

Raises:

FileExistsError – If the target file already exists.

_list_directory(path: str) list[paramiko.SFTPAttributes][source]

List the files in the specified directory on the SFTP server.

Parameters:

path (str) – The directory path to list files from.

Returns:

A list of SFTPAttributes for the files in the directory.

Return type:

list[SFTPAttributes]

_get_files_by_pattern(path: str, fnmatch_pattern: str) list[paramiko.SFTPAttributes][source]

Get files from the SFTP server that match the specified pattern.

Parameters:
  • path (str) – The directory path to search for files.

  • fnmatch_pattern (str) – The pattern to match file names against.

Returns:

A list of SFTPAttributes for the matching files.

Return type:

list[SFTPAttributes]

_ensure_sftp_directory(remote_directory: str, max_depth: int = 20) None[source]

Ensure that the specified directory exists on the SFTP server. If it does not exist, create it.

Parameters:
  • remote_directory (str) – The directory path to ensure on the SFTP server.

  • max_depth (int) – The maximum directory depth to traverse when ensuring the directory exists. Default is 20.

Returns:

None

Raises:

CreateError – If the maximum directory depth is exceeded while ensuring the SFTP directory.

_read_files_as_dataframe(files: list[paramiko.SFTPAttributes]) pandas.DataFrame[source]

Read the dataset from the SFTP server as a dataframe.

Parameters:

files (list[SFTPAttributes]) – List of SFTPAttributes for the files to read.

Returns:

The combined data from the files as a single DataFrame.

Return type:

pd.DataFrame

_list_directory_files(files: list[paramiko.SFTPAttributes]) pandas.DataFrame[source]

List the files in the directory as a dataframe.

Parameters:

files (list[SFTPAttributes]) – List of SFTPAttributes for the files to list.

Returns:

A dataframe containing the file information.

Return type:

pd.DataFrame

class ds_protocol_sftp_py_lib.dataset.SftpDatasetSettings[source]

Bases: ds_resource_plugin_py_lib.common.resource.dataset.DatasetSettings

Settings for the SFTP dataset.

folder_path: str

Path to the folder containing the file(s) to read/write on the SFTP server.

file_name: str

Name of the file to read/write on the SFTP server.

list: ListSettings

Settings for listing the SFTP dataset.