Skip to content

Documentation for AwsS3JSONLoader

Functionality

This module extends AwsS3TextLoader to provide specialized handling for JSON files stored in AWS S3. It parses JSON content and supports filtering fields based on user-defined parameters.

Parameters

  • fields_to_keep: Optional list or set of field names to retain in the JSON.
  • retry_config: Optional configuration for retry strategies during loading.
  • features: Optional specification of expected dataset features.
  • encoding: Character encoding used for processing the JSON file.
  • kwargs: Additional parameters for AWS S3 credentials.

Usage

Purpose - To load and process JSON files from AWS S3 with optional field filtering, simplifying the extraction of relevant fields.

Example

loader = AwsS3JSONLoader(
    fields_to_keep=["id", "text"],
    aws_access_key_id="YOUR_KEY",
    aws_secret_access_key="YOUR_SECRET"
)
data = loader.load_items([
    S3FileMeta(bucket="my-bucket", file="data.json")
])

Documentation for AwsS3JSONLoader._filter_fields

Functionality

Filters a dictionary based on the fields_to_keep attribute. Only keys that exist in fields_to_keep are retained in the returned dictionary.

Parameters

  • item (dict): The dictionary to filter.

Return Value

  • dict: A new dictionary containing only the allowed fields from the original item.

Usage

Use this method to remove unwanted data from JSON records loaded from S3. Ensure that the instance's fields_to_keep is set before invoking this method.

Example

loader = AwsS3JSONLoader(fields_to_keep=["id", "text"])
filtered_item = loader._filter_fields(item)

Documentation for AwsS3JSONLoader._get_item

Functionality

Processes a BytesIO object containing JSON data. Overrides the parent method to decode the file, parse it into a JSON object, and filter its fields when 'fields_to_keep' is set.

Parameters

  • file: A BytesIO object that holds the downloaded JSON data.

Usage

  • Purpose: Convert file contents to a JSON object with optional field filtering based on the provided configuration.

Example

# Instantiate loader to keep only 'id' and 'text'
loader = AwsS3JSONLoader(
    fields_to_keep=["id", "text"],
    aws_access_key_id="YOUR_KEY",
    aws_secret_access_key="YOUR_SECRET"
)
item = loader.load_items([
    S3FileMeta(bucket="my-bucket", file="data.json")
])