How to configure a RuntimeDataConnector

This guide demonstrates how to configure a RuntimeDataConnector and only applies to the V3 (Batch Request) API. A RuntimeDataConnector allows you to specify a Batch using a Runtime Batch Request, which is used to create a Validator. A Validator is the key object used to create Expectations and validate datasets.

Prerequisites: This how-to guide assumes you have:

Completed the Getting Started Tutorial
Have a working installation of Great Expectations
Understand the basics of Datasources in the V3 (Batch Request) API
Learned how to configure a Data Context using test_yaml_config

A RuntimeDataConnector is a special kind of Data Connector that enables you to use a RuntimeBatchRequest to provide a Batch's data directly at runtime. The RuntimeBatchRequest can wrap an in-memory dataframe, a filepath, or a SQL query, and must include batch identifiers that uniquely identify the data (e.g. a run_id from an AirFlow DAG run). The batch identifiers that must be passed in at runtime are specified in the RuntimeDataConnector's configuration.

Steps#

1. Instantiate your project's DataContext#

Import these necessary packages and modules:

YAML
Python

import great_expectations as gefrom great_expectations.core.batch import RuntimeBatchRequest

from ruamel import yaml
import great_expectations as gefrom great_expectations.core.batch import RuntimeBatchRequest

2. Set up a Datasource#

All of the examples below assume you’re testing configuration using something like:

YAML
Python

datasource_yaml = """name: taxi_datasourceclass_name: Datasourceexecution_engine:  class_name: PandasExecutionEnginedata_connectors:  <DATACONNECTOR NAME GOES HERE>:    <DATACONNECTOR CONFIGURATION GOES HERE>"""context.test_yaml_config(yaml_config=datasource_config)

datasource_config = {    "name": "taxi_datasource",    "class_name": "Datasource",    "module_name": "great_expectations.datasource",    "execution_engine": {        "module_name": "great_expectations.execution_engine",        "class_name": "PandasExecutionEngine",    },    "data_connectors": {        "<DATACONNECTOR NAME GOES HERE>": {          "<DATACONNECTOR CONFIGURATION GOES HERE>"        },    },}context.test_yaml_config(yaml.dump(datasource_config))

If you’re not familiar with the test_yaml_config method, please check out: How to configure Data Context components using test_yaml_config

3. Add a RuntimeDataConnector to a Datasource configuration#

This basic configuration can be used in multiple ways depending on how the RuntimeBatchRequest is configured:

YAML
Python

datasource_yaml = """name: taxi_datasourceclass_name: Datasourcemodule_name: great_expectations.datasourceexecution_engine:  module_name: great_expectations.execution_engine  class_name: PandasExecutionEnginedata_connectors:  default_runtime_data_connector_name:    class_name: RuntimeDataConnector    batch_identifiers:      - default_identifier_name"""

datasource_config = {    "name": "taxi_datasource",    "class_name": "Datasource",    "module_name": "great_expectations.datasource",    "execution_engine": {        "module_name": "great_expectations.execution_engine",        "class_name": "PandasExecutionEngine",    },    "data_connectors": {        "default_runtime_data_connector_name": {            "class_name": "RuntimeDataConnector",            "batch_identifiers": ["default_identifier_name"],        },    },}

Once the RuntimeDataConnector is configured you can add your datasource using:

context.add_datasource(**datasource_config)

Example 1: RuntimeDataConnector for access to file-system data:#

At runtime, you would get a Validator from the Data Context by first defining a RuntimeBatchRequest with the path to your data defined in runtime_parameters:

batch_request = RuntimeBatchRequest(    datasource_name="taxi_datasource",    data_connector_name="default_runtime_data_connector_name",    data_asset_name="<YOUR MEANINGFUL NAME>",  # This can be anything that identifies this data_asset for you    runtime_parameters={"path": "<PATH TO YOUR DATA HERE>"},  # Add your path here.    batch_identifiers={"default_identifier_name": "<YOUR MEANINGFUL IDENTIFIER>"},)

Next, you would pass that request into context.get_validator:

validator = context.get_validator(    batch_request=batch_request,    create_expectation_suite_with_name="<MY EXPECTATION SUITE NAME>",)

Example 2: RuntimeDataConnector that uses an in-memory DataFrame#

At runtime, you would get a Validator from the Data Context by first defining a RuntimeBatchRequest with the DataFrame passed into batch_data in runtime_parameters:

import pandas as pd

path = "<PATH TO YOUR DATA HERE>"

df = pd.read_csv(path)
batch_request = RuntimeBatchRequest(    datasource_name="taxi_datasource",    data_connector_name="default_runtime_data_connector_name",    data_asset_name="<YOUR MEANINGFUL NAME>",  # This can be anything that identifies this data_asset for you    runtime_parameters={"batch_data": df},  # Pass your DataFrame here.    batch_identifiers={"default_identifier_name": "<YOUR MEANINGFUL IDENTIFIER>"},)

Next, you would pass that request into context.get_validator:

batch_request=batch_request,    expectation_suite_name="<MY EXPECTATION SUITE NAME>",)print(validator.head())

Additional Notes#

To view the full script used in this page, see it on GitHub:

how_to_configure_a_runtimedataconnector.py