How to configure a RuntimeDataConnector
This guide demonstrates how to configure a RuntimeDataConnector and only applies to the V3 (Batch Request) API. A RuntimeDataConnector
allows you to specify a Batch using a Runtime Batch Request, which is used to create a Validator. A Validator is the key object used to create Expectations and validate datasets.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- Have a working installation of Great Expectations
- Understand the basics of Datasources in the V3 (Batch Request) API
- Learned how to configure a Data Context using test_yaml_config
A RuntimeDataConnector is a special kind of Data Connector that enables you to use a RuntimeBatchRequest to provide a Batch's data directly at runtime. The RuntimeBatchRequest can wrap an in-memory dataframe, a filepath, or a SQL query, and must include batch identifiers that uniquely identify the data (e.g. a run_id
from an AirFlow DAG run). The batch identifiers that must be passed in at runtime are specified in the RuntimeDataConnector's configuration.
#
Steps#
1. Instantiate your project's DataContextImport these necessary packages and modules:
- YAML
- Python
import great_expectations as gefrom great_expectations.core.batch import RuntimeBatchRequest
from ruamel import yaml
import great_expectations as gefrom great_expectations.core.batch import RuntimeBatchRequest
#
2. Set up a DatasourceAll of the examples below assume you’re testing configuration using something like:
- YAML
- Python
datasource_yaml = """name: taxi_datasourceclass_name: Datasourceexecution_engine: class_name: PandasExecutionEnginedata_connectors: <DATACONNECTOR NAME GOES HERE>: <DATACONNECTOR CONFIGURATION GOES HERE>"""context.test_yaml_config(yaml_config=datasource_config)
datasource_config = { "name": "taxi_datasource", "class_name": "Datasource", "module_name": "great_expectations.datasource", "execution_engine": { "module_name": "great_expectations.execution_engine", "class_name": "PandasExecutionEngine", }, "data_connectors": { "<DATACONNECTOR NAME GOES HERE>": { "<DATACONNECTOR CONFIGURATION GOES HERE>" }, },}context.test_yaml_config(yaml.dump(datasource_config))
If you’re not familiar with the test_yaml_config
method, please check out: How to configure Data Context components using test_yaml_config
#
3. Add a RuntimeDataConnector to a Datasource configurationThis basic configuration can be used in multiple ways depending on how the RuntimeBatchRequest
is configured:
- YAML
- Python
datasource_yaml = """name: taxi_datasourceclass_name: Datasourcemodule_name: great_expectations.datasourceexecution_engine: module_name: great_expectations.execution_engine class_name: PandasExecutionEnginedata_connectors: default_runtime_data_connector_name: class_name: RuntimeDataConnector batch_identifiers: - default_identifier_name"""
datasource_config = { "name": "taxi_datasource", "class_name": "Datasource", "module_name": "great_expectations.datasource", "execution_engine": { "module_name": "great_expectations.execution_engine", "class_name": "PandasExecutionEngine", }, "data_connectors": { "default_runtime_data_connector_name": { "class_name": "RuntimeDataConnector", "batch_identifiers": ["default_identifier_name"], }, },}
Once the RuntimeDataConnector is configured you can add your datasource using:
context.add_datasource(**datasource_config)
#
Example 1: RuntimeDataConnector for access to file-system data:At runtime, you would get a Validator from the Data Context by first defining a RuntimeBatchRequest
with the path
to your data defined in runtime_parameters
:
batch_request = RuntimeBatchRequest( datasource_name="taxi_datasource", data_connector_name="default_runtime_data_connector_name", data_asset_name="<YOUR MEANINGFUL NAME>", # This can be anything that identifies this data_asset for you runtime_parameters={"path": "<PATH TO YOUR DATA HERE>"}, # Add your path here. batch_identifiers={"default_identifier_name": "<YOUR MEANINGFUL IDENTIFIER>"},)
Next, you would pass that request into context.get_validator
:
validator = context.get_validator( batch_request=batch_request, create_expectation_suite_with_name="<MY EXPECTATION SUITE NAME>",)
#
Example 2: RuntimeDataConnector that uses an in-memory DataFrameAt runtime, you would get a Validator from the Data Context by first defining a RuntimeBatchRequest
with the DataFrame passed into batch_data
in runtime_parameters
:
import pandas as pd
path = "<PATH TO YOUR DATA HERE>"
df = pd.read_csv(path)
batch_request = RuntimeBatchRequest( datasource_name="taxi_datasource", data_connector_name="default_runtime_data_connector_name", data_asset_name="<YOUR MEANINGFUL NAME>", # This can be anything that identifies this data_asset for you runtime_parameters={"batch_data": df}, # Pass your DataFrame here. batch_identifiers={"default_identifier_name": "<YOUR MEANINGFUL IDENTIFIER>"},)
Next, you would pass that request into context.get_validator
:
batch_request=batch_request, expectation_suite_name="<MY EXPECTATION SUITE NAME>",)print(validator.head())
#
Additional NotesTo view the full script used in this page, see it on GitHub: