How to connect to in-memory data in a Spark dataframe

This guide will help you connect to your data in an in-memory dataframe using Spark. This will allow you to validate and explore your data.

Prerequisites: This how-to guide assumes you have:

Completed the Getting Started Tutorial
Have a working installation of Great Expectations
Have access to an in-memory Spark dataframe

Steps#

1. Choose how to run the code in this guide#

Get an environment to run the code in this guide. Please choose an option below.

CLI + filesystem
No CLI + filesystem
No CLI + no filesystem

If you use the Great Expectations CLI, run this command to automatically generate a pre-configured Jupyter Notebook. Then you can follow along in the YAML-based workflow below:

great_expectations datasource new

2. Instantiate your project's DataContext#

Import these necessary packages and modules.

from ruamel import yaml
import great_expectations as gefrom great_expectations.core.batch import BatchRequest, RuntimeBatchRequestfrom great_expectations.data_context import BaseDataContextfrom great_expectations.data_context.types.base import (    DataContextConfig,    InMemoryStoreBackendDefaults,)

Load your DataContext into memory

Use one of the guides below based on your deployment:

3. Configure your Datasource#

Using this example configuration add in the path to a directory that contains some of your data:

YAML
Python

datasource_yaml = f"""name: my_spark_dataframeclass_name: Datasourceexecution_engine:    class_name: SparkDFExecutionEnginedata_connectors:    default_runtime_data_connector_name:        class_name: RuntimeDataConnector        batch_identifiers:            - batch_id"""

Run this code to test your configuration.

context.test_yaml_config(datasource_yaml)

datasource_config = {    "name": "my_spark_dataframe",    "class_name": "Datasource",    "execution_engine": {"class_name": "SparkDFExecutionEngine"},    "data_connectors": {        "default_runtime_data_connector_name": {            "class_name": "RuntimeDataConnector",            "batch_identifiers": ["batch_id"],        }    },}

Run this code to test your configuration.

context.test_yaml_config(yaml.dump(datasource_config))

Note: Since the Datasource does not have data passed-in until later, the output will show that no data_asset_names are currently available. This is to be expected.

4. Save the Datasource configuration to your DataContext#

Save the configuration into your DataContext by using the add_datasource() function.

YAML
Python

context.add_datasource(**yaml.load(datasource_yaml))

context.add_datasource(**datasource_config)

5. Test your new Datasource#

Verify your new Datasource by loading data from it into a Validator using a BatchRequest.

Add the variable containing your dataframe (df in this example) to the batch_data key under runtime_parameters in your BatchRequest.

batch_request = RuntimeBatchRequest(    datasource_name="my_spark_dataframe",    data_connector_name="default_runtime_data_connector_name",    data_asset_name="<YOUR_MEANGINGFUL_NAME>",  # This can be anything that identifies this data_asset for you    batch_identifiers={"batch_id": "default_identifier"},    runtime_parameters={"batch_data": df},  # Your dataframe goes here)

Note this guide uses a toy dataframe that looks like this.

data = [    {"a": 1, "b": 2, "c": 3},    {"a": 4, "b": 5, "c": 6},    {"a": 7, "b": 8, "c": 9},]

Then load data into the Validator.

context.create_expectation_suite(    expectation_suite_name="test_suite", overwrite_existing=True)validator = context.get_validator(    batch_request=batch_request, expectation_suite_name="test_suite")print(validator.head())

🚀🚀 Congratulations! 🚀🚀 You successfully connected Great Expectations with your data.

Additional Notes#

To view the full scripts used in this page, see them on GitHub:

Next Steps#

Now that you've connected to your data, you'll want to work on these core skills: