Skip to main content
Version: Next

Cassandra

Testing

Important Capabilities

CapabilityStatusNotes
Asset ContainersEnabled by default
Detect Deleted EntitiesOptionally enabled via stateful_ingestion.remove_stale_metadata
Platform InstanceEnabled by default
Schema MetadataEnabled by default

This plugin extracts the following:

  • Metadata for tables
  • Column types associated with each table column
  • The keyspace each table belongs to

Setup

This integration pulls metadata directly from Cassandra databases, including both DataStax Astra DB and Cassandra Enterprise Edition (EE).

You’ll need to have a Cassandra instance or an Astra DB setup with appropriate access permissions.

Steps to Get the Required Information

  1. Set Up User Credentials:

    • For Astra DB:
      • Log in to your Astra DB Console.
      • Navigate to Organization Settings > Token Management.
      • Generate an Application Token with the required permissions for read access.
      • Download the Secure Connect Bundle from the Astra DB Console.
    • For Cassandra EE:
      • Ensure you have a username and password with read access to the necessary keyspaces.
  2. Permissions:

    • The user or token must have SELECT permissions that allow it to:
      • Access metadata in system keyspaces (e.g., system_schema) to retrieve information about keyspaces, tables, columns, and views.
      • Perform SELECT operations on the data tables if data profiling is enabled.
  3. Verify Database Access:

    • For Astra DB: Ensure the Secure Connect Bundle is used and configured correctly.
    • For Cassandra EE: Verify SSL/TLS settings if required, and ensure the contact point and port are accessible.

CLI based Ingestion

Install the Plugin

The cassandra source works out of the box with acryl-datahub.

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: "cassandra"
config:
# Credentials for on prem cassandra
contact_point: "localhost"
port: 9042
username: "admin"
password: "password"

# Or
# Credentials Astra Cloud
#cloud_config:
# secure_connect_bundle: "Path to Secure Connect Bundle (.zip)"
# token: "Application Token"

# Optional Allow / Deny extraction of particular keyspaces.
keyspace_pattern:
allow: [".*"]

# Optional Allow / Deny extraction of particular tables.
table_pattern:
allow: [".*"]

# Optional
profiling:
enabled: true
profile_table_level_only: true

sink:
# config sinks

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
contact_point
string
Domain or IP address of the Cassandra instance (excluding port).
Default: localhost
password
string
Password credential associated with the specified username.
platform_instance
string
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://datahubproject.io/docs/platform-instances/ for more details.
port
integer
Port number to connect to the Cassandra instance.
Default: 9042
username
string
Username credential with read access to the system_schema keyspace.
env
string
The environment that all assets produced by this connector belong to
Default: PROD
cloud_config
CassandraCloudConfig
Configuration for cloud-based Cassandra, such as DataStax Astra DB.
cloud_config.secure_connect_bundle 
string
File path to the Secure Connect Bundle (.zip) used for a secure connection to DataStax Astra DB.
cloud_config.token 
string
The Astra DB application token used for authentication.
cloud_config.connect_timeout
integer
Timeout in seconds for establishing new connections to Cassandra.
Default: 600
cloud_config.request_timeout
integer
Timeout in seconds for individual Cassandra requests.
Default: 600
keyspace_pattern
AllowDenyPattern
Regex patterns to filter keyspaces for ingestion.
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
keyspace_pattern.ignoreCase
boolean
Whether to ignore case sensitivity during pattern matching.
Default: True
keyspace_pattern.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
keyspace_pattern.allow.string
string
keyspace_pattern.deny
array
List of regex patterns to exclude from ingestion.
Default: []
keyspace_pattern.deny.string
string
profile_pattern
AllowDenyPattern
Regex patterns for tables to profile
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
profile_pattern.ignoreCase
boolean
Whether to ignore case sensitivity during pattern matching.
Default: True
profile_pattern.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
profile_pattern.allow.string
string
profile_pattern.deny
array
List of regex patterns to exclude from ingestion.
Default: []
profile_pattern.deny.string
string
table_pattern
AllowDenyPattern
Regex patterns to filter keyspaces.tables for ingestion.
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
table_pattern.ignoreCase
boolean
Whether to ignore case sensitivity during pattern matching.
Default: True
table_pattern.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
table_pattern.allow.string
string
table_pattern.deny
array
List of regex patterns to exclude from ingestion.
Default: []
table_pattern.deny.string
string
profiling
ProfileConfig
Configuration for profiling
Default: {'enabled': False, 'operation_config': {'lower_fre...
profiling.catch_exceptions
boolean
Default: True
profiling.column_count
boolean
Default: True
profiling.enabled
boolean
Whether profiling should be done.
Default: False
profiling.field_sample_values_limit
integer
Upper limit for number of sample values to collect for all columns.
Default: 20
profiling.include_field_distinct_count
boolean
Whether to profile for the number of distinct values for each column.
Default: True
profiling.include_field_distinct_value_frequencies
boolean
Whether to profile for distinct value frequencies.
Default: False
profiling.include_field_histogram
boolean
Whether to profile for the histogram for numeric fields.
Default: False
profiling.include_field_max_value
boolean
Whether to profile for the max value of numeric columns.
Default: True
profiling.include_field_mean_value
boolean
Whether to profile for the mean value of numeric columns.
Default: True
profiling.include_field_median_value
boolean
Whether to profile for the median value of numeric columns.
Default: True
profiling.include_field_min_value
boolean
Whether to profile for the min value of numeric columns.
Default: True
profiling.include_field_null_count
boolean
Whether to profile for the number of nulls for each column.
Default: True
profiling.include_field_quantiles
boolean
Whether to profile for the quantiles of numeric columns.
Default: False
profiling.include_field_sample_values
boolean
Whether to profile for the sample values for all columns.
Default: True
profiling.include_field_stddev_value
boolean
Whether to profile for the standard deviation of numeric columns.
Default: True
profiling.limit
integer
Max number of documents to profile. By default, profiles all documents.
profiling.max_workers
integer
Number of worker threads to use for profiling. Set to 1 to disable.
Default: 20
profiling.offset
integer
Offset in documents to profile. By default, uses no offset.
profiling.profile_table_level_only
boolean
Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
profiling.report_dropped_profiles
boolean
Whether to report datasets or dataset columns which were not profiled. Set to True for debugging purposes.
Default: False
profiling.row_count
boolean
Default: True
profiling.turn_off_expensive_profiling_metrics
boolean
Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.
Default: False
profiling.operation_config
OperationConfig
Experimental feature. To specify operation configs.
profiling.operation_config.lower_freq_profile_enabled
boolean
Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.
Default: False
profiling.operation_config.profile_date_of_month
integer
Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.
profiling.operation_config.profile_day_of_week
integer
Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.
stateful_ingestion
StatefulStaleMetadataRemovalConfig
Configuration for stateful ingestion and stale metadata removal.
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Code Coordinates

  • Class Name: datahub.ingestion.source.cassandra.cassandra.CassandraSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for Cassandra, feel free to ping us on our Slack.