Querying Databricks from QuantumLayers: How to Turn Your Lakehouse into a Live Analytical Dataset

How the Databricks SQL Statement Execution API returns query results as JSON over HTTP, why a Personal Access Token is all you need to authenticate, and how QuantumLayers ingests the response into a dataset with automated statistical analysis and AI-generated insights

The Lakehouse Has Your Data. Now What?

Databricks has become the gravitational center of the modern data stack. If your organization runs a lakehouse architecture, there is a good chance that most of your structured analytical data lives inside Databricks: sales transactions, customer records, product catalogs, event logs, marketing attribution tables, financial reporting views. The data is governed by Unity Catalog, stored in Delta Lake format on cloud object storage, and queryable through Databricks SQL warehouses that can handle everything from ad hoc exploration to production-grade reporting workloads.

The problem most teams encounter is not getting data into Databricks. It is getting data out of it in a way that is useful for the people who actually need to make decisions. A data engineer can write a Spark notebook that produces a summary table. A SQL analyst can run a query in the Databricks SQL editor and export the results. But the business user who needs to understand what the numbers mean, spot anomalies, identify trends, and act on the patterns still ends up waiting for someone else to extract the data, transform it into a readable format, and deliver it as a slide deck or a spreadsheet attachment.

Databricks recognized this problem when it built the SQL Statement Execution API. The API exposes any SQL query against a Databricks SQL warehouse as a standard HTTP POST request that returns structured JSON. No JDBC driver. No ODBC connection. No Spark cluster. Just an HTTPS endpoint, a Bearer token, a SQL statement, and a JSON response containing the results. That design makes Databricks queryable by any tool that can make a REST API call, which includes QuantumLayers.

This post walks through the full connection: how the Databricks API works, how to configure it inside QuantumLayers’ REST connector, what the response looks like, and what happens to the data once it lands in the platform’s statistical analysis and AI insights pipeline. By the end, your lakehouse data will be a live dataset inside QuantumLayers, refreshable on demand and queryable by QL-Agent in plain English.

Why the REST API Path Is Better Than Exporting Files

Before walking through the connection, it is worth understanding why querying Databricks through its REST API is fundamentally different from the alternative approach of exporting data from Databricks and uploading it as a file.

The export path looks like this: someone runs a query in Databricks, downloads the result as a CSV, and uploads it to QuantumLayers. The data is now a frozen snapshot. It reflects reality at the moment the query ran and starts aging immediately. If the underlying tables in Databricks are updated hourly or daily (which they usually are), the uploaded file is already stale by the next refresh cycle. Repeating the export requires a human to log into Databricks, re-run the query, download the file, and re-upload it. That loop is the same manual treadmill that data lakes and lakehouses were supposed to eliminate.

The REST API path eliminates every step in that loop. QuantumLayers stores the connection as a set of instructions: the Databricks workspace URL, the warehouse ID, the SQL query, and the authentication token. When the dataset needs to be refreshed, the platform re-executes those instructions, pulls the latest results from Databricks, and reprocesses the dataset through the analysis pipeline. No human intervention. No file transfer. No stale snapshots. The dataset in QuantumLayers is always the result of the most recent successful query against the lakehouse.

There is an additional advantage that is easy to overlook. When you query Databricks through the SQL API, the Databricks SQL warehouse handles all the heavy lifting of reading Delta Lake tables, resolving Parquet files, applying schema enforcement, and executing joins and aggregations at the source. You do not need to worry about file formats, partitioning strategies, or storage layout. You write a SELECT statement. Databricks figures out how to execute it. QuantumLayers receives clean, structured JSON. The complexity of the lakehouse’s internal storage is entirely abstracted away.

How the Databricks SQL Statement Execution API Works

The Databricks SQL Statement Execution API is a REST endpoint that accepts a SQL query and returns the results as JSON. It was designed specifically for building integrations between Databricks and external applications without requiring driver installations, connection pooling, or Spark runtime dependencies. The API is available on Databricks Premium and Enterprise tiers across AWS, Azure, and GCP.

The endpoint is:

POST https://<your-workspace>.cloud.databricks.com/api/2.0/sql/statements

Replace <your-workspace> with your Databricks workspace instance. For Azure Databricks, the host looks like adb-1234567890123456.7.azuredatabricks.net. For AWS, it looks like dbc-a1b2c3d4-e5f6.cloud.databricks.com. You can find the exact hostname at the top of your Databricks workspace URL when you are logged in.

The request body is a JSON object with two required fields: the SQL statement you want to execute and the ID of the SQL warehouse that should execute it. A request for the last 90 days of daily order revenue, grouped by product category, looks like this:

{
  "warehouse_id": "a1b2c3d4e5f67890",
  "statement": "SELECT order_date, product_category, SUM(revenue) AS total_revenue, COUNT(*) AS order_count FROM analytics.sales.daily_orders WHERE order_date >= DATEADD(DAY, -90, CURRENT_DATE()) GROUP BY order_date, product_category ORDER BY order_date DESC",
  "wait_timeout": "30s",
  "on_wait_timeout": "CANCEL",
  "format": "JSON_ARRAY",
  "disposition": "INLINE"
}

The wait_timeout tells Databricks how long to wait for the query to complete before returning. Setting it to 30 seconds with on_wait_timeout set to CANCEL means the query will be canceled if it takes longer than 30 seconds, rather than continuing asynchronously in the background. The format and disposition fields tell Databricks to return the results as an inline JSON array rather than as external links to files in cloud storage. For datasets that fit within the 25 MiB inline limit (which covers the vast majority of analytical summaries and aggregated reporting queries), this is the simplest configuration.

Authentication uses a Personal Access Token (PAT) passed as a Bearer token in the Authorization header. You generate a PAT inside the Databricks workspace under User Settings, and it remains valid until you revoke it or until it reaches an optional expiry date that you configure at creation time. The token grants the same permissions as the user who created it, scoped to the SQL warehouse and the data objects referenced in the query. Databricks recommends creating a dedicated service principal for production integrations, but for connecting QuantumLayers to a specific set of analytical queries, a user-scoped PAT with read-only SQL warehouse access is sufficient and straightforward.

What the Response Looks Like

Understanding the response structure matters because it determines how QuantumLayers parses the data. The Databricks Statement Execution API returns a JSON response with three top-level objects: status, manifest, and result.

A successful response from our example query looks like this (abbreviated for clarity):

{
  "statement_id": "01ed92c5-3583-1f38-b21b-c6773e7c56b3",
  "status": {
    "state": "SUCCEEDED"
  },
  "manifest": {
    "format": "JSON_ARRAY",
    "schema": {
      "column_count": 4,
      "columns": [
        { "name": "order_date", "type_name": "DATE", "position": 0 },
        { "name": "product_category", "type_name": "STRING", "position": 1 },
        { "name": "total_revenue", "type_name": "DECIMAL", "position": 2 },
        { "name": "order_count", "type_name": "LONG", "position": 3 }
      ]
    },
    "total_row_count": 540,
    "total_chunk_count": 1,
    "truncated": false
  },
  "result": {
    "chunk_index": 0,
    "row_count": 540,
    "data_array": [
      ["2026-05-10", "Electronics", "48750.00", "312"],
      ["2026-05-10", "Apparel", "22340.50", "187"],
      ["2026-05-09", "Electronics", "51200.00", "298"]
    ]
  }
}

The manifest.schema.columns array provides the column names and types. The result.data_array contains the actual rows, each represented as an array of string values in the same order as the columns defined in the manifest. This structure, where column names and row data are provided in separate arrays, is a column-oriented JSON format that QuantumLayers’ automatic JSON detection recognizes and converts into a flat tabular dataset.

The manifest and result objects are nested inside the top-level response envelope. To reach the row data, QuantumLayers needs to know the path: result.data_array. And to extract column names, it reads the schema at manifest.schema.columns. These paths are configured once in the QuantumLayers connection, and the platform handles the extraction and alignment from that point on.

Configuring the Connection in QuantumLayers

The connection follows the same five-field template that QuantumLayers uses for every REST API source. If you have already connected Google Analytics, Stripe, or any other API, this process will feel familiar. If this is your first REST connection, the REST API connector walkthrough covers the fundamentals in detail.

Go to the QuantumLayers Connect page and select REST API as the connection type. Fill in the following fields:

Endpoint URL

The full URL of the Statement Execution API endpoint on your Databricks workspace:

https://<your-workspace-host>/api/2.0/sql/statements

The workspace host is the domain you see in your browser when you are logged into Databricks. For Azure Databricks, it will be something like adb-1234567890123456.7.azuredatabricks.net. For AWS, it will be something like dbc-a1b2c3d4-e5f6.cloud.databricks.com. For GCP, it follows a similar pattern. Use the exact hostname without adding a trailing slash.

HTTP Method

Set to POST. The Statement Execution API accepts SQL queries as a JSON payload in the request body, which requires a POST request.

Authentication

Select Bearer Token from the authentication dropdown and paste your Databricks Personal Access Token. To generate one, open your Databricks workspace, click your username in the top-right corner, go to User Settings, then Developer, and click Generate New Token. Give it a descriptive name like “QuantumLayers read-only” and set an expiry that aligns with your organization’s security policy. Copy the token immediately after creation, because Databricks will not show it again.

The token is stored encrypted in QuantumLayers and attached as an Authorization: Bearer <token> header on every API call. The user who created the token must have CAN USE permission on the SQL warehouse being queried, plus read access to the catalogs, schemas, and tables referenced in the SQL statement. These permissions are enforced by Unity Catalog, so the connection inherits whatever governance rules your Databricks administrators have already configured.

Request Body

The JSON payload containing your SQL query and the warehouse configuration. At minimum, it needs the warehouse_id and the statement. Including wait_timeout, format, and disposition makes the behavior explicit and ensures the response comes back in a format QuantumLayers can parse.

Here is a practical example that pulls a customer cohort analysis from a lakehouse:

{
  "warehouse_id": "a1b2c3d4e5f67890",
  "statement": "SELECT signup_month, acquisition_channel, COUNT(DISTINCT customer_id) AS customers, SUM(lifetime_revenue) AS total_ltv, AVG(lifetime_revenue) AS avg_ltv, SUM(CASE WHEN is_churned THEN 1 ELSE 0 END) AS churned FROM analytics.customers.cohort_summary WHERE signup_month >= '2025-01-01' GROUP BY signup_month, acquisition_channel ORDER BY signup_month",
  "wait_timeout": "30s",
  "on_wait_timeout": "CANCEL",
  "format": "JSON_ARRAY",
  "disposition": "INLINE"
}

To find your warehouse ID, open the Databricks SQL interface, click on the SQL warehouse you want to use, and look at the Connection Details tab. The warehouse ID is the string of characters that follows /sql/1.0/warehouses/ in the HTTP Path field. It is also visible in the URL when you have the warehouse selected in the Databricks UI.

The SQL statement can be any valid SQL that your warehouse supports. That includes SELECT queries against Delta Lake tables, views, materialized views, and cross-catalog joins. It supports Databricks SQL syntax including window functions, CTEs, QUALIFY clauses, and the full set of built-in functions documented in the Databricks SQL language reference. The query executes against the live state of your lakehouse tables, so the results always reflect the most recent data that has been written and committed.

Response Path

Set the response path to result.data_array. This tells QuantumLayers where to find the actual row data inside the Databricks response envelope. The platform extracts the array at that path, reads the column names from manifest.schema.columns, aligns each value in the row arrays with its corresponding column name, and produces a flat tabular dataset with properly named columns and inferred data types.

Click Connect. QuantumLayers sends the POST request to your Databricks workspace, receives the JSON response, parses the column structure from the manifest, extracts the row data from result.data_array, infers types for each column, computes summary statistics, and produces a dataset that is immediately ready for charting, statistical analysis, and AI insights.

What Happens After the Data Lands

Once the Databricks query results are inside QuantumLayers, they are processed through the same pipeline that handles every other data source on the platform. The processing is identical whether the data came from a Databricks query, a Google Analytics API call, a PostgreSQL connection, or a CSV upload.

The first thing the platform does is type inference. Databricks returns all values as strings inside the data_array, regardless of their actual type. A revenue value comes back as "48750.00", not as a numeric 48750.00. A date comes back as "2026-05-10", not as a date object. QuantumLayers examines the values across every column, detects the underlying types using pattern matching and statistical sampling, and converts them accordingly. The order_date column becomes a date. The total_revenue column becomes a decimal. The product_category column stays as text. This inference is what allows the charting engine to format axes correctly and the statistical engine to select the appropriate tests for each column pair.

After type inference, the platform computes column-level statistics: minimum and maximum values, mean and median, standard deviation, distinct count, null count, and a sample of representative values. These statistics feed directly into the AI insights engine and are visible on the dataset details page alongside data quality diagnostics.

From there, you can go in several directions. The chart builder lets you visualize any combination of columns across 14 chart types, from time series lines to scatter plots to violin distributions. AI insights run the full 9-step statistical pipeline, covering Pearson correlations, ANOVA with non-parametric fallbacks, distribution analysis, temporal trend detection with stationarity testing and structural break detection, regression, categorical entropy, chi-square tests, cross-correlation for leading and lagging relationships, and multicollinearity diagnostics. Every finding that survives the Benjamini-Hochberg false discovery rate correction is translated into a plain-language business insight with an importance score and a recommended action. The full methodology behind the statistical pipeline is documented in Understanding Your Data: A Comprehensive Guide to Statistical Analysis.

A Practical Example: Customer Cohort Analysis

To make this concrete, consider the customer cohort query from the configuration section above. A SaaS company uses Databricks to process subscription data from their production database nightly. The data engineering team maintains a cohort_summary table in the analytics catalog that contains one row per signup month per acquisition channel, with metrics for customer count, lifetime revenue, average lifetime value, and churn count.

The head of growth connects this table to QuantumLayers using the Databricks REST API connector. The connection takes about three minutes to configure. The resulting dataset has six columns and roughly 200 rows (18 months of data across a dozen acquisition channels). It is small enough to fit well within the 25 MiB inline response limit.

She generates AI insights. The statistical pipeline identifies several findings: a strong negative correlation between cohort age and churn rate (older cohorts are more stable, which is expected), a statistically significant difference in average LTV across acquisition channels (organic search customers have 40% higher LTV than paid social customers, which was suspected but never quantified), a structural break in the organic search cohort starting in November 2025 (coinciding with a site redesign that the marketing team had not connected to acquisition quality), and a leading relationship where changes in paid social spend predict changes in total new customer count with a one-month lag.

She saves the most relevant charts, creates a weekly scheduled report that re-runs the Databricks query and regenerates insights every Monday morning, and sends it to the growth team and the CFO. The report arrives with updated data from the lakehouse, a refreshed AI summary, and inline charts showing week-over-week changes. Nobody downloaded a CSV. Nobody opened a notebook. The data flowed from Databricks to the inbox through a connection that was configured once.

Merging Databricks Data with Other Sources

One of the most powerful patterns that emerges when Databricks data is inside QuantumLayers is merging it with data from other sources that do not live in the lakehouse.

Consider the SaaS company from the previous example. Their Databricks lakehouse contains subscription and revenue data. But their marketing team tracks campaign performance in Google Analytics. Their finance team maintains budget forecasts in Google Sheets. Their support team uses a helpdesk tool that exposes ticket data through a REST API.

Inside QuantumLayers, each of these sources is a dataset. The Databricks cohort data comes through the SQL Statement Execution API. The Google Analytics traffic data comes through the GA4 REST API connector. The budget forecasts come through the Google Sheets connector. The support ticket data comes through another REST API connection. Using the platform’s merge interface, these datasets can be joined on shared keys like acquisition channel, month, or customer ID to produce a unified analytical view that spans the entire business.

That merged dataset is where the most valuable insights live. Correlations between marketing spend (from GA) and customer LTV (from Databricks). Comparisons between budgeted revenue (from Sheets) and actual revenue (from Databricks). Relationships between support ticket volume (from the helpdesk API) and churn rates (from the lakehouse). None of these cross-source analyses are possible when each dataset lives in a separate tool. The merge is what creates the analytical value that no single source can provide on its own. The full mechanics of multi-source merging are covered in The Data Ingestion Challenge.

Keeping the Data Fresh

Once the Databricks connection is saved, there are three ways to keep the dataset current.

The first is manual refresh. From the dataset view, click Refresh. QuantumLayers re-executes the stored SQL statement against the Databricks warehouse, pulls the latest results, and reprocesses the dataset. This is the right pattern for interactive analysis sessions where you want the most current data at the moment you are working.

The second is scheduled reports. Configure a report that includes the Databricks dataset with a daily or weekly cadence. At the scheduled time, QuantumLayers re-syncs the connection, regenerates insights against the latest data, and delivers the report by email. The lakehouse tables are queried automatically. The AI analysis runs automatically. The report arrives automatically. Configuration details are in the QuantumLayers help documentation.

The third is QL-Agent. Open the conversational agent and type something like “refresh the Databricks cohort data and tell me what changed since last week.” The agent re-executes the SQL query, compares the updated dataset to its previous state, and responds with a plain-language summary of the material changes, complete with inline charts. The mechanics of how QL-Agent orchestrates multi-step workflows are covered in Agentic Data Analytics and QL-Agent.

Sizing Considerations and the 25 MiB Limit

The one constraint worth understanding is the inline response size limit. When disposition is set to INLINE (which is the configuration QuantumLayers uses), Databricks caps the response payload at 25 MiB. If the query result exceeds that size, the API returns an error rather than the data.

In practice, 25 MiB is a substantial amount of JSON data. A dataset with 50 columns and 100,000 rows of typical business data (dates, categories, numeric values) fits comfortably within the limit. Most analytical use cases, including daily aggregates, cohort summaries, product performance reports, and regional breakdowns, produce datasets well below this threshold.

If your query produces a result set that exceeds 25 MiB, the solution is to adjust the query. Add tighter date filters to reduce the time range. Pre-aggregate at the source using GROUP BY instead of pulling individual transaction rows. Select only the columns you need rather than using SELECT *. Limit the row count with a LIMIT clause. The right mental model is that the SQL query in the QuantumLayers connection is a reporting query, not an extraction query. It should return the analytical dataset you need for a specific analysis, not a raw dump of every row in a table.

Security and Access Control

The security model for this connection operates at multiple layers, each reinforcing the others.

At the transport layer, every request uses TLS 1.2 or above. The API endpoint is HTTPS-only, and Databricks rejects unencrypted connections.

At the authentication layer, the Personal Access Token maps to a specific Databricks user or service principal. If the token is revoked, the connection stops working immediately. If the user’s permissions are reduced, the next query that references a restricted table will fail with an access denied error. The token does not grant blanket access to the workspace; it inherits the exact permissions of the identity that created it.

At the authorization layer, Unity Catalog enforces fine-grained access control on every catalog, schema, and table referenced in the SQL statement. If a column is masked, the query returns the masked values. If a table is restricted to specific groups, users outside those groups cannot query it, even through the API. The governance policies that your Databricks administrators have already configured apply identically to API-initiated queries.

At the storage layer, QuantumLayers encrypts the Bearer token at rest and in transit. The token is never exposed in frontend code after the initial configuration. The full credential security model is documented in the QuantumLayers help documentation.

If your Databricks workspace is behind a private network or IP allowlist, you will need to ensure that QuantumLayers’ outbound IP addresses are permitted. This is an infrastructure consideration that your Databricks administrator can configure in the workspace networking settings.

Databricks on AWS, Azure, and GCP

The Statement Execution API is available on all three cloud platforms where Databricks runs. The connection configuration in QuantumLayers is identical across all three. The only difference is the workspace hostname in the endpoint URL. For Azure Databricks, the hostname follows the adb-*.azuredatabricks.net pattern. For AWS, it follows the *.cloud.databricks.com pattern. For GCP, it follows a similar structure. The request body, authentication method, and response format are identical regardless of the underlying cloud provider.

This also means that if your organization runs Databricks on Azure with data stored in Azure Data Lake Storage Gen2, the connection described in this post is the cleanest way to get that data into QuantumLayers. Rather than trying to access ADLS Gen2 files directly (which requires dealing with SAS tokens, file formats, and storage-layer authentication), you query the data through Databricks, which handles the complexity of reading Delta Lake tables from blob storage and returns clean, typed, structured JSON. Databricks becomes the access layer. QuantumLayers becomes the analysis layer. The storage layer stays invisible.

Conclusion

Databricks already has the data. QuantumLayers turns it into analysis. The connection between them is a POST request: a workspace URL, a warehouse ID, a SQL statement, and a Bearer token. No driver to install. No pipeline to build. No ETL to maintain. The SQL Statement Execution API returns query results as structured JSON, and QuantumLayers ingests that JSON into a live dataset with automated statistical profiling, AI-generated insights, interactive charting, and conversational querying through QL-Agent.

The SQL query in the connection defines the analytical scope. It can be a simple aggregation or a complex multi-table join across catalogs. It executes against the live state of your lakehouse. When the dataset in QuantumLayers is refreshed, the query runs again, the latest data comes back, and the analysis updates. When the insights need to reach stakeholders on a schedule, the reporting engine delivers them automatically. When a question comes up between scheduled reports, QL-Agent answers it in seconds.

The lakehouse solved the data storage and governance problem. QuantumLayers solves the analytical access problem. The bridge between them is a five-field REST connection that takes three minutes to configure.

This post is part of the QuantumLayers blog series on connecting live data sources for AI-powered analysis. For the general walkthrough of how the REST API connector works with any API, see Live Data Without the Export Cycle: Connecting QuantumLayers to Any REST API. For what happens to your data once it lands in the platform, see Understanding Your Data: A Comprehensive Guide to Statistical Analysis. For the broader picture of all ingestion methods QuantumLayers supports, see The Data Ingestion Challenge. Connect your first dataset at www.quantumlayers.com.