1.1. Extraction from Google Sheets

This article explains how to extract tables from Google Sheets to AWS S3 with Apache Airflow. We start with a simple approach for small tables, then scale up as your data grows. Along the way, we cover the 10M cells limit, secrets/configuration, code examples, and a design that stays robust as requirements evolve.

Overview

Many teams keep small-to-medium datasets in Google Sheets and need a dependable way to land that data in S3 for downstream analytics (dbt, Spark, Trino, Athena). This article focuses on a low-friction path to operationalize that flow with Airflow—starting simple for quick wins and growing into a robust pattern as volume and complexity increase. You’ll see how to schedule secure extractions, choose the right file format, and evolve the design without rework as your needs scale.

Design overview

Two complementary approaches:

  1. Small tables: use GoogleApiToS3Operator (Sheets API values.get) → nested JSON (fastest to set up).

  2. Growing tables: use a custom operator to write JSONL (one JSON row per line) → easier querying and scaling.

Contract (inputs/outputs)

  • Inputs:
    • Google Spreadsheet ID, worksheet name, and range (first row must contain headers).

    • Airflow connections for Google and AWS.

    • S3 bucket and destination path.

  • Outputs:
    • Files in S3 (JSON or JSONL), versioned by date if desired.

  • Success criteria:
    • All expected rows are extracted.

    • Files are written to the configured S3 path and can be queried.

  • Error modes:
    • Missing/invalid credentials (GCP/AWS), incorrect range or sheet name, API quota errors.

Limitations and implications

  • Google Sheets hard limit: ~10 million cells per spreadsheet.
    • As tables grow (columns × rows), you may hit this cap.

    • Mitigations: split by time (multiple worksheets/spreadsheets) or serialize columns into a single cell per row (see below).

  • The Sheets API values.get returns a nested values array that is less convenient to query directly.

Setup and secrets

  • Install Airflow providers:
    • apache-airflow

    • apache-airflow-providers-google

    • apache-airflow-providers-amazon

  • Create Airflow connections:
    • google_cloud_default: service account with Sheets API enabled (keyfile JSON or secret backend).

    • aws_default: IAM with write access to the S3 bucket.

  • Prepare configuration:
    • Spreadsheet ID, worksheet name(s), range (e.g., A:Z or A1:Z10000 including headers).

    • S3 bucket and prefix (e.g., s3://your-bucket/raw/google_sheets/<sheet>/).

Approach A — Simple JSON (best for small tables)

This uses GoogleApiToS3Operator to call spreadsheets.values.get and store the response JSON in S3.

 1"""
 2Example Airflow DAG: extract a Google Sheets worksheet range to S3 using
 3GoogleApiToS3Operator (row-major JSON from Sheets API).
 4
 5Generalized (no project-specific names). Requires Airflow + providers installed:
 6  - apache-airflow
 7  - apache-airflow-providers-google
 8  - apache-airflow-providers-amazon
 9"""
10from __future__ import annotations
11
12from airflow import DAG
13from airflow.operators.empty import EmptyOperator
14from airflow.providers.amazon.aws.transfers.google_api_to_s3 import (
15    GoogleApiToS3Operator,
16)
17import pendulum
18
19
20with DAG(
21    dag_id="example_gsheet_to_s3_basic",
22    description="Copy a Google Sheets range to S3 as JSON using GoogleApiToS3Operator",
23    schedule="0 8-15 * * 1,2",  # Hourly Mon-Tue 08:00-15:00 UTC
24    start_date=pendulum.datetime(2025, 1, 1, tz="UTC"),
25    catchup=False,
26    tags=["google-sheets", "s3", "extraction"],
27    default_args={"retries": 2},
28) as dag:
29    start = EmptyOperator(task_id="start")
30    end = EmptyOperator(task_id="end")
31
32    copy_gsheet_to_s3 = GoogleApiToS3Operator(
33        task_id="copy_worksheet_to_s3",
34        gcp_conn_id="google_cloud_default",  # Configure in Airflow
35        google_api_service_name="sheets",
36        google_api_service_version="v4",
37        google_api_endpoint_path="sheets.spreadsheets.values.get",
38        google_api_endpoint_params={
39            "spreadsheetId": "<YOUR_SPREADSHEET_ID>",
40            "range": "<WORKSHEET_NAME>!A:Z",  # Includes headers in first row
41        },
42        aws_conn_id="aws_default",
43        s3_destination_key="s3://<YOUR_BUCKET>/raw/google_sheets/<WORKSHEET_NAME>/data.json",
44        s3_overwrite=True,
45    )
46
47    start >> copy_gsheet_to_s3 >> end

Code walkthrough

  • google_api_endpoint_params supplies spreadsheetId and range.

  • Output is a JSON object containing range, majorDimension, and values (list-of-lists).

  • Suitable for small tables and quick wins; you can normalize later in SQL/ETL.

Loading the nested JSON

CREATE EXTERNAL TABLE IF NOT EXISTS raw.google_sheet_values (
        range STRING,
        majorDimension STRING,
        values ARRAY<ARRAY<STRING>>
)
STORED AS JSON
LOCATION 's3://your-bucket/raw/google_sheets/worksheet_name/';

Trade-offs

  • Pros: minimal setup, leverages built-in operator.

  • Cons: nested format is harder to query; not ideal as data grows.

Approach B — JSONL per row (better for growth)

We switch to row-wise JSON Lines so each row is a separate JSON object (easier for dbt/Spark/Trino).

Operator (generic)

  1"""
  2Generic Airflow operator to extract a Google Sheets range and write JSONL to S3.
  3
  4Each data row becomes a JSON object (one per line). Optionally, serialize all
  5columns into a single payload string to reduce Google Sheets cell usage.
  6"""
  7from __future__ import annotations
  8
  9import json
 10from datetime import datetime
 11from typing import Any, List, Optional
 12
 13from airflow.models.baseoperator import BaseOperator
 14from airflow.providers.amazon.aws.hooks.s3 import S3Hook
 15from airflow.providers.google.suite.hooks.sheets import GSheetsHook
 16
 17
 18class GSheetToS3JsonlOperator(BaseOperator):
 19    """
 20    Fetch data from Google Sheets, convert each row to JSONL, and upload to S3.
 21
 22    Parameters:
 23        gcp_conn_id: Airflow connection ID for Google (e.g., "google_cloud_default").
 24        aws_conn_id: Airflow connection ID for AWS (e.g., "aws_default").
 25        spreadsheet_id: Google Spreadsheet ID.
 26        worksheet_name: Sheet/tab name.
 27        worksheet_range: Cell range, e.g., "A:Z" or "A1:Z9999". Must include headers in first row.
 28        s3_bucket: Destination S3 bucket.
 29        s3_key: Destination S3 key (object path). Can include templates like {{ ds }}.
 30        serialize_columns: If True, serialize all columns into a single string field.
 31        serialization_separator: Separator for column serialization when serialize_columns=True.
 32        payload_field: Field name to store serialized payload.
 33        extras: Optional dict to include extra fields in each JSON line.
 34    """
 35
 36    template_fields = ("s3_key", "worksheet_name", "worksheet_range")
 37
 38    def __init__(
 39        self,
 40        *,
 41        gcp_conn_id: str,
 42        aws_conn_id: str,
 43        spreadsheet_id: str,
 44        worksheet_name: str,
 45        worksheet_range: str,
 46        s3_bucket: str,
 47        s3_key: str,
 48        serialize_columns: bool = False,
 49        serialization_separator: str = "|",
 50        payload_field: str = "payload",
 51        extras: Optional[dict] = None,
 52        **kwargs,
 53    ) -> None:
 54        super().__init__(**kwargs)
 55        self.gcp_conn_id = gcp_conn_id
 56        self.aws_conn_id = aws_conn_id
 57        self.spreadsheet_id = spreadsheet_id
 58        self.worksheet_name = worksheet_name
 59        self.worksheet_range = worksheet_range
 60        self.s3_bucket = s3_bucket
 61        self.s3_key = s3_key
 62        self.serialize_columns = serialize_columns
 63        self.serialization_separator = serialization_separator
 64        self.payload_field = payload_field
 65        self.extras = extras or {}
 66
 67    def _serialize_row(self, headers: List[str], row: List[Optional[str]]) -> dict:
 68        # pad row to headers length
 69        padded = list(row) + [None] * (len(headers) - len(row))
 70        if self.serialize_columns:
 71            # Join columns into a single string. Replace separators in values to avoid collisions.
 72            safe_values = [
 73                "" if v is None else str(v).replace(self.serialization_separator, f"\\{self.serialization_separator}")
 74                for v in padded
 75            ]
 76            return {self.payload_field: self.serialization_separator.join(safe_values)}
 77        else:
 78            return dict(zip(headers, padded))
 79
 80    def execute(self, context: Any):
 81        logical_date = context["ds"]
 82        self.log.info(
 83            "Reading spreadsheet_id=%s sheet=%s range=%s",
 84            self.spreadsheet_id,
 85            self.worksheet_name,
 86            self.worksheet_range,
 87        )
 88
 89        hook = GSheetsHook(gcp_conn_id=self.gcp_conn_id)
 90        values = hook.get_values(
 91            spreadsheet_id=self.spreadsheet_id, range_=f"{self.worksheet_name}!{self.worksheet_range}"
 92        )
 93
 94        if not values or len(values) < 2:
 95            self.log.warning("No data found in worksheet '%s'. Skipping.", self.worksheet_name)
 96            return None
 97
 98        raw_headers = values[0]
 99        headers = [h.strip().replace(" ", "_").lower() for h in raw_headers]
100        rows = values[1:]
101
102        date_of_transfer = datetime.fromisoformat(logical_date).strftime("%Y-%m-%d")
103        file_name = f"{self.worksheet_name}.jsonl"
104
105        json_lines: List[str] = []
106        for row in rows:
107            obj = self._serialize_row(headers, row)
108            obj.update(
109                {
110                    "worksheet": self.worksheet_name,
111                    "range": self.worksheet_range,
112                    "file_name": file_name,
113                    "date_of_file_transfer": date_of_transfer,
114                }
115            )
116            if self.extras:
117                obj.update(self.extras)
118            json_lines.append(json.dumps(obj, ensure_ascii=False))
119
120        output_data = "\n".join(json_lines)
121
122        s3_hook = S3Hook(aws_conn_id=self.aws_conn_id)
123        s3_hook.load_string(
124            string_data=output_data, key=self.s3_key, bucket_name=self.s3_bucket, replace=True
125        )
126
127        self.log.info(
128            "Uploaded %d rows to s3://%s/%s", len(rows), self.s3_bucket, self.s3_key
129        )
130        return f"s3://{self.s3_bucket}/{self.s3_key}"

Example DAG

 1"""
 2Example Airflow DAG: extract a Google Sheets worksheet range to S3 as JSONL
 3using a custom operator that normalizes row-wise and adds metadata.
 4
 5Generalized (no project-specific names). Requires Airflow + providers installed.
 6"""
 7from __future__ import annotations
 8
 9from airflow import DAG
10from airflow.operators.empty import EmptyOperator
11import pendulum
12
13# Adjust import to your Airflow project structure (e.g., plugins/operators)
14try:
15    from operators.gsheet_to_s3_jsonl import GSheetToS3JsonlOperator  # type: ignore
16except Exception:  # pragma: no cover - example import fallback
17    GSheetToS3JsonlOperator = None  # type: ignore
18
19
20with DAG(
21    dag_id="example_gsheet_to_s3_jsonl",
22    description="Copy a Google Sheets range to S3 as JSONL (one object per row)",
23    schedule="0 3 * * *",  # Daily 03:00 UTC
24    start_date=pendulum.datetime(2025, 1, 1, tz="UTC"),
25    catchup=False,
26    tags=["google-sheets", "s3", "jsonl"],
27    default_args={"retries": 2},
28) as dag:
29    start = EmptyOperator(task_id="start")
30    end = EmptyOperator(task_id="end")
31
32    if GSheetToS3JsonlOperator is None:
33        raise ImportError(
34            "GSheetToS3JsonlOperator not importable. Place operators/gsheet_to_s3_jsonl.py in your Airflow project and adjust the import."
35        )
36
37    copy_gsheet_to_s3 = GSheetToS3JsonlOperator(
38        task_id="copy_worksheet_to_s3",
39        gcp_conn_id="google_cloud_default",
40        aws_conn_id="aws_default",
41        spreadsheet_id="<YOUR_SPREADSHEET_ID>",
42        worksheet_name="<WORKSHEET_NAME>",
43        worksheet_range="A:Z",
44        s3_bucket="<YOUR_BUCKET>",
45        s3_key="raw/google_sheets/<WORKSHEET_NAME>/data_{{ ds_nodash }}.jsonl",
46        # set serialize_columns=True to collapse all columns into one payload string
47        serialize_columns=False,
48        serialization_separator="|",
49        payload_field="payload",
50        extras={"source": "google_sheets"},
51    )
52
53    start >> copy_gsheet_to_s3 >> end

Code walkthrough

  • Reads headers and data via GSheetsHook; creates one JSON per row.

  • Adds metadata: worksheet, range, file_name, date_of_file_transfer.

  • Writes to S3 as .jsonl so downstream tools can process line-by-line.

Schema for JSONL

CREATE EXTERNAL TABLE IF NOT EXISTS raw.google_sheet_jsonl (
        -- your columns inferred from headers
        col1 STRING,
        col2 STRING,
        -- metadata
        worksheet STRING,
        range STRING,
        file_name STRING,
        date_of_file_transfer DATE
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://your-bucket/raw/google_sheets/worksheet_name/';

Dealing with the 10M cells limit: column serialization

If you are constrained by the Sheets cell limit, store all columns as a single serialized field per row in the sheet, and reconstruct downstream. The operator supports this via serialize_columns=True.

Example (conceptual):

copy_gsheet_to_s3 = GSheetToS3JsonlOperator(
        task_id="copy_worksheet_to_s3_serialized",
        gcp_conn_id="google_cloud_default",
        aws_conn_id="aws_default",
        spreadsheet_id="<YOUR_SPREADSHEET_ID>",
        worksheet_name="<WORKSHEET_NAME>",
        worksheet_range="A:Z",
        s3_bucket="<YOUR_BUCKET>",
        s3_key="raw/google_sheets/<WORKSHEET_NAME>/data_{{ ds_nodash }}.jsonl",
        serialize_columns=True,
        serialization_separator="|",   # escape handled in operator
        payload_field="payload",
)

Downstream, split payload by the chosen separator to reconstruct columns. This reduces sheet cell usage (one cell per row instead of many), at the cost of manual parsing later.

Scheduling, monitoring, and reliability

  • Schedule aligns with the manual update window; use retries (2–4) and alerting on failure.

  • Version outputs (e.g., data_{{ ds_nodash }}.jsonl) for reproducibility.

  • Consider sensors if you depend on other upstream tasks.

Security and configuration

  • Keep credentials in Airflow Connections or a secret backend (e.g., AWS Secrets Manager) — not in code.

  • Scope IAM minimally: S3 write-only where possible; Sheets read-only.

  • Limit spreadsheet access to service accounts used by Airflow.

Summary of trade-offs

  • Simple JSON (values.get): fastest start; harder to query later.

  • JSONL per row: slightly more setup; far easier for downstream tools and scale.

  • Column serialization: mitigates Sheets limits; requires parsing in the lake.

References