NULL and Empty String Handling¶

This section documents the behavior of NULL and empty string values across different cursor types in PyAthena. Understanding this behavior is important for applications that need to distinguish between NULL (missing) values and empty strings.

CSV Format Limitation¶

When Athena executes a query, the results are stored as CSV files in S3. This CSV format has an inherent limitation in how NULL values and empty strings are represented:

NULL values are represented as unquoted empty fields: ,,
Empty strings are represented as quoted empty fields: ,"",

# Example CSV output from Athena
id,name,description
1,Alice,Hello
2,Bob,""
3,Charlie,

# Row 2: description is an empty string (quoted: "")
# Row 3: description is NULL (unquoted)

Most CSV parsers handle these differently:

Pandas (used by PandasCursor): By default, treats both unquoted empty and quoted empty as missing values
PyArrow (used by ArrowCursor): With quoted_strings_can_be_null=False, treats both as empty strings
Polars (used by PolarsCursor): Correctly distinguishes quoted empty strings from unquoted NULL values
AthenaCSVReader (used by S3FSCursor): Correctly distinguishes NULL (returns None) from empty strings

This means the ability to distinguish NULL from empty strings varies by cursor when reading CSV files.

Cursor Behavior Comparison¶

The following table summarizes how different cursors handle NULL and empty string values, based on actual testing with Athena:

Cursor Type	Data Source	Empty String `''`	NULL Value	Distinguishes?
`Cursor` (default)	Athena API	`''`	`None`	✅ Yes
`DictCursor`	Athena API	`''`	`None`	✅ Yes
`PandasCursor`	CSV file	`NaN`	`NaN`	❌ No
`PandasCursor` + unload	Parquet file	`''`	`None`	✅ Yes
`ArrowCursor`	CSV file	`''`	`''`	❌ No
`ArrowCursor` + unload	Parquet file	`''`	`null`	✅ Yes
`PolarsCursor`	CSV file	`''`	`null`	✅ Yes
`PolarsCursor` + unload	Parquet file	`''`	`null`	✅ Yes
`S3FSCursor`	CSV file	`''`	`None`	✅ Yes

Note

PolarsCursor and S3FSCursor are unique among the file-based cursors in that they can properly distinguish NULL from empty strings even when reading CSV files. PolarsCursor uses Polars’ CSV parser which correctly interprets unquoted empty values as NULL, while S3FSCursor uses a custom AthenaCSVReader that respects CSV quoting rules.

Default Cursor (API-based)¶

The default Cursor and DictCursor fetch results directly from the Athena API, which properly distinguishes between NULL and empty string values.

from pyathena import connect

cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2").cursor()

cursor.execute("""
    SELECT * FROM (
        VALUES
            (1, ''),
            (2, CAST(NULL AS VARCHAR)),
            (3, 'hello')
    ) AS t(id, value)
""")

for row in cursor:
    print(f"id={row[0]}, value={repr(row[1])}, is_none={row[1] is None}")

# Output:
# id=1, value='', is_none=False     <- Empty string
# id=2, value=None, is_none=True    <- NULL
# id=3, value='hello', is_none=False

This is the most reliable cursor when distinguishing NULL from empty string is critical.

PandasCursor Behavior¶

Without Unload (CSV)¶

When using PandasCursor without the unload option, the cursor reads the CSV file using pandas’ read_csv(). Both empty strings and NULL values are treated as NaN.

from pyathena import connect
from pyathena.pandas.cursor import PandasCursor

cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2",
                 cursor_class=PandasCursor).cursor()

df = cursor.execute("""
    SELECT * FROM (
        VALUES
            (1, '', 'empty_string'),
            (2, CAST(NULL AS VARCHAR), 'null_value'),
            (3, 'hello', 'normal_string')
    ) AS t(id, value, description)
""").as_pandas()

print(df)
#    id  value    description
# 0   1    NaN   empty_string   <- Empty string becomes NaN
# 1   2    NaN     null_value   <- NULL becomes NaN
# 2   3  hello  normal_string

print(df['value'].isna().tolist())
# [True, True, False]  <- Both empty string and NULL are NaN

Note

By default, PyAthena sets keep_default_na=False and na_values=("",) which means only empty values are treated as NaN, while strings like “N/A”, “NULL”, “NA” are preserved as-is. However, this still cannot distinguish between quoted empty strings and unquoted NULL values in the CSV output.

With Unload (Parquet) - Recommended¶

Using the unload option outputs results in Parquet format, which properly preserves NULL semantics:

from pyathena import connect
from pyathena.pandas.cursor import PandasCursor

cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2",
                 cursor_class=PandasCursor).cursor(unload=True)

df = cursor.execute("""
    SELECT * FROM (
        VALUES
            (1, '', 'empty_string'),
            (2, CAST(NULL AS VARCHAR), 'null_value'),
            (3, 'hello', 'normal_string')
    ) AS t(id, value, description)
""").as_pandas()

print(df)
#    id  value    description
# 0   1         empty_string   <- Empty string preserved
# 1   2   None    null_value   <- NULL is None
# 2   3  hello normal_string

print(df['value'].isna().tolist())
# [False, True, False]  <- Only NULL is NaN, empty string is not

ArrowCursor Behavior¶

Without Unload (CSV)¶

ArrowCursor uses PyArrow’s CSV reader with quoted_strings_can_be_null=False, which means both NULL and empty strings become empty strings:

from pyathena import connect
from pyathena.arrow.cursor import ArrowCursor

cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2",
                 cursor_class=ArrowCursor).cursor()

table = cursor.execute("""
    SELECT * FROM (
        VALUES
            (1, '', 'empty_string'),
            (2, CAST(NULL AS VARCHAR), 'null_value'),
            (3, 'hello', 'normal_string')
    ) AS t(id, value, description)
""").as_arrow()

value_col = table.column('value')
print(value_col.to_pylist())
# ['', '', 'hello']  <- Both empty string and NULL become ''

print(value_col.null_count)
# 0  <- No null values detected

With Unload (Parquet) - Recommended¶

from pyathena import connect
from pyathena.arrow.cursor import ArrowCursor

cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2",
                 cursor_class=ArrowCursor).cursor(unload=True)

table = cursor.execute("""
    SELECT * FROM (
        VALUES
            (1, '', 'empty_string'),
            (2, CAST(NULL AS VARCHAR), 'null_value'),
            (3, 'hello', 'normal_string')
    ) AS t(id, value, description)
""").as_arrow()

value_col = table.column('value')
print(value_col.to_pylist())
# ['', None, 'hello']  <- NULL is properly None

print(value_col.null_count)
# 1  <- One null value correctly detected

PolarsCursor Behavior¶

PolarsCursor is unique in that it can distinguish NULL from empty strings even when reading CSV files. This is because Polars’ CSV parser correctly interprets unquoted empty values as NULL.

Without Unload (CSV)¶

from pyathena import connect
from pyathena.polars.cursor import PolarsCursor

cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2",
                 cursor_class=PolarsCursor).cursor()

df = cursor.execute("""
    SELECT * FROM (
        VALUES
            (1, '', 'empty_string'),
            (2, CAST(NULL AS VARCHAR), 'null_value'),
            (3, 'hello', 'normal_string')
    ) AS t(id, value, description)
""").as_polars()

print(df)
# shape: (3, 3)
# ┌─────┬───────┬───────────────┐
# │ id  ┆ value ┆ description   │
# │ --- ┆ ---   ┆ ---           │
# │ i32 ┆ str   ┆ str           │
# ╞═════╪═══════╪═══════════════╡
# │ 1   ┆       ┆ empty_string  │  <- Empty string
# │ 2   ┆ null  ┆ null_value    │  <- NULL
# │ 3   ┆ hello ┆ normal_string │
# └─────┴───────┴───────────────┘

print(df['value'].is_null().to_list())
# [False, True, False]  <- Correctly distinguishes NULL

print(df['value'].to_list())
# ['', None, 'hello']  <- Empty string and NULL are different

With Unload (Parquet)¶

The behavior is the same with unload, as Parquet also properly preserves NULL semantics:

from pyathena import connect
from pyathena.polars.cursor import PolarsCursor

cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2",
                 cursor_class=PolarsCursor).cursor(unload=True)

df = cursor.execute("SELECT * FROM your_table").as_polars()
# Same behavior as CSV - NULL and empty strings are properly distinguished

S3FSCursor Behavior¶

S3FSCursor supports two CSV readers with different NULL handling behaviors:

AthenaCSVReader (default): Properly distinguishes NULL from empty strings by respecting CSV quoting rules
DefaultCSVReader: For backward compatibility; both NULL and empty strings become None

AthenaCSVReader (Default)¶

By default, S3FSCursor uses AthenaCSVReader which correctly interprets unquoted empty values as NULL and quoted empty values as empty strings.

from pyathena import connect
from pyathena.s3fs.cursor import S3FSCursor

# AthenaCSVReader is used by default
cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2",
                 cursor_class=S3FSCursor).cursor()

cursor.execute("""
    SELECT * FROM (
        VALUES
            (1, '', 'empty_string'),
            (2, CAST(NULL AS VARCHAR), 'null_value'),
            (3, 'hello', 'normal_string')
    ) AS t(id, value, description)
""")

for row in cursor:
    print(f"id={row[0]}, value={repr(row[1])}, is_none={row[1] is None}")

# Output:
# id=1, value='', is_none=False     <- Empty string preserved
# id=2, value=None, is_none=True    <- NULL is None
# id=3, value='hello', is_none=False

DefaultCSVReader (Backward Compatibility)¶

If you need backward compatibility where both NULL and empty strings are treated as None, you can explicitly specify DefaultCSVReader:

from pyathena import connect
from pyathena.s3fs.cursor import S3FSCursor
from pyathena.s3fs.reader import DefaultCSVReader

cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2",
                 cursor_class=S3FSCursor,
                 cursor_kwargs={"csv_reader": DefaultCSVReader}).cursor()

cursor.execute("SELECT '' AS empty_col, NULL AS null_col")
row = cursor.fetchone()
print(row)  # (None, None) - both become None

Note

S3FSCursor does not support the unload option. However, since AthenaCSVReader (the default) can already distinguish NULL from empty strings when reading CSV files, this is not a limitation for NULL handling.

Recommendations¶

Based on the cursor behaviors documented above:

If you need to distinguish NULL from empty strings:
- Use the default Cursor or DictCursor (API-based, most reliable)
- Or use PolarsCursor or S3FSCursor (works correctly even with CSV)
- Or use any cursor with unload=True (Parquet format)
If you need pandas DataFrame and NULL/empty string distinction:
- Use PandasCursor with unload=True
If you need Arrow Table and NULL/empty string distinction:
- Use ArrowCursor with unload=True
For data pipelines and ETL with DataFrame cursors:
- Prefer PolarsCursor (works correctly without unload)
- Or use unload=True with any DataFrame cursor
If performance is critical and NULL/empty distinction is not important:
- Any cursor works; CSV-based reading is generally faster for smaller datasets

Summary Table¶

Use Case	Recommended Cursor
Need NULL/empty distinction with tuples	`Cursor`, `DictCursor`, or `S3FSCursor`
Need pandas DataFrame with NULL/empty distinction	`PandasCursor` with `unload=True`
Need Arrow Table with NULL/empty distinction	`ArrowCursor` with `unload=True`
Need Polars DataFrame (any case)	`PolarsCursor` (works with or without unload)
Don’t care about NULL/empty distinction	Any cursor (choose based on performance needs)

Unload Limitations¶

While the unload option solves the NULL/empty string issue for pandas and Arrow cursors, it has some limitations:

The UNLOAD statement has certain type restrictions (e.g., TIME type is not supported)
Results are written to multiple files in parallel without guaranteed global sort order
Column aliases are required for all SELECT expressions

See the ArrowCursor documentation for more details on unload limitations.

NULL and Empty String Handling¶

CSV Format Limitation¶

Cursor Behavior Comparison¶

Default Cursor (API-based)¶

PandasCursor Behavior¶

Without Unload (CSV)¶

With Unload (Parquet) - Recommended¶

ArrowCursor Behavior¶

Without Unload (CSV)¶

With Unload (Parquet) - Recommended¶

PolarsCursor Behavior¶

Without Unload (CSV)¶

With Unload (Parquet)¶

S3FSCursor Behavior¶

AthenaCSVReader (Default)¶

DefaultCSVReader (Backward Compatibility)¶

Recommendations¶

Summary Table¶

Unload Limitations¶

Related Issues¶