Pandas Integration¶

This section covers pandas-specific cursors, result sets, and data converters.

Pandas Cursors¶

class pyathena.pandas.cursor.PandasCursor(s3_staging_dir: str | None = None, schema_name: str | None = None, catalog_name: str | None = None, work_group: str | None = None, poll_interval: float = 1, encryption_option: str | None = None, kms_key: str | None = None, kill_on_interrupt: bool = True, unload: bool = False, engine: str = 'auto', chunksize: int | None = None, block_size: int | None = None, cache_type: str | None = None, max_workers: int = 20, result_reuse_enable: bool = False, result_reuse_minutes: int = 60, auto_optimize_chunksize: bool = False, **kwargs)[source]¶

Cursor for handling pandas DataFrame results from Athena queries.

This cursor returns query results as pandas DataFrames with memory-efficient processing through chunking support and automatic chunksize optimization for large result sets. It’s ideal for data analysis and data science workflows.

The cursor supports both regular CSV-based results and high-performance UNLOAD operations that return results in Parquet format, which is significantly faster for large datasets and preserves data types more accurately.

description¶: Sequence of column descriptions for the last query.

rowcount¶: Number of rows affected by the last query (-1 for SELECT queries).

arraysize¶: Default number of rows to fetch with fetchmany().

chunksize¶: Number of rows per chunk when iterating through results.

Example

>>> from pyathena.pandas.cursor import PandasCursor
>>> cursor = connection.cursor(PandasCursor)
>>> cursor.execute("SELECT * FROM sales_data WHERE year = 2023")
>>> df = cursor.fetchall()  # Returns pandas DataFrame
>>> print(df.describe())

# Memory-efficient iteration for large datasets >>> cursor.execute(“SELECT * FROM huge_table”) >>> for chunk_df in cursor: … process_chunk(chunk_df) # Process data in chunks

# High-performance UNLOAD for large datasets >>> cursor = connection.cursor(PandasCursor, unload=True) >>> cursor.execute(“SELECT * FROM big_table”) >>> df = cursor.fetchall() # Faster Parquet-based result

__init__(s3_staging_dir: str | None = None, schema_name: str | None = None, catalog_name: str | None = None, work_group: str | None = None, poll_interval: float = 1, encryption_option: str | None = None, kms_key: str | None = None, kill_on_interrupt: bool = True, unload: bool = False, engine: str = 'auto', chunksize: int | None = None, block_size: int | None = None, cache_type: str | None = None, max_workers: int = 20, result_reuse_enable: bool = False, result_reuse_minutes: int = 60, auto_optimize_chunksize: bool = False, **kwargs) → None[source]¶

Initialize PandasCursor with configuration options.

Parameters:

s3_staging_dir – S3 directory for query result staging.
schema_name – Default schema name for queries.
catalog_name – Default catalog name for queries.
work_group – Athena workgroup name.
poll_interval – Query polling interval in seconds.
encryption_option – S3 encryption option.
kms_key – KMS key for encryption.
kill_on_interrupt – Cancel query on interrupt signal.
unload – Use UNLOAD statement for faster result retrieval.
engine – CSV parsing engine (‘auto’, ‘c’, ‘python’, ‘pyarrow’).
chunksize – Number of rows per chunk for memory-efficient processing. If specified, takes precedence over auto_optimize_chunksize.
block_size – S3 read block size.
cache_type – S3 caching strategy.
max_workers – Maximum worker threads for parallel processing.
result_reuse_enable – Enable query result reuse.
result_reuse_minutes – Result reuse duration in minutes.
auto_optimize_chunksize – Enable automatic chunksize determination for large files. Only effective when chunksize is None. Default: False (no automatic chunking).
**kwargs – Additional arguments passed to pandas.read_csv.

static get_default_converter(unload: bool = False) → DefaultPandasTypeConverter | Any[source]¶

Get the default type converter for this cursor class.

Parameters:: unload – Whether the converter is for UNLOAD operations. Some cursor types may return different converters for UNLOAD operations.
Returns:: The default type converter instance for this cursor type.

Execute a SQL query and return results as pandas DataFrames.

Executes the SQL query on Amazon Athena and configures the result set for pandas DataFrame output. Supports both regular CSV-based results and high-performance UNLOAD operations with Parquet format.

Parameters:

operation – SQL query string to execute.
parameters – Query parameters for parameterized queries.
work_group – Athena workgroup to use for this query.
s3_staging_dir – S3 location for query results.
cache_size – Number of queries to check for result caching.
cache_expiration_time – Cache expiration time in seconds.
result_reuse_enable – Enable Athena result reuse for this query.
result_reuse_minutes – Minutes to reuse cached results.
paramstyle – Parameter style (‘qmark’ or ‘pyformat’).
keep_default_na – Whether to keep default pandas NA values.
na_values – Additional values to treat as NA.
quoting – CSV quoting behavior (pandas csv.QUOTE_* constants).
on_start_query_execution – Callback called when query starts.
result_set_type_hints – Optional dictionary mapping column names to Athena DDL type signatures for precise type conversion within complex types.
**kwargs – Additional pandas read_csv/read_parquet parameters.

Returns:

Self reference for method chaining.

Example

>>> cursor.execute("SELECT * FROM sales WHERE year = %(year)s",
...                {"year": 2023})
>>> df = cursor.fetchall()  # Returns pandas DataFrame

as_pandas() → DataFrame | PandasDataFrameIterator[source]¶

Return DataFrame or PandasDataFrameIterator based on chunksize setting.

Returns:: DataFrame when chunksize is None, PandasDataFrameIterator when chunksize is set.

iter_chunks() → Generator[DataFrame, None, None][source]¶

Iterate over DataFrame chunks for memory-efficient processing.

This method provides an iterator interface for processing large result sets in chunks, preventing memory exhaustion when working with datasets that are too large to fit in memory as a single DataFrame.

Chunking behavior: - If chunksize is explicitly set, uses that value - If auto_optimize_chunksize=True and chunksize=None, automatically determines

optimal chunksize based on file size

If auto_optimize_chunksize=False and chunksize=None, yields entire DataFrame

Yields:

DataFrame –

Individual chunks of the result set when chunking is enabled,: or the entire DataFrame as a single chunk when chunking is disabled.

Examples

# Explicit chunksize cursor = connection.cursor(PandasCursor, chunksize=50000) cursor.execute(“SELECT * FROM large_table”) for chunk in cursor.iter_chunks():

process_chunk(chunk)

# Auto-optimization enabled cursor = connection.cursor(PandasCursor, auto_optimize_chunksize=True) cursor.execute(“SELECT * FROM large_table”) for chunk in cursor.iter_chunks():

process_chunk(chunk) # Chunks determined automatically for large files

# No chunking (default behavior) cursor = connection.cursor(PandasCursor) cursor.execute(“SELECT * FROM large_table”) for chunk in cursor.iter_chunks():

process_chunk(chunk) # Single DataFrame regardless of size

DEFAULT_FETCH_SIZE: int = 1000¶

DEFAULT_RESULT_REUSE_MINUTES = 60¶

LIST_DATABASES_MAX_RESULTS = 50¶

LIST_QUERY_EXECUTIONS_MAX_RESULTS = 50¶

LIST_TABLE_METADATA_MAX_RESULTS = 50¶

property arraysize: int¶

cancel() → None¶

Cancel the currently executing query.

Raises:: ProgrammingError – If no query is currently executing.

property catalog: str | None¶

close() → None¶: Close the cursor and release associated resources.

property completion_date_time: datetime | None¶

property connection: Connection[Any]¶

property data_manifest_location: str | None¶

property data_scanned_in_bytes: int | None¶

property database: str | None¶

property description: list[tuple[str, str, None, None, int, int, str]] | None¶

property effective_engine_version: str | None¶

property encryption_option: str | None¶

property engine_execution_time_in_millis: int | None¶

property error_category: int | None¶

property error_message: str | None¶

property error_type: int | None¶

executemany(operation: str, seq_of_parameters: list[dict[str, Any] | list[str] | None], **kwargs) → None¶

Execute a SQL query multiple times with different parameters.

Parameters:

operation – SQL query string to execute.
seq_of_parameters – Sequence of parameter sets, one per execution.
**kwargs – Additional keyword arguments passed to each execute().

property execution_parameters: list[str]¶

property expected_bucket_owner: str | None¶

fetchall() → list[tuple[Any | None, ...] | dict[Any, Any | None]]¶

Fetch all remaining rows from the result set.

Returns:: List of tuples representing all remaining rows.
Raises:: ProgrammingError – If no result set is available.

fetchmany(size: int | None = None) → list[tuple[Any | None, ...] | dict[Any, Any | None]]¶

Fetch multiple rows from the result set.

Parameters:: size – Maximum number of rows to fetch. Defaults to arraysize.
Returns:: List of tuples representing the fetched rows.
Raises:: ProgrammingError – If no result set is available.

fetchone() → tuple[Any | None, ...] | dict[Any, Any | None] | None¶

Fetch the next row of the result set.

Returns:: A tuple representing the next row, or None if no more rows.
Raises:: ProgrammingError – If no result set is available.

get_table_metadata(table_name: str, catalog_name: str | None = None, schema_name: str | None = None, logging_: bool = True) → AthenaTableMetadata¶

property has_result_set: bool¶

property kms_key: str | None¶

list_databases(catalog_name: str | None, max_results: int | None = None) → list[AthenaDatabase]¶

list_table_metadata(catalog_name: str | None = None, schema_name: str | None = None, expression: str | None = None, max_results: int | None = None) → list[AthenaTableMetadata]¶

property output_location: str | None¶

property query: str | None¶

property query_id: str | None¶

property query_planning_time_in_millis: int | None¶

property query_queue_time_in_millis: int | None¶

property result_reuse_enabled: bool | None¶

property result_reuse_minutes: int | None¶

property result_set: AthenaResultSet | None¶

property retryable: bool | None¶

property reused_previous_result: bool | None¶

property rowcount: int¶

Get the number of rows affected by the last operation.

For SELECT statements, this returns -1 as per DB API 2.0 specification. For DML operations (INSERT, UPDATE, DELETE) and CTAS, this returns the number of affected rows.

Returns:: The number of rows, or -1 if not applicable or unknown.

property rownumber: int | None¶

property s3_acl_option: str | None¶

property selected_engine_version: str | None¶

property service_processing_time_in_millis: int | None¶

setinputsizes(sizes)¶: Does nothing by default

setoutputsize(size, column=None)¶: Does nothing by default

property state: str | None¶

property state_change_reason: str | None¶

property statement_type: str | None¶

property submission_date_time: datetime | None¶

property substatement_type: str | None¶

property total_execution_time_in_millis: int | None¶

property work_group: str | None¶

class pyathena.pandas.async_cursor.AsyncPandasCursor(s3_staging_dir: str | None = None, schema_name: str | None = None, catalog_name: str | None = None, work_group: str | None = None, poll_interval: float = 1, encryption_option: str | None = None, kms_key: str | None = None, kill_on_interrupt: bool = True, max_workers: int = 20, arraysize: int = 1000, unload: bool = False, engine: str = 'auto', chunksize: int | None = None, result_reuse_enable: bool = False, result_reuse_minutes: int = 60, **kwargs)[source]¶

Asynchronous cursor that returns results as pandas DataFrames.

This cursor extends AsyncCursor to provide asynchronous query execution with results returned as pandas DataFrames. It’s designed for data analysis workflows where pandas integration is required and non-blocking query execution is beneficial.

Features:

Asynchronous query execution with concurrent futures
Direct pandas DataFrame results for data analysis
Configurable CSV and Parquet engines for optimal performance
Support for chunked processing of large datasets
UNLOAD operations for improved performance with large results
Memory optimization through configurable chunking

arraysize¶: Number of rows to fetch per batch.

engine¶: Parsing engine (‘auto’, ‘c’, ‘python’, ‘pyarrow’).

chunksize¶: Number of rows per chunk for large datasets.

Example

>>> from pyathena.pandas.async_cursor import AsyncPandasCursor
>>>
>>> cursor = connection.cursor(AsyncPandasCursor, chunksize=10000)
>>> query_id, future = cursor.execute("SELECT * FROM large_table")
>>>
>>> # Get result when ready
>>> result_set = future.result()
>>> df = result_set.as_pandas()
>>>
>>> # Or iterate through chunks for large datasets
>>> for chunk_df in result_set:
...     process_chunk(chunk_df)

Note

Requires pandas to be installed. For large datasets, consider using chunksize or UNLOAD operations for better memory efficiency.

__init__(s3_staging_dir: str | None = None, schema_name: str | None = None, catalog_name: str | None = None, work_group: str | None = None, poll_interval: float = 1, encryption_option: str | None = None, kms_key: str | None = None, kill_on_interrupt: bool = True, max_workers: int = 20, arraysize: int = 1000, unload: bool = False, engine: str = 'auto', chunksize: int | None = None, result_reuse_enable: bool = False, result_reuse_minutes: int = 60, **kwargs) → None[source]¶

static get_default_converter(unload: bool = False) → DefaultPandasTypeConverter | Any[source]¶

Get the default type converter for this cursor class.

Parameters:: unload – Whether the converter is for UNLOAD operations. Some cursor types may return different converters for UNLOAD operations.
Returns:: The default type converter instance for this cursor type.

property arraysize: int¶

LIST_DATABASES_MAX_RESULTS = 50¶

LIST_QUERY_EXECUTIONS_MAX_RESULTS = 50¶

LIST_TABLE_METADATA_MAX_RESULTS = 50¶

cancel(query_id: str) → Future[None]¶

Cancel a running query asynchronously.

Submits a cancellation request for the specified query. The cancellation itself runs asynchronously in the background.

Parameters:: query_id – The Athena query execution ID to cancel.
Returns:: Future object that completes when the cancellation request finishes.

Example

>>> query_id, future = cursor.execute("SELECT * FROM huge_table")
>>> # Later, cancel the query
>>> cancel_future = cursor.cancel(query_id)
>>> cancel_future.result()  # Wait for cancellation to complete

close(wait: bool = False) → None¶

property connection: Connection[Any]¶

description(query_id: str) → Future[list[tuple[str, str, None, None, int, int, str]] | None]¶

Execute a SQL query asynchronously.

Starts query execution on Amazon Athena and returns immediately without waiting for completion. The query runs in the background while your application can continue with other work.

Parameters:

operation – SQL query string to execute.
parameters – Query parameters (optional).
work_group – Athena workgroup to use (optional).
s3_staging_dir – S3 location for query results (optional).
cache_size – Query result cache size in MB (optional).
cache_expiration_time – Cache expiration time in seconds (optional).
result_reuse_enable – Enable result reuse for identical queries (optional).
result_reuse_minutes – Result reuse duration in minutes (optional).
paramstyle – Parameter style to use (optional).
result_set_type_hints – Optional dictionary mapping column names to Athena DDL type signatures for precise type conversion within complex types.
**kwargs – Additional execution parameters.

Returns:

query_id: Athena query execution ID for tracking
future: Future object for result retrieval

Return type:

Tuple of (query_id, future) where

Example

>>> query_id, future = cursor.execute("SELECT * FROM large_table")
>>> print(f"Query started: {query_id}")
>>> # Do other work while query runs...
>>> result_set = future.result()  # Wait for completion

executemany(operation: str, seq_of_parameters: list[dict[str, Any] | list[str] | None], **kwargs) → None¶

Execute multiple queries asynchronously (not supported).

This method is not supported for asynchronous cursors because managing multiple concurrent queries would be complex and resource-intensive.

Parameters:

operation – SQL query string.
seq_of_parameters – Sequence of parameter sets.
**kwargs – Additional arguments.

Raises:

NotSupportedError – Always raised as this operation is not supported.

Note

For bulk operations, consider using execute() with parameterized queries or batch processing patterns instead.

get_table_metadata(table_name: str, catalog_name: str | None = None, schema_name: str | None = None, logging_: bool = True) → AthenaTableMetadata¶

list_databases(catalog_name: str | None, max_results: int | None = None) → list[AthenaDatabase]¶

list_table_metadata(catalog_name: str | None = None, schema_name: str | None = None, expression: str | None = None, max_results: int | None = None) → list[AthenaTableMetadata]¶

poll(query_id: str) → Future[AthenaQueryExecution]¶

Poll for query completion asynchronously.

Waits for the query to complete (succeed, fail, or be cancelled) and returns the final execution status. This method blocks until completion but runs the polling in a background thread.

Parameters:: query_id – The Athena query execution ID to poll.
Returns:: Future object containing the final AthenaQueryExecution status.

Note

This method performs polling internally, so it will take time proportional to your query execution duration.

query_execution(query_id: str) → Future[AthenaQueryExecution]¶

Get query execution details asynchronously.

Retrieves the current execution status and metadata for a query. This is useful for monitoring query progress without blocking.

Parameters:: query_id – The Athena query execution ID.
Returns:: Future object containing AthenaQueryExecution with query details.

setinputsizes(sizes)¶: Does nothing by default

setoutputsize(size, column=None)¶: Does nothing by default

Pandas Result Set¶

class pyathena.pandas.result_set.AthenaPandasResultSet(connection: Connection[Any], converter: Converter, query_execution: AthenaQueryExecution, arraysize: int, retry_config: RetryConfig, keep_default_na: bool = False, na_values: Iterable[str] | None = ('',), quoting: int = 1, unload: bool = False, unload_location: str | None = None, engine: str = 'auto', chunksize: int | None = None, block_size: int | None = None, cache_type: str | None = None, max_workers: int = 20, auto_optimize_chunksize: bool = False, result_set_type_hints: dict[str | int, str] | None = None, **kwargs)[source]¶

Result set that provides pandas DataFrame results with memory optimization.

This result set handles CSV and Parquet result files from S3, converting them to pandas DataFrames with configurable chunking for memory-efficient processing. It automatically optimizes chunk sizes based on file size and provides iterative processing capabilities for large datasets.

Features:

Automatic chunk size optimization based on file size
Support for both CSV and Parquet result formats
Memory-efficient iterative processing
Automatic date/time parsing for pandas compatibility
PyArrow integration for Parquet files

LARGE_FILE_THRESHOLD_BYTES¶

File size threshold for chunking (50MB).

Type:: int

AUTO_CHUNK_SIZE_LARGE¶

Default chunk size for large files (100,000 rows).

Type:: int

AUTO_CHUNK_SIZE_MEDIUM¶

Default chunk size for medium files (50,000 rows).

Type:: int

Example

>>> # Used automatically by PandasCursor
>>> cursor = connection.cursor(PandasCursor)
>>> cursor.execute("SELECT * FROM large_table")
>>>
>>> # Get full DataFrame
>>> df = cursor.fetchall()
>>>
>>> # Or iterate through chunks for memory efficiency
>>> for chunk_df in cursor:
...     process_chunk(chunk_df)

Note

This class is used internally by PandasCursor and typically not instantiated directly by users.

PYARROW_MIN_FILE_SIZE_BYTES: int = 100¶

LARGE_FILE_THRESHOLD_BYTES: int = 52428800¶

ESTIMATED_BYTES_PER_ROW: int = 100¶

AUTO_CHUNK_THRESHOLD_LARGE: int = 2000000¶

AUTO_CHUNK_THRESHOLD_MEDIUM: int = 1000000¶

AUTO_CHUNK_SIZE_LARGE: int = 100000¶

AUTO_CHUNK_SIZE_MEDIUM: int = 50000¶

__init__(connection: Connection[Any], converter: Converter, query_execution: AthenaQueryExecution, arraysize: int, retry_config: RetryConfig, keep_default_na: bool = False, na_values: Iterable[str] | None = ('',), quoting: int = 1, unload: bool = False, unload_location: str | None = None, engine: str = 'auto', chunksize: int | None = None, block_size: int | None = None, cache_type: str | None = None, max_workers: int = 20, auto_optimize_chunksize: bool = False, result_set_type_hints: dict[str | int, str] | None = None, **kwargs) → None[source]¶

Initialize AthenaPandasResultSet with pandas-specific configurations.

Parameters:

connection – Database connection instance.
converter – Data type converter for Athena types to pandas types.
query_execution – Query execution metadata from Athena.
arraysize – Number of rows to fetch in each batch (not used for pandas processing).
retry_config – Retry configuration for S3 operations.
keep_default_na – pandas option for handling NA values.
na_values – Additional values to recognize as NA.
quoting – CSV quoting behavior.
unload – Whether result uses UNLOAD statement (Parquet format).
unload_location – S3 location for UNLOAD results.
engine – Parsing engine (‘auto’, ‘c’, ‘python’, ‘pyarrow’).
chunksize – Number of rows per chunk. If specified, takes precedence over auto_optimize_chunksize.
block_size – S3 read block size.
cache_type – S3 caching strategy.
max_workers – Maximum worker threads for parallel operations.
auto_optimize_chunksize – Enable automatic chunksize determination for large files when chunksize is None.
result_set_type_hints – Optional dictionary mapping column names to Athena DDL type signatures for precise type conversion.
**kwargs – Additional arguments passed to pandas.read_csv/read_parquet.

property dtypes: dict[str, type[Any]]¶

Get pandas-compatible data types for result columns.

Returns:: Dictionary mapping column names to their corresponding Python types based on the converter’s type mapping.

property converters: dict[Any | None, Callable[[str | None], Any | None]]¶

property parse_dates: list[Any | None]¶

fetchone() → tuple[Any | None, ...] | dict[Any, Any | None] | None[source]¶

fetchmany(size: int | None = None) → list[tuple[Any | None, ...] | dict[Any, Any | None]][source]¶

fetchall() → list[tuple[Any | None, ...] | dict[Any, Any | None]][source]¶

DEFAULT_FETCH_SIZE: int = 1000¶

DEFAULT_RESULT_REUSE_MINUTES = 60¶

property arraysize: int¶

as_pandas() → PandasDataFrameIterator | DataFrame[source]¶

property catalog: str | None¶

property completion_date_time: datetime | None¶

property connection: Connection[Any]¶

property data_manifest_location: str | None¶

property data_scanned_in_bytes: int | None¶

property database: str | None¶

property description: list[tuple[str, str, None, None, int, int, str]] | None¶

property effective_engine_version: str | None¶

property encryption_option: str | None¶

property engine_execution_time_in_millis: int | None¶

property error_category: int | None¶

property error_message: str | None¶

property error_type: int | None¶

property execution_parameters: list[str]¶

property expected_bucket_owner: str | None¶

property is_closed: bool¶

property is_unload: bool¶

Check if the query is an UNLOAD statement.

Returns:: True if the query is an UNLOAD statement, False otherwise.

property kms_key: str | None¶

property output_location: str | None¶

property query: str | None¶

property query_id: str | None¶

property query_planning_time_in_millis: int | None¶

property query_queue_time_in_millis: int | None¶

property result_reuse_enabled: bool | None¶

property result_reuse_minutes: int | None¶

property retryable: bool | None¶

property reused_previous_result: bool | None¶

property rowcount: int¶

property rownumber: int | None¶

property s3_acl_option: str | None¶

property selected_engine_version: str | None¶

property service_processing_time_in_millis: int | None¶

property state: str | None¶

property state_change_reason: str | None¶

property statement_type: str | None¶

property submission_date_time: datetime | None¶

property substatement_type: str | None¶

property total_execution_time_in_millis: int | None¶

property work_group: str | None¶

iter_chunks() → PandasDataFrameIterator[source]¶

Iterate over result chunks as pandas DataFrames.

This method provides an iterator interface for processing large result sets. When chunksize is specified, it yields DataFrames in chunks for memory-efficient processing. When chunksize is not specified, it yields the entire result as a single DataFrame.

Returns:: PandasDataFrameIterator that yields pandas DataFrames for each chunk of rows, or the entire DataFrame if chunksize was not specified.

Example

>>> # With chunking for large datasets
>>> cursor = connection.cursor(PandasCursor, chunksize=50000)
>>> cursor.execute("SELECT * FROM large_table")
>>> for chunk in cursor.iter_chunks():
...     process_chunk(chunk)  # Each chunk is a pandas DataFrame
>>>
>>> # Without chunking - yields entire result as single chunk
>>> cursor = connection.cursor(PandasCursor)
>>> cursor.execute("SELECT * FROM small_table")
>>> for df in cursor.iter_chunks():
...     process(df)  # Single DataFrame with all data

close() → None[source]¶

class pyathena.pandas.result_set.PandasDataFrameIterator(reader: TextFileReader | DataFrame, trunc_date: Callable[[DataFrame], DataFrame])[source]¶

Iterator for chunked DataFrame results from Athena queries.

This class wraps either a pandas TextFileReader (for chunked reading) or a single DataFrame, providing a unified iterator interface. It applies optional date truncation to each DataFrame chunk as it’s yielded.

The iterator is used by AthenaPandasResultSet to provide chunked access to large query results, enabling memory-efficient processing of datasets that would be too large to load entirely into memory.

Example

>>> # Iterate over DataFrame chunks
>>> for df_chunk in iterator:
...     process(df_chunk)
>>>
>>> # Iterate over individual rows
>>> for idx, row in iterator.iterrows():
...     print(row)

Note

This class is primarily for internal use by AthenaPandasResultSet. Most users should access results through PandasCursor methods.

__init__(reader: TextFileReader | DataFrame, trunc_date: Callable[[DataFrame], DataFrame]) → None[source]¶

Initialize the iterator.

Parameters:

reader – Either a TextFileReader (for chunked) or a single DataFrame.
trunc_date – Function to apply date truncation to each chunk.

__next__() → DataFrame[source]¶

Get the next DataFrame chunk.

Returns:: The next pandas DataFrame chunk with date truncation applied.
Raises:: StopIteration – When no more chunks are available.

__iter__() → PandasDataFrameIterator[source]¶: Return self as iterator.

__enter__() → PandasDataFrameIterator[source]¶: Context manager entry.

__exit__(exc_type, exc_value, traceback) → None[source]¶: Context manager exit.

close() → None[source]¶: Close the iterator and release resources.

iterrows() → Iterator[tuple[int, dict[str, Any]]][source]¶

Iterate over rows as (index, row_dict) tuples.

Row indices are continuous across all chunks, starting from 0.

Yields:: Tuple of (row_index, row_dict) for each row across all chunks.

get_chunk(size: int | None = None) → DataFrame[source]¶

Get a chunk of specified size.

Parameters:: size – Number of rows to retrieve. If None, returns entire chunk.
Returns:: DataFrame chunk.

as_pandas() → DataFrame[source]¶

Collect all chunks into a single DataFrame.

Returns:: Single pandas DataFrame containing all data.

Pandas Data Converters¶

class pyathena.pandas.converter.DefaultPandasTypeConverter[source]¶

Optimized type converter for pandas DataFrame results.

This converter is specifically designed for the PandasCursor and provides optimized type conversion that works well with pandas data types. It minimizes conversions for types that pandas handles efficiently and only converts complex types that need special handling.

The converter focuses on:

Preserving numeric types for pandas optimization
Converting only complex types (json, binary, etc.)
Maintaining compatibility with pandas data type inference

Example

>>> from pyathena.pandas.converter import DefaultPandasTypeConverter
>>> converter = DefaultPandasTypeConverter()
>>>
>>> # Used automatically by PandasCursor
>>> cursor = connection.cursor(PandasCursor)
>>> # converter is applied automatically to results

Note

This converter is used by default in PandasCursor. Most users don’t need to instantiate it directly.

__init__() → None[source]¶

convert(type_: str, value: str | None, type_hint: str | None = None) → Any | None[source]¶

class pyathena.pandas.converter.DefaultPandasUnloadTypeConverter[source]¶

Type converter for pandas UNLOAD operations.

This converter is designed for use with UNLOAD queries that write results directly to Parquet files in S3. Since UNLOAD operations bypass the normal conversion process and write data in native Parquet format, this converter has minimal functionality.

Note

Used automatically when PandasCursor is configured with unload=True. UNLOAD results are read directly as DataFrames from Parquet files.

__init__() → None[source]¶

convert(type_: str, value: str | None, type_hint: str | None = None) → Any | None[source]¶

Pandas Utilities¶

pyathena.pandas.util.get_chunks(df: DataFrame, chunksize: int | None = None) → Iterator[DataFrame][source]¶

Split a DataFrame into chunks of specified size.

Parameters:

df – The DataFrame to split into chunks.
chunksize – Number of rows per chunk. If None, yields the entire DataFrame.

Yields:

DataFrame chunks of the specified size.

Raises:

ValueError – If chunksize is less than or equal to zero.

pyathena.pandas.util.reset_index(df: DataFrame, index_label: str | None = None) → None[source]¶

Reset the DataFrame index and add it as a column.

Parameters:

df – The DataFrame to reset the index on (modified in-place).
index_label – Name for the index column. Defaults to “index”.

Raises:

ValueError – If the index name conflicts with existing column names.

pyathena.pandas.util.as_pandas(cursor: Cursor, coerce_float: bool = False) → DataFrame[source]¶

Convert cursor results to a pandas DataFrame.

Fetches all remaining rows from the cursor and converts them to a DataFrame with column names from the cursor description.

Parameters:

cursor – A PyAthena cursor with executed query results.
coerce_float – If True, attempt to convert non-string columns to float.

Returns:

A DataFrame containing the query results, or an empty DataFrame if no results are available.

pyathena.pandas.util.to_sql_type_mappings(col: Series) → str[source]¶

Map a pandas Series data type to an Athena SQL type.

Infers the appropriate Athena SQL type based on the pandas Series dtype. Used when creating tables from DataFrames.

Parameters:: col – A pandas Series to determine the SQL type for.
Returns:: The Athena SQL type name (e.g., “STRING”, “BIGINT”, “DOUBLE”).
Raises:: ValueError – If the data type is not supported (complex, time).

pyathena.pandas.util.to_parquet(df: DataFrame, bucket_name: str, prefix: str, retry_config: RetryConfig, session_kwargs: dict[str, Any], client_kwargs: dict[str, Any], compression: str | None = None, flavor: str = 'spark') → str[source]¶

Write a DataFrame to S3 as a Parquet file.

Converts the DataFrame to Apache Arrow format and writes it to S3 as a Parquet file with a UUID-based filename.

Parameters:

df – The DataFrame to write.
bucket_name – S3 bucket name.
prefix – S3 key prefix (path within the bucket).
retry_config – Configuration for API call retries.
session_kwargs – Arguments for creating a boto3 Session.
client_kwargs – Arguments for creating the S3 client.
compression – Parquet compression codec (e.g., “snappy”, “gzip”).
flavor – Parquet flavor for compatibility (“spark” or “hive”).

Returns:

The S3 URI of the written Parquet file.

pyathena.pandas.util.to_sql(df: DataFrame, name: str, conn: Connection[Any], location: str, schema: str = 'default', index: bool = False, index_label: str | None = None, partitions: list[str] | None = None, chunksize: int | None = None, if_exists: str = 'fail', compression: str | None = None, flavor: str = 'spark', type_mappings: Callable[[Series], str] = <function to_sql_type_mappings>, executor_class: type[ThreadPoolExecutor | ProcessPoolExecutor] = <class 'concurrent.futures.thread.ThreadPoolExecutor'>, max_workers: int = 20, repair_table=True) → None[source]¶

Write a DataFrame to an Athena table backed by Parquet files in S3.

Creates an external Athena table from a DataFrame by writing the data as Parquet files to S3 and executing the appropriate DDL statements. Supports partitioning, compression, and parallel uploads.

Parameters:

df – The DataFrame to write to Athena.
name – Name of the table to create.
conn – PyAthena connection object.
location – S3 location for the table data (e.g., “s3://bucket/path/”).
schema – Database schema name. Defaults to “default”.
index – If True, include the DataFrame index as a column.
index_label – Name for the index column if index=True.
partitions – List of column names to use as partition keys.
chunksize – Number of rows per Parquet file. None for single file.
if_exists – Action if table exists: “fail”, “replace”, or “append”.
compression – Parquet compression codec (e.g., “snappy”, “gzip”).
flavor – Parquet flavor for compatibility (“spark” or “hive”).
type_mappings – Function to map pandas types to SQL types.
executor_class – Executor class for parallel uploads.
max_workers – Maximum number of parallel upload workers.
repair_table – If True, run ALTER TABLE ADD PARTITION for partitioned tables.

Raises:

ValueError – If if_exists is invalid, compression is unsupported, or partition keys contain None values.
OperationalError – If if_exists=”fail” and table already exists.

pyathena.pandas.util.get_column_names_and_types(df: DataFrame, type_mappings) → OrderedDict[str, str][source]¶

Extract column names and their SQL types from a DataFrame.

Parameters:

df – The DataFrame to extract column information from.
type_mappings – Function to map pandas types to SQL types.

Returns:

An OrderedDict mapping column names to their SQL type strings.

pyathena.pandas.util.generate_ddl(df: DataFrame, name: str, location: str, schema: str = 'default', partitions: list[str] | None = None, compression: str | None = None, type_mappings: Callable[[Series], str] = <function to_sql_type_mappings>) → str[source]¶

Generate CREATE EXTERNAL TABLE DDL for a DataFrame.

Creates DDL for an external Athena table with Parquet storage format based on the DataFrame’s schema.

Parameters:

df – The DataFrame to generate DDL for.
name – Name of the table to create.
location – S3 location for the table data.
schema – Database schema name. Defaults to “default”.
partitions – List of column names to use as partition keys.
compression – Parquet compression codec for TBLPROPERTIES.
type_mappings – Function to map pandas types to SQL types.

Returns:

The CREATE EXTERNAL TABLE DDL statement as a string.