Exploring Apache Iceberg using PyIceberg

Exploring Apache Iceberg using PyIceberg – Part 2

Posted on March 24, 2026 Updated on March 27, 2026

Apache Iceberg, an open-source table format that has become the industry standard for data sharing in modern data architectures. In my previous posts on Apache Iceberg I explored the core features of Iceberg Tables and gave examples of using Python code to create, store, add data, read a table and apply filters to an Iceberg Table. In this post I’ll explore some of the more advanced features of interacting with an Iceberg Table, how to add partitioning and how to moved data to a DuckDB database.

Check out the link at the bottom of this post to download the Notebook containing all the PyIceberg code in this post. I had a similar notebook for all the code examples in my previous post. You should check that our first as the examples in the post and notebook are an extension of those.

This post will cover:

Partitioning an Iceberg Table
Schema Evolution
Row Level Operations
Advanced Scanning & Query Patterns
DuckDB and Iceberg Tables

Setup & Conguaration

Before we can start on the core aspects of this post, we need to do some basic setup like importing the necessary Python packages, defining the location of the warehouse and catalog and checking the namespace exists. These were created created in the previous post.

			
import os, pandas as pd, pyarrow as pa
from datetime import date
from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import (
    NestedField, LongType, StringType, DoubleType, DateType
)
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import (
    MonthTransform, IdentityTransform, BucketTransform
)
WAREHOUSE = "/Users/brendan.tierney/Dropbox/Iceberg-Demo"
os.makedirs(WAREHOUSE, exist_ok=True)
catalog = SqlCatalog("local", **{
    "uri":       f"sqlite:///{WAREHOUSE}/catalog.db",
    "warehouse": f"file://{WAREHOUSE}",
})
for ns in ["sales_db"]:
    if ns not in [n[0] for n in catalog.list_namespaces()]:
        catalog.create_namespace(ns)

		

Partitioning an Iceberg Table

Partitioning is how Iceberg physically organises data files on disk to enable partition pruning. Partitioning pruning will automactically skip directorys and files that don’t contain the data you are searching for. This can have a significant improvement of query response times.

The following will create a partition table based on the combination of the fiels order_date and region.

			
# ── Explicit Iceberg schema (gives us full control over field IDs) ─────
schema = Schema(
    NestedField(field_id=1, name="order_id",   field_type=LongType(),   required=False),
    NestedField(field_id=2, name="customer",   field_type=StringType(), required=False),
    NestedField(field_id=3, name="product",    field_type=StringType(), required=False),
    NestedField(field_id=4, name="region",     field_type=StringType(), required=False),
    NestedField(field_id=5, name="order_date", field_type=DateType(),   required=False),
    NestedField(field_id=6, name="revenue",    field_type=DoubleType(), required=False),
)
# ── Partition spec: partition by month(order_date) AND identity(region) ─
partition_spec = PartitionSpec(
    PartitionField(
        source_id=5,           # order_date field_id
        field_id=1000,
        transform=MonthTransform(),
        name="order_date_month",
    ),
    PartitionField(
        source_id=4,           # region field_id
        field_id=1001,
        transform=IdentityTransform(),
        name="region",
    ),
)
tname = ("sales_db", "orders_partitioned")
if catalog.table_exists(tname): 
    catalog.drop_table(tname)

		

Now we can create the table and inspect the details

			
table = catalog.create_table(
    tname,
    schema=schema,
    partition_spec=partition_spec,)
print("Partition spec:", table.spec())
Partition spec: [
  1000: order_date_month: month(5)
  1001: region: identity(4)
]

		

We can now add data to the partitioned table.

			
# Write data — Iceberg routes each row to the correct partition directory
df = pd.DataFrame({
    "order_id":   [1001, 1002, 1003, 1004, 1005, 1006],
    "customer":   ["Alice", "Bob", "Carol", "Dave", "Eve", "Frank"],
    "product":    ["Laptop", "Phone", "Tablet", "Monitor", "Keyboard", "Webcam"],
    "region":     ["EU", "US", "EU", "APAC", "US", "EU"],
    "order_date": [date(2024,1,15), date(2024,1,20),
                   date(2024,2,3),  date(2024,2,20),
                   date(2024,3,5),  date(2024,3,12)],
    "revenue":    [1299.99, 1798.00, 549.50, 1197.00, 399.95, 258.00],
})
table.append(pa.Table.from_pandas(df))

		

We can inspect the directories and files created. I’ve only include a partical listing below but it should be enough for you to get and idea of what Iceberg as done.

			
# Verify partition directories were created
!find {WAREHOUSE}/sales_db/orders_partitioned/data -type f
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/region=APAC/order_date_day=2024-04-05/00000-4-0542db6c-f67f-4a26-9012-59d8267b5005.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/region=APAC/order_date_day=2024-02-20/00000-2-0542db6c-f67f-4a26-9012-59d8267b5005.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/order_date_month=2024-01/region=EU/00000-0-e9ad65a0-c088-46fc-a537-12a6b60b38c5.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/order_date_month=2024-01/region=EU/00000-0-1f976101-f836-4db3-bf4a-c0e0cf7dd4c6.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/order_date_month=2024-01/region=EU/00000-0-4233dad6-ef48-4ad5-95c9-5842e641fc0f.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/order_date_month=2024-01/region=EU/00000-0-b0a10298-d2a6-45b4-a541-9a459e478496.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/order_date_month=2024-01/region=US/00000-1-b0a10298-d2a6-45b4-a541-9a459e478496.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/order_date_month=2024-01/region=US/00000-1-4233dad6-ef48-4ad5-95c9-5842e641fc0f.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/order_date_month=2024-01/region=US/00000-1-1f976101-f836-4db3-bf4a-c0e0cf7dd4c6.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/order_date_month=2024-01/region=US/00000-1-e9ad65a0-c088-46fc-a537-12a6b60b38c5.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/region=EU/order_date_day=2024-02-03/00000-1-0542db6c-f67f-4a26-9012-59d8267b5005.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/region=EU/order_date_day=2024-01-15/00000-0-0542db6c-f67f-4a26-9012-59d8267b5005.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/region=EU/order_date_day=2024-04-01/00000-3-0542db6c-f67f-4a26-9012-59d8267b5005.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/order_date_month=2024-02/region=APAC/00000-3-b0a10298-d2a6-45b4-a541-9a459e478496.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/order_date_month=2024-02/region=APAC/00000-3-e9ad65a0-c088-46fc-a537-12a6b60b38c5.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/order_date_month=2024-02/region=APAC/00000-3-4233dad6-ef48-4ad5-95c9-5842e641fc0f.parquet
/Users/brendan.tierney/Dropbox/Iceberg-Demo/sales_db/orders_partitioned/data/order_date_month=2024-02/region=APAC/00000-3-1f976101-f836-4db3-bf4a-c0e0cf7dd4c6.parquet

		

We can change the partictioning specification without rearranging or reorganising the data

			
from pyiceberg.transforms import DayTransform
# Iceberg can change the partition spec without rewriting old data.
# Old files keep their original partitioning; new files use the new spec.
with table.update_spec() as update:
    # Upgrade month → day granularity for more recent data
    update.remove_field("order_date_month")
    update.add_field(
        source_column_name="order_date",
        transform=DayTransform(),
        partition_field_name="order_date_day",
    )
print("Updated spec:", table.spec())

		

I’ll leave you to explore the additional directories, files and meta-data files.

			
#find all files starting from this directory
!find {WAREHOUSE}/sales_db/orders_partitioned/data -type f

Schema Evolution

Iceberg tracks every schema version with a numeric ID and never silently breaks existing readers. You can add, rename, and drop columns, change types (safely), and reorder fields, all with zero data rewriting.

			
#Add new columns
from pyiceberg.types import FloatType, BooleanType, TimestampType
print("Before:", table.schema())
with table.update_schema() as upd:
    # Add optional columns — old files return NULL for these
    upd.add_column("discount_pct", FloatType(),   "Discount percentage applied")
    upd.add_column("is_returned",  BooleanType(), "True if the order was returned")
    upd.add_column("updated_at",   TimestampType())
print("After:", table.schema())
Before: table {
  1: order_id: optional long
  2: customer: optional string
  3: product: optional string
  4: region: optional string
  5: order_date: optional date
  6: revenue: optional double
}
After: table {
  1: order_id: optional long
  2: customer: optional string
  3: product: optional string
  4: region: optional string
  5: order_date: optional date
  6: revenue: optional double
  7: discount_pct: optional float (Discount percentage applied)
  8: is_returned: optional boolean (True if the order was returned)
  9: updated_at: optional timestamp
}

		

We can rename columns. A column rename is a meta-data only change. The Parquet files are untouched. Older readers will still see the previous versions of the column name, whicl new readers will see the new column name.

			
#rename a column
with table.update_schema() as upd:
    upd.rename_column("discount_pct", "discount_percent")
print("Updated:", table.schema())
Updated: table {
  1: order_id: optional long
  2: customer: optional string
  3: product: optional string
  4: region: optional string
  5: order_date: optional date
  6: revenue: optional double
  7: discount_percent: optional float (Discount percentage applied)
  8: is_returned: optional boolean (True if the order was returned)
  9: updated_at: optional timestamp
}

		

Similarly when dropping a column, it is a meta-data change

			
#drop a column
with table.update_schema() as upd:
    upd.delete_column("updated_at")
print("Updated:", table.schema())
Updated: table {
  1: order_id: optional long
  2: customer: optional string
  3: product: optional string
  4: region: optional string
  5: order_date: optional date
  6: revenue: optional double
  7: discount_percent: optional float (Discount percentage applied)
  8: is_returned: optional boolean (True if the order was returned)
}

		

We can see all the different changes or versions of the Iceberg Table schema.

			
import json, glob
meta_files = sorted(glob.glob(
    f"{WAREHOUSE}/sales_db/orders_partitioned/metadata/*.metadata.json"
))
with open(meta_files[-1]) as f:
    meta = json.load(f)
print(f"Total schema versions: {len(meta['schemas'])}")
for s in meta["schemas"]:
    print(f"  schema-id={s['schema-id']}  fields={[f['name'] for f in s['fields']]}")
Total schema versions: 4
  schema-id=0  fields=['order_id', 'customer', 'product', 'region', 'order_date', 'revenue']
  schema-id=1  fields=['order_id', 'customer', 'product', 'region', 'order_date', 'revenue', 'discount_pct', 'is_returned', 'updated_at']
  schema-id=2  fields=['order_id', 'customer', 'product', 'region', 'order_date', 'revenue', 'discount_percent', 'is_returned', 'updated_at']
  schema-id=3  fields=['order_id', 'customer', 'product', 'region', 'order_date', 'revenue', 'discount_percent', 'is_returned']    

		

Agian if you inspect the directories and files in the warehouse, you’ll see the impact of these changes at the file system level.

			
#find all files starting from this directory
!find {WAREHOUSE}/sales_db/orders_partitioned/data -type f

Row Level Operations

Iceberg v2 introduces two delete file formats that enable row-level mutations without rewriting entire data files immediately — writes stay fast, and reads merge deletes on the fly.

Operations Iceberg Mechanism Write cost Read cost Append New data files only Low Low Delete rows Position or equality delete files Low Medium Update rows Delete + new data file Medium Medium (copy-on-write or merge-on-read) Overwrite Atomic swap of data files Medium Low (replace partition).

			
from pyiceberg.expressions import EqualTo, In
# Delete all orders from the APAC region
table.delete(EqualTo("region", "APAC"))
print(table.scan().to_pandas())
   order_id customer   product region  order_date  revenue  discount_percent  \
0      1001    Alice    Laptop     EU  2024-01-15  1299.99               NaN   
1      1002      Bob     Phone     US  2024-01-20  1798.00               NaN   
2      1003    Carol    Tablet     EU  2024-02-03   549.50               NaN   
3      1005      Eve  Keyboard     US  2024-03-05   399.95               NaN   
4      1006    Frank    Webcam     EU  2024-03-12   258.00               NaN   
  is_returned  
0        None  
1        None  
2        None  
3        None  
4        None  

		

Also

			
# Delete specific order IDs
table.delete(In("order_id", [1001, 1003]))
# Verify — deleted rows are gone from the logical view
df_after = table.scan().to_pandas()
print(f"Rows after delete: {len(df_after)}")
print(df_after[["order_id", "customer", "region"]])
Rows after delete: 3
   order_id customer region
0      1002      Bob     US
1      1005      Eve     US
2      1006    Frank     EU

		

We can see partiton pruning in action with a scan EqualTo(“region”, “EU”) will skip all data files in region=US/ and region=APAC/ directories entirely — zero bytes read from those files.

Advanced Scanning & Query Processing

The full expression API (And, Or, Not, In, NotIn, StartsWith, IsNull), time travel by snapshot ID, incremental reads between two snapshots for CDC pipelines, and streaming via Arrow RecordBatchReader for out-of-memory processing.

PyIceberg’s scan API supports rich predicate pushdown, snapshot-based time travel, incremental reads between snapshots, and streaming via Arrow record batches.

Let’s start by adding some data back into the table.

			
df3 = pd.DataFrame({
    "order_id":    [1001, 1003, 1004, 1006, 1007],
    "customer":    ["Alice", "Carol", "Dave", "Frank", "Grace"],
    "product":     ["Laptop", "Tablet", "Monitor", "Headphones", "Webcam"],
    "order_date":  [
        date(2024, 1, 15), date(2024, 2, 3),  date(2024, 2, 20), date(2024, 4, 1), date(2024, 4, 5)],
    "region":      ["EU", "EU", "APAC", "EU", "APAC"],
    "revenue":    [1299.99, 549.50, 1197, 498.00, 129.00]
})
#Add the data
table.append(pa.Table.from_pandas(df3))

		

Let’s try a query with several predicates.

			
from pyiceberg.expressions import (
    And, Or, Not,
    EqualTo, NotEqualTo,
    GreaterThan, GreaterThanOrEqual,
    LessThan, LessThanOrEqual,
    In, NotIn,
    IsNull, IsNaN,
    StartsWith,
)
# EU or US orders, revenue > 500, product is not "Keyboard"
df_complex = table.scan(
    row_filter=And(
        Or(
            EqualTo("region", "EU"),
            EqualTo("region", "US"),
        ),
        GreaterThan("revenue", 500.0),
        NotEqualTo("product", "Keyboard"),
    ),
    selected_fields=("order_id", "customer", "product", "region", "revenue"),
).to_pandas()
print(df_complex)
   order_id customer product region  revenue
0      1001    Alice  Laptop     EU  1299.99
1      1003    Carol  Tablet     EU   549.50
2      1002      Bob   Phone     US  1798.00

		

Now let’s try a NOT predicate

			
df_not_in = table.scan(
    row_filter=Not(In("region", ["US", "APAC"]))
).to_pandas()
print(df_not_in)
   order_id customer     product region  order_date  revenue  \
0      1001    Alice      Laptop     EU  2024-01-15  1299.99   
1      1003    Carol      Tablet     EU  2024-02-03   549.50   
2      1006    Frank  Headphones     EU  2024-04-01   498.00   
3      1006    Frank      Webcam     EU  2024-03-12   258.00   
   discount_percent is_returned  
0               NaN        None  
1               NaN        None  
2               NaN        None  
3               NaN        None 

		

Now filter data with data starting with certain values.

			
df_starts = table.scan(
    row_filter=StartsWith("product", "Lap")  # matches "Laptop", "Laptop Pro"
).to_pandas()
print(df_starts)
   order_id customer product region  order_date  revenue  discount_percent  \
0      1001    Alice  Laptop     EU  2024-01-15  1299.99               NaN   
  is_returned  
0        None

		

Using the LIMIT function.

			
df_sample = table.scan(limit=3).to_pandas()
print(df_sample)
   order_id customer  product region  order_date  revenue  discount_percent  \
0      1001    Alice   Laptop     EU  2024-01-15  1299.99               NaN   
1      1003    Carol   Tablet     EU  2024-02-03   549.50               NaN   
2      1004     Dave  Monitor   APAC  2024-02-20  1197.00               NaN   
  is_returned  
0        None  
1        None  
2        None

		

We can also perform data streaming.

			
# Process very large tables without loading everything into memory at once
scan = table.scan(selected_fields=("order_id", "revenue"))
total_revenue = 0.0
total_rows    = 0
# to_arrow_batch_reader() returns an Arrow RecordBatchReader
for batch in scan.to_arrow_batch_reader():
    df_chunk       = batch.to_pandas()
    total_revenue += df_chunk["revenue"].sum()
    total_rows    += len(df_chunk)
print(f"Total rows:    {total_rows}")
print(f"Total revenue: ${total_revenue:,.2f}")
Total rows:    8
Total revenue: $6,129.44

		

DuckDB and Iceberg Tables

We can register an Iceberg scan plan as a DuckDB virtual table. PyIceberg handles metadata; DuckDB reads the Parquet files.

			
conn = duckdb.connect()
# Expose the scan plan as an Arrow dataset DuckDB can query
scan = table.scan()
arrow_dataset = scan.to_arrow()  # or to_arrow_batch_reader()
conn.register("orders", arrow_dataset)
# Full SQL on the table
result = conn.execute("""
    SELECT
        region,
        COUNT(*)                             AS order_count,
        ROUND(SUM(revenue), 2)               AS total_revenue,
        ROUND(AVG(revenue), 2)               AS avg_revenue,
        ROUND(MAX(revenue) - MIN(revenue), 2) AS revenue_range
    FROM orders
    GROUP BY region
    ORDER BY total_revenue DESC
""").df()
print(result)
  region  order_count  total_revenue  avg_revenue  revenue_range
0     EU            4        2605.49       651.37        1041.99
1     US            2        2197.95      1098.97        1398.05
2   APAC            2        1326.00       663.00        1068.00

		

DuckDB has a native Iceberg extension that reads Parquet files directly.

			
import duckdb, glob
conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg;")
# Enable version guessing for Iceberg tables
conn.execute("SET unsafe_enable_version_guessing = true;")
# Point DuckDB at the Iceberg table root directory
table_path = f"{WAREHOUSE}/sales_db/orders_partitioned"
df_duck = conn.execute(f"""
    SELECT *
    FROM iceberg_scan('{table_path}', allow_moved_paths = true)
    WHERE revenue > 500
    ORDER BY revenue DESC
""").df()
print(df_duck)
   order_id customer  product region order_date  revenue  discount_percent  \
0      1002      Bob    Phone     US 2024-01-20  1798.00               NaN   
1      1001    Alice   Laptop     EU 2024-01-15  1299.99               NaN   
2      1004     Dave  Monitor   APAC 2024-02-20  1197.00               NaN   
3      1003    Carol   Tablet     EU 2024-02-03   549.50               NaN   
   is_returned  
0         <NA>  
1         <NA>  
2         <NA>  
3         <NA>

		

We can access the data using the time travel Iceberg feature.

			
# Time travel via DuckDB
snap_id = table.history()[0].snapshot_id
df_tt = conn.execute(f"""
    SELECT * FROM iceberg_scan(
        '{table_path}',
        snapshot_from_id = {snap_id},
        allow_moved_paths = true
    )
""").df()
print(f"Time travel rows: {len(df_tt)}")
Time travel rows: 6

		

This entry was posted in Analytics, Big Data and tagged AI, Analytics, Iceberg, Python.

Ora-lytics

By Brendan Tierney

Exploring Apache Iceberg using PyIceberg – Part 2

Exploring Apache Iceberg using PyIceberg – Part 2

Share this:

Related