
Daft v0.7.15: Safe Type Conversions, Flight Shuffle Optimizations, and PostgreSQL Support
Daft v0.7.15 ships with try_cast for safe type conversion, Flight shuffle LZ4 compression, UUIDv7 timestamp extraction, and PostgreSQL support.
by Everett KlevenTLDR:
- try_cast() brings safe type conversion with null fallbacks instead of runtime errors
- Flight shuffle now defaults to LZ4 compression — frames stay compressed across disk and the wire, ~2.3x faster shuffle on EBS gp3
- UUIDv7 timestamp extraction enables time-based partitioning on UUID columns
Data transformations fail when type conversions encounter unexpected values. Shuffle performance becomes a bottleneck when workers exchange data across the network. Time-based queries struggle with UUID primary keys that embed temporal information.
Daft v0.7.15 addresses each of these pain points with 55+ improvements from 21 contributors, spanning safe type handling, shuffle optimizations, and expanded data source support.
Safe Type Conversion with try_cast()
Type conversion errors halt pipelines when source data doesn't match schema expectations. The new try_cast() function converts types safely, returning null for invalid values instead of throwing exceptions.
# /// script
# description = "Safe type conversion with try_cast function"
# requires-python = ">=3.12"
# dependencies = ["daft==0.7.15"]
# ///
import daft
df = daft.from_pydict({
"values": ["123", "invalid", "456", "", "789"],
"timestamps": ["2024-01-01", "not-a-date", "2024-12-31", None, "2024-06-15"]
})
df = df.with_columns({
"safe_ints": daft.col("values").try_cast(daft.DataType.int64()),
"safe_dates": daft.col("timestamps").try_cast(daft.DataType.date())
})
df.show()try_cast() succeeds where cast() would fail, converting "123" and "789" to integers while gracefully handling "invalid" and empty strings as null values. Critical for ETL pipelines processing messy real-world data.
Contributed by @XuQianJin-Stars in #6960.
Flight Shuffle LZ4 Compression
Distributed joins and aggregations shuffle data between workers. Disk and network overhead dominated execution time for data-intensive operations. Daft v0.7.15 defaults Flight shuffle to LZ4 compression — frames are compressed once on the map side and stay compressed across disk and the wire. A 1 TB TPC-H repartition sweep shows LZ4 winning at every partition count: ~10% faster on local NVMe and ~2.3x faster on EBS gp3.
# /// script
# description = "Use Flight Shuffle with LZ4 compression for a distributed shuffle"
# requires-python = ">=3.12"
# dependencies = ["daft==0.7.15"]
# ///
import daft
daft.set_execution_config(
shuffle_algorithm="flight_shuffle",
flight_shuffle_compression="lz4",
)
cfg = daft.context.get_context().daft_execution_config
print("shuffle_algorithm =", cfg.shuffle_algorithm)
print("flight_shuffle_compression =", cfg.flight_shuffle_compression)
df = daft.from_pydict({
"user_id": [i % 1000 for i in range(100_000)],
"amount": list(range(100_000)),
})
df.groupby("user_id").agg(daft.col("amount").sum().alias("total")).sort("user_id").show()LZ4 is cheap enough to encode that the compression pays for itself even on fast local disks, and the win grows as storage gets slower. Set flight_shuffle_compression="none" to restore the previous uncompressed behavior.
Contributed by @colin-ho in #7071 and #6979.
UUIDv7 Timestamp Extraction
UUIDv7 embeds timestamps in the first 48 bits, enabling time-based partitioning and filtering on UUID columns. The new timestamp extraction functions unlock temporal analytics on UUID primary keys.
# /// script
# description = "Extract the timestamp embedded in UUIDv7 values for partitioning"
# requires-python = ">=3.12"
# dependencies = ["daft==0.7.15"]
# ///
import daft
from daft import col
from daft.functions import extract_day_uuid7, extract_hour_uuid7, uuid
df = daft.from_pydict({
"event": ["login", "purchase", "refund"],
})
df = df.with_column("id", uuid(version="v7"))
df = df.with_columns({
"hour_bucket": extract_hour_uuid7(col("id")),
"day_bucket": extract_day_uuid7(col("id")),
})
df.select("event", "id", "hour_bucket", "day_bucket").show()UUIDv7 timestamp extraction enables efficient time-based partitioning schemes without additional timestamp columns. Particularly valuable for event sourcing and audit log systems using UUID primary keys.
Contributed by @jaychia in #7032.
PostgreSQL Data Source Support
Daft now reads PostgreSQL tables directly via Gravitino catalog integration, expanding structured data source support beyond cloud object stores.
from daft.catalog import Catalog
catalog = Catalog.from_gravitino(
endpoint="http://localhost:8090",
metalake_name="your_metalake",
)
df = catalog.read_table("postgres_catalog.public.users")
df.show()Query a PostgreSQL table and a Parquet dataset in S3 in the same job, without an export step in between.
Contributed by @qingfeng-occ in #6989.
ASOF Join Enhancements
ASOF (As-Of) joins match records based on temporal proximity rather than exact equality. Daft v0.7.15 adds nearest ASOF joins and aligned partition assumptions for financial time series and sensor data analytics.
# /// script
# description = "Match each trade to the nearest quote with an as-of join"
# requires-python = ">=3.12"
# dependencies = ["daft==0.7.15"]
# ///
import daft
trades = daft.from_pydict({
"ts": [100, 250, 400],
"symbol": ["AAPL", "AAPL", "MSFT"],
"price": [150.0, 155.0, 200.0],
}).sort("ts")
quotes = daft.from_pydict({
"ts": [90, 260, 395, 500],
"bid": [149.5, 154.8, 199.7, 153.0],
}).sort("ts")
trades.join_asof(quotes, on="ts", strategy="nearest").sort("ts").show()Nearest ASOF joins find the temporally closest match regardless of whether it occurs before or after the target timestamp. Assumes sorted and aligned partitions boost performance for pre-sorted time series data.
Contributed by @euanlimzx in #6953 and #7067.
Window Function Performance
Window operations now use specialized Series return paths, reducing memory copies during finalization. first_value() and last_value() aggregations join the window function library for ranking and analytical queries.
# /// script
# description = "first_value and last_value over an ordered window"
# requires-python = ">=3.12"
# dependencies = ["daft==0.7.15"]
# ///
import daft
from daft import Window, col
sales = daft.from_pydict({
"region": ["North", "North", "North", "South", "South"],
"month": [1, 2, 3, 1, 2],
"revenue": [100, 120, 90, 80, 95],
})
w = (
Window().partition_by("region").order_by("month")
.rows_between(Window.unbounded_preceding, Window.unbounded_following)
)
sales = sales.with_columns({
"first_month_rev": col("revenue").first_value().over(w),
"last_month_rev": col("revenue").last_value().over(w),
})
sales.sort(["region", "month"]).show()Returning Series from window ops instead of full RecordBatches cuts peak memory on partition-heavy window queries by up to ~39% (benchmarked across partition and frame configurations). first_value() and last_value() enable lead/lag analytics without manual offset calculations.
Contributed by @euanlimzx in #6974, #7006, and #7011.
Everything Else
Video Processing: video_frames() supports configurable sampling intervals via sample_interval_seconds parameter (@TheR1sing3un #6832)
Iceberg Integration: Branch and tag reads for Iceberg tables, plus auto-configuration for Alibaba Cloud OSS (@jackylee-ch #7042, @plusplusjiajia #6993)
Join Optimization: A specialized code path for hash joins on integer keys — roughly 2x faster dedupe at small scale (@srilman #6644)
Observability: Distributed checkpoint counters and cross-sink helpers for production monitoring (@rohitkulshreshtha #7026, #6932)
Cloud Storage: GCS object store delete operations and per-column compression configuration (@daiping8 #6958, @rchowell #6884)
Community Contributions
12 external contributors expanded Daft's capabilities across type safety, data source integration, and cloud storage:
- @XuQianJin-Stars implemented try_cast safe type conversion.
- @qingfeng-occ integrated PostgreSQL via Gravitino.
- @jackylee-ch enhanced Iceberg with branch/tag reads.
- @TheR1sing3un added video frame sampling intervals.
- @daiping8 implemented GCS delete operations.
- @BABTUNA optimized multi-column aggregations.
- @plusplusjiajia added Alibaba Cloud OSS support.
- @aaron-ang refactored file byte-range fields.
- @kyuds fixed Ray integration compatibility.
- @ARDA7787 improved retry jitter calculations.
- @mikedep333 cleaned up feature gates.
- @0xdeadd bumped the minimum PyArrow version to 16.
Upgrade
uv add "daft>=0.7.15"Or try the latest nightly:
uv pip install daft --pre --extra-index-url https://nightly.daft.aiCheck the full changelog for the complete list of merged PRs.
Join the Community
Questions about try_cast, Flight shuffle tuning, or UUIDv7 extraction? Connect with the Daft community:
- Slack — Daily discussions and real-time support
- GitHub Discussions — Technical deep dives and feature requests
- Documentation — Complete API reference and tutorials

