About:
Daniel is a data enthusiast with a background in agriculture, passionate about data engineering and related technologies.
Website:
Specializations:
Interests:
Subscribe to RSS:
The author expresses strong criticism of the concept of 'Medallion Architecture' in data modeling, particularly as promoted by Databricks. They argue that this approach is overly complicated and misleading, confusing many data eng...
The author discusses the challenges of inserting large datasets into a Postgres database using Spark's JDBC method, which is notably slow. After experimenting, the author finds that combining Python with Spark's multi-processing c...
The author shares their experience migrating hundreds of Delta Lake tables from partitioning to liquid clustering, highlighting the complexities and potential pitfalls involved in the process. They emphasize that while the change ...
DuckDB outperforms Polars in handling large datasets, showcasing better developer support and reliability in production environments.
The author discusses transitioning from Spark to Polars for handling distributed compute jobs, emphasizing the cost-effectiveness and simplicity of using single-node tools. They highlight the challenges of managing large datasets ...
The author expresses a strong opinion that data modeling is dead, criticizing the current generation of data engineers for their reliance on modern technologies like Data Lakes and Lake Houses, which have overshadowed traditional ...
The author reflects on the evolution of software engineering, contrasting the nostalgic past with the current landscape dominated by AI tools like Cursor. They discuss the importance of adapting to new technologies while emphasizi...
The author expresses surprise at the lack of awareness surrounding 'Lazy Execution' in data processing, particularly with tools like Polars, Daft, and DuckDB. They emphasize the importance of processing data in batches rather than...
The blog post discusses the challenges faced by data engineers, particularly in Spark performance tuning and optimizations. It emphasizes the importance of understanding DataFrame partitions and their impact on performance. The au...
A disciplined migration to Databricks emphasizes strong fundamentals, clear governance, and intentional design to ensure stability and control over the data architecture.
Declarative Pipelines represent the future of Spark by simplifying data engineering through structured frameworks that enhance reliability and maintainability.
The post discusses the shift from traditional database drivers to Apache Arrow Database Connectivity, emphasizing its efficiency and performance benefits in data handling.
The author expresses frustration with the overuse of Terraform and YAML in infrastructure management, arguing that it complicates development and debugging processes. They reminisce about simpler times when API calls and SDKs were...
AI will transform software engineering but is unlikely to replace engineers entirely, as human oversight and expertise remain essential.
In the Age of AI, maintaining clean and organized code remains crucial for software developers, especially when using tools like Polars for data pipelines.
Embrace Agentic AI as a tool for innovation, encouraging software developers to learn and adapt rather than fear technological change.
The blog post discusses the evolving landscape of AI and LLMs, emphasizing the importance of hands-on experience in coding with these technologies. It highlights the growing reliance on AI tools among junior developers, leading to...
The post discusses the comparison between DuckDB and Polars, emphasizing that there is no definitive answer to which is better as it depends on the context of use. DuckDB is described as an embedded analytical database suitable fo...
The author explores the challenges of loading CSV files with mismatched schemas using two data engineering tools: DuckDB and Polars. The post discusses how DuckDB handles schema mismatches by allowing for merging options, while Po...
The author discusses common errors encountered while using the Rust GOAT dataframe tool Polars, particularly focusing on schema mismatches when handling large CSV files. The text highlights the frustration of dealing with errors l...
Databricks' temporary tables simplify data pipeline management for SQL teams, offering a familiar structure that reduces clutter and eases migration from traditional data warehouses.
The author discusses the integration of Apache Iceberg with DuckDB, expressing frustration over the lack of write support in DuckDB's Iceberg extension. They share their experience testing this integration on the Databricks platfo...
The post advocates for using Lance as a simple and efficient vector database for storing embeddings in AI applications, emphasizing the need for software professionals to stay updated with new technologies.
The author discusses the increasing utility of the PyArrow Python package for data engineering tasks, particularly in data ingestion and handling large datasets in cloud storage. The text highlights PyArrow's capabilities in readi...