Deep dives into Apache Spark, scalable architectures, and the code that powers big data. Written for engineers, by engineers.
# Optimizing your Spark Job
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MeedsterDataPipe") \
.getOrCreate()
# Is it really efficient?
df = spark.read.parquet("s3://data/...")
df.groupBy("id").count().show()
Spark has become the de-facto standard for big data processing, but is it overkill for your project? We break down the use cases, performance costs, and alternatives.
Read Full AnalysisClean up your ETL code with these powerful decorator patterns used in production environments.
Read Article →Choosing the right container orchestration tool for your microservices architecture.
Read Article →We stress-tested Mongo, Cassandra, and DynamoDB. The results might surprise you.
Read Article →