japila-books

The Internals Online Books

Welcome to “The Internals Of” Online Books project! 🤙

I’m Jacek Laskowski, a Freelance Data(bricks) Engineer specializing in Apache Spark (incl. Spark SQL and Spark Structured Streaming), Delta Lake, Databricks, and Apache Kafka (incl. Kafka Streams) with brief forays into a wider data engineering space (e.g., Trino, Dask and dbt, mostly during Warsaw Data Engineering meetups).

I’m very excited to have you here and hope you will enjoy exploring the internals of the open source projects together (in no particular order):

  1. Apache Spark
  2. Spark SQL
  3. Unity Catalog
  4. Spark Connect
  5. Spark Structured Streaming
  6. Delta Lake
  7. Spark on Kubernetes
  8. PySpark
  9. Apache Kafka (previously at gitbooks.io)
  10. Kafka Streams (previously at gitbooks.io)
  11. ksqlDB (no longer maintained)
  12. Apache Beam (no longer maintained)
  13. Spark Standalone (no longer maintained)

Please note that some books have less current content than others, but that’s expected with a one-person project where so many things are truly interesting and thus time-consuming. Life’s too short to taste everything :/

The aim of this project is to host all the current and future internals books under a single organization on GitHub and publish to a single domain via GitHub Pages (until I find a better way to publish the books).

Custom Docker Image

The books projects use a custom Docker image.

The official Docker image does not include all plugins the books need as well as is no longer available.

See build-image.sh shell script to learn more.

Build Books Docker Image

Execute the build-image.sh shell script to build the Docker image.

Build Book

Use docker run command with build argument to build a book.

docker run \
  --rm \
  -it \
  -p 8000:8000 \
  -v ${PWD}:/docs \
  jaceklaskowski/mkdocs-material-insiders \
  build --clean

TIP: Consult the Material for MkDocs documentation to get started.

Live Editing

Use docker run command with serve argument (with --dirtyreload for faster reloads) in the project root (the folder with mkdocs.yml).

docker run \
  --rm \
  -it \
  -p 8000:8000 \
  -v ${PWD}:/docs \
  jaceklaskowski/mkdocs-material-insiders \
  serve --dirtyreload --verbose --dev-addr 0.0.0.0:8000

List Outdated Packages

Run an interactive shell in a container.

docker run \
  --rm \
  -it \
  -p 8000:8000 \
  -v ${PWD}:/docs \
  --entrypoint sh \
  jaceklaskowski/mkdocs-material-insiders

While inside, execute the following command to list outdated packages, and show the latest version available (as described here).

python -m pip list --outdated

Follow @jaceklaskowski on Mastodon!