Experiment, Fail, Learn, Repeat

Life is too exciting to just keep still!

Lessons from Kubecon/CloudNativeCon 2018 Europe

The following set of summaries are from the Kubecon and Cloud Native Con Europe in Denmark from 2-4 May 2018.

These summaries are from conference talks that I thought provided more interesting thinking points.

The videos for the conference can be found here:
https://www.youtube.com/watch?v=OUYTNywPk-s&list=PLj6h78yzYM2N8GdbjmhVU65KYm_68qBmo

Below are some of the talks that I found quite interesting (just my own preference)
I took some of my personal notes so that I don’t need to rewatch the videos once more just to get the main point the video seem to talk about.

Anatomy of a Production Kubernetes Outage

Cloud Native Landscape Intro

Accelerating Kubernetes Native Applications

Kubernetes Project Update

  • Security
    • Network Policy
    • Encrypted Secrets
    • RBAC
    • TLS Cert Rotation
    • Pod Security Policy
    • Threat Detection (Not really part of Kubernetes - GKE Cloud Security Command Centre)
    • Sandbox Applications (Providing a tiny kernel for the container - gVisor)
  • Applications
    • Batch Applications
    • Workload Controllers, Local Storage
    • GPU access
    • Container Storage Interface
    • (Mention about a Spark operator - a software which manages the running of a Spark cluster)
    • Stackdriver. Integrates deeply with Prometheus
  • Developer Experience
    • Skaffold (Allows debug tool to be attached allowing interactive debugging with custom deployments)

The Challenges of Migrating 150+ microservices

  • Tools out there kind of follow the same cycle: Genesis -> Custom Built solutions -> Product Offering -> Commodity.
  • Chart from here: https://medium.com/wardleymaps/anticipation-89692e9b0ced
  • Link to whole blog post: https://medium.com/wardleymaps
  • When companies are big, moving and innovating becomes expensive (its not a technology problem but more of a human, community, company problem). So essentially, one can consider this as innovation tokens; tokens that should only be spent wisely, else failure would be result.
  • Choose boring technology. http://mcfunley.com/choose-boring-technology
  • One way to reduce risk is to run the applications on 2 parallel stacks but it is very expensive in terms of complexity and human effort. When doing this, one needs note of the costs of doing this kind of test
  • Such tests have an impact on cost - might be good to rope in the people with this on the test being run, the hypothesis of what that should be happening and the benefits that the company will have

Container-Native dev and ops experience

Container Native observability & security from Google Cloud

Continuously Deliver your Kubernetes Infrastructure

  • Philosophy for setting kubernetes clusters
    • No pet clusters (No special custom configuration for 80 clusters)
    • Always provide the latest stable Kubernetes version
    • Continuous and non-disruptive cluster updates
    • “Fully” automated operations (Able to redeploy by just doing PRs)
  • Cluster setup
    • Provision in AWS via cloud formation
    • Etcd stack outside Kubernetes
    • Container Linux
    • Multi-AZ worker nodes
    • HA control plane setup behind ELB
    • Cluster configuration in git
    • e2e test on Jenkins
  • Cluster registry
    • List of clusters available of access
  • https://github.com/zalando-incubator/kubernetes-on-aws
  • https://github.com/zalando-incubator/cluster-lifecycle-manager
  • Multiple “channels” of Kubernetes
    • Cluster upgrade moves from dev, alpha, beta clusters
    • dev (Cluster to play around with)
    • alpha (Main infrastructure cluster that is used by infrastructure team for testing)
    • beta (Main cluster rest of org uses)
    • Has e2e tests
    • Conformance tests (https://github.com/cncf/k8s-conformance)
    • Statefulset tests (Test attachment volumes - testing to use redis cluster?)
    • Has monitoring on each cluster to ensure behaviour
    • https://github.com/mikkeloscar/kubernetes-e2e
  • Hints for running e2e tests
    • Run with flake attempts=2. Some tests can fail due to autoscaling
    • Update e2e images with each release of Kubernetes
    • Disable broken e2e tests with -skip parameter
    • Remove completed pods from kube-system to make room for other pods of testing to enter (To save money)