Lessons from Kubecon/CloudNativeCon 2018 Europe

The following set of summaries are from the Kubecon and Cloud Native Con Europe in Denmark from 2-4 May 2018.

These summaries are from conference talks that I thought provided more interesting thinking points.

The videos for the conference can be found here:
https://www.youtube.com/watch?v=OUYTNywPk-s&list=PLj6h78yzYM2N8GdbjmhVU65KYm_68qBmo

Below are some of the talks that I found quite interesting (just my own preference)
I took some of my personal notes so that I don’t need to rewatch the videos once more just to get the main point the video seem to talk about.

Anatomy of a Production Kubernetes Outage
Cloud Native Landscape Intro
Accelerating Kubernetes Native Applications
Kubernetes Project Update
The Challenges of Migrating 150+ microservices
Container-Native dev and ops experience
Container Native observability & security from Google Cloud
Continuously Deliver your Kubernetes Infrastructure

Anatomy of a Production Kubernetes Outage

Production Outage occured
Blog Post: https://community.monzo.com/t/resolved-current-account-payments-may-fail-major-outage-27-10-2017/26296/95?u=alexs
Another blog post: https://community.monzo.com/t/anatomy-of-a-production-kubernetes-outage-presentation/37331
In summary: Checking for compatability between platform, tools are vital - such checks are vital especially on the platform level when they can cause cascading failures across the applications.
Fallbacks when systems fail is helpful; in the case above, applications failed but transactions continue running.

Cloud Native Landscape Intro

Introduction to the cloud native landscape tools and github page
Github Link: https://github.com/cncf/landscape
Website Link: https://landscape.cncf.io/
Get the pdf versions of the landscape from Github

Accelerating Kubernetes Native Applications

Operators is a concept that was build on Kubernetes providing the Custom Resource Definitions
Allows for specific application management; e.g. Managing the running of a database - if a database need to be resized, operators could be programmed to trigger snapshot before switching to a bigger pod which the data can be replicated in. (example only)
Reasons on why operators are kind of game changing: https://dzone.com/articles/why-kubernetes-operators-are-a-game-changer
Additional links: https://medium.com/@mtreacher/writing-a-kubernetes-operator-a9b86f19bfb9
Operator Framework by core os: https://coreos.com/operators/
Github link to operators: https://github.com/operator-framework/operator-sdk

Kubernetes Project Update

Security
- Network Policy
- Encrypted Secrets
- RBAC
- TLS Cert Rotation
- Pod Security Policy
- Threat Detection (Not really part of Kubernetes - GKE Cloud Security Command Centre)
- Sandbox Applications (Providing a tiny kernel for the container - gVisor)
Applications
- Batch Applications
- Workload Controllers, Local Storage
- GPU access
- Container Storage Interface
- (Mention about a Spark operator - a software which manages the running of a Spark cluster)
- Stackdriver. Integrates deeply with Prometheus
Developer Experience
- Skaffold (Allows debug tool to be attached allowing interactive debugging with custom deployments)

The Challenges of Migrating 150+ microservices

Tools out there kind of follow the same cycle: Genesis -> Custom Built solutions -> Product Offering -> Commodity.
Chart from here: https://medium.com/wardleymaps/anticipation-89692e9b0ced
Link to whole blog post: https://medium.com/wardleymaps
When companies are big, moving and innovating becomes expensive (its not a technology problem but more of a human, community, company problem). So essentially, one can consider this as innovation tokens; tokens that should only be spent wisely, else failure would be result.
Choose boring technology. http://mcfunley.com/choose-boring-technology
One way to reduce risk is to run the applications on 2 parallel stacks but it is very expensive in terms of complexity and human effort. When doing this, one needs note of the costs of doing this kind of test
Such tests have an impact on cost - might be good to rope in the people with this on the test being run, the hypothesis of what that should be happening and the benefits that the company will have

Container-Native dev and ops experience

Talk about the following tool: https://github.com/Azure/draft

Container Native observability & security from Google Cloud

Talk about the following tool: gVisor - this tool is a fix for the Dirty Cow vulnerability
Stackdriver support - Deep prometheus integration - It can import metrics stats over from it to stackdriver to provide the one glass pane to be able to view all applications being monitored in one tool
Podcast: https://kubernetespodcast.com/
Blog post talking about podcast: https://cloudplatform.googleblog.com/2018/05/introducing-kubernetes-podcast-from-google.html

Continuously Deliver your Kubernetes Infrastructure

Philosophy for setting kubernetes clusters
- No pet clusters (No special custom configuration for 80 clusters)
- Always provide the latest stable Kubernetes version
- Continuous and non-disruptive cluster updates
- “Fully” automated operations (Able to redeploy by just doing PRs)
Cluster setup
- Provision in AWS via cloud formation
- Etcd stack outside Kubernetes
- Container Linux
- Multi-AZ worker nodes
- HA control plane setup behind ELB
- Cluster configuration in git
- e2e test on Jenkins
Cluster registry
- List of clusters available of access
https://github.com/zalando-incubator/kubernetes-on-aws
https://github.com/zalando-incubator/cluster-lifecycle-manager
Multiple “channels” of Kubernetes
- Cluster upgrade moves from dev, alpha, beta clusters
- dev (Cluster to play around with)
- alpha (Main infrastructure cluster that is used by infrastructure team for testing)
- beta (Main cluster rest of org uses)
- Has e2e tests
- Conformance tests (https://github.com/cncf/k8s-conformance)
- Statefulset tests (Test attachment volumes - testing to use redis cluster?)
- Has monitoring on each cluster to ensure behaviour
- https://github.com/mikkeloscar/kubernetes-e2e
Hints for running e2e tests
- Run with flake attempts=2. Some tests can fail due to autoscaling
- Update e2e images with each release of Kubernetes
- Disable broken e2e tests with -skip parameter
- Remove completed pods from kube-system to make room for other pods of testing to enter (To save money)