Introduction
There are many ways to create a secure internal developer platform, e.g., following the aspect-oriented programming model and giving the developers free guardrails and security features.
Utilizing multiple layers of security is a crucial aspect of any robust security strategy. This approach, also known as defense in depth, ensures that should one security layer fail, others are in place to thwart potential threats. Each layer addresses different types of threats and covers any gaps that other layers may leave unprotected.
This blog post will describe how to use multiple layers in a typical internal developer platform setup using aspect-oriented programming techniques to protect K8s clusters and their workloads. Some of the tools are Azure-specific, but there are similar alternatives in all the other major cloud providers.
Network Security Groups
Azure Network Security Groups (NSGs) are key in controlling network traffic in and out of resources in an Azure Virtual Network. They act as a firewall, enabling you to define security rules that allow or deny inbound and outbound traffic to various resources.
K8s clusters based on Azure Kubernetes Service (AKS) configured with Azure CNI (advanced) networking utilize Azure Virtual Network without overlays, making it easier to reason about the effect of using Azure Network Security Groups(NSG) to control traffic. NSGs can be used to control traffic between the subnets and to/from the internet and are approved as Firewalls concerning the PCI-DSS standard.
Distributed Denial of Service (DDoS) protection
The big cloud providers often provide DDoS protection as a service, e.g., Azure DDoS Protection service. Azure enables this on the Azure VNet and gives us automatic DDoS protection on all public IPs.
Traefik Ingress Controller
Next up are ingress controllers, e.g., the Traefik Ingress Controller. Traefik supports middleware that can add security features to your ingress, e.g., rate limiting and whitelisting. This feature and NSGs give us much control over the ingress traffic.
Network Policies
Inside of the cluster, you can utilize Network Policies to control traffic to and from pods. The policies can be implemented using Calico or Cilium and are enforced at the network layer. This means that the policies are enforced before the traffic reaches the pod, and the pod will not see any traffic that is not allowed by the policy. Networking policies, together with NSGs and ingress controllers, allow us, in effect, to create multiple software-defined firewalls in our cluster, controlling traffic coming in and out of the cluster, in addition to between pods.
Infrastructure as Code (IaC)
Not so much directly a security feature as, for example, NSGs, but provisioning infrastructure using IaC, providing QAs, repeatability, and audibility for all changes reduces the chance of misconfiguration.
Linkerd for automatic mTLS
Linkerd can be used to implement automatic mTLS in our cluster. This will give us a secure communication channel between all pods in the cluster and easily enforce that all traffic between pods is encrypted. Linkerd will also give you more observability into your cluster and see which pods are talking to each other and how much traffic they send. This will help identify any malicious traffic in clusters by detecting deviations from normal traffic load.
GitOps
GitOps enhances security by managing system configuration as code (CaC) and providing a full audit trail of all system changes. Teams can have a Git repository containing their YAML files describing their system with all their applications, e.g., Kubernetes (K8s) Deployments. You can use gitops to apply the teams’ yaml files from their repo into their specific K8s namespace. Any changes going directly through the K8s API server instead of through the repo can be disabled for the teams, thus ensuring a proper review and approval of all changes, reducing the risk of misconfigurations or potential security breaches.
Role-Based Access Control
K8s has a built-in role-based access control capability(RBAC), giving developers and components running inside K8s fine-level access control. Teams typically get access solely through group memberships, e.g., Azure AD groups.
The GitOps component applying the team-specific YAML runs as a K8s service account. That service account only has access to certain operations in our clusters. This is not to limit the team's workflow or what they can do but to ensure that if a team member adds a malicious YAML file to their repo, it will not be able to do any harm to the cluster.
Ideally, the teams should get all necessary information through the Grafana dashboards and Prometheus alarms.
Azure Privileged Identity Management (PIM) is a service that enables management, control, and monitoring of access to important resources in your organization. This includes access to Azure resources, Azure AD roles, and Azure resources. PIM provides features like just-in-time privileged access, assignment of time-bound access, approval workflows, and access reviews. These features help organizations reduce risks associated with users who have privileged access and provide better visibility into this access. Combining PIM with the K8s RBAC allows us to set up limited access for developers in production systems for a limited amount of time.
Policy as code for workload configuration
There are mainly two options for implementing policy as code with K8s: Kyverno and Open Policy Agent (OPA).
Kyverno is Kubernetes-native and uses the same declarative approach as Kubernetes, making it a bit easier to learn and use for those already familiar with Kubernetes manifests. It allows for policy enforcement at the admission control stage and automatically fix configurations to comply with policies.
On the other hand, OPA is a more generic policy engine that can be used across a wide range of tech stacks, not just Kubernetes. It uses a high-level declarative language called Rego for policy creation, which can be more flexible but also has a steeper learning curve. OPA, like Kyverno, also supports admission control but doesn't support automatic fixes for non-compliant configurations.
I have been using Kyverno.
On a high level, Kyverno works by investigating all YAML coming into, and currently inside, a cluster and deciding whether or not it is according to our policies. With Kyverno policies, we can specify what should happen with the YAML that is not compliant. Typically, we will start with auditing it and ensuring the developer teams know how to fix it. As we move closer into production, the rules will be forced, effectively stopping all non-compliant YAML from coming into and running in our cluster.
Even though K8s will enforce policies at the admission level, sometimes you want to check them during CI/CD or as part of the inner developer loop. Kyverno comes with a CLI that you can use to write tests against your YAML and run them locally and in GH Actions. For a demo and examples, please see the kyverno-demo repo.
Pod Security Standard
In Kubernetes, the Pod Security Standard is a comprehensive set of standards designed to mitigate security risks in pod configuration by enforcing a baseline set of restrictions. These standards focus on reducing potential attack points by imposing restrictions on pod behavior.
The standards are divided into three distinct levels, each representing a different degree of security and compatibility with Kubernetes configurations.
- The
Privileged
level essentially offers unrestricted access, making it compatible with existing workloads in default Kubernetes settings. Pods operating at this level have broad permissions, enabling them to operate without any predefined restrictions. - The
Baseline
level introduces a degree of restriction over the pod configurations. This level covers known privilege escalations that could be exploited yet still permit some privileged features. This level is most suitable for workloads that require a decent level of security without limiting necessary functionalities. - The
Restricted
level imposes heavy restrictions on pod behavior. This level is recommended for highly secure environments where maximum security is required. Most of the security-sensitive features and configurations are prohibited at this level, limiting the potential for security breaches.
I typically use the Pod Security Standard - restricted. Kyverno has multiple tools helping you view the policy reports, e.g., https://kyverno.io/docs/policy-reports/ and https://kyverno.github.io/policy-reporter/.
Trivy for workload scanning
Trivy is a comprehensive vulnerability scanner for containers and other digital artifacts. It enhances workload security both in your CI/CD pipelines and within your clusters by scanning your images for potential vulnerabilities. Please see the demo repo I created for more information. Trivy can also be used in securing Infrastructure as Code CI/CD pipelines, which I outlined in this blog post: https://fredrkl.com/blog/testing-strategies-in-terraform/.
Automatic certificate management
Cert-manager is a Kubernetes add-on to automate the management and issuance of TLS certificates from various issuing sources using the ACME protocol. It will ensure certificates are valid and up to date periodically and attempt to renew certificates at an appropriate time before expiry. The certificates are short-lived, increasing security.
Falco for runtime security
Falco is an open-source intrusion and abnormality detection project for Cloud Native platforms like Kubernetes. It detects abnormal application behavior and alerts on threats at runtime. By using a powerful system called capture infrastructure, Falco gains deep insight into the behavior of the Linux kernel. This enables it to detect unexpected application behavior and flag potential security threats, effectively enhancing the security of Kubernetes environments.
Azure Defender for Cloud
Not so much mitigating active security threats, but Azure Defender for Cloud provides two other important capabilities:
- Threat detection across hybrid workloads
- Security posture management
Azure Defender for Cloud uses machine learning and behavioral analytics to identify and respond to abnormal activities, detect threats, and reduce false positives.
Conclusion
In conclusion, securing an internal developer platform involves strategically implementing multiple security measures. From network security groups, ingress controllers, and network policies to infrastructure as code, mTLS, and role-based access control, each layer of security plays a crucial role in mitigating potential threats. Utilizing policy as code for workload configuration, adhering to Pod Security Standards, vulnerability scanning, automatic certificate management, runtime security, and DDoS protection all contribute to a robust, secure system. It's important to remember that security isn't a one-off task but an ongoing process that requires regular review and updates to adapt to evolving threats. With the right tools and strategies in place, you can create a secure, efficient, and reliable developer platform.