Quantcast
Channel: Colin's ALM Corner
Viewing all 192 articles
Browse latest View live

Lessons about DevOps from 3D Printing

$
0
0

It's no surprise that I'm passionate about DevOps. I think that has to do with my personality - my top five StrengthsFinder strengths are Strategic, Ideation, Learner, Activator, Achiever. I love the combination of people and tools that DevOps brings together. Being deeply technical and also fascinated by how people interact means I'm built for DevOps consulting. Add to that my love of learning, and I'm in my perfect job, since if there's one thing I've learned about DevOps - it's that I'll never learn everything there is to learn about it!

IMG_20190120_001801

I recently got a 3D printer (an Ender3) for my birthday - I was actually looking for some Raspberry Pi projects to do with my kids when I saw a post about a guy who created some programmable LED Christmas lights. He'd printed himself a case for his Raspberry Pi and I wondered, "How much does a 3D printer go for nowadays?" I was pleasantly surprised to learn that you can get a pretty decent printer for around $200. So I got one for my birthday - and it's been a ton of fun learning how to model, then slice models, then tweak the printer to make it bend to my will. But watching an abstract model turn into physical reality before your eyes is fantastic!

While I was learning how to 3D print, I realized there are a lot of parallels between 3D printing and DevOps. Some may accuse me of "when you have a hammer, everything looks like a nail" and I suspect they're partly right. But that doesn't mean you can't learn about one discipline from studying another! Remember, Lean and Agile have roots in auto manufacturing!

IMG_20190203_205203_Bokeh

So here are some thoughts about DevOps that I got from 3D Printing.

You Have To Just Jump In At Some Stage

I ordered my printer just before Christmas (my birthday is in January) but I had about 3 weeks between ordering and receiving my printer. I watched a ton of videos, read as much as I could, learned a couple of modeling and slicing programs - but no matter how much I read, I had to just start printing! There's something about learning while you go that's fun (and sometimes frightening) about both 3D printing and DevOps. Don't get into "analysis paralysis" - start somewhere and you'll be surprised how quickly you can move. Having a partner who can help you decide on some "low-lying fruit" will help you start faster - but don't be afraid to start somewhere. Also, no matter how much theoretical knowledge you have, you're going to have to start implementing and then adjust along the way.

Understanding Fundamentals Improves Your Success

While it was frustrating to wait so long for my printer to arrive, I am ultimately grateful because I learned a lot of fundamentals. Before ordering the printer I had no idea what PLA filament was or what slicers were. But learning the fundamentals of how printers actually work has helped me troubleshoot and improved my success.

Learning about DevOps can help you on your journey - some teams implement automation and call it DevOps. This shows that they don't fully understand what DevOps is about - and understanding the higher-level goals and history of DevOps can help you on your journey. Don't just jump onto the buzz words - understand why they make a difference. Understanding fundamentals will help you improve your success.

Small Changes Can Have Radical Impact

Because Fused Deposition Modeling (FDM) printing is additive, the 1st layer is critical. This layer needs to bond correctly to the build bed, otherwise the print is doomed. Adjusting the bed to make it level and the correct height from the nozzle is a fiddly task - and small changes can make a huge difference.

In DevOps, sometimes small changes make a big difference. Be mindful of how you make changes in your teams, processes or tools. Making too many changes at once will prevent you from determining which changes are working and which are not. Also small changes let your team get some wins and build momentum.

Just When You Think Everything Is Perfect, Something Fails

One of the challenges with printing is extrusion - the amount of plastic that is fed into the hot-end as the printer works. Too little and you get holes and missing layers, too much and you get blobs and stringing. The printer firmware has a multiplier for the extruder - if you program it to extrude 100mm of filament, it should extrude 100mm of filament! However, the stepper motor isn't perfectly calibrated, so you have to tweak the multiplier to get the correct extrusion. I had gone through the process of setting the extruder multiplier and was happy with the prints I was getting. I wanted to install some upgrades and wanted to print a baseline print for comparison - but the baseline print was terrible! There was clear under-extrusion - which I wasn't expecting since I hadn't touched the extruder settings. Eventually I had to recalibrate the extruder multiplier again.

In DevOps, you never "arrive". DevOps is a journey, and things can sometimes just blow up when you least expect. Remember that DevOps is more than just tooling and automation - people are a critical component of DevOps. And people change, new people come in or leave - and these changes can affect your culture - and therefore your DevOps. Keeping your eyes open, ensuring that you're following the vision and making sure everyone is still with you is key to success.

Fast Feedback Is Critical

IMG_20190123_233331

Some prints can take a while - the Yoda print I made for my son took just over 7 hours. I watched closely (especially in the beginning) to make sure the first couple layers worked correctly - fortunately I got a good couple layers early on and the print turned out great. I have also done some other prints where the first couple layers didn't bond to the build surface correctly, and I aborted before wasting time (and filament). Getting feedback quickly was critical - fortunately I got to see the layers as they ran, so I got immediate feedback.

Getting feedback quickly is one of the primary goals of DevOps - reducing cycle times ensures that you can iterate rapidly and adjust rapidly. You may have heard the expression "Fail fast". Rather get feedback after 2 weeks and adjust than go off for 3 months building the wrong thing. Whatever you do and however far you are on the DevOps journey, make sure that you get rapid feedback - both for the software you're building and for your DevOps processes (how you're building) so that you can adjust quickly and often.

You Can Use a Printer To Improve a Printer

It's almost a rite of passage - when you get your printer, your first dozen prints are upgrades for your printer! Every print you make for your printer improves your printer so that you keep getting better prints.

In DevOps, you can apply principles for good software delivery to the process itself. How about "evidence gathered in production"? There's no place like production, so that's where you want to get usage and performance metrics from. Similarly, you want to measure your team in their native habitat to see how they're doing. Reducing cycle times and getting feedback fast improves your software, but are you applying the same principles to your processes? Try something, evaluate, and abort quickly if it's not working.

At some stage, however, you need to print something other than parts for your printer. If all I did was print parts, then the parts become pointless since I bought the printer to enable me to print prototypes or art or whatever, not just printer parts. Don't navel gaze too much into your DevOps processes and remember that the ultimate goal is to deliver value to your customers!

Sometimes You Have To Cut Your Losses And Try Again

I love board games - and we have a good collection of them. I wanted to print some inserts for sorting the cards and components to Dead of Winter - and the prints kept failing. I got a bit frustrated and stopped printing (for now) until I've got some confidence back.

In DevOps, you can try things and if they fail, you may have to cut your losses and start again. Perhaps you need to try again - perhaps you need to try something different. Don't give up because something didn't work like you expected it to. Regroup, recoup and try again!

Your Printer Is Unique

I'd seen a few experts recommend TL smoothers to improve prints. The stepper motors on a printer are driven by voltage differentials, and the TL smoother is a set of diodes that prevent a voltage "flutter" when the voltage dips towards 0 - which smooths the movement of the stepper motor. So I decided I want to get some. I got a baseline print, installed the smoothers and repeated the print. Absolutely no difference. To be fair, the smoothers only cost $12 so it's not a big deal. It turns out that my Ender3 probably has better components than some earlier models, so the smoothers weren't necessary, even though experts had recommended them.

With DevOps, you're far better off being pragmatic about how and when to apply changes in your processes, tools and people. Don't just implement blindly - understand how DevOps practices will affect your team and how best to implement them for your team. Just because another organization or team was successful with some practice, doesn't mean you have to do it in the same way (or at all). Each team, environment and organization is different, so your DevOps could look different from that in other orgs. As Dori says, "Just keep swimming!"

Conclusion

There's a ton to learn about DevOps from 3D printing. As with anything, we're all on a journey. But sometimes we have to remember to step back and remember how far we've come - and why we're on the journey in the first place! Hopefully this reflection is positive for you.

Happy printing!


Container DevOps: Beyond Build (Part 1)

$
0
0

I've written before that I think that containers - and Kubernetes (k8s) - are the way of the future. I was fortunate enough to attend my first KubeCon last year in Seattle, and I was happy to see the uptake of k8s and the supporting cloud native technologies around k8s are robust and healthy. But navigating the myriad of services, libraries and techniques is a challenge! This is going to be the first in a series of posts about Container DevOps - and I don't just mean building images and deploying them. What about monitoring? And A/B testing? And all the other stuff that successful DevOps teams are supposed to be doing? We'll look at how you can implement some of these tools and techniques in this series.

PartsUnlimited 1.0

For the last couple of years I've had the opportunity to demo DevOps using Azure DevOps probably a few hundred times. I built a demo in Azure DevOps using a fork of Microsoft's PartsUnlimited repo. When I originally built the demo, the .NET Core tooling was a bit of a mess, so I just stuck with the full framework version. The demo targets Azure App Services and shows how you can make a change in code, then submit a Pull Request, which triggers a build that compiles, runs static code analysis and unit tests and triggers a release, which deploys the new version of the app to a staging slot in the Azure web app, routing some small percentage of all traffic to the slot for canary testing. All the while metrics are being collected in Application Insights, and after doing the canary deployment, you can analyze the metrics and decide if the canary is successful or not - and then either promote it to the rest of the site or just shift all traffic to the existing prod version in the case of a failure.

But how do you do the same sort of thing with k8s? That's what I set out to discover. But before we get there, let's take a step back and consider what we should be investigating in the world of Container DevOps in the first place!

Components of Container DevOps

Here are some of the components of Container DevOps that I think need to be considered:

  1. Building Quality - including multi-stage image builds, reportable unit testing and publishing to secure image repositories
  2. Environment isolation - isolating Dev, Test and Prod environments
  3. Canary Testing - testing changes on a small set or percentage of users before promoting to everyone (also called Testing in Production or Blue/Green testing)
  4. Monitoring - producing, consuming and analyzing metrics about your services
  5. Security - securing your services using TLS
  6. Resiliency - making services resilient through throttling or circuit-breakers
  7. Infrastructure and Configuration as Code

There are some more that I think should be on the list that I haven't yet gotten to exploring in detail - such as vulnerability scanning. Hopefully I get around to adding to the above list, but we'll start here for now.

Building Quality

I've previously blogged about how to run unit tests - and publish the test and code coverage results - in Azure DevOps pipelines. It was a good exercise, but as I look at it now I realize why I prefer to build code outside the container and copy the binaries in: it's hard to do ancillary work (like unit test, code scans etc.) in a Dockerfile. One advantage to the multi-stage Dockerfile that you'll lose is the dependency management - you have to manage that on the build machine (or container) if you're building the code outside the Dockerfile. But I think the dependency management ends up being simpler than trying to run (and publish) tests and test coverage and static analysis inside the Dockerfile. My post covered how to do unit testing/code coverage, but when I thought about adding SonarQube analysis or vulnerability scanning with WhiteSource, I realized the Dockerfile starts becoming clumsy. I think it's easier to just drop in the SonarQube and WhiteSource tasks into a pipeline and build on the build machine - and then just have a Dockerfile copy the compiled binaries in to create the final light-weight container image.

Environment Isolation

There are a couple of ways to do this: the most isolated (and expensive) is to spin up a k8s cluster per environment - but you can achieve good isolation using k8s namespaces. You could also use a combination: have a Prod cluster and a Dev/Test cluster that uses namespaces to separate environments within that cluster.

Canary Testing

This one took a while for me to wrap my head around. Initially I struggled with this concept because I was too married to the Azure Web App version, which works as follows:

  1. Traffic Manager routes 0% traffic to the "blue" slot
  2. Deploy the new build to the blue slot - it doesn't matter if this slot is down, since no traffic is incoming anyway
  3. Update Traffic Manager to divert some percentage of traffic to the blue slot
  4. Monitor metrics
  5. If successful, swap the "blue" slot with the production slot (an instantaneous action) and update Traffic Manager to route 100% of traffic to the new production slot

I started trying to achieve the same thing in k8s, but k8s doesn't have the notion of slots. An easy enough solution is to have a separate Deployment (and Service) for "blue" and "green" versions. But then there's no Traffic Manager - so I stated investigating various Ingresses and Ingress Controllers to see if I could get the same sort of functionality. I initially got a POC running in Istio - but was still "swapping slots" in my mental model. Unfortunately, Istio, while very capable, is large and complicated - it felt like a sledge-hammer when all I needed was a screwdriver. I then tried linkerd - which was fine until I hit some limitations. Finally, I tried Traefik - and I found I could do everything I wanted to (and more) using Traefik. There's definitely a follow on post here detailing this part of my Container DevOps journey - so stay tuned!

The other mental breakthrough came when I realized that Deployments (unlike slots in Azure Web Apps) are inherently highly available: that is, I can deploy a new version of a Deployment and k8s automatically does rolling updates to ensure the Deployment is never "down". So I didn't have to worry about diverting traffic away from the "blue" Deployment while I was busy updating it - and I can even keep the traffic split permanent. What that means is that I have two versions of the Deployment/Service: a "blue" one and a "green" one. I set up traffic splitting using Traefik and route say 20% of traffic to the "blue" service. When rolling out a new version, I simply update the tag of the image in the Deployment and k8s automatically does a rolling update for me - the blue slot never goes down and it's "suddenly" on the new version. Then I can monitor metrics, and if successful, I can update the "green" Deployment. Now both Blue and Green Deployments are on the new version, so while the Blue Deployment is getting 20% of the traffic, the versions are the same, so 100% of the traffic is now on the new version - and I had zero downtime! Of course I can simply revert the Blue Deployment back to the old version to get the same effect if the experiment is NOT a success.

One more snippet to whet your appetite for future posts on this topic: Traefik also handles TLS certs via LetsEncrypt, circuit-breaking and more.

Monitoring

I am a huge fan of Application Insights - and there's no reason not to continue logging to AppInsights from within your containers - assuming your containers can reach out to Azure. However, I wanted to see how I could do monitoring completely "in-cluster" and so I turned to Prometheus and Grafana. This is also a subject for another blog post, but I managed to (without too much hassle) get the following to work:

  1. Export Prometheus metrics from .NET Core containers
  2. Create a Prometheus Operator in my k8s cluster to easily (and declaratively) add Prometheus scraping to new deployments
  3. Create custom dashboards in Grafana
  4. Export the dashboards to code so that I can deploy them from scratch for new clusters

I didn't explore AlertManager - but this would be essential for a complete monitoring solution. However, the building blocks are in place. I also found that "business telemetry" is difficult in Prometheus. By business telemetry I mean telemetry that has nothing to do with performance - things like how many products from category A were sold in a particular date range? AppInsights made "business telemetry" a breeze - the closest I could get in Prometheus was some proxy telemetry that gave me some idea of what was happening from a "business" perspective. Admittedly, Prometheus is a performance metric framework, so I wasn't too surprised.

Security

There's a whole lot to security that I didn't fully explore - especially in-cluster Role Based Access Control (RBAC). What I did explore was how to secure your services using certificates - and how to do that declaratively and easily as you roll out a publicly accessible service. Again Traefik made this simple - I'll detail how I did it in my Traefik blogpost. As a corollary, I did also isolate "internal" from "external" services - the internal services are not accessible from outside the cluster at all - the simplest way to secure a service!

Resiliency

I've already mentioned how using Deployments with rolling updates gives me zero-downtime deployments "for free". But what about throttling services that are being overwhelmed with connections? Or circuit-breakers? Again Traefik came to the rescue - and again, details are coming up in a post dedicated to how I configured Traefik.

Infrastructure as Code

It (almost) goes without saying that all of these capabilities should be doable from scripts and templates - and that there shouldn't be any manual steps. I did that from the first - using Azure CLI scripts to spin up my Azure Kubernetes Service (AKS) clusters, and configure Public IPs and Load Balancers. I used some bash scripts for doing kubectl commands and finally used Helm for packaging my applications so that deployment is a breeze - including creating Ingresses, ServiceMonitors, Deployments, Secrets and all the pieces you need to run your services.

Conclusion

I know I haven't showed very much - but I've gotten all of the pieces in place and will be blogging about how I did it - and sharing the code too! The point is that Container DevOps is more than just building images - there is far involved to do mature DevOps, and it's possible to achieve all of these mature practices using k8s. Traefik is definitely the star of the show! For now, I've hopefully prodded you into thinking about how to do some of these practices yourself.

Happy deploying!

Container DevOps Beyond Build: Part 2 - Traefik

$
0
0

Series:

In Part 1 of this series, I outlined some of my goals and some of the thinking around what I think Container DevOps is - it's far more than just being able to build and run a container or two. Beyond just automating builds, you have to think about how you're going to release. Zero downtime and canary testing, resiliency and monitoring are all table stakes - but while I understand how to do that using Azure Web Apps, I hadn't done a lot of these for containerized applications. After working for a couple of months on a Kubernetes (k8s) version of .NET Core PartsUnlimited, I have some thoughts to share on how I managed to put these practices into place.

When thinking about running containers in production, you have to think about the end to end journey, starting at building images right through deployment and into monitoring and tracing. I'm a firm believer in building quality into the pipeline early, so automated builds should unit test (with code coverage), do static analysis and finally package applications. In "traditional web" builds, the packaging usually means a zip or WebDeploy package or NPM package or even just a drop of files. When building container images, you're inevitably using a Dockerfile - which makes compiling and packaging simple, but leaves a lot to be desired when you want to test code or do static analysis, package scanning and other quality controls. I've already blogged about how I was able to add unit testing and code coverage to a multi-stage Dockerfile - I just got SonarQube working too, so that's another post in the works.

Works In My Orchestrator ™

However, assume that we have an image in our container registry that we want to deploy. You've probably run that image on your local machine to make sure it at least starts up and exposes the right ports, so it works on your machine. Bonus points if you ran it in a development Kubernetes cluster! But now how do you deploy this new container to a production cluster? If you just use k8s Deployment rolling update strategy. you'll get zero-downtime for free, since k8s brings up the new container and replaces the existing ones only when the new ones are ready (assuming you have good liveness and readiness probes defined). But how do you test the new version for only a small percentage of users? Or secure traffic to that service? Or if you're deploying multiple services (microservices anyone?) how do you monitor traffic flow in the service mesh? Or cut out "bad" services so that they don't crash your entire system?

With these questions in mind, I started to investigate how one does these sorts of things with deployments to k8s. The rest of this post is about my experiences.

Ops Criteria

Here's the list of criteria I had in mind to cover - and I'll evaluate three tools using these criteria:

  1. Internal and External Routing - I want to be able to define how traffic "external" traffic (traffic originating outside the cluster) and "internal" traffic (traffic originating and terminating within the cluster) is routed between services.
  2. Secure Communication - I want communication to endpoints to be secure - especially external traffic.
  3. Traffic Shifting - I want to be able to shift traffic between services - especially for canary testing.
  4. Resiliency - I want to be able to throttle connections or implement circuit breaking to keep my app as a whole resilient.
  5. Tracing - I want to be able to see what's going on across my entire application.

I explored three tools: Istio, Linkerd and Traefik. I'll evaluate each tool against the five criteria above.

Spoiler: Traefik won the race!

Disclaimer: some of these tools do more than these five things, so this isn't a wholistic showdown between these tools - it's a showdown over these five criteria only. Also, Traefik is essentially a reverse proxy on steroids, while Istio and Linkerd are service meshes - so you may need some functionality of a service mesh that Traefik can't provide.

Internal and External Routing

All three tools are capable of routing traffic. Istio and Linkerd both inject sidecar proxies to your containers. I like this approach since you can abstract away the communication/traffic/monitoring from your application code. This seemed to be promising, and while I was able to get some of what I wanted using both Istio and Linkerd, both had some challenges. Firstly, Istio is huge, rich and complicated. It has a lot of Custom Resource Definitions (CRDs) - more than k8s itself in fact! So while it worked for routing like I wanted, it seemed very heavy. Linkerd worked for external routing, but due to limitations in the current implementation, I couldn't get it working to route internal traffic.

Let's say you have a website and make a code change - you want to test that in production - but only to a small percentage of users. With Azure App Services, you can use Traffic Manager and deployment slots for this kind of canary testing. Let's say you get the "external" routing working - most clients connecting to your website get the original version, while a small percentage get the new version. This is what I mean by "external" traffic. But what if you have a microservice architecture and your website code is calling internal services which call other internal services? Surely you want to be able to do the same sort of traffic shifting - that's "internal" routing - routing traffic internally within the cluster. Linkerd couldn't do that for me - mostly due to incompatibility between the control plane and the sidecars, I think.

Traefik did this easily via Ingress Controllers (abstractions native to k8s). I set up two controllers - one to handle "external" traffic and one to handle "internal" traffic - and it worked beautifully. More on this later.

Secure Communication

I didn't explore this topic too deeply with either Istio or Linkerd, but Traefik made securing external endpoints with certificates via LetsEncrypt really easy. I tried to get secure communication for my internal services, but I was trying with a self-signed cert and I think that's what prevented it from working. I'm sure that you could just as easily add this capability into internal traffic using Traefik if you really needed to. We'll see this later too - but using a static IP and DNS on an Azure Load Balancer, I was able to get secure external endpoints with very little fuss!

Traffic Shifting

If you've got routing, then it follows that you should be able to shift traffic to different services (or more likely, different versions of the same service). I got this working in Istio (see my Github repo and mardown on how I did this here) and Linkerd only worked for external traffic. With Istio you shift by defining a VirtualService - it's an Istio CRD that's a love-child between a Service and an Ingress. With Linkerd, traffic rules are specified using dtabs - it's a cool idea (abstracting routes) but the implementation was horrible to work with - trying to learn the obscure format and debug it was not great.

By far the biggest problem with both Istio and Linkerd is that their network routing doesn't understand readiness or liveness probes since the work via their sidecar containers. This becomes a problem when you're deploying a new version of a service using a rolling upgrade - as soon as the service is created, Istio or Linkerd start sending traffic to the endpoint, irrespective of the readiness of that deployment. You can probably work around this, but I found that I didn't have to if I used Traefik.

Traefik lets you declaratively specify weight rules to shift traffic using simple annotations on a standard Ingress resource. It's clean and intuitive when you see it. The traffic shifting also obeys readiness/liveness probes, so you don't start getting traffic routed to services/deployments that are not yet ready. Very cool!

Resiliency

First, there's a lot of things to discuss in terms of resiliency - for this post I'm just looking at features like zero-downtime deployment, circuit breaking and request limiting. Istio and Linkerd both have control planes for defining circuit breakers and request limiting - Traefik let's you define these as annotations. Again, this comparison is a little apples-for-oranges since Traefik is "just" a reverse proxy, while Istio and Linkerd are service meshes. However, the ease of declaration of these features is so simple in Traefik, it's compelling. Also, since Traefik builds "over" rolling updates in Deployments, you get zero-downtime deployment for free. If you're using Istio, you have to be careful about your deployments since you can get traffic to services that are not yet ready.

Tracing

Traefik offloads monitoring to Prometheus and the helm chart has hooks into DataDog, Zipkin or Jaeger for tracing. For my experiments, I deployed Prometheus and Grafana for tracing and monitoring. Both Istio and Linkerd have control planes that include tracing - including mesh visualization - which can be really useful for tracing microservices since you can see traffic flow within the mesh. With Traefik, you need additional tools.

Configuring Traefik Controllers

So now you know some of the reasons that I like Traefik - but how do you actually deploy it? There are a couple components to Traefik: the Ingress Controller (think of this as a proxy) and then ingresses themselves - these can be defined at the application level and specify how the controller should direct traffic to the services within the cluster. There's another component (conceptually) and that is the ingress class: you can have multiple Traefik ingress controllers, and if you do, you need to specify a class for each controller. When you create an ingress, you also annotate that ingress to specify which controller should handle its traffic - you're essentially carving the ingress space into multiple partitions, each handled by a different controller.

For the controller, there are some other "under the hood" components such as secrets, config maps, deployments and services - but all of that can be easily deployed and managed via the Traefik Helm chart. You can quite easily deploy Traefik with a lot of default settings using --set from the command line, but I found it started getting unwieldy. I therefore downloaded the default values.yml file and customized some of the values. When deploying Traefik, I simply pass in my customized values.yml file to specify my settings.

For my experiments I wanted two types of controller: an "external" controller that was accessible from the world and included SSL. I also wanted an "internal" controller that was not accessible outside of the cluster that I could use to do internal routing. I use Azure Kubernetes Service (AKS), so the code for this series assumes that.

Let's take a look at the values file of the "internal" controller:

image: traefik
imageTag: 1.7.7
serviceType: NodePort

kubernetes:
  ingressClass: "dev-traefik-internal"
ssl:
  enabled: false

acme:
  enabled: false

dashboard:
  enabled: true
  domain: traefik-internal-ui.aks
rbac:
  enabled: true
metrics:
  prometheus:
    enabled: true
    restrictAccess: false
  datadog:
    enabled: false
  statsd:
    enabled: false

Notes:

  • Lines 1-3: The image it "traefik" and we want the 1.7.7 version. Since this is just internal, we only need a NodePort service (I tried ClusterIP, but that didn't work).
  • Line 6: we want this ingress controller to watch and manage traffic for Ingresses that have this class as their annotation. This is how we have multiple Traefik controllers within a cluster. I prepend the class with the namespace (dev) too!
  • Lines 7,8: Since this is an internal ingress, we don't need SSL. I tried to get this working, but suspect I had issues with the certs. If you need internal SSL, this is where you'd set it.
  • Lines 10,11: This is for generating a cert via LetsEncrypt. Not needed for our internal traffic.
  • Lines 14,15: Enable the dashboard. I accessed via port-forwarding, so the domain isn't critical.
  • Lines 16,17: RBAC is enabled.
  • Lines 18-25: tracing options - I just enabled Prometheus.

Let's now compare the values file for the "external" controller:

image: traefik
imageTag: 1.7.7
serviceType: LoadBalancer
loadBalancerIP: "101.22.98.189"

kubernetes:
  ingressClass: "dev-traefik-external"
ssl:
  enabled: true
  enforced: true
  permanentRedirect: true
  upstream: false
  insecureSkipVerify: false
  generateTLS: false

acme:
  enabled: true
  email: myemail@somewhere.com
  onHostRule: true
  staging: false
  logging: false
  domains:
    enabled: true
    domainsList:
      - main: "mycoolaks.westus.cloudapp.azure.com"
  challengeType: tls-alpn-01
  persistence:
    enabled: true
    
dashboard:
  enabled: true
  domain: traefik-external-ui.aks
  
rbac:
  enabled: true
metrics:
  prometheus:
    enabled: true
    restrictAccess: false
  datadog:
    enabled: false
  statsd:
    enabled: false

tracing:
  enabled: false

Most of the file is the same, but here are the differences:

  • Line 4: We specify the static IP we want the LoadBalancer to use - I have code that pre-creates this static IP (with DNS name) in Azure before I execute this script.
  • Line 7: We specify a different class to divide the "ingress space".
  • Lines 8-14: These are the LetsEncrypt settings, including the domain name, challenge type and persistence to store the cert settings.

Now that we have the controllers (internal and external) deployed, we can deploy "native" k8s services and ingresses (with the correct annotations) and everything Will Just Work ™.

Configuring Ingresses

Assuming you have the following service:

apiVersion: v1
kind: Service
metadata:
  name: dev-partsunlimitedwebsite
  namespace: dev
spec:
  type: NodePort
  selector:
    app: partsunlimited-website
    function: web
  ports:
  - name: http
    port: 80

Then you can define the following ingress:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: dev-traefik-external
  labels:
    app: partsunlimited-website
    function: web
  name: dev-website-ingress
  namespace: dev
spec:
  rules:
  - host: mycoolaks.westus.cloudapp.azure.com
    http:
      paths:
      - backend:
          serviceName: dev-partsunlimitedwebsite
          servicePort: http
        path: /site

Notes:

  • Line 2: This resource is of type "Ingress"
  • Lines 4,5: We define the class - this ties this Ingress to the controller with this class - for our case, this is the "external" Traefik controller
  • Lines 12-18: We're specifying how the Ingress Controller (Traefik in this case) should route traffic. This is the simplest configuration - take requests coming to the host "mycoolaks.westus.cloudapp.azure.com" and route them to the "dev-partsunlimitedwebsite" service onto the "http" port (port 80 if you look at the service definition above).
  • Line 19: We can use the Traefik controller to front multiple services - using the path helps to route effectively.

When you access the service, you'll see the secure padlock in the browser window and be able to see details for the valid cert:

image

The best thing is I didn't have to generate the cert myself - Traefik did it all for me.

There's more that we can configure on the Ingress - including the traffic shifting for canary or A/B testing. We can also annotate the service to include circuit-breaking - but I'll save that for another post now that I've laid out the foundation for traffic management using Traefik.

Conclusion

Container DevOps requires thinking about how traffic is going to flow in your cluster - and while there are many tools for doing this, I like the combination of simplicity and power you get with Traefik. There's still a lot more to explore in Container DevOps - hopefully this post gives you some insight into my thoughts.

Happy container-ing!

Container DevOps: Beyond Build (Part 3) - Canary Testing

$
0
0

Series:

In my previous post I compared Istio, Linkerd and Traefik and motivated why I preferred Traefik for Container DevOps. I showed how I was able to spin up Traefik controllers - one for internal cluster traffic routing, one for external cluster in-bound traffic routing. With that foundation in place, I can easily implement canary testing - both for external endpoints as well as internal services.

Canary Testing

What is canary testing (sometimes referred to as A/B testing)? This is a technique of "testing in production" where you shift a small portion of traffic to a new version of a service to ensure it is stable, or that it is meeting some sort of business requirement, in the case of hypothesis-driven development. This is an important technique because no matter how good your test and staging environments are, there's no place like production. Sure, you can test performance in a test/stage environment, but you can only ever test user behavior in production! Being able to trickle a small amount of traffic to a new service limits exposure.

However, a lot of teams that do use canary testing tend to use it just for proving that a service is stable. I think that they're missing a trick - namely, telemetry and "proving hypotheses". Without good telemetry, you're never going to unlock the true potential of canary testing. Think of your canary as an experiment - and make sure you have a means to measure the success (or failure) of that experiment - otherwise you're just pointlessly mixing chemicals. I'll cover monitoring and telemetry in another post.

Traffic Shifting Using Label Selectors

You can do canary testing "natively" in Kubernetes (k8s) by using good label selectors. Imagine you have service Foo and it has label selectors "app=foo". Any pods that you deploy (typically via Deployments, DaemonSets or StatefulSets) that have the label "app=foo" get traffic routed to them when the service endpoint is targeted. Imagine you had a Deployment that spins up two replicas of a pod with labels "app=foo,version=1.0". Hitting the Service endpoint will cause k8s to route traffic between the two pods. Now you have a new version of the container image and you create a Deployment that spins up one pod with labels "app=foo,version=1.1". Now because all three pods match the Service label selector "app=foo" traffic is distributed between all three pods - you've effectively routed 33% of traffic to the new pod.

So far so good. But here's where things get tricky: say you're monitoring the pods and decide that version 1.1 is good to go - how do you "promote" it to production fully? You could update the labels on the original pods and remove "app=foo" - they'll no longer match and so now all traffic is going to the third version 1.1 pod. But now you only have one pod, where originally you had two. So you'd have to also scale the Deployment of version 1.1 to ensure it gets as many replicas as the original service. And now you have a Deployment that's missing some labels - so you'd have to dig to find out what those pods are.

Alternatively, you could just add "version=1.1" to the Service label selectors. Again you'd have to scale the version 1.1 Deployment, but at least you don't get "dangling pods". But what about deploying version 1.2? Now you have to remove the "version=1.1" label from the Service since just adding "app=foo" won't be good enough to get traffic onto pods with labels "app=foo,version=1.2".

And how would you go about testing traffic shifting of just 2%? You'd need to deploy 49 replicas of version 1.1 and a single version 1.2 just to get that percentage.

What it boils down to is that using label selectors proves to be too much cognitive load since you spend too much time juggling labels, and the dial is "too course" - you can't easily test traffic percentages lower that say 20% very easily. In contrast, if you use Traefik to do the traffic shifting, you get the added bonus of circuit breakers, SSL and other features too.

Traffic Shifting Using Traefik

Let's see how we'd do traffic shifting using Traefik. Let's suppose that I've already deployed a Traefik controller with "ingressClass=traefik.external". To route traffic between two identical services (where the only difference between the services is the image version) I can create this ingress:

kind: Ingress
metadata:
annotations:
  kubernetes.io/ingress.class: traefik-external
  traefik.ingress.kubernetes.io/service-weights: |
    partsunlimited-website-blue: 5%
    partsunlimited-website-green: 95%
labels:
  app: partsunlimited-website
  name: partsunlimited-website
spec:
  rules:
  - host: cdk8spu-dev.westus.cloudapp.azure.com
    http:
      paths:
      - backend:
          serviceName: partsunlimited-website-blue
          servicePort: http
        path: /site
      - backend:
          serviceName: partsunlimited-website-green
          servicePort: http
        path: /site

Notes:

  • Line 1: the kind of resource in "Ingress" - nothing special about this, it's a native k8s Ingress resource
  • Line 4: this is where we specify which IngressController should do the heavy lifting for this particular Ingress
  • Lines 5-7: simple, intuitive and declarative - we want 5% of traffic to be routed to the "blue" Service
  • Line 13: when inbound traffic has host "cdk8spu-dev.westus.cloudapp.azure.com" (the DNS for the LoadBalancer), then we want the ingress to use the following rules to direct the traffic
  • Lines 16-23: we specify the backend Services and Ports that the Ingress should route to and can even specify custom paths to map different backends to different URL paths

The Services

This assumes that we have two services: partsunlimited-website-blue and partsunlimted-website-green. In my case these are exactly the same service - they will sometimes just have pods on different versions of the images I'm building. Let's look at the services:

apiVersion: v1
kind: Service
metadata:
  name: partsunlimited-website-blue
  labels:
    app: partsunlimited
    canary: blue
annotations:
  traefik.backend.circuitbreaker: "NetworkErrorRatio() > 0.2"
spec:
  ...
---
apiVersion: v1
kind: Service
metadata:
  name: partsunlimited-website-green
  labels:
    app: partsunlimited
    canary: green
spec:
  ...

Notes:

  • Lines 5-7, 17-19: these are out-of-box label selectors for services. There's the common "app" label and then a label for each canary "slot" that I have
  • Lines 8-9: since I am using Traefik, I can easily create a circuit-breaker using the annotation. In this case, we instruct the controller to cease to send traffic to the blue service if its network failure rate rises above 20%
  • The other lines are exactly what you would use for defining any k8s service

Helm

Now that I've shown you how to define the ingress and the services, I can discuss how I actually deployed my services. If you use "native" k8s yml manifests, it can become difficult to manage all your resources. Imagine you have several services, configmaps, secrets, ingresses, ingress controllers, persistence volumes - you'd need to manage each type of resource. Helm simplifies that task by "bundling" the related resources. That way "helm upgrade" gives you a single command to install or upgrade all the resources - and similarly, "helm status" and "helm delete" let you inspect or destroy the app and all its resources quickly. So I built a helm package for my application that included the Traefik plumbing.

Challenges with Helm

It's not all roses and unicorns though - helm has some disadvantages. Firstly, there's Tiller - the "server side" component of helm. To use helm, you need to install Tiller on your k8s cluster, and give it some pretty beefy permissions. Helm 3 is abandoning Tiller, so this should improve in the near future.

The other (more pertinent) challenge is the way helm performs upgrades. Let's have a look at a snippet of the values file that I have for my service - this file is used to override (or supply) values to an actual deployment:

canaries:
  - name: blue
    replicaCount: 1
    weight: 20
    tag: 1.0.0.0
    annotations:
      traefik.backend.circuitbreaker: "NetworkErrorRatio() > 0.2"
  - name: green
    replicaCount: 2
    weight: 80
    tag: 1.0.0.0
    annotations: {}

Notes:

  • Line 4,10 - I define the weights for each canary. Helm injects this value into the Ingress resource.
  • Lines 5,6 - I define annotations to apply to the service - in this case the Traefik circuit-breaker, but I could add others too

Initially, I wanted to do a deployment with "version=1.0.0.0" for both canaries, and then just run "helm upgrade --set-values canaries[0].imageTag=1.0.0.1" to update the version of the blue canary. However, helm doesn't work this way and so I have to supply all the values for the chart, rather than just the ones I want to update. In a pipeline, the version to deploy to the blue canary is the latest build number - but I have to calculate the green canary version number or it will be overwritten with "1.0.0.0" every time. It's not a big deal since I can work it out, but it would be nice if helm had a way to only update a single value and leave all other current values "as-is".

In the end, the ease of managing the entire application (encompassing all the resources) using helm outweighed the minor paper-cuts. I still highly recommend helm to manage app deployment - even if they're simple apps!

Conclusion

Traffic shifting using Traefik is pretty easy - it's also intuitive since it's based on annotations and is specified over "native" k8s resources instead of having to rely on custom constructs or sidecars or other rule-language formats. This makes it an ideal tool for performing canary testing in k8s deployments.

Happy canary testing!

Container DevOps: Beyond Build (Part 4) - Telemetry with Prometheus

$
0
0

Series:

In my previous post in this series I wrote about how I used Traefik to do traffic shifting and canary testing. I asserted that without proper telemetry, canary testing is (almost) pointless. Without some way to determine the efficacy of a canary deployment, you may as well just deploy straight out and not pretend.

I've also written about how I love and use Application Insights to monitor .NET applications. Application Insights (or AppInsights for short) is still my go-to telemetry tool. And it's not only a .NET tool - there are SDKs for Java, Javascript and Python among others. But since we're delving into container-land, I wanted to at least explore one of the popular k8s tools: Prometheus. There are other monitoring tools (like Datadog) and I think it'll be worth doing a compare/contrast of various monitoring tools at some stage. But for this post, I'll stick to Prometheus.

Business vs Performance Telemetry

Most developers that are using any kind of telemetry understand "performance" telemetry - requests per second, read/writes per second, errors per second, memory and CPU usage - usual, bread-and-butter telemetry. However, I often encourage teams not to stop at performance telemetry and to also start looking at how to instrument their applications with "business" telemetry. Business telemetry is telemetry that has nothing to do with the running application - and everything to do with how the site or application is doing in business terms. For example, how many products of a certain category were sold today? What products are popular in which geos? And so on.

AppInsights is one of my go-to tools because you get performance telemetry "for free" - just add it to your project and you get all of the perf telemetry you need to have a good view of your application performance - and that's without changing a single line of code! However, if you do want business telemetry, you can add a few lines of code and it's simple to get business telemetry. Add to that the ability to connect PowerBI to your telemetry (something I've written about before) and you're able to produce the telemetry and have business users consume it using PowerBI - that's a recipe for success!

On the down-side, making sense of AppInsights telemetry definitely isn't simple, and the learning curve for analyzing your data is steep. The AppInsights query language is a delight though, and even has some built-in machine learning capabilities).

Prometheus

Prometheus has long been a popular telemetry solution - however, as I was exploring it I came across some challenges. Firstly, integrating into .NET isn't simple - and you don't get anything "for free" - you have to code in the telemetry. Secondly, there are only four types of metrics you can utilize: Counter, Gauge, Histogram and Summary. These are great for performance telemetry, but are very difficult to use for business telemetry. However, creating graphs from Prometheus data is really simple (at least using Grafana, as I'll discuss in a later post) and there's a whole query language called PromQL for querying Prometheus metrics.

In the remainder of this post I'll show how I used Prometheus in a .NET Core application.

Performance Telemetry

To add performance telemetry to a .NET Core application, you have to add some middleware. You also need to expose the Prometheus endpoint. Here's a snippet from my Startup.cs file:

using Prometheus;

public class Startup
{
    public void Configure(IApplicationBuilder app)
    {
        var basePath = Configuration["PathBase"] ?? "/";
        ...

        // prometheus
        var version = Assembly.GetEntryAssembly().GetCustomAttribute()
                .Version.ToString();
        app.UseMethodTracking(version, Configuration["ASPNETCORE_ENVIRONMENT"], Configuration["CANARY"]);
        app.UseMetricServer($"{basePath}/metrics");

Notes:

  • Line 1: Import the Prometheus namespace - this is from the Prometheus NuGet package
  • Line 7: We need to set a base path - this is for sharing Traefik frontends for multiple backend services
  • Lines 10-11: Get the version of the application
  • Line 12: Call the UseMethodTracking method (shown below) to configure middleware, passing in the version, environment and canary name
  • Line 13: Tell Prometheus to expose an endpoint for the Prometheus server to scrape

I want my metrics to be dimensioned by version, environment and canary. This is critical for successful canary testing! We also need a pathbase other than "/" since when we deploy services behind the Traefik router, we want to use path-based rules to route traffic to different backend services, even though there's only a single front-end base URL.

Here's the code for the UseMethodTracking method:

public static class PrometheusAppExtensions
{
    static readonly string[] labelNames = new[] { "version", "environment", "canary", "method", "statuscode", "controller", "action" };

    static readonly Counter counter = Metrics.CreateCounter("http_requests_received_total", "Counts requests to endpoints", new CounterConfiguration
    {
        LabelNames = labelNames
    });

    static readonly Gauge inProgressGauge = Metrics.CreateGauge("http_requests_in_progress", "Counts requests currently in progress", new GaugeConfiguration
    {
        LabelNames = labelNames
    });

    static readonly Histogram requestHisto = Metrics.CreateHistogram("http_request_duration_seconds", "Duration of requests to endpoints", new HistogramConfiguration
    {
        LabelNames = labelNames
    });

    public static void UseMethodTracking(this IApplicationBuilder app, string version, string environment, string canary)
    {
        app.Use(async (context, next) =>
        {
            // extract values for this event
            var routeData = context.GetRouteData();
            var action = routeData?.Values["Action"] as string ?? "";
            var controller = routeData?.Values["Controller"] as string ?? "";
            var labels = new string[] { version, environment, canary,
                context.Request.Method, context.Response.StatusCode.ToString(), controller, action };

            // start a timer for the histogram
            var stopWatch = Stopwatch.StartNew();
            using (inProgressGauge.WithLabels(labels).TrackInProgress()) // increments the inProgress, decrementing when disposed
            {
                try
                {
                    await next.Invoke();
                }
                finally
                {
                    // record the duration
                    stopWatch.Stop();
                    requestHisto.WithLabels(labels).Observe(stopWatch.Elapsed.TotalSeconds);

                    // increment the counter
                    counter.WithLabels(labels).Inc();
                }
            }
        });
    }
}

Notes:

  • Line 2: set up the names of the dimensions I want to configure - note version, environment and canary
  • Lines 5-8: set up a counter to count method hits
  • Lines 10-13: set up a gauge to report how many requests are in progress
  • Lines 15-18: set up a histogram to record duration of each method call
  • Line 20: create a static extension method to inject Prometheus tracking into the middleware
  • Line 22: add a new handler into the pipeline
  • Lines 25-28: extract action, controller, method and response code from the current request if available
  • Line 32: start a stopwatch
  • Line 33: tell Prometheus that a method call is in progress - the "end of operation" is automatic at the end of the using (line 47)
  • Line 35-38: invoke the actual request
  • Line 39-43: stop the stopwatch and log the time recorded
  • Line 46: increment the counter for this controller/action/method/version/environment/canary combination

This code gives us performance metrics - we inject a step into the pipeline that starts a stopwatch, calls the operation, tells Prometheus an operation is in progress, and then when the operation completes, records the time taken and increments the call counter. Each "log" includes the "withLabels()" call that creates the context (dimensions) for the event.

Total Sales by Product Category

Let's examine what telemetry looks like if we want to track a business metric: say, sales of products by category. For this to work, I'd need to track the product category and price of each item sold. I could add other dimensions too (such as user) so that I can extend my analytics. If I know which users are purchasing products, I can start slicing and dicing by geo or language or other user attributes. If I know when sales occur, I can slice and dice by day of week or hour or any other time-based dimensions. The more dimensions I have, the more insights I can drive.

Let's see how we would track this business metric using Prometheus. Firstly, which metric type do I need? If we use a Counter, we can count how many items are sold, but not track the price - because counters can only increment by 1, not anything else. I could try Gauge since Gauge lets me set an arbitrary number - but unfortunately that doesn't give me a running total - it's just a number at a point in time. Both Histogram and Summary are snapshots of observations in a time period (my wording) so they don't work either. In the end I decided to settle for number of products sold as a proxy for revenue - each time an item is sold, I want to log a counter for the product category and other dimensions so I get some idea of business telemetry.

Generally I like a logging framework with an interface that abstracts the logging mechanism or storage away from the application. I found that this was relatively easy to do using AppInsights - however, Prometheus doesn't really work that way because the metric types are very specific to a particular event or method.

Here's how I ended up logging some telemetry in Prometheus in my .NET Core application:

public class ShoppingCartController : Controller
{
    static readonly string[] labelNames = new[] { "category", "product", "version", "environment", "canary"  };

    readonly Counter productCounter = Metrics.CreateCounter(
        "pu_product_add", "Increments when product is added to basket", 
        new CounterConfiguration
        {
            LabelNames = labelNames
        });
    
    public async Task AddToCart(int id)
    {
        // Retrieve the product from the database
        var addedProduct = _db.Products
            .Include(product => product.Category)
            .Single(product => product.ProductId == id);

        var labels = new string[] { properties["ProductCategory"], properties["Product"], version, environment, canary };
        productCounter.WithLabels(labels).Inc();
        ...
    }
    ...
}

Notes:

  • Line 3: Set up a list of label names - again, these are dimensions for the telemetry
  • Lines 5-10: Set up a Counter for counting when a product is added to a basket, using labelNames for the dimensions
  • Line 19: Create an array of values that correspond to the labelNames array
  • Line 20: Increment the counter, again using WithLabels()

Viewing Telemetry

Now that we have telemetry integrated, we can view the telemetry by browsing to the endpoint we configured Prometheus to expose. We'll get some of the metrics live:

# HELP process_open_handles Number of open handles
# TYPE process_open_handles gauge
process_open_handles 346
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1556149570.76
# HELP dotnet_total_memory_bytes Total known allocated memory
# TYPE dotnet_total_memory_bytes gauge
dotnet_total_memory_bytes 8133304
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 12298391552
# HELP http_request_duration_seconds Duration of requests to endpoints
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_sum{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action=""} 6.6218794
http_request_duration_seconds_count{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action=""} 926
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="0.005"} 872
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="0.01"} 909
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="0.025"} 916
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="0.05"} 919
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="0.075"} 919
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="0.1"} 922
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="0.25"} 923
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="0.5"} 925
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="0.75"} 925
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="1"} 925
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="2.5"} 925
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="5"} 926
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="7.5"} 926
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="10"} 926
http_request_duration_seconds_bucket{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action="",le="+Inf"} 926
# HELP process_num_threads Total number of threads
# TYPE process_num_threads gauge
process_num_threads 24
# HELP dotnet_collection_count_total GC collection count
# TYPE dotnet_collection_count_total counter
dotnet_collection_count_total{generation="0"} 3
dotnet_collection_count_total{generation="2"} 0
dotnet_collection_count_total{generation="1"} 1
# HELP process_working_set_bytes Process working set
# TYPE process_working_set_bytes gauge
process_working_set_bytes 159961088
# HELP http_requests_received_total Counts requests to endpoints
# TYPE http_requests_received_total counter
http_requests_received_total{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action=""} 926
# HELP process_private_memory_bytes Process private memory size
# TYPE process_private_memory_bytes gauge
process_private_memory_bytes 0
# HELP pu_product_add Increments when product is added to basket
# TYPE pu_product_add counter
pu_product_add{category="Wheels & Tires",product="Disk and Pad Combo",version="1.0.0.45",environment="Production",canary="blue"} 1
pu_product_add{category="Oil",product="Oil and Filter Combo",version="1.0.0.45",environment="Production",canary="blue"} 1
# HELP http_requests_in_progress Counts requests currently in progress
# TYPE http_requests_in_progress gauge
http_requests_in_progress{version="1.0.0.45",environment="Production",canary="blue",method="GET",statuscode="200",controller="",action=""} 1
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 15.26

Notes:

  • Lines 14-32: shows http request duration in buckets - notice how each bucket also has the version of the app, the environment, canary, statuscode, controller and action
  • Lines 50-53: shows the number of products added to the basket by category, product, version, environment and canary

Conclusion

I didn't get round to showing how Prometheus scrapes the metrics from various services so that you can start to dashboard and analyze - that's the subject for the next post. While I find Prometheus is fairly painful to implement on the tracking side (certainly compared to AppInsights), the graphing and querying can be worth the pain. I'll show you how to do that in a k8s cluster in the next post.

Unfortunately, though I experimented with using Prometheus for "business" telemetry, I can't say I recommend it. It's really meant for performance telemetry. So use AppInsights - which you can totally do even from within containers - if you need to do any business telemetry. And you do!

Happy monitoring!

Container DevOps: Beyond Build (Part 5) - Prometheus Operator

$
0
0

Series:

In part 4 of this series I showed how I created a metrics endpoint using Prometheus in my .NET Core application. While not perfect for business telemetry, Prometheus is a standard for performance metrics. But that only exposes an endpoint for metrics - it doesn't do any visualization. In this post I'll go over how I used Prometheus Operator in a k8s cluster to easily scrape metrics from services and then in the next post I'll cover how I configured Grafana  to visualize those metrics - first by hand and then using infrastructure-as-code so that I can audit and/or recreate my entire monitoring environment from source code.

Container DevOps Recap: The Importance of Monitoring

Monitoring is often the black sheep of DevOps - it's not stressed very much. I think that's partly because monitoring is hard - and often, contextual. Boards for work management and pipelines for build and release are generally more generic in concept and most teams starting with DevOps seem to start with these tools. However, Continuous Integration and Continuous Deployment should be complimented by Continuous Monitoring.

One of my DevOps heroes (and by luck of life, friend) Donovan Brown coined the quintessential definition of DevOps a few years ago: DevOps is the union of people, products and process to enable continuous delivery of value to end users. I've heard some folks criticize this definition for its lack of mention of monitoring among other things - but I think that a lot of Donovan's definition is implied (at least should be implied) in the phrase "value".

Most teams think of value in terms of features: I'd like to propose that monitoring as a mechanism of keeping systems stable, resilient and responsive is just as important as delivering features. So in a very real sense, his definition implies monitoring. I've also heard Donovan state that it doesn't matter how good your code is, if it's not in the hands of your users, it doesn't deliver value. In the same vein, it doesn't matter how good your features are: if you can't monitor for errors, scale or usage then you're missing delivering value for your users.

In a world of microservices and Kubernetes, the need for solid monitoring is paramount, and more difficult. Monoliths may be hard to change, but they are by and large easy to monitor. Microservices increase the complexity of monitoring, but there are some techniques that teams can use to manage the complexity.

Prometheus Operator

In the last post I showed how I exposed Prometheus metrics from my services. Imagine you have 10 or 15 services - how do you monitor each one? Exposing metrics via Prometheus is all well and good, but how do you aggregate and visualize the metrics that are being produced? The first step is Prometheus itself - or the Prometheus instance (to be disambiguated by the Prometheus metrics endpoint that containers or services expose).

If you were manually setting up an instance of Prometheus, you would have to install the pods and services in a k8s cluster as well as configure Prometheus to tell it where to scrape metrics from. This manual process is complex, error prone and time-consuming: enter the Prometheus Operator.

Installing the Prometheus operator (and instance) itself is simple thanks to the official helm chart.  This also (optionally) includes endpoints for monitoring the health of your k8s cluster components via kube-prometheus. It also installs AlertManager for automating alerts - I haven't played with this though.

K8s Operators are a mechanism for deploying applications to a k8s cluster - but these applications tend to be "smarter" than regular k8s applications in that they can hook into the k8s lifecycle. Point-in-case: scraping telemetry for a newly deployed service. The Prometheus Operator will automagically update the Prometheus configuration via the k8s API when you declare that a new service has Prometheus endpoints. This is done via a custom resource definition (CRD) that is created by the Prometheus helm chart: ServiceMonitors. When you create a service that exposes a Prometheus metrics endpoint, you simply declare a ServiceMonitor alongside your service to dynamically update Prometheus and let it know that you have a new service that can be scraped: including which port and how frequently to scrape.

Configuring Prometheus Operator

The helm chart for Prometheus Operator is a beautiful thing: it means you can install and configure a Prometheus instance, the Prometheus Operator and kube-prometheus (for monitoring cluster components) with a couple lines of script. Here's how I do this in my release pipeline:

helm repo add coreos https://s3-eu-west-1.amazonaws.com/coreos-charts/stable/
helm upgrade --install prometheus-operator coreos/prometheus-operator --namespace monitoring
helm upgrade --install kube-prometheus coreos/kube-prometheus --namespace monitoring

Notes:

  • Line 1: Add the CoreOS repo for the stable Prometheus operator charts
  • Line 2: Install (or upgrade) the Prometheus operator into a namespace called "monitoring"
  • Line 3: Install the kube-prometheus components - this gives me cluster monitoring

These commands are also idempotent so I can run them every time without worrying about current state - I always end up with the correct config. Querying the services in the monitoring namespace we see the following:

alertmanager-operated                 ClusterIP   None                   9093/TCP,6783/TCP   60d
kube-prometheus                       ClusterIP   10.0.240.33            9090/TCP            60d
kube-prometheus-alertmanager          ClusterIP   10.0.168.1             9093/TCP            60d
kube-prometheus-exporter-kube-state   ClusterIP   10.0.176.16            80/TCP              60d
kube-prometheus-exporter-node         ClusterIP   10.0.251.145           9100/TCP            60d
prometheus-operated                   ClusterIP   None                   9090/TCP            60d

Configuring ServiceMonitors

Now that we have a Prometheus instance (and the Operator) configured we can examine how to tell the Operator that there's a new service to monitor. Fortunately, now that we have the ServiceMonitor CRD, it's pretty straight-forward: we just declare a ServiceMonitor resource alongside our service! Let's take a look at one:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: website-monitor
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    tier: website
spec:
  jobLabel: app
  selector:
    matchLabels:
      release: pu-dev-website
      system: PartsUnlimited
      app: partsunlimited-website
  namespaceSelector:
    any: true
  endpoints:
  - port: http
    path: /site/metrics
    interval: 15s

Notes:

  • Lines 1-2: We're using the custom resource ServiceMonitor
  • Line 7: We're using a label to ringfence this Service - we've configured the Prometheus service to look for ServiceMonitors with this label
  • Lines 11-15: This ServiceMonitor applies to all services with these matching labels
  • Lines 18-21: We configure the port (a named port in this case) and path for the Prometheus endpoint, as well as what frequency to scrape the metrics

When we create this resource, the Operator picks up the creation of the ServiceMonitor resource via the k8s API and configures the Prometheus server to now scrape metrics from our service(s).

PromQL

Now that we have some metrics going into Prometheus, we have the ability to query the metrics. We start by port-forwarding the Prometheus instance so that we can securely access it (you can also expose the instance publicly if you want to):

kubectl port-forward svc/kube-prometheus -n monitoring 9090:9090

Then we can browse to http://localhost:9090/graph and see the Prometheus instance. We can then query metrics using PromQL - a language to query, aggregate and visualize metrics. For example, to see the rate of increase in sales by category for the last 2 minutes, we can write

sum(increase(pu_product_add[2m])) by (category)

And then we can visualize it by clicking on the graph button:

image

This is a proxy for how well sales are doing on our site by category. And while this is certainly great, it's far better to visualize Prometheus metrics in Grafana dashboards - which we can do using the PromQL queries. But that's a topic for a future post in this series!

Conclusion

Prometheus and the Prometheus Operator make configuring metric aggregation really easy. It's declarative, dynamic and intuitive. This makes it a really good framework for monitoring services within a k8s cluster. In the next post I'll show how you can hook Grafana up to the Prometheus server in order to make visualizations of the telemetry. Until then:

Happy monitoring!

Enterprise GitHub

$
0
0

Since Microsoft acquired GitHub, and the anti-Microsoft folks had calmed down, there have been a number of interesting developments in the GitHub ecosystem. If you’ve ever read one of my blogs or attended any events that I’ve spoken at, you’ll know that I am a raving Azure DevOps fan. I do, however, also have several repos on GitHub. As a DevOpsologist (someone who is in a constant state of learning about DevOps) I haven’t ever recommended GitHub for Enterprises – but the lines of functionality are starting to blur between GitHub and Azure DevOps. So which should you use, and when?

One note before we launch in: Azure DevOps is the name of the suite of functionality for the Microsoft DevOps platform. There are a number of “verticals” within the suite, which you can mix and match according to your needs. Azure Repos is the source control feature, Azure Boards the project management feature, Azure Test Plans the manual test management feature, Azure Pipelines the Continuous Integration/Continuous Deployment (CI/CD) feature and Azure Artifacts, the package management feature.

Blurring the Line

Up until a couple years ago, the line between GitHub and Azure DevOps was fairly simple to me: if you’re doing open-source development, use GitHub. If you’re doing Enterprise development, use Azure DevOps. The primary reasons were that GitHub was mainly just source control with some basic issue tracking, and you got unlimited public repos (at least, private repos were a lot more expensive). Azure DevOps (or Visual Studio Team Services, VSTS, as it used to be called), on the other hand, offered Project Management, CI/CD, Test Management and Package Management in addition to unlimited private repos. However, now you can now create private repos inexpensively in GitHub and you can create public repos in Azure DevOps. And last week, GitHub announced an update to GitHub Actions that enables first-class CI, which conceivably will expand to CD fairly soon. While you can’t do manual Test Management using GitHub, you get a far better “InnerSource” experience in GitHub than you do in Azure DevOps.

This Shouldn’t Be A Surprise

This isn’t entirely apples-to-apples, because (to oversimplify a bit) Azure DevOps is a DevOps platform, while GitHub is primarily a source control platform. At least, that’s been my experience up until recently. GitHub has some good, basic project management capabilities that work fantastically for open source development, but is a little simplistic for Enterprise development. As the industry shifts more and more to automated testing over manual testing, fewer teams have a need to manage manual testing. While you can publish releases on GitHub repos, Azure Artifacts arguably offers a more feature-rich service for package management. Also, while Azure DevOps, when it was still Team Foundation Server (TFS), used to be a “better together” suite, where you were probably better off doing everything in TFS, Azure DevOps is embracing the fact that developers have diverse toolsets, languages, platforms and target platforms. You can now very easily have source code in GitHub, Project Management in Jira, CI/CD in Azure Pipelines and have good visibility and traceability end-to-end.

We shouldn’t be surprised by the blur between GitHub and Azure DevOps. After all, Microsoft owns both now. And I think acquiring GitHub was an astute move by the tech giant – because, if you win the developer, you’re likely to win the organization. The perception of a “big bad Microsoft” is rapidly changing. Even before the GitHub acquisition, Microsoft employees were the top contributors to open source projects in GitHub. So not only is Microsoft embracing open source more and more, but they purchased the premier open source platform in the world!

The question then becomes: Where is Microsoft focusing? Are you better off in GitHub or in Azure DevOps, or some Frankenstein mix of the two? Will Azure Repos continue to evolve, or will Microsoft “Silverlight” Azure Repos? If I could gaze into a crystal ball, I’d predict that in 3 – 5 years, most organizations will have source code in GitHub and utilize Azure DevOps for Project Management, Pipelines and Package Management. Disclaimer: this is pure conjecture on my part!

Azure DevOps and GitHub Head to Head

The blurred lines between GitHub and Azure DevOps should be cause for celebration, not consternation. It just means that we have more options! And options are good, if you consider them carefully and don’t make hype-based decisions. So let’s compare Azure DevOps and GitHub head to head in the realms of Source Control, Project Management, CI/CD and Package Management.

Source Control

There’s not much difference as far as source control management goes between Azure Repos and GitHub, if you’re talking about Git. Fortunately, when the Azure DevOps team decided to add distributed version control to Azure DevOps, they just added Git. Not some funky Microsoft version of Git (though they contribute actively to Git and specifically to libgit2, the core Git libraries). So if you have a Git repo, you can add a GitHub remote, or an Azure DevOps remote, or both. Both are just Git remote repositories. However, if you want centralized source control (don’t do this any more) then you have to go with Azure DevOps. I would argue that the Pull Request experience is slightly better in Azure DevOps, but not by much. Both platforms allow you to protect branches using policies, so not much difference there either. Both platforms have WebHooks that you can use to trigger custom actions off events. Both have APIs for interaction. GitHub Enterprise has pre-receive hooks that can validate code before it is actually imported into the repo. Azure DevOps has a similar mechanism for centralized version control with Check-In policies, but these do not work for Git repos. We’ll call this one a tie.

Project Management

Azure Boards has a better Enterprise Project Management story. With GitHub you get Issues and Pull Requests (PRs) as the base “work item” types. You can add Task Lists to Issues, but the overall forms and flows of Issues and PRs is basic. Azure Boards work items can be a lot more complex, but offer much more customization opportunities. You can also do portfolio management more effectively in Azure Boards, since you can create work item hierarchies. GitHub does have the notion of Milestones and Projects, but again the functionality is fairly basic and probably too simplistic for Enterprises. While you can create basic filters for work items in GitHub, Azure DevOps has an advanced Work Item Query Language and elastic search. Both platforms allow you to tag (or label) work items. Azure Boards also lets you create widgets and dashboards and even has an OData feed and an Analytics Service so that you can create reports (say from PowerBI) over your work items. Of course you could use neither system for Project Management, you could use Jira, integrating Jira tickets easily to both Azure Repos (and Pipelines) or GitHub.

In terms of Enterprise project management, I’d have to give this one to Azure Boards.

CI/CD

GitHub introduced GitHub actions about a year ago. Hundreds of Actions were created by the community, validating the demand for actions triggered off events on a repo. But it seemed that doing any sort of Enterprise-scale CI with Actions was a challenge. Last week, a new and improved version of GitHub Actions was announced, and now CI is baked into GitHub through GitHub actions. I expect that we’ll see a huge surge in adoption of this CI tool and platforms like CircleCI and other cloud-CI systems may battle to compete. The feature isn’t GA yet, so we’ll see. It’s also suspiciously close to the YML format used by Azure Pipelines and supports Windows, Mac and Linux, just like Azure Pipelines…

The story doesn’t quite end there – if you want CI for GitHub repos, you now have a choice of GitHub Actions or Azure Pipelines, since Azure Pipelines has native support for GitHub repos. If you have repos outside of GitHub, you can’t use Actions.

I’d have to give this one to Azure Pipelines, at least for now. Azure Pipelines does include Release Management (for CD) or multi-stage YML files. I am sure we’ll see similar support soon for Actions, but for now while you can do CI pretty easily in GitHub Actions, CD is going to be a challenge. At this point in time, I’ll give this one to Azure Pipelines.

Package Management

GitHub releases allow you to tag a repo and publish that version of the repo as a release. You can also upload binaries that are the versioned packages for that release. Azure Artifacts allows you to create feeds that can be consumed – you can access a feed using NuGet, npm, Maven, Python packages or Universal Packages (which is a feed of arbitrary files – think NuGet for anything). Feeds are usually better than releases since tools like NuGet or npm know how to connect to feeds. Again, in terms of Enterprise package management, this one goes to Azure Artifacts.

Just Tell Me Which One To Use Already!

So the final score is 3.5 to 0.5 for Azure DevOps. But that’s not a full reflection of the situation, so don’t start porting to Azure DevOps just yet. Remember, options are great if you consider them carefully. And they’re not mutually exclusive either. So here is what I think are the key considerations:

Unit Of Management: Code or Work Items?

Do you track your work by looking at work item tracking, reporting and rolling up across your portfolio? Then you probably need to look at Azure Boards, since GitHub Issues are not going to handle complex Enterprise-level portfolio management. However, if your teams operate a bit more independently and you track work by looking at changes to repos, GitHub may be a better fit. Don’t forget that you can still keep source code in GitHub and use Azure Boards for project management!

Single Pane of Glass or Bring Your Own Tools?

I’ve seen Enterprises that are trying to standardize tooling and processes across teams. In this case, since Azure DevOps is a complete end to end platform, you’re probably better off using Azure DevOps. If you prefer a smorgasbord of tools, you could go either way. Even if you are managing work with Jira, building with TeamCity and deploying with Octopus Deploy, Azure DevOps could still tie these tools together to serve as a “backbone” giving you a single pane of glass.

Manual Test Management or Centralized Source Control?

If you need a tool for Manual Test Management, then Azure DevOps is for you. However, you could easily keep source code in GitHub and still use Azure Test Management to manage manual tests. And if you need centralized source control for some reason, then your only option is Azure Repos using Team Foundation Version Control (TFVC).

Conclusion

As you can see, there’s a lot of overlap between GitHub and Azure DevOps. Here’s my final prediction – we’ll see more innovation in the source control space in GitHub than we will in Azure Repos. Once again, the disclaimer is that this is my observation, and is in no way based on any official communication from either GitHub or Azure DevOps. I do think that a very viable option for Enterprises in the next few years will be to manage source code in GitHub and use Azure DevOps for CI/CD, Project Management and Package Management. This gives you the best of both worlds – you’re on (arguably) the best source control system – GitHub – and you get Enterprise-grade features for project, build and package management. As GitHub Actions evolves, perhaps CI/CD can be moved over to GitHub too. Yes, it’s still fuzzy even after thinking through all these options!

In short, think clearly about which platform (or combination of platforms) is going to best suit your Enterprise’s culture. Remember, DevOps is as much about people (culture) as is is about tooling.

Happy DevOps!

Azure DevOps Build and Test Reports using OData and REST in PowerBI

$
0
0

I have been playing with the Azure DevOps OData service recently to start creating some reports. Most of my fiddling has been with the Work Item and Work Item Board Snaphot entities, but I recently read a great post focused more on Build metrics by my friend and fellow ALM MVP, Wouter de Kort. I just happened to be working with a customer that is migrating from Azure DevOps Server to Azure DevOps Services and they had some SSRS reports that I knew could fairly easily be created using OData and PowerBI. In this post I’ll go over some of my experiences with the OData service and share a PowerBI template so that you can start creating some simple build reports yourself.

TL;DR

If you just want the PowerBI template, then head over to this Github repo and have at it!

Exploring Metadata

If you want to see what data you can grab from the OData endpoint for your Azure DevOps account, then navigate to this URL: https://analytics.dev.azure.com/{organization}/_odata/v3.0-preview/$metadata (you’ll need to replace {organization} with your organization name). This gives an XML document that details the entities and relationships. Here’s a screenshot of what it looks like in Chrome:

image

To get the data for an entity, you need to use OData queries. Some of these are pretty obscure, but powerful. First tip: pluralize the entity to get the entries. For example, the entity “Build” is queried by navigating to https://analytics.dev.azure.com/{organization}/_odata/v3.0-preview/Builds. You definitely want to learn how to apply $filter (for filtering data), $select (for specifying which columns you want to select), $apply (for grouping and aggregating) and $expand (for expanding fields from related entities). Once you have some of these basics down, you’ll be able to get some pretty good data out of your Azure DevOps account.

Here’s an example. Let’s imagine you want a list of all builds (build runs) from Sep 1st to today. The Build entity has the ProjectSK (an identifier to the project), but you’ll probably want to expand to get the Project name. Similarly, the Build entity includes a reference to the Build Definition ID, but you’ll have to expand to get the Build Definition Name. Here’s what the request would look like:

https://analytics.dev.azure.com/{organization}/_odata/v3.0-preview/Builds?
   $apply=filter(CompletedDate ge 2019-09-01Z "
   &$select=* 
   &$expand=Project($select=ProjectName),BuildPipeline($select=BuildPipelineName),Branch($select=RepositoryId,BranchName)

If you look at the metadata for the Build entity, you’ll see that there are navigation properties for Project, BuildPipeline, Branch and a couple others. These are the names I use in the $expand directive, using an internal $select to specify which fields of the related entities I want to select.

Connecting with PowerBI

To connect with PowerBI, you just connect to an OData field. You then have to expand some of the columns and do some other cleanup. Here’s what the M query looks like (view it by navigating to the “Advanced editor” for a query) for getting all the Builds since September 1st:

let
    Source = OData.Feed ("https://analytics.dev.azure.com/" & #"AzureDevOpsOrg" & "/_odata/v3.0-preview/Builds?"
        & "$apply=filter(CompletedDate ge " & Date.ToText(Date.From(Date.AddDays(DateTime.LocalNow(), -14)), "yyyy-MM-dd") & "Z )"
        & "&$select=* "
        & "&$expand=Project($select=ProjectName),BuildPipeline($select=BuildPipelineName),Branch($select=RepositoryId,BranchName)"
     ,null, [Implementation="2.0",OmitValues = ODataOmitValues.Nulls,ODataVersion = 4]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"BuildSK", type text}, {"BuildId", type text}, {"BuildDefinitionId", type text}, {"BuildPipelineId", type text}, {"BuildPipelineSK", type text}, {"BranchSK", type text}, {"BuildNumberRevision", type text}}),
    #"Expanded BuildPipeline" = Table.ExpandRecordColumn(#"Changed Type", "BuildPipeline", {"BuildPipelineName"}, {"BuildPipelineName"}),
    #"Expanded Branch" = Table.ExpandRecordColumn(#"Expanded BuildPipeline", "Branch", {"RepositoryId", "BranchName"}, {"RepositoryId", "BranchName"}),
    #"Renamed Columns" = Table.RenameColumns(#"Expanded Branch",{{"PartiallySucceededCount", "PartiallySucceeded"}, {"SucceededCount", "Succeeded"}, {"FailedCount", "Failed"}, {"CanceledCount", "Canceled"}}),
    #"Expanded Project" = Table.ExpandRecordColumn(#"Renamed Columns", "Project", {"ProjectName"}, {"ProjectName"})
in
    #"Expanded Project"

Notes:

  • Line 2: Connect to the OData feed Build entities (the #”AzureDevOpsOrg” is a parameter so that the account can be changed in a single place)
  • Line 3: Use Date.ToText and other M functions to get dates going back 2 weeks
  • Line 6: Standard OData feed arguments
  • Lines 7-11: Update some column types, rename some columns and expand some record columns to make the data easier to work with

You can see how we can use PowerBI functions (like DateTime.LocalNow) and so on. This allows us to create dynamic reports.

Performance – Be Mindful

Be careful with your queries – try to aggregate where you can. For detail reports, make sure you limit the result sets using filters like date, team or team project and so on. You don’t want to be bringing millions of records back each time you refresh a report! For my particular report, I limit the date range to the builds completed in the last 2 weeks. In my case, that’s not a lot of data – but if you run hundreds of builds every day, even that date range might be too broad.

Limitations

There are still some gaps when using the OData feeds. For example, you can get TestRun and TestResult entities – both for automated as well as manual tests. This data is sufficient for doing some reporting on automated tests – but it’s impossible to tie the TestResults back to test plans and suites. The TestResult actually has a TestCaseReferenceId so you can get back to the Test Case, but there’s no way to aggregate these to Suites and Plans since these entities are entirely absent from the OData model. Or the Build entity has a relationship to the Branch entity, which contains a RepositoryId, but no repository name – and there isn’t an entity for Repo in the OData model either.

API Calls From PowerBI

Two other limitations that I found was that there’s no queue information in the OData fields (so you can’t see which queue a build was routed to) and there’s no code coverage information either. So doing any analysis on code coverage statistics or queues isn’t possible using pure OData. Wouter makes the same discovery in his blog post, where he calls out using PowerShell to call the Azure DevOps REST APIs to get some additional queue data.

However, you can call REST APIs from PowerBI. I wanted a report where users could filter by repo, so I wanted a list of repositories in my organization. I also wanted to include queue and code coverage information on the Build entities.

Before we look at how to do this in PowerBI, there is a caveat to doing API calls, especially if you’re looping over records: don’t do this for large datasets! When I was trying to aggregate test runs to test suites and plans, I was actually able to get a list of test plans and test suites in an organization using REST APIs. But then I wanted a list of test IDs in each Test Suite – and that’s when my dream died. The organization I was doing this for had over 20,000 Test Suites – that means that PowerBI would have to make over 20,000 REST API calls to get all the Tests in Test Suites in an organization. I was forced to abandon that plan. In short, be mindful of where you use your REST API calls, and try to limit the number of rows you’re making the calls for!

Another caveat is that while you can authenticate to the OData feed using org credentials, you need a PAT for the REST API call! So there are now two authentication mechanisms for the report – org account and PAT.

Enough caveats – let’s get to it!

Create a REST API Function

The first step is to create a function that can call the Azure DevOps API. Here’s the function to get a list of repositories for a give Team Project:

(project as text) =>
let
    Source = Json.Document(Web.Contents("https://dev.azure.com/" & #"AzureDevOpsOrg" & "/" & project & "/_apis/git/repositories?api-version=5.1"))
in
    Source

This function takes a single arg called “project” of type text.

Now that we have the function defined, we can use it to expand a table with a list of Team Projects to end up with a list of all the repos in an org. Add a new Data Source, open the advanced editor and paste in this query:

let
    Source = OData.Feed ("https://analytics.dev.azure.com/" & #"AzureDevOpsOrg" & "/_odata/v3.0-preview/Projects?"
        &"&$select=ProjectSK, ProjectName "
    ,null, [Implementation="2.0",OmitValues = ODataOmitValues.Nulls,ODataVersion = 4])
in
    Source

If it runs, you’ll get a table of projects in your Azure DevOps organization:

image

Now comes the magic:

  1. Click on Add Column in the ribbon
  2. Click on “Invoke Custom Function”
  3. Enter “Repos” as the new column name
  4. Select “GetGitRepos” (the function we created earlier) from the list of functions
  5. Make sure the type is set to column so that PowerBI will loop through each row in the table, calling the function
  6. Change the column to ProjectName – this is the value for the project arg for the function

image

Once you click OK, PowerBI will call the function for each row in the table – this is why you don’t want to do this on a table with more than a few hundred rows! Here’s what the result will look like:

image

Now we want to expand the Record in each row, so click on the expand glyph to the right of the column name. We don’t really care about count, we just want value expanded and we don’t need the prefix:

image

This expands, but we’ll need to expand “value” once more, since it too is a complex object. Click the expand glyph again and select “Expand to Rows”. You can now filter out nulls – I could only do this by adding a line in the Advanced Editor:

#"Filter nulls" = Table.SelectRows(#"Expanded value", each [value] <> null)

Don’t forget to change the “in” to #“Filter nulls”. You will then need to expand value again:

image

Now we can finally see the fields for the repo itself – I just selected name, size, defaultBranch and webUrl. Now you can update any types and rename columns as you need. We now have a list of repos! Here’s the final M query:

let
   Source = OData.Feed ("https://analytics.dev.azure.com/" & #"AzureDevOpsOrg" & "/_odata/v3.0-preview/Projects?"
        &"&$select=ProjectSK, ProjectName "
    ,null, [Implementation="2.0",OmitValues = ODataOmitValues.Nulls,ODataVersion = 4]),
    #"Invoked Custom Function" = Table.AddColumn(Source, "Repos", each GetGitRepos([ProjectName])),
    #"Expanded Repos" = Table.ExpandRecordColumn(#"Invoked Custom Function", "Repos", {"value"}, {"Repos.value"}),
    #"Expanded Repos.value" = Table.ExpandListColumn(#"Expanded Repos", "Repos.value"),
    #"Filter nulls" = Table.SelectRows(#"Expanded Repos.value", each [Repos.value] <> null),
    #"Expanded Repos.value2" = Table.ExpandRecordColumn(#"Filter nulls", "Repos.value", {"id", "name", "defaultBranch", "size", "webUrl"}, {"Repos.value.id", "Repos.value.name", "Repos.value.defaultBranch", "Repos.value.size", "Repos.value.webUrl"}),
    #"Renamed Columns1" = Table.RenameColumns(#"Expanded Repos.value2",{{"Repos.value.id", "RepositoryId"}, {"Repos.value.name", "Name"}, {"Repos.value.defaultBranch", "DefaultBranch"}, {"Repos.value.size", "Size"}, {"Repos.value.webUrl", "WebURL"}})
in
    #"Renamed Columns1"

For adding queue information to builds, I created a function to get build detail for a build number (so that I could extract the queue). For code coverage, I created a function to call the coverage API for a build – again expanding the records that came back. You can see the final queries in the template.

Relating Entities

Now that I have a few entities, PowerBI detects most of the relationships. I added a CalendarDate table so that I could filter all builds/tests on a particular date (the CompletedDate column is a DateTime field, so this is necessary to group on a day). The final ERD looks like this:

image

I had some trouble relating branch to repo, so I eventually just added a LOOKUP function to lookup the repo name for the branch via the repositoryId. That’s why Repo isn’t related to other entities in the ERD. Similarly, I originally had a Project entity, but found that creating slicers on the Project column in the build worked just fine and kept the ERD simple.

Charts

I created two simple reports in the template – one showing a Build Summary and another showing a test summary. Feel free to start from these and go make some pretty reports!

image

To open the template, you can get it from this Github repo. There’s also instructions on how to update the auth.

Conclusion

The OData feed for Azure DevOps is getting better – mixing in some REST API calls allows you to fill in some gaps. If you’re careful about your filtering and what data you’re querying, you’ll be able to make some compelling reports. Go forth and measure…

Happy reporting!


Azure Pipeline Variables

$
0
0

I am a big fan of Azure Pipelines. Yes it’s YAML, but once you get over that it’s a fantastic way to represent pipelines as code. It would be tough to achieve any sort of sophistication in your pipelines without variables. There are several types of variables, though this classification is partly mine and pipelines don’t distinguish between these types. However, I’ve found it useful to categorize pipeline variables to help teams understand some of the nuances that occur when dealing with them.

Every variable is really a key:value pair. The key is the name of the variable, and it has a string value. To dereference a variable, simply wrap the key in `$()`. Let’s consider this simple example:

variables:
  name: colin

steps:
- script: echo "Hello, $(name)!"

This will write “Hello, colin!” to the log.

Inline Variables

Inline variables are variables that are hard coded into the pipeline YML file itself. Use these for specifying values that are not sensitive and that are unlikely to change. A good example is an image name: let’s imagine you have a pipeline that is building a Docker container and pushing that container to a registry. You are probably going to end up referencing the image name in several steps (such as tagging the image and then pushing the image). Instead of using a value in-line in each step, you can create a variable and use it multiple times. This keeps to the DRY (Do not Repeat Yourself) principle and ensures that you don’t inadvertently misspell the image name in one of the steps. In the following example, we create a variable called “imageName” so that we only have to maintain the value once rather than in multiple places:

trigger:
- master

pool:
  vmImage: ubuntu-latest

variables:
  imageName: myregistry/api-image

steps:
- task: Docker@2
  displayName: Build an image
  inputs:
    repository: $(imageName)
    command: build
    Dockerfile: api/Dockerfile

- task: Docker@2
  displayName: Push image
  inputs:
    containerRegistry: $(ACRRegistry)
    repository: $(imageName)
    command: push
    tags: $(Build.BuildNumber)

Note that you obviously you cannot create "secret" inline variables. If you need a variable to be secret, you’ll have to use pipeline variables, variable groups or dynamic variables.

Predefined Variables

There are several predefined variables that you can reference in your pipeline. Examples are:

  • Source branch: “Build.SourceBranch”
  • Build reason: “Build.Reason”
  • Artifact staging directory: “Build.ArtifactStagingDirectory”

You can find a full list of predefined variables here.

Pipeline Variables

Pipeline variables are specified in Azure DevOps in the pipeline UI when you create a pipeline from the YML file. These allow you to abstract the variables out of the file. You can specify defaults and/or mark the variables as "secrets" (we’ll cover secrets a bit later). This is useful if you plan on triggering the pipeline manually and want to set the value of a variable at queue time.

One thing to note: if you specify a variable in the YML variables section, you cannot create a pipeline variable with the same name. If you plan on using pipeline variables, you must not specify them in the "variables" section!

When should you use pipeline variables? These are useful if you plan on triggering the pipeline manually and want to set the value of a variable at queue time. Imagine you sometimes want to build in “DEBUG” and other times in “RELEASE”: you could specify “buildConfiguration” as a pipeline variable when you create the pipeline, giving it a default value of “debug”:

image

If you specify “Let users override this value when running this pipeline” then users can change the value of the pipeline when they manually queue it. Specifying “Keep this value secret” will make this value a secret (Azure DevOps will mask the value).

Let's look at a simple pipeline that consumes the pipeline variable:

name: 1.0$(Rev:.r)

trigger:
- master

pool:
  vmImage: ubuntu-latest
  
jobs:
- job: echo
  steps:
  - script: echo "BuildConfiguration is $(buildConfiguration)"

Running the pipeline without editing the variable produces the following log:

image

If the pipeline is not manually queued, but triggered, any pipeline variables default to the value that you specify in the parameter when you create it.

Of course if we update the value when we queue the pipeline to “release”, of course the log reflects the new value:

image

Referencing a pipeline variable is exactly the same as referencing an inline variable – once again, the distinction is purely for discussion.

Secrets

At some point you’re going to want a variable that isn’t visible in the build log: a password, an API Key etc. As I mentioned earlier, inline variables are never secret. You must mark a pipeline variable as secret in order to make it a secret, or you can create a dynamic variable that is secret.

"Secret" in this case just means that the value is masked in the logs. It is still possible to expose the value of a secret if you really want to. A malicious pipeline author could “echo” a secret to a file and then open the file to get the value of the secret.

All is not lost though: you can put controls in place to ensure that nefarious developers cannot simply run updated pipelines – you should be using Pull Requests and Branch Policies to review changes to the pipeline itself (an advantage to having pipelines as code). The point is, you still need to be careful with your secrets!

Dynamic Variables and Logging Commands

Dynamic variables are variables that are created and/or calculated at run time. A good example is using the “az cli” to retrieve the connection string to a storage account so that you can inject the value into a web.config. Another example is dynamically calculating a build number in a script.

To create or set a variable dynamically, you can use logging commands. Imagine you need to get the username of the current user for use in subsequent steps. Here’s how you can create a variable called “currentUser” with the value:

- script: |
    curUser=$(whoami)
    echo "##vso[task.setvariable variable=currentUser;]$curUser"

When writing bash or PowerShell commands, don’t confuse “$(var)” with “$var”. “$(var)” is interpolated by Azure DevOps when the step is executed, while “$var” is a bash or PowerShell variable. I often use “env” to create environment variables rather than dereferencing variables inline. For example, I could write:

- script: echo $(Build.BuildNumber)
but I can also use environment variables:
- script: echo $buildNum
  env:
    buildNum: $(Build.BuildNumber)
This may come down to personal preference, but I’ve avoided confusion by consistently using env for my scripts!

To make the variable a secret, simple add “issecret=true” into the logging command:

echo "##vso[task.setvariable variable=currentUser;issecret=true]$curUser"

You could do the same thing using PowerShell:

- powershell: |
    Write-Host "##vso[task.setvariable variable=currentUser;]$env:UserName"

Note that there are two flavors of PowerShell: “powershell” is for Windows and “pwsh” is for PowerShell Core which is cross-platform (so it can run on Linux and Mac!).

One special case of a dynamic variable is a calculated build number. For that, calculate the build number however you need to and then use the “build.updatebuildnumber” logging command:

- script: |
    buildNum=$(...)  # calculate the build number somehow
    echo "##vso[build.updatebuildnumber]$buildNum"
Other logging commands are documented here.

Variable Groups

Creating inline variables is fine for values that are not sensitive and that are not likely to change very often. Pipeline variables are useful for pipelines that you want to trigger manually. But there is another option that is particularly useful for multi-stage pipelines (we'll cover these in more detail later).

Imagine you have a web application that connects to a database that you want to build and then push to DEV, QA and Prod environments. Let's consider just one config setting - the database connection string. Where should you store the value for the connection string? Perhaps you could store the DEV connection string in source control, but what about QA and Prod? You probably don't want those passwords stored in source control.

You could create them as pipeline variables - but then you'd have to prefix the value with an environment or something to distinguish the QA value from the Prod value. What happens if you add in a STAGING environment? What if you have other settings like API Keys? This can quickly become a mess.

This is what Variable Groups are designed for. You can find variable groups in the “Library" hub in Azure DevOps:

image

The image above shows two variable groups: one for DEV and one for QA. Let's create a new one for Prod, specifying the same variable name (“ConStr”) but this time entering in the value for Prod:

image

Security is beyond the scope of this post- but you can specify who has permission to view/edit variable groups, as well as which pipelines are allowed to consume them. You can of course mark any value in the variable group as secret by clicking the padlock icon next to the value.

The trick to making variable groups work for environment values is to keep the names the same in each variable group. That way the only setting you need to update between environments is the variable group name. I suggest getting the pipeline to work completely for one environment, and then “Clone” the variable group - that way you're assured you're using the same variable names.

KeyVault Integration

You can also integrate variable groups to Azure KeyVaults. When you create the variable group, instead of specifying values in the variable group itself, you connect to a KeyVault and specify which keys from the vault should be synchronized when the variable group is instantiated in a pipeline run:

image

Consuming Variable Groups

Now that we have some variable groups, we can consume them in a pipeline. Let's consider this pipeline:

trigger:
- master

pool:
  vmImage: ubuntu-latest
  
jobs:
- job: DEV
  variables:
  - group: WebApp-DEV
  - name: environment
    value: DEV
  steps:
  - script: echo "ConStr is $(ConStr) in enviroment $(environment)"

- job: QA
  variables:
  - group: WebApp-QA
  - name: environment
    value: QA
  steps:
  - script: echo "ConStr is $(ConStr) in enviroment $(environment)"

- job: Prod
  variables:
  - group: WebApp-Prod
  - name: environment
    value: Prod
  steps:
  - script: echo "ConStr is $(ConStr) in enviroment $(environment)"

When this pipeline runs, we’ll see the DEV, QA and Prod values from the variable groups in the corresponding jobs.

Notice that the format for inline variables alters slightly when you have variable groups: you have to use the “- name/value” format.

Variable Templates

There is another type of template that can be useful - if you have a set of inline variables that you want to share across multiple pipelines, you can create a template. The template can then be referenced in multiple pipelines:

# templates/variables.yml
variables:
- name: buildConfiguration
  value: debug
- name: buildArchitecture
  value: x64

# pipelineA.yml
variables:
- template: templates/variables.yml
steps:
- script: build x ${{ variables.buildArchitecture }} ${{ variables.buildConfiguration }}

# pipelineB.yml
variables:
- template: templates/variables.yml
steps:
- script: echo 'Arch: ${{ variables.buildArchitecture }}, config ${{ variables.buildConfiguration }}'

Precedence and Expansion

Variables can be defined at various scopes in a pipeline. When you define a variable with the same name at more than one scope, you need to be aware of the precedence. You can read the documentation on precedence here.

You should also be aware of when variables are expanded. They are expanded at the beginning of the run, as well as before each step. This example shows how this works:

jobs:
- job: A
  variables:
    a: 10
  steps:
    - bash: |
        echo $(a)    # This will be 10
        echo '##vso[task.setvariable variable=a]20'
        echo $(a)    # This will also be 10, since the expansion of $(a) happens before the step
    - bash: echo $(a)        # This will be 20, since the variables are expanded just before the step

Conclusion

Azure Pipelines variables are powerful – and with great power comes great responsibility! Hopefully you understand variables and some of their gotchas a little better now. There’s another topic that needs to be covered to complete the discussion on variables – parameters. I’ll cover parameters in a follow up post.

For now – happy building!

Executing JMeter Tests in an Azure Pipeline

$
0
0

Microsoft have deprecated Load Testing in Visual Studio. Along with this, they have also deprecated the cloud load testing capability in Azure/Azure DevOps. On the official alternatives document, several alternative load testing tools and platforms are mentioned, including JMeter. What is not clear from this page is how exactly you’re supposed to integrate JMeter into your pipelines.

I have a demo that shows how you can use Application Insights to provide business telemetry. In the demo, I update a website (PartsUnlimited) and then use traffic routing to route 20% of traffic to a canary slot. To simulate traffic, I run a cloud load test. Unfortunately, I won’t be able to use that for much longer since the cloud load test functionality will end of life soon! I set about figuring out how to run this test using JMeter.

JMeter tests can be run on a platform called BlazeMeter. BlazeMeter has integration with Azure DevOps. However, I wanted to see if I could get a solution that didn’t require a subscription service.

The Solution

JMeter is a Java-based application. I’m not a big fan of Java – even though I authored a lot of the Java Hands on Labs for Azure DevOps! I had to install the JRE on my machine in order to open the JMeter GUI so that I could author my test. However, I didn’t want to have to install Java or JMeter on the build agent – so of course I looked to Docker. And it turns out that you can run JMeter tests in a Docker container pretty easily!

To summarize the solution:

  1. Create your JMeter test plans (and supporting files like CSV files) and put them into a folder in your repo
  2. Create a “run.sh” file that launches the Docker image and runs the tests
  3. Create a “test.sh” file for each JMeter test plan – this just calls “run.sh” passing in the test plan and any parameters
  4. Publish the “reports” directory for post-run analysis

I’ve created a GitHub repo that has the source code for this example.

Test Plan Considerations

I won’t go over recording and creating a JMeter test in this post – I assume that you have a JMeter test ready to go. I do, however, want to discuss parameters and data files.

It’s common to have some parameters for your test plans. For example, you may want to run the same test plan against multiple sites – DEV or STAGING for example. In this case you can specify a parameter called “host” that you can specify when you run the test. To access parameters in JMeter, you have to use the parameter function, “__P”. JMeter distinguishes between parameters and variables, so you can have both a variable and a parameter of the same name.

In the figure below, I have a test plan called CartTest.jmx where I specify a User Defined Variable (UDV) called “host”. I use the parameter function to read the parameter value if it exists, or default to “cdpartsun2-prod.azurewebsites.net” if the parameter does not exist:

image

The value of the host UDV is “${__P(host,cdpartsun2-prod.azurewebsites.net)}”. Of course you can use the __P function wherever you need it – not just for UDVs.

In my test plan, I also have a CSV for test data. I set the path of this file as a relative path to the JMX file:

image

Now that I have the test plan and supporting data files, I am ready to script test execution. Before we get to running the test in a container, let’s see how I can run the test from the command line. I simply execute this command from within the folder containing the JMX file:

jmeter -n -t CartTest.jmx -l results.jtl -Jhost=cdpartsun2-dev.azurewebsites.net –j jmeter.log –e –o reports

Notes:

  • -n tells JMeter to run in non-GUI mode
  • -t specifies the path to the test plan
  • -l specifies the path to output results to
  • -J<name>=<value> is how I pass in parameters; there may be multiple of these
  • -j specifies the path to the log file
  • -e specifies that JMeter should produce a report
  • -o specifies the report folder location

We now have all the pieces to script this into a pipeline! Let’s encapsulate some of this logic into two scripts: “run.sh” which will launch a Docker container and execute a test plan, and “test.sh” that is a wrapper for executing the CartTest.jmx file.

run.sh

I based this script off this GitHub repo by Just van den Broecke.

#!/bin/bash
#
# Run JMeter Docker image with options

NAME="jmetertest"
IMAGE="justb4/jmeter:latest"
ROOTPATH=$1

echo "$ROOTPATH"
# Finally run
docker stop $NAME > /dev/null 2>&1
docker rm $NAME > /dev/null 2>&1
docker run --name $NAME -i -v $ROOTPATH:/test -w /test $IMAGE ${@:2}

Notes:

  • The NAME variable is the name of the container instance
  • The IMAGE is the container image to launch – in this case “justb4/jmeter:latest” – this container includes Java and JMeter, as well as an entrypoint that launches a JMeter test
  • ROOTPATH is the first arg to the script and is the path that contains the JMeter test plan and data files
  • The script stops any running instance of the container, and then deletes it
  • The final line of the script runs a new instance of the container, mapping a volume from “ROOTPATH” on the host machine to a folder in the container called “/test” and then passes in remaining parameters (skipping ROOTPATH) as arguments to the entrypoint of the script. These are the JMeter test arguments.

test.sh

Now we have a generic way to launch the container, map the files and run the tests. Let’s wrap this call into a script for executing the CartTest.jmx test plan:

#!/bin/bash
#
rootPath=$1
testFile=$2
host=$3

echo "Root path: $rootPath"
echo "Test file: $testFile"
echo "Host: $host"

T_DIR=.

# Reporting dir: start fresh
R_DIR=$T_DIR/report
rm -rf $R_DIR > /dev/null 2>&1
mkdir -p $R_DIR

rm -f $T_DIR/test-plan.jtl $T_DIR/jmeter.log  > /dev/null 2>&1

./run.sh $rootPath -Dlog_level.jmeter=DEBUG \
	-Jhost=$host \
	-n -t /test/$testFile -l $T_DIR/test-plan.jtl -j $T_DIR/jmeter.log \
	-e -o $R_DIR

echo "==== jmeter.log ===="
cat $T_DIR/jmeter.log

echo "==== Raw Test Report ===="
cat $T_DIR/test-plan.jtl

echo "==== HTML Test Report ===="
echo "See HTML test report in $R_DIR/index.html"

Notes:

  • Lines 3-5: We need 3 args: the rootPath on the host containing the test plan, the name of the test plan (the jmx file) and a host parameter, which is specific to this test plan
  • Line 9: set the T_DIR to the current directory
  • Lines 14-16: Create a report directory, cleaning it if it exists already
  • Line 18: Clear previous result files
  • Lines 20-23: Call run.sh, passing in the rootPath and all the other JMeter args and parameters we need to invoke the test
  • Lines 22-29: Echo the location of the log, raw report and HTML reports

As long as we have Docker, we can run the script and we don’t need to install Java or JMeter!

We can execute the test from bash like so:

./test.sh $PWD CartTest.jmx cdpartsun2-dev.azurewebsites.net

WSL Gotcha

One caveat for Windows Subsystem for Linux (WSL): $PWD will not work for the volume mapping. This is because Docker for Windows is running on Windows, while the WSL paths are mounted in the Linux subsystem. In my case, the folder in WSL is “/mnt/c/repos/10m/partsunlimited/jmeter”, while the folder in Windows is “c:\repos\10m\partsunlimited\jmeter”. It took me a while to figure this out – the volume mapping works, but the volume is always empty. To work around this, just pass in the Windows path instead:

./test.sh 'C:\repos\10m\partsunlimited\jmeter' CartTest.jmx cdpartsun2-dev.azurewebsites.net

Executing from a Pipeline

We’ve done most of the hard work – now we can put the script into a pipeline. We need to execute the test script with the correct arguments and upload the test results and we’re done! Here’s the pipeline:

variables:
  host: cdpartsun2-dev.azurewebsites.net

jobs:
- job: jmeter
  pool:
    vmImage: ubuntu-latest
  displayName: Run JMeter tests
  steps:
  - task: Bash@3
    displayName: Execute JMeter tests
    inputs:
      targetType: filePath
      filePath: 'jmeter/test.sh'
      arguments: '$PWD CartTest.jmx $(host)'
      workingDirectory: jmeter
      failOnStderr: true
  - task: PublishPipelineArtifact@1
    displayName: Publish JMeter Report
    inputs:
      targetPath: jmeter/report
      artifact: jmeter

This is very simple – and we don’t even have to worry about installing Java or JMeter – the only prerequisite we have is that the agent is able to run Docker containers! The first step executes the test.sh script, passing in the arguments just like we did from in the console. The second task publishes the report folder so that we can analyze the run.

Here’s a snippet of the log while the test is executing: we can see the download of the Docker image and the boot up – now we just wait for the test to complete.

image

Executable Attributes

One quick note: initially when I committed the scripts to the repo, they didn’t have the executable attribute set – this caused the build to fail because the scripts were not executable. To set the executable attribute, I ran the following command in the folder containing the sh files:

git update-index --chmod=+x test.sh

git update-index --chmod=+x run.sh

Once the build completes, we can download the report file and analyze the test run:

image

image

Conclusion

Once you have a JMeter test, it’s fairly simple to run the it in a Docker container as part of your build (or release) process. Of course this doesn’t test load from multiple locations and is limited to the amount of threads the agent can spin up, but for quick performance metrics it’s a clean and easy way to execute load tests. Add to that the powerful GUI authoring capabilities of JMeter and you have a good performance testing platform.

Happy load testing!

Azure Pipeline Parameters

$
0
0

In a previous post, I did a deep dive into Azure Pipeline variables. That post turned out to be longer than I anticipated, so I left off the topic of parameters until this post.

Type: Any

If we look at the YML schema for variables and parameters, we’ll see this definition:

variables: { string: string }

parameters: { string: any }

Parameters are essentially the same as variables, with the following important differences:

  • Parameters are dereferenced using “${{}}” notation
  • Parameters can be complex objects
  • Parameters are expanded at queue time, not at run time
  • Parameters can only be used in templates (you cannot pass parameters to a pipeline, only variables)

Parameters allow us to do interesting things that we cannot do with variables, like if statements and loops. Before we dive in to some examples, let’s consider variable dereferencing.

Variable Dereferencing

The official documentation specifies three methods of dereferencing variables: macros, template expressions and runtime expressions:

  • Macros: this is the “$(var)” style of dereferencing
  • Template parameters use the syntax “${{ parameter.name }}”
  • Runtime expressions, which have the format “$[variables.var]”

In practice, the main thing to bear in mind is when the value is injected. “$()” variables are expanded at runtime, while “${{}}” parameters are expanded at compile time. Knowing this rule can save you some headaches.

The other notable difference is left vs right side: variables can only expand on the right side, while parameters can expand on left or right side. For example:

# valid syntax
key: $(value)
key: $[variables.value]
${{ parameters.key }} : ${{ parameters.value }}

# invalid syntax
$(key): value
$[variables.key]: value

Here's a real-life example from a TailWind Traders I created. In this case, the repo contains several microservices that are deployed as Kubernetes services using Helm charts. Even though the code for each microservice is different, the deployment for each is identical, except for the path to the Helm chart and the image repository.

Thinking about this scenario, I wanted a template for deployment steps that I could parameterize. Rather than copy the entire template, I used a “for” expression to iterate over a map of complex properties. For each service deployment, I wanted:

  • serviceName: The path to the service Helm chart
  • serviceShortName: Required because the deployment requires two steps: “bake” the manifest, and then “deploy” the baked manifest. The “deploy” task references the output of the “bake” step, so I needed a name that wouldn't collide as I expanded it multiple times in the “for” loop

Here's a snippet of the template steps:

# templates/step-deploy-container-service.yml
parameters:
  serviceName: ''  # product-api
  serviceShortName: '' # productapi
  environment: dev
  imageRepo: ''  # product.api
  ...
  services: []

steps:
- ${{ each s in parameters.services }}:
  - ${{ if eq(s.skip, 'false') }}:
    - task: KubernetesManifest@0
      displayName: Bake ${{ s.serviceName }} manifest
      name: bake_${{ s.serviceShortName }}
      inputs:
        action: bake
        renderType: helm2
        releaseName: ${{ s.serviceName }}-${{ parameters.environment }}
        ...
    - task: KubernetesManifest@0
      displayName: Deploy ${{ s.serviceName }} to k8s
      inputs:
        manifests: $(bake_${{ s.serviceShortName }}.manifestsBundle)
        imagePullSecrets: $(imagePullSecret)

Here's a snippet of the pipeline that references the template:

...
  - template: templates/step-deploy-container-service.yml
    parameters:
      acrName: $(acrName)
      environment: dev
      ingressHost: $(IngressHost)
      tag: $(tag)
      autoscale: $(autoscale)
      services:
      - serviceName: 'products-api'
        serviceShortName: productsapi
        imageRepo: 'product.api'
        skip: false
      - serviceName: 'coupons-api'
        serviceShortName: couponsapi
        imageRepo: 'coupon.api'
        skip: false
      ...
      - serviceName: 'rewards-registration-api'
        serviceShortName: rewardsregistrationapi
        imageRepo: 'rewards.registration.api'
        skip: true

In this case, “services” could not have been a variable since variables can only have “string” values. Hence I had to make it a parameter.

Parameters and Expressions

There are a number of expressions that allow us to create more complex scenarios, especially in conjunction with parameters. The example above uses both the “each” and the “if” expressions, along with the boolean function “eq”. Expressions can be used to loop over steps or ignore steps (as an equivalent of setting the “condition” property to “false”). Let's look at an example in a bit more detail. Imagine you have this template:

# templates/steps.yml
parameters:
  services: []

steps:
- ${{ each s in parameters.services }}:
  - ${{ if eq(s.skip, 'false') }}:
    - script: echo 'Deploying ${{ s.name }}'

Then if you specify the following pipeline:

jobs:
- job: deploy
  - steps: templates/steps.yml
    parameters:
      services:
      - name: foo
        skip: false
      - name: bar
        skip: true
      - name: baz
        skip: false


you should get the following output from the steps:

Deploying foo
Deploying baz

Parameters can also be used to inject steps. Imagine you have a set of steps that you want to repeat with different parameters - except that in some cases, a slightly different middle step needs to be executed. You can create a template that has a parameter called “middleSteps” where you can pass in the middle step(s) as a parameter!

# templates/steps.yml
parameters:
  environment: ''
  middleSteps: []

steps:
- script: echo 'Prestep'
- ${{ parameters.middleSteps }}
- script: echo 'Post-step'

# pipelineA
jobs:
- job: A
  - steps: templates/steps.yml
    parameters:
      middleSteps:
      - script: echo 'middle A step 1'
      - script: echo 'middle A step 2'

# pipelineB
jobs:
- job: B
  - steps: templates/steps.yml
    parameters:
      middleSteps:
      - script: echo 'This is job B middle step 1'
      - task: ...  # some other task
      - task: ...  # some other task

For a real world example of this, see this template file. This is a demo where I have two scenarios for machine learning: a manual training process and an AutoML training process. The pre-training and post-training steps are the same, but the training steps are different: the template reflects this scenario by allowing me to pass in different “TrainingSteps” for each scenario.

Extends Templates

Passing steps as parameters allows us to create what Azure DevOps calls “extends templates”. These provide rails around what portions of a pipeline can be customized, allowing template authors to inject (or remove) steps. The following example from the documentation demonstrates this:

# template.yml
parameters:
- name: usersteps
  type: stepList
  default: []
steps:
- ${{ each step in parameters.usersteps }}:
  - ${{ each pair in step }}:
    ${{ if ne(pair.key, 'script') }}:
      ${{ pair.key }}: ${{ pair.value }}

# azure-pipelines.yml
extends:
  template: template.yml
  parameters:
    usersteps:
    - task: MyTask@1
    - script: echo This step will be stripped out and not run!
    - task: MyOtherTask@2

Conclusion

Parameters allow us to pass and manipulate complex objects, which we are unable to do using variables. They can be combined with expressions to create complex control flow. Finally, parameters allow us to control how a template is customized using extends templates.

Happy parameterizing!

ChatOps with GitHub Actions and Azure Web Apps

$
0
0

Over this weekend, I ported a task from Azure Pipelines to GitHub Actions. It was a fun project, and while I was busy I realized that I could now do some ChatOps. I decided to create a quick video which is below.

Code is in GitHub here.

Happy chat-opsing!

Viewing all 192 articles
Browse latest View live