Thinking process behind choosing a tool
Thinking process behind choosing a tool does not get the attention it deserves. While there are many discussions of the form tool X vs tool Y, there is very little discussion of how one should choose between tools or in presumable absence of alternatives, whether one should use the only candidate, tool X. This post will cover few things to keep in mind when selecting a tool by highlighting few common problems and fallacies. Puppet will be used as an example tool for consideration.
Focusing on positive parts only
When considering a product or a tool, too often positive aspects are overestimated and negative aspects that influence TCO (Total Cost of Ownership) are underestimated or neglected. There are several cognitive biases and logical fallacies involved. Cognitive biases and logical fallacies can be avoided to some extent just by being aware. I will be referring to these through the post to help you, the reader, become more aware of your thought process which will hopefully improve it and consequently the process of decision making on your part.
Marketing pushes to see the positive
We all know that marketing focuses on positive aspects of a product and neglects to mention downsides. This is specifically mentioned in “False advertising” article under “Omitting information”.
For example, the fact that it’s not convenient to manage Puppet modules (proof: existence of a tool to do just that) will not appear in marketing materials. You might think that the existence of Librarian-puppet is on the contrary, makes management of these modules easier. It does but it also brings more complexity to the system. New problems and bugs instead of inhuman manual management of modules.
This post will focus on the negative
While there is more than enough focus on positive aspects of products, this post will be highlighting the negative aspects in order to strike some balance. There is plenty of marketing materials but it’s harder to find a list of problems that you only discover when you are neck-deep into the tool/product. These problems will be listed here. Note that this can not be exhaustive list because different situations reveal different problems and this post is only based on experience of several of my friends and mine.
Listing the problems of a tool touches Availability heuristic cognitive bias: the easier you recall something the more “important” it is. You are bombarded by marketing materials which are all positive. When considering a tool, your natural flow of thought is “How easily can I remember positive sides of the tool?” and it’s easy, because you were probably brainwashed already by how good the tool is. Then “How easily can I remember negative sides of the tool?” is much harder. This is not the kind of information that will be pushed to you by the people behind the tool, they have no interest in doing so. Their money goes to advertise how good the tool is, not how bad it is. You can balance your rosy impressions of any tool or product with looking at GitHub issues, digging StackOverflow for the downsides, or reading posts like this one.
Please, assume that X is the wrong tool for your needs.
As opposed to “yeah, looks good, let’s use it”, this approach leads to more thoughtful tool selection process. Please read Prove your tool is the right choice.
“Everybody uses X”
“Everybody uses X” thought might have been planted in your brain by marketing efforts. Please analyze the source of that thought carefully. Maybe you have heard from some of your friends and/or colleagues about the product and made a generalization? Maybe people are just stuck with it? Maybe that’s what they know? Did you search for alternatives? Did you try to disprove “Everybody uses X”?
“Everybody uses X, therefore it’s good”
Whether this thought was planted by marketing or not, no, there is no logical connection between the first and the second clauses.
If a lot of people use something, it becomes better as there is more feedback and contributors. It is often implied that therefore X is good. Improvement over time or with user base does not mean X is good enough for any particular use right now.
Did you communicate with the people that use X? Did they tell you it was a good decision? Beware of Choice-supportive bias when you talk to them. Which alternatives did they consider? Are they able to articulate downsides? Every solution has downsides, being able to recognize these increases credibility of the opinion about X.
“Everybody uses X, we should use X”
Yes, if you consider the value of “then we can blog about it and be part of the hype, possibly getting some traction and traffic”. This might have some estimated value which should be weighted against the cost incurred by choosing otherwise unneeded or inferior tool or technology. You can point your bosses to this paragraph, along with your estimation of the costs of using tool X vs better alternatives (which might be just not using it and coding yourself the needed functionality for example, the comparison is valid for both X vs Y and X vs without X).
No, “We should use X” does not logically follow from “Everybody uses X”. Beware of conformity bias.
“Company C uses X”
This piece of information, when served by vendor of X implies that company C knows better and you should use X too.
Company C is big and respectable company with smart engineers. The vendor of X will gladly list big and reputable companies that use X. That’s the use of “Argument from authority”.
Again, there is no straight logical path between “C uses X” and “we should use X too”.
Chances are that company C is vastly different from your company and their circumstances and situation are different from yours.
Company C can also make mistakes. You are unlikely to see a blog post from vendor of X that is titled “Company C realized their mistake and migrated from X”.
Claims of success with tool X
Treat claims of successful usage of tool X with caution. Searching quickly for “measuring project success” reveals the following dimensions to be looked at when estimating a success of a project:
- Team satisfaction
- Customer satisfaction
The claims of successful usage of tool X carry almost no information regarding what really happens. “We are using Puppet successfully” might mean (when taken to extreme) that for 100 servers and one deploy per day the following applies:
- Cost: There is dedicated team of five costly operations people that work just on Puppet because it’s complex.
- Scope: Puppet covers 80% of the needs, this might be the only dimension looked into when claiming a success.
- Quality, Team satisfaction: This team is constantly cursing because of bugs, modules or Puppet upgrades issues such as Upgrade to puppet-mysql 3.6.0 Broke My Manifest (fixed in just two months!) or puppet 4.5.0 has introduced a internal version of dig, that is not compatible to stdlib’s version oopsie.
Enjoy the list of regression bugs. It’s hard to blame Puppet developers for these bugs because these kinds of issues are natural for projects of this size and complexity. I suggest that creating your own domain-specific language, which is not a programming language for a configuration management tool is a bad idea. I’ll elaborate about this point in a bit, in the “Puppet DSL” section.
- Time: Took 6 moths of the above team to implement Puppet. Unpredictable time to implement any feature because of complexity and unexpected bugs along the way.
- Customer satisfaction: Given all of the above it’s hard to believe in any kind of satisfaction with what’s going on.
It’s also worth to keep in mind that any shown success, real success, does not mean that same solution will be equally applicable to your situation, because it’s almost certainty different on one or more dimensions: time, budget, scope (problem you are solving), skills, requirements.
“But X also provides feature F”
I am sure that the advertisements will mention all the important features as well as “cool” features. Do you really need F?
When choosing a tool the thought “But X also provides feature F” might be dangerous if F is not something you immediately need. One might think that F might be needed later. This might be the case but what are the odds, what’s the value of F to you, how much will it cost to implement using another tool or write yourself? Also, consider the “horizon”. If you might need feature F in 3 years, in many situations this should be plainly ignored. In 3 years there might be another tool for F or you might switch from X to something else for other reasons by then.
Suppose there is another tool X2 which is alternative to X. X2 does not provide F but it’s estimated TCO over a year is 50% less than F. You should consider the costs because it might be that X2 for the first year and then switching to X, including the switching costs can be cheaper.
Putting tools before needs
“So, there is new trendy hypy tool X. How can we use it?” is typically a bad start. At the very least it should be “So, there is new trendy hypy tool. Do we have any problems where X would be a better alternative?”
Ideally the approach would be “We have problem P, which alternative solutions do we have?”. P might be some inefficiency or desired functionality. Solutions, once again, do not have to mean existing tools.
Puppet – the good parts
I will quickly go over a few good parts because I want this post at least to try to be objective.
Convergence is an approach that says one should define the desired state, not the steps to be taken to get there. The steps are abstracted away and on each run the system will try to achieve the desired state as closely as possible.
I do agree that declaring a resource such as file, user, package or service and it’s desired state is a good approach. It’s concise and it’s usually simpler than specifying the operations that would lead to the desired state, like regular scripts do. This idea manifests in many other tools too: Chef, Ansible, CloudFormation, Terraform.
Appropriate in some situations
- Think about a startup where someone does part time systems engineering job, not a professional. As Guy Egozy pointed out, there are situations such as startups with limited resources and basic needs where using a configuration management tools might make more sense than in other situations.
- Urgent demo with all defaults if you have a good control of the tool and you know that you need some very specific functionality, say wordpress+mysql demo tomorrow, it is probably worth to prepare the demo with Puppet or Chef. There is still a danger of course that the module you were using a month ago have now changed and you need to invest additional time to make things work. Or maybe the module is just broken now.
Multiple platforms support
In my experience the chances that you will be running same applications on say Windows and Linux are pretty slim. The overlap of installed software on different platforms is likely to be infrastructure tooling only (monitoring, graphing, logging). Is it really worth the price?
Puppet DSL has a concept of “class” which has nothing to do with classes in programming languages. It least in retrospective it was not such a good idea, especially when considering operations guys trying to explain about Puppet classes to developers.
Limited DSL limitations 🙂
Acknowledged as a problem by facts
Limitations of DSL in my opinion were acknowledged by actions taken by Puppet’s developers and contributors:
- stdlib tremendous growth:
wc -l load_vars.rb, the only file in stdlib repo at 2011, gives 56 lines of code. In year 2017,
wc -l $(find lib/puppet/parser/functions -name '*.rb')gives 5325 lines of code.
- each and map: Adding each and map in version 4.0.0: Added on 2014-04-08 according to GitHub. First puppet release was on 2006-01-03 (version 0.9.3). This means that it took 9 years to implement a loop, “In earlier versions of Puppet, when there were no iteration functions and lambdas weren’t supported, you could achieve a clunkier form of iteration” (same Puppet documentation page).
- Lambdas: Adding Lambdas in Puppet 3 as experimental feature, what a novel idea from 1936! Implemented in many languages for years.
Limited DSL is not a great idea!
I do understand why limited DSL can be aesthetically and mathematically appealing. The problem here is that life is more complex than limited DSL. What could be 10 lines of real code turns into 50 lines of ugly copy+paste and/or hacks around the DSL limitations.
It sounds reasonable that at the time when CFengine and Puppet were created there were not enough examples of shortcomings of limited DSLs and their clashes with real life. Today we have more:
- Puppet – DSL failure admitted by actions, as discussed above.
- Ansible – just looks bad . Some features look like they were torn from a programming language and forced into YAML.
- Terraform – often generated because well … life. This one is more of a configuration language by design. This approach has pros and cons when applied to infrastructure.
- CloudFormation – 99% configuration format and 1% language, that’s why it’s generated for all except trivial cases. You do have the alternative of not generating CloudFormation input file but provide custom resources which use AWS Lambda functions instead. They will do some of the work. While this fits CloudFormation model perfectly, and makes CloudFormation much more powerful, I would really prefer a script over inversion of control and additional AWS service (Lambda) which I have to use – one more thing that can go wrong or just be unavailable when needed the most.
I do not agree that Terraform should be limited the way it is, but in my opinion, Terraform and CloudFormation are more legitimately limited while Puppet and Ansible are just bad design. This limitation by design causes complex workarounds which are costly and sometimes fragile, not to mention mental well being of the systems engineers that are working with Puppet.
We can all stop now creating domain specific languages for configuration management which were not built on top of real programming languages. Except for a few cases, that’s a bad idea. We can admit it instead of perpetuating the wishful thinking that the reality is simple and limited DSL can deal with it somehow.
Dependencies between Puppet modules
Plainly headache. Modules have dependencies on other modules and so on. Finding compatible modules’ versions is a hard problem. That’s why we have Librarian-puppet. As I mentioned above, it has it’s own issues.
There are also issues that Librarian-puppet can not solve, which are inherent to system of this scale, complexity and number of contributors. Let’s say you have module APP1 that depends on module LIB and module APP2 that also depends on LIB. Pinning version of module LIB because APP1 has a bug can prevent you from upgrading module APP2 which in newer versions depends on newer LIB. This is not imaginary scenario but real life experience.
Breakage of Puppet modules
Another aspect is that in this complex environment it’s somewhere between hard and impossible for any module maintainer to make sure his/her changes do not break anything. Therefore, they do break:
- Custom nginx.conf template is no longer working
- Changes from Validate_ to datatypes is not backward compatible with Puppet 3.8.7
- Nginx location as exported resource not working anymore
Popular community modules deal with so many cases and operating systems that breakage of some functionality is inevitable.
There is this idea that is kind of in the air: “you have community modules for everything, if you are not using them you are incompetent and wasting your time and money”.
This could come from 3 sources:
- People that use community modules for simple cases and they work fine
- People that underestimate the amount of maintenance work required to make community modules work for your particular case.
The feedback that I’ve got several times from different sources is that if you are doing anything serious with your configuration management tool, you should write your own modules, fitting community modules to your needs is too costly.
Graph dependencies model problems
Do you know people who think in dependency graphs? It looks like most people that I know are much more comfortable thinking about sequence of items or operations to perform in most cases. Thinking about dependency graphs such as about packages’ versions compatibility usually comes with recognizable significant mental effort, often with curses.
Puppet team admitted (again, by actions) this is a problem and introduced ordering configuration and made “manifest” ordering the default at some point. Note that this ordering is only for resources without explicit dependencies and within one manifest.
Graphs are somewhat implicit. This causes surprise and consequential WTFs. Messages about dependencies errors are not easily understood.
- Puppet usage is compared to manual performance of the same tasks – “Getting rid of the manual deployments”. This is clearly a marketing trick: comparing your tool to the worst possible alternative, not other tools which are similar to yours.
- Puppet is compared to bash scripts. Why not Python or Ruby?
- “Automate!” is all over Puppet site. Implies that Puppet is a good automation tool.
- Top 5 success stories / case studies use Puppet Enterprise? Coincidence? I think not 🙂
Many thanks for guidance to Konstantin Nazarov (@racktear). We met at DevOpsDays Moscow 2017 where he offered free guidance lessons for improving speech and writing skills. In reality, lessons also include productivity tips which help me a lot. Feel free to contact Konstantin, he might have a free weekly slot for you.
Have a productive career!