Why Next Generation Shell?

Background

I’m a systems engineer. The job that I’m doing is also called system, SRE, DevOps, production engineer, etc. I define my job as everything between “It works on my machine” of a developer and real life. Common tasks are setting up and maintaining cloud-based infrastructure: networking, compute, databases and other services. Other common tasks are setting up, configuring and maintaining everything inside a VM: disks+mounts, packages, configuration files, users, services. Additional aspects include monitoring, logging and graphing.

The problem

If we take specifically systems engineering tasks such as running a VM instance in a cloud, installing and running programs on a server and modifying configuration files, typical scripting (when no special tools are used) is done in either bash or Python/Ruby/Perl/Go.

Bash

The advantage of bash is that bash is domain specific. It is convenient for running external programs and files manipulation.

# Count lines in all *.c files in this directory and below
wc -l $(find . -name '*.c')

# Make sure my_file has the line my_content
echo my_content >my_file

# Run a process and capture the output
out=$(my_process)

The disadvantage of bash is horrible syntax, pitfalls everywhere, awkward error handling and many features one would expect from a programming language (such as data structures, named functions parameters, etc) are missing.

# Inconsistent, awkward syntax,
# result of keeping backwards compatibility
if something;    then ... fi
while something; do ... done

# Can remove / if MY_DIR is not defined
# unless in "set -u" mode
rm -rf "$MY_DIR/"

# Removes files "a" and "b" instead of "a b"
myfile="a b"
rm $myfile

# Silently ignores the error unless in "set -e" mode
my_script

# Function parameters can't be named, they are
# in $1, $2, ... or in $@ and $*
myfunc() {
  FILE="$1"
  OPTION_TO_ENABLE="$2"
  ...
}

Leave bash alone, it was not intended for programming, don’t do anything in bash, just use external programs for everything.

What do you observe? Is it or is it not used as a programming language in real life?

General-Purpose programming languages

Python/Ruby/Perl/Go are general-purpose programming languages.

The advantage of general-purpose programming languages is in their power, better syntax, ability to handle arbitrary data structures.

orig = [1,2,3]
doubled = [x*2 for x in orig]

The disadvantage of general-purpose programming languages is that they are not and can not be as convenient for systems engineering tasks because they are not focusing on this particular aspect of programming (in contrast to bash and other shells for example).

# Write whole file - too verbose
f = open('myfile', 'w+')
f.write('mycontent')
f.close()

# Run a process and capture the output
# https://docs.python.org/3.5/library/subprocess.html
proc = subprocess.Popen(...)
try:
    outs, errs = proc.communicate(timeout=15)
except TimeoutExpired:
    proc.kill()
    outs, errs = proc.communicate()

Summary

My conclusion is that there is no handy language for systems engineering tasks. On one hand there is bash that is domain specific but is not a good programming language and does not cover today’s needs, on the other hand there are general-purpose programming languages which do not specialize on this kinds of tasks.

You can use Puppet, Chef, Ansible, Terraform, CloudFormation, Capistrano and many other tools for common systems engineering tasks. What if your task is not covered by existing tools? Maybe one-off? Maybe a case where using one of the existing tools is not an optimal solution? You would like to write a script, right? In that case, your life sucks because scripting sucks. That’s because there is no convenient language and libraries to get systems engineering tasks done with minimal friction and effort.

Solution

I suggest, creating a new programming language (with a shell) which is domain specific, as bash, and which incorporates important features of general-purpose programming languages: data structures, exceptions, types, multiple dispatch.

My way of looking at it: imagine that bash was created today, taking into account today’s reality and things that became clear with time. Some of them are:

  • The shell is used as a programming language.
  • A system is usually a set of VMs and APIs, not a single machine.
  • Most APIs return JSON so data structures are needed as multiple jq calls are not convenient.
  • Silently ignoring errors proved to be bad strategy (hence set -e switch which tries to solve the problem).
  • Silently substituting undefined variables with empty strings proved to be bad strategy (hence set -u switch).
  • Expanding $x into multiple arguments proved to be error prone.
  • Syntax matters.
  • History entries without context have limited usefulness (cd $DIR for example: what was the current directory before cd and what was in $DIR ?)
  • UX
    • Spitting lots of text to a terminal is useless as it can not be processed by a human.
    • Feedback is important.
      • Exit code should be displayed by default.
      • An effort should be made to display status and progress of a process.
      • Ideally, something like pv should be integrated into the shell.

I’m not only suggesting the solution I’ve just described. I’m working on it. Please give it a try and/or join to help developing it: NGS – Next Generation Shell.

NGS LOGO

# Make sure my_file has the line my_content
echo my_content >my_file

# Run a process and capture the output
out=`my_process`

# Get process handle (used to access output, exit code, killing)
p=$(my_process)

# Get process output and parse it, getting structured data
amis=``aws ec2 describe-images --owner self``
echo(amis.len()) # number of amis, not lines in output

# Functional programming support
orig = [1,2,3]
doubled = orig.map(X*2)

# Function parameters can be named, have default values, etc
F myfunc(a,b=1,*args,**kwargs) {
  ...
}

# Create AWS VPC and Gateway (idempotent)
NGS_BUILD_CIDR = '192.168.120.0/24'
NGS_BUILD_TAGS = {'Name': 'ngs-build'}
vpc = AWS::Vpc(NGS_BUILD_TAGS).converge(CidrBlock=NGS_BUILD_CIDR, Tags=NGS_BUILD_TAGS)
gw  = AWS::Igw(Attachments=[{'VpcId': vpc}]).converge(Tags=NGS_BUILD_TAGS)

I don’t think scripting is the right approach.

It really depends on the task, constraints, your approach and available alternative solutions. I expect that situations needing scripting will be with us for a while.

Another programming language? Really? Why the world needs yet another programming language?

I agree that creating a new language needs justification because the effort that goes into creating a language and learning a language is considerable. Productivity gains of using the new language must outweigh the effort of learning and switching.

NGS creation is justified in exactly the same way as many other languages were justified: dissatisfaction with all existing programming languages when trying to solve specific problem or a set of similar problems. In case of NGS the dissatisfaction is specifically how existing programming languages address the systems engineering tasks niche. NGS addresses this particular niche with a unique combination of features and trade offs. Productivity of using NGS comes from best match between the tool and the problems being solved.

Yet another shell? We have plenty already but they all have serious adoption problems.

NGS will be implementing ideas which are not present in other shells. Hopefully, the advantages will be worthy enough to justify switching.

I’ll be just fine with bash/Python/Ruby/Perl/Go

You will. The decision to learn and use a new language depends on your circumstances: how many systems engineering tasks you are doing, how much you suffer, how much easier the tasks will become with NGS, how easily this can be done in your company / on your project and whether you are willing to take the risk.

You could just write a shell based on Ruby or Python or whatever, leveraging all the time and effort invested in existing language.

I could and I didn’t. Someone else did it for Python and for Scala (take a look, these are interesting projects).

  • I don’t think it’s the right solution to stretch existing language to become something else.
  • NGS has features that can not be implemented in a straightforward way as a library: special syntaxes for common tasks, multiple dispatch.

One could just write a library for Python or Ruby or whatever happens to be his/her favorite programming language, leveraging all the time and effort already invested in existing language.

In order to be similar to NGS, one would not only have to build a library but also change language syntax. I personally know only two languages that can do that: Lisp (using reader macros) and Perl6 (using grammar facility). These are general-purpose programming languages. Turning them into something NGS-like will be a significant effort, which I don’t think is justified.

PowerShell appears to be similar to what you describe here.

Note that I have very limited experience with PowerShell. The only aspect I definitely like is consistent usage of the $ sigil.

  • It’s probably a matter of taste and what you are accustomed to but I like NGS’ syntax more. PowerShell is pretty verbose.
  • DSC appears to be focused on resources inside a server/VM. NGS plans similar functionality. Meanwhile, NGS uses this approach in the AWS library: vpc = AWS::Vpc(NGS_BUILD_TAGS).converge(CidrBlock=NGS_BUILD_CIDR, Tags=NGS_BUILD_TAGS)

There are libraries for Python that make systems engineering tasks easier.

Right, sh for example. Such solution can’t be used as shell, it just improves the experience of calling external program from Python.


Was this post convincing? Anything is missing to convince you personally? Let me know!

Have a nice day!

Please don’t use Puppet

Thinking process behind choosing a tool

Thinking process behind choosing a tool does not get the attention it deserves. While there are many discussions of the form tool X vs tool Y, there is very little discussion of how one should choose between tools or in presumable absence of alternatives, whether one should use the only candidate, tool X. This post will cover few things to keep in mind when selecting a tool by highlighting few common problems and fallacies. Puppet will be used as an example tool for consideration.

Focusing on positive parts only

When considering a product or a tool, too often positive aspects are overestimated and negative aspects that influence TCO (Total Cost of Ownership) are underestimated or neglected. There are several cognitive biases and logical fallacies involved. Cognitive biases and logical fallacies can be avoided to some extent just by being aware. I will be referring to these through the post to help you, the reader, become more aware of your thought process which will hopefully improve it and consequently the process of decision making on your part.

Marketing pushes to see the positive

We all know that marketing focuses on positive aspects of a product and neglects to mention downsides. This is specifically mentioned in “False advertising” article under “Omitting information”.

For example, the fact that it’s not convenient to manage Puppet modules (proof: existence of a tool to do just that) will not appear in marketing materials. You might think that the existence of Librarian-puppet is on the contrary, makes management of these modules easier. It does but it also brings more complexity to the system. New problems and bugs instead of inhuman manual management of modules.

This post will focus on the negative

While there is more than enough focus on positive aspects of products, this post will be highlighting the negative aspects in order to strike some balance. There is plenty of marketing materials but it’s harder to find a list of problems that you only discover when you are neck-deep into the tool/product. These problems will be listed here. Note that this can not be exhaustive list because different situations reveal different problems and this post is only based on experience of several of my friends and mine.

Listing the problems of a tool touches Availability heuristic cognitive bias: the easier you recall something the more “important” it is. You are bombarded by marketing materials which are all positive. When considering a tool, your natural flow of thought is “How easily can I remember positive sides of the tool?” and it’s easy, because you were probably brainwashed already by how good the tool is. Then “How easily can I remember negative sides of the tool?” is much harder. This is not the kind of information that will be pushed to you by the people behind the tool, they have no interest in doing so. Their money goes to advertise how good the tool is, not how bad it is. You can balance your rosy impressions of any tool or product with looking at GitHub issues, digging StackOverflow for the downsides, or reading posts like this one.

Please, assume that X is the wrong tool for your needs.

As opposed to “yeah, looks good, let’s use it”, this approach leads to more thoughtful tool selection process. Please read Prove your tool is the right choice.

“Everybody uses X”

“Everybody uses X” thought might have been planted in your brain by marketing efforts. Please analyze the source of that thought carefully. Maybe you have heard from some of your friends and/or colleagues about the product and made a generalization? Maybe people are just stuck with it? Maybe that’s what they know? Did you search for alternatives? Did you try to disprove “Everybody uses X”?

“Everybody uses X, therefore it’s good”

Whether this thought was planted by marketing or not, no, there is no logical connection between the first and the second clauses.

If a lot of people use something, it becomes better as there is more feedback and contributors. It is often implied that therefore X is good. Improvement over time or with user base does not mean X is good enough for any particular use right now.

Did you communicate with the people that use X? Did they tell you it was a good decision? Beware of Choice-supportive bias when you talk to them. Which alternatives did they consider? Are they able to articulate downsides? Every solution has downsides, being able to recognize these increases credibility of the opinion about X.

“Everybody uses X, we should use X”

Yes, if you consider the value of “then we can blog about it and be part of the hype, possibly getting some traction and traffic”. This might have some estimated value which should be weighted against the cost incurred by choosing otherwise unneeded or inferior tool or technology. You can point your bosses to this paragraph, along with your estimation of the costs of using tool X vs better alternatives (which might be just not using it and coding yourself the needed functionality for example, the comparison is valid for both X vs Y and X vs without X).

No, “We should use X” does not logically follow from “Everybody uses X”. Beware of conformity bias.

“Company C uses X”

This piece of information, when served by vendor of X implies that company C knows better and you should use X too.

Company C is big and respectable company with smart engineers. The vendor of X will gladly list big and reputable companies that use X. That’s the use of “Argument from authority”.

Again, there is no straight logical path between “C uses X” and “we should use X too”.

Chances are that company C is vastly different from your company and their circumstances and situation are different from yours.

Company C can also make mistakes. You are unlikely to see a blog post from vendor of X that is titled “Company C realized their mistake and migrated from X”.

Claims of success with tool X

Treat claims of successful usage of tool X with caution. Searching quickly for “measuring project success” reveals the following dimensions to be looked at when estimating a success of a project:

  • Cost
  • Scope
  • Quality
  • Time
  • Team satisfaction
  • Customer satisfaction

The claims of successful usage of tool X carry almost no information regarding what really happens. “We are using Puppet successfully” might mean (when taken to extreme) that for 100 servers and one deploy per day the following applies:

  • Cost: There is dedicated team of five costly operations people that work just on Puppet because it’s complex.
  • Scope: Puppet covers 80% of the needs, this might be the only dimension looked into when claiming a success.
  • Quality, Team satisfaction: This team is constantly cursing because of bugs, modules or Puppet upgrades issues such as Upgrade to puppet-mysql 3.6.0 Broke My Manifest (fixed in just two months!) or puppet 4.5.0 has introduced a internal version of dig, that is not compatible to stdlib’s version oopsie.

    Enjoy the list of regression bugs. It’s hard to blame Puppet developers for these bugs because these kinds of issues are natural for projects of this size and complexity. I suggest that creating your own domain-specific language, which is not a programming language for a configuration management tool is a bad idea. I’ll elaborate about this point in a bit, in the “Puppet DSL” section.

  • Time: Took 6 moths of the above team to implement Puppet. Unpredictable time to implement any feature because of complexity and unexpected bugs along the way.
  • Customer satisfaction: Given all of the above it’s hard to believe in any kind of satisfaction with what’s going on.

It’s also worth to keep in mind that any shown success, real success, does not mean that same solution will be equally applicable to your situation, because it’s almost certainty different on one or more dimensions: time, budget, scope (problem you are solving), skills, requirements.

“But X also provides feature F”

I am sure that the advertisements will mention all the important features as well as “cool” features. Do you really need F?

When choosing a tool the thought “But X also provides feature F” might be dangerous if F is not something you immediately need. One might think that F might be needed later. This might be the case but what are the odds, what’s the value of F to you, how much will it cost to implement using another tool or write yourself? Also, consider the “horizon”. If you might need feature F in 3 years, in many situations this should be plainly ignored. In 3 years there might be another tool for F or you might switch from X to something else for other reasons by then.

Suppose there is another tool X2 which is alternative to X. X2 does not provide F but it’s estimated TCO over a year is 50% less than F. You should consider the costs because it might be that X2 for the first year and then switching to X, including the switching costs can be cheaper.

Putting tools before needs

“So, there is new trendy hypy tool X. How can we use it?” is typically a bad start. At the very least it should be “So, there is new trendy hypy tool. Do we have any problems where X would be a better alternative?”

Ideally the approach would be “We have problem P, which alternative solutions do we have?”. P might be some inefficiency or desired functionality. Solutions, once again, do not have to mean existing tools.

Puppet – the good parts

I will quickly go over a few good parts because I want this post at least to try to be objective.

Convergence

Convergence is an approach that says one should define the desired state, not the steps to be taken to get there. The steps are abstracted away and on each run the system will try to achieve the desired state as closely as possible.

I do agree that declaring a resource such as file, user, package or service and it’s desired state is a good approach. It’s concise and it’s usually simpler than specifying the operations that would lead to the desired state, like regular scripts do. This idea manifests in many other tools too: Chef, Ansible, CloudFormation, Terraform.

Appropriate in some situations

  • Think about a startup where someone does part time systems engineering job, not a professional. As Guy Egozy pointed out, there are situations such as startups with limited resources and basic needs where using a configuration management tools might make more sense than in other situations.
  • Urgent demo with all defaults if you have a good control of the tool and you know that you need some very specific functionality, say wordpress+mysql demo tomorrow, it is probably worth to prepare the demo with Puppet or Chef. There is still a danger of course that the module you were using a month ago have now changed and you need to invest additional time to make things work. Or maybe the module is just broken now.

Multiple platforms support

In my experience the chances that you will be running same applications on say Windows and Linux are pretty slim. The overlap of installed software on different platforms is likely to be infrastructure tooling only (monitoring, graphing, logging). Is it really worth the price?

Puppet DSL

Puppet class

Puppet DSL has a concept of “class” which has nothing to do with classes in programming languages. It least in retrospective it was not such a good idea, especially when considering operations guys trying to explain about Puppet classes to developers.

Limited DSL limitations ūüôā

Acknowledged as a problem by facts

Limitations of DSL in my opinion were acknowledged by actions taken by Puppet’s developers and contributors:

Limited DSL is not a great idea!

I do understand why limited DSL can be aesthetically and mathematically appealing. The problem here is that life is more complex than limited DSL. What could be 10 lines of real code turns into 50 lines of ugly copy+paste and/or hacks around the DSL limitations.

It sounds reasonable that at the time when CFengine and Puppet were created there were not enough examples of shortcomings of limited DSLs and their clashes with real life. Today we have more:

  • Puppet – DSL failure admitted by actions, as discussed above.
  • Ansiblejust looks bad . Some features look like they were torn from a programming language and forced into YAML.
  • Terraform – often generated because well … life. This one is more of a configuration language by design. This approach has pros and cons when applied to infrastructure.
  • CloudFormation – 99% configuration format and 1% language, that’s why it’s generated for all except trivial cases. You do have the alternative of not generating CloudFormation input file but provide custom resources which use AWS Lambda functions instead. They will do some of the work. While this fits CloudFormation model perfectly, and makes CloudFormation much more powerful, I would really prefer a script over inversion of control and additional AWS service (Lambda) which I have to use – one more thing that can go wrong or just be unavailable when needed the most.

I do not agree that Terraform should be limited the way it is, but in my opinion, Terraform and CloudFormation are more legitimately limited while Puppet and Ansible are just bad design. This limitation by design causes complex workarounds which are costly and sometimes fragile, not to mention mental well being of the systems engineers that are working with Puppet.

We can all stop now creating domain specific languages for configuration management which were not built on top of real programming languages. Except for a few cases, that’s a bad idea. We can admit it instead of perpetuating the wishful thinking that the reality is simple and limited DSL can deal with it somehow.

Puppet modules

Dependencies between Puppet modules

Plainly headache. Modules have dependencies on other modules and so on. Finding compatible modules’ versions is a hard problem. That’s why we have Librarian-puppet. As I mentioned above, it has it’s own issues.

There are also issues that Librarian-puppet can not solve, which are inherent to system of this scale, complexity and number of contributors. Let’s say you have module APP1 that depends on module LIB and module APP2 that also depends on LIB. Pinning version of module LIB because APP1 has a bug can prevent you from upgrading module APP2 which in newer versions depends on newer LIB. This is not imaginary scenario but real life experience.

Breakage of Puppet modules

Another aspect is that in this complex environment it’s somewhere between hard and impossible for any module maintainer to make sure his/her changes do not break anything. Therefore, they do break:

Popular community modules deal with so many cases and operating systems that breakage of some functionality is inevitable.

Community modules

There is this idea that is kind of in the air: “you have community modules for everything, if you are not using them you are incompetent and wasting your time and money”.

This could come from 3 sources:

  • Marketing
  • People that use community modules for simple cases and they work fine
  • People that underestimate the amount of maintenance work required to make community modules work for your particular case.

The feedback that I’ve got several times from different sources is that if you are doing anything serious with your configuration management tool, you should write your own modules, fitting community modules to your needs is too costly.

Graph dependencies model problems

Do you know people who think in dependency graphs? It looks like most people that I know are much more comfortable thinking about sequence of items or operations to perform in most cases. Thinking about dependency graphs such as about packages’ versions compatibility usually comes with recognizable significant mental effort, often with curses.

Puppet team admitted (again, by actions) this is a problem and introduced ordering configuration and made “manifest” ordering the default at some point. Note that this ordering is only for resources without explicit dependencies and within one manifest.

Graphs are somewhat implicit. This causes surprise and consequential WTFs. Messages about dependencies errors are not easily understood.

Marketing

  • Puppet usage is compared to manual performance of the same tasks – “Getting rid of the manual deployments”. This is clearly a marketing trick: comparing your tool to the worst possible alternative, not other tools which are similar to yours.
  • Puppet is compared to bash scripts. Why not Python or Ruby?
  • “Automate!” is all over Puppet site. Implies that Puppet is a good automation tool.
  • Top 5 success stories / case studies use Puppet Enterprise? Coincidence? I think not ūüôā

Thanks

Many thanks for guidance to Konstantin Nazarov (@racktear). We met at DevOpsDays Moscow 2017 where he offered free guidance lessons for improving speech and writing skills. In reality, lessons also include productivity tips which help me a lot. Feel free to contact Konstantin, he might have a free weekly slot for you.


Have a productive career!

About declarative frameworks and tools

This post is a reply to “just use Terraform” recommendation I’ve just seen. I hope more people will benefit from my perspective if it’s posted here. There is plenty of marketing behind most of the tools I mention here. It’s all rosy, see the “Life before Puppet” video. Let’s balance this marketing bullshit a bit.

Think twice before using declarative framework/tool

Terraform, CloudFormation, Puppet, Chef as any other declarative frameworks/tools take control away from you. They work fine for “hello world” examples. Then there is life where you need something these frameworks did not anticipate and you are sorry you have not coded everything yourself from the start. Now you are stuck with these tools and you will be paying for it in your time and money. Working around limitations of such tools is a pain.

I am using CloudFormation and have used Puppet and Chef in the past. These tools do have their place. In my opinion it’s a very limited set of scenarios. Terraform, CloudFormation, Puppet and Chef are used much more widely than they should be.

These tools have some value but too often people neglect the cost which in many cases outweighs the value. Most of the cost comes from inflexibility. Terraform and CloudFormation are so limited that people frequently use another tool for generating these. That adds another bit to the cost.

I’m hearing frequently from a friend (sorry, can’t name him) how much they suffer from Terraform’s inflexibility. Inflexibility can not be fixed because it’s a declarative framework. Unfortunately they are so invested in Terraform that they will continue to spend hundreds of hours to fight it. Chef is causing trouble there too, community Cookbooks proved to be a mismatch for the needs and sanity of the engineers there.

… and there is this gem

A key component of every successful Puppet implementation is access to a knowledgeable support team

That’s from https://puppet.com/support-services/customer-support/support-plans

Are you sure you want to use Puppet? Apparently you can’t do it well without their support… Just saying…

Is one of these tools right for you?

Regular considerations for choosing a tool apply. See my older post “Prove your tool is the right choice“.

Expected replies and my replies to those

You don’t get it.

OK

You don’t understand these tools.

OK

You are not using these tools right / as intended.

OK

Are you crazy? You want to code everything yourself?

Let’s take it to the extreme: no new code should be written. No libraries, no frameworks. Because everything already exists. Sounds about right.

People smarter than you have figured it all out, use their tools

Smarter people don’t always produce better solutions or solutions that fit your use case. Most of the time smart people will produce smart solutions… and then there are people that don’t usually think in graphs and are really puzzled when debugging Puppet cyclic dependency errors for example.

Most of the code you need is already written, don’t waste time and money, use it! Community Cookbooks and modules are great!

This is marketing bullshit. Don’t buy it! It’s often more expensive to adopt a code that does not meet your exact needs and is much more complex that you need (because it should support multiple platforms and use cases) than to write your own. I have seen suffering followed by usage of community Cookbooks/modules followed by in-house rewrite or fork.

Don’t you care about the next guy? Work with standard tools!

Let’s do some math. Team of two works for a year. They are (very modest estimation) 10% more productive because they have coded whatever they needed and were not fighting with the tools. Even when wrongfully assuming that custom solution is harder to understand for the 3rd guy that joined the team after one year, how much is it harder? Is it more than 300 hours harder?

Update following responses on Reddit

2017-04-28

2 totally different toolsets – infrastructure orchestration (Terraform, Cloudformation), and Configuration Management (Puppet, Chef)… — (/u/absdevops)

Yes. What is common to all these tools is declarative style and their usage: these tools are typically run using CLI.

All these tools have three axes that I consider:

  1. “Input” axis: What’s the input of these tools?
    1. Configuration format
    2. Half-baked programming language that was probably never indented to be a programming language
    3. Real programming language
  2. “Calling” axis: framework vs library (typical usage)
  3. “TCO” axis: TCO vs other solutions, especially vs the other solution that is always available – code the subset of the functionality that you need yourself

I’d like to make sure that it’s clear that the tools mentioned in this article have different positions on the 3 axes and are not equal in the value they provide you in your specific situation.

The main point of the article is that while these tools differ on axes 1 and 3, they are all limiting because conceptually, they are all frameworks. You pass your execution into the tool and it does a lot. Here is where you loose your flexibility as opposed to using a library. You have relatively little control of what happens inside the tool.

I must strongly disagree with Terraform being put in the list – its a great base tool with limitations that can be worked around. — (/u/absdevops)

I don’t want to work around limitations. It seems to be the norm for these tools. I’d rather have a library that misses parts that I’d code myself. Working around limitations in my opinion is generally much worse than missing functionality (depends on specific circumstances of course).

Regarding inflexibility – it’s probably the most flexible tool of the bunch — (/u/absdevops)

Please note we are still comparing between the tools that all use limiting paradigm: frameworks

I will also duel anyone to the death for preference of Cloudformation syntax to Terraform — (/u/absdevops)

We are talking about the “Input” axis I mentioned above. Yes, Terraform syntax apart from being more aesthetically pleasing is somewhat closer to “Half-baked programming language that was probably never indented to be a programming language” while CloudFormation is somewhat closer to “Configuration format”.

I totally disagree with points made about having to generate Terraform manifests. … generate what you need specifically, and hand it off to Terraform, much like making an API call to a library. — /u/SlinkyAvenger

There is a huge difference in the amount of work done by typical API call and what these tools do once you call them. With more granular API calls you decide if and when you do specific calls and what do you do in between the calls – it’s much more flexible.

I’m also a big proponent of Puppet — /u/SlinkyAvenger

One of the low value tools from my perspective. I’ll explain. On the “Input” axis, it’s half-baked programming language. Better than configuration file but still loses to Chef for example. On the “TCO” access, I really think that Puppet and Chef are not good alternatives to custom scripts in most cases. Scripts by the way also win on the “Calling” axis, which means flexibility.

I’d really like to hear what you’re honestly going back to puppet support for. — /u/neoghostz

We don’t. When we suffered while working with Puppet, we knew that support will not solve our problems. Some crappy community modules can not be solved by support. Breakage on modules versions updates – same. Librarian, more complexity on top of complexity – same. The above quote about support (“A key component of every successful Puppet implementation is access to a knowledgeable support team”) was just to highlight that guys at Puppet think people can’t use it without support. This is just a humorous point and not really important.

What is the point of this article? It basically dumps on Terraform, CF, Puppet, Chef, etc., but offers no actual criticism (other than a vague ‘it takes away control’ statement) or, perhaps more importantly, alternatives. — /u/cryonine

The point is that all these tools would have been better if they would be implemented as libraries on top of real programming languages, where you call the parts that you need instead of one “do everything” call.

With the exception of Chef, these tools use either configuration files as input or configuration-file-almost-a-programming-language format. It’s always the same path:

  1. We need small limited DSL it’s so academically beautiful, we can prove theorems about this.
  2. Oh wait, there are real world scenarios where it’s not enough, damn these complaining engineers.
  3. Let’s add stdlib
  4. Let’s add proper loops
  5. Now we have a half-baked programming language.

Elaborating on taking control away from you. You get convoluted things like this:

Alternatives

For Puppet and Chef, I have not seen a single system where my estimated TCO of these tools would be better than a bunch of idempotent modular bash scripts which I use. It did not take much time to write these. Some Python is used for configuration generation (json / jinja templates + environment data).

With Cloudformation and Terraform it’s not that simple. I’m mostly amazed that nobody does libraries which would just provide declarative primitives, not frameworks where you feed everything you need via one call. I am working on one but it is really strange for me that I haven’t heard already about such library for Python or Ruby.

Terraform … vastly superior to any other alternate out there — /u/cryonine

Not sure I agree 100% because it depends on situation but I can imagine many situations where it’s correct. The important thing here is that I think that all current alternatives are not so good.

How is it wrongful to assume that a custom solution is harder to understand? That’s completely accurate. — /u/cryonine

Custom solution is simpler. Do you really need documentation for 19 lines of bash code that installs Nginx and another 29 that do a restart that handles leaking file descriptors? You will definitely need documentation of 2000+ lines of Chef cookbook or Puppet module that install Nginx and … oh wait… how do I reload Nginx and then conditionally (if enough file descriptors leaked) restart it? Time to dive in ūüôā

I do imagine how custom solution can be complicated (read harder to maintain and higher TCO) if done by unprofessional people. In some cases it might be better for them to use a framework. On the other hand, they might stuck when trying to do something advanced with the framework. Really depends on the situation.

While “use standard tools” generally sounds right, I have seen too much convoluted solutions using “standard tools” because of the inflexibility. People were trying to work around the limitations. Comparing top-down execution of simple script to workarounds for these tools, it’s much simpler to wrap your head around the scripts. I have recently passed one of my clients to the next guy. I have asked him how he is doing and he told me that he was happy to have simple custom solution over complex frameworks. TCO has many components. Choosing “standard tools” does not always outweigh other aspects.

 


Have a nice day and a productive life!

NGS unique features – exit code handling

smilies-1607163_640

How other languages treat exit codes?

Most languages that I know do not care about exit codes of processes they run. Some languages do care … but not enough.

Update / Clarification / TL;DR

  1. Only NGS can throw exceptions based on fine grained inspection of exit codes of processes it runs out of the box. For example, exit code 1 of test will not throw an exception while exit code 1 of cat will throw an exception by default. This allows to write correct scripts which do not have explicit exit codes checking and therefore are smaller (meaning better maintainability).
  2. This behaviour is highly customizable.
  3. In NGS, it is OK to write if $(test -f myfile) ... else ... which will throw an exception if exit code of test is 2 (test expression syntax error or alike) while for example in bash and others you should explicitly check and handle exit code 2 because simple if can not cover three possible exit codes of test (zero for yes,¬† one for no, two for error). Yes, if /usr/bin/test ...; then ...; fi in bash is incorrect! By the way, did you see scripts that actually do check for three possible exit codes of test? I haven’t.
  4. When -e switch is used, bash can exit (somewhat similar to uncaught exception) when exit code of a process that it runs is not zero. This is not fine grained and not customizable.
  5. I do know that exit codes are accessible in other languages when they run a process. Other languages do not act on exit codes with the exception of bash with -e switch. In NGS exit codes are translated to exceptions in a fine grained way.
  6. I am aware that $? in the examples below show the exit code of the language process, not the process that the language runs. I’m contrasting this to bash (-e) and NGS behaviour (exception exits with non-zero exit code from NGS).

Let’s run “test” binary with incorrect arguments.

Perl

> perl -e '`test a b c`; print "OK\n"'; echo $?
test: ‚Äėb‚Äô: binary operator expected
OK
0

Ruby

> ruby -e '`test a b c`; puts "OK"'; echo $?
test: ‚Äėb‚Äô: binary operator expected
OK
0

Python

> python
>>> import subprocess
>>> subprocess.check_output(['test', 'a', 'b', 'c'])
... subprocess.CalledProcessError ... returned non-zero exit status 2
>>> subprocess.check_output(['test', '-f', 'no-such-file'])
... subprocess.CalledProcessError: ... returned non-zero exit status 1

bash

> bash -c '`/usr/bin/test a b c`; echo OK'; echo $?
/usr/bin/test: ‚Äėb‚Äô: binary operator expected
OK
0

> bash -e -c '`/usr/bin/test a b c`; echo OK'; echo $?
/usr/bin/test: ‚Äėb‚Äô: binary operator expected
2

Used /usr/bin/test for bash to make examples comparable by not using built-in test in bash.

Perl and Ruby for example, do not see any problem with failing process.

Bash does not care by default but has -e switch to make non-zero exit code fatal, returning the bad exit code when exiting from bash.

Python can differentiate zero and non-zero exit codes.

So, the best we can do is distinguish zero and non-zero exit codes? That’s just not good enough. test for example can return 0 for “true” result, 1 for “false” result and 2 for exceptional situation. Let’s look at this bash code with intentional syntax error in “test”:

if /usr/bin/test --f myfile;then
  echo OK
else
  echo File does not exist
fi

The output is

/usr/bin/test: missing argument after ‚Äėmyfile‚Äô
File does not exist

Note that -e switch wouldn’t help here. Whatever follows if is allowed to fail (it would be impossible to do anything if -e would affect if and while conditions)

How NGS treats exit codes?

> ngs -e '$(test a b c); echo("OK")'; echo $?
test: ‚Äėb‚Äô: binary operator expected
... Exception of type ProcessFail ...
200

> ngs -e '$(nofail test a b c); echo("OK")'; echo $?
test: ‚Äėb‚Äô: binary operator expected
OK
0

> ngs -e '$(test -f no-such-file); echo("OK")'; echo $?
OK
0

> ngs -e '$(test -d .); echo("OK")'; echo $?
OK
0

NGS has easily configurable behaviour regarding how to treat exit codes of processes. Built-in behaviour knows about false, test, fuser and ping commands. For unknown processes, non-zero exit code is an exception.

If you use a command that returns non-zero exit code as part of its normal operation you can use nofail prefix as in the example above or customize NGS behaviour regarding the exit code of your process or even better, make a pull request adding it to stdlib.

How easy is to customize exit code checking for your own command? Here is the code from stdlib that defines current behaviour. You decide for yourself (skipping nofail as it’s not something typical an average user is expected to do).

F finished_ok(p:Process) p.exit_code == 0

F finished_ok(p:Process) {
    guard p.executable.path == '/bin/false'
    p.exit_code == 1
}

F finished_ok(p:Process) {
    guard p.executable.path in ['/usr/bin/test', '/bin/fuser', '/bin/ping']
    p.exit_code in [0, 1]
}

Let’s get back to the bash if test ... example and rewrite the it in NGS:

if $(test --f myfile)
    echo("OK")
else
    echo("File does not exist")

… and run it …

... Exception of type ProcessFail ...

For if purposes, zero exit code is true and any non-zero exit code is false. Again, customizable. Such exit code treatment allows the if ... test ... NGS example above to function properly, somewhat similar to bash but with exceptions when needed.

NGS’ behaviour makes much more sense for me. I hope it makes sense for you.

Update: Reddit discussion.


Have a nice weekend!

NGS unique features – execute and parse

I am developing a shell and a language called NGS. I keep repeating it’s domain specific. What are the unique features that make NGS most suitable for today’s system administration tasks (a.k.a “Operations” or hype-compatible word “DevOps”)?

This post is first in series that show what makes NGS unique.

one-979261_640

Execute and parse operator

Execute-and-parse operator … executes a command and parses it’s output. This one proved to be central in working with AWS API. Citing ec2din.ngs demo script:

``aws ec2 describe-instances $*filters``

The expression above returns a data structure. The command is run, the output is captured and then fed to parse() method. Whatever the parse() method returns is the result of the ``exec-and-parse syntax`` expression above.

Built-in parsing

By default, NGS parses any JSON output when running a command using ``exec-and-parse`` syntax. (TODO: parse YAML too)

In case with AWS CLI commands additional processing takes place to make the data structure coming out of exec-and-parse operator more useful:

  1. The top level of AWS responses is usually a hash that has one key which has an array as value: {"LoadBalancerDescriptions": [NGS, returns, this] } . While I can guess few reasons for such format, I find it much more useful to have an array as a result of running an AWS CLI command and that’s what NGS returns if you run ``aws ...`` commands.
  2. Specifically for aws ec2 describe-instances I’ve removed the annoyance of having Reservations list with instances as sub-lists. NGS returns flat instances list. Sorry, Amazon, this is much more productive.

Customizable parsing

What if you have your own special command with it’s own special output format?

The parsing is customizable via defining your own parse(s:Str, hints:Hash) method implementation. That means you can define how your command is parsed.

 No parsing

Don’t want parsed data? No problem, stick with the `command` syntax instead of ``command``. In case you need original data structure you can use `command`.decode_json() for example.

Why exec-and-parse is an operator?

Why adding an exec_parse() function would not be sufficient?

  1. Execute-and-parse is common operation in system tasks so it should be short. NGS has taken the pragmatic approach: the more common the operation, the shorter the syntax.
  2. Execute-and-parse should look similar `execute-and-capture-output` syntax which already existed when I was adding execute-and-parse.
  3. Making it an operator allows the command to be executed to be written in “commands syntax” (a bit bash-like) which is a better fit.

“I can add this as a function to any language!”

Sure but:

  1. Your chances of getting same brevity are not very good
  2. Making exec-and-parse as flexible as in NGS in other languages would be an additional effort
  3. ``some-command arg1 arg2`` – would it be exec_parse(['some-command', 'arg1', 'arg2']) ? How do you solve the syntax of the passed command? The array syntax does not look good here. Not many languages will allow you to have special syntax for commands to be passed to exec_parse().

If your language is not domain-specific for system tasks, adding exec-and-parse to it will be a task with dubious benefit.

How extreme opposite looks like

Just came across build configuration file of Firefox: settings.gradle (sorry, could not find a link to this file on a web in a sane amount of time). Here is excerpt with lines wrapped for convenience.

def commandLine = ["${topsrcdir}/mach", "environment", "--format",
    "json", "--verbose"]
def proc = commandLine.execute(null, new File(topsrcdir))
def standardOutput = new ByteArrayOutputStream()
proc.consumeProcessOutput(standardOutput, standardOutput)
proc.waitFor()

...

import groovy.json.JsonSlurper
def slurper = new JsonSlurper()
def json = slurper.parseText(standardOutput.toString())

...

if (json.substs.MOZ_BUILD_APP != 'mobile/android') {
...
}

Here is how roughly equivalent code looks in NGS (except for the new File(topsrcdir) which I don’t understand):

json = ``"${topsrcdir}/mach" environment --format json --verbose``
...
if json.substs.MOZ_BUILD_APP != 'mobile/android' {
...
}

Yes, there are many languages where exec-and-parse functionality looks like something in between Gradle and NGS. I don’t think there is one that can do what NGS does in this regard out of the box. I’m not saying NGS is better than other languages for all tasks. NGS is aiming to be better at some tasks. Dealing with I/O and data structures is definitely a target area.


Have a nice day!

Declarative primitives or mkdir -p for the cloud

After some positive feedback regarding the concept of declarative primitives I would like to elaborate about it.

Defining declarative primitives

Declarative primitives is just a description of existing techniques. I gave it a name because I’m not aware of any other term describing these techniques. The idea behind declarative approach is to describe the desired state or result and not particular command or operations to achieve it.

Example: mkdir -p dir1/dir2/dir3

The outcome of the command does not depend on current state (whether the directory exists or not). You describe the desired state: directories dir1, dir2 and dir3 should exist after the command is run. Note that mkdir dir1/dir2/dir3 does not have the same effect: it fails if dir1 does not exist or dir2 does not exist or if dir3 exists.

The phrase declarative primitives emphasizes granularity. Existing declarative tools for the cloud operate on many described resources, build dependency graphs and run in order that they decide. Declarative primitives provide a very flexible way to control a single resource or a group of few resources of the same type. The flexibility comes from granularity. You decide how you combine the resources. You can easily integrate existing resources. You can modify just the properties of your interest on the resources you choose. This approach is ideal of scripting in my opinion.

Where are declarative primitives for the cloud?

sky-414198_640

I believe that when writing a script, using mkdir -p should be similar to using AwsElb(...).converge() for example. I’m working on implementing it (as a library for the Next Generation Shell) and I’m not aware of any other project that does it.

There are many projects for managing the cloud, how are they different?

Here are the solutions that I’m aware of and how familiar I am with each one:

  1. CloudFormation Рusing frequently (I prefer YAML syntax for it)
  2. Terraform – I’ve read the documentation and bits of source code
  3. Cloudify Рfamiliar with the product, made modules for it
  4. Puppet Рwas using it intensively on few different projects
  5. Chef Рwas using it intensively in many projects
  6. Ansible – unfamiliar with this one (only took a look at documentation) so not reviewing it below
  • All take the declarative approach. You describe many resources or the entire system and feed the description to the tool which in turn does all the work. None of these solutions was designed to provide you with the primitives that could be easily used in your scripts. These tools just don’t match my view regarding scripting.
  • These tools can do a rollback on error for example. They can do that precisely because they have the description of the entire system or big parts of it. It will take some additional work to implement rolling back using declarative primitives. The question is whether you need the rollback functionality …
  • Some of these tools can be made to work with different clouds relatively easily. Working with different clouds easily may also possible with declarative primitives but the library I’m currently working on does not have such goal.
  • Except for Chef, the tools in the list above use formats or DSLs not based on real programming languages. This means that except for trivial cases you will be using some additional tool to generate the descriptions of desired states. Limited DSLs do not work. See Puppet and Ansible that started with simple description languages and now they are almost real programming languages … which where never designed as programming languages, which has consequences.
  • I’m not aware of any option in the tools above that lets you view definitions of existing resources, which prevents you from starting managing existing resources with these tools and from cloning existing resources. I have started implementing the functionality that lets you generate the script that would build an existing resource: SomeResource(...).code() . This will allow easy modification or cloning.
  • A feature missing both from these tools and from my library is generating a code to start with for a given resource type (say security group or load balancer). Writing CloudFormation definition for a type with many properties is a nightmare. Nobody should start from scratch. Apache or Nginx configuration files are good example of starting points. Similar should be done for the cloud resources.
  • Note that Chef and Puppet were originally designed to manage servers. I don’t have any experience using them for managing the cloud but I can guess it would be less optimal than dedicated tools (the first three tools).

Scripting the cloud – time to do it right!

Why CloudFormation is better than Chef and Puppet

Strange comparison, I know.

apple-926456_640

Scripting vs declarative approaches

The aspect I’m looking at is scripting (aka imperative programming) vs declarative approach. In many situations I choose the scripting approach over declarative because the downsides of declarative approach outweigh the benefits in the situations that I have.

Declarative approach downsides

Downsides of Chef, Puppet and other declarative systems? Main downsides are complexity and more external dependencies. These lead to:

  1. Fragility
  2. More maintenance
  3. More setup for anything except for the trivial cases

I can’t stress enough the price of complexity.

Declarative approach advantages

When the imperative approach would mean too much work the declarative approach has the advantage. Think of SQL statements. It would be enormous amounts of work to code them by hand each time. Let’s summarize:

  1. Concise and meaningful code
  2. Much work done by small amount of code

Value of tools

I value the tools by TCO.

Example 1: making sure a file has specific content. It could be as simple as echo my_content > my_file in a script or it could be as complex as installing Chef/Puppet/Your-cool-tool-du-jour server and so on…

Example 2: making sure that specific load balancer is set up (AWS ELB). It could be writing a script that uses AWS CLI or using declarative tools such as CloudFormation or Terraform (haven’t used Terraform myself yet). Writing a script to idempotently configure security groups and the load balancer and it’s properties is much more work than echo ... from the previous example.

While the TCO greatly depends on your specific situation, I argue that the tools that reduce larger amounts of work, such as in example 2, are more likely to have better TCO in general than tools from example 1.

“… but Chef can manage AWS too, you know?”

Yes, I know… and I don’t like this solution. I would like to manage AWS from my laptop or from dedicated management machine, not where Chef client runs. Also, (oh no!) I don’t currently use Chef and bringing it just for managing AWS does not seem like a good idea.

Same for managing AWS with Puppet.

Summary

Declarative tools will always bring complexity and it’s a huge minus. The more complex the tool the more work it requires to operate. Make sure the amount of work saved is greater than the amount of work your declarative tool requires to operate.

Opinion: we can do better

I like the scripting solutions for their relative simplicity (when scripts are written professionally). I suggest combined approach. Let’s call it “declarative primitives”.

Imagine a scripting library that provides primitives AwsElb, AwsInstance, AwsSecGroup and such. Using this primitives does not force you to give up the flow control. No dependency graphs. You are still writing a script. Minimal complexity increase over regular scripting.

Such library is under development. Additional advantage of this library is that the whole state will be kept in the tags of the resources. Other solutions have additional state files and I don’t like that.

Sample (NGS language) censored code that uses the library follows:

my_vpc_ancor = {'aws:cloudformation:stack-name': 'my-vpc'}

elb = AwsElb(
    "${ENV.ENV}-myservice",
    {
        'tags': %{
            env ${ENV.ENV}
            role myservice-elb
        },
        'listeners': [
            %{
                Protocol TCP
                LoadBalancerPort 443
                InstanceProtocol TCP
                InstancePort 443
            }.n()
        ]
        'subnets': AwsSubnet(my_vpc_ancor).expect(2)
        'health-check': %{
            UnhealthyThreshold 5
            Timeout 5
            HealthyThreshold 3
            Interval 10
            Target 'SSL:443'
        }.n()
        'instances': AwsInstance({'env': ENV.ENV, 'role': 'myservice'}).expect()
    }
)

elb.converge()

It creates a load balancer in an already existing VPC (which was created by CloudFormation) and connects existing instances to it. The example is not full as the library is work in progress but it does work.


Have fun and watch your TCO!