How fucked is AWS CLI/API

In the 21st century, we can do significantly better than this crap AWS CLI. We just need to think from the users’ perspective for a change and not just dump on users whatever we were using internally or seen in a nightmare and then implemented.

thinking-3082823_640
Think from the users’ perspective

I’m working on a solution which is described towards the end of the post. It’s not ready yet but the (working btw) examples will convince the reader that “significantly better” is very possible… I hope.

Background

I’m building AWS library and a command line tool. Since I’m doing it for a shell-like language (NGS), the AWS library that I’m building uses AWS CLI. Frustration and anger are prominent feelings when working with AWS CLI. I will just list here some of the reasons behind that feeling.

furious-2514031_640
AWS CLI/API – WTF were they thinking?

Overall impression

Separate teams worked on different services and were not talking to each other. Then the something like this happened: “OK, we have these APIs here, let’s expose them. User experience? No, we don’t have time for that / we don’t care”.

Tags

Representing a map (key-value pairs) as an array ( "Tags": [ { "Key": "some tag name", "Value": "some tag value" }, ... ] ), as AWS CLI does is insane… but only when you think about the user (developer in this case) experience.

shocked-2681488_640
AWS Tags – no, not in this form!

Reservations

When listing EC2 instances using aws ec2 describe-instances, the data is organized as list of Reservations and each Reservation has list of instances. I’m using AWS for quite a few years and I never needed to do anything with Reservations but I did spent some time unwrapping again and again the interesting data (list of instances) from the unwanted layer of madness. It really feels like “OK, Internally we have Reservations and that’s how our API works, let’s just expose it as it is”.

annoyed-3126442_640
Who da fuck ever used Reservations?

Results data structure

AWS CLI usually (of course not always!) returns a map/hash at the top level of the result. The most interesting data would be called for example LoadBalancerDescriptions, or Vpcs, or Subnets, or … ensuring difficulty making generic tooling around AWS CLI. Thanks! Do you even use your own AWS CLI, Amazon?

Inconsistencies

beer-2370783_640
Kind of the same but not really

Security Groups

These are of course special. Think of aws ec2 create-security-group ...

  1. --group-name and --description are mandatory command line arguments. For now, I haven’t seen any other resource creation that requires both name and description.
  2. The command returns something like { "GroupId": "sg-903004f8" } which is unlike many other commands which return a hash with the properties of the newly created resource, not just the ID.

Elastic Load Balancer

Oh, I like this one. It’s unlike any other.

  1. The unique key by which you find a load balance is name, unlike other resources which use ids.
  2. Tags on load balancers work differently. When you list load balancers, you don’t get the tags with the list like you have when listing instances, subnets, vpcs, etc. You need to issue additional command: aws elb describe-tags --load-balancer-names ...
  3. aws elb describe-load-balancers – items in the returned list have field VPCId while other places usually name it VpcId .

Target groups

aws elbv2 deregister-targets --target-group-arn ...

Yes, this particular command and it’s “target group” friends use ARN, not a name (as ELB) and not an ID (like most EC2 resources).

DHCP options (sets)

(Haven’t used but considered so looked at documentation and here is what I found)

Example: aws ec2 create-dhcp-options --dhcp-configuration "Key=domain-name-servers,Values=10.2.5.1,10.2.5.2"

Yes, the create syntax is unlike other commands and looks like filter syntax. Instead of say --name-servers ns1 ns2 ... switch you have "Key=domain-name-servers,Values=" WTF?

Route 53

aws route53 list-hosted-zones does not return the tags. Here is an output of the command:

{
 "HostedZones": [
 {
 "Id": "/hostedzone/Z2V8OM9UJRMOVJ",
 "Name": "test1.com.",
 "CallerReference": "test1.com",
 "Config": {
 "PrivateZone": false
 },
 "ResourceRecordSetCount": 2
 },
 {
 "Id": "/hostedzone/Z3BM21F7GYXS7Y",
 "Name": "test2.com.",
 "CallerReference": "test2.com",
 "Config": {
 "PrivateZone": false
 },
 "ResourceRecordSetCount": 2
 }
 ]
}

Wanna get the tags? F*ck you! Here is the command: aws route53 list-tags-for-resources --resource-type hostedzone --resource-ids Z2V8OM9UJRMOVJ Z3BM21F7GYXS7Y . Get it? You are supposed to get the Id field from the list generated by list-hosted-zones, split it by slash and then use the last part as resource ids. Tagging zones also uses the rightmost part of id: aws route53 change-tags-for-resource --resource-type hostedzone --resource-id Z2V8OM9UJRMOVJ --add-tags Key=t1,Value=v11

… but that apparently was not enough differentiation 🙂 Check this out: aws elb add-tags --load-balancer-names NAME --tags TAGS vs aws route53 change-tags-for-resource --resource-type hostedzone --resource-id ID --add-tags TAGS and aws elb remove-tags --load-balancer-names NAME --tags TAGS vs aws route53 change-tags-for-resource --resource-type hostedzone --resource-id ID --remove-tag-keys TAGS . Trick question: on how many dimensions this is different? So wrong on so many levels 🙂

Here is another one: when you fetch records from a zone you use the full id, /hostedzone/Z2V8OM9UJRMOVJ , not Z2V8OM9UJRMOVJaws route53 list-resource-record-sets --hosted-zone-id /hostedzone/Z2V8OM9UJRMOVJ

 

(many more to come)

Expected comments and answers

feedback-2044700_640
Internet discussions are the best 😉

Just use Terraform

I might some day. For the cases I have in mind, for now I prefer a tool that:

  1. Is conceptually simpler (improved scripting, not something completely different)
  2. Doesn’t require a state file (I don’t want to explain to a client with two servers where the state file is and what it does)
  3. Can be easily used for ad-hoc listing and manipulation of the resources, even when these resources were not created/managed by the tool.
  4. Can stop and start instances (impossible with Terraform last time I checked a few months ago)

Just use CloudFormation

I don’t find it convenient. Also, see Terraform points above.

By the way, the phrases of the form “Just use X” are not appreciated.

You just shit on what other people do to push your tools

Yes, I reserve the right to shit on things. Here I mean specifically AWS CLI. Freedom of speech, you know… especially when:

  1. The product (AWS API) is technically subpar
  2. The creator of the product does not lack resources
  3. The product is used widely, wasting huge amounts of time of the users

If I’m investing my time to write my own tool purely out of frustration, I will definitely tell why I think the product is crap.

Is that just a rant or you have alternatives?

I’m working on it. The command line tool is called na (Ngs Aws). It’s a thin wrapper around declarative primitives library for AWS.

excited-3126450_640
There might be a solution to this madness!!!

The command line tool that I’m working on is not anywhere ready but here is a gist of what you can do with it:

# list vpcs
na vpc

# Find the default VPC
na vpc IsDefault:true

# List security groups of all vpcs which have tag "Name" "v1".
na vpc Name=v1 sg

# Idempotent operations, in this case deletion.
# No error when this group does not exist.
# Delete "test-group-name" in the given vpc(s).
na vpc Name=v1 sg GroupName:test-group-name DEL

# List all volumes which are attached to stopped instances
na i State:stopped vol

# Delete all volumes that are not attached to instances
na vol State:available DEL

# Stop all instances which are tagged as
# "role" "ocsp-proxy" and "env" "dev".
na i role=ocsp-proxy env=dev SET State:stopped
  1. Na never dumps huge amounts of data to your terminal. As a human, you will probably not be able to process it so when number of items (rows in a table actually) is above certain configurable threshold, you will see something like this:
    $ na vol 
    === Digest of 78 rows ===
    ...

    It will show how many unique values there are in each column, min and max value for each column. Thinking about displaying top 5 (or so) top and bottom values for each column.

  2. Na has concept of “related” resources so when you write something like na i State:stopped vol , it knows that the volumes you want to show are related to the instances that you mentioned. In this particular case, it means volumes attached to the instances.
  3. Note the consistency of what you see in output and arguments to CLI. If something is called “State” in the data structure returned by AWS CLI, it will be called “State” in the query, not “state” (“–state”).

I will be updating this post as things come along.

Just use X

feedback-2044700_640

Typical conversation on the Internet.

I’m having this situation, I’m trying to do blah using blah and it doesn’t work for me because blah. How do I proceed?

Invariably, one of the answers is

Just use X

And to that I would like to answer right now:

Go fuck yourself!

Your answer shows lack of thought and arrogance. In addition, chances are that X is a blogosphere-hyped tool or product. Here are some suggestions for next time, instead of “Just use X”:

  1. Is there a reason you are not using X? I had similar situation, tried to achieve what you are trying to achieve and had positive experience.
  2. I haven’t tried myself, but I heard about X which I think should solve your problem, you should probably take a look if you haven’t yet.

Hope this helps both sides of the discussion.


You are welcome to link to this post when you get “Just use X” response.

Google deleted our G Suite (now recovered)

Introduction

We, at Beame.io LTD, have been using G Suite for a few years now. At the moment of the story, we had 2FA enabled for all accounts except for one, which was stuck (because of our process) at setup phase. This account had no administrative privileges.

Pending meeting

On 2018-03-22 there is a scheduled remote meeting with big and important client for which we need to prepare. We estimate that the outcomes of this meeting will affect our partnership with the client. Most of the documents we need to deal in order to prepare to the meeting with are in Google Documents.

Involved people

In the events below we will mention employees A, B, C, and D. Employees A, C, and D have (had actually) admin accounts.

TL;DR

Google deleted all our G Suite data. All important data is recovered by now.

  1. 2018-03-18 – We got email notification about removal of G Suite. The notification went unnoticed because the domain in the subject and in the body of the email is a domain we do not use. The old domain (mentioned in the email) was primary G Suite domain.
  2. 2018-03-22 – Our G Suite is deleted because we did not respond to the email.
  3. 2018-03-27 – Support finally says that the data is unrecoverable. After more than an hour an email from another person in support gives us some hope.
  4. 2018-03-28 – We are requested to prove domain ownership. The data and accounts are either being restored or partially restored.
  5. 2018-03-28 – Restore is in progress. Most accounts are available.
  6. 2018-04-01 – The last (and the most important account) was recovered.

It took 5 days to tell us that we are completely f*cked. All our documents and emails are gone. The next day our stuff is being recovered.

Failure to respond to an email gets your G Suite deleted within days.

After recovering the data, Google failed to tell us that they have finished. We were contacted later.

Last update of this blog

2018-04-12 12:52 UTC+3

Timeline of events

2018-03-22 09:56 UTC+2

Message from A to the team – email account doesn’t work. B reports that he observes the same problem. It quickly becomes clear that we all can not access our accounts.

10:21 UTC+2

We have disproved our first theory – we accidentally did not pay for the service. Our Money guy says we paid.

Somewhen

C & D fail to find any obvious way to contact G Suite support except for Twitter.

Around 11:15 (twitter does not show exact times)

Conversation using PM with @gsuite .

D:

looks like our gsuite domain was deleted and we can’t log in but we can’t phone google support because it requires PIN … which you get when logged in

@gsuite:

Hi there, what is the domain name on the G Suite account? Please send a full screenshot of the error message you receive when you access http://admin.google.com from an incognito window https://support.google.com/chrome/answer/95464?co=GENIE.Platform%3DDesktop&hl=en …. -MO

D: pastes screenshots

The conversation continues in parallel to other things below.

11:42 UTC+2

Failing to get through phone support D opens new G Suite domain to get PIN to get some support.

Somewhen

D contacts G Suite support on the phone using the PIN from new G Suite domain.

11:49 UTC+2

Following phone conversation with the support, D gets email. The email does not contain the promised link to a form which D should fill.

11:53 UTC+2

D gets email with a link to account recovery form. D misses the email because he is talking to C, answers the previous email and tries to get to G Suite support via Twitter. D sees this email after a while, when a case using the form is already opened.

11:56 UTC+2

D answers in email (replying to email at 11:49):

I’ve seen that page before. It’s not clear which form you were referring to. I don’t understand how to proceed. Please advise.

12:39 UTC+2

After C and D both contact G Suite support via Twitter account, get link to the account recovery form and fill the form (at https://support.google.com/a/contact/recovery_form ), a case is opened.

Case text is:

Subject: [urgent] (CENSORED-DOMAIN) – Whole G Suite domain does not function

We started getting “account deleted”
errors (started less than 24 hours ago) when trying to log in. Looks like
the whole G Suite domain was deleted. We suspect malicious activity. We did
not even consider deleting this domain. I am one of the administrators of
the domain – (CENSORED-EMAIL). Recovering the access and the data is critical
for our organization.”

The email says

Thanks for opening this G Suite support ticket. Your ticket number is (CENSORED-TICKET-NUMBER), and we’ll be in touch with you soon.

13:17 UTC+2

D checks what our clients see when they send us emails. Here is the reply they could get if sent something to one of our mail accounts:

The response was:
The email account that you tried to reach does not exist. Please try double-checking the recipient’s email address for typos or unnecessary spaces. Learn more at https://support.google.com/mail/?p=NoSuchUser (CENSORED-SOME-ID) – gsmtp

13:30 UTC+2

Email from G Suite support regarding the case:

Hello (CENSORED-NAME),

Thank you for contacting G Suite Account Recovery team. I understand that you are having issue with your G Suite account associated to the domain CENSORED-DOMAIN and getting error that the account has been deleted. I would like to know if your G Suite account is associated to another G Suite account before? Please let me know by replying to this email. Thank you.

Sincerely,

(CENSORED-NAME)
G Suite Account Recovery team

13:34 UTC+2

Our reply to the email above:

We are not sure. If yes, it was (CENSORED-OLD-DOMAIN) .

Since this is very urgent issue that is critical to our business, can we have a phone conversation with you?
I think it will be more productive.

Regards,
(CENSORED-NAME)

15:46 UTC+2

D does following PM to @gsuite:

Hi. Sorry to bother you but it has been 2 hours since my last email to the support. The issue is very urgent to us. Can you check maybe what’s going on with case (CENSORED-TICKET-NUMBER)

16:08 UTC+2

@gsuite to D:

Hi there, the service level agreement for G Suite technical support is 24 hours. The team will reply within 24 hours from your response. I hope this helps. -MO

16:20 UTC+2

D to @gsuite:

Can we buy premium support to accelerate this? The impact of this issue on our business is huge.

This is unanswered at the time of publication.

22:50 UTC+2

Additional email to support:

It is disappointing that I was still not contacted, despite defining this issue as urgent. It’s evening, and I’m going to sleep soon. Since the issue is urgent, please call me ( CENSORED-PHONE-NUMBER ) because I will not see your emails till the morning; I’m at UTC+2.

22:59 UTC+2

Email from support (despite requesting phone communication):

Hello CENSORED-NAME,

Good day! I hope this message finds you well. My name is CENSORED-NAME and I am one of the supervisors for G Suite Account Recovery. I understand that you are getting an error that the domain has been deleted when you try to sign in with your email account for CENSORED-DOMAIN. I’ll be taking over now of this case.

Google sends notifications to the administrators, especially to the primary administrator, whenever there’s an update or changes that might affect their G Suite account. You mentioned that you are one of the administrators of the account which said was deleted. Have you coordinated with the other administrators and check if they receive any notifications from Google?

Also, when the previous agent asked you if there was any other domain affiliated with the account, you said that it might be CENSORED-OLD-DOMAIN. Do you still own that domain?

I would also like to ask if you remember signing up for G Suite, what was the domain you used?

If you have questions, feel free to reply to this email.

Sincerely,

CENSORED-NAME
G Suite Account Recovery Team

23:06 UTC+2

Answered the email:

Thanks. It’s good when my emails are responded to.

They did not delete the account and they did not get any notifications. There were no plans to delete the G Suite domain.

We do not own CENSORED-OLD-DOMAIN anymore. Please note that we are not sure whether these were linked.

I’m not sure I understand the question. When I joined the company, G Suite was already in use. Our emails etc were on “CENSORED-DOMAIN” domain, which we still own.

2018-03-23 01:09 UTC+2

Email to support:

New information. C ( CENSORED-EMAILS ) did receive notification about pending removal of CENSORED-OLD-DOMAIN G Suite. He did not think it was related. Apparently it is.

Please recover our “CENSORED-DOMAIN” domain as soon as you can.

Going to sleep now, please use phone to contact me if you have any questions.

Regards,
D

08:15 UTC+3

Email to support:

Hello. Any updates? I’m asking because several hours passed, I haven’t heard from you and our business is continuing to suffer greatly from this issue.

Regards,
D

14:56 UTC+3

No reply yet. Publishing in hope to apply some pressure.

19:06 UTC+3

Email from support:

Hello D,

Good day! Thank you for your response.  I apologize if I let you felt that your case is being not attended. Don’t worry, I have your case under my watch. To set proper expectation regarding my availability, I am in the office every Mondays to Fridays,  5:00AM to 2:00PM PST. I do understand how critical this situation is and how it is affecting your business. I’ll do my best to address the issue you are experiencing.

You have mentioned that your co-admin has received a message from Google regarding pending deletion of the G Suite account associated with CENSORED-OLD-DOMAIN. It was addressed to CENSORED-EMAIL-OF-C, which I think is an admin since it has received the notification.

Based on your answers, if the email notification from Google was received and resulted to the deletion of the G Suite account where CENSORED-DOMAIN  is associated with, most probably the primary domain configured in your G Suite account was CENSORED-OLD-DOMAIN, which its domain ownership had been contested by the new owner for them to be able to sign up for G Suite.

In short, CENSORED-DOMAIN was also deleted because it was configured as secondary domain of the G Suite account registered for CENSORED-OLD-DOMAIN.

For your reference, you may read this Help Center article about the importance of disassociating expired domain name with your G Suite account:
What if my domain expires? – https://support.google.com/a/answer/6359803
Before you change your primary domain (When do I need to change my primary domain) – https://support.google.com/a/answer/6301932

I regret to inform you that this can no longer be recovered as it underwent through proper process. Google already sent a notification to the primary administrator of the account. What I can advise is to sign up for a new G Suite account registered with the domain that you really own, CENSORED-DOMAIN.

If you have questions, feel free to reply to this email.

Sincerely,

CENSORED-NAME
G Suite Account Recovery Team

19:31 UTC+3

D replies:

The damage to our company due to losing the documents is huge. We estimate it at millions of dollars and CENSORED. Please check how to recover our documents. Escalate if needed. It’s completely unreasonable situation.

Around 20:00 UTC+3

C finds a copy of notification email. It is dated 2018-03-18. C did not even open it till now because the title says CENSORED-OLD-DOMAIN. Well, nothing in the email says anything about CENSORED-DOMAIN.

23:23 UTC+3

Publishing the update, no more emails till now.

2018-03-24 12:10 UTC +3

PM to @gsuite:

Here is how this support nightmare looks from our side: https://www.reddit.com/r/devops/comments/86kq2n/g_suite_support_24_hours_without_our_account_and/

2018-03-24 14:22 UTC+3

Still no reply to email and no reply from @gsuite

18:01 UTC+3

Got this reply. I was thinking that we might not hear from Google till Monday because this guy told us (in one of previous emails) he is working Monday till Friday and today is Saturday.

Hello CENSORED-NAME,

Thank you for your response. I understand that you’re getting confused why the CENSORED-DOMAIN got deleted when in fact, you only received a notification regarding the pending deletion of the G Suite account for CENSORED-OLD-DOMAIN. I’ll be addressing the confusion to clear things up.

G Suite knows the nature of business when it comes to domain name registrations. When a domain name registration expires, it may be sold again to a new user who wanted to sign up the domain with G Suite. Proof of domain ownership is necessary when we use the service.

A G Suite account is recognized for its primary domain. If the primary domain used for G Suite got expired, the administrator needs to switch it to a different domain that they still own. It is because if your primary domain is not updated, there is a chance that someone may buy the domain name and sign it up for G Suite. When that happens, error will occur because the domain they try to sign up (or add as secondary domain or domain alias) is existing in the system.

How I understand the situation, this is the same thing happened to CENSORED-OLD-DOMAIN, as you have mentioned that you no longer own the domain name. The new owner must have tried to sign it up for G Suite, but they are getting the error that it is already in use. So they have to contact G Suite to contest the domain ownership for them to be able to use it with G Suite.

G Suite sends an email notification to the administrator of the existing G Suite account in the system, that a user has proven their domain ownership and wanted to use the domain name for G Suite. In your case, CENSORED-OLD-DOMAIN was the primary domain of the G Suite account, and if it is the primary domain is being contested, the whole G Suite account will be deleted, including the secondary domain in it, which was in your case CENSORED-DOMAIN.

There are some questions as well that I want to ask:
1) if CENSORED-OLD-DOMAIN has long gone expired, why was the primary domain of the G Suite account was not updated and changed it to an active one?
2) There were 3 days for the admin to reply on the said email regarding the pending deletion, was there an attempt of replying to it or was just simply disregarded because it was treated irrelevant?
3) Is there really a different G Suite account for CENSORED-MISSPELED-DOMAIN? If so, then why it got deleted too when CENSORED-OLD-DOMAIN ownership got contested? It would not be deleted if it has a separate G Suite account where it is set as primary domain, right?

Please be advised that the information I have provided are based on my deductive reasoning of the answers you have provided to me and my technical expertise regarding this matter. I would also like to let you know that this case has already been escalated and I am the best person to speak with regarding this technical matter. Please also note that in accordance to our privacy policy, I am not allowed to disclose any relevant information from our end since you are contacting us through an unauthenticated channel.

If you have questions, feel free to respond to this email.

Sincerely,

CENSORED-NAME
G Suite Account Recovery Team

My thoughts: If you understand why we are “getting confused”, maybe the system should be fixed?

21:09 UTC+3

Email to support:

[…]

1) if CENSORED-OLD-DOMAIN has long gone expired, why was the primary domain of the G Suite account was not updated and changed it to an active one?

CENSORED-OLD-DOMAIN-WITHOUT-DOT-COM is previous name of our business. We are slowly phasing out usage of CENSORED-OLD-DOMAIN in all places. We are busy startup so theses things take time.

2) There were 3 days for the admin to reply on the said email regarding the pending deletion, was there an attempt of replying to it or was just simply disregarded because it was treated irrelevant?

The significance of the said email was not understood. The subject had CENSORED-OLD-DOMAIN in it so it was treated like low priority issue to take a look in the future.

3) Is there really a different G Suite account for CENSORED-MISSPELED-DOMAIN? If so, then why it got deleted too when CENSORED-OLD-DOMAIN ownership got contested? It would not be deleted if it has a separate G Suite account where it is set as primary domain, right?

We think that there was not a separate account, CENSORED-OLD-DOMAIN-WITHOUT-DOT-COM is previous name of our business.

Please be advised that the information I have provided are based on my deductive reasoning of the answers you have provided to me and my technical expertise regarding this matter. I would also like to let you know that this case has already been escalated and I am the best person to speak with regarding this technical matter. Please also note that in accordance to our privacy policy, I am not allowed to disclose any relevant information from our end since you are contacting us through an unauthenticated channel.

We own the CENSORED-DOMAIN domain, which we want to recover. How you want us to authenticate? We can prove ownership of CENSORED-DOMAIN by setting a DNS record for example.

21:33 UTC+3

Email to support:

Most importantly, we would like to make sure that a technical team is working on recovering our data: emails and Drive. Every minute without our data is critical to our clients and workers, any help with that would be great.

We consider that situation where 3 days without replying to an email causes removal of all our accounts and data is totally unreasonable. Especially so when CENSORED-DOMAIN domain was in active use which could be seen from Google side and it was still deleted.

23:44 UTC+3

Got email to D’s email on the affected CENSORED-DOMAIN domain. D was able to get the email because we use different mail provider for the affected domain for now.

From: PlatformNotifications-noreply@google.com
Subject: Your Google Developers Console project is scheduled for deletion

Project Shutdown Announcement


Your Google Cloud Platform project CENSORED was shut down on Saturday, March 24, 2018 8:44:47 PM UTC.

If you take no action, after Monday, April 23, 2018 8:44:47 PM UTC, you will be unable to recover this project. If this was unintentional, visit this URL before Monday, April 23, 2018 8:44:47 PM UTC to cancel the project shutdown:

https://console.developers.google.com/project?pendingDeletion=true&organizationId=CENSORED

If you have any questions, please visit the Developers Console Help at this URL, or contact support:

https://developers.google.com/console/help/new/

Thanks,
The Google Developers Console Team

© 2016 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043

You have received this mandatory email service announcement to update you about important changes to your Google Developers Console product or account.

D has no idea about this and suspects it might be a byproduct of Google working on recovery. C suspects Google are deleting something additional related to our domain.

The link works only for logged in users. Since none of us can log in, we can not use the link to take a look.

 

2018-03-27 13:34 UTC+3

Email to support:

Any news? It’s almost 3 days since we’ve heard from you … regarding urgent case.

2018-03-27 19:16 UTC+3

Hello CENSORED-NAME,

Good day! Thank you for your response. Sorry if my reply came too late as I just came back to the office.

I have read your answers from my previous questions and also your request for the G Suite account to be recovered. I’d really want to help out regarding this matter as I know how this is really impacting to your business. However, the best thing that we can do regarding this situation is to tell you the truth that it is no longer recoverable and provide education so that it will not happen again in the future.

Here are my responses to your answers:

1) “CENSORED-OLD-DOMAIN-WITHOUT-DOT-COM is previous name of our business. We are slowly phasing out usage of
CENSORED-OLD-DOMAIN in all places. We are busy startup so theses things take time.” – Thank you for letting me know about how CENSORED-OLD-DOMAIN-WITHOUT-DOT-COM is transitioning to a new business name. The opportunity I find here is that since the business is undergoing transition, the G Suite account should have adapted the changes in the organization too by changing the primary domain. To know more how to change your primary domain, you may check this Help Center article: https://support.google.com/a/answer/7009324

2) “The significance of the said email was not understood. The subject had
CENSORED-OLD-DOMAIN in it so it was treated like low priority issue to take a look in
the future.” – I really feel sorry if that email was not prioritized because of its subject containing a domain which you no longer own. However, it is always a best practice as an admin to keep the communication lines open and check notifications from G Suite as it may contain important, sensitive and private message that concerns your G Suite account.

3) “We think that there was not a separate account, CENSORED-OLD-DOMAIN-WITHOUT-DOT-COM is previous name of our
business.” – Thank you for letting us know that you are aware that both domains are configured in one G Suite account.

Based from your responses, I think the opportunity here is to understand the importance of these terminologies in G Suite:
1) Primary Domain – The domain name used in which the G Suite account is registered. The very foundation of a G Suite account.
2) Secondary Domain – A type of additional domain that shares from the storage of the primary domain’s G Suite account. Users may be provisioned under this domain and use it to log in.
3) Domain Alias – A type of additional domain that automatically adds email aliases to all users under the primary domain. Email addresses under domain alias can not be configured as login username.

It is really important to update the primary domain if the domain name enrolled in it has expired its registration. For your reference, you may visit this Help Center article regarding the importance of updating the primary domain if in case it expires: https://support.google.com/a/answer/6359803

The reason it is no longer recoverable is because the domain CENSORED-OLD-DOMAIN was set as primary domain of your G Suite account that was legitimately contested by the new owner for them to be able to add their new domain with G Suite. In which, it was also admitted by your end that CENSORED-OLD-DOMAIN is no longer owned by your organization. Also, I have also checked that the new owner has already enrolled the domain name to G Suite.

Thank you for your understanding.

Sincerely,

CENSORED-NAME
G Suite Account Recovery Team

2018-03-27 20:54 UTC+3

Email from another person in support. Gives some hope.

Hello CENSORED-NAME,

Thank you for contacting G Suite Support. I understand the account CENSORED-DOMAIN. has been deleted. My name is CENSORED-NAME and I am in charge of your case.

I am aware that retrieving your domain is very important to you. I have contacted a specialist team regarding this issue. As soon as I hear from them, I will update this case and provide you with further notifications regarding the issue at that point. I would like to be honest with you and provide the right expectations, I can not promise to recover your domain but I will exhaust all the resources in my hands to do it.

The case will remain open. If you have any question please respond to this message and I will be very happy to follow up with you. I look forward to contacting you at the earliest convenience. I’ll be available Monday through Friday from 4:00 pm to 1:00 am IDT. Have a nice day!

Sincerely,

CENSORED-NAME
G Suite Support

2018-03-28 02:33 UTC+3

Email from third support person:

Hello CENSORED-NAME,

This is a friendly follow up regarding the case you currently have on Google G Suite Support.

I hope this message finds you well. We thank you for patiently waiting for a resolution to your case. I just wanted to let you know that due to the nature of the issue we can’t guarantee that the data will be restored since it got all permanently deleted but, our engineering team is currently working to verify if we have an option to restore it. In the meantime, We would greatly appreciate if you don’t create a brand new account until we get a final resolution to your case.

Please don’t hesitate to reply back to this email if you have any additional questions. HAve yourself a wonderful day.

Best regards,

CENSORED-NAME-OF-THIRD-SUPPORT-PERSON
Google G Suite Support.

2018-03-28 06:14 UTC+3

Email from support:

Hello CENSORED-NAME,

This is a friendly follow up regarding the case you currently have on Google G Suite Support.

I hope this message finds you well. We thank you for patiently waiting for a resolution to your case. I just wanted to let you know that due to the nature of the issue we can’t guarantee that the data will be restored since it got all permanently deleted but, our engineering team is currently investigating. In the meantime, We would greatly appreciate if you don’t create a brand new account but verify domain ownership by adding the following CNAME:

Label/Host: CENSORED-NUMBER

Destination/Target: Google.com

Time to live (TTL): 3600

Please don’t hesitate to reply back to this email if you have any additional questions. Have yourself a wonderful day.

 

Best regards,

CENSORED-NAME-OF-THIRD-SUPPORT-PERSON
Google G Suite Support.

2018-03-28 11:12 UTC+3

Email to support:

Hello.

CNAME added as requested.

$ dig +short CENSORED-NUMBER.CENSORED-DOMAIN
google.com.
CENSORED-IP

Regards,
CENSORED-NAME

2018-03-28 16:18 UTC+3

Email from support:

Hello CENSORED-NAME,

Thank you for your reply.

I apologize for contacting you until now but I am starting my shift. Your case is a priority for us and if I can not contact you, someone else will. I see that CENSORED-NAME-OF-SUPPORT-PERSON-3 asked you to create a CNAME record. We are still working on your case, I’ll keep you updated.

The case will remain open. If you have any question or additional comments in regards to your case, please feel free to reply to this email.

Sincerely,

CENSORED-NAME-OF-SUPPORT-PERSON-2
G Suite Support

2018-03-28 23:04 UTC+3

Automated email from Google:

30 days remaining to set up billing for G Suite Basic
Hello,

Our records indicate that you have no payment information on file for G Suite Basic for CENSORED-DOMAIN. You have until April 27, 2018 to set up your billing information, after which you will be suspended. Please set up billing in order to continue uninterrupted service.

Here’s what you need to do to continue your subscription to G Suite:

Click the Set up billing button below.
Select payment plan.
Review your purchase and accept the terms of service.
Enter your payment information.
If you would like to pay by check or manual bank transfer every month, please contact G Suite support.

D logs in with his email at CENSORED-DOMAIN. Users appear to be there. D calls C. C fails to log in. In admin, C is a “deleted user”. The documents, at least most of them appear to be there, we are not sure, it’s late, we’ll look into this tomorrow. We also hope to get some official notification about recovery finish.

2018-03-29 06:55 UTC+3

Email from support:

Hello CENSORED-NAME,

I want to thank you for your ongoing patience while we have working to resolve your issue. My name is CENSORED-SUPPORT-PERSON-4, and I am the Technical Solutions Engineer (TSE) who will be taking ownership of your escalated case from CENSORED-SUPPORT-PERSON-2.

Over the past ~30 hours, I have been engaging with our Product Engineers to identify the safest (and quickest) way to recover the “@CENSORED-DOMAIN” users that previously existed on the G Suite account associated with “CENSORED-OLD-DOMAIN”. In most circumstances, it is easy for us to restore a deleted G Suite account. However, in your case—because you did not maintain ownership of the primary domain—another customer has set-up a new G Suite account with it.

Unfortunately, our systems cannot accommodate two G Suite accounts with the same primary domain, and we cannot evict the one that was recently created by the current owner of “CENSORED-OLD-DOMAIN”. If you were that individual, I am sure that you would not appreciate us deleting (or asking you to delete) the G Suite account that you just created.

Since “CENSORED-DOMAIN” cannot be restored independently of “CENSORED-OLD-DOMAIN”, we have created a new G Suite account (on your behalf), with “CENSORED-DOMAIN” set as the primary domain. In addition, we have begun the process of moving your deleted users to this new G Suite account—this is outside our normal processes, which is why it has taken longer than desired.

At this time, we have been able to restore the following users, and move them to the new G Suite account:

CENSORED-USERS-LIST

At this this time, the aforementioned users should be able to sign (via https://accounts.google.com/) and access their data. Additionally, as a Super Admin, you should also be able to access the Admin console (via https://admin.google.com/), where you can set-up your billing information, and manage these user accounts.

Please understand that were are doing all of this on a best-effort basis, and we cannot guarantee that user data has been (or can be) recovered. Currently, due to technical constraints, we are having difficulty moving “CENSORED@CENSORED-DOMAIN” and “CENSORED@CENSORED-DOMAIN” to the new G Suite account—however, we are continuing to work on this, and will let you know once we have determined whether or not these users can be recovered.

I also want to note that we cannot restore any Google Groups—however, I believe that you only had “CENSORED@CENSORED-DOMAIN” and “CENSORED@CENSORED-DOMAIN” on the former G Suite account. If these are necessary, you should be able to re-create them via the “Groups” page within your Admin console.

I will reach out to you around this time tomorrow with the latest information on our progress. From your cell number (+972 CENSORED), I understand that you are located in Israel. Since I am located in Mountain View, California, there is little overlap in our working hours—however, if you would like to speak with someone on phone, please let me know, and I will try to co-ordinate something with my European colleagues.

Sincerely,

CENSORED-SUPPORT-PERSON-4

2018-03-29 14:53 UTC+3

Email to support:

We appreciate the technical efforts being put into solving the problem. Thanks!

Me and some other users were able to log in and see our emails and documents. I was able to log into admin. That’s great. C was not able to log in but I understand that you just haven’t finished with the account. C is listed under “deleted users” in admin.

I will gladly speak to you tomorrow, if my 7AM are OK for you. I do prefer to speak to the person who did the recovery and not more-time-zone-connvenient colleagues. CENSORED.

Regards,
CENSORED-NAME

2018-03-29 18:00 UTC+3

Call from technical person that works on our case.

  1. They are still working on recovering the last account. It was slowed down by the fact that C opened consumer Google account with same email address (he needed to authenticate to an external service).
  2. The technical team got the case just about two days ago.
  3. Google will be improving the process
    1. Better notification that will mention the relevant domains.
    2. Better escalation. Apparently first line support (“agents”) were not sure what to do with our case because it was unusual so it took a while to escalate.
    3. Support will be instructed to use the mentioned communication channel
  4. There should be news in a few hours or maybe tomorrow morning (UTC+3 time)

2018-04-01 10:16 UTC+3

No further communications from Google. Admin shows that the last account was recovered. C checked the documents and they seem to be OK.

2018-04-06 18:42 UTC+3

We started to check possible backup solutions. We are also considering which other systems are critical to us and hence need to be backed up.

2018-04-09 21:55 UTC+3

Email from support:

Hello CENSORED-NAME,

My name is CENSORED-SUPPORT-PERSON-5, and I am one of the support managers in Google Cloud Support.
NAME-OF-SUPPORT-PERSON-4 will be out of office this week, so I wanted to follow up on the status of the issue.

Based on the case and issue history tracked by Engineering, we have addressed last issues associated with CENSORED-EMAIL and CENSORED-EMAIL on April 4th and thus all the issues should have been addresses. Please do let us know if that is not the case, and I will work with another engineer to look into it.

Thank you for your patience with this case.

Regards,
CENSORED-SUPPORT-PERSON-5
Manager
Google Cloud Support

2018-04-10 15:59 UTC+3

Email to support:

Hello NAME-OF-SUPPORT-PERSON-5,

We have noticed that the data was recovered. It was strange that we had to notice that by ourselves because we were not contacted after the last accounts were recovered.

Other than this additional item for your post-mortem (we are really curious what are the resulting action items), everything seems to be OK, thanks.

Regards,
CENSORED-NAME

2018-04-10 18:04 UTC+3

Email from support:

Hello CENSORED-NAME,

Thank you for your message. I’m really happy to hear that you have been able to access the data for CENSORED-EMAIL and CENSORED-EMAIL.

During our last phone conversation, I had mentioned that our engineering team was encountering significant difficulty recovering these users. However, on April 4th (a little before midnight IDT), they confirmed that the necessary changes had been made to undelete them. I did not contact you at the time, as it can take 24-48 hours for all user data to be fully restored—especially for a user like CENSORED-EMAIL, who has used over 9 GB of data storage.

A notice about this “propagation delay” used to be mentioned in the Help Center at: https://support.google.com/a/answer/1397578, however it was recently removed for unknown reasons. I have confirmed with our technical writers that this notice has been reincluded in the current article draft, which is currently undergoing review.

I am glad to see that NAME-OF-SUPPORT-PERSON-5 contacted to you yesterday, since I was out of the office. As CENSORED mentioned, I will be out for most of this week—however, I did want to take some time to inform you of the progress being made on the “action items” (from the post-mortem) that we discussed over the phone. With regards to your concerns about how your case was handled by our tier-one team, I have relayed your feedback to each of their managers, and have asked that all of them receive additional training to ensure that they do not avoid making outbound calls and/or escalating cases when specifically asked by customers.

In addition, we are continuing to work on updating our email templates to ensure that all domains are listed—instead of just the one with contested ownership. We are currently working with our Product Counsel team on several proposed changes to our procedures/workflows, and this has been included amongst them.

Finally, I am happy to report that our engineering team did address the root cause of the technical issues that were preventing your two users from being restored. Should any of our customers ever end up in a similar situation, we should be able to initiate the restoration process a lot faster!

Because of all the trouble that you experienced, I have gone ahead an extended your “G Suite Business” trial to May 31, 2018 (the farthest back possible)—I hope that this will give you enough time to backup/export all of your user data, and to re-evaluate the product. If you encounter any further issues during this trial period, please do not hesitate to respond back to this email and let NAME-OF-SUPPORT-PERSON-5 or me know!

Sincerely,

NAME-OF-SUPPORT-PERSON-4
Google Cloud Support

2018-04-12 12:52 UTC+3

Email to support:

Thank you!

TODO: Schedule post-mortem

 

The missing link of Ops tools

It’s like we went from horse to spaceship, skipping everything in between.

Background

Let’s say you are managing your system in AWS. Amazon provides you with API to do that. What are your options for consuming that API?

Option 1: CLI or library for API access

AWS CLI let’s us access the API from the command line and bash scripts. Python/Ruby/Node.js and other languages can access the API using appropriate libraries.

Option 2: Declarative tools

You declare how the system should look like, the tool figures out dependencies and performs any API calls that are needed to achieve the declared state.

Problem with using CLI or API libraries

Accessing API using CLI or libraries is fine for one off tasks. In many cases, automation is needed and we would like to prepare scripts. Ideally, these scripts would be idempotent (can be run multiple times, converging to the desired state and not ruining it). We then quickly discover how clunky these scripts are:

# Script "original"
if resource_a exists then
  if resource_a_property_p != desired_resource_a_property_p then
    set resource_a_property_p to desired_resource_a_property_p
  end
  if resource_a_property_q != desired_resource_a_property_q then
    ...
  end
else
  # resource_a does not exist
  create resource_a
  set resource_a_property_p to desired_resource_a_property_p
  ...
end
# more chunks like the above

It’s easy to see why you wouldn’t want to write and maintain a script such as above.

How the problem was solved

What happened next: jump to “Option 2”, declarative tools such as CloudFormation, Terraform, etc.

rocket-1374248_640

Other possible solution that never happened

If you have developed any code, you probably know what refactoring is: making the code more readable, deduplicate shared code, factoring out common patterns, etc… without changing the meaning of the code. The script above is an obvious candidate for refactoring, which would be improving “Option 1” (CLI or a library for API access) above, but that never happened.

All the ifs should have been moved to a library and the script could be transformed to something like this:

# Script "refactored"
create_or_update(resource_a, {
  property_p = desired_resource_a_property_p
  property_q = desired_resource_a_property_q
})
# more chunks like the above

One might say that the “refactored” script looks pretty much like input file of the declarative tools mentioned above. Yes, it does look similar; there is a huge difference though.

Declarative tools vs declarative primitives library

By “declarative primitives library” I mean a programming language library that provides idempotent functions to create/update/delete resources. In our cases these resource are VPCs, load balancers, security groups, instances, etc…

Differences of declarative tools vs declarative primitives library

  1. Declarative tools (at least some of them) do provide dependency resolution so they can sort out in which order the resources should be created/destroyed.
  2. Complexity. The complexity of mentioned tools can not be ignored; it’s much higher than one of  declarative primitives library. Complexity means bugs and higher maintenance costs. Complexity should be considered a negative factor when picking a tool.
  3. Some declarative tools track created resources so they can easily be destroyed, which is convenient. Note that on the other hand this brings more complexity to the tool as there must be yet another chunk of code to manage the state.
  4. Interacting with existing resources. Between awkward to impossible with declarative tools; easy with correctly built declarative primitives library. Example: delete all unused load balancers (unused means no attached instances): AWS::Elb().reject(X.Instances).delete()
  5. Control. Customizing behaviour of your script that uses declarative primitives library is straightforward. It’s possible but harder with declarative tools. Trivial if in a programming language can look like count = "${length(var.public_subnets) > 0 ? 1 : 0}" (approved Terraform VPC module).
  6. Ease of onboarding has declarative tools as a clear winner – you don’t have to program and don’t even need to know a programming language, but you can get stuck without knowing it:
  7. Getting stuck. If your declarative tool does not support a property or a resource that you need, you might need to learn a new programming language because the DSL used by your tool is not the programming language of the tool itself (Terraform, Puppet, Ansible). When using declarative primitives library on the other hand you can always either extend it when/if you wish (preferable) or make your own easy workaround.
  8. Having one central place where potentially all resources are described as text (Please! don’t call that code, format is not a code!). It should be easier done with declarative tools. In practice, I think it depends more on your processes and how you work.

As you can see, it’s not black and white, so I would expect both solutions be available so that we, Ops, could choose according to our use case and our skills.

My suggestion

I don’t only suggest to have something between a horse and a spaceship; I work on a car. As part of the Next Generation Shell (a shell and a programming language for ops tasks) I work on declarative primitives library. Right now it covers some parts of AWS. Please have a look. Ideally, join the project.

Next Generation Shell – https://github.com/ilyash/ngs

Feedback

Do you agree that the jump between API and declarative tools was too big? Do you think that the middle ground, declarative primitives approach, would be useful in some cases? Comment here or on Reddit.


Have a nice day!

Why I have no favorite programming language

TL;DR – because for me there is no good programming language.

I’m doing mostly systems engineering tasks. I manage resources in Cloud and on Linux machines mostly. I can almost hear your neurons firing half a dozen names of programming languages. I do realize that they are used by many people for systems engineering tasks:

  • Go
  • Python
  • Ruby
  • Perl
  • bash

The purpose of this post is not to diminish the value of these languages; the purpose is to share why I don’t want to use any of the languages above when I write one of my systems-engineering-task scripts. My hope is that if my points resonate with you, the reader, you might want to help spread the word about or even help with my suggested solution described towards the end.

man-2756206_640

So let’s go over and see why I don’t pick one of the languages:

Why not language X?

All languages

  • Missing smart handling of exit codes of external processes. Example in bash: if test -f my_file (file is not there, exit code 1) vs if test --f my_file (syntax error, exit code 2). If you don’t spot the syntax error with your eyes, everything behaves as if the file does not exist.
  • Missing declarative primitives libraries (for Cloud resources and local resources such as files and users). Correction: maybe found one, in Perl – (R)?ex ; unfortunately it’s not clear from the documentation how close it is to my ideas.

All languages except bash

  • Inconvenient/verbose work with files and processes. Yes, there are libraries for that but there is no syntax for that, which would be much more convenient. Never seen something that could compare to my_process > my_file or echo my_flag > my_file .

Go

  • Compiled
  • Error handling is a must. When I write a small script, it’s more important for me for it to be concise than to handle all possible failures; in many cases I prefer an exception over twice-the-size script. I do understand how mandatory and explicit error handling can be a good thing for larger programs or programs with greater stability requirements.
  • Dependencies problem seem to be unresolved issue

Python

  • Functional programming is second level citizen. In particular list/dictionary comprehension is the Pythonic way while I prefer map and filter. Yes, that’s probably one of the features that makes Python easier to learn and suggested first language. Not everything that’s optimized for beginners must be good for more experienced users. It’s OK.
  • Mixed feelings about array[slice:syntax] . It’s helpful but slice:syntax is only applicable inside [ ] , in other places you must use slice(...) function to create the same slice object

Ruby and Perl

  • The Sigils syntax does not resonate with me.

Ruby

I can’t put my finger on something specific but Ruby does not feel right for me.

Perl

  • Contexts and automatic flattening of lists in some cases make the language more complicated than it should.
  • Object orientation is an afterthought.
  • Functions that return success status. I prefer exceptions. Not the default behaviour in Perl but an afterthought: autodie.
  • Overall syntax feeling (strictly matter of personal taste).

bash

Note that bash was created in a world that was vastly different from the world today: different needs, tasks, languages to take inspiration from.

  • Missing data structures (flat arrays and hashes is not nearly enough). jq is a workaround, not a solution in my eyes.
  • Awkward error handling with default of ignoring the errors completely (proved to be bad idea)
  • Expansion of undefined variable to empty string (proved to be bad idea)
  • -e ,  -u and other action at a distance options.
  • Unchecked facts but my feelings:
    • When bash was created, it was not assumed that bash will be used for any complex scripting.
    • bash was never “designed” as a language, started with simple commands execution and other features were just bolted on as time goes by while complete redesign and rewrite were off the table, presumably for compatibility.
  • Syntax
  • No widely used libraries (except few for init scripts) and no central code repository to search for modules (Correct me if I’m wrong here. I haven’t heard of these).

My suggested solution

I would like to fill the gap. We have systems-engineering-tasks oriented language: bash. We have quite a few modern programming languages. What we don’t have is a language that is both modern and systems-engineering-tasks oriented. That’s exactly what I’m working on: Next Generation Shell. NGS is a fully fledged programming language with domain specific syntax and features. NGS tries to avoid the issues listed above.

Expected questions and my answers

People work with existing languages and tools. Why do you need something else?

  • I assume I have lower bullshit tolerance than many others. Some people might consider it to be normal to build more and more workarounds (especially around anemic DSLs) where I say “fuck this tool, I would already finish the task without it (preferably using appropriate scripting language)”. I don’t blame other people for understandable desire to work with “standard” tools. I think it’s not worth it when the solutions become too convoluted.
  • I am technically able to write a new programming language that solves my problems better than other languages.

Another programming language? Really? We have plenty already.

  • I would like to remind you that most of the programming languages were born out of dissatisfaction with existing ones.
  • Do you assume that we are at global maximum regarding the languages that we have and no better language can be made?

Feedback

Would you use NGS? Which features it must have? What’s the best way to ease the adoption? Please comment here, on Reddit (/r/bash , /r/ProgrammingLanguages) or on Hacker News.


Update: following feedback roughly of the form “Yes, I get that but many Ops tasks are done using configuration management tools and tools like CloudFormation and Terraform. How NGS compares to these tools” – there will be a blog post comparing NGS to the mentioned tools. Stay tuned!


Have a nice day!

Technology Prevention

Part of my job is to prevent usage of technologies. This sounds so uncool, I know. Do you want to increase the chance of success of your organization? You must prevent technologies. There are lots of technologies out there. Most of the technologies are not relevant to your situation. It is cool to build a spaceship, but do you need one?

Problem

Growth of a startup S that makes technology/product X is more important than whether or not there is a match between X and your use case. S generally doesn’t care whether your startup will succeed or fail because you used X. There is no immediate economical incentive for S to be honest. In short, S f*cks you over for money.

my-profit-fuck-you

 

As a consequence, a distorted picture of reality is presented to you:

  • The world is full of marketing bullshit. It is similar to psychological warfare, as noted in a post about 10gen marketing strategies.
  • Chunks of this bullshit are masked as engineering blogs.
  • Half truths are presented, such as company Blah uses X. They might be using it but for what? Is that in the core of their business or in some side project?
  • Vocal advocates of X all over (gaining directly or indirectly from you using X)

Solution

  • Remember that your aim is to succeed as a company and the aim of S is also to succeed. The correlation does not have to exist, and when it exists it does not have to be positive.
  • Start with problems that you have and find the tools for solving them, not the other way around.
  • Consider peoples’ motives when they write about tool X. Will they benefit from widespread adoption of X (consultants, employees of the make of X, people affected by investors of the company behind X)? Will they look bad if they negatively review X, even for specific use case?
  • Looking at a tool, assume it’s the wrong one for your use case and then prove this statement wrong.
  • All people must at least be aware of cost-benefit analysis. In many cases it’s actually very simple. Zero to minuscule benefit and high adoption/migration cost.
  • Take top 10 latest-shiny-cool technologies. If you are a small startup, chances are that you need zero to two of them. (Not counting the Cloud as new).
  • Using latest-shiny-cool technology to attract employees is not the right thing to do. You will probably attract employees that will always want to switch to the latest technology. Maybe they will leave for another company that starts using the next latest-shiny-cool and you don’t.

See also: Prove your tool is the right choice


Have a nice weekend!

NGS unique features – Argv command line arguments builder

Background: what is NGS?

NGS LOGO

NGS, the Next Generation Shell is a (work in progress) shell and a programming language built ground up for systems engineering tasks. You can think of it as bash that’s designed today: sane syntax, data structures, functional programming, extensibility, cloud in mind, declarative primitives.

What’s the problem with constructing command line arguments?

The problem affects only more “advanced” cases of constructing command line arguments when some arguments might or might not be present. Let’s consider this example:

# Made-up syntax, resembling NGS
args = []
if 'Subnets' in props {
  args += '--subnets'
  args += props['Subnets']
}
if ... {
  args += ...
}
if ... {
  args += ...
}
...
aws elb create-load-balancer ... $args

Wouldn’t it be cleaner to get rid of all the ifs? … and what happens if props['Subnets'] is an empty array?

How Argv facility in NGS solves the problem?

Argv is a result of factoring out the common code bits involved in constructing command line arguments. The ifs above were also factored out. They are now in Argv.

Let’s look at usage example (real NGS code, from AWS library)

argv = Argv({
  '--load-balancer-name': rd.anchor.name
  '--listeners': props.ListenerDescriptions.encode_json()
  '--subnets': rd.opt_prop('Subnets', props).map(only(ResDef, ids))
})
rd.run('create ELB', %(aws elb create-load-balancer $*argv))

The important points here are:

  1. Argv is a function with a single parameter which must be of type Hash (also called “dictionary” in some languages)
  2. The keys of the Hash are switches’ names (--load-balancer-name, --listeners, --subnets)
  3. The values of the Hash are values for the switches

The “if” that decides whether a switch is present in the resulting argv is inside Argv implementation and your code is clean of it. The values of the Hash are considered when Argv decides whether a switch should be present. null, empty array and instances of type EmptyBox are considered by Argv as missing values and it discards the switch. For convenience, instances of type FullBox are unboxed when constructing the result of Argv.

The Argv facility is yet another point among others that shows why NGS and systems engineering tasks are best fit.


Have a nice weekend!

 

Why Next Generation Shell?

Background

I’m a systems engineer. The job that I’m doing is also called system, SRE, DevOps, production engineer, etc. I define my job as everything between “It works on my machine” of a developer and real life. Common tasks are setting up and maintaining cloud-based infrastructure: networking, compute, databases and other services. Other common tasks are setting up, configuring and maintaining everything inside a VM: disks+mounts, packages, configuration files, users, services. Additional aspects include monitoring, logging and graphing.

The problem

If we take specifically systems engineering tasks such as running a VM instance in a cloud, installing and running programs on a server and modifying configuration files, typical scripting (when no special tools are used) is done in either bash or Python/Ruby/Perl/Go.

Bash

The advantage of bash is that bash is domain specific. It is convenient for running external programs and files manipulation.

# Count lines in all *.c files in this directory and below
wc -l $(find . -name '*.c')

# Make sure my_file has the line my_content
echo my_content >my_file

# Run a process and capture the output
out=$(my_process)

The disadvantage of bash is horrible syntax, pitfalls everywhere, awkward error handling and many features one would expect from a programming language (such as data structures, named functions parameters, etc) are missing.

# Inconsistent, awkward syntax,
# result of keeping backwards compatibility
if something;    then ... fi
while something; do ... done

# Can remove / if MY_DIR is not defined
# unless in "set -u" mode
rm -rf "$MY_DIR/"

# Removes files "a" and "b" instead of "a b"
myfile="a b"
rm $myfile

# Silently ignores the error unless in "set -e" mode
my_script

# Function parameters can't be named, they are
# in $1, $2, ... or in $@ and $*
myfunc() {
  FILE="$1"
  OPTION_TO_ENABLE="$2"
  ...
}

Leave bash alone, it was not intended for programming, don’t do anything in bash, just use external programs for everything.

What do you observe? Is it or is it not used as a programming language in real life?

General-Purpose programming languages

Python/Ruby/Perl/Go are general-purpose programming languages.

The advantage of general-purpose programming languages is in their power, better syntax, ability to handle arbitrary data structures.

orig = [1,2,3]
doubled = [x*2 for x in orig]

The disadvantage of general-purpose programming languages is that they are not and can not be as convenient for systems engineering tasks because they are not focusing on this particular aspect of programming (in contrast to bash and other shells for example).

# Write whole file - too verbose
f = open('myfile', 'w+')
f.write('mycontent')
f.close()

# Run a process and capture the output
# https://docs.python.org/3.5/library/subprocess.html
proc = subprocess.Popen(...)
try:
    outs, errs = proc.communicate(timeout=15)
except TimeoutExpired:
    proc.kill()
    outs, errs = proc.communicate()

Summary

My conclusion is that there is no handy language for systems engineering tasks. On one hand there is bash that is domain specific but is not a good programming language and does not cover today’s needs, on the other hand there are general-purpose programming languages which do not specialize on this kinds of tasks.

You can use Puppet, Chef, Ansible, Terraform, CloudFormation, Capistrano and many other tools for common systems engineering tasks. What if your task is not covered by existing tools? Maybe one-off? Maybe a case where using one of the existing tools is not an optimal solution? You would like to write a script, right? In that case, your life sucks because scripting sucks. That’s because there is no convenient language and libraries to get systems engineering tasks done with minimal friction and effort.

Solution

I suggest, creating a new programming language (with a shell) which is domain specific, as bash, and which incorporates important features of general-purpose programming languages: data structures, exceptions, types, multiple dispatch.

My way of looking at it: imagine that bash was created today, taking into account today’s reality and things that became clear with time. Some of them are:

  • The shell is used as a programming language.
  • A system is usually a set of VMs and APIs, not a single machine.
  • Most APIs return JSON so data structures are needed as multiple jq calls are not convenient.
  • Silently ignoring errors proved to be bad strategy (hence set -e switch which tries to solve the problem).
  • Silently substituting undefined variables with empty strings proved to be bad strategy (hence set -u switch).
  • Expanding $x into multiple arguments proved to be error prone.
  • Syntax matters.
  • History entries without context have limited usefulness (cd $DIR for example: what was the current directory before cd and what was in $DIR ?)
  • UX
    • Spitting lots of text to a terminal is useless as it can not be processed by a human.
    • Feedback is important.
      • Exit code should be displayed by default.
      • An effort should be made to display status and progress of a process.
      • Ideally, something like pv should be integrated into the shell.

I’m not only suggesting the solution I’ve just described. I’m working on it. Please give it a try and/or join to help developing it: NGS – Next Generation Shell.

NGS LOGO

# Make sure my_file has the line my_content
echo my_content >my_file

# Run a process and capture the output
out=`my_process`

# Get process handle (used to access output, exit code, killing)
p=$(my_process)

# Get process output and parse it, getting structured data
amis=``aws ec2 describe-images --owner self``
echo(amis.len()) # number of amis, not lines in output

# Functional programming support
orig = [1,2,3]
doubled = orig.map(X*2)

# Function parameters can be named, have default values, etc
F myfunc(a,b=1,*args,**kwargs) {
  ...
}

# Create AWS VPC and Gateway (idempotent)
NGS_BUILD_CIDR = '192.168.120.0/24'
NGS_BUILD_TAGS = {'Name': 'ngs-build'}
vpc = AWS::Vpc(NGS_BUILD_TAGS).converge(CidrBlock=NGS_BUILD_CIDR, Tags=NGS_BUILD_TAGS)
gw  = AWS::Igw(Attachments=[{'VpcId': vpc}]).converge(Tags=NGS_BUILD_TAGS)

I don’t think scripting is the right approach.

It really depends on the task, constraints, your approach and available alternative solutions. I expect that situations needing scripting will be with us for a while.

Another programming language? Really? Why the world needs yet another programming language?

I agree that creating a new language needs justification because the effort that goes into creating a language and learning a language is considerable. Productivity gains of using the new language must outweigh the effort of learning and switching.

NGS creation is justified in exactly the same way as many other languages were justified: dissatisfaction with all existing programming languages when trying to solve specific problem or a set of similar problems. In case of NGS the dissatisfaction is specifically how existing programming languages address the systems engineering tasks niche. NGS addresses this particular niche with a unique combination of features and trade offs. Productivity of using NGS comes from best match between the tool and the problems being solved.

Yet another shell? We have plenty already but they all have serious adoption problems.

NGS will be implementing ideas which are not present in other shells. Hopefully, the advantages will be worthy enough to justify switching.

I’ll be just fine with bash/Python/Ruby/Perl/Go

You will. The decision to learn and use a new language depends on your circumstances: how many systems engineering tasks you are doing, how much you suffer, how much easier the tasks will become with NGS, how easily this can be done in your company / on your project and whether you are willing to take the risk.

You could just write a shell based on Ruby or Python or whatever, leveraging all the time and effort invested in existing language.

I could and I didn’t. Someone else did it for Python and for Scala (take a look, these are interesting projects).

  • I don’t think it’s the right solution to stretch existing language to become something else.
  • NGS has features that can not be implemented in a straightforward way as a library: special syntaxes for common tasks, multiple dispatch.

One could just write a library for Python or Ruby or whatever happens to be his/her favorite programming language, leveraging all the time and effort already invested in existing language.

In order to be similar to NGS, one would not only have to build a library but also change language syntax. I personally know only two languages that can do that: Lisp (using reader macros) and Perl6 (using grammar facility). These are general-purpose programming languages. Turning them into something NGS-like will be a significant effort, which I don’t think is justified.

PowerShell appears to be similar to what you describe here.

Note that I have very limited experience with PowerShell. The only aspect I definitely like is consistent usage of the $ sigil.

  • It’s probably a matter of taste and what you are accustomed to but I like NGS’ syntax more. PowerShell is pretty verbose.
  • DSC appears to be focused on resources inside a server/VM. NGS plans similar functionality. Meanwhile, NGS uses this approach in the AWS library: vpc = AWS::Vpc(NGS_BUILD_TAGS).converge(CidrBlock=NGS_BUILD_CIDR, Tags=NGS_BUILD_TAGS)

There are libraries for Python that make systems engineering tasks easier.

Right, sh for example. Such solution can’t be used as shell, it just improves the experience of calling external program from Python.


Was this post convincing? Anything is missing to convince you personally? Let me know!

Have a nice day!

Please don’t use Puppet

Thinking process behind choosing a tool

Thinking process behind choosing a tool does not get the attention it deserves. While there are many discussions of the form tool X vs tool Y, there is very little discussion of how one should choose between tools or in presumable absence of alternatives, whether one should use the only candidate, tool X. This post will cover few things to keep in mind when selecting a tool by highlighting few common problems and fallacies. Puppet will be used as an example tool for consideration.

Focusing on positive parts only

When considering a product or a tool, too often positive aspects are overestimated and negative aspects that influence TCO (Total Cost of Ownership) are underestimated or neglected. There are several cognitive biases and logical fallacies involved. Cognitive biases and logical fallacies can be avoided to some extent just by being aware. I will be referring to these through the post to help you, the reader, become more aware of your thought process which will hopefully improve it and consequently the process of decision making on your part.

Marketing pushes to see the positive

We all know that marketing focuses on positive aspects of a product and neglects to mention downsides. This is specifically mentioned in “False advertising” article under “Omitting information”.

For example, the fact that it’s not convenient to manage Puppet modules (proof: existence of a tool to do just that) will not appear in marketing materials. You might think that the existence of Librarian-puppet is on the contrary, makes management of these modules easier. It does but it also brings more complexity to the system. New problems and bugs instead of inhuman manual management of modules.

This post will focus on the negative

While there is more than enough focus on positive aspects of products, this post will be highlighting the negative aspects in order to strike some balance. There is plenty of marketing materials but it’s harder to find a list of problems that you only discover when you are neck-deep into the tool/product. These problems will be listed here. Note that this can not be exhaustive list because different situations reveal different problems and this post is only based on experience of several of my friends and mine.

Listing the problems of a tool touches Availability heuristic cognitive bias: the easier you recall something the more “important” it is. You are bombarded by marketing materials which are all positive. When considering a tool, your natural flow of thought is “How easily can I remember positive sides of the tool?” and it’s easy, because you were probably brainwashed already by how good the tool is. Then “How easily can I remember negative sides of the tool?” is much harder. This is not the kind of information that will be pushed to you by the people behind the tool, they have no interest in doing so. Their money goes to advertise how good the tool is, not how bad it is. You can balance your rosy impressions of any tool or product with looking at GitHub issues, digging StackOverflow for the downsides, or reading posts like this one.

Please, assume that X is the wrong tool for your needs.

As opposed to “yeah, looks good, let’s use it”, this approach leads to more thoughtful tool selection process. Please read Prove your tool is the right choice.

“Everybody uses X”

“Everybody uses X” thought might have been planted in your brain by marketing efforts. Please analyze the source of that thought carefully. Maybe you have heard from some of your friends and/or colleagues about the product and made a generalization? Maybe people are just stuck with it? Maybe that’s what they know? Did you search for alternatives? Did you try to disprove “Everybody uses X”?

“Everybody uses X, therefore it’s good”

Whether this thought was planted by marketing or not, no, there is no logical connection between the first and the second clauses.

If a lot of people use something, it becomes better as there is more feedback and contributors. It is often implied that therefore X is good. Improvement over time or with user base does not mean X is good enough for any particular use right now.

Did you communicate with the people that use X? Did they tell you it was a good decision? Beware of Choice-supportive bias when you talk to them. Which alternatives did they consider? Are they able to articulate downsides? Every solution has downsides, being able to recognize these increases credibility of the opinion about X.

“Everybody uses X, we should use X”

Yes, if you consider the value of “then we can blog about it and be part of the hype, possibly getting some traction and traffic”. This might have some estimated value which should be weighted against the cost incurred by choosing otherwise unneeded or inferior tool or technology. You can point your bosses to this paragraph, along with your estimation of the costs of using tool X vs better alternatives (which might be just not using it and coding yourself the needed functionality for example, the comparison is valid for both X vs Y and X vs without X).

No, “We should use X” does not logically follow from “Everybody uses X”. Beware of conformity bias.

“Company C uses X”

This piece of information, when served by vendor of X implies that company C knows better and you should use X too.

Company C is big and respectable company with smart engineers. The vendor of X will gladly list big and reputable companies that use X. That’s the use of “Argument from authority”.

Again, there is no straight logical path between “C uses X” and “we should use X too”.

Chances are that company C is vastly different from your company and their circumstances and situation are different from yours.

Company C can also make mistakes. You are unlikely to see a blog post from vendor of X that is titled “Company C realized their mistake and migrated from X”.

Claims of success with tool X

Treat claims of successful usage of tool X with caution. Searching quickly for “measuring project success” reveals the following dimensions to be looked at when estimating a success of a project:

  • Cost
  • Scope
  • Quality
  • Time
  • Team satisfaction
  • Customer satisfaction

The claims of successful usage of tool X carry almost no information regarding what really happens. “We are using Puppet successfully” might mean (when taken to extreme) that for 100 servers and one deploy per day the following applies:

  • Cost: There is dedicated team of five costly operations people that work just on Puppet because it’s complex.
  • Scope: Puppet covers 80% of the needs, this might be the only dimension looked into when claiming a success.
  • Quality, Team satisfaction: This team is constantly cursing because of bugs, modules or Puppet upgrades issues such as Upgrade to puppet-mysql 3.6.0 Broke My Manifest (fixed in just two months!) or puppet 4.5.0 has introduced a internal version of dig, that is not compatible to stdlib’s version oopsie.

    Enjoy the list of regression bugs. It’s hard to blame Puppet developers for these bugs because these kinds of issues are natural for projects of this size and complexity. I suggest that creating your own domain-specific language, which is not a programming language for a configuration management tool is a bad idea. I’ll elaborate about this point in a bit, in the “Puppet DSL” section.

  • Time: Took 6 moths of the above team to implement Puppet. Unpredictable time to implement any feature because of complexity and unexpected bugs along the way.
  • Customer satisfaction: Given all of the above it’s hard to believe in any kind of satisfaction with what’s going on.

It’s also worth to keep in mind that any shown success, real success, does not mean that same solution will be equally applicable to your situation, because it’s almost certainty different on one or more dimensions: time, budget, scope (problem you are solving), skills, requirements.

“But X also provides feature F”

I am sure that the advertisements will mention all the important features as well as “cool” features. Do you really need F?

When choosing a tool the thought “But X also provides feature F” might be dangerous if F is not something you immediately need. One might think that F might be needed later. This might be the case but what are the odds, what’s the value of F to you, how much will it cost to implement using another tool or write yourself? Also, consider the “horizon”. If you might need feature F in 3 years, in many situations this should be plainly ignored. In 3 years there might be another tool for F or you might switch from X to something else for other reasons by then.

Suppose there is another tool X2 which is alternative to X. X2 does not provide F but it’s estimated TCO over a year is 50% less than F. You should consider the costs because it might be that X2 for the first year and then switching to X, including the switching costs can be cheaper.

Putting tools before needs

“So, there is new trendy hypy tool X. How can we use it?” is typically a bad start. At the very least it should be “So, there is new trendy hypy tool. Do we have any problems where X would be a better alternative?”

Ideally the approach would be “We have problem P, which alternative solutions do we have?”. P might be some inefficiency or desired functionality. Solutions, once again, do not have to mean existing tools.

Puppet – the good parts

I will quickly go over a few good parts because I want this post at least to try to be objective.

Convergence

Convergence is an approach that says one should define the desired state, not the steps to be taken to get there. The steps are abstracted away and on each run the system will try to achieve the desired state as closely as possible.

I do agree that declaring a resource such as file, user, package or service and it’s desired state is a good approach. It’s concise and it’s usually simpler than specifying the operations that would lead to the desired state, like regular scripts do. This idea manifests in many other tools too: Chef, Ansible, CloudFormation, Terraform.

Appropriate in some situations

  • Think about a startup where someone does part time systems engineering job, not a professional. As Guy Egozy pointed out, there are situations such as startups with limited resources and basic needs where using a configuration management tools might make more sense than in other situations.
  • Urgent demo with all defaults if you have a good control of the tool and you know that you need some very specific functionality, say wordpress+mysql demo tomorrow, it is probably worth to prepare the demo with Puppet or Chef. There is still a danger of course that the module you were using a month ago have now changed and you need to invest additional time to make things work. Or maybe the module is just broken now.

Multiple platforms support

In my experience the chances that you will be running same applications on say Windows and Linux are pretty slim. The overlap of installed software on different platforms is likely to be infrastructure tooling only (monitoring, graphing, logging). Is it really worth the price?

Puppet DSL

Puppet class

Puppet DSL has a concept of “class” which has nothing to do with classes in programming languages. It least in retrospective it was not such a good idea, especially when considering operations guys trying to explain about Puppet classes to developers.

Limited DSL limitations 🙂

Acknowledged as a problem by facts

Limitations of DSL in my opinion were acknowledged by actions taken by Puppet’s developers and contributors:

Limited DSL is not a great idea!

I do understand why limited DSL can be aesthetically and mathematically appealing. The problem here is that life is more complex than limited DSL. What could be 10 lines of real code turns into 50 lines of ugly copy+paste and/or hacks around the DSL limitations.

It sounds reasonable that at the time when CFengine and Puppet were created there were not enough examples of shortcomings of limited DSLs and their clashes with real life. Today we have more:

  • Puppet – DSL failure admitted by actions, as discussed above.
  • Ansiblejust looks bad . Some features look like they were torn from a programming language and forced into YAML.
  • Terraform – often generated because well … life. This one is more of a configuration language by design. This approach has pros and cons when applied to infrastructure.
  • CloudFormation – 99% configuration format and 1% language, that’s why it’s generated for all except trivial cases. You do have the alternative of not generating CloudFormation input file but provide custom resources which use AWS Lambda functions instead. They will do some of the work. While this fits CloudFormation model perfectly, and makes CloudFormation much more powerful, I would really prefer a script over inversion of control and additional AWS service (Lambda) which I have to use – one more thing that can go wrong or just be unavailable when needed the most.

I do not agree that Terraform should be limited the way it is, but in my opinion, Terraform and CloudFormation are more legitimately limited while Puppet and Ansible are just bad design. This limitation by design causes complex workarounds which are costly and sometimes fragile, not to mention mental well being of the systems engineers that are working with Puppet.

We can all stop now creating domain specific languages for configuration management which were not built on top of real programming languages. Except for a few cases, that’s a bad idea. We can admit it instead of perpetuating the wishful thinking that the reality is simple and limited DSL can deal with it somehow.

Puppet modules

Dependencies between Puppet modules

Plainly headache. Modules have dependencies on other modules and so on. Finding compatible modules’ versions is a hard problem. That’s why we have Librarian-puppet. As I mentioned above, it has it’s own issues.

There are also issues that Librarian-puppet can not solve, which are inherent to system of this scale, complexity and number of contributors. Let’s say you have module APP1 that depends on module LIB and module APP2 that also depends on LIB. Pinning version of module LIB because APP1 has a bug can prevent you from upgrading module APP2 which in newer versions depends on newer LIB. This is not imaginary scenario but real life experience.

Breakage of Puppet modules

Another aspect is that in this complex environment it’s somewhere between hard and impossible for any module maintainer to make sure his/her changes do not break anything. Therefore, they do break:

Popular community modules deal with so many cases and operating systems that breakage of some functionality is inevitable.

Community modules

There is this idea that is kind of in the air: “you have community modules for everything, if you are not using them you are incompetent and wasting your time and money”.

This could come from 3 sources:

  • Marketing
  • People that use community modules for simple cases and they work fine
  • People that underestimate the amount of maintenance work required to make community modules work for your particular case.

The feedback that I’ve got several times from different sources is that if you are doing anything serious with your configuration management tool, you should write your own modules, fitting community modules to your needs is too costly.

Graph dependencies model problems

Do you know people who think in dependency graphs? It looks like most people that I know are much more comfortable thinking about sequence of items or operations to perform in most cases. Thinking about dependency graphs such as about packages’ versions compatibility usually comes with recognizable significant mental effort, often with curses.

Puppet team admitted (again, by actions) this is a problem and introduced ordering configuration and made “manifest” ordering the default at some point. Note that this ordering is only for resources without explicit dependencies and within one manifest.

Graphs are somewhat implicit. This causes surprise and consequential WTFs. Messages about dependencies errors are not easily understood.

Marketing

  • Puppet usage is compared to manual performance of the same tasks – “Getting rid of the manual deployments”. This is clearly a marketing trick: comparing your tool to the worst possible alternative, not other tools which are similar to yours.
  • Puppet is compared to bash scripts. Why not Python or Ruby?
  • “Automate!” is all over Puppet site. Implies that Puppet is a good automation tool.
  • Top 5 success stories / case studies use Puppet Enterprise? Coincidence? I think not 🙂

Thanks

Many thanks for guidance to Konstantin Nazarov (@racktear). We met at DevOpsDays Moscow 2017 where he offered free guidance lessons for improving speech and writing skills. In reality, lessons also include productivity tips which help me a lot. Feel free to contact Konstantin, he might have a free weekly slot for you.


Have a productive career!

About declarative frameworks and tools

This post is a reply to “just use Terraform” recommendation I’ve just seen. I hope more people will benefit from my perspective if it’s posted here. There is plenty of marketing behind most of the tools I mention here. It’s all rosy, see the “Life before Puppet” video. Let’s balance this marketing bullshit a bit.

Think twice before using declarative framework/tool

Edit 2017-12-06: note that this post is my opinion based on my experience with Puppet, Chef and CloudFormation. My opinion is based on my use cases and circumstances.

Terraform, CloudFormation, Puppet, Chef as any other declarative frameworks/tools take control away from you. They work fine for “hello world” examples. Then there is life where you need something these frameworks did not anticipate and you are sorry you have not coded everything yourself from the start. Now you are stuck with these tools and you will be paying for it in your time and money. Working around limitations of such tools is a pain.

I am using CloudFormation and have used Puppet and Chef in the past. These tools do have their place. In my opinion it’s a very limited set of scenarios. Terraform, CloudFormation, Puppet and Chef are used much more widely than they should be.

These tools have some value but too often people neglect the cost which in many cases outweighs the value. Most of the cost comes from inflexibility. Terraform and CloudFormation are so limited that people frequently use another tool for generating these. That adds another bit to the cost.

I’m hearing frequently from a friend (sorry, can’t name him) how much they suffer from Terraform’s inflexibility. Inflexibility can not be fixed because it’s a declarative framework. Unfortunately they are so invested in Terraform that they will continue to spend hundreds of hours to fight it. Chef is causing trouble there too, community Cookbooks proved to be a mismatch for the needs and sanity of the engineers there.

Edit 2017-12-06, following discussion on Reddit:

  • I spoke to my friend regarding Terraform. Many issues they had are solved by more up-to-date version of TF.
  • “Inflexibility” I’m talking about is my opinion on DSL. On the other hand, Terraform DSL contains features that apparently help cover most use cases.
  • Just to make clear: CloudFormation has a lot flexibility using Lambda. That doesn’t make sense to me but it is there.

… and there is this gem

A key component of every successful Puppet implementation is access to a knowledgeable support team

That’s from https://puppet.com/support-services/customer-support/support-plans

Are you sure you want to use Puppet? Apparently you can’t do it well without their support… Just saying…

Is one of these tools right for you?

Regular considerations for choosing a tool apply. See my older post “Prove your tool is the right choice“.

Expected replies and my replies to those

You don’t get it.

OK

You don’t understand these tools.

OK

You are not using these tools right / as intended.

OK

Are you crazy? You want to code everything yourself?

Let’s take it to the extreme: no new code should be written. No libraries, no frameworks. Because everything already exists. Sounds about right.

People smarter than you have figured it all out, use their tools

Smarter people don’t always produce better solutions or solutions that fit your use case. Most of the time smart people will produce smart solutions… and then there are people that don’t usually think in graphs and are really puzzled when debugging Puppet cyclic dependency errors for example.

Most of the code you need is already written, don’t waste time and money, use it! Community Cookbooks and modules are great!

This is marketing bullshit. Don’t buy it! It’s often more expensive to adopt a code that does not meet your exact needs and is much more complex that you need (because it should support multiple platforms and use cases) than to write your own. I have seen suffering followed by usage of community Cookbooks/modules followed by in-house rewrite or fork.

Don’t you care about the next guy? Work with standard tools!

Let’s do some math. Team of two works for a year. They are (very modest estimation) 10% more productive because they have coded whatever they needed and were not fighting with the tools. Even when wrongfully assuming that custom solution is harder to understand for the 3rd guy that joined the team after one year, how much is it harder? Is it more than 300 hours harder?

Update following responses on Reddit

2017-04-28

2 totally different toolsets – infrastructure orchestration (Terraform, Cloudformation), and Configuration Management (Puppet, Chef)… — (/u/absdevops)

Yes. What is common to all these tools is declarative style and their usage: these tools are typically run using CLI.

All these tools have three axes that I consider:

  1. “Input” axis: What’s the input of these tools?
    1. Configuration format
    2. Half-baked programming language that was probably never indented to be a programming language
    3. Real programming language
  2. “Calling” axis: framework vs library (typical usage)
  3. “TCO” axis: TCO vs other solutions, especially vs the other solution that is always available – code the subset of the functionality that you need yourself

I’d like to make sure that it’s clear that the tools mentioned in this article have different positions on the 3 axes and are not equal in the value they provide you in your specific situation.

The main point of the article is that while these tools differ on axes 1 and 3, they are all limiting because conceptually, they are all frameworks. You pass your execution into the tool and it does a lot. Here is where you loose your flexibility as opposed to using a library. You have relatively little control of what happens inside the tool.

I must strongly disagree with Terraform being put in the list – its a great base tool with limitations that can be worked around. — (/u/absdevops)

I don’t want to work around limitations. It seems to be the norm for these tools. I’d rather have a library that misses parts that I’d code myself. Working around limitations in my opinion is generally much worse than missing functionality (depends on specific circumstances of course).

Regarding inflexibility – it’s probably the most flexible tool of the bunch — (/u/absdevops)

Please note we are still comparing between the tools that all use limiting paradigm: frameworks

I will also duel anyone to the death for preference of Cloudformation syntax to Terraform — (/u/absdevops)

We are talking about the “Input” axis I mentioned above. Yes, Terraform syntax apart from being more aesthetically pleasing is somewhat closer to “Half-baked programming language that was probably never indented to be a programming language” while CloudFormation is somewhat closer to “Configuration format”.

I totally disagree with points made about having to generate Terraform manifests. … generate what you need specifically, and hand it off to Terraform, much like making an API call to a library. — /u/SlinkyAvenger

There is a huge difference in the amount of work done by typical API call and what these tools do once you call them. With more granular API calls you decide if and when you do specific calls and what do you do in between the calls – it’s much more flexible.

I’m also a big proponent of Puppet — /u/SlinkyAvenger

One of the low value tools from my perspective. I’ll explain. On the “Input” axis, it’s half-baked programming language. Better than configuration file but still loses to Chef for example. On the “TCO” access, I really think that Puppet and Chef are not good alternatives to custom scripts in most cases. Scripts by the way also win on the “Calling” axis, which means flexibility.

I’d really like to hear what you’re honestly going back to puppet support for. — /u/neoghostz

We don’t. When we suffered while working with Puppet, we knew that support will not solve our problems. Some crappy community modules can not be solved by support. Breakage on modules versions updates – same. Librarian, more complexity on top of complexity – same. The above quote about support (“A key component of every successful Puppet implementation is access to a knowledgeable support team”) was just to highlight that guys at Puppet think people can’t use it without support. This is just a humorous point and not really important.

What is the point of this article? It basically dumps on Terraform, CF, Puppet, Chef, etc., but offers no actual criticism (other than a vague ‘it takes away control’ statement) or, perhaps more importantly, alternatives. — /u/cryonine

The point is that all these tools would have been better if they would be implemented as libraries on top of real programming languages, where you call the parts that you need instead of one “do everything” call.

With the exception of Chef, these tools use either configuration files as input or configuration-file-almost-a-programming-language format. It’s always the same path:

  1. We need small limited DSL it’s so academically beautiful, we can prove theorems about this.
  2. Oh wait, there are real world scenarios where it’s not enough, damn these complaining engineers.
  3. Let’s add stdlib
  4. Let’s add proper loops
  5. Now we have a half-baked programming language.

Elaborating on taking control away from you. You get convoluted things like this:

Alternatives

For Puppet and Chef, I have not seen a single system where my estimated TCO of these tools would be better than a bunch of idempotent modular bash scripts which I use. It did not take much time to write these. Some Python is used for configuration generation (json / jinja templates + environment data).

With Cloudformation and Terraform it’s not that simple. I’m mostly amazed that nobody does libraries which would just provide declarative primitives, not frameworks where you feed everything you need via one call. I am working on one but it is really strange for me that I haven’t heard already about such library for Python or Ruby.

Terraform … vastly superior to any other alternate out there — /u/cryonine

Not sure I agree 100% because it depends on situation but I can imagine many situations where it’s correct. The important thing here is that I think that all current alternatives are not so good.

How is it wrongful to assume that a custom solution is harder to understand? That’s completely accurate. — /u/cryonine

Custom solution is simpler. Do you really need documentation for 19 lines of bash code that installs Nginx and another 29 that do a restart that handles leaking file descriptors? You will definitely need documentation of 2000+ lines of Chef cookbook or Puppet module that install Nginx and … oh wait… how do I reload Nginx and then conditionally (if enough file descriptors leaked) restart it? Time to dive in 🙂

I do imagine how custom solution can be complicated (read harder to maintain and higher TCO) if done by unprofessional people. In some cases it might be better for them to use a framework. On the other hand, they might stuck when trying to do something advanced with the framework. Really depends on the situation.

While “use standard tools” generally sounds right, I have seen too much convoluted solutions using “standard tools” because of the inflexibility. People were trying to work around the limitations. Comparing top-down execution of simple script to workarounds for these tools, it’s much simpler to wrap your head around the scripts. I have recently passed one of my clients to the next guy. I have asked him how he is doing and he told me that he was happy to have simple custom solution over complex frameworks. TCO has many components. Choosing “standard tools” does not always outweigh other aspects.

 


Have a nice day and a productive life!