Common,,Mistakes

July 6, 2026 Ilya SherLeave a comment

As a programmer / code reviewer, any string manipulation is a red flag and should put you on alert. It’s error prone and mistakes are common. That goes for constructing, parsing, and searching within strings. Avoid string manipulation whenever possible.

Rationales, pitfalls, and alternative solutions follow.

Concatenation

Concatenation usually means having two (or more) separate data items mashed into one. The problems with that are:

Syntactically – depending on how this is done, it might be easy, hard, or impossible (guaranteed correctly) to parse the string back into original items. One can think of concatenation as (likely) incorrectly implemented serialization and the corresponding parsing as incorrectly implemented deserialization.
Semantically – in many cases string concatenation is semantically incorrect and shouldn’t have been done at all, the items should have been kept separately.

Examples follow.

Emojis in Log Messages

Assume that we have added several methods to our logger:

logSecurityIssue(message)
logPerformanceIssue(message)
logPossibleBug(message)
logInvalidInput(message)
logInvalidConfiguration(message)

Each one of them results in a new log message of the form: “${category_emoji} ${message}”. The category of the message should have been in its own field instead. The consequences of broken semantics (two-in-one message field) are:

Broken user interface: Now imagine a user interface for this log where the user tries to filter by category. If you mixed the category into the message, there is no “category” from the UI perspective. You would need to search by emoji which is harder to type and can potentially occur inside unrelated messages. Leaving aside what your indexer might decide to do with these emojis.

Harder automated parsing for analytics of these logs: Parsing (for analytics for example) might or might not be an issue. We know that the emoji is at the beginning of the message and is separated by space. That works. Unless… you have generic log(message) method which does not prepend any emoji. Let alone special code in the parser for no-emoji case, how about messages passed to log(message) that can contain emojis in the beginning of the string? Note that even when parsing is “not an issue” it is already more of an issue compared to not having parsing at all, which would be the case if category went into its own field.

Constructing Code

When concatenation is used to construct code with embedded data, the data needs to be properly sanitized and/or escaped /quoted. Failing to do so, a common mistake, enables:

Failed execution due to resulting syntax error
Incorrect execution due to unintended parsing of syntactically correct result of concatenation
An attacker can pass specially crafted data that contains code that will be executed. This is known as Code Injection. Common examples are SQL injection and XSS.

For constructing FHIRPath expressions, for example, use the provided method of passing variables, envVars argument, do not concatenate code and data.

For SQL, almost any framework (ORMs for example) provides a way to pass values without concatenation of SQL “code” and data together.

As to code in general, I’m against generating/constructing code but if you have to, you can build AST and then convert it into code in one place, where you would need to deal with escaping/quoting once.

As a tangent / side note, the more general issue of mixing code and data, or instructions and data is a long known security issue that has known solutions except for LLMs. This issue is unsolved in AI tools and there is no solution in sight.

Constructing URLs

Another common occurrence is using concatenation for constructing URLs. Stop for a minute and think what’s wrong with that…

As with other cases, escaping (or encoding) is often forgotten. I recommend using new URL() in combination with modifying searchParams instead of being on your toes trying to remember escaping/quoting/encoding all the time.

Constructing HTML

Almost any framework here provides escaping/quoting. Use it. Just don’t concatenate HTML and dynamic data. If you do, again, you have to remember to escape/quote all the time.

Constructing Output for Humans only

Here we get into my long-standing issue (aka pet peeve) with Unix. Programs are primarily written to display results to users as text.

This “design” is a nightmare of forgotten escaping/quoting and incorrect parsing, costing everyone in case the output should be actually consumed by the next program… which does happen quite often.

I recommend preparing a data structure and only concatenating it to strings at the latest possible moment for human consumption. This way you can have a switch to (properly) serialize this data instead. Use this switch if the output needs to be used by another program (typically, indicated by non tty stdout). Serialization avoids concatenation (in the bad sense), which is very hard to undo in the next program.

Alternatively, use an existing solution, libxo, to output in configurable formats. Configuration is through command line arguments.

Side note about parsing outputs of programs. A very nice project that does exactly that is jc, which in my opinion would not be needed if we had the correct architecture to begin with.

Parsing

Text parsing is common… and error prone. In general, every parser is more complicated than what seems initially.

Splitting CSV file lines by comma? This is wrong. Commas can be in quoted fields.

Parsing your own format? Unsecaped delimiters – a bug, escaped delimiters – extra code to parse. You can’t win.

If you can’t avoid using your own format (must be very rare occurrence when all existing formats do not match the use case), use a parser library. I like peg parsers. Raku has its own parsing feature. For binary parsers there are also libraries, use them. Whatever works for you, just not hand made. This advice is applicable to close to 100% of cases.

Searching

You have items “foo”, “bar”, and “baz”. They should be an array/list/collection/set (however it is called in your language). If they are mashed into a string, for example “foo:bar:baz”, looking for a substring “bar” to check for element presence is likely to be incorrect. In “real life” the items would be an input which we don’t control. Possible problems are:

Substring match

If you have “foo bar” as an input, your checks for both “foo” and “bar” would succeed, which is not intended of course.

The next obvious step is to look for “:bar:” instead. For that to work in the general case you need to handle the edge cases for the start and the end of the string. It means either have the string “:foo:bar:baz:” or have three separate lookups – “bar:” in the beginning, “:bar” in the end, and “:bar:” in the middle.

If you are going with the “:foo:bar:baz:”, another edge case surfaces. Will you ever be checking if the empty string is an element? With this format, the empty string is always (well, almost) an element here.

Unescaped Separator

Let’s assume you went with “:foo:bar:baz” and checking for “:bar:” presence now. Possible element “foo bar” won’t match. But how about “foo:bar” input element? If your input element has the separator character in it, the search is broken. You need to encode the separator character before adding an element to the string… and decode when you need the value back… and you need to remember to encode the substring that you are looking for too. How this can look like for example?

Let’s say our escape character would be “@”. Input “:” would be encoded as “@c”. Then leaving “@c” in the input as is would be incorrectly decoded to “:” later. You have to escape “@” in the input to. So “@” would be let’s say “@@”.

“foo:bar” encodes as “foo@cbar”, “foo@bar” as “foo@@bar”. That’s true when adding new elements and when preparing substring search.

Such “:foo:bar:baz:foo@cbar:” string represents a serialized array/list/etc and you have just invented your own serialization format. See my earlier post – On Accidental Serialization Formats.

Conclusion

I hope that after reading this post, whenever possible, when presented with a choice, you will go for the simpler and the more semantically correct option which does not involve string manipulation. Left without a choice, you will be prepared for the uphill battle and aware of the pitfalls.

Hope this helps. Good luck!

The Original Sin in IT

April 10, 2022April 10, 2022 Ilya SherLeave a comment

You have a program with human readable output. Now a second program needs that information. There are two choices:

Do the technically right and challenging thing – create a protocol and rewrite the first program and then write the second program (utilizing something like libxo in the first program maybe).
Fuck everybody for the decades to come by deciding that parsing text is the way to go.

I would like to thank everybody involved in choosing number 2!

jc is an effort to fix (more work around) that decision. Thanks to the author and I hope it will make your DevOps life at least a bit more tolerable.

The Pseudo Narrow Waist in Unix

March 6, 2022May 20, 2022 Ilya SherLeave a comment

Background

This is a pain-driven response to post about Narrow Waist of Unix Architecture. If you have the time, please read that post.

The (very simplified and rough) TL;DR of the above link:

The Internet has “Narrow Waist”, the IP protocol. Anything that is above that layer (TCP, HTTP, etc), does not need to be concerned with lower level protocols. Each piece of software therefore does not need to concern itself with any specifics of how the data is transferred.
Unix has “Narrow Waist” which is text-based formats. You have a plethora of tools that work with text. On one side of of Narrow Waist we have different formats, on another side text manipulating tools.

I agree with both points. I disagree with implied greatness of the Unix “design” in this regard. I got the impression that my thoughts in this post are likely to be addressed by next oilshell blog posts but nevertheless…

Formats

Like hierarchy of types, we have hierarchy formats. Bytes is the lowest level.

Bytes

Everything in Unix is Bytes. Like in programming languages, if you know the base type, you have a certain set of operations available to you. In case of Bytes in Unix, that would be cp, zip, rsync, dd, xxd and quite a few others.

Text

A sub-type (a more specific type) of Bytes would be Text. Again, like in a programming language, if you know that your are dealing with data of a more specific type, you have more operations available to you. In case of Text in Unix it would be: wc, tr, sed, grep, diff, patch, text editors, etc.

X

For the purposes of this discussion X is a sub-type of Text. CSV or JSON or a program text, etc.

Is JSON a sub-type of Text? Yes, in the same sense that a cell phone is a communication device, a cow is an animal, and a car is a transportation device. Exercise to the reader: are this useful abstractions?

The Text Hell

The typical Unix shell approach for working with X are the following steps:

Use Text tools (because they are there and you are proficient wielder)
One of:
1. Add a bunch of fragile code to bring Text tools to level where they understand enough of X (in some cases despite existing command line tools that deal specifically with X)
2. Write your own set of tools to deal with the relevant subset of X that you have.
Optional but likely: suffer fixing and extending number 2 for each new “corner case”.

The exception here are tools like jq and jc which continue gaining in popularity (for a good reason in my opinion). Yes, I am happy to see declining number of “use sed” recommendations when dealing with JSON or XML.

Interestingly enough, if a programmer would perform the above mentioned atrocities in almost any programming language today, that person would be pointed out that it’s not the way and libraries should be used and “stop using square peg for round hole”. After few times of unjustified repetition of the same offense, that person should be fired.

Somehow this archaic “Unix is great, we love POSIX, we love Text” approach is still acceptable…

Pipes Text Hell

Create a pipe between different programs (text output becomes text input of the next program)
Use a bunch of fragile code to transform between what first program produces and the second one consumes.

Where Text Abstraction is not Useful

Everywhere almost. In order to do some of the most meaningful/high-level operations on the data, you can’t ignore it’s X and just work like it is Text.

Editing

The original post says that since the format is Text, you can use vim to edit it. Yes you can… but did you notice that any self respecting text editor comes with plugins for various X’s? Why is that? Because even the amount of useful “text editing” is limited when all you know you are dealing with Text. You need plugins for semantic understanding of X in order to be more productive.

Wanna edit CSV in a text editor without CSV plugin? OK. I prefer spreadsheet software though.

Have you noticed that most developers use IDEs that “understand” the code and not Notepad?

Lines Count

Simple, right? wc -l my.csv. Do you know the embedded text in quotes does not have newlines? Oops. Does it have header line? Oops.

Text Replacement

Want to try to rename a method in a Java program? sed -i 's/my_method/our_method/g' *.java, right? Well, depends on your luck. I would highly recommend to do such kind of refactoring using an IDE that actually understands Java so that you rename: only specific method in a specific class as opposed to unfortunately named methods and variables, not to mention arbitrary strings.

Search / Indexing

Yep… except that understanding of the semantics helps here quite a bit. That’s why you have utilities which understand specific programming languages that do the indexing.

Conclusion

I do not understand the fascination with text. Still waiting for any convincing arguments why is it so “great” and why the interoperability that it provides is not largely a myth. Having a set of tools enabling one to do subpar job each time is better than not having them but is it the best we can?

My previous dream of eradicating text where it does not make sense (my blog post from 2009) came true with HTTP/2. Apparently I’m not alone in this regard.

Sorry if anything here was harsh. It’s years of pain.

Clarification – Layering

Added: 2022-02-07 (answering, I hope, https://www.reddit.com/r/ProgrammingLanguages/comments/t2bmf2/comment/hzm7n44/)

Layering in case of IP protocol works just fine. Implementer of HTTP server really does not care about the low level transport details such as Ethernet. Also the low level drivers don’t care which exactly data they deliver. Both sides of the Waist don’t care about each other. This works great!

My claim is that in case of the Text Narrow Waist, where X is on one hand of and the Text tools are on the other, there are two options:

Tools ignore X and you have very limited functionality you get out of the tools.
Tools know about X but then it’s “leaky abstraction” and not exactly a Narrow Waist.

That’s why I think that in case of Text, the Narrow Waist is more of an illusion.

Have a nice week!

Ilya's blog

Systems and software engineering

Tag: text

Common,,Mistakes

Concatenation

Emojis in Log Messages

Constructing Code

Constructing URLs

Constructing HTML

Constructing Output for Humans only

Parsing

Searching

Substring match

Unescaped Separator

Conclusion

The Original Sin in IT

The Pseudo Narrow Waist in Unix

Background

Formats

Bytes

Text

X

The Text Hell

Pipes Text Hell

Where Text Abstraction is not Useful

Editing

Lines Count

Text Replacement

Search / Indexing

Conclusion

Clarification – Layering