Unicode characters as operators in a programming language¿

Wouldn’t it be cool to use Unicode characters as operators and maybe even function names?

if a ≠ b { ... }

range = 0…10

all_above_ten = myarr ∀ F(x) x > 10

Looks good, concise, expressive. Perl 6, Julia, Scala appear to support Unicode operators.

Why don’t I add Unicode operators and function names to NGS then?

If NGS would allow Unicode, it would be optional as I don’t want additional entry barrier or possible problems typing Unicode using remote connection. If I do add optional Unicode to NGS, here is what I think will happen next:

Some people start using Unicode in NGS while others don’t. Mixed code style emerges. It’s easy to imagine such mixed code style even in the same file as someone without the Unicode setup on his/her keyboard  is doing a quick fix. ViralBShah sums it up pretty well.

What do you think? Your comments are welcome.


3 thoughts on “Unicode characters as operators in a programming language¿

  1. I have a “toy” language (BC BASIC) with some of these features. Some of the special characters:

    Source code is “Word-proofed” by accepting smart quotes. Word also changes some minus signs into en-dashes; my little language accepts those in place of minus signs

    Some pretty math operators includes square root and square and the “normal” math signs. Imagine a language where you can use the ÷ division sign. So far I’ve found these to be nice but not awesome.

    The surprise hit is that my language allows 🚩 the use flag characters as white space.This is surprisingly useful; it’s nice to quickly prototype some code, dropping flags on lines that need to be rewritten. A flag at the start of a line (🚩 retval=√(a²+b²)) is a great quick hint.

    Lastly, because typing Unicode is a pain, the mini-IDE include a pop-up with the special characters that are accepted.

    More at: https://shipwrecksoftware.wordpress.com/2016/12/04/awesome-updates-to-best-calculator-bc-basic/


  2. For simple cases, one could imagine a sed script that would substitute for the expressive Unicode sequences the appropriate ASCII (and a similar sed script that would perform the opposite translation). This would solve the problems of expressibility while also allowing people who have the scripts, but not the Unicode text entry, to be able to maintain the source code. In addition, running the scripts would ensure that the output was not a hodge-podge of inconsistent typography.

    The use of sed scripts fails, however, for multiple reasons. First, it cannot distinguish between when Unicode characters are used as operators and when they are string literals (ditto for the ASCII equivalents.) Secondly, requiring all code maintainers to have these scripts in addition to the language itself is fragile, as one might be working on projects/environments where only the language exists, but not the sed scripts (or sed itself, such as when one may be coding in a small environment on a phone, for instance).

    As the implementor of the language, however, you can solve everything you want! After all, your language takes as input the high-level input code, parses it, and does tokenization/translation as part of syntax checking already. If you are contemplating adding synonymous Unicode characters for some ASCII sequences, then you already are committed to the work of enhancing the parser. It would be a straightforward thing for you to allow a command-line argument at the parser that would output the input code doing the translation either to pure ASCII or pure Unicode. If you build in this feature directly into the language, then all programmers of it would be able to use the feature. As your language knows when a Unicode sequence or ASCII sequence is treated as a literal string, there would be no confusion.

    bcbasic –unicode-output=beautiful.bas ASCII.BAS : REM outputs Unicode operators, etc.
    bcbasic –ascii-output=ASCII.BAS beautiful.bas : REM outputs ASCII operators, etc.


    • It looks like clever solution and it really makes the situation much better. Still, it puts some unnecessary burden (much less than I initially imagined) on developers. If you work with version control which for example uses Unicode and you edit locally as ASCII, you would be running recoding scripts after checkout and before commit.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s