Unicode characters as operators in a programming language¿

December 8, 2016December 8, 2016 Ilya Sher4 Comments

Wouldn’t it be cool to use Unicode characters as operators and maybe even function names?

if a ≠ b { ... }

range = 0…10

all_above_ten = myarr ∀ F(x) x > 10

Looks good, concise, expressive. Perl 6, Julia, Scala appear to support Unicode operators.

Why don’t I add Unicode operators and function names to NGS then?

If NGS would allow Unicode, it would be optional as I don’t want additional entry barrier or possible problems typing Unicode using remote connection. If I do add optional Unicode to NGS, here is what I think will happen next:

Some people start using Unicode in NGS while others don’t. Mixed code style emerges. It’s easy to imagine such mixed code style even in the same file as someone without the Unicode setup on his/her keyboard is doing a quick fix. ViralBShah sums it up pretty well.

What do you think? Your comments are welcome.

4 thoughts on “Unicode characters as operators in a programming language¿”

shipwrecksoftware says:

December 8, 2016 at 6:57 pm

I have a “toy” language (BC BASIC) with some of these features. Some of the special characters:

Source code is “Word-proofed” by accepting smart quotes. Word also changes some minus signs into en-dashes; my little language accepts those in place of minus signs

Some pretty math operators includes square root and square and the “normal” math signs. Imagine a language where you can use the ÷ division sign. So far I’ve found these to be nice but not awesome.

The surprise hit is that my language allows 🚩 the use flag characters as white space.This is surprisingly useful; it’s nice to quickly prototype some code, dropping flags on lines that need to be rewritten. A flag at the start of a line (🚩 retval=√(a²+b²)) is a great quick hint.

Lastly, because typing Unicode is a pain, the mini-IDE include a pop-up with the special characters that are accepted.

More at: https://shipwrecksoftware.wordpress.com/2016/12/04/awesome-updates-to-best-calculator-bc-basic/

LikeLike

Reply
Comet says:

December 9, 2016 at 9:14 pm

For simple cases, one could imagine a sed script that would substitute for the expressive Unicode sequences the appropriate ASCII (and a similar sed script that would perform the opposite translation). This would solve the problems of expressibility while also allowing people who have the scripts, but not the Unicode text entry, to be able to maintain the source code. In addition, running the scripts would ensure that the output was not a hodge-podge of inconsistent typography.

The use of sed scripts fails, however, for multiple reasons. First, it cannot distinguish between when Unicode characters are used as operators and when they are string literals (ditto for the ASCII equivalents.) Secondly, requiring all code maintainers to have these scripts in addition to the language itself is fragile, as one might be working on projects/environments where only the language exists, but not the sed scripts (or sed itself, such as when one may be coding in a small environment on a phone, for instance).

As the implementor of the language, however, you can solve everything you want! After all, your language takes as input the high-level input code, parses it, and does tokenization/translation as part of syntax checking already. If you are contemplating adding synonymous Unicode characters for some ASCII sequences, then you already are committed to the work of enhancing the parser. It would be a straightforward thing for you to allow a command-line argument at the parser that would output the input code doing the translation either to pure ASCII or pure Unicode. If you build in this feature directly into the language, then all programmers of it would be able to use the feature. As your language knows when a Unicode sequence or ASCII sequence is treated as a literal string, there would be no confusion.

bcbasic –unicode-output=beautiful.bas ASCII.BAS : REM outputs Unicode operators, etc.
bcbasic –ascii-output=ASCII.BAS beautiful.bas : REM outputs ASCII operators, etc.

LikeLike

Reply
- Ilya Sher says:
  
  December 9, 2016 at 11:44 pm
  
  It looks like clever solution and it really makes the situation much better. Still, it puts some unnecessary burden (much less than I initially imagined) on developers. If you work with version control which for example uses Unicode and you edit locally as ASCII, you would be running recoding scripts after checkout and before commit.
  
  LikeLike
  
  Reply
Paul K. McKneely says:

March 25, 2020 at 9:22 pm

I began development of a software tool chain for just such a language called ϕPPL (ϕ Parallel Programming Language) in 1988. It was conceived, though, around 1986. Unicode support was not just added as an after-thought like most other languages. There are lots of operators that are not ASCII characters and some of them are in the User Defined Area because there are no suitable characters in Unicode for their intended purpose. I could show you a lot of examples of code but this website does not show them correctly. It is really a powerful language and, the more I program in it, the more I feel that I am wearing a straight jacket when using other languages. Function names can use any of 254 Unicode characters that are in the allowable character subset. These include a full set of Roman, Greek and Cyrillic characters as well as a number of modified Roman. They require only one byte for storage when re-encoded to NSC values which only require one byte of storage each. The Zero is reserved for name termination and the value of 1 is reserved as a separator for multi-part names. These are the allowable characters in identifiers:

0123456789_aAãÃäÄâÂàÀáÁăĂāĀåÅæ
ÆbBcCçÇdDeEėĖëËêÊèÈéÉĕĔēĒfFgGhHi
IïÏîÎìÌíÍĭĬīĪjJkKlLmMnNñÑoOœŒöÖô
ÔòÒóÓŏŎōŌpPqQrRsSßtTuUůŮüÜùÙúÚŭŬ
ūŪvVwWxXyYŷŶýÝzZαΑβΒγΓδΔεΕζΖηΗθΘ
ιΙκΚλΛμΜνΝξΞοΟπΠρΡσΣτΤυΥφΦχΧψΨωΩ
аАбБвВгГдДеЕжЖзЗиИйЙкКлЛмМнНоОпП
рРсСтТуУфФхХцЦчЧшШщЩъЪыЫьЬэЭюЮяЯ

A handy side-effect to using this system is that these characters are in alphabetical order so you can sort names on their raw NSC values. ϕEdit has many handy keypads to make it quick and easy to type these characters into your source code. ϕText is a superset of Unicode and this is the preferred encoding for storing your source code in files. This is more efficient than UTF-8 and it supports text properties such as 8 colors, 8 styles, 4 sizes and 3 independent attributes (Bold,Underline and Italic). It also supports paragraph justification such as Left, Right, Centered and Fully Justified.

ϕEngine is a 64-bit advanced processor architecture that is co-designed with ϕPPL used for writing ϕOS, an-object-oriented operating system. The assembler for this processor is also based on Unicode (ϕText). Together, these components make up the trinity which are called “The ϕ System” or just ϕSystem for short.

If you are C tool-chain developer and would like to participate in its development, please email me at PhiExpert@technoventure.com

LikeLike

Reply