ACNews
#63
26 декабря 2016 |
|
Review - NedoLang programming language
NedoLang programming language by Alone Coder The language in this version doesn't provide counterparts for the following C things: - multidimensional arrays (can be changed to index calculations or arrays of pointers); - switch...case (todo because assembler is based on it); - floating point numbers (todo, they are already processed syntactically); - const constants; - const constant arrays; - sizeof(<type>); - #include (todo it projects will be based on it); - #define constants; - #define macros (not recommended); - conditional compilation ( #if, #ifdef, #ifndef ); - for (use while or repeat instead); - structures; - unions (not recommended); - return from the middle of a function (not recommended); - exit from the middle of a program (not recommended); - calling functions by pointers (not recommended); Commands are collected in blocks enclosed with curly braces: {<command><command>...<command>} Such a block is equal to a command and usable anywhere you can use single command. For C compatibility, it's recommended (not required) to write ; after every command. A line can contain any number of commands. Line break between words doesn't matter. There are no line numbers in the source. In error messages, there are text line numbers (the first line is 00001). Standard comments are /**comment*/. Single asterisk /*comment*/ isn't a comment (it's needed for procedure/function calls, see below). Standard comments can't be nested. There are also standard one-line comments: // to end of line. There are also non-standard (incompatible with C) nested comments: {comment{comment inside}resume comment} You can insert comments anywhere between words, except between variable and opening square bracket, i.e. you can't do this: a/**comment*/[10] All the comments are copied into assembly text as comments. Full name of a variable (that is used by the assembler) is made of current namespace followed by variable name followed by postfix. - Full name "petya.pet." means: namespace "petya", variable "pet", defined in the source (marked with dot "."). - Full name "_boolarr." means: global variable "_boolarr", defined in the source (marked with dot "."). - Full name "vasya.nemain.l00003" means: namespace "vasya.nemain", variable "l00003", created automatically (no dot in the end). Thus automatically created variables can't mix with variables defined in the source. Function headers and jump labels are also composed this way. Variables and other identifiers are accessible from the language this way: - _boolarr to access a global variable _boolarr from anywhere (the same for global procedure/function). - i to access a variable i, defined in the current namespace. - .func2 to access a function func2, defined in the parent namespace (this is used to call procedures/functions from other procedures/functions). - ..anothermodule.funcS to access a function funcS, defined in neighbour module anothermodule (i.e. in neighbour namespace to a namespace where the current function is defined). - submodule.v to access a variable v, defined in child namespace submodule (practically this can be only used for nested functions that use goto) Automatically created variables are not accessible. Identifiers (names of variables, procedures, functions, labels) must begin with a letter (or underscore for globals) and consist of letters, digis, and underscores. Full name must not exceed 254 characters. Cyrillic letters are allowed. They remain encoded the same way as in the source (i.e. cp866, cp1251, or utf8). All the commands begin with a word: * var - define a variable: var<type><variable> * var with square brackets - define an array of fixed size: var<type><variable><[><expression><]> indexed from 0, so array a[10] doesn't contain a value with index 10. * recursivevar - define a variable inside a recursive procedure/function. Syntax is the same as in var. If you define a regular variable in recursive procedure/function, it will be spoiled while recursion. You can't define arrays this way. * let - count and assign to a variable: let<var>=<expression> expression type must match variable type, except pointers: you can assign an array address to a pointer variable, i.e. let poi=arr (in fact, you can assign anything to a pointer, but the outcome is undefined with byte or char, and with long there will be a runtime error). * let with square brackets - count and assign to a location in an array: let<var><[><expression><]>=<expression> * let* - count and write in memory with given type: let*(<type>*)(<pointerexpression>)=<expression> you can't skip (<type>*) and parentheses around (<pointerexpression>). * module - define a module (namespace): module<label><command> command <command> is created inside a new namespace that is inside the current namespace (for example, if there was current namespace mainmodule, then "module submodule{...}" will create namespace mainmodule.submodule for the command. You can re-define the same namespace to add something to it. * proc - define a procedure: proc<procname>([<type><par>,...])<command> also creates a namespace inside the current namespace. Thus <procname> must be unique in the current namespace. * func - define a function: func<type><funcname>([<type><par>,...])<command> also creates a namespace inside the current namespace. Thus <funccname> must be unique in the current namespace. * if - alternative branching: if<boolexpression>then<command>[else<command>]endif contains 'then' to avoid "if(expr);cmd" mistake, and 'endif' to avoid mistakes with nested incomplete alternatives. For C compatibility, it's recommended to perenthesize the <boolexpression>. * call - call a procedure: call<label>[recursive]([<type><label>.<par>=<expression>,...]) for C compatibility, the following form is recommended: call/*.*/<procname>[recursive] ([/*<type>.<procname>.<par>=*/<expression>,...]) The name of called procedure and its parameters are written from the current namespace, so a dot is used before (not needed for globals). 'recursive' word is mandatory for calling recursive procedures, or else the parameters will be spoiled while recursion. * while - pre-conditional loop: while<boolexpression>loop<command> contains 'loop' to avoid "while(expr);cmd" mistake. For C compatibility, it's recommended to perenthesize the <boolexpression>. * repeat - post-conditional loop: repeat<command>until<boolexpression> for C compatibility, it's recommended to perenthesize the <boolexpression>. * break - exit from loop (while or repeat): no parameters, just break * return - return value from function: return<expression> must be the last command in a function. Return expression type must match the function type. * _label - define a label for jump: _label<labelname><:> label must be unique in the current namespace. Underscore is added to see it better in the source. * goto - jump to a label: goto<labelname> only inside the current procedure or function. * asm - assembly language commands: asm("cmd1""cmd2"...) every string is generated to a single command in a separate line of assembly text. There are the following data types: * byte (1 byte, unsigned) * bool (the only allowed values are /*?*/_TRUE and /*?*/_FALSE , unknown size) * char (1 byte, unknown signedness) * int (signed integer, size of one CPU word) * uint (unsigned integer, size of one CPU word) * long (long integer, size of two CPU words) [* float ] * pointer (pointer - size of one CPU word) In the current version, bool is 1 byte, /*?*/_TRUE==Oxff, /*?*/_FALSE==0x00. They might be defined in C in other ways, so don't mix boolean data with integer data. Any value except /*?*/_FALSE means /*?*/_TRUE, but it's not recommended to use this behaviour for future compatibility. Values of type long can't be used in comparisons, multiplications, and divisions, so their signedness is not defined. In <expressions>, there is the following order of operations: 1. Parentheses and prefixes ( + , - , ~ (inversion), ! (logical negation), & (take variable address), * (read memory at pointer), also nonstandard < (shift left 1 bit), > (shift right 1 bit), [<constexpr>] (constant expression)). 2. Multiplication, division, & , && . 3. Addition, subtraction, | , || , ^ , ^^ . 4. Comparisons and shifts ( << , >> ). Comparisons and shifts work only for types byte, char, int, uint. For bool, you can only compare for (un)equality. The following values are allowed: * variable name - typed as the variable. * unsigned integer constant - typed as byte (if written as Oxff with <=2 digits or 0b111 with <=8 digits), long (if it has L in the end), or uint (otherwise). * signed integer constant - typed as int. * floating point constant - typed as float (not yet supported). * boolean constant /*?*/_TRUE or /*?*/_FALSE - typed as bool. * character constant 'c' or 'c' (only 'n', 'r', 't', ' ', ''', '"', '\' allowed), where c is one symbol - typed as char. * string constant "string" - typed as pointer. String linking like "str1""str2" is allowed (may be with line break). String constants are created automatially with ' ' in the end. Integer constants may be decimal ( 100 ), hexadecimal ( 0x10 ), binary ( 0b11 ), octal ( 0o177 or a variant 0177 with warning because of ambiguity). Expressions are strongly typed. Every operation with two operands needs their types to match, except: * adding pointer with int or uint (no multiplication because the pointers are untyped). * + / - / & / | / ^ char with byte (typed as the left operand). * & for typecast. * 0L + <uintexpression> . For moving a pointer one byte left use <pointerexpression>+-1 . To cast uint, byte, char to int use +<expression> (without sign change) or -<expression> (with sign change). To cast int or long to uint use <expression>&Oxffff . To cast char, int, uint, long to byte use <expression>&Oxff . To cast uint to long use 0L + <expression> . It is forbidden to add long+uint in other cases. Function call is written whis way: /*@<type>.*/<procname>[recursive] ([/*<type>.<procname>.<par>=*/<expression>,...]) (see above about /*...*/ for C compatibility and about 'recursive' for calling recursive procedures and functions). Dot before <procname> is not needed for global functions (see above about namespaces). Procedures and functions technically have a variable number of arguments, but this behaviour is not recommended. Reading memory at a pointer must contain typecast: *(<type>*)(<pointerexpression>) It is recommended to write all the keywords with all capitals. To compile a NedoLang file with a C compiler, define the following: #define POINTER byte* #define BYTE byte #define CHAR unsigned char #define INT short int #define UINT unsigned short int #define LONG long #define FLOAT double #define BOOL unsigned char #define _TRUE Oxff #define _FALSE 0x00 #define FUNC /**/ #define PROC /**/ #define RETURN return #define VAR /**/ #define LET /**/ #define CALL /**/ #define LABEL /**/ #define LOOP /**/ #define REPEAT do #define IF if #define THEN /**/ #define ELSE else #define ENDIF /**/ #define WHILE while #define UNTIL(x) while(!(x)) This syntax is based on the following requirements: - analyze syntax with left context only; - every command begins with a word, to avoid long label search; - namespaces don't cross (label access in C can mean current namespace or any parent namespace); - C compatibility. The syntax can be changed. For example, with prefix + for typecast and function call: +(type)var +(type)func(...) Today the system is targetted for Z80 only, but the CPU sensitive part is under 40% of all code, and this part is very monotonous. So the language might be multi-platform. Preliminary evaluated size of compiler code (found by compiling old versions via IAR and SDCC) is around 20-30 KB, so there is some space to evolve. It is unknown, what compiling speed it will have at ZX Spectrum. I try to keep compromises between inefficient code and preliminary optimization, i.e. I select code structures that allow easy subsequent changes. Here are the statistics for IAR and SDCC (no self-compilation yet): SDCC 3.6.0 --opt-code-speed --all-callee-saves: #6630 bytes code with constants + #0036 bytes constant arrays + #19c7 bytes data = #802d IAR Z80/64180 C-Compiler V4.06A/W32 Command line = -v0 -ml -uua -q -e -K -gA -s9 -t4 -T -Llist -Alist -I../z80/inc/ main.c #38b4 bytes code + #1076 bytes constants + #19c7 bytes data (writes shifted address before defs) = #62f1 Compiler features and problems for self-compilation: # SDCC 2.9.0 initializes constant arrays with commands. SDCC 3.4.0 doesn't - it uses ".area _INITIALIZER" (no LDIR and no initialization code) # SDCC 2.9.0 generates stack frame code for empty functions. SDCC 3.4.0 doesn't. My code generator doesn't, too. IAR generates obscure push-pops in half-empty procedures. SDCC 3.6.0 with opt-size, writes call ___sdcc_enter_ix instead of push ix:ld ix,0:add ix,sp (so opt-speed must be used). opt-size also adds excessive inc sp:pop af in the end of emitinvreg, emitnegreg. opt-size (and SDCC 3.4.0) also counts REGNAME[rnew] once for two uses (it even can push-pop sometimes, more often with --all-callee-saves). I can't do this. No other distinctions found in SDCC 3.6.0 between opt-size and opt-speed (there still are even inc sp..dec sp for 8 bit arguments). # In SDCC, expressions with && || are counted arithmetically (IAR uses jumps). I can't make jumps. Either I add 'and if', 'or else', or a lot of branches in the condition, or make islalpha...() with an array (initialize it by hand?) If I break the condition in parts like a&&b (no loss in IAR), SDCC translates && into jumps (even in 'return a&&b' in 3.6.0) - how to make so? If I break || in parts like if (a) else if (b) else if (c) with identical commands, SDCC doesn't join them (IAR does), but anyway it works faster (IAR code doesn't change). If I make instead if (!a) if (!b) if (!c), then SDCC code goes better, and IAR code is almost ideal (with strange registers though). Expressions like (c<='z') are inefficient in IAR (not as c<('z'+1)). I must rewrite the source with +1? My current code generator won't make profit of this because it can't do CP. # Most procedures have arguments. IAR passes one or two arguments in registers (SDCC doesn't). This is visible in number printing procedure (it's just for debug). Pass one argument in a register? (I must keep a variable in a register!) Sometimes I can remove one argument, if I join a flag with data, like isfunc with functype. SDCC 3.4.0 accesses the first argument with pop:pop:push:push, this makes profit without ix stack frame. But register passing is better. # String procedures use 8-bit and 16-bit data in the same time. Rewrite all the syscalls module in assembly language? # String procedures (via arrays) are compiled as a mess. Use pointers? And keep pointers in registers? How to determine where to keep everything? Add 'register' keyword? Rewrite all the syscalls module in assembly language? But strings are also used in compiler (title cutting, label read/write). # SDCC and IAR optimize CALL:RET to JP. How to make so? Lazy CALL? (won't work for CALL in local block) SDCC 3.4.0 generates jr:ret (in getnothingword), where it was jr before. SDCC 3.6.0 fixes this. # SDCC 3.4.0 (3.6.0 with opt-size) counts REGNAME[rnew] once for two uses (it even can push-pop sometimes, more often with --all-callee-saves). I can't do this. todo rewrite the sounce with a variable (I need two because there are two letters) # stringappend (used for 'command' and 'joinwordsresult' strings) needs to be inline. I don't have #define for inlines. # read, write need to be inline (inline syscalls? or inline emits? or both? in the end there will be system macros). I don't have #define for inlines. # SDCC 3.6.0 generates ld iy:ld a,(iy)...ld (iy),a instead of ld a,()...ld (),a (in readchar_skipdieresis_concat) --reserve-regs-iy makes it via hl but with ix stack frames where unneeded, and becomes unable to do this: ld hl,(_fin) push hl call _eof it emits this instead: ld hl,#_fin ld c, (hl) inc hl ld b, (hl) push bc call _eof
Другие статьи номера:
Похожие статьи:
В этот день... 21 ноября