Writing a simple Compiler on my own - Type Declaration and Checking [C][Flex][Bison]

in #utopian-io5 years ago

[Custom Thumbnail]
All the Code of the series can be found at the Github repository: https://github.com/drifter1/compiler

Introduction

    Hello it's a me again @drifter1! Today we continue with my Compiler Series, a series where we implement a complete compiler for a simple C-like language by using the C-tools Flex and Bison. In this article we will write the needed code for type checking and type declarations, without actually integrating it into the compiler (bison parser) completely!

The topics that we will cover today are:

  1. Type declaration code
  2. Type checking code
  3. "Simple" code integration with Symbol Table, Lexer and Parser

Requirements:

    Actually you need to read and understand all the topics that I covered in the series as a whole, as these articles will give you access to knowledge about:
  • What Compiler Design is (mainly the steps)
  • For which exact Language the Compiler is build for (Tokens and Grammar)
  • How to use Flex and Bison
  • How to implement a lexer and parser for the language using those tools
  • What the Symbol Table is and how we implement it
  • How we combine Flex and Bison together
  • How we can pass information from the Lexer to the Parser
  • How we define operator priorities, precedencies and associativity
  • What Semantic Analysis is (Attributes, SDT etc.)
  • How we do the so called "Scope Resolution"

Difficulty:

Talking about the series in general this series can be rated:
  • Intermediate to Advanced
Today's topic(s) can be rated:
  • Medium
So, without further ado, let's now finally start with the actual Tutorial...

Actual Tutorial Content

Type Declaration Code

    First up, now that we start implementing semantics, it might be a good idea to put this code into a new header and code file. Let's call them "semantics.h" and "semantics.c".
    The new header file should contain the data (or token) types that our language has, meaning that we should remove them from the symbol table header "symtab.h" and include this new header there and whenever we want to access types and semantics in general. Doing that I also saw that I defined more types than I should and that the character type was called string type, which might be a little confusing. So, in the end, the header file will now contain the following type definitions:
#define UNDEF 0
#define INT_TYPE 1
#define REAL_TYPE 2
#define CHAR_TYPE 3
#define ARRAY_TYPE 4
#define POINTER_TYPE 5
#define FUNCTION_TYPE 6
That's because, right now, our simple C language only has:
  • Integers
  • Floats and Doubles (Real type)
  • Characters (including Strings)
  • Arrays and pointers of the previous types
  • Functions returning the previous types
    In the same notion I also found out that I defined string arrays or strings instead of characters in some places of the symbol table entry struct "list_t". So, now the "new" struct is:
typedef struct list_t{
	// name, size of name, scope and occurrences (lines)
	char st_name[MAXTOKENLEN];
    int st_size;
    int scope;
    RefList *lines;
// to store value and sometimes more information int st_ival; double st_fval; char st_sval;
// "main" type int st_type;
// for arrays (info type), for pointers (pointing type) // and for functions (return type) int inf_type;
// array stuff int *i_vals; double *f_vals; char *s_vals; int array_size;
// function parameters Param *parameters; int num_of_pars;
// pointer to next item in the list struct list_t *next; }list_t;
    When declaring a variable we have to be able to set the type of the identifier afterwards, using the type-information stored in the "type"-rule for example (something that I will not do today, but later on). So, we need some kind of "set type" function that let's us actually declare a variable or an identifier, by now setting the "undefined" type of the identifier to the correct type that was used in the declaration. Such a function will call lookup() to find the corresponding entry of the needed identifier to then access the entries st_type and inf_type, so that we can store the "main" type and info/data, pointing or return type when talking about pointers, arrays or functions.
The code for such a function looks as following:
void set_type(char *name, int st_type, int inf_type){
	/* lookup entry */
	list_t *l = lookup(name);
/* set "main" type */ l->st_type = st_type;
/* if array, pointer or function */ if(inf_type != UNDEF){ l->inf_type = inf_type; } }
    Of course we could also do this by hand every time, but I think a function that does it for us is much better in the circumstance! In the same way, we should also be able to get the type of the identifier using some "get type" function that gives us the "actual" type of the entry. For "simple" variables we just have to return the value from "st_type", but for the others (array, pointer or function) we have to return the info, pointing or return type, which is stored in "inf_type".
In code this looks as following:
int get_type(char *name){
	/* lookup entry */
	list_t *l = lookup(name);
/* if "simple" type */ if(l->st_type == INT_TYPE || l->st_type == REAL_TYPE || l->st_type == CHAR_TYPE){ return l->st_type; } /* if array, pointer or function */ else{ return l->inf_type; } }
    And the "set_type" function is actually all that we have to call in the actual parser when setting or better said declaring the type of an identifier. The other function will be used, so that we are able to access the type of an identifier when "type checking".

Type Checking Code

    Depending on the type of operator, different types are inter-compatible or not. This means that we will have to define the operator types, in the same way that we defined the data types. The following operators can occur when type checking:
  • None -> When assigning or when checking parameter inter-compatibility
  • Arithmetic -> When adding/subtracting, multiplying and dividing
  • Increment -> When incrementing/decrementing numeric values
  • Boolean -> In Boolean OR or AND operations
  • Not -> The Boolean NOT operator
  • Relational -> The operators: '>', '<', '>=' and '<='
  • Equality -> The operators: "==" and "!="
Note that the increment operator is still an arithmetic operator, but just a special case. The same way the Not operator is a special case of Boolean operator. Both of these operators have only one parameter.
In actual "definition" code, this looks as following:
#define NONE 0	   // to check types only - assignment, parameter
#define ARITHM_OP 1 // ADDOP, MULOP, DIVOP (+, -, *, /)
#define INCR_OP 2   // INCR (++, --)
#define BOOL_OP 3   // OROP, ANDOP (||, &&)
#define NOT_OP 4    // NOTOP (!)
#define REL_OP 5    // RELOP (>, <, >=, <=)
#define EQU_OP 6    // EQUOP (==, !=)
Of course this code is inside of the "semantics.h" file.
    The actual type checking will be implemented using a function that checks the types depending on the operator and returns if they are compatible by maybe even returning the result type of the operation. Knowing that the actual types are: Integer, Real or Char, we can easily write the code needed. So, let's get into each case after covering the main function declaration!
The function declaration looks as following:
int get_result_type(int type_1, int type_2, int op_type);

No operator

    When having no operator only "equal" types can be assigned to each other (or used in parameters). This means that an integer or character can only be assigned to an integer or character, cause these types are actually the same. The ASCII representation of a char is actually an int! The value of a float or double, and so real in general, can be any of the others (including the same type). This might change in the future if we define a separate float and double type! From all that we understand that the code needs to be:
switch(op_type){
case NONE: /* type compatibility only, '1': compatible */
    // first type INT
    if(type_1 == INT_TYPE){
        // second type INT or CHAR
        if(type_2 == INT_TYPE || type_2 == CHAR_TYPE){
            return 1;
        }
        else{
           type_error(type_1, type_2, op_type);
        }
    }
    // first type REAL
    else if(type_1 == REAL_TYPE){
        // second type INT, REAL or CHAR
        if(type_2 == INT_TYPE || type_2 == REAL_TYPE || type_2 == CHAR_TYPE){
            return 1;
        }
        else{
            type_error(type_1, type_2, op_type);
        }
    }
    // first type CHAR
    else if(type_1 == CHAR_TYPE){
        // second type INT or CHAR
        if(type_2 == INT_TYPE || type_2 == CHAR_TYPE){
            return 1;
        }
        else{
            type_error(type_1, type_2, op_type);
        }
    }
    break;
...code continues...
    You can see that we have a switch case with the operator type "op_type" and than the current case of "NONE" operator (no operator). That way depending on the first type we see if the second type is compatible with it or not. When they are compatible we return '1'. When not, well, then we call a new function called "type_error". This function just prints out that the two types are incompatible when using this operator type.
More specifically, the code of this function is:
void type_error(int type_1, int type_2, int op_type){ 
	fprintf(stderr, "Type conflict between %d and %d using op type %d\n",
         type_1, type_2, op_type);
	exit(1);
}
    You can see that we just print out that a type conflict occurred and then end the program with return code '1', meaning error! Pretty easy I guess!
In the same notion we also have cases for the rest of the operators...

Arithmetic operators

    Again real is compatible with everything, but, the same also happens with the rest of the types. In general, arithmetic operators are compatible with everything. But now we should also see what kind of result we get in each of these calculations! In general, whenever real is one of the types, the result will be real. In the same way, having int and char the result will be int. Only when we have char as the first type we will get a type of char in the result!
So, the code is:
...
case ARITHM_OP: /* arithmetic operator */
    // first type INT
    if(type_1 == INT_TYPE){
        // second type INT or CHAR
        if(type_2 == INT_TYPE || type_2 == CHAR_TYPE){
            return INT_TYPE;
        }
        // second type REAL
        else if(type_2 == REAL_TYPE){
            return REAL_TYPE;
        }
        else{
            type_error(type_1, type_2, op_type);
        }
    }
    // first type REAL
    else if(type_1 == REAL_TYPE){
        // second type INT, REAL or CHAR
        if(type_2 == INT_TYPE || type_2 == REAL_TYPE || type_2 == CHAR_TYPE){
            return REAL_TYPE;
        }
        else{
            type_error(type_1, type_2, op_type);
        }
    }
    // first type CHAR
    else if(type_1 == CHAR_TYPE){
        // second type INT or CHAR
        if(type_2 == INT_TYPE || type_2 == CHAR_TYPE){
            return CHAR_TYPE;
        }
        // second type REAL
        else if(type_2 == REAL_TYPE){
            return REAL_TYPE;
        }
        else{
            type_error(type_1, type_2, op_type);
        }
    }
    else{
        type_error(type_1, type_2, op_type);
    }
    break;
...
    The special case of INCR operator is very simple. We just return the type of the first parameter. The second type is non-existent and mostly UNDEF. That way we just have to write:
...
case INCR_OP: /* special case of INCR */
    // type INT
    if(type_1 == INT_TYPE){
        return INT_TYPE;
    }
    // type REAL
    else if(type_1 == REAL_TYPE){
        return REAL_TYPE;
    }
    // type CHAR
    else if(type_1 == CHAR_TYPE){
        return CHAR_TYPE;
    }
    else{
        type_error(type_1, type_2, op_type);
    }
    break;
...

Boolean operators

    As a Boolean value we can only use integers and characters. This means that only these two types are inter-compatible and usable with Boolean operators. To make our life even simpler we will return the type of the first parameter! In code this looks as following:
...
case BOOL_OP: /* Boolean operator */
    // first type INT
    if(type_1 == INT_TYPE){
        // second type INT or CHAR
        if(type_2 == INT_TYPE || type_2 == CHAR_TYPE){
            return INT_TYPE;
        }
        else{
            type_error(type_1, type_2, op_type);
        }
    }
    // first type CHAR
    else if(type_1 == CHAR_TYPE){
        // second type INT or CHAR
        if(type_2 == INT_TYPE || type_2 == CHAR_TYPE){
            return CHAR_TYPE;
        }
        else{
            type_error(type_1, type_2, op_type);
        }
    }
    else{
        type_error(type_1, type_2, op_type);
    }
    break;
...
    The special case of NOT is similar to INCR. We just return the type of the first parameter. This means that the code is:
...
case NOT_OP: /* special case of NOTOP */
    // type INT
    if(type_1 == INT_TYPE){
        return INT_TYPE;
    }
    // type CHAR
    else if(type_1 == CHAR_TYPE){
        return INT_TYPE;
    }
    else{
        type_error(type_1, type_2, op_type);
    }
    break;
...

Relational operators

    In relational operators the types need to be comparable. So when talking about integers or characters, those can be compared with each other, but also with the type real. For example we can have: "5 >= 5.5", which means that an integer is greater than a real. So, in the same way as with arithmetic operators all of the types are inter-compatible or better said inter-comparable in this case! But, there is one BIG difference now! The result type is a Boolean value, which in C is represented with an integer value.
So, knowing all that the code is:
...
case REL_OP: /* Relational operator */
    // first type INT
    if(type_1 == INT_TYPE){
        // second type INT, REAL or CHAR
        if(type_2 == INT_TYPE || type_2 == REAL_TYPE || type_2 == CHAR_TYPE){
            return INT_TYPE;
        }
        else{
            type_error(type_1, type_2, op_type);
        }
    }
    else if(type_1 == REAL_TYPE){
        // second type INT, REAL or CHAR
        if(type_2 == INT_TYPE || type_2 == REAL_TYPE || type_2 == CHAR_TYPE){
            return INT_TYPE;
        }
        else{
            type_error(type_1, type_2, op_type);
        }
    }
    // first type CHAR
    else if(type_1 == CHAR_TYPE){
        // second type INT, REAL or CHAR
        if(type_2 == INT_TYPE || type_2 == REAL_TYPE || type_2 == CHAR_TYPE){
                return INT_TYPE;
        }
        else{
            type_error(type_1, type_2, op_type);
        }
    }
    else{
        type_error(type_1, type_2, op_type);
    }
    break;
...

Equality operator

    Only Integer and characters can be inter-compatible in the equality and in-equality checks. Reals can only be compared with themselves. The result of such a comparison is again a Boolean value that we will represent using an integer. So, in the end, the code, including the default case and end of the switch-case statement, is:
...
case EQU_OP: /* Equality operator */
    // first type INT
    if(type_1 == INT_TYPE){
        // second type INT or CHAR
        if(type_2 == INT_TYPE || type_2 == CHAR_TYPE){
            return INT_TYPE;
        }
        else{
            type_error(type_1, type_2, op_type);
        }
    }
    else if(type_1 == REAL_TYPE){
        // second type REAL
        if(type_2 == REAL_TYPE){
            return INT_TYPE;
        }
        else{
            type_error(type_1, type_2, op_type);
        }
    }
    // first type CHAR
    else if(type_1 == CHAR_TYPE){
        // second type INT or CHAR
        if(type_2 == INT_TYPE || type_2 == CHAR_TYPE){
            return INT_TYPE;
        }
        else{
            type_error(type_1, type_2, op_type);
        }
    }
    else{
        type_error(type_1, type_2, op_type);
    }
    break;
/* ---------------------------------------------------------- */
default: /* wrong choice case */
    fprintf(stderr, "Error in operator selection!\n");
    exit(1);
}
    I would like to note that this type checking is of course not final at all! We just got into the general structure of "Type checking" and created the rules for most of the types! Of course Pointers need a special handling, that we will get into some articles later on. In arrays and functions we just use the info/data or return value, meaning that we use the types that we already covered!

"Simple" Integration

    Of course the "semantics.h" and "semantics.c" files have to be included in our previous code. More specifically, the "semantics.c" file has to include it's corresponding header file "semantics.h". Both the "symtab.c" and the "symtab.h" files, which contain the symbol table, have to include the "semantics.h" library. That's because we now defined the (token) types in there!
    To compile and use them in the actual final compiler we of course also have to include these files in the lexer and parser! In the lexer we have to include the header file "semantics.h", to be able to use the token type "UNDEF" when inserting an ID into the symbol table. That's because the insert() function of course also take a type as a parameter. In the same way, to be able to use the new functions in the parser we will have to include the "semantics.c" file there. This will also make sure that this new code will be compiled and included in the final compiler executable!
    And this is actually all that I wanted to do today! So, go take a look at the code on GitHub to understand how everything gets bind together! Of course the compiler right now, has the same features that it had in the previous article. But, we made a first step in the declaration and type checking procedures, which just need an "actual" integration now. To do this we will have to store information/attributes in the actual rules and so non-terminals, which is something that we will do in an article that follows. So, what remains now is writing some more code that will be used for semantics checks, by also creating the actual Attribute Grammar...

RESOURCES

References:

    The following helped me refresh my knowledge about C-datatypes, their inter-compatibility and semantics:
  1. http://www.cs.sfu.ca/~msiahban/personal/teaching/CMPT-379-Spring-2016/typecheck.pdf
  2. http://www.cse.chalmers.se/edu/year/2015/course/DAT150/lectures/proglang-07.html
  3. https://pebble.gitbooks.io/learning-c-with-pebble/content/chapter03.html
  4. https://pebble.gitbooks.io/learning-c-with-pebble/content/appendixa.html
  5. https://engineering.purdue.edu/~milind/ece468/2014fall/lecture-04.pdf

Images:

All of the images are custom-made!

Previous parts of the series


Final words | Next up on the project

     And this is actually it for today's post! I hope that I explained everything as much as I needed to, meaning that you learned something out of it.
Next up on this series are:
  • Semantic analysis (creating and using even more semantic rules in bison, attribute grammar)
  • Intermediate Code generation (Abstract Syntax Tree)
  • Machine Code generation (MIPS Assembly)
     Which are all topics that will need more than one article to complete. Also, note that we might also get into Optimizations later on, or could even extend the Language by adding complex datatypes (structs and unions), more rules etc.
So, see ya next time!

GitHub Account:

https://github.com/drifter1

Keep on drifting! ;)
Sort:  

Thank you for your contribution @drifter1.
After reviewing your tutorial we suggest only one point below:

  • The comments are very important in the code, however only place comments on pieces of code that are more complicated for the user to understand.

Your tutorial is excellent, thank you very much for your good work in developing your contribution.
We are waiting for more tutorials.

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Thank you for your review, @portugalcoin! Keep up the good work!

Good post!

Hi @drifter1!

Your post was upvoted by @steem-ua, new Steem dApp, using UserAuthority for algorithmic post curation!
Your UA account score is currently 3.506 which ranks you at #6288 across all Steem accounts.
Your rank has dropped 11 places in the last three days (old rank 6277).

In our last Algorithmic Curation Round, consisting of 257 contributions, your post is ranked at #215.

Evaluation of your UA score:
  • You're on the right track, try to gather more followers.
  • The readers appreciate your great work!
  • Try to work on user engagement: the more people that interact with you via the comments, the higher your UA score!

Feel free to join our @steem-ua Discord server

Hello! Your post has been resteemed and upvoted by @ilovecoding because we love coding! Keep up good work! Consider upvoting this comment to support the @ilovecoding and increase your future rewards! ^_^ Steem On!

Reply !stop to disable the comment. Thanks!

Hey, @drifter1!

Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Get higher incentives and support Utopian.io!
Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via SteemPlus or Steeditor).

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Coin Marketplace

STEEM 0.36
TRX 0.12
JST 0.039
BTC 70112.96
ETH 3549.99
USDT 1.00
SBD 4.71