What is SMerge?

From smerge/doc/design.txt:
Just a word about SMerge...
                                  Cyril ADRIAN (mailto cadrian@ifrance.com)

Abstract______________________________________________________________________

  SMerge  is  an  attempt to  at  last!  offer  programmers a  true  syntactic
merger. What do I  call a "syntactic merger"? That is a  merger which is aware
of the syntax of  a text (e.g., Eiffel source, Java source  and so on) instead
of just lines. Therefore, it is able to better cope (*) with:

  - INDENTATION. Indentation is not relevant in source code.

  - RE-ORDERED  FUNCTIONS. The order of things  may not be as  important as it
    seems. For  example, in Eiffel the  order of features is  not important as
    long as the belong to the same export clause.


    (*) than e.g., diff


The diagrams are made with Dia-0.86 (http://www.lysator.liu.se/~alla/dia/)

Note:  it  has  been  alleged  that  indentation may  be  important  (like  in
       Python). I am aware of that, but it's not a limitation: make it part of
       the grammar ;o)


References____________________________________________________________________

  [1]   Design Patterns
        Elements of Reusable Object-Oriented Software
        E. Gamma, R. Helm, R. Johnson, J. Vlissides
        Addison-Wesley, 1999, ISBN 0-201-63361-2

  [2]   Compilers
        Principles, techniques and tools
        A. Aho, R. Sethi, J. Ullman
        Addison-Wesley, 1986, ISBN 0-201-10194-7

  [3]   GOBO
        http://www.gobosoft.com

  [4]   ELJ
        http://www.elj.com


Design notes__________________________________________________________________

  The basis of SMerge is a generic parser (yet another compiler compiler ;o) )
in charge of building parsers... which will, in turn, be used by the syntactic
merger module---not yet designed, but it  will surely be a kind of visitor.  I
love this pattern; in the future,  this will maybe allow the fundation classes
of SMerge to be used for some other projects...

  This parser  will be a LL  but should easily  be extended to cope  with LALR
grammars in some future.


   The "parser/core" package__________________________________________________

   Directory: src/parser/core

        The core classes of the parser are in the "parser/core" package. Those
      classes  are   almost  all  deferred.   They  just   implement  a  basic
      skeleton. A diagram may be found in [parser.core.dia].

        Note that there are two groups of classes:

      - The  classes derivating  of  NODULE.  Those  classes  are just  parser
        particles...   They are  able  to  create some  NODE  under the  right
        circumstances (e.g., when submitted a source file)

      - The  classes derivating  of  NODE.  Those classes  are  indeed the  data
        representing the parsed file.

        I cannot  stress enough  the difference between  NODE and  NODULE. The
      names are close, but their roles are really distinct.

        You  may also  find in  this package  the NODE_MANAGER  which  in some
      future will  be made separate (when SmallEiffel  supports this keyword).
      This NODE_MANAGER  is in charge of  creating the whole AST  (ie the tree
      structure of NODEs).

        At last, there  is the RULE class; as it should  be obvious, it allows
      to replace ("reduce", see [2]) a  whole string of keywords with a single
      non-terminal derivation (RULE_HEAD, child of NODULE holds the derivating
      RULEs).


   The "parser/algorithms" package____________________________________________

   Directory: src/parser/algorithms

        This  package  provide  some  behaviour classes  implementing  several
      different parsing algorithms.

        This  is a  _strategy_ pattern:  the deferred  class is  GRAMMAR which
      provides only one method

          shift: -> BOOLEAN

      which  "advances"   in  the   code  source,  gathering   the  structural
      elements. When returning "false", the source has been totally parsed, or
      there was an error.


        At the time of writing, there are two concrete classes:

      - RECURSIVE_GRAMMAR, just a facade to the core which directly implements
        this  method.  It is  a  rough LL  parser  (rough:  recursive with  no
        predicive algorithms)

      - PREDICTIVE_GRAMMAR   is  an   enhanced  LL   parser   with  predictive
        algorithms. This one should be really fast.


        If someone would implement an LALR, and even an LR parser, please feel
      free to  do so. They  are more powerful  parsers, but _much_  slower and
      less error-tolerant.


   The "parser/stage1" package________________________________________________

   Directory: src/parser/stage1/eiffel

        Here come fancy  things. This package reads a  grammar description and
      creates a parser.  In that respect, this parser creates NODULEs that are
      able to parse source code.

        The classes of this package  being the very foundation of SMerge, they
      are a hard-coded  grammar. In other words, changing  the grammar grammar
      (sic) may require some work.

        The  basis of  the parser  is GOBO  XML (see  [3]). Hence  the grammar
      grammar is indeed a DTD.

        Here is ths DTD (found in the directory src/parser/stage1/xml)

      --8<---------------------------------------------------------------
      <?xml version='1.0' encoding='utf-8'?>
      
      <!ENTITY % regexp "#PCDATA">                  <!-- will be parsed by SMerge -->
      
         <!-- The regexps are parsed in the following way;
           --
           --   - the data must be ensclosed betwneen single quotes or
           --     double quotes; inside those pairs the other one may be freely
           --     used as in bash
           --
           --   - many sequences of single- and double-quotes parts are concatenated
           --     as in C
           --
           --   Example:
           --
           --   <regexp>
           --     "'"
           --     '[^"]*'
           --     "'"
           --   </regexp>
           --
           --   gives the regexp : '[^"]*'
           --
           --   (a string between single quotes which does not contain double quotes.
           --   The example is ugly, I know--feel free to provide a better one)
           --
           --   (see below the definition of the <regexp> element)
           -->
      
      <!-- ====================================================================== -->
      
      <!ELEMENT grammar (options?, tokens, rules)>
      
      <!-- ====================================================================== -->
      
      <!ELEMENT options (option+)>
      
      <!ELEMENT option (case | lookahead)>
      
      <!ELEMENT case EMPTY>
      <!ATTLIST case value (sensitive|insensitive) #REQUIRED>
      
      <!ELEMENT lookahead EMPTY>
      <!ATTLIST lookahead tokens CDATA #REQUIRED>
      
      <!-- ====================================================================== -->
      
      <!ELEMENT tokens ((macro | blank | comment)+)>
      
      <!ENTITY % term "is-regexp | is-identifier | is-number | regexp">
      
      <!ELEMENT macro (idname, (%term;))>
      
      <!ELEMENT idname (#PCDATA)>
      
      <!ELEMENT blank (%regexp;)>
      
      <!ELEMENT comment (%regexp;)>
      
      <!ELEMENT is-regexp EMPTY>           <!-- stage2 parser will parse a regexp -->
      <!ELEMENT is-identifier EMPTY>  <!-- stage2 parser will parse an identifier -->
      <!ELEMENT is-number EMPTY>           <!-- stage2 parser will parse a number -->
      
      <!ELEMENT regexp (%regexp;)>
      
      <!-- ====================================================================== -->
      
      <!ELEMENT rules (axiom, rule+)>
      
      <!ELEMENT axiom EMPTY>
      <!ATTLIST axiom name IDREF #REQUIRED>
      
      <!ELEMENT rule (empty?, listerm+)>
      <!ATTLIST rule name ID #REQUIRED>
      
      <!ELEMENT empty EMPTY>                                  <!-- the empty rule -->
      
      <!ELEMENT listerm ((nonterm | %term;)+)>
      <!ATTLIST listerm ordering (indicative|imperative) "imperative">
      <!ATTLIST listerm count (once|many) "once">
      
      <!ELEMENT nonterm EMPTY>
      <!ATTLIST nonterm name IDREF #REQUIRED>
      
      <!-- ====================================================================== -->
      --------------------------------------------------------------->8--

        The classes implementing this DTD follow the following scheme which, I
      hope, will simplify any modification of the grammar:

      - S1_* (where  "*" stands for any  element of the  DTD above) implements
        the element structure.   For instance, <listerm>..</listerm> is parsed
        by S1_LISTERM. All those classes extend STAGE1 which provide a feature
        (unmarshall) which must be used as creation feature.

      - In this DTD, see the link between nonterm and rule, made thanks to the
        ID (named "name"). Indeed, S1_NONTERM acts as a proxy of S1_RULE.

      - STAGE1 implements  XM_NODE_VISITOR, an interface provided  by GOBO XML
        (well,  I hope...   for  now  please patch  GOBO  XML). S1_*  redefine
        visit_element,  visit_attribute and  so  on to  create their  children
        (simply by calling unmarshall on them).

      - S1_GRAMMAR  is  the root  of  the  AST  representing the  grammar.  It
        provides a shift feature similar to the one of the core package.

      - S1_NODE implements NODE  (in the core package) and is  the type of the
        objects creating when calling  S1_GRAMMAR.shift. The root node will be
        stored in S1_GRAMMAR.root.

      - S1_FILE is the source file, shared via S1_SHARED.

      - S1_REGEXP_MATCHER uses ELJ's regular expressions (see [4]).

Extra notes___________________________________________________________________

  I hope to  use one of the GTK  Eiffel wrappers for the UI  part.  Nothing is
planned for that yet, though.

(to be continued...)