Add documentation around the BriDoc type/api

2017-03-06 20:49:08 +01:00 · 2017-03-06 20:49:08 +01:00 · ed10137174
parent cea81d5369
commit ed10137174
3 changed files with 440 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -101,3 +101,8 @@ stack build
    - -XTemplateHaskell
    - -XBangPatterns
  ~~~~
 # Implementation
 I have started adding documentation about the main data type `BriDoc`; see the
 "docs/implementation" folder. Start with "bridoc-design.md".
--- a/docs/implementation/bridoc-api.md
+++ b/docs/implementation/bridoc-api.md
@ -0,0 +1,203 @@
 # BriDoc nodes/Smart constructors and their semantics
 At this point, you should have a rough idea of what the involved
 types mean. This leaves us to explain the different `BriDoc`
 (smart) constructors and their exact semantics.
 ### Special nodes
 - docDebug/BDDebug
  Like the `trace` statement of the `BriDoc` type. It does not affect the
  normal output, but prints stuff to stderr when the transformation traverses
  this node.
 - BDExternal is used for original-source reproduction.
 ### Basic nodes
 - docEmpty/BDEmpty Text
  ""
  The empty document. Has empty output. Should never affect layouting.
 - docLit/BDLit
  "a" "Maybe" "("
  The most basic building block - a simple string. Has nothing to do with
  literals in the parsing sense. Will always be produces as-is in the output.
  It must be free of newline characters and should normally be free of any
  spaces (because those would never be considered for line-breaking - but there
  are cases where this makes sense still).
 - docSeq/BDSeq [BriDoc]
  "func foo = 13"
  A in-line/horizontal sequence of sub-docs. The sub-documents should not
  contain any newlines, but there is an exception: The last element of the
  sequence may be multi-line. In combination with `docSetBaseY` this allows
  for example:
  ~~~~.hs
  foo | bar = 1
      | baz = 2
  ~~~~
  which is represented roughly like
  ~~~~
  docSeq
    "foo"
    space
    docSetBaseY
      docLines
        stuff that results in "| bar = 1"
        stuff that results in "| baz = 2"
  ~~~~
  But in general it should be preferred to use `docPar` to handle multi-line
  sub-nodes, where possible.
 - docAlt/BDAlt [BriDoc]
  Specify multiple alternative layouts. Take care to appropriately maintain
  sharing for the documents representing the children of the current node.
 - docAltFilter
  simple utility wrapper around `docAlt`: Each alternative is accompanied by
  a boolean; if False the alternative is discarded.
 - docPar/BDPar
  TODO
 - docLines/BDLines
  TODO
 - docSeparator/BDSeparator
  Adds a space, unless it is the last element in a line. Also merges with
  other separators and has no effect if inserted right after inserting space
  (e.g. in the start of a line when indented) or if already indented due to
  horizontal alignment.
 ### Creating horizontal alignment
 - docCols/BDCols ColSig [BriDoc]
  This works like docSeq, but adds horizontal alignment if possible. The
  implementation involves a lot of special-case trickeries and I assume that
  it is impossible to specify the exact semantics. But the rough idea is:
  If
  1. horizontal alignment is not turned off via global config
  2. there are consecutive lines (created e.g. by docLines or docPar) and
  3. both lines consist of docCols (where "consist" can ignore certain shallow
     wrappers like `docAddBaseY`) and
  4. the two ColSigs are equal and
  5. the two docCols contain an equal number of children and
  6. there is enough horizontal space to insert the additional spaces
  then the contained docs will be aligned horizontally.
  And further, if there are multiple lines so that consecutive pairs fulfill
  these requirements, the whole block will be aligned to the same horizontal
  tabs.
  And further, if a docCols contains another docCols, and the docCols in the
  next line also does, and the child docCols also match in ColSigs and have
  the same number of arguments and so on, then the children's children are
  also aligned horizontally.
  And of course this nesting also works over blocks built of matching
  consecutive pairs.
  Wait, was this not supposed to be broadly simplifying? Well.. it is. uhm.
  Let us just.. example.. an example seems fine.
  Considering the following declaration/formatting:
  ~~~~.hs
  func (MyLongFoo abc def) = 1
  func (Bar       a   d  ) = 2
  func _                   = 3
  ~~~~
  Note how the "=" are aligned over all three lines, and the patterns in the
  first two lines are as well, but the pattern in the third line is just a
  structureless underscore?
  The representation behind that source is something in the direction of this
  (heavily simplified and not exact at all; e.g. spaces are not represented at
  all):
  ~~~~
  docLines
    docCols equation
      "func"
      docCols
        "("
        "MyLongFoo"
        "abc"
        "def"
        ")"
      docSeq
        "="
        "1"
    docCols equation
      "func"
      docCols
        "("
        "Bar"
        "a"
        "d"
        ")"
      docSeq
        "="
        "2"
    docCols equation
      "func"
      "_"
      docSeq
        "="
        "3"
  ~~~~
 ### Controlling indentation level
 TODO
 - docAddBaseY/BDAddBaseY
 - docSetBaseY
 - docSetIndentLevel
 - docSetBaseAndIndent
 - docEnsureIndent
 ### Controlling layouting
 TODO
 - docNonBottomSpacing
 - docSetParSpacing
 - docForceParSpacing
 - docForceSingleline
 - docForceMultiline
 ### Inserting comments / Controlling comment placement
 TODO
 - docAnnotationPrior
 - docAnnotationKW
 - docAnnotationRest
 ### Deprecated
 - BDForwardLineMode is unused and apparently should be deprecated.
 - BDProhibitMTEL is deprecated
--- a/docs/implementation/bridoc-design.md
+++ b/docs/implementation/bridoc-design.md
@ -0,0 +1,232 @@
 # The BriDoc type and the to-BriDoc transformation
 The `BriDoc` type is the brittany equivalent of the `Doc` type from
 general-purpose formatting libraries such as the `pretty` package.
 It is specialized for this usecase: Representing a formatted
 haskell source code document. As a consequence, it is a good amount
 more complex than the `Doc` type (which has 8, not directly exposed,
 constructors): The `BriDoc` type has ~25 constructors.
 (26, but one for debugging, two deprecated and so on.)
 Examples are `BDEmpty`, `BDSeq [BriDoc]` (inline sequence),
 and `BDAddBaseY BrIndent BriDoc` (add a certain type of indentation
 to the inner doc).
 The main bulk of code that makes brittany work is the translation
 of different syntactical constructs into a raw `BriDoc` value.
 (technically a `BriDocF` value, we'll explain soon.)
 The input of this translation is the syntax tree produced by
 GHC/ExactPrint. The ghc api exposes the syntax tree nodes, and
 ExactPrint adds certain annotations (e.g. information about
 in-source comments). The main thing that you will be looking
 at here is the ghc api documentation, for example
 https://downloads.haskell.org/~ghc/8.0.2/docs/html/libraries/ghc-8.0.2/HsDecls.html
 ## Two examples of the process producing raw BriDoc
 1. For example, `Brittany.hs` contains the following code (shortened a bit):
  ~~~~.hs
  ppDecl d@(L loc decl) = case decl of
    SigD sig  -> [..] $ do
      briDoc <- briDocMToPPM $ layoutSig (L loc sig)
      layoutBriDoc d briDoc
    ValD bind -> [..] $ do
      briDoc <- [..] layoutBind (L loc bind)
      layoutBriDoc d briDoc
    _         -> briDocMToPPM (briDocByExactNoComment d) >>= layoutBriDoc d
  ~~~~
  which matches on the type of module top-level syntax node and
  dispatches to `layoutSig`/`layoutBind` to layout type signatures
  and equations. For all other constructs, it currently falls back to using
  ExactPrint to reproduce the exact original.
 2. Lets look at a "lower" level fragment that actually produces BriDoc (from Type.hs):
  ~~~~.hs
    -- if our type is an application; think "HsAppTy Maybe Int"
    HsAppTy typ1 typ2 -> do
      typeDoc1 <- docSharedWrapper layoutType typ1 -- layout `Maybe`
      typeDoc2 <- docSharedWrapper layoutType typ2 -- layout `Int`
      docAlt                                       -- produce two possible layouts
        [ docSeq                                       -- a singular-line sequence, with a space in between
          [ docForceSingleline typeDoc1                -- "Maybe Int"
          , docLit $ Text.pack " "
          , docForceSingleline typeDoc2
          ]
        , docPar                                       -- an multi-line result, with the "child" indented.
            typeDoc1                                   -- "Maybe\
            (docEnsureIndent BrIndentRegular typeDoc2) --    Int"
        ]
  ~~~~
  here, all functions prefixed with "doc" produces new BriDoc(F) nodes.
  I think this example can be understood already, even when many details
  (what is `docSharedWrapper`?
  What are the exact semantics of the different `doc..` functions?
  Why do we need to wrap the `BriDoc` constructors behind those smart-constructor thingies?)
  are not explained yet.
 ## Size of BriDoc trees, Sharing and Complexity
 In order to explain the `BriDocF` type and the reasoning behind smart
 constructors, we need to consider the size of the `BriDoc` tree produced by
 this whole process.
 As seen above, we can have multiple alternative layouts (`docAlt`) for
 the same node.
 This means the number of nodes in the `BriDoc` value we produces in general is
 exponential in the number for syntax nodes of the input.
 But we are targeting for linear run-time, right? So what can save us here?
 You might think: We have sharing! For `let x = 3+3; (x, x)` we only have one
 `x` in memory ever. And indeed, we do the same above: `typeDoc1` and `2` are
 used in exactly that manner: Both are referenced once in each of the two
 alternatives.
 Unfortunately this does not mean that we can forget this issue entirely.
 The problem is that the BriDoc tree value will get transformed by multiple
 transformations. And this "breaks" sharing: If we take an exponential-sized
 tree that is linear-via-sharing and `fmap` some function `f` on it (think of
 some general-purpose tree that is Functor) then `f` will be evaluated an
 exponential number of times. And worse, the output will have lost any sharing.
 Sharing is not automatic memoization.
 And this holds for BriDoc, even when the transformations are not exactly
 `fmap`s.
 So.. we already mentioned "memoization" there, right?
 1. The bad news:
   Any existing memoization utilities/approaches didn't work for one reason
   or another. (I suspect that there is a bug in the GHC StableName
   implementation, or I messed up..) After trying several memoization
   approaches and wasting tons of time, I went with a manual approach,
   and it worked more or less instantly. So that is where we are at.
   Manual memoization means that we manually tag every node of the `BriDoc`
   with a unique `Int`. This is rather annoying at places, but then again
   we can abstract over that pretty well.
 2. The good news:
   With manual memoization, creating an exponentially-sized tree is no
   problem, presuming that it is linear-via-sharing. Not messing up this
   property can take a bit of consideration - but otherwise we are set.
   If the `BriDocF` tree is exponential, the transformations will still
   do only linear-amount of "selection work" in order to convert into a
   linear-sized `BriDoc` tree.
   This property is the defining one that motivates the BriDoc
   intermediate representation.
 ## BriDocF
 The `BriDocF f` type encapsulates the idea that each subnode is wrapped
 in the `f` container. This notion gives us the following nice properties:
 `BriDocF Identity ~ BriDoc` and `BriDocF ((,) Int)` is the
 manual-memoization tree with labeled nodes. Abstractions, abstractions..
 Lets have a glance at related code/types we have so far:
 ~~~~.hs
 -- The pure BriDoc: What we really want, but cannot use everywhere due
 -- to sharing issues.
 -- Isomorphic to `BriDocF Identity`. We still use this type, because
 -- then we have to unwrap the `Identities` only in once place.
 data BriDoc
  = BDEmpty
  | BDLit !Text
  | BDSeq [BriDoc]
  | BDAddBaseY BrIndent BriDoc
  | BDAlt [BriDoc]
  .. [a good amount more]
 data BriDocF f
  = BDFEmpty
  | BDFLit !Text
  | BDFSeq [f (BriDocF f)]
  | BDFAddBaseY BrIndent (f (BriDocF f))
  | BDFAlt [f (BriDocF f)]
  .. [a good amount more]
 type BriDocFInt = BriDocF ((,) Int)
 type BriDocNumbered = (Int, BriDocFInt)
 -- drop the labels
 unwrapBriDocNumbered :: BriDocNumbered -> BriDoc
 unwrapBriDocNumbered = ..
 ~~~~
 And, because we will need it below: The monadic context that the creation
 of the BriDocF tree uses:
 ~~~~.hs
 -- If you are not familiar with the `multistate`
 -- package and RWS, this is somewhat similar to:
 -- ReaderT Config (ReaderT Anns (WriterT [LayoutError] (WriterT (Seq String) (State NodeAllocIndex))))
 -- i.e. it is basically an environment allowing:
 --   a) read access to global program config `Config` and the exactprint
 --      annotations `Anns` of given input;
 --   b) write access of errors and "good" output;
 --   c) a local/"State" "variable" `NodeAllocIndex`
 --      (yep, for the manual memoization node labels).
 type ToBriDocM = MultiRWSS.MultiRWS '[Config, Anns] '[[LayoutError], Seq String] '[NodeAllocIndex]
 ~~~~
 We don't use this directly, but the code below uses this,
 and if the type `ToBriDocM` scared you, see how mundane it
 is used here (`m` will be `ToBriDocM` mostly):
 ~~~~
 allocNodeIndex :: MonadMultiState NodeAllocIndex m => m Int
 allocNodeIndex = do
  NodeAllocIndex i <- mGet
  mSet $ NodeAllocIndex (i + 1)
  return i
 ~~~~
 ## The `doc..` smart constructors
 In most cases the smart constructors are fairly dumb: Their main purpose
 is to allocate the unique label for the current node, and return it
 together with the node itself. Lets look at two examples to get a
 feeling for the types involved:
 ~~~~.hs
 docEmpty :: ToBriDocM BriDocNumbered
 docEmpty = allocateNode BDFEmpty -- what a "smart" constructor, right?
 docSeq :: [ToBriDocM BriDocNumbered] -> ToBriDocM BriDocNumbered
 docSeq l = allocateNode . BDFSeq =<< sequence l
 -- this is a bit more elaborate: In order to allow proper
 -- composition of these smart constructors, we accept a list of
 -- actions instead of just `BriDocNumbered`s, and use `sequence`
 -- to make it work. Nothing unusual otherwise.
 ~~~~
 There is one rather special `doc..` function: `docSharedWrapper`.
 Lets consider the code first:
 ~~~~.hs
 docSharedWrapper :: Monad m => (x -> m y) -> x -> m (m y)
 docSharedWrapper f x = return <$> f x
 ~~~~
 How is this useful? Consider this: All the smart constructors
 expect as input actions returning (freshly labeled) nodes.
 But what if we want sharing? In those cases we do _not_ want
 fresh labels on multiple uses. Here `docSharedWrapper` comes
 into play: It executes the contained label-allocation once
 and returns a pure action via `return`; this pure action
 can then be passed e.g. to docSeq but does not do any new
 allocation. This gives us sharing in the cases where we
 want it.
 But wait, one more thing: Not all `BriDoc` constructors have
 an exactly matching smart constructor, and there are smart
 constructors that involve multiple BriDoc constructors behind
 the scenes. For this reason, we will focus on the smart
 constructors in the following, because they define the
 real interface to be used.
 You now might have a glance at "bridoc-api.md"