From ed1013717427ba0109f0d3eee29b24a1623ba15f Mon Sep 17 00:00:00 2001
From: Lennart Spitzner <lsp@informatik.uni-kiel.de>
Date: Mon, 6 Mar 2017 20:49:08 +0100
Subject: [PATCH] Add documentation around the BriDoc type/api

---
 README.md                            |   5 +
 docs/implementation/bridoc-api.md    | 203 +++++++++++++++++++++++
 docs/implementation/bridoc-design.md | 232 +++++++++++++++++++++++++++
 3 files changed, 440 insertions(+)
 create mode 100644 docs/implementation/bridoc-api.md
 create mode 100644 docs/implementation/bridoc-design.md

diff --git a/README.md b/README.md
index 0992220..79a7b74 100644
--- a/README.md
+++ b/README.md
@@ -101,3 +101,8 @@ stack build
     - -XTemplateHaskell
     - -XBangPatterns
   ~~~~
+
+# Implementation
+
+I have started adding documentation about the main data type `BriDoc`; see the
+"docs/implementation" folder. Start with "bridoc-design.md".
diff --git a/docs/implementation/bridoc-api.md b/docs/implementation/bridoc-api.md
new file mode 100644
index 0000000..c4c53e7
--- /dev/null
+++ b/docs/implementation/bridoc-api.md
@@ -0,0 +1,203 @@
+# BriDoc nodes/Smart constructors and their semantics
+
+At this point, you should have a rough idea of what the involved
+types mean. This leaves us to explain the different `BriDoc`
+(smart) constructors and their exact semantics.
+
+### Special nodes
+
+- docDebug/BDDebug
+
+  Like the `trace` statement of the `BriDoc` type. It does not affect the
+  normal output, but prints stuff to stderr when the transformation traverses
+  this node.
+
+- BDExternal is used for original-source reproduction.
+
+### Basic nodes
+
+- docEmpty/BDEmpty Text
+
+  ""
+
+  The empty document. Has empty output. Should never affect layouting.
+
+- docLit/BDLit
+
+  "a" "Maybe" "("
+
+  The most basic building block - a simple string. Has nothing to do with
+  literals in the parsing sense. Will always be produces as-is in the output.
+  It must be free of newline characters and should normally be free of any
+  spaces (because those would never be considered for line-breaking - but there
+  are cases where this makes sense still).
+
+- docSeq/BDSeq [BriDoc]
+  
+  "func foo = 13"
+
+  A in-line/horizontal sequence of sub-docs. The sub-documents should not
+  contain any newlines, but there is an exception: The last element of the
+  sequence may be multi-line. In combination with `docSetBaseY` this allows
+  for example:
+
+  ~~~~.hs
+  foo | bar = 1
+      | baz = 2
+  ~~~~
+
+  which is represented roughly like
+
+  ~~~~
+  docSeq
+    "foo"
+    space
+    docSetBaseY
+      docLines
+        stuff that results in "| bar = 1"
+        stuff that results in "| baz = 2"
+  ~~~~
+
+  But in general it should be preferred to use `docPar` to handle multi-line
+  sub-nodes, where possible.
+
+- docAlt/BDAlt [BriDoc]
+
+  Specify multiple alternative layouts. Take care to appropriately maintain
+  sharing for the documents representing the children of the current node.
+
+- docAltFilter
+
+  simple utility wrapper around `docAlt`: Each alternative is accompanied by
+  a boolean; if False the alternative is discarded.
+
+- docPar/BDPar
+
+  TODO
+
+- docLines/BDLines
+
+  TODO
+
+- docSeparator/BDSeparator
+
+  Adds a space, unless it is the last element in a line. Also merges with
+  other separators and has no effect if inserted right after inserting space
+  (e.g. in the start of a line when indented) or if already indented due to
+  horizontal alignment.
+
+### Creating horizontal alignment
+
+- docCols/BDCols ColSig [BriDoc]
+  
+  This works like docSeq, but adds horizontal alignment if possible. The
+  implementation involves a lot of special-case trickeries and I assume that
+  it is impossible to specify the exact semantics. But the rough idea is:
+  If
+
+  1. horizontal alignment is not turned off via global config
+  2. there are consecutive lines (created e.g. by docLines or docPar) and
+  3. both lines consist of docCols (where "consist" can ignore certain shallow
+     wrappers like `docAddBaseY`) and
+  4. the two ColSigs are equal and
+  5. the two docCols contain an equal number of children and
+  6. there is enough horizontal space to insert the additional spaces
+
+  then the contained docs will be aligned horizontally.
+
+  And further, if there are multiple lines so that consecutive pairs fulfill
+  these requirements, the whole block will be aligned to the same horizontal
+  tabs.
+
+  And further, if a docCols contains another docCols, and the docCols in the
+  next line also does, and the child docCols also match in ColSigs and have
+  the same number of arguments and so on, then the children's children are
+  also aligned horizontally.
+
+  And of course this nesting also works over blocks built of matching
+  consecutive pairs.
+
+  Wait, was this not supposed to be broadly simplifying? Well.. it is. uhm.
+  Let us just.. example.. an example seems fine.
+
+  Considering the following declaration/formatting:
+
+  ~~~~.hs
+  func (MyLongFoo abc def) = 1
+  func (Bar       a   d  ) = 2
+  func _                   = 3
+  ~~~~
+
+  Note how the "=" are aligned over all three lines, and the patterns in the
+  first two lines are as well, but the pattern in the third line is just a
+  structureless underscore?
+
+  The representation behind that source is something in the direction of this
+  (heavily simplified and not exact at all; e.g. spaces are not represented at
+  all):
+
+  ~~~~
+  docLines
+    docCols equation
+      "func"
+      docCols
+        "("
+        "MyLongFoo"
+        "abc"
+        "def"
+        ")"
+      docSeq
+        "="
+        "1"
+    docCols equation
+      "func"
+      docCols
+        "("
+        "Bar"
+        "a"
+        "d"
+        ")"
+      docSeq
+        "="
+        "2"
+    docCols equation
+      "func"
+      "_"
+      docSeq
+        "="
+        "3"
+  ~~~~
+
+### Controlling indentation level
+
+TODO
+
+- docAddBaseY/BDAddBaseY
+- docSetBaseY
+- docSetIndentLevel
+- docSetBaseAndIndent
+- docEnsureIndent
+
+### Controlling layouting
+
+TODO
+
+- docNonBottomSpacing
+- docSetParSpacing
+- docForceParSpacing
+- docForceSingleline
+- docForceMultiline
+
+### Inserting comments / Controlling comment placement
+
+TODO
+
+- docAnnotationPrior
+- docAnnotationKW
+- docAnnotationRest
+
+### Deprecated
+
+- BDForwardLineMode is unused and apparently should be deprecated.
+- BDProhibitMTEL is deprecated
+
diff --git a/docs/implementation/bridoc-design.md b/docs/implementation/bridoc-design.md
new file mode 100644
index 0000000..31103a3
--- /dev/null
+++ b/docs/implementation/bridoc-design.md
@@ -0,0 +1,232 @@
+# The BriDoc type and the to-BriDoc transformation
+
+The `BriDoc` type is the brittany equivalent of the `Doc` type from
+general-purpose formatting libraries such as the `pretty` package.
+It is specialized for this usecase: Representing a formatted
+haskell source code document. As a consequence, it is a good amount
+more complex than the `Doc` type (which has 8, not directly exposed,
+constructors): The `BriDoc` type has ~25 constructors.
+(26, but one for debugging, two deprecated and so on.)
+Examples are `BDEmpty`, `BDSeq [BriDoc]` (inline sequence),
+and `BDAddBaseY BrIndent BriDoc` (add a certain type of indentation
+to the inner doc).
+
+The main bulk of code that makes brittany work is the translation
+of different syntactical constructs into a raw `BriDoc` value.
+(technically a `BriDocF` value, we'll explain soon.)
+
+The input of this translation is the syntax tree produced by
+GHC/ExactPrint. The ghc api exposes the syntax tree nodes, and
+ExactPrint adds certain annotations (e.g. information about
+in-source comments). The main thing that you will be looking
+at here is the ghc api documentation, for example
+https://downloads.haskell.org/~ghc/8.0.2/docs/html/libraries/ghc-8.0.2/HsDecls.html
+
+## Two examples of the process producing raw BriDoc
+
+1. For example, `Brittany.hs` contains the following code (shortened a bit):
+
+  ~~~~.hs
+  ppDecl d@(L loc decl) = case decl of
+    SigD sig  -> [..] $ do
+      briDoc <- briDocMToPPM $ layoutSig (L loc sig)
+      layoutBriDoc d briDoc
+    ValD bind -> [..] $ do
+      briDoc <- [..] layoutBind (L loc bind)
+      layoutBriDoc d briDoc
+    _         -> briDocMToPPM (briDocByExactNoComment d) >>= layoutBriDoc d
+  ~~~~
+
+  which matches on the type of module top-level syntax node and
+  dispatches to `layoutSig`/`layoutBind` to layout type signatures
+  and equations. For all other constructs, it currently falls back to using
+  ExactPrint to reproduce the exact original.
+
+2. Lets look at a "lower" level fragment that actually produces BriDoc (from Type.hs):
+
+  ~~~~.hs
+    -- if our type is an application; think "HsAppTy Maybe Int"
+    HsAppTy typ1 typ2 -> do
+      typeDoc1 <- docSharedWrapper layoutType typ1 -- layout `Maybe`
+      typeDoc2 <- docSharedWrapper layoutType typ2 -- layout `Int`
+      docAlt                                       -- produce two possible layouts
+        [ docSeq                                       -- a singular-line sequence, with a space in between
+          [ docForceSingleline typeDoc1                -- "Maybe Int"
+          , docLit $ Text.pack " "
+          , docForceSingleline typeDoc2
+          ]
+        , docPar                                       -- an multi-line result, with the "child" indented.
+            typeDoc1                                   -- "Maybe\
+            (docEnsureIndent BrIndentRegular typeDoc2) --    Int"
+        ]
+  ~~~~
+
+  here, all functions prefixed with "doc" produces new BriDoc(F) nodes.
+  I think this example can be understood already, even when many details
+  (what is `docSharedWrapper`?
+  What are the exact semantics of the different `doc..` functions?
+  Why do we need to wrap the `BriDoc` constructors behind those smart-constructor thingies?)
+  are not explained yet.
+  
+## Size of BriDoc trees, Sharing and Complexity
+
+In order to explain the `BriDocF` type and the reasoning behind smart
+constructors, we need to consider the size of the `BriDoc` tree produced by
+this whole process.
+As seen above, we can have multiple alternative layouts (`docAlt`) for
+the same node.
+This means the number of nodes in the `BriDoc` value we produces in general is
+exponential in the number for syntax nodes of the input.
+
+But we are targeting for linear run-time, right? So what can save us here?
+You might think: We have sharing! For `let x = 3+3; (x, x)` we only have one
+`x` in memory ever. And indeed, we do the same above: `typeDoc1` and `2` are
+used in exactly that manner: Both are referenced once in each of the two
+alternatives.
+
+Unfortunately this does not mean that we can forget this issue entirely.
+The problem is that the BriDoc tree value will get transformed by multiple
+transformations. And this "breaks" sharing: If we take an exponential-sized
+tree that is linear-via-sharing and `fmap` some function `f` on it (think of
+some general-purpose tree that is Functor) then `f` will be evaluated an
+exponential number of times. And worse, the output will have lost any sharing.
+Sharing is not automatic memoization.
+And this holds for BriDoc, even when the transformations are not exactly
+`fmap`s.
+
+So.. we already mentioned "memoization" there, right?
+
+1. The bad news:
+   Any existing memoization utilities/approaches didn't work for one reason
+   or another. (I suspect that there is a bug in the GHC StableName
+   implementation, or I messed up..) After trying several memoization
+   approaches and wasting tons of time, I went with a manual approach,
+   and it worked more or less instantly. So that is where we are at.
+
+   Manual memoization means that we manually tag every node of the `BriDoc`
+   with a unique `Int`. This is rather annoying at places, but then again
+   we can abstract over that pretty well.
+   
+2. The good news:
+   With manual memoization, creating an exponentially-sized tree is no
+   problem, presuming that it is linear-via-sharing. Not messing up this
+   property can take a bit of consideration - but otherwise we are set.
+   If the `BriDocF` tree is exponential, the transformations will still
+   do only linear-amount of "selection work" in order to convert into a
+   linear-sized `BriDoc` tree.
+   
+   This property is the defining one that motivates the BriDoc
+   intermediate representation.
+
+## BriDocF
+
+The `BriDocF f` type encapsulates the idea that each subnode is wrapped
+in the `f` container. This notion gives us the following nice properties:
+
+`BriDocF Identity ~ BriDoc` and `BriDocF ((,) Int)` is the
+manual-memoization tree with labeled nodes. Abstractions, abstractions..
+
+Lets have a glance at related code/types we have so far:
+
+~~~~.hs
+-- The pure BriDoc: What we really want, but cannot use everywhere due
+-- to sharing issues.
+-- Isomorphic to `BriDocF Identity`. We still use this type, because
+-- then we have to unwrap the `Identities` only in once place.
+data BriDoc
+  = BDEmpty
+  | BDLit !Text
+  | BDSeq [BriDoc]
+  | BDAddBaseY BrIndent BriDoc
+  | BDAlt [BriDoc]
+  .. [a good amount more]
+
+data BriDocF f
+  = BDFEmpty
+  | BDFLit !Text
+  | BDFSeq [f (BriDocF f)]
+  | BDFAddBaseY BrIndent (f (BriDocF f))
+  | BDFAlt [f (BriDocF f)]
+  .. [a good amount more]
+
+type BriDocFInt = BriDocF ((,) Int)
+type BriDocNumbered = (Int, BriDocFInt)
+
+-- drop the labels
+unwrapBriDocNumbered :: BriDocNumbered -> BriDoc
+unwrapBriDocNumbered = ..
+~~~~
+
+And, because we will need it below: The monadic context that the creation
+of the BriDocF tree uses:
+
+~~~~.hs
+-- If you are not familiar with the `multistate`
+-- package and RWS, this is somewhat similar to:
+-- ReaderT Config (ReaderT Anns (WriterT [LayoutError] (WriterT (Seq String) (State NodeAllocIndex))))
+-- i.e. it is basically an environment allowing:
+--   a) read access to global program config `Config` and the exactprint
+--      annotations `Anns` of given input;
+--   b) write access of errors and "good" output;
+--   c) a local/"State" "variable" `NodeAllocIndex`
+--      (yep, for the manual memoization node labels).
+type ToBriDocM = MultiRWSS.MultiRWS '[Config, Anns] '[[LayoutError], Seq String] '[NodeAllocIndex]
+~~~~
+
+We don't use this directly, but the code below uses this,
+and if the type `ToBriDocM` scared you, see how mundane it
+is used here (`m` will be `ToBriDocM` mostly):
+
+~~~~
+allocNodeIndex :: MonadMultiState NodeAllocIndex m => m Int
+allocNodeIndex = do
+  NodeAllocIndex i <- mGet
+  mSet $ NodeAllocIndex (i + 1)
+  return i
+~~~~
+ 
+## The `doc..` smart constructors
+
+In most cases the smart constructors are fairly dumb: Their main purpose
+is to allocate the unique label for the current node, and return it
+together with the node itself. Lets look at two examples to get a
+feeling for the types involved:
+
+~~~~.hs
+docEmpty :: ToBriDocM BriDocNumbered
+docEmpty = allocateNode BDFEmpty -- what a "smart" constructor, right?
+
+docSeq :: [ToBriDocM BriDocNumbered] -> ToBriDocM BriDocNumbered
+docSeq l = allocateNode . BDFSeq =<< sequence l
+-- this is a bit more elaborate: In order to allow proper
+-- composition of these smart constructors, we accept a list of
+-- actions instead of just `BriDocNumbered`s, and use `sequence`
+-- to make it work. Nothing unusual otherwise.
+~~~~
+
+There is one rather special `doc..` function: `docSharedWrapper`.
+Lets consider the code first:
+
+~~~~.hs
+docSharedWrapper :: Monad m => (x -> m y) -> x -> m (m y)
+docSharedWrapper f x = return <$> f x
+~~~~
+
+How is this useful? Consider this: All the smart constructors
+expect as input actions returning (freshly labeled) nodes.
+But what if we want sharing? In those cases we do _not_ want
+fresh labels on multiple uses. Here `docSharedWrapper` comes
+into play: It executes the contained label-allocation once
+and returns a pure action via `return`; this pure action
+can then be passed e.g. to docSeq but does not do any new
+allocation. This gives us sharing in the cases where we
+want it.
+
+But wait, one more thing: Not all `BriDoc` constructors have
+an exactly matching smart constructor, and there are smart
+constructors that involve multiple BriDoc constructors behind
+the scenes. For this reason, we will focus on the smart
+constructors in the following, because they define the
+real interface to be used.
+
+You now might have a glance at "bridoc-api.md"