Add doc chapter on exactprinting, plus minor doc fixups

2019-11-18 11:26:18 +01:00 · 2019-11-18 11:26:18 +01:00 · 41750dc8a8
parent 974826f98f
commit 41750dc8a8
3 changed files with 435 additions and 7 deletions
--- a/doc/implementation/exactprinting.md
+++ b/doc/implementation/exactprinting.md
@ -0,0 +1,412 @@
+# Exactprinting
+
+Brittany uses the `ghc-exactprint` library/wrapper around the GHC API to
+parse haskell source code into a syntax tree and into "annotations". The
+unannotated syntax tree would lose information, such as the exact (relative)
+position in the source text, or any comments; this is what annotations provide.
+
+Following that name, we'll call "exactprinting" the aspect of reproducing
+comments and relative positions - most importantly additional newlines - while
+round-tripping through brittany. The focus is not on the API of the
+`ghc-exactprint` library, but on the corresponding data-flow through brittany.
+
+**Take note that the `--dump-bridoc-*` output filters out the constructors
+responsible for comments and for applying DeltaPositions.**
+This is done to keep the output more readable, but could confuse you if you
+try to understand how comments work.
+
+## TLDR - Practical Suggestions for Implementing Layouters
+
+This advice does not explain how comments work, but if you are implementing
+a layouter it might cover most cases without requiring you to understand the
+details.
+
+- Ideally, we should wrap the `BriDoc` of any construct that as a location
+  (i.e. has the form `(L _ something)`) (and consequently has an `AnnKey`)
+  using `docWrapNode`. As an example, look at the `layoutExpr` function and
+  how it applies `docWrapNode lexpr $ ..` right at the top.
+
+- If we have not done the above, it is somewhat likely that comments
+  "get eaten". For such cases:
+
+  1. Take a small reproduction case
+
+  1. Throw it at `brittany --dump-ast-full` and see where the comment is
+       in the syntax tree. See where the corresponding syntax node is
+       consumed/transformed by brittany and wrap it with `docWrapNode`.
+
+  1. If it is unclear what alternative (of a `docAlt` or
+       `runFilteredAlternative`) applies, try inserting `docDebug "myLabel"`
+       nodes to track down which alternative applies.
+
+- For comments that _do_ appear in the output but at the wrong location, there
+  are two classes of problems: Firstly we have comments that move "down" past
+  other stuff (even switching order of comments is possible). Use the steps
+  from the last item to figure out which syntax tree constructor is relevant,
+  and try inserting `docMoveToKWDP` or replace `docWrapNode` with a manually
+  refined combination of `docWrapNodePrior` and `docWrapNodeRest`.
+
+- For comments that _do_ appear in the output in roughly the right position,
+  only with the wrong indentation, the cause most likely is a
+  mis-interpretation of DPs that can be fixed by inserting a
+  `docSetIndentLevel` at the right position - right before printing the
+  thing that provides the "layouting rule" indentation, i.e. the body of a
+  `do`/`let`/`where` block.
+
+- There is one other cause for off-by-one errors in comment position:
+  Whitespace. In general, layouters should prefer to use `docSeparator` to
+  insert space between syntax elements rather than including spaces in
+  literal strings. As an example, use `docSeq [docLit "then", docSeparator]`
+  or the equivalent `appSep (docLit "then")` rather than `docLit "then "`.
+  The reason is that comment positions are relative to the last non-whitespace,
+  and `docSeparator` is interpreted in just the right fashion: It inserts
+  a whitespace, but keeps track of the correct comment offset. (Also,
+  subsequent `docSeparators` are merged into one.)
+
+- If all of this fails, read below, bother the maintainers and/or make use of
+  the more advanced debugging features (there is a `#define` in
+  `BackendUtils.hs` that you can turn on to insert all kinds of verbose
+  output in-line with the actual output).
+
+## A Small Example
+
+~~~~.hs
+main = do
+  putStr "hello" -- very suspense
+  putStrLn " world" --nice
+~~~~
+
+If you pass this to `brittany --dump-ast-full` you'll see .. a 100 line syntax
+tree. Yeah, raw syntax tree are a bit unwieldly.
+
+(btw I'd use `clipread | brittany --dump-ast-full` for that purpose, where
+`clipread` boils down to `xclip -o -selection clipboard`. If you have not set
+up that script on your system, you really should.)
+
+To simplify this slightly, we will focus down on just the syntax tree of
+the `do` block, which is the `HsDo` constructor.
+
+~~~~
+---- ast ----
+A Just (Ann (DP (0,0)) [] [] [((AnnComment (Comment "--nice" stdin:3:21-26 Nothing)),DP (0,1)),((G AnnEofPos),DP (1,0))] Nothing Nothing)
+  HsModule
+    ..
+    [ A Just (Ann (DP (0,0)) [] [] [] Nothing Nothing)
+        ValD
+          FunBind
+            A Just (Ann (DP (0,0)) [] [] [((G AnnVal),DP (0,0))] Nothing Nothing)
+              Unqual {OccName: main}
+            MG
+              A Nothing
+                [ A Just (Ann (DP (0,0)) [] [] [((G AnnEqual),DP (0,1))] Nothing Nothing)
+                    Match
+                      FunRhs
+                        ..main..
+                        Prefix
+                        NoSrcStrict
+                      []
+                      GRHSs
+                        [ A Just (Ann (DP (0,-1)) [] [] [] Nothing Nothing)
+                            GRHS
+                              []
+                              A Just (Ann (DP (0,1)) [] [] [((G AnnDo),DP (0,0))] Nothing Nothing)
+                                HsDo
+                                  DoExpr
+                                  A Nothing
+                                    [ A Just (Ann (DP (1,2)) [] [] [] Nothing Nothing)
+                                        BodyStmt
+                                          A Just (Ann (DP (0,0)) [] [] [] Nothing Nothing)
+                                            HsApp
+                                              ..putStr..
+                                              .."hello"..
+                                          ..
+                                          ..
+                                    , A Just (Ann (DP (1,0)) [((Comment "-- very suspense" stdin:2:18-33 Nothing),DP (0,1))] [] [] Nothing Nothing)
+                                        BodyStmt
+                                          A Just (Ann (DP (0,0)) [] [] [] Nothing Nothing)
+                                            HsApp
+                                              ..putStrLn..
+                                              .." world"..
+                                          ..
+                                          ..
+                                    ]
+                        ]
+                        A (Nothing) (EmptyLocalBinds)
+                ]
+              FromSource
+            WpHole
+            []
+    ]
+    ..
+~~~~
+
+So this is a haskell module, `HsModule` containing a function bind `FunBind`
+containing a match group, containing a Match, containing some right-hand-side
+expression which in this case is just a do block `HsDo` which contains two
+applications `HsApp` of a function `putStr(Ln)` plus some string literal.
+
+There is no need to understand this, as long as you can roughly see how this
+representation corresponds to the input source code.
+
+For the purpose of exactprinting, what we need to look at are the annotations.
+The `ghc-exactprint` library returns the syntax tree and annotations as two
+different entities:
+- [You can start looking at the module level](https://downloads.haskell.org/ghc/latest/docs/html/libraries/ghc-8.8.1/HsSyn.html#v:HsModule)
+and work your way down to any syntactical construct from there;
+- The [Annotation type and its `Ann` constructor](https://hackage.haskell.org/package/ghc-exactprint-0.6.2/docs/Language-Haskell-GHC-ExactPrint-Types.html#t:Annotation).
+
+In the above `--dump-ast-full` output these two are mixed together using the
+fake `A` constructor that is just a pair of a `Maybe Annotation` and of one
+node in the syntax tree. It was produced by recursively printing the syntax
+tree, and for each node `n` we print `A (getAnnotation n) n`. So let's focus
+on the `Annotation` type.
+
+## The `ghc-exactprint` Annotation Type
+
+~~~~.hs
+Ann  
+  { annEntryDelta        :: !DeltaPos
+  , annPriorComments     :: ![(Comment, DeltaPos)]
+  , annFollowingComments :: ![(Comment, DeltaPos)]
+  , annsDP               :: ![(KeywordId, DeltaPos)]
+  , annSortKey           :: !(Maybe [SrcSpan])
+  , annCapturedSpan      :: !(Maybe AnnKey)
+  }
+~~~~
+
+But please refer to [the ghc-exactprint docs](https://hackage.haskell.org/package/ghc-exactprint-0.6.2/docs/Language-Haskell-GHC-ExactPrint-Types.html#t:Annotation) for the fully commented version.
+
+A few things to point out:
+
+- There are _three_ constructors that contain the `Comment` type in that
+  constructor. `annPriorComments` and `annFollowingComments` are obvious, but
+  a third hides behind the `KeywordId` type. Source code comments may appear
+  in one of these three locations.
+- The `DeltaPos` type and its `DP` constructor can be seen in the above output
+  everywhere. It contains information about relative positioning of both
+  comments and syntax nodes. Please test what changes if you insert a newline
+  before `putStrLn`, or add spaces before one of the comments, and see how the
+  `--dump-ast-full` output changes.
+- The exact semantics of the `DP` value, especially when it comes to
+  indentation, are a source of constant joy. If the values don't make sense,
+  you are on the right track. Just figure out what DP is connected to what
+  change in the syntax tree for now.
+- We have two comments in the source code, which appear in opposite order
+  in the `--dump-ast-full` output. The reason is that comments mostly appear
+  in the middle of two AST nodes, and it is somewhat arbitary whether we
+  connected them as an "after" comment of the first or as an "before" comment
+  of the second node. And keep in mind that we have a third constructor that
+  can contain comments that are somewhere in the "middle" of a node, too.
+- We have `DP`s with negative offsets. Did I mention how much fun `DP`s are?
+  I have no idea where the above `-1` comes from.
+- The `annsDP` field may also contain the `DP`s of syntax that is somewhere
+  "in the middle" of a syntax node, e.g. the position of the `else` keyword.
+
+  We will discuss the semantics of `DP` further down below.
+
+## Data-Flow of a Comment When Round-Tripping
+
+Parsing with `ghc-exactprint` returns both a syntax tree and a map of
+annotations (`Map AnnKey Annotation`). Let's consider just the comment
+"-- very suspense" in the above example: The annotations map would contain
+the following mapping:
+
+~~~~
+AnnKey {stdin:3:3-19} (CN "BodyStmt")
+  -> Ann { annEntryDelta = DP (1,0)
+         , annPriorComments =
+             [((Comment "-- very suspense" stdin:2:18-33 Nothing),DP (0,1))]
+         , annFollowingComments = []
+         , annsDB = []
+         , annSortKey = Nothing
+         , annCapturedSpan = Nothing
+         }
+~~~~
+
+where the `AnnKey` is connected to the syntax node `BodyStmt` with the given
+source location.
+
+Brittany keeps the annotations map around, and the `BriDoc` structure contains
+nodes that have `AnnKey` values, i.e. the `BriDoc` nested documented structure
+similarly only contains references into the annotations map. The corresponding
+constructors of the `BriDoc(F)` type are:
+
+~~~~.hs
+data BriDoc
+  = ..
+  | BDAnnotationPrior AnnKey BriDoc
+  | BDAnnotationKW AnnKey (Maybe AnnKeywordId) BriDoc
+  | BDAnnotationRest  AnnKey BriDoc
+  | BDMoveToKWDP AnnKey AnnKeywordId Bool BriDoc -- True if should respect x offset
+  | ..
+~~~~
+
+when rendering a `BriDoc` to text, the semantics of the above nodes can be
+described roughly like this:
+- `render (BDAnnotationPrior annkey bd)` extracts the "before" type comments
+  under the given `annkey` from the annotations map (this is a stateful
+  process - they are really removed from the map). It renders these comments.
+  If we are in a new line, we respect the `annEntryDelta :: DeltaPos` value
+  to insert newlines. The "if in a new line" check prevents us from inserting
+  newlines in the case that brittany chose to transform a multi-line layout
+  into a single-line layout.
+
+  Then we recursively process `render bd`.
+- `render (BDAnnotationsKW annkey mkwId bd)` similarly first renders the comments
+  extracted from the annotations map under the given `annkey` before calling
+  `render bd`. For example, this would allow us to print the comments _before_
+  the closing bracket `]` of an empty list literal e.g.
+  `[{-put numbers here to do X-}]`.
+- `render (BDMoveToKWDP annkey kwId xToo bd` moves to the relative position of
+  the given keyword before continuing with `render bd`.
+  It is used for example to insert newlines before a `where` keyword to
+  match those of the original source code.
+- `render (BDAnnotationsRest annkey bd)` first calls `render bd` and _then_
+  takes _any remaining comments_ it can find in the annotations map under the
+  given `annkey` and prints them.
+
+### Some Notes to This Design
+
+- We heavily rely on the `ghc-exactprint` library and its types and
+  their semantics. We could define our own data structures to capture comments
+  and whitespace offsets. While this could allow us to make the later steps
+  of the process easier by more closely matching the information we need when
+  rendering a `BriDoc` document, it would involve a mildly complex
+  extra transformation step from `ghc-exactprint` annotations to hypothetical
+  `brittany` annotations.
+
+- For those cases where we rely on `ghc-exactprint` to output syntax that
+  `brittany` does not know yet, it is mandatory that we keep the annotations
+  around.
+
+- We make the rendering stateful in the annotations. The main advantage to
+  this is that we can keep track of any comments that have not yet been
+  reproduced in the output, and as a last resort append them at the end. The
+  effect of that is that comments "move" down in the document when brittany is
+  not exact, but at least it does not "eat" comments. The latter can still
+  happen though if we forget to include a given `AnnKey` at all in the `BriDoc`
+  document.
+
+  Of course this is a bit yucky, but it seems to be a sensible measure for
+  the long transitioning period where `brittany` is not perfect.
+
+- It may be surprising to nest things like we do in the `BriDoc` type.
+  The intuitive document representation for something like
+
+    ~~~~.hs
+    -- before
+    foo
+    -- after
+    ~~~~
+
+  might be
+
+    ~~~~
+    sequence [comment "-- before", text "foo", comment "-- after"]
+    ~~~~
+
+  but instead we use
+
+    ~~~~
+    BDAnnotationsPrior annkey1 -- yields "-- before"
+      BDAnnotationsRest annkey1 -- yields "-- after"
+        BDLit "foo"
+    ~~~~
+
+  which may seem unnecessarily nested. But this representation has certain
+  advantages, most importantly rewriting/restructuring the tree is
+  straigh-forward: consider how `BDAnnotationsPrior annkey (BDSeq [a, b])` can
+  be transformed into `BDSeq [BDAnnotationsPrior annkey a, b]`. You can do the
+  same transformation using the "flat" representation, but there are way more
+  cases to consider.
+
+## DeltaPosition semantics
+
+DeltaPositions (we'll just say `DP` which matches the constructor name for this
+type) are used to specify where to place comments and regular syntax (including
+keywords). This covers both newlines and indentation, and for indentation
+includes the case where indentation is mandatory ("layouting rule").
+
+Let us look at this example, which was constructed so that each comment
+contains its own DP:
+
+~~~~.hs
+do -- DP (0, 1)
+
+  -- DP (2, 2)   two newlines, two spaces indentation
+  abc
+  -- DB (1, 0)   one newline, zero indentation relative to the do-indentation
+  def
+~~~~
+
+The first comment is of the easy sort, because it occurs at the end of a
+non-empty line: There is no row offset, and the column offset matches the
+number of spaces (before the "--") after the last thing in the line.
+
+The second comment _does_ have a row offset: After the last comment, we have
+to insert two line-breaks, then apply the indentation (two spaces) and then
+insert the comment starting with "--". This is straight-forward so far.
+
+The third comment however highlights how DPs are affected by the layouting
+rule.
+
+### Caveat One: Indentation relative to layouting rule indentation level
+
+Following the first two cases, one would assume that the DP would be
+`(1, 2)`. However, for cases where the layouting rule applies
+(`do`, `let`, `where`) the indentation of the comments is expressed relative
+to the _current indentation_ according to the layouting rule. Unfortunately,
+this _current indentation_ is not known until the first construct after
+the let, so in the above example, the comment between the `do` and the first
+construct (`abc`) has an indentation relative to the enclosing indentation
+level (which is 0 for this example). This applies _even_ if the comment is
+connected to the first construct (if the first comment is a "prior" comment
+of the "abc" syntax node).
+
+This applies not only to comments, but also to the DPs of all syntax nodes
+(including keywords).
+
+This also means that it is possible to have negative indentation. Consider
+this comment:
+
+~~~~.hs
+do
+  abc
+ -- fancy comment
+  def
+~~~~
+
+### Caveat Two: Caveat one applies to more than the layouting rule
+
+There are syntactic constructs, for example data declarations, where the
+layouting rule does not apply, but for the purpose of `DP` indentations
+`ghc-exactprint` pretends that it does. For example:
+
+~~~~.hs
+data Foo = Foo
+  { myInt :: Int
+    -- DP (1, -7)    relative to the `Foo` constructor (!)
+  }
+~~~~
+
+The layouting rule does _not apply in any way_ here. Still, we get a rather
+unexpected DP.
+
+### DeltaPositions of keywords and syntax nodes
+
+We have mostly talked about comments, but DPs exist and work for keywords
+and syntax nodes just like they do for comments.
+
+~~~~.hs
+func = x
+ 
+ where
+
+  x = 42
+~~~~
+
+here, the `where` keyword has a DP of `(2, 1)` and the `x = 42` equation
+has a DP of `(2, 2)`. We make use of these DPs using the `BDMoveToKWDP` or the
+`BDAnnotationPrior` constructors of the `BriDoc` document. The former would be
+used for the `where` keyword, the latter would be applied to the equation
+document.
--- a/doc/implementation/index.md
+++ b/doc/implementation/index.md
@ -18,6 +18,11 @@
  Specifying the semantics of the different (smart) constructors of the
  `BriDoc` type.

+- [exactprinting](exactprinting.md)
+
+  A closer look at how we achieve exactprinting, i.e. keeping comments and
+  certain whitespace (empty lines) as they appear in the input source code.
+
 - Brittany uses the following (randomly deemed noteworthy) libraries:

  - [`ghc-exactprint`](https://hackage.haskell.org/package/ghc-exactprint)
--- a/doc/implementation/theory.md
+++ b/doc/implementation/theory.md
@ -18,7 +18,7 @@ will first consider

 Every haskell module can be written in a single line - of course, in most
 cases, an unwieldly long one. We humans prefer our lines limitted to some
-laughingly small limit like 80 or 160 or whatever. Further, we generally
+laughingly small limit like 80 or 160. Further, we generally
 prefer the indentation of our expressions(, statements etc.) line up with
 its syntactic structure. This preferences (and the layouting rule which
 already enforces it partially) still leaves a good amount of choice for
@ -39,8 +39,9 @@ myList =
 ~~~~

 While consistency has the first priority, we also prefer short code: If it
-fits, we prefer the version/layout with less lines of code. So we wish to trade
-more lines for less columns, but only until things fit.
+fits, we prefer the version/layout with less lines of code. So coming from the
+everything-in-one-line version, we wish to trade more lines to achieve less
+columns, but stop immediately when everything fits into 80 columns.

 For simple cases we can give a trivial rule: If there is space
 for the one-line layout, use it; otherwise use the indented-multiline
@ -108,15 +109,19 @@ not help at all in pruning the alternatives on a given layer. In the above
 `nestedCaseExpr` example, we might obtain a better solution by looking not
 at just the first, but the first n possible layouts, but against an exponential
 search-space, this does not scale: Just consider the possibility that there
-are exponentially many sub-solutions for layout 2) (replace "good" with some
-slightly more complex expression). You basically always end up with either
+are exponentially many sub-solutions for layout 2) (replace the literal "good"
+in the above example with some slightly more complex expression).
+You basically always end up with either
 "the current line is not yet full, try to fill it up" or
 "more than n columns used, abort".

 But a (pure) bottom-up approach does not work either: If we have no clue about
 the "current" indentation while layouting some node of our syntax tree,
 information about the (potential) size of (the layout of) child-nodes does
-not allow us to make good decisions.
+not allow us to make good decisions - if we have a choice between a layout
+that takes 40 and a layout that takes 60 columns, we _need_ to know whether
+the current indentation is bigger or smaller than 20, otherwise our result
+will be non-optimal in general.

 So we need information to flow bottom-to-top to allow for pruning whole trees
 of possible layouts, and top-to-bottom for making the actual decisions.. well,
@ -140,6 +145,11 @@ if func x
 -- => 3 lines, 13 columns used
 ~~~~

+So internally, to the syntax node of this if-then-else expression we connect
+a label containing these two choices, and including the spacing information:
+`[(1, 32, someDoc1), (3, 13, someDoc2)]`. where the `someDoc`s are document
+representations that can reproduce the above two source code layouts.
+
 This is heavily simplified; in Brittany spacing information is (as usual) a
 bit more complex.

@ -147,7 +157,8 @@ We restrict the size of these sets. Given the sets of spacings for the
 child-nodes in the syntax-tree, we generate a limited number of possible
 spacings in the current node. We then prune nodes that already violate desired
 properties, e.g. any spacing that already uses more columns locally than
-globally available.
+globally available - we would not have something like `(_, 90, _)` in the
+above list when our limit is 80 columns.

 The second pass is top-down and uses the spacing-information to decide on one
 of the possible layouts for the current node. It passes the current