Add more high-level documentation: theory.md

2017-05-11 17:36:28 +02:00 · 2017-05-11 17:36:28 +02:00 · eac73c9a7b
parent 7775812cfd
commit eac73c9a7b
1 changed files with 313 additions and 0 deletions
--- a/docs/implementation/theory.md
+++ b/docs/implementation/theory.md
@ -0,0 +1,313 @@
+# Introduction
+
+[The readme](../../master/README.md) mentions a couple of goals for this
+project, including the following two:
+
+- Be clever about using the available horizontal space while not overflowing
+  it if it cannot be avoided;
+- Have linear complexity in the size of the input.
+
+These two goals stand in conflict, and this project chooses a solution
+to this that distinguishes it from (all) other existing formatters. This
+idea was the motivation to write Brittany in the first place, and it might
+even be applicable to more general-purposes structured-document formatters
+(although we will not expand on that here). Before explaining the idea, we
+will first consider
+
+## The problem
+
+Every haskell module can be written in a single line - of course, in most
+cases, an unwieldly long one. We humans prefer our lines limitted to some
+laughingly small limit like 80 or 160 or whatever. Further, we generally
+prefer the indentation of our expressions(, statements etc.) line up with
+its syntactic structure. This preferences (and the layouting rule which
+already enforces it partially) still leaves a good amount of choice for
+where exactly to insert newlines. For example, these are all consistent
+in their newline/indentation usage:
+
+~~~~.hs
+myRecord = MyRecord { abc = "abc", def = "def"}
+myRecord = MyRecord
+  { abc = "abc"
+  , def = "def"
+  }
+myList = [ "abc", "def"]
+myList =
+  [ "abc"
+  , "def"
+  ]
+~~~~
+
+While consistency has the first priority, we also prefer short code: If it
+fits, we prefer the version/layout with less lines of code. So we wish to trade
+more lines for less columns, but only until things fit.
+
+For simple cases we can give a trivial rule: If there is space
+for the one-line layout, use it; otherwise use the indented-multiline
+layout (this applies to `myRecord` example). The
+straight-forward approach works well: Calculate the length of the
+one-line layout for each node, then recurse top-down and make the choices
+according to that rule. Linear runtime with good results.
+
+Things get more interesting with nesting, and with structures with more than
+two alternatives. Consider the following four alternative layouts of the
+same expression:
+
+~~~~.hs
+--        10        20        30        40
+-- 1)                                   -- | lets assume the user wants
+nestedCaseExpr = case e1 of             -- | 40 columns max.
+  Left x -> if func x then "good" else "bad" -- too long
+-- 2)                                   -- 
+nestedCaseExpr = case e1 of             -- 
+  Left x -> if func x                   -- 
+    then "good"                         -- 
+    else "bad"                          -- 
+-- 3)                                   -- 
+nestedCaseExpr = case e1 of             -- 
+  Left x ->                             -- 
+    if func x then "good" else "bad"    -- 
+-- 4)                                   -- 
+nestedCaseExpr = case e1 of             -- 
+  Left x ->                             -- 
+    if func x                           -- 
+      then "good"                       -- 
+      else "bad"                        -- 
+~~~~
+
+- We nest a case-expression and an if-expression; for both we consider two
+  alternative layouts, meaning that
+- There are a total of four, and in general an exponential amount of
+  possible layouts;
+- With a top-down approach, one would choose 2), as "if func x" has space in
+  the current line. But..
+- Layout 3) is the optimal solution considering lines-of-code and the 40-column
+  limit.
+
+So our question: Is there an algorithm which, given some input syntax tree,
+returns the/an optimal (least lines-of-code while respecting max-column limit)
+layout (out of a potentially exponential number of valid layouts). Further,
+this algorithm's time/space usage should be linear in the size of the input
+(we might weaken this to some polynomial upper bound - but that should be the
+limit).
+
+If we pessimistically assume that such algorithm does not exist, we may ask
+alternatively: What linear algorithm returns solutions which,
+on average, are closest to the optimal solution?
+
+In the following we will describe _one_ such algorithm that seems to return
+the optimal solution in several non-trivial cases, and near-optimal solutions
+in many other. We won't try to prove that it is the best algorithm, but we will
+consider the circumstances for which a non-optimal solution is returned.
+
+## The Reasoning
+
+A top-down approach is so bad, because when there are exponentially many
+layouts to consider, there information passed down from the parents does
+not help at all in pruning the alternatives on a given layer. In the above
+`nestedCaseExpr` example, we might obtain a better solution by looking not
+at just the first, but the first n possible layouts, but against an exponential
+search-space, this does not scale: Just consider the possibility that there
+are exponentially many sub-solutions for layout 2) (replace "good" with some
+slightly more complex expression). You basically always end up with either
+"the current line is not yet full, try to fill it up" or
+"more than n columns used, abort".
+
+But a (pure) bottom-up approach does not work either: If we have no clue about
+the "current" indentation while layouting some node of our syntax tree,
+information about the (potential) size of (the layout of) child-nodes does
+not allow us to make good decisions.
+
+So we need information to flow bottom-to-top to allow for pruning whole trees
+of possible layouts, and top-to-bottom for making the actual decisions.. well,
+this can be arranged.
+
+# The Algorithm
+
+This algorithm works in two passes over the input syntax tree. The first pass
+is bottom-up and adds a label to each node. This label contains (a set of)
+"spacings" - certain information regarding the number of columns/lines needed
+for the textual representation of potential layouts of that node.
+For example we might return two possible spacings for the following (same)
+expression:
+
+~~~~.hs
+if func x then "good" else "bad"
+-- => 1 line, 32 columns used
+if func x
+  then "good"
+  else "bad"
+-- => 3 lines, 13 columns used
+~~~~
+
+This is heavily simplified; in Brittany spacing information is (as usual) a
+bit more complex.
+
+We restrict the size of these sets. Given the sets of spacings for the
+child-nodes in the syntax-tree, we generate a limited number of possible
+spacings in the current node. We then prune nodes that already violate desired
+properties, e.g. any spacing that already uses more columns locally than
+globally available.
+
+The second pass is top-down and uses the spacing-information to decide on one
+of the possible layouts for the current node. It passes the current
+"cursor position" (indentation, etc.) downwards, allowing to check that the
+layout fits (e.g. ensure "current indentation + columns used < max columns").
+
+This algorithm is trivially linear - two traversals and only linear space
+required per node of the input.
+
+### Consequences
+
+- when calculating spacings during bottom-up, several spacings are combined
+  into a single one via min/max/some other associative operation. Perhaps
+  consider how in a multi-line list layout the "columns used" will be derived
+  from the element spacings. For additional information embedded into the
+  spacings, we will need at least one such Semigroup instance.
+
+- We require an order on the alternatives for each syntactical construct.
+  The first alternative where we find some combination of spacings for the
+  children which is acceptable will be used.
+
+  In unlucky cases, all spacings might get pruned. In that case, we will
+  default to the last alternative, which therefor should be the most
+  "conservative" choice, i.e. the one that gives the child-nodes the most
+  space.
+
+  The first alternative should always be the "one-liner" layout, if it exists;
+  this alternative is preferable in general and will be filtered first should
+  it not fit.
+
+  In between those two extremes can be other choices that gradually trade
+  "columns used" for "lines used".
+
+- Sometimes there exist several (near) optimal solutions. E.g. when there exist
+  multiple nodes where we can trade lines for columns in such a way that the
+  whole result fits.
+
+  In such cases, Brittany makes the following choice: We prefer "trading"
+  in the node higher up in the syntax tree. The reasoning is that we rather
+  spend one line early on to gain additional horizontal space for _all_ the
+  children than the other way around. (In those cases where there are
+  alternatives to choose from, there are often several children.) We can apply
+  this to the `nestedCaseExpr` example above: The options are to either
+  put the right-hand-side of the case-alternative into a new line, or
+  split up the if-then-else. The "case" is the outer one, so Brittany will
+  prefer 3) over 2), which proves to be the right choice (here).
+
+  As a consequence, we are most interested in the maximum spacings; yet, we do
+  not have a total order because we cannot generally prefer one of two spacings
+  where the first uses more columns, the second more lines.
+
+- The number of syntactical constructs in haskell is large. If we were to
+  work directly on the syntax tree to do our traversals, we'd have to write two
+  (or even three - one to generate spacings, one to make the choices, one to
+  do the output and insert comments) different functions each respecting every
+  syntactical construct in haskell. It would be an incredible amount of code
+  (and work) times three.
+
+  Instead we can use some recursive data-type describing a structured document,
+  which abstracts of different syntactical constructs and only considers the
+  things relevant for layouting. This data-type is called `BriDoc`.
+
+- The `BriDoc` tree has an exponential number of nodes, but it is linear when
+  sharing is considered - the child-nodes can (and must) be re-used across
+  different alternatives. In the `nestedCaseExpr` example above, note how
+  there are four layouts, but essentially only two ways in which the "if" is
+  layouted. Either as a single line or with then/else on new lines. We can
+  handle spacings in such a way that we can share them for 1/3 and 2/4. This
+  already hints at how "columns used" will need to be redesigned slightly so
+  that 2/4 really have the same spacing label at the "if".
+
+### Concessions/non-Optimality
+
+- Prefering to trade in the higher node can give non-optimal results, e.g.
+  when there is only one child.
+
+- Spacings and the pruning of spacings happens in bottom-up fashion, meaning
+  that we might not prune a "70 columns used" spacing even though we later
+  notice that the current indentation already needs to be 20 and we only have
+  60 columns left. If all spacings found in one label have this property,
+  the top-down traversal might be forced to make the "conservative" choice,
+  trading more lines for less indentation to prevent potential (but unknown)
+  overflow during child-node layouting.
+
+- We can increase the limits arbitrarily (i.e. increase the constant factor
+  while remaining linear) to get better (optimal) results in more cases. We can
+  choose constant factors in such a way that we get optimal results for all
+  inputs up to a certain size. Of course calling them "constant factors" is
+  delusional if they grow exponentially - but it is still nice that this
+  heuristical algorithm can trivially be transformed into the perfect
+  (but exponential) algorithm.
+
+### Examples of non-Optimality and Border-Cases in General
+
+TODO
+
+# Practical details
+
+## Comments
+
+Comments don't affect semantics and thus are free-form; as the user we almost
+never want them reformatted in any way. The tempting approach is to keep them
+entirely separate: Don't layout comments, and don't let comments affect
+layout. However, this not possible always:
+
+~~~~.hs
+val = f -- useful comment here
+      x
+-- reformatted to
+val = f -- useful comment here x
+~~~~
+
+Does not work too well, even when the one-liner "f x" would certainly be the
+layout of choice.
+
+Brittany handles this in two ways:
+
+1) Some alternatives are pruned based on the existence of comments;
+2) Always insert newlines after end-of-line comments, and restore the
+   current indetation to prevent violation of the layouting rule:
+
+    ~~~~.hs
+    val = do
+      myAction -- useful comment here
+               x
+      g
+    ~~~~
+
+## CPP
+
+Conditional text-based source-code insertion is horrible.
+"#if"-guarded code in one line can easily affect in which way all other code
+is parsed. This makes it more or less impossible to layout code involving
+the preprocessor. One might be able to re-implement the preprocessor to then
+determine when things are safe to layout, but that is no fun, and Brittany
+does not bother.
+
+Brittany allows one thing: Working on the already pre-processed code, treating
+any disabled sections like comments. This is not safe in general, but is nice
+when the user makes only responsible usage of CPP (e.g.: only put "#if"-guards
+around module top-level constructs).
+
+## Horizontal alignment
+
+Sometimes horizontal alignment can make things more readable. However there
+are good reasons against using such whitespace: It can cause larger diffs
+on simple changes and it is rather subjective in which cases things are
+"more readable" rather than "annoyingly spaced".
+
+Nonetheless Brittany has a fully-featured implementation for horizontal
+alignment:
+
+~~~~.hs
+func (MyLongFoo abc def) = 1
+func (Bar       a   d  ) = 2
+func _                   = 3
+~~~~
+
+Again we have to ask the question: Does alignment affect the layouting choices,
+and does layouting affect alignment? The answer is clear in this case: No.
+First we make layouting-choices, then, independently, we add alignment but only
+in those cases where it does not cause overflows.
+