Show the data!

Show the data!

Javascript implementation of a spreadsheet data presentation formatting.

Featured on Hashnode

Introduction

Probably these days is not necessary to teach people about importance that data has our world. It is often said that we live in the information age, and information is resolution of uncertainty, uncertainty usually makes us to feel uncomfortable.

So, the more data we have more comfortable we are, right?

Well, not always. So, why?

Because in a world where information is stored and processed by machines, we don't get to see it, and we can't unless we sit days, or years. Then we may try to communicate with a machine. Machines communicate to each other by means of binary signals, but we tend to communicate by our senses. Among those senses, hearing and vision are the most important for communication of objective information.

Writing was the first method we know to have been used to store information. It took a long time to standardize alphabets, spellings. The numbers as we know today were introduced to the occidental civilization in the 10th century. The "equals" sign was invented in 1557, and when it comes to keeping numeric data or dates or times there is wide range of preferences. Different situations may compel the use of different representation.

The digital age

The digital gave the ability to store information in an objective way. You have a device that can store bits, any information be represented using bits, now encodings must be standardized; which bits represents which letters? bits that can represent colors, and eventually an image. Images are large, let's compress them! Machines are so fast now that we could actually play audio! But that's so large, let's compress them! What if we put images and audio together? We invented the multimedia word (or world?).

So we have the computer as this interesting machine that lives in a world of arrays, and can process them very quickly. But how to make it more friendly to humans? Well both sides will have to adapt. Software tools had to evolve providing appealing features and humans had to learn how to use the available tools.

Nowadays any person minimally acquainted with computers will know how a spreadsheet works. And we can look back and appreciate how this impacted the generation before us

It is undeniable that microsoft excel was very influential and we are admittedly influenced by its formatting, shown in 2:42. Excel introduced a loose language to describe formats. It is not a formal language, and it seems most of it was developed in an unstructured way, and it's documented based on examples not the definition of the actual behavior. But we still have reasons to use it.

  • Interoperability: Excel data format is widely used if someone wants to load or export data data that stores formatting information in the excel notation when presenting the data we should be aware of how to present.
  • Usability: Many users are already acquainted with the excel format notation and it is nice if they can use what they already know in new software.
  • Meets universal goals: We have alternatives for data formatting most of them differ only in the way strings are escaped or how different things are denoted, e.g. is month represented by %B, or mmmm

The particular case of date

As happened with letters that had to be associated with an arbitrary sequence of bits thus dates had to be. But since numbers are some of the most primitive types on the computer history and dates is a form to specify a point in time number can help, in other words serial date representation. But if you want to tell dates, you have to stablish a reference point and a time unit, that's what calendars do. Excel choice was to count days since 1st of January of 1900, javascript counts seconds since first of January of 1970.

The reality about date and time is not as smooth as described above, how different time zones relate, some adjustments to the calendars.

Leap years introducing 29th of February is an example that is quite regular and predictable, it happens in every year that is multiple of four and not multiple of 100, or multiple of 400. And even for those it is easier to get it wrong, excel itself incorrectly assumes the year 1900 was a leap year*.

There are situations where the clock time changes arbitrarily by political decisions, the most common is the daylight-saving time, and excel does not handle that. Actually excel does not handle time zones either. This could be a limitation for some applications, but it makes it less confusing, giving what common sense would expect.

The universal time coordinate was is a standard time, JavaScript Date.UTC the Unix time given a UTC time, and as such it does not handle leap seconds.

Leap seconds are introduced to avoid limit the difference between UTC time and the solar time measurement to at most 0.9 second, and every semester International Earth Rotation and Reference Systems Service (IERS) announces if it will happen in the period. Since it is based on empirical measurements it is impossible to program a function to correctly predict future dates and time intervals. So, even though we have the ability to calculate differences between timestamps, they will deviate a few seconds from the actual elapsed time.

Existing candidates on NPM

formatter, actually exports a functor for string interpolation, the only format option is the fixed width.

d3-time-format is a great solution if you can use the standard C format description format (I like that term)

moment support many operations with date time, and has a nice formatting support, with their own format description format, e.g. moment().format('h AM/PM') would give 11 AM10/P10 instead of 11 AM, because "A" will be replaced by "AM" or "PM", "M" will be replaced by the month (10). And they provide some other nice features such as formatting days as ordinal e.g. moment().format("Do [of] MMMM") gives 2nd of October. This package is extremely popular, currently with 20 million downloads per month, and almost 60k other modules depending on it. And the package itself is relatively large.

numeral Shows natural number formats in multiple languages, however it does not mixing text with number parts, numeral(12345678).format('0,,\\M') gives 12,345,678, while ub excel TEXT(12345678, "0,,\M") would give 12M, one could get 12m with the format 0a, but if the value is changed to 12345678432, it will produce 12b while excel gives 12345M. Again, the module is good but only if you don't care about getting a format that is compatible with Excel.

d3-format supports the python format, again you cannot specify masks.

excel-style-dataformatter Is very good, and small. Cover many corner cases, have multilingual support, but misses some corner cases, and does not fully support conditional formatting

const DataFormatter = require('excel-style-dataformatter');
const dataFormatter = new DataFormatter();
function TEXT(value, format){
  const type = typeof value === 'number'? 'Number': 'DateTime';
  return dataFormatter.format(value, type, format).value;
}
// It works if some explicit condition is met
console.log(TEXT(1, '[<=1]"small";"large"'))
// But fails otherwise
console.log(TEXT(10, '[<=1]"small";"large"'))
// gives 2497974068071313/2500000000000000
console.log(TEXT(1233/1234, "#/###"))
// Gives "text", excel would give "-text"
console.log(TEXT(-1, '"text"'))
// Gives "1E+123" instead of "1E1-2/3 ignoring the mask for the exponent
console.log(TEXT(1e123, "0E-0-0\\/0"))
// Gives "3E+" istead of "3E+0"
console.log(TEXT(3, "0E+##"))
// Gives "123.00E+2", but the 3 leading zero indicates that it should use 
// An exponent multiple of 3, excel gives 012.30E+3
console.log(TEXT(12300, "000.00E+0"))
// Interpret m as Month where it should be interpreted as minute.
console.log(TEXT(new Date(), "s m"))

Some of those packages could be arguably better than excel in some respects, but not good if compatibility is what you want.

Our approach

As the examples show, every single feature may have many different behaviors, and these these features can combined in different ways as well. Getting a fully compatible formatting code is maybe beyond what is reached with reasonable effort, and even not necessary, mature applications such as OpenOffice Calc or Google Sheets have their own differences. I wouldn't be surprised one find differences between different versions of excel as well. But the goal is to reduce the differences the maximum we can and let the package open to evolve.

Self-criticism

A joke about standardization that I always tell:

When we started our research we noticed that there are n standards, so we created a standard that covers everything (now there are n+1 standards).

The aim of this package an implementation that can reproduce as closely as possible the excel behavior.

  • We know we are not yet there, but we have a test suite (good enough to reveal bugs in existing libraries).
  • We are using a parser generation tool to write code emphasizing readability and maintainability.
  • Well defined format parsing output enables developers to more easily write custom and efficient renderers.

Parsing (understanding) format

Parsing is not an easy task. An extensive literature was dedicated to it over the years, different algorithms to parse different classes of parsers. The grammars are ranked accordingly to some properties that implies difficulty to parse in the Chomsky hyerarchy.

Parsing with regular expressions

It is very common to find implementations of number formatters using string substitution, for instance in excel-style-dataformatter. This approach requires multiple passes and have to use some state variables to determine what to use. Javascript natively supports regular expressions, a few features in the formatting are not possible to implement using only string substitution, e.g. and AM/PM string changes the meaning of preceding h symbols, but multiple h symbols could appear mixed with other symbols.

Parsing with a context-free language

The context free language has the advantage that at any time a token is seen it is possible to determine what to do with it. They can be parsed in time linear with the input size. If you are designing a language as a context free language you do the language design and the parser implementation at the same time and you can take decisions about the language features for the convenience of your parser. However, if you want to parse a language defined by examples this may be very difficult. This is the case for natural language, natural language understanding has taken off only recently thanks to machine learning with models that uses billion of parameters.

As the scale of a machine learning model increases the performance across tasks improves, while also unlocking new capabilities As the scale of a machine learning model increases the performance across tasks improves, while also unlocking new capabilities. Credit: Google AI, April 2022

Using Parsing Expression Grammar (PEG)

PEG parser generators were popularized by David Majda with pegjs, who transferred the maintenance and nowadays seems to be abandoned by the new maintainer*. But the package is still available on NPM and there it was forked as a peggy.

In addition to being theoretically more powerful than context-free, the PEG formalism is much more intuitive and does not introduce ambiguity. The only downside is that you have to make sure to add rules before any rules that matches a prefix of it.

In my experience peggy source is much easier to maintain, and also easier to read and explain, feel free to have a look at the parser code

Separating parsing from rendering

As mentioned above, some approaches analyze the format string and produces the output in a single pass. As far as performance is concerned this approach is limiting, because the format string has to be processed every time a new output is desired. For screen rendering this may not be an issue because only a few hundreds of cells should be visible, even a very slow parser would do that. But if you are copying 100 million rows as text it can take a long time to format all of them. This gives a few reasons to decouple format specification parsing from rendering.

  • Optimizing code usually takes a lot of energy and can make the code a lot less readable and difficult to maintain. If rendering and parsing are implemented together optimizations to the renderization will obfuscate the logic of the parser.

  • Having an intermediate format makes it easier to support different format specification formats, or maybe more importantly for us to implement different renderers, from a basic text renderer, or a html renderer, or a graphic renderer, without having to touch the parser.

  • The parsing code can run once per format and the rendering code once per cell, typically, no matter how large is a table it uses only a few distinct formats, so the overhead of parsing is roughly constant, so we can relax the optimizations and focus on intelligibility of the parser.

For the more rigorous developers, I have created a type declarations for the parser output

Performance

Running a quick benchmark we see that the reference TextRenderer implementation provided with @datadocs/rose-formatter@1.0.0 is slightly faster than the default excel-style-dataformatter formatting.

The parser will be run when new rose.TextRenderer(format) is called, the result is saved in formatCache and for subsequent calls only the renderization .formatNumber(value) is executed.

const BenchTable = require('benchtable')
const numbersSuite = new BenchTable('Number formatting', {isTransposed: true});
const xl1 = require('excel-style-dataformatter');
const rose = require('@datadocs/rose-formatter');
const xlf = new xl1();
const formatCache = {}
numbersSuite
  .addFunction("excel-style-dataformatter", (value, format) => {
    xlf.format(value, 'Number', format);
  })
  .addFunction("rose-format render", (value, format) => {
    return (formatCache[format] || 
             (formatCache[format] = new rose.TextRenderer(format))
        ).formatNumber(value);
  })
  .addInput('Integer mask', [2398472398, '00-00-00\\.00.00\\/000\\/000'])
  .addInput('Decimal digits', [Math.PI, '0000.0000000000'])
  .addInput('Fractions', [Math.PI, '# #/###'])
  .addInput('Thousand separators', [12345678903, '0,'])
  .addInput('Scientific notation', [12345678903, '000E+0'])
  .addInput('Percentage', [12345678903, '0%%%'])
  .on('cycle', event => {
    console.log(event.target.toString())
  })
  .on('complete', () => {
    console.log(numbersSuite.table.toString());
  })
  .run({async: false})
excel-style-dataformatter rose-format render
Integer mask 261,803 ops/sec 689,686 ops/sec
Decimal digits 505,814 ops/sec 837,683 ops/sec
Fractions 407,169 ops/sec 793,907 ops/sec
Thousand separators 1,308,233 ops/sec 1,631,703 ops/sec
Scientific notation 757,928 ops/sec 1,164,294 ops/sec
Percentage 1,160,640 ops/sec 1,313,979 ops/sec
yyyy-mm-dd 120,363 ops/sec 1,257,242 ops/sec
hh:mm:ss 106,129 ops/sec 1,254,540 ops/sec
[hh]:mm:ss 178,210 ops/sec 1,384,785 ops/sec

Examples in action

Future work

@datadocs/rose-formatter@1.0.0](npmjs.com/package/@datadocs/rose-formatter/..) focus in English. Since the beginning the goal was to use it as a part of a web application that could load spreadsheets with hundreds of millions of rows. For such application WebAssembly code as a database function is available.

Here we demonstrated text formatting. Improvements on style formatting such as determining colors, alignment, and padding the _ rule (that depends on the cell width). Are relatively easy changes.

We foresee in the future Internationalization and Localization, for that contributors are needed.

Feel free to open a pull request or report an issue if there is something you would like to see in the package.