# Sources

In Li, a "source" is a Javascript class that provides lists of URLs that should be crawled, and methods to scrape the data from those pages.

This guide provides information on the criteria we use to determine whether a source should be added to the project, and offers details on how a source can be implemented.

## Examples

An anotated sample source is in `docs/sample-sources/sample.js`. It will give you a rough idea of the shape of a source, and how it is used.

Live sources are in `src/shared/sources/` as well.

## Criteria for sources

See [Source criteria](./source-criteria.md) to determine if a source should be included in this project.

## Writing a source

As shown in the samples, a `source` has `crawlers` which pull down data files (json, csv, html, pdf, tsv, raw), and `scrapers` which scrape those files and return useful data. Sources can pull in data for anything -- cities, counties, states, countries, or collections thereof. See the existing scrapers for ideas on how to deal with different ways of data being presented.

Copy the template in `docs/sample-sources/template.md` to a new file in the correct country, region, and region directory (e.g., `src/shared/sources/us/ca/mycounty-name.js`). That file contains some fields that you should fill in or delete, depending on the details of the source. Also see the comments in `docs/sample-sources/sample.js`, and below.

### Crawling

At the moment, we provide support for page, headless, csv, tsv, pdf, json, raw. A central controller will execute the source to crawl the provided URLs and cache the date. You just need to supply the `url`, `type`, and `name` if there are multiple urls to crawl.a

### Scraping

Scrapers are functions associated with the `scrape` attribute on `scrapers` in the `source`. You may implement one or multiple scrapers if the source changes its formatting (see [What to do if a scraper breaks?](#what-to-do-if-a-scraper-breaks)).

Your scraper should return an object, an array of objects, or `null`.

#### The returned scrape results object

The object may have the following attributes:

```javascript
result = {
  // [ISO 3166-1 alpha-3 country code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) [required]
  country: 'iso1:xxx',

  // The state, province, or region (not required if defined on scraper object)
  state: 'iso2:xxx',

  // The county or parish (not required if defined on scraper object)
  county: 'xxx',

  // The city name (not required if defined on scraper object)
  city: 'xxx',

  // Total number of cases
  cases: 42,

  // Total number of deaths
  deaths: 42,

  // Total number of hospitalized
  hospitalized: 42,

  // Total number of discharged
  discharged: 42,

  // Total number recovered
  recovered: 42,

  // Total number tested
  tested: 42,

  // GeoJSON feature associated with the location (See [Features and population data](#features-and-population-data))
  feature: 'xxx',

  // Additional identifiers to aid with feature matching (See [Features and population data](#features-and-population-data))
  featureId: 'xxx',

  // The estimated population of the location (See [Features and population data](#features-and-population-data))
  population: 42,

  // Array of coordinates as `[longitude, latitude]` (See [Features and population data](#features-and-population-data))
  coordinates: 'xxx',
}
```

Note: the data fields actually exported during report generation are listed in `src/shared/constants/case-data-fields.js`.

#### Returning an array

Returning an array of objects is useful for aggregate sources, sources that provide information for more than one geographical area. For example, [Canada](https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection.html?topic=tilelink) provides information for all provinces of the country. If the scraper returns an array, each object in the array will have the attributes specified in the source object appended, meaning you only need to specify the fields that change per location (`county`, `cases`, `deaths` for example).

#### Returning `null`

`null` should be returned in case no data is available. This could be the case if the source has not provided an update for today, or we are fetching historical information for which we have no cached data.

## Making sure your scraper doesn't break

It's a tough challenge to write scrapers that will work when websites are inevitably updated. Here are some tips:

- If your source is an HTML table, validate its structure: check that   table headers contain expected text, that columns exist, etc. For   example, if you say `result.deaths` is the value stored in column 2,   but the source has changed column 2 from "Deaths" to "Cases", your   scrape will complete successfully, but the data won't be correct.

- If data for a field is not present (eg. no recovered information),   **do not put 0 for that field**. Make sure to leave the field   undefined so the scraper knows there is no information for that   particular field.

- Write your scraper so it handles aggregate data with a single   scraper entry (i.e. find a table, process the table)

- Try not to hardcode county or city names, instead let the data on   the page populate that

- Try to make your scraper less brittle by avoiding using generated   class names (i.e. CSS modules)

- When targeting elements, don't assume order will be the same   (i.e. if there are multiple `.count` elements, don't assume the   second one is deaths, verify it by parsing the label)


### What to do if a scraper breaks?

Source scrapers need to be able to operate correctly on old data, so updates to scrapers must be backwards compatible. If you know the date the site broke, you can have two implementations (or more) of a scraper in the same function, based on date. Most sources in `src/shared/sources` deal with such cases.

## Features and population data

We strive to provide a GeoJSON feature and population number for every location in our dataset. When adding a source for a country, we may already have this information and can populate it automatically. For smaller regional entities, this information may not be available and has to be added manually.

### Features

Features can be specified in three ways: through the `country`, `state` and `county` field, by matching the `longitude` and `latitude` to a particular feature, through the `featureId` field, or through the `feature` field.

While the first two methods works most of the time, sometimes you will have to rely on `featureId` to help the crawler make the correct guess. `featureId` is an object that specifies one or more of the attributes below:

- `name`
- `adm1_code`
- `iso_a2`
- `iso_3166_2`
- `code_hasc`
- `postal`

In case we do not have any geographical information for the location you are trying to scrape, you can provide a GeoJSON feature directly in the `feature` attribute you can return with the scraper.

If we have a feature for the location, we will calculate a `longitude` and `latitude`. You may also specify a custom longitude and latitude by specifying a value in the `coordinates` attribute.

### Population

Population can usually be guessed automatically, but if that is not the case, you can provide a population number by returning a value for the `population` field in the returned object of the scraper.

## Testing sources

You should test your source first by running `npm run test`. This will perform some basic tests to make sure nothing crashes and the source object is in the correct form.

### Test coverage

We run scrapes periodically for every cached file, for every source.

If you change a source, it will be exercised when you run `npm run test:integration`. See [Testing](./testing.md) for more information.