Design issues

# Design issues

Some decisions need to be made before we declare the API as stable.  We can put
here all the questions for discussing (we should answer these questions as soon
as possible since it impacts the current implementation and would cause rework
if delayed).
## (A) About `rows.Table`
- [ ] **A.1)** What about lazyness? Should `rows.Table` be always lazy? Always
  not lazy? Support both? What are the implications? if it's lazy, how to deal
  with deletion and addition of rows?
- [ ] **A.2)** How should we handle row filtering? What would be the best API?
  For example: we have a `rows.Table` with many rows but want to filter some
  rows.  Should we provide a special method for this or use Python's built-in
  `filter`? Using Python's built-in `filter` would be the more Pythonic way but
  we can optimize some operations on certain plugins if we provide a special
  method (example: filtering on a MySQL-based `Table`).
- [ ] **A.3)** What if we want to import everything filtered? It's not a filter
  on a pre-existing `rows.Table` like in question **A.2**: it's a filter to be
  executed during importation process so we're going to import only some rows.
- [ ] **A.4)** We should provide an API to modify the current rows during the
  iteration over the `Table`. User can specify a custom function that will
  receive `Table.Row` object and return a new one (that should be returned when
  iterating over the `Table`). This way we can deal with addition of new fields
  and other custom operations online. How should we expose this API? This
  implementation may solve problem on question **A.3**.
- [ ] **A.5)** The default row class is a `collections.namedtuple`. What is the
  best API to change it? Should the default be another one? If we want an
  object with read-write access and also value access via attributes
  [AttrDict](http://pypi.python.org/pypi/AttrDict) would be a good option.
  Should we add metadata to the row instance, like its index on that `Table`?
  See `sqlite3.Row` and other Python's DBAPI implementations.
- [ ] **A.6)** `rows` current architecture is good for importing and exporting
  data but is not well suited for working with that data. One of the key facts
  is that we cannot create a `Table` from a CSV, change some rows' values and
  save it to the same CSV without doing a batch operation. Should we implement
  read-write access? It can add a lot of complication on the implementation
  (not only the `Table` itself but in the plugins) since we'll need to deal
  with problems like seeking hrough the rows, saving/flushing partial data (not
  the entire set), amont other problems.
- [ ] **A.7)** As many users will use `rows` to import-and-export data it'd be
  handy if we have a shortcut (and maybe some optimizations) to do it. If the
  entire `Table` is lazy we may not need this shortcut because we can iterate
  over one `Table` (in a lazy way) at the same time we're saving into another.
- [ ] **A.8)** Should implement `__add__` (so, for example,
  `sum([table1, table2, ..., tableN])` will return another `Table` with all the
  rows -- but only if all table's types are the same). What metadata should
  remain?
- [ ] **A.9)** Which other operations should be implemented? Join, intersect,
  ...?
## (B) About `rows.fields`
- [ ] **B.1)** Field instances (values, actually) should be native Python
  objects or custom objects (based on custom classes)? I'm inclined to use
  native Python objects (as it's implemented today).
## (C) About Plugins
- [ ] **C.1)** Should plugins implement classes instead of functions? These
  classes should inherite from `rows.Table` and implement only the needed
  methods to access data (everything else should be made by `rows.Table`). This
  way we can optimize operations like `__len__`, `__reverse__` and others.
  These magic methods may be implemented only on `rows.Table` and not
  overwritten (the plugin class would create a custom method `rows.Table` will
  call for each operation) -- we need to specify these methods' API.
- [ ] **C.2)** What should be the list of default plugins? May be: `text`,
  `json`, `csv`, `sqlite`.
- [ ] **C.3)** What should be the list of official plugins (available on PyPI,
  maintained by rows team but not pre-installed by default)? May be: `xls`,
  `html`, `ods`. See [graphlab's connectors](https://dato.com/products/create/docs/graphlab.data_structures.html#connectors) and tablib's supported extensions.
- [ ] **C.4)** How should we represent the table rows internally?
  `Table.__rows`?  What plugins can and cannot do with it? What is the expected
  behaviour?
- [ ] **C.5)** Should add a `Table.meta` with metadata about that `Table`. For
  example: plugin data if the `Table` was generated by a plugin (example: if
  the plugin is `csv` could have the actual CSV filename, encoding and so on).
- [ ] **C.6)** If we are dealing with a huge amount of data it'd nice to have
  callbacks and batch options (like the old MySQL plugin). How the API should
  be exposed?
- These links may help:
  - https://pythonhosted.org/setuptools/setuptools.html#dynamic-discovery-of-services-and-plugins
  - https://github.com/nose-devs/nose/blob/master/nose/plugins/manager.py#L368
  - https://nose.readthedocs.org/en/latest/plugins/writing.html
  - https://github.com/flavioamieiro/nose-ipdb/blob/master/ipdbplugin.py
  - https://pytest.org/latest/plugins.html#setuptools-entry-points
  - http://docs.openstack.org/developer/stevedore/
## (D) About CLI
- [ ] **D.1)** Should we implement `--query` (to query using SQL -- same as
  import-and-filter)?
## (E) Other
- [ ] **E.1)** How to deal with Table collections? Examples: a XLS file have more than one sheet (each one is a `rows.Table` itself), a HTML file could contain more than one `<table>`. See how [tablib](https://tablib.readthedocs.org/en/latest/) deals with it.
- [ ] **E.2)** See sqlite's `detect_types`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design issues #31

Design issues

(A) About `rows.Table`

(B) About `rows.fields`

(C) About Plugins

(D) About CLI

(E) Other

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Design issues #31

Description

Design issues

(A) About rows.Table

(B) About rows.fields

(C) About Plugins

(D) About CLI

(E) Other

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

(A) About `rows.Table`

(B) About `rows.fields`