-
Notifications
You must be signed in to change notification settings - Fork 135
Open
Description
Design issues
Some decisions need to be made before we declare the API as stable. We can put
here all the questions for discussing (we should answer these questions as soon
as possible since it impacts the current implementation and would cause rework
if delayed).
(A) About rows.Table
- A.1) What about lazyness? Should
rows.Tablebe always lazy? Always
not lazy? Support both? What are the implications? if it's lazy, how to deal
with deletion and addition of rows? - A.2) How should we handle row filtering? What would be the best API?
For example: we have arows.Tablewith many rows but want to filter some
rows. Should we provide a special method for this or use Python's built-in
filter? Using Python's built-infilterwould be the more Pythonic way but
we can optimize some operations on certain plugins if we provide a special
method (example: filtering on a MySQL-basedTable). - A.3) What if we want to import everything filtered? It's not a filter
on a pre-existingrows.Tablelike in question A.2: it's a filter to be
executed during importation process so we're going to import only some rows. - A.4) We should provide an API to modify the current rows during the
iteration over theTable. User can specify a custom function that will
receiveTable.Rowobject and return a new one (that should be returned when
iterating over theTable). This way we can deal with addition of new fields
and other custom operations online. How should we expose this API? This
implementation may solve problem on question A.3. - A.5) The default row class is a
collections.namedtuple. What is the
best API to change it? Should the default be another one? If we want an
object with read-write access and also value access via attributes
AttrDict would be a good option.
Should we add metadata to the row instance, like its index on thatTable?
Seesqlite3.Rowand other Python's DBAPI implementations. - A.6)
rowscurrent architecture is good for importing and exporting
data but is not well suited for working with that data. One of the key facts
is that we cannot create aTablefrom a CSV, change some rows' values and
save it to the same CSV without doing a batch operation. Should we implement
read-write access? It can add a lot of complication on the implementation
(not only theTableitself but in the plugins) since we'll need to deal
with problems like seeking hrough the rows, saving/flushing partial data (not
the entire set), amont other problems. - A.7) As many users will use
rowsto import-and-export data it'd be
handy if we have a shortcut (and maybe some optimizations) to do it. If the
entireTableis lazy we may not need this shortcut because we can iterate
over oneTable(in a lazy way) at the same time we're saving into another. - A.8) Should implement
__add__(so, for example,
sum([table1, table2, ..., tableN])will return anotherTablewith all the
rows -- but only if all table's types are the same). What metadata should
remain? - A.9) Which other operations should be implemented? Join, intersect,
...?
(B) About rows.fields
- B.1) Field instances (values, actually) should be native Python
objects or custom objects (based on custom classes)? I'm inclined to use
native Python objects (as it's implemented today).
(C) About Plugins
- C.1) Should plugins implement classes instead of functions? These
classes should inherite fromrows.Tableand implement only the needed
methods to access data (everything else should be made byrows.Table). This
way we can optimize operations like__len__,__reverse__and others.
These magic methods may be implemented only onrows.Tableand not
overwritten (the plugin class would create a custom methodrows.Tablewill
call for each operation) -- we need to specify these methods' API. - C.2) What should be the list of default plugins? May be:
text,
json,csv,sqlite. - C.3) What should be the list of official plugins (available on PyPI,
maintained by rows team but not pre-installed by default)? May be:xls,
html,ods. See graphlab's connectors and tablib's supported extensions. - C.4) How should we represent the table rows internally?
Table.__rows? What plugins can and cannot do with it? What is the expected
behaviour? - C.5) Should add a
Table.metawith metadata about thatTable. For
example: plugin data if theTablewas generated by a plugin (example: if
the plugin iscsvcould have the actual CSV filename, encoding and so on). - C.6) If we are dealing with a huge amount of data it'd nice to have
callbacks and batch options (like the old MySQL plugin). How the API should
be exposed? - These links may help:
- https://pythonhosted.org/setuptools/setuptools.html#dynamic-discovery-of-services-and-plugins
- https://github.com/nose-devs/nose/blob/master/nose/plugins/manager.py#L368
- https://nose.readthedocs.org/en/latest/plugins/writing.html
- https://github.com/flavioamieiro/nose-ipdb/blob/master/ipdbplugin.py
- https://pytest.org/latest/plugins.html#setuptools-entry-points
- http://docs.openstack.org/developer/stevedore/
(D) About CLI
- D.1) Should we implement
--query(to query using SQL -- same as
import-and-filter)?
(E) Other
- E.1) How to deal with Table collections? Examples: a XLS file have more than one sheet (each one is a
rows.Tableitself), a HTML file could contain more than one<table>. See how tablib deals with it. - E.2) See sqlite's
detect_types.