Conversation
roll
commented
May 27, 2020
- fixes [dataflows] revisit loading multiple sheets with one load step frictionlessdata/datapackage-pipelines#188
|
It's my second take on this issue. The first attempt was #110 The new one uses I think |
|
I will take a look! |
|
Hey @roll - Wouldn't it be better if there was a standard way to open a 'container' kind of file - for example, an Excel file or Google Spreadsheet with multiple sheets or a ZIP file with multiple files. This implementation basically re-opens the Excel file for each of the sheets, reads a sample, infers a schema - and then checks to see if the sheet name adheres to the I'm thinking we could have a generic class similar to >>> container = tabulator.Container('path/to/excel/file.xlsx')
# OR
>>> container = tabulator.Container('path/to/archive/file.zip')
>>> for item in container.iter():
... print(item.name) # could be sheet name or filename in zipfile
... print(item.options) # dict of options to be used in Stream, e.g. '{"sheet": 1}' or {"compression": "zip"} etc.
... stream = item.stream(**other_options) # Returns a Stream objectthen you could do: FILENAME = 'path/to/excel/file.xlsx'
Flow(*[
load(FILENAME, headers=1, name='res-%d' % idx, **item.options)
for i, item in enumerate(Container(FILENAME).iter())
]).process() |
|
@akariv Aside from implementation, what do you think the best API will be for DPP? For |
|
To answer the DPP question - off the top of my head.
Internally it will iterate on the different parts (using the Example
resource name can be a slug of the filenames/sheetnames of the different parts combined (we can complicate it later) |
|
@akariv
|
|
Our custom load processor already has something relatively akin to "load_batch" - it can take a comma separated list of urls or a regular expression (for local or s3) url, and then it just generates a bunch of load steps for each resulting URL. But if your implementation will improve the load times for multiple sheets within an xlsx file, I am happy to switch over to your implementation. 👍 |