I have a dozen 300 million-row hyper files which I'm trying to union in Prep.
They only have about 8 columns and all are formatted with the same name.
I'm trying to union them but it's timing out.
If there a more efficient way to union large files such as these?
Is it best to keep them as Hyper files, or should I publish them as .CVS files on the Server?
Any thoughts?
Michael Hesser (Tableau Forum Ambassador)
If this response has answered your question, kindly click "Best Answer"
@Michael Hesser
I have never tested, but you may try with Hyper API. I think a good source could be this article:
https://engineering.tableau.com/using-the-hyper-api-to-union-hyper-files-3fd0904fd69b
(yeah it seems the page has its certificate expired):
from tableauhyperapi import HyperProcess, Connection, Telemetry, TableDefinition, TableName, SchemaName, Inserter, CreateMode
from glob import glob
from time import time
import os
input_files = glob("WorldIndicators_*.hyper")
table_name = TableName('Extract','Extract')
output_file = "WorldIndicatorsMerged.hyper"
# Let's delete the output file to make sure we can rerun this script
# even if the output already exists.
if os.path.exists(output_file):
os.remove(output_file)
# Start a new Hyper instance
start_time = time()
with HyperProcess(Telemetry.SEND_USAGE_DATA_TO_TABLEAU, 'unionfiles_efficient') as hyper:
# Create a connection without any connected databases
with Connection(hyper.endpoint) as connection:
# Connect to all our input databases
for i, file in enumerate(input_files):
connection.catalog.attach_database(file, alias=f"input{i}")
print(f"{time() - start_time}: Attached all input databases...")
# Prepare the output database
connection.catalog.create_database(output_file)
connection.catalog.attach_database(output_file, alias="output")
print(table_name)
connection.catalog.create_schema(SchemaName("output", table_name.schema_name))
print(f"{time() - start_time}: Prepared output database")
# Build the CREATE TABLE AS command which unions all our inputs
union_query = ' UNION ALL\n'.join(
f'SELECT * FROM "input{i}".{table_name}' for i in range(len(input_files)))
create_table_sql = f'CREATE TABLE "output".{table_name} AS \n{union_query}'
# And execute it
connection.execute_command(create_table_sql)
print(f"{time() - start_time}: Done :)")
If this post resolves the question, would you be so kind to "Select as Best"?. This will help other users find the same answer/resolution and help community keep track of answered questions. Thank you.
Regards,
Diego Martinez
Tableau Visionary and Forums Ambassador