Skip to main content

I have a dozen 300 million-row hyper files which I'm trying to union in Prep.

They only have about 8 columns and all are formatted with the same name.

I'm trying to union them but it's timing out.

 

If there a more efficient way to union large files such as these?

 

Is it best to keep them as Hyper files, or should I publish them as .CVS files on the Server?

 

Any thoughts?

Michael Hesser (Tableau Forum Ambassador)

If this response has answered your question, kindly click "Best Answer"

3 réponses
  1. 1 avr. 2025, 13:27

    @Michael Hesser​ 

    I have never tested, but you may try with Hyper API. I think a good source could be this article:

    https://engineering.tableau.com/using-the-hyper-api-to-union-hyper-files-3fd0904fd69b

    (yeah it seems the page has its certificate expired):

    from tableauhyperapi import HyperProcess, Connection, Telemetry, TableDefinition, TableName, SchemaName, Inserter, CreateMode

    from glob import glob

    from time import time

    import os

     

    input_files = glob("WorldIndicators_*.hyper")

    table_name = TableName('Extract','Extract')

    output_file = "WorldIndicatorsMerged.hyper"

     

    # Let's delete the output file to make sure we can rerun this script

    # even if the output already exists.

    if os.path.exists(output_file):

    os.remove(output_file)

     

    # Start a new Hyper instance

    start_time = time()

    with HyperProcess(Telemetry.SEND_USAGE_DATA_TO_TABLEAU, 'unionfiles_efficient') as hyper:

    # Create a connection without any connected databases

    with Connection(hyper.endpoint) as connection:

    # Connect to all our input databases

    for i, file in enumerate(input_files):

    connection.catalog.attach_database(file, alias=f"input{i}")

    print(f"{time() - start_time}: Attached all input databases...")

    # Prepare the output database

    connection.catalog.create_database(output_file)

    connection.catalog.attach_database(output_file, alias="output")

    print(table_name)

    connection.catalog.create_schema(SchemaName("output", table_name.schema_name))

    print(f"{time() - start_time}: Prepared output database")

    # Build the CREATE TABLE AS command which unions all our inputs

    union_query = ' UNION ALL\n'.join(

    f'SELECT * FROM "input{i}".{table_name}' for i in range(len(input_files)))

    create_table_sql = f'CREATE TABLE "output".{table_name} AS \n{union_query}'

    # And execute it

    connection.execute_command(create_table_sql)

    print(f"{time() - start_time}: Done :)")

    If this post resolves the question, would you be so kind to "Select as Best"?. This will help other users find the same answer/resolution and help community keep track of answered questions. Thank you.

     

    Regards,

     

    Diego Martinez

    Tableau Visionary and Forums Ambassador

0/9000