Skip to main content
Skip to main content

Parquet

InputOutputAlias

Description

Apache Parquet is a columnar storage format widespread in the Hadoop ecosystem. ClickHouse supports read and write operations for this format.

Data types matching

The table below shows how Parquet data types match ClickHouse data types.

Parquet type (logical, converted, or physical)ClickHouse data type
BOOLEANBool
UINT_8UInt8
INT_8Int8
UINT_16UInt16
INT_16Int16/Enum16
UINT_32UInt32
INT_32Int32
UINT_64UInt64
INT_64Int64
DATEDate32
TIMESTAMP, TIMEDateTime64
FLOATFloat32
DOUBLEFloat64
INT96DateTime64(9, 'UTC')
BYTE_ARRAY, UTF8, ENUM, BSONString
JSONJSON
FIXED_LEN_BYTE_ARRAYFixedString
DECIMALDecimal
LISTArray
MAPMap
structTuple
FLOAT16Float32
UUIDFixedString(16)
INTERVALFixedString(12)

When writing Parquet file, data types that don't have a matching Parquet type are converted to the nearest available type:

ClickHouse data typeParquet type
IPv4UINT_32
IPv6FIXED_LEN_BYTE_ARRAY (16 bytes)
Date (16 bits)DATE (32 bits)
DateTime (32 bits, seconds)TIMESTAMP (64 bits, milliseconds)
Int128/UInt128/Int256/UInt256FIXED_LEN_BYTE_ARRAY (16/32 bytes, little-endian)

Arrays can be nested and can have a value of Nullable type as an argument. Tuple and Map types can also be nested.

Data types of ClickHouse table columns can differ from the corresponding fields of the Parquet data inserted. When inserting data, ClickHouse interprets data types according to the table above and then casts the data to that data type which is set for the ClickHouse table column. E.g. a UINT_32 Parquet column can be read into an IPv4 ClickHouse column.

For some Parquet types there's no closely matching ClickHouse type. We read them as follows:

  • TIME (time of day) is read as a timestamp. E.g. 10:23:13.000 becomes 1970-01-01 10:23:13.000.
  • TIMESTAMP/TIME with isAdjustedToUTC=false is a local wall-clock time (year, month, day, hour, minute, second and subsecond fields in a local timezone, regardless of what specific time zone is considered local), same as SQL TIMESTAMP WITHOUT TIME ZONE. ClickHouse reads it as if it were a UTC timestamp instead. E.g. 2025-09-29 18:42:13.000 (representing a reading of a local wall clock) becomes 2025-09-29 18:42:13.000 (DateTime64(3, 'UTC') representing a point in time). If converted to String, it shows the correct year, month, day, hour, minute, second and subsecond, which can then be interpreted as being in some local timezone instead of UTC. Counterintuitively, changing the type from DateTime64(3, 'UTC') to DateTime64(3) would not help as both types represent a point in time rather than a clock reading, but DateTime64(3) would incorrectly be formatted using local timezone.
  • INTERVAL is currently read as FixedString(12) with raw binary representation of the time interval, as encoded in Parquet file.

Example usage

Inserting data

Using a Parquet file with the following data, named as football.parquet:

    ┌───────date─┬─season─┬─home_team─────────────┬─away_team───────────┬─home_team_goals─┬─away_team_goals─┐
 1. │ 2022-04-30 │   2021 │ Sutton United         │ Bradford City       │               1 │               4 │
 2. │ 2022-04-30 │   2021 │ Swindon Town          │ Barrow              │               2 │               1 │
 3. │ 2022-04-30 │   2021 │ Tranmere Rovers       │ Oldham Athletic     │               2 │               0 │
 4. │ 2022-05-02 │   2021 │ Port Vale             │ Newport County      │               1 │               2 │
 5. │ 2022-05-02 │   2021 │ Salford City          │ Mansfield Town      │               2 │               2 │
 6. │ 2022-05-07 │   2021 │ Barrow                │ Northampton Town    │               1 │               3 │
 7. │ 2022-05-07 │   2021 │ Bradford City         │ Carlisle United     │               2 │               0 │
 8. │ 2022-05-07 │   2021 │ Bristol Rovers        │ Scunthorpe United   │               7 │               0 │
 9. │ 2022-05-07 │   2021 │ Exeter City           │ Port Vale           │               0 │               1 │
10. │ 2022-05-07 │   2021 │ Harrogate Town A.F.C. │ Sutton United       │               0 │               2 │
11. │ 2022-05-07 │   2021 │ Hartlepool United     │ Colchester United   │               0 │               2 │
12. │ 2022-05-07 │   2021 │ Leyton Orient         │ Tranmere Rovers     │               0 │               1 │
13. │ 2022-05-07 │   2021 │ Mansfield Town        │ Forest Green Rovers │               2 │               2 │
14. │ 2022-05-07 │   2021 │ Newport County        │ Rochdale            │               0 │               2 │
15. │ 2022-05-07 │   2021 │ Oldham Athletic       │ Crawley Town        │               3 │               3 │
16. │ 2022-05-07 │   2021 │ Stevenage Borough     │ Salford City        │               4 │               2 │
17. │ 2022-05-07 │   2021 │ Walsall               │ Swindon Town        │               0 │               3 │
    └────────────┴────────┴───────────────────────┴─────────────────────┴─────────────────┴─────────────────┘

Insert the data:

INSERT INTO football FROM INFILE 'football.parquet' FORMAT Parquet;

Reading data

Read data using the Parquet format:

SELECT *
FROM football
INTO OUTFILE 'football.parquet'
FORMAT Parquet
Tip

Parquet is a binary format that does not display in a human-readable form on the terminal. Use the INTO OUTFILE to output Parquet files.

To exchange data with Hadoop, you can use the HDFS table engine.

Format settings

SettingDescriptionDefault
input_format_parquet_case_insensitive_column_matchingIgnore case when matching Parquet columns with CH columns.0
input_format_parquet_preserve_orderAvoid reordering rows when reading from Parquet files. Usually makes it much slower.0
input_format_parquet_filter_push_downWhen reading Parquet files, skip whole row groups based on the WHERE/PREWHERE expressions and min/max statistics in the Parquet metadata.1
input_format_parquet_bloom_filter_push_downWhen reading Parquet files, skip whole row groups based on the WHERE expressions and bloom filter in the Parquet metadata.0
input_format_parquet_use_native_readerWhen reading Parquet files, to use native reader instead of arrow reader.0
input_format_parquet_allow_missing_columnsAllow missing columns while reading Parquet input formats1
input_format_parquet_local_file_min_bytes_for_seekMin bytes required for local read (file) to do seek, instead of read with ignore in Parquet input format8192
input_format_parquet_enable_row_group_prefetchEnable row group prefetching during parquet parsing. Currently, only single-threaded parsing can prefetch.1
input_format_parquet_skip_columns_with_unsupported_types_in_schema_inferenceSkip columns with unsupported types while schema inference for format Parquet0
input_format_parquet_max_block_sizeMax block size for parquet reader.65409
input_format_parquet_prefer_block_bytesAverage block bytes output by parquet reader16744704
input_format_parquet_enable_json_parsingWhen reading Parquet files, parse JSON columns as ClickHouse JSON Column.1
output_format_parquet_row_group_sizeTarget row group size in rows.1000000
output_format_parquet_row_group_size_bytesTarget row group size in bytes, before compression.536870912
output_format_parquet_string_as_stringUse Parquet String type instead of Binary for String columns.1
output_format_parquet_fixed_string_as_fixed_byte_arrayUse Parquet FIXED_LEN_BYTE_ARRAY type instead of Binary for FixedString columns.1
output_format_parquet_versionParquet format version for output format. Supported versions: 1.0, 2.4, 2.6 and 2.latest (default)2.latest
output_format_parquet_compression_methodCompression method for Parquet output format. Supported codecs: snappy, lz4, brotli, zstd, gzip, none (uncompressed)zstd
output_format_parquet_compliant_nested_typesIn parquet file schema, use name 'element' instead of 'item' for list elements. This is a historical artifact of Arrow library implementation. Generally increases compatibility, except perhaps with some old versions of Arrow.1
output_format_parquet_use_custom_encoderUse a faster Parquet encoder implementation.1
output_format_parquet_parallel_encodingDo Parquet encoding in multiple threads. Requires output_format_parquet_use_custom_encoder.1
output_format_parquet_data_page_sizeTarget page size in bytes, before compression.1048576
output_format_parquet_batch_sizeCheck page size every this many rows. Consider decreasing if you have columns with average values size above a few KBs.1024
output_format_parquet_write_page_indexAdd a possibility to write page index into parquet files.1
input_format_parquet_import_nestedObsolete setting, does nothing.0
input_format_parquet_local_time_as_utctrueDetermines the data type used by schema inference for Parquet timestamps with isAdjustedToUTC=false. If true: DateTime64(..., 'UTC'), if false: DateTime64(...). Neither behavior is fully correct as ClickHouse doesn't have a data type for local wall-clock time. Counterintuitively, 'true' is probably the less incorrect option, because formatting the 'UTC' timestamp as String will produce representation of the correct local time.