Read data from a text file and record it as a parquet file on HDFS using Spark (Part 1)

Tram Ho

The text format is a hugely popular format, either on HDFS or anywhere. The text file data is presented in lines, each line can be considered as a record and marked with the end ” n” (a newline character). The advantage of the text file is light, but it has the disadvantage of slow read and write and cannot split files.

Apache Parquet is a column storage format available for any project in the Hadoop ecosystem, regardless of the choice of a data processing framework, data model, or programming language. Instead of storing data in contiguous rows, parquet can store data by adjacent columns. So data is partitioned both horizontally and vertically. Apache parquet overcomes the disadvantages of text files, reduces I / O time, better compression due to column-organized data, …

See also :Free hosting, a lifetime domain, suitable for students

This article I will talk about reading data from a text file and recording the read data into a parquet file on hdfs using Spark. On hdfs, not the local disk, so before you do you have to start hdfs first.
Generally speaking, the other parts are long but all related to Spark is extremely simple and fast.

Read data from text file using Spark

First of all we have a data set with text file:sample_text , which is the .dat files in this directory. Data is represented on lines, each line represents the properties of an object separated by the character ” t”. To know in order what are the corresponding attributes, we should note that the log.txt file model is recorded as follows:

From this file we can see that the first attribute defined on each line is timeCreate , followed by cookieCreate , … and the data types of the attributes defined below (like timeCreate and cookieCreate will be). Date).
From this information we can immediately create a ModelLog object with the above properties to store data from these text files as follows:

To read the data from the text file above, we will declare a JavaSparkContext and browse each line to spread the object and save it in a listModelLog as follows:

The splitLine function will strip properties of an object separated by the character ” t”:


So after running, we have listModelLog get all the data in the text file. I also see that the article is a bit long but it leads to boring, so stop writing data to a parquet file using Spark, I will write another article in the next section. See the next section HERE


Share the news now

Source : Viblo