Experience working with Big Data

Have you ever felt overwhelmed when working with Big Data? Have you ever had to sit for hours just to transfer data from one server to another just to test your algorithm? Or the times when I'm tired of seeing my script start running for hours and finally discover that I'm wrong? I think that all those who have just started working with Big Data have such feelings. In this article, I would like to contribute a bit of my experience to work with Big Data more effectively, including individuals or teamwork.

Plan well

Big Data decision
Big Data decision

When starting a Big Data project, Project manager will face a lot of decisions such as choosing which database system to suit the problem, developing what programming language to easily develop and maintenance later. When there is not much experience, making mistakes is inevitable. But the bottom line is still to have a good plan that will save the project as well as its members later.

For example, in companies that specialize in data processing log and time series, data generated hourly up to Gigabytes are normal. The problem is composed of two parts. First, how to quickly store data that comes from multiple sources in one place without losing information. Second, how to quickly synthesize data from minutes to hours to perform data analysis.

At this time, Project manager will stand before two options for storing and querying quickly information:

  1. Using MongoDB and Java : members already have experience in these two technologies so they can be developed immediately.
  2. Using Kafka and Spark : this is a new technology that meets the needs of the problem but requires time to update technology and has not much experience to develop.

Finally, due to the priority of giving products quickly and confidently with the experience of the current team, Project manager decided to run the project on MongoDB. And this is a decision that only meets current needs but in the long run will cause many problems and troubles that can lead to failure of the entire project.

Besides the advantages of a NoSQL system that ensures the availability of the system, MongoDB has many defects that are not suitable for real-time analysis on Big Data.

  • Slow, difficult MapReduce (MR) mechanism is difficult to distribute on multiple servers. Although there is Aggregation replacing MapReduce to improve data aggregation speed, there are still many errors related to memory management that have not been overcome.
  • The upgraded versions are not backward compatible with previous versions. Leads to many difficulties in upgrading the source code. Many functions will be removed and replaced, it is difficult to reuse.
  • Database has no schema or integrity constraints, making data with lots of duplicate and inaccurate information. Implementing data analysis based on invalid data like this leads to unacceptable errors.
  • Cannot switch to another technology. The system has been running and developing for a long time, so it cannot be discarded and redone from the beginning. This in turn makes system maintenance in the long run unsecured, difficult to inherit the system.

Therefore, if you go to plan better then choose Kafka in conjunction with Spark to be the development platform. Upgrade and maintenance is also easier. Do not because of the ease of use and the fast development time that makes the plan not good.

Should draw data sample for experiment

Tasting soup
Tasting soup

When working with Big Data, the only difference is time . Copy time, algorithm runtime, verification time, etc. all last from one to several hours or even a few days. The best way is to collect the Big word and extract immediately the small sample to experiment, so it will be faster and easier to control.

Small here means from a few dozen to several hundred lines of data, can be run directly and checked on software like Excel. Thus, we can check whether our algorithm is running properly, and confident when implementing on real Big data.

Therefore, extract the observable and experimental data samples on it to ensure correctness. We can think in the inductive way from the premise, hypothesis, to prove the correctness of the data.


Last chance to learn the latest Big Data knowledge on August 18 in Hanoi. Register here !

Practice getting used to debugging

IntelliJ debugger
IntelliJ debugger

During the probationary period, the company requires you to understand what people are doing, specifically their code. The best way is to learn debug, this is really a necessary skill.

When I first entered the company, I did not like to use the debug tool very much because in the process of working with Python I was familiar with how to write code, I ran there and debugged it on the command line (command line), but this only makes me slow down. As you continue the project of others, you will hardly be able to keep up with logic flow without debugging. Thanks to debugging, you will understand what the idea of ​​this function is, why this condition exists, why the other variable appears. From there, you have the ability to discover the logic errors that the project is experiencing as well as the incomplete points to make improvements to the current system.

Please choose a convenient debug tool for your work like Eclipse, Netbeans, intelli J, iPython, Robomongo, Pycharm, …

Working on big data, there are many servers

MapReduce
MapReduce programming with Python

This may sound absurd, but the company that makes big data with only one server sounds ridiculous.

The problem here is that a server in addition to downloading I / O is also subject to downloading and analyzing data. Thus, the consequence is that the system will be overloaded and stagnant. Many people think that being able to apply Spark on a server is enough. Please never have that. Spark is a technology that relies heavily on RAM, the more RAM, the more servers, the more powerful the algorithm is to meet real-time applications.

Distribute to multiple servers because MapReduce runs on a server that only costs more system resources.

Please write the document carefully

Atlassian Confluence
Atlassian Confluence

If you are a new soldier and the company has a training program, you are very lucky. But most start-up companies do not do this, which leads to many difficulties for newcomers.

Document here is not only a fully coded comment, but a full documentation of system design, defining functions and how to install. Thanks to a document, we will limit a lot of obstacles when the old people leave and new people come in.

Write documents for system design right from the beginning. Describe the system's tasks, coherent logic flow of the program, versions and current updates, …

Data and technology must be synchronized

Local Staging and Production
Local Staging and Production

If you are a Freelancer, your work only involves local and remote servers . All development stages, testing, and debugging are completed at local, then all changes will be pushed to the remote server for production version. However, for projects as big as Big Data, you will have an intermediate step that is staging (testing step before publishing to production version). At this point, you will face data synchronization.

The staging step has security advantages as well as avoiding data loss when incidents occur, while ensuring availability of production versions being used by customers. However, this will become detrimental when working with Big Data.

As mentioned above, the difficulty of Big Data is time and verification of the correctness of the data. Although your algorithm works properly on local and staging, it is probably not true on the production version. The loss of data synchronization as well as the software version between systems is always a potential danger in the future and easily leads to many errors and difficulties in maintenance.

So Data engineers, Data scientist and System admin work together to plan to synchronize data every day. Also, install the development environment so that all versions are synchronized. Thus, the stages from developing, released and production will be smoother.

Learn new technology quickly

Do more think less
Do more think less

Last chance to learn the latest Big Data knowledge on August 18 in Hanoi. Register here !

The field you are pursuing has a very fast rate of change. The technology you learned at University has become obsolete. So how can we keep up with all current technologies. The answer is knowledge of the background and habit (passion) of new technology.

Knowledge base includes mathematics (linear algebra, discrete math, calculus, graph theory, probability and statistics), programming (C / C ++, object-oriented, Java, C #) and subjects industry base (operating system, computer architecture and assembly language, computer network, database, data structure and algorithm). This knowledge gives you a focal point to face when faced with big waves of technology. Because all the inventions and inventions today are based on this basic science. Math will help you work effectively and logically, know how to organize your work and solve problems coherently. Programming helps you convey every idea to your computer. The nature of the computer is "cattle". Computers can perform many simple tasks as instructed by the programmer. If you can manually perform the tasks you need, you can show the computer what you want. Industry-based courses help you get an overview of the software system, helping you know the relevant tools when dealing with problems for each industry.

Dabbing new technology, seems to be a habit and passion of anyone who has identified themselves the way of information technology. In the past, when there were no specialized online newspapers, I often read technology magazines such as "Making friends with computers", "Echip", "Computer World", "Game World" , … Each When I had a good trick or software, I installed and tried it. Thanks to that, within 2 weeks, I have caught up with many technologies at the same time: git, pyspark, ipython, pandas, glassfish, node.js, bower, jira, stash, scala, mongodb, spark, sbt, maven, … Really Luckily if you've ever dug through these technologies once, you've got an advantage over many people.

Therefore, take a base of knowledge as a support, not like fireflies following this technology, the other tool, when it is closed and then turn off is not stable. Always cultivate new technology, not because of the work required but because of the excitement to learn. There will be times when you need it, because knowledge is never redundant.

Share the news now