I wanted to write something on both Linux and Hadoop, which I have recently been interested in and started working on. I will continue to share new things about Hadoop from time to time. First of all, what is Hadoop? We should start there.
To briefly answer the question of what “Hadoop” is and what it does; it is an open source library for processing big data from servers. With huge storage areas and a very high processor speed, it helps to bring your data in line, so to speak, and this is exactly where it gets its power. It has 4 modules: “Hdfs”, “Yarn”, “MapReduce” and “Hadoop Common”. It is Java based.
After a short definition of Hadoop, I will finally use Ubuntu 22.04 as the version and the latest version 3.3.6 on the Hadoop side. Enjoy your reading.
First we will check and install updates on Ubuntu and for this we will start by opening the Terminal and executing the following command.
sudo apt-get update
Then we need to install Java OpenJDK which is required for Hadoop. For that, we need to run the following command in the terminal. The command will install the latest version. You can install other versions if you want.
sudo apt install default-jdk
After running this command, Ubuntu will ask us for confirmation for the installation. We type “Y” and press enter to give this confirmation and wait for it to complete the installation.
Java Installation
After the installation is complete, we will check the version to see if the installation was successful with the relevant command.
java -version
Version Control
User Settings and SSH Authorization
At this stage, we need to create an SSH key for our machine and create a separate user for our operations on the hadoop side. If we briefly touch on the concept of SSH, ssh is a protocol that allows users to control and organize their remote servers over the internet. In this way, all information coming from the remote server is encrypted and protected, and a secure communication is provided between the host computer and the remote server. Now we will create the ssh key and user required for the hadoop server. For this, we need to install “openssh” first. We download and install openssh on our computer with the following command.
Just like the installation we did for java, the computer will ask us for confirmation again and we will continue the installation by typing y in the line that comes up and pressing entera. Here you will need to press enter 4 times until the following screen appears and the following screen will appear.
SSH Setup
After this process is completed, we are doing the new user operations that we will create for hadoop. We need to run the following command to create a user.
sudo useradd -m -s /bin/bash hadoop
A user named Hadoop has been created. You can choose a different name. Now we need to set a password for this user.
sudo passwd hadoop
We created our user and defined our password. After this step, we will authorize the user named hadoop to run “sudo” commands. So we will create a “super user”. Friends who are unfamiliar with Linux will probably wonder what this sudo command is for. The sudo command is the equivalent of the admin authorization on the Windows side in its crudest form.
sudo usermod -aG sudo hadoop
After giving sudo authorization to the newly created user named hadoop, we switch from our own user to the hadoop user where we will perform our operations.
su hadoop
Now we will go back to the ssh side and we will first define our key via openssh, which we downloaded and installed to create an ssh key, and then we will do the key and user operations. We need to run the following command to create the key. An important point here is that we will pass the warnings that come after running the code by pressing entera.
ssh-keygen -t rsa
We have created the key and now we need to verify it. We will proceed with the following command for verification.
ls ~/.ssh/
Next, run the following command to copy the SSH public key ‘id_rsa.pub’ to the “authorized_keys” file and change the default permission to 600.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Also in SSH, the “authorized_keys” file is where the ssh public key is stored, which can be more than one public key. Anyone who has the public key stored in authorized_keys and has the correct private key will be able to connect to the server.
chmod 600 ~/.ssh/authorized_keys
We are now at the final process for SSH, where we will connect to the host machine with the following command and verify the key. While doing this, a confirmation line will appear in front of us again. We will confirm by typing “yes” in the next line.
ssh localhost
SSH Localhost
Downloading and Installing Hadoop
After completing the SSH and user processes, we can start downloading hadoop. We will work with hadoop 3.3.6. You can work with other versions if you wish. To download, we go to the hadoop website and click on the source option in the version we want from the download options. We copy the link ending with .tar.gz in the first row on the source page that opens in front of us. We type and run the wget command from the commands used for downloads in Linux and then the link we copied. Hadoop will start downloading.
Once our file is downloaded, we need to run the following commands in order for installation and migration.
tar -xvzf hadoop-3.3.6-src.tar.gz
sudo mv hadoop-3.3.6-src /usr/local/hadoop
Finally, we need to authorize the hadoop user and group for the installation directory.
sudo chown -R hadoop:hadoop /usr/local/hadoop
Setting Hadoop Environment Variables
For this process, we first need to open the configuration file and make some additions. We need to run the following command to open the configuration file.
nano ~/.bashrc
We come to the bottom line of the screen I shared the image of and add the following lines here. Then press Ctrl+O and enter. Changes will be written. After the changes are written, we exit the configuration file with Ctrl+X.
Next, you need to configure the JAVA_HOME environment variable in the ‘hadoop-env.sh’ script. Open the ‘hadoop-env.sh‘ file using the following nano editor command. The hadoop-env.sh file is available in the ‘$HADOOP_HOME’ directory, where the hadoop installation directory is called /usr/local/hadoop.
nano $HADOOP_HOME/etc/hadoop/hadoop/hadoop-env.sh
We go to the “export $JAVA_HOME” environment left as a comment line, remove the # sign at the beginning and change it as follows. After completing this process, we exit by pressing Ctrl+O, Enter and Ctrl+X keys respectively.
Let’s check the Hadoop version now for correct installation.
hadoop version
Hadoop Version Control
Hadoop Cluster Setup (Pseudeo-Distribute Mode)
It is possible to set up clusters in Hadoop in three different ways.
Local Mode (Standalone) : It is the default hadoop setup that runs as a single java process and undeployed mode. With this you can easily debug the hadoop process.
Pseudo-Distributed Mode : This mode allows you to run a hadoop cluster in distributed mode even with only a single node/server. In this mode Hadoop processes will be run in separate java processes.
Fully-Distributed Mode : With this mode you can deploy large hadoop deployments with multiple or even thousands of nodes/servers. If you want to run hadoop in production, you will need to use hadoop in fully-distributed mode.
Since we will be working in pseudeo-mode, we will need to adjust the configurations accordingly.
Namenode and Datanode Adjustments
sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
command to open the nano editor and type the following configuration settings into the configuration range. After doing this, press Ctrl+O, Enter and Ctrl-X keys respectively and exit the editor.
Then change the ownership of the DataNode directories to the user “hadoop”.
sudo chown -R hadoop:hadoop /home/hadoop/hdfs
After these operations, we need to make configuration changes in the nano editor again. We open the nano editor with the following command.
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add the following configuration to the file. In this example, we changed the value of “dfs.replication” to “1” since you will install the hadoop cluster on a single node. We will also need to specify the directory to be used for the DataNode. After the changes, press Ctrl+O, Enter and Ctrl-X to save the changes and exit the file.
We are done with the configuration, now we will format the hadoop file system.
hdfs namenode -format
Then start NameNode and DataNode with the following command. NameNode will run on the server IP address you configured in the “core-site.xml” file.
start-dfs.sh
There is something I need to add here. When you enter this command, you may get “permission denied” error. Friends who get this error can find many solutions on the internet, I solved the problem by running the following commands sequentially in the terminal.
Permission Denied Error
sudo apt-get remove pdsh
sudo apt-get remove pdsh
Now namenode and datanode are working. To see that both are working in the web interface, you can enter http://0.0.0.0:9870/ in your web browser and see the activation for namenode in the interface. For datanode, you will only need to change the port number to 9864.
DataNode and NameNode
Yarn Manager Configuration
We will open the nano editor again with the following command.
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
After the editor screen appears, add the following configuration settings and save and exit by pressing Ctrl+O, Enter and Ctrl+X respectively.
A similar configuration is required for yarn-site.xml. After entering the configuration settings, save and exit by pressing Ctrl+O, Enter and Ctrl+X keys respectively.
HADOOP INSTALLATION ON UBUNTU 22.04
I wanted to write something on both Linux and Hadoop, which I have recently been interested in and started working on. I will continue to share new things about Hadoop from time to time. First of all, what is Hadoop? We should start there.
To briefly answer the question of what “Hadoop” is and what it does; it is an open source library for processing big data from servers. With huge storage areas and a very high processor speed, it helps to bring your data in line, so to speak, and this is exactly where it gets its power. It has 4 modules: “Hdfs”, “Yarn”, “MapReduce” and “Hadoop Common”. It is Java based.
After a short definition of Hadoop, I will finally use Ubuntu 22.04 as the version and the latest version 3.3.6 on the Hadoop side. Enjoy your reading.
First we will check and install updates on Ubuntu and for this we will start by opening the Terminal and executing the following command.
sudo apt-get update
Then we need to install Java OpenJDK which is required for Hadoop. For that, we need to run the following command in the terminal. The command will install the latest version. You can install other versions if you want.
sudo apt install default-jdk
After running this command, Ubuntu will ask us for confirmation for the installation. We type “Y” and press enter to give this confirmation and wait for it to complete the installation.
Java Installation
After the installation is complete, we will check the version to see if the installation was successful with the relevant command.
java -version
Version Control
User Settings and SSH Authorization
At this stage, we need to create an SSH key for our machine and create a separate user for our operations on the hadoop side. If we briefly touch on the concept of SSH, ssh is a protocol that allows users to control and organize their remote servers over the internet. In this way, all information coming from the remote server is encrypted and protected, and a secure communication is provided between the host computer and the remote server. Now we will create the ssh key and user required for the hadoop server. For this, we need to install “openssh” first. We download and install openssh on our computer with the following command.
sudo apt install openssh-server openssh-client pdsh
Just like the installation we did for java, the computer will ask us for confirmation again and we will continue the installation by typing y in the line that comes up and pressing entera. Here you will need to press enter 4 times until the following screen appears and the following screen will appear.
SSH Setup
After this process is completed, we are doing the new user operations that we will create for hadoop. We need to run the following command to create a user.
sudo useradd -m -s /bin/bash hadoop
A user named Hadoop has been created. You can choose a different name. Now we need to set a password for this user.
sudo passwd hadoop
We created our user and defined our password. After this step, we will authorize the user named hadoop to run “sudo” commands. So we will create a “super user”. Friends who are unfamiliar with Linux will probably wonder what this sudo command is for. The sudo command is the equivalent of the admin authorization on the Windows side in its crudest form.
sudo usermod -aG sudo hadoop
After giving sudo authorization to the newly created user named hadoop, we switch from our own user to the hadoop user where we will perform our operations.
su hadoop
Now we will go back to the ssh side and we will first define our key via openssh, which we downloaded and installed to create an ssh key, and then we will do the key and user operations. We need to run the following command to create the key. An important point here is that we will pass the warnings that come after running the code by pressing entera.
ssh-keygen -t rsa
We have created the key and now we need to verify it. We will proceed with the following command for verification.
ls ~/.ssh/
Next, run the following command to copy the SSH public key ‘id_rsa.pub’ to the “authorized_keys” file and change the default permission to 600.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Also in SSH, the “authorized_keys” file is where the ssh public key is stored, which can be more than one public key. Anyone who has the public key stored in authorized_keys and has the correct private key will be able to connect to the server.
chmod 600 ~/.ssh/authorized_keys
We are now at the final process for SSH, where we will connect to the host machine with the following command and verify the key. While doing this, a confirmation line will appear in front of us again. We will confirm by typing “yes” in the next line.
ssh localhost
SSH Localhost
Downloading and Installing Hadoop
After completing the SSH and user processes, we can start downloading hadoop. We will work with hadoop 3.3.6. You can work with other versions if you wish. To download, we go to the hadoop website and click on the source option in the version we want from the download options. We copy the link ending with .tar.gz in the first row on the source page that opens in front of us. We type and run the wget command from the commands used for downloads in Linux and then the link we copied. Hadoop will start downloading.
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6-src.tar.gz
Once our file is downloaded, we need to run the following commands in order for installation and migration.
tar -xvzf hadoop-3.3.6-src.tar.gz
sudo mv hadoop-3.3.6-src /usr/local/hadoop
Finally, we need to authorize the hadoop user and group for the installation directory.
sudo chown -R hadoop:hadoop /usr/local/hadoop
Setting Hadoop Environment Variables
For this process, we first need to open the configuration file and make some additions. We need to run the following command to open the configuration file.
nano ~/.bashrc
We come to the bottom line of the screen I shared the image of and add the following lines here. Then press Ctrl+O and enter. Changes will be written. After the changes are written, we exit the configuration file with Ctrl+X.
# Hadoop environment variables
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib/native”
Nano Editor Settings
After adding these lines and saving our configuration file, we enter the following command to apply the changes.
source ~/.bashrc
To check the environment variables we have created, we can run the following 3 commands in sequence and test their correctness.
echo $JAVA_HOME
echo $HADOOP_HOME
echo $HADOOP_OPTS
Hadoop Location Control
Next, you need to configure the JAVA_HOME environment variable in the ‘hadoop-env.sh’ script. Open the ‘hadoop-env.sh‘ file using the following nano editor command. The hadoop-env.sh file is available in the ‘$HADOOP_HOME’ directory, where the hadoop installation directory is called /usr/local/hadoop.
nano $HADOOP_HOME/etc/hadoop/hadoop/hadoop-env.sh
We go to the “export $JAVA_HOME” environment left as a comment line, remove the # sign at the beginning and change it as follows. After completing this process, we exit by pressing Ctrl+O, Enter and Ctrl+X keys respectively.
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Setting up a Hadoop Environment
Let’s check the Hadoop version now for correct installation.
hadoop version
Hadoop Version Control
Hadoop Cluster Setup (Pseudeo-Distribute Mode)
It is possible to set up clusters in Hadoop in three different ways.
Local Mode (Standalone) : It is the default hadoop setup that runs as a single java process and undeployed mode. With this you can easily debug the hadoop process.
Pseudo-Distributed Mode : This mode allows you to run a hadoop cluster in distributed mode even with only a single node/server. In this mode Hadoop processes will be run in separate java processes.
Fully-Distributed Mode : With this mode you can deploy large hadoop deployments with multiple or even thousands of nodes/servers. If you want to run hadoop in production, you will need to use hadoop in fully-distributed mode.
Since we will be working in pseudeo-mode, we will need to adjust the configurations accordingly.
Namenode and Datanode Adjustments
sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
command to open the nano editor and type the following configuration settings into the configuration range. After doing this, press Ctrl+O, Enter and Ctrl-X keys respectively and exit the editor.
<configuration>
< property>
<name>fs.defaultFS</name>
<value> hdfs://0.0.0.0:9000</value>
</property>
</configuration>
HDFS Nano
Next, run the following command to create new directories to be used for the DataNode in the hadoop cluster.
sudo mkdir -p /home/hadoop/hdfs/{namenode,datanode}
Then change the ownership of the DataNode directories to the user “hadoop”.
sudo chown -R hadoop:hadoop /home/hadoop/hdfs
After these operations, we need to make configuration changes in the nano editor again. We open the nano editor with the following command.
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add the following configuration to the file. In this example, we changed the value of “dfs.replication” to “1” since you will install the hadoop cluster on a single node. We will also need to specify the directory to be used for the DataNode. After the changes, press Ctrl+O, Enter and Ctrl-X to save the changes and exit the file.
<configuration>
< property>
<name>dfs.replication</name>
<value>1</value>
</property>
< property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hdfs/namenode</value>
</property>
< property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hdfs/datanode</value>
</property>
</configuration>
HDFS NameNode
We are done with the configuration, now we will format the hadoop file system.
hdfs namenode -format
Then start NameNode and DataNode with the following command. NameNode will run on the server IP address you configured in the “core-site.xml” file.
start-dfs.sh
There is something I need to add here. When you enter this command, you may get “permission denied” error. Friends who get this error can find many solutions on the internet, I solved the problem by running the following commands sequentially in the terminal.
Permission Denied Error
sudo apt-get remove pdsh
sudo apt-get remove pdsh
Now namenode and datanode are working. To see that both are working in the web interface, you can enter http://0.0.0.0:9870/ in your web browser and see the activation for namenode in the interface. For datanode, you will only need to change the port number to 9864.
DataNode and NameNode
Yarn Manager Configuration
We will open the nano editor again with the following command.
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
After the editor screen appears, add the following configuration settings and save and exit by pressing Ctrl+O, Enter and Ctrl+X respectively.
<configuration>
< property>
<name>mapreduce.framework.name</name>
<value> yarn</value>
</property>
< property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
Mapred-Site Nano
A similar configuration is required for yarn-site.xml. After entering the configuration settings, save and exit by pressing Ctrl+O, Enter and Ctrl+X keys respectively.
sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
< property>
<name>yarn.nodemanager.aux-services</name>
<value> mapreduce_shuffle</value>
</property>
< property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
Yarn-Site Nano
Then we start tomorrow.
start-yarn.sh
ResourceManager should be running on the default port 8088. return to your web browser and run the address (http://0.0.0.0:8088/).
Now all the processes are completed and the installation is done. You can start using Hadoop.
Ender Kaderli
Contact : enderkaderli@datapaper.ai
Arşivler
Kategoriler
Arşivler
AUTONOMOUS CRUISE CONTROL IN AUTOMOBILES USING FUZZY LOGIC
22 Haziran 2025DATA RETENTION STRATEGIES
22 Haziran 2025Categories
Meta