Winutils Exe Hadoop
2021年11月21日Download here: http://gg.gg/wz216
-->
*May 28, 2020 Copy the winutils.exe file from the Downloads folder to C:hadoopbin. Step 7: Configure Environment Variables Configuring environment variables in Windows adds the Spark and Hadoop locations to your system PATH.
*This detailed step-by-step guide shows you how to install the latest Hadoop v3.3.0 on Windows 10. It leverages Hadoop 3.3.0 winutils tool. WLS (Windows Subsystem for Linux) is not required.
This tutorial shows you how to use the Azure Toolkit for IntelliJ plug-in to develop Apache Spark applications, which are written in Scala, and then submit them to a serverless Apache Spark pool directly from the IntelliJ integrated development environment (IDE). You can use the plug-in in a few ways:
*Develop and submit a Scala Spark application on a Spark pool.
*Access your Spark pools resources.
*Develop and run a Scala Spark application locally.
In this tutorial, you learn how to:
Hadoop requires native libraries on Windows to work properly -that includes accessing the file:// filesystem, where Hadoop uses some Windows APIs to implement posix-like file access permissions. This is implemented in HADOOP.DLL and WINUTILS.EXE. In particular,%HADOOPHOME% BIN WINUTILS.EXE must be locatable.
*Use the Azure Toolkit for IntelliJ plug-in
*Develop Apache Spark applications
*Submit application to Spark poolsPrerequisites
*
IntelliJ IDEA Community Version.
*
Azure toolkit plugin 3.27.0-2019.2 – Install from IntelliJ Plugin repository
*
JDK (Version 1.8).
*
Scala Plugin – Install from IntelliJ Plugin repository.
*
The following prerequisite is only for Windows users:
While you’re running the local Spark Scala application on a Windows computer, you might get an exception, as explained in SPARK-2356. The exception occurs because WinUtils.exe is missing on Windows.To resolve this error, download the WinUtils executable to a location such as C:WinUtilsbin. Then, add the environment variable HADOOP_HOME, and set the value of the variable to C:WinUtils.Create a Spark Scala application for a Spark pool
*
Start IntelliJ IDEA, and select Create New Project to open the New Project window.
*
Select Apache Spark/HDInsight from the left pane.
*
Select Spark Project with Samples(Scala) from the main window.
*
From the Build tool drop-down list, select one of the following types:
*Maven for Scala project-creation wizard support.
*SBT for managing the dependencies and building for the Scala project.
*
Select Next.
*
In the New Project window, provide the following information:PropertyDescriptionProject nameEnter a name. This tutorial uses myApp.Project locationEnter the wanted location to save your project.Project SDKIt might be blank on your first use of IDEA. Select New... and navigate to your JDK.Spark VersionThe creation wizard integrates the proper version for Spark SDK and Scala SDK. Synapse only supports Spark 2.4.0.
*
Select Finish. It may take a few minutes before the project becomes available.
*
The Spark project automatically creates an artifact for you. To view the artifact, do the following operating:
a. From the menu bar, navigate to File > Project Structure....
b. From the Project Structure window, select Artifacts.
c. Select Cancel after viewing the artifact.
*
Find LogQuery from myApp > src > main > scala> sample> LogQuery. This tutorial uses LogQuery to run.Connect to your Spark pools
Sign in to Azure subscription to connect to your Spark pools.Sign in to your Azure subscription
*
From the menu bar, navigate to View > Tool Windows > Azure Explorer.
*
From Azure Explorer, right-click the Azure node, and then select Sign In.
*
In the Azure Sign In dialog box, choose Device Login, and then select Sign in.
*
In the Azure Device Login dialog box, select Copy&Open.
*
In the browser interface, paste the code, and then select Next.
*
Enter your Azure credentials, and then close the browser.
*
After you’re signed in, the Select Subscriptions dialog box lists all the Azure subscriptions that are associated with the credentials. Select your subscription and then select Select.
*
From Azure Explorer, expand Apache Spark on Synapse to view the Workspaces that are in your subscriptions.
*
To view the Spark pools, you can further expand a workspace.Remote Run a Spark Scala application on a Spark pool
After creating a Scala application, you can remotely run it.
*
Open Run/Debug Configurations window by selecting the icon.
*
In the Run/Debug Configurations dialog window, select +, then select Apache Spark on Synapse.
*
In the Run/Debug Configurations window, provide the following values, and then select OK:PropertyValueSpark poolsSelect the Spark pools on which you want to run your application.Select an Artifact to submitLeave default setting.Main class nameThe default value is the main class from the selected file. You can change the class by selecting the ellipsis(...) and choosing another class.Job configurationsYou can change the default key and values. For more information, see Apache Livy REST API.Command-line argumentsYou can enter arguments separated by space for the main class if needed.Referenced Jars and Referenced FilesYou can enter the paths for the referenced Jars and files if any. You can also browse files in the Azure virtual file system, which currently only supports ADLS Gen2 cluster. For more information: [Apache Spark Configuration]https://spark.apache.org/docs/2.4.5/configuration.html#runtime-environment) and How to upload resources to cluster.Job Upload StorageExpand to reveal additional options.Storage TypeSelect Use Azure Blob to upload or Use cluster default storage account to upload from the drop-down list.Storage AccountEnter your storage account.Storage KeyEnter your storage key.Storage ContainerSelect your storage container from the drop-down list once Storage Account and Storage Key has been entered.
*
Select SparkJobRun icon to submit your project to the selected Spark pool. The Remote Spark Job in Cluster tab displays the job execution progress at the bottom. You can stop the application by selecting the red button.Local Run/Debug Apache Spark applications
You can follow the instructions below to set up your local run and local debug for your Apache Spark job.Scenario 1: Do local run
*
Open the Run/Debug Configurations dialog, select the plus sign (+). Then select the Apache Spark on Synapse option. Enter information for Name, Main class name to save.
*Environment variables and WinUtils.exe Location are only for windows users.
*Environment variables: The system environment variable can be auto detected if you have set it before and no need to manually add.
*WinUtils.exe Location: You can specify the WinUtils location by selecting the folder icon on the right.
*
Then select the local play button.
*
Once local run completed, if script includes output, you can check the output file from data > default.Scenario 2: Do local debugging
*
Open the LogQuery script, set breakpoints.
*
Select Local debug icon to do local debugging.Access and manage Synapse Workspace
You can perform different operations in Azure Explorer within Azure Toolkit for IntelliJ. From the menu bar, navigate to View > Tool Windows > Azure Explorer.Launch workspace
*
From Azure Explorer, navigate to Apache Spark on Synapse, then expand it.
*
Right-click a workspace, then select Launch workspace, website will be opened.Spark console
You can run Spark Local Console(Scala) or run Spark Livy Interactive Session Console(Scala).Spark local console (Scala)
Ensure you’ve satisfied the WINUTILS.EXE prerequisite.
*
From the menu bar, navigate to Run > Edit Configurations....
*
From the Run/Debug Configurations window, in the left pane, navigate to Apache Spark on Synapse > [Spark on Synapse] myApp.
*
From the main window, select the Locally Run tab.
*
Provide the following values, and then select OK:PropertyValueEnvironment variablesEnsure the value for HADOOP_HOME is correct.WINUTILS.exe locationEnsure the path is correct.
*
From Project, navigate to myApp > src > main > scala > myApp.
*
From the menu bar, navigate to Tools > Spark console > Run Spark Local Console(Scala).
*
Then two dialogs may be displayed to ask you if you want to auto fix dependencies. If so, select Auto Fix.
*
The console should look similar to the picture below. In the console window type sc.appName, and then press ctrl+Enter. The result will be shown. You can stop the local console by selecting red button.Spark Livy interactive session console (Scala)
It’s only supported on IntelliJ 2018.2 and 2018.3.
*
From the menu bar, navigate to Run > Edit Configurations....
*
From the Run/Debug Configurations window, in the left pane, navigate to Apache Spark on synapse > [Spark on synapse] myApp.
*
From the main window, select the Remotely Run in Cluster tab.
*
Provide the following values, and then select OK:PropertyValueMain class nameSelect the Main class name.Spark poolsSelect the Spark pools on which you want to run your application.
*
From Project, navigate to myApp > src > main > scala > myApp.
*
From the menu bar, navigate to Tools > Spark console > Run Spark Livy Interactive Session Console(Scala).
*
The console should look similar to the picture below. In the console window type sc.appName, and then press ctrl+Enter. The result will be shown. You can stop the local console by selecting red button.Send selection to Spark console
You may want to see the script result by sending some code to the local console or Livy Interactive Session Console(Scala). To do so, you can highlight some code in the Scala file, then right-click Send Selection To Spark console. The selected code will be sent to the console and be done. The result will be displayed after the code in the console. The console will check the existing errors.Next steps
This detailed step-by-step guide shows you how to install the latest Hadoop v3.3.0 on Windows 10. It leverages Hadoop 3.3.0 winutils tool. WLS (Windows Subsystem for Linux) is not required. This version was released on July 14 2020. It is the first release of Apache Hadoop 3.3 line. There are significant changes compared with Hadoop 3.2.0, such as Java 11 runtime support, protobuf upgrade to 3.7.1, scheduling of opportunistic containers, non-volatile SCM support in HDFS cache directives, etc.
Please follow all the instructions carefully. Once you complete the steps, you will have a shiny pseudo-distributed single node Hadoop to work with.warning Without consent from author, please don’t redistribute any part of the content on this page. The yellow elephant logo is a registered trademark of Apache Hadoop; the blue window logo is registered trademark of Microsoft.References
Refer to the following articles if you prefer to install other versions of Hadoop or if you want to configure a multi-node cluster or using WSL.
*Install Hadoop 3.3.0 on Windows 10 using WSL (Windows Subsystems for Linux is requried)Required tools
Before you start, make sure you have these following tools enabled in Windows 10.ToolCommentsPowerShell
We will use this tool to download package.
In my system, PowerShell version table is listed below:Git Bash or 7 Zip
We will use Git Bash or 7 Zip to unzip Hadoop binary package.
You can choose to install either tool or any other tool as long as it can unzip *.tar.gz files on Windows.Command PromptWe will use it to start Hadoop daemons and run some commands as part of the installation process. Java JDK
JDK is required to run Hadoop as the framework is built using Java.
In my system, my JDK version is jdk1.8.0_161.
Check out the supported JDK version on the following page.
From Hadoop 3.3.0, Java 11 runtime is now supported.
Now we will start the installation process. Step 1 - Download Hadoop binary packageSelect download mirror link
Go to download page of the official website:
And then choose one of the mirror link. The page lists the mirrors closest to you based on your location. For me, I am choosing the following mirror link:info In the following sections, this URL will be used to download the package. Your URL might be different from mine and you can replace the link accordingly.Download the packageinfo In this guide, I am installing Hadoop in folder big-data of my F drive (F:big-data). If you prefer to install on another drive, please remember to change the path accordingly in the following command lines. This directory is also called destination directory in the following sections.
Open PowerShell and then run the following command lines one by one:
It may take a few minutes to download.
Once the download completes, you can verify it:
You can also directly download the package through your web browser and save it to the destination directory.warning Please keep this PowerShell window open as we will use some variables in this session in the following steps. If you already closed it, it is okay, just remember to reinitialise the above variables: $client, $dest_dir.Step 2 - Unpack the package
Now we need to unpack the downloaded package using GUI tool (like 7 Zip) or command line. For me, I will use git bash to unpack it.
Open git bash and change the directory to the destination folder:
And then run the following command to unzip:
The command will take quite a few minutes as there are numerous files included and the latest version introduced many new features.
After the unzip command is completed, a new folder hadoop-3.3.0 is created under the destination folder.
info When running the command you may experience errors like the following:Please ignore it for now.Step 3 - Install Hadoop native IO binary
Hadoop on Linux includes optional Native IO support. However Native IO is mandatory on Windows and without it you will not be able to get your installation working. The Windows native IO libraries are not included as part of Apache Hadoop release. Thus we need to build and install it.infoThe following repository already pre-built Hadoop Windows native libraries:https://github.com/kontext-tech/winutilswarning These libraries are not signed and there is no guarantee that it is 100% safe. We use it purely for test&learn purpose.
Download all the files in the following location and save them to the bin folder under Hadoop folder. For my environment, the full path is: F:big-datahadoop-3.3.0bin. Remember to change it to your own path accordingly.
Alternatively, you can run the following commands in the previous PowerShell window to download:
After this, the bin folder looks like the following:Step 4 - (Optional) Java JDK installation
Java JDK is required to run Hadoop. If you have not installed Java JDK, please install it.
You can install JDK 8 from the following page:
Once you complete the installation, please run the following command in PowerShell or Git Bash to verify:
If you got error about ’cannot find java command or executable’. Don’t worry we will resolve this in the following step.Step 5 - Configure environment variables
Now we’ve downloaded and unpacked all the artefacts we need to configure two important environment variables.Configure JAVA_HOME environment variable
As mentioned earlier, Hadoop requires Java and we need to configure JAVA_HOME environment variable (though it is not mandatory but I recommend it).
First, we need to find out the location of Java SDK. In my system, the path is: D:Javajdk1.8.0_161.
Your location can be different depends on where you install your JDK.
And then run the following command in the previous PowerShell window:
Remember to quote the path especially if you have spaces in your JDK path.infoYou can setup environment variable at system level by adding option /M however just in case you don’t have access to change system variables, you can just set it up at user level.
The output looks like the following:Configure HADOOP_HOME environment variable
Similarly we need to create a new environment variable for HADOOP_HOME using the following command. The path should be your extracted Hadoop folder. For my environment it is: F:big-datahadoop-3.3.0.
If you used PowerShell to download and if the window is still open, you can simply run the following command:
The output looks like the following screenshot:
Alternatively, you can specify the full path:
Now you can also verify the two environment variables in the system:
Configure PATH environment variable
Once we finish setting up the above two environment variables, we need to add the bin folders to the PATH environment variable.
If PATH environment exists in your system, you can also manually add the following two paths to it:
*%JAVA_HOME%/bin
*%HADOOP_HOME%/bin
Alternatively, you can run the following command to add them:
If you don’t have other user variables setup in the system, you can also directly add a Path environment variable that references others to make it short:
Close PowerShell window and open a new one and type winutils.exe directly to verify that our above steps are completed successfully:
You should also be able to run the following command:Step 6 - Configure Hadoop
Now we are ready to configure the most important part - Hadoop configurations which involves Core, YARN, MapReduce, HDFS configurations. Configure core site
Edit file core-site.xml in %HADOOP_HOME%etchadoop folder. For my environment, the actual path is F:big-datahadoop-3.3.0etchadoop.
Replace configuration element with the following:Configure HDFS
Edit file hdfs-site.xml in %HADOOP_HOME%etchadoop folder.
Before editing, please correct two folders in your system: one for namenode directory and another for data directory. For my system, I created the following two sub folders:
*F:big-datadatadfsnamespace_logs_330
*F:big-datadatadfsdata_330
Replace configuration element with the following (remember to replace the highlighted paths accordingly):
In Hadoop 3, the property names are slightly different from previous version. Refer to the following official documentation to learn more about the configuration properties:infoFor DFS replication we configure it as one as we are configuring just one single node. By default the value is 3.infoThe directory configuration are not mandatory and by default it will use Hadoop temporary folder. For our tutorial purpose, I would recommend customise the values. Configure MapReduce and YARN site
Edit file mapred-site.xml in %HADOOP_HOME%etchadoop folder.
Replace configuration element with the following:
Edit fileyarn-site.xml in %HADOOP_HOME%etchadoop folder. Step 7 - Initialise HDFS & bug fix
Run the following command in Command Prompt
The following is an example when it is formatted successfully:
Winutils.exe Hadoop 2.6 DownloadStep 8 - Start HDFS daemons
Run the following command to start HDFS daemons in Command Prompt:Two Command Prompt windows will open: one for datanode and another for namenode as the following screenshot s
https://diarynote-jp.indered.space
-->
*May 28, 2020 Copy the winutils.exe file from the Downloads folder to C:hadoopbin. Step 7: Configure Environment Variables Configuring environment variables in Windows adds the Spark and Hadoop locations to your system PATH.
*This detailed step-by-step guide shows you how to install the latest Hadoop v3.3.0 on Windows 10. It leverages Hadoop 3.3.0 winutils tool. WLS (Windows Subsystem for Linux) is not required.
This tutorial shows you how to use the Azure Toolkit for IntelliJ plug-in to develop Apache Spark applications, which are written in Scala, and then submit them to a serverless Apache Spark pool directly from the IntelliJ integrated development environment (IDE). You can use the plug-in in a few ways:
*Develop and submit a Scala Spark application on a Spark pool.
*Access your Spark pools resources.
*Develop and run a Scala Spark application locally.
In this tutorial, you learn how to:
Hadoop requires native libraries on Windows to work properly -that includes accessing the file:// filesystem, where Hadoop uses some Windows APIs to implement posix-like file access permissions. This is implemented in HADOOP.DLL and WINUTILS.EXE. In particular,%HADOOPHOME% BIN WINUTILS.EXE must be locatable.
*Use the Azure Toolkit for IntelliJ plug-in
*Develop Apache Spark applications
*Submit application to Spark poolsPrerequisites
*
IntelliJ IDEA Community Version.
*
Azure toolkit plugin 3.27.0-2019.2 – Install from IntelliJ Plugin repository
*
JDK (Version 1.8).
*
Scala Plugin – Install from IntelliJ Plugin repository.
*
The following prerequisite is only for Windows users:
While you’re running the local Spark Scala application on a Windows computer, you might get an exception, as explained in SPARK-2356. The exception occurs because WinUtils.exe is missing on Windows.To resolve this error, download the WinUtils executable to a location such as C:WinUtilsbin. Then, add the environment variable HADOOP_HOME, and set the value of the variable to C:WinUtils.Create a Spark Scala application for a Spark pool
*
Start IntelliJ IDEA, and select Create New Project to open the New Project window.
*
Select Apache Spark/HDInsight from the left pane.
*
Select Spark Project with Samples(Scala) from the main window.
*
From the Build tool drop-down list, select one of the following types:
*Maven for Scala project-creation wizard support.
*SBT for managing the dependencies and building for the Scala project.
*
Select Next.
*
In the New Project window, provide the following information:PropertyDescriptionProject nameEnter a name. This tutorial uses myApp.Project locationEnter the wanted location to save your project.Project SDKIt might be blank on your first use of IDEA. Select New... and navigate to your JDK.Spark VersionThe creation wizard integrates the proper version for Spark SDK and Scala SDK. Synapse only supports Spark 2.4.0.
*
Select Finish. It may take a few minutes before the project becomes available.
*
The Spark project automatically creates an artifact for you. To view the artifact, do the following operating:
a. From the menu bar, navigate to File > Project Structure....
b. From the Project Structure window, select Artifacts.
c. Select Cancel after viewing the artifact.
*
Find LogQuery from myApp > src > main > scala> sample> LogQuery. This tutorial uses LogQuery to run.Connect to your Spark pools
Sign in to Azure subscription to connect to your Spark pools.Sign in to your Azure subscription
*
From the menu bar, navigate to View > Tool Windows > Azure Explorer.
*
From Azure Explorer, right-click the Azure node, and then select Sign In.
*
In the Azure Sign In dialog box, choose Device Login, and then select Sign in.
*
In the Azure Device Login dialog box, select Copy&Open.
*
In the browser interface, paste the code, and then select Next.
*
Enter your Azure credentials, and then close the browser.
*
After you’re signed in, the Select Subscriptions dialog box lists all the Azure subscriptions that are associated with the credentials. Select your subscription and then select Select.
*
From Azure Explorer, expand Apache Spark on Synapse to view the Workspaces that are in your subscriptions.
*
To view the Spark pools, you can further expand a workspace.Remote Run a Spark Scala application on a Spark pool
After creating a Scala application, you can remotely run it.
*
Open Run/Debug Configurations window by selecting the icon.
*
In the Run/Debug Configurations dialog window, select +, then select Apache Spark on Synapse.
*
In the Run/Debug Configurations window, provide the following values, and then select OK:PropertyValueSpark poolsSelect the Spark pools on which you want to run your application.Select an Artifact to submitLeave default setting.Main class nameThe default value is the main class from the selected file. You can change the class by selecting the ellipsis(...) and choosing another class.Job configurationsYou can change the default key and values. For more information, see Apache Livy REST API.Command-line argumentsYou can enter arguments separated by space for the main class if needed.Referenced Jars and Referenced FilesYou can enter the paths for the referenced Jars and files if any. You can also browse files in the Azure virtual file system, which currently only supports ADLS Gen2 cluster. For more information: [Apache Spark Configuration]https://spark.apache.org/docs/2.4.5/configuration.html#runtime-environment) and How to upload resources to cluster.Job Upload StorageExpand to reveal additional options.Storage TypeSelect Use Azure Blob to upload or Use cluster default storage account to upload from the drop-down list.Storage AccountEnter your storage account.Storage KeyEnter your storage key.Storage ContainerSelect your storage container from the drop-down list once Storage Account and Storage Key has been entered.
*
Select SparkJobRun icon to submit your project to the selected Spark pool. The Remote Spark Job in Cluster tab displays the job execution progress at the bottom. You can stop the application by selecting the red button.Local Run/Debug Apache Spark applications
You can follow the instructions below to set up your local run and local debug for your Apache Spark job.Scenario 1: Do local run
*
Open the Run/Debug Configurations dialog, select the plus sign (+). Then select the Apache Spark on Synapse option. Enter information for Name, Main class name to save.
*Environment variables and WinUtils.exe Location are only for windows users.
*Environment variables: The system environment variable can be auto detected if you have set it before and no need to manually add.
*WinUtils.exe Location: You can specify the WinUtils location by selecting the folder icon on the right.
*
Then select the local play button.
*
Once local run completed, if script includes output, you can check the output file from data > default.Scenario 2: Do local debugging
*
Open the LogQuery script, set breakpoints.
*
Select Local debug icon to do local debugging.Access and manage Synapse Workspace
You can perform different operations in Azure Explorer within Azure Toolkit for IntelliJ. From the menu bar, navigate to View > Tool Windows > Azure Explorer.Launch workspace
*
From Azure Explorer, navigate to Apache Spark on Synapse, then expand it.
*
Right-click a workspace, then select Launch workspace, website will be opened.Spark console
You can run Spark Local Console(Scala) or run Spark Livy Interactive Session Console(Scala).Spark local console (Scala)
Ensure you’ve satisfied the WINUTILS.EXE prerequisite.
*
From the menu bar, navigate to Run > Edit Configurations....
*
From the Run/Debug Configurations window, in the left pane, navigate to Apache Spark on Synapse > [Spark on Synapse] myApp.
*
From the main window, select the Locally Run tab.
*
Provide the following values, and then select OK:PropertyValueEnvironment variablesEnsure the value for HADOOP_HOME is correct.WINUTILS.exe locationEnsure the path is correct.
*
From Project, navigate to myApp > src > main > scala > myApp.
*
From the menu bar, navigate to Tools > Spark console > Run Spark Local Console(Scala).
*
Then two dialogs may be displayed to ask you if you want to auto fix dependencies. If so, select Auto Fix.
*
The console should look similar to the picture below. In the console window type sc.appName, and then press ctrl+Enter. The result will be shown. You can stop the local console by selecting red button.Spark Livy interactive session console (Scala)
It’s only supported on IntelliJ 2018.2 and 2018.3.
*
From the menu bar, navigate to Run > Edit Configurations....
*
From the Run/Debug Configurations window, in the left pane, navigate to Apache Spark on synapse > [Spark on synapse] myApp.
*
From the main window, select the Remotely Run in Cluster tab.
*
Provide the following values, and then select OK:PropertyValueMain class nameSelect the Main class name.Spark poolsSelect the Spark pools on which you want to run your application.
*
From Project, navigate to myApp > src > main > scala > myApp.
*
From the menu bar, navigate to Tools > Spark console > Run Spark Livy Interactive Session Console(Scala).
*
The console should look similar to the picture below. In the console window type sc.appName, and then press ctrl+Enter. The result will be shown. You can stop the local console by selecting red button.Send selection to Spark console
You may want to see the script result by sending some code to the local console or Livy Interactive Session Console(Scala). To do so, you can highlight some code in the Scala file, then right-click Send Selection To Spark console. The selected code will be sent to the console and be done. The result will be displayed after the code in the console. The console will check the existing errors.Next steps
This detailed step-by-step guide shows you how to install the latest Hadoop v3.3.0 on Windows 10. It leverages Hadoop 3.3.0 winutils tool. WLS (Windows Subsystem for Linux) is not required. This version was released on July 14 2020. It is the first release of Apache Hadoop 3.3 line. There are significant changes compared with Hadoop 3.2.0, such as Java 11 runtime support, protobuf upgrade to 3.7.1, scheduling of opportunistic containers, non-volatile SCM support in HDFS cache directives, etc.
Please follow all the instructions carefully. Once you complete the steps, you will have a shiny pseudo-distributed single node Hadoop to work with.warning Without consent from author, please don’t redistribute any part of the content on this page. The yellow elephant logo is a registered trademark of Apache Hadoop; the blue window logo is registered trademark of Microsoft.References
Refer to the following articles if you prefer to install other versions of Hadoop or if you want to configure a multi-node cluster or using WSL.
*Install Hadoop 3.3.0 on Windows 10 using WSL (Windows Subsystems for Linux is requried)Required tools
Before you start, make sure you have these following tools enabled in Windows 10.ToolCommentsPowerShell
We will use this tool to download package.
In my system, PowerShell version table is listed below:Git Bash or 7 Zip
We will use Git Bash or 7 Zip to unzip Hadoop binary package.
You can choose to install either tool or any other tool as long as it can unzip *.tar.gz files on Windows.Command PromptWe will use it to start Hadoop daemons and run some commands as part of the installation process. Java JDK
JDK is required to run Hadoop as the framework is built using Java.
In my system, my JDK version is jdk1.8.0_161.
Check out the supported JDK version on the following page.
From Hadoop 3.3.0, Java 11 runtime is now supported.
Now we will start the installation process. Step 1 - Download Hadoop binary packageSelect download mirror link
Go to download page of the official website:
And then choose one of the mirror link. The page lists the mirrors closest to you based on your location. For me, I am choosing the following mirror link:info In the following sections, this URL will be used to download the package. Your URL might be different from mine and you can replace the link accordingly.Download the packageinfo In this guide, I am installing Hadoop in folder big-data of my F drive (F:big-data). If you prefer to install on another drive, please remember to change the path accordingly in the following command lines. This directory is also called destination directory in the following sections.
Open PowerShell and then run the following command lines one by one:
It may take a few minutes to download.
Once the download completes, you can verify it:
You can also directly download the package through your web browser and save it to the destination directory.warning Please keep this PowerShell window open as we will use some variables in this session in the following steps. If you already closed it, it is okay, just remember to reinitialise the above variables: $client, $dest_dir.Step 2 - Unpack the package
Now we need to unpack the downloaded package using GUI tool (like 7 Zip) or command line. For me, I will use git bash to unpack it.
Open git bash and change the directory to the destination folder:
And then run the following command to unzip:
The command will take quite a few minutes as there are numerous files included and the latest version introduced many new features.
After the unzip command is completed, a new folder hadoop-3.3.0 is created under the destination folder.
info When running the command you may experience errors like the following:Please ignore it for now.Step 3 - Install Hadoop native IO binary
Hadoop on Linux includes optional Native IO support. However Native IO is mandatory on Windows and without it you will not be able to get your installation working. The Windows native IO libraries are not included as part of Apache Hadoop release. Thus we need to build and install it.infoThe following repository already pre-built Hadoop Windows native libraries:https://github.com/kontext-tech/winutilswarning These libraries are not signed and there is no guarantee that it is 100% safe. We use it purely for test&learn purpose.
Download all the files in the following location and save them to the bin folder under Hadoop folder. For my environment, the full path is: F:big-datahadoop-3.3.0bin. Remember to change it to your own path accordingly.
Alternatively, you can run the following commands in the previous PowerShell window to download:
After this, the bin folder looks like the following:Step 4 - (Optional) Java JDK installation
Java JDK is required to run Hadoop. If you have not installed Java JDK, please install it.
You can install JDK 8 from the following page:
Once you complete the installation, please run the following command in PowerShell or Git Bash to verify:
If you got error about ’cannot find java command or executable’. Don’t worry we will resolve this in the following step.Step 5 - Configure environment variables
Now we’ve downloaded and unpacked all the artefacts we need to configure two important environment variables.Configure JAVA_HOME environment variable
As mentioned earlier, Hadoop requires Java and we need to configure JAVA_HOME environment variable (though it is not mandatory but I recommend it).
First, we need to find out the location of Java SDK. In my system, the path is: D:Javajdk1.8.0_161.
Your location can be different depends on where you install your JDK.
And then run the following command in the previous PowerShell window:
Remember to quote the path especially if you have spaces in your JDK path.infoYou can setup environment variable at system level by adding option /M however just in case you don’t have access to change system variables, you can just set it up at user level.
The output looks like the following:Configure HADOOP_HOME environment variable
Similarly we need to create a new environment variable for HADOOP_HOME using the following command. The path should be your extracted Hadoop folder. For my environment it is: F:big-datahadoop-3.3.0.
If you used PowerShell to download and if the window is still open, you can simply run the following command:
The output looks like the following screenshot:
Alternatively, you can specify the full path:
Now you can also verify the two environment variables in the system:
Configure PATH environment variable
Once we finish setting up the above two environment variables, we need to add the bin folders to the PATH environment variable.
If PATH environment exists in your system, you can also manually add the following two paths to it:
*%JAVA_HOME%/bin
*%HADOOP_HOME%/bin
Alternatively, you can run the following command to add them:
If you don’t have other user variables setup in the system, you can also directly add a Path environment variable that references others to make it short:
Close PowerShell window and open a new one and type winutils.exe directly to verify that our above steps are completed successfully:
You should also be able to run the following command:Step 6 - Configure Hadoop
Now we are ready to configure the most important part - Hadoop configurations which involves Core, YARN, MapReduce, HDFS configurations. Configure core site
Edit file core-site.xml in %HADOOP_HOME%etchadoop folder. For my environment, the actual path is F:big-datahadoop-3.3.0etchadoop.
Replace configuration element with the following:Configure HDFS
Edit file hdfs-site.xml in %HADOOP_HOME%etchadoop folder.
Before editing, please correct two folders in your system: one for namenode directory and another for data directory. For my system, I created the following two sub folders:
*F:big-datadatadfsnamespace_logs_330
*F:big-datadatadfsdata_330
Replace configuration element with the following (remember to replace the highlighted paths accordingly):
In Hadoop 3, the property names are slightly different from previous version. Refer to the following official documentation to learn more about the configuration properties:infoFor DFS replication we configure it as one as we are configuring just one single node. By default the value is 3.infoThe directory configuration are not mandatory and by default it will use Hadoop temporary folder. For our tutorial purpose, I would recommend customise the values. Configure MapReduce and YARN site
Edit file mapred-site.xml in %HADOOP_HOME%etchadoop folder.
Replace configuration element with the following:
Edit fileyarn-site.xml in %HADOOP_HOME%etchadoop folder. Step 7 - Initialise HDFS & bug fix
Run the following command in Command Prompt
The following is an example when it is formatted successfully:
Winutils.exe Hadoop 2.6 DownloadStep 8 - Start HDFS daemons
Run the following command to start HDFS daemons in Command Prompt:Two Command Prompt windows will open: one for datanode and another for namenode as the following screenshot s
https://diarynote-jp.indered.space
コメント