-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Hadoop Compatible File System
HDFS is optimized for large files. The scalability of the single HDFS namenode is limited by the number of files. It is hard for HDFS to store lots of small files.
SeaweedFS excels on small files and has no issue to store large files. Now it is possible to enable Hadoop jobs to read from and write to SeaweedFS.
$cd $GOPATH/src/github.com/seaweedfs/seaweedfs/other/java/client
$ mvn install
# build for hadoop3
$cd $GOPATH/src/github.com/seaweedfs/seaweedfs/other/java/hdfs3
$ mvn package
$ ls -al target/seaweedfs-hadoop3-client-4.00.jar
Maven
<dependency>
<groupId>com.seaweedfs</groupId>
<artifactId>seaweedfs-hadoop3-client</artifactId>
<version>4.00</version>
</dependency>
Or you can download the latest version from MavenCentral
Suppose you are getting a new Hadoop installation. Here are the minimum steps to get SeaweedFS to run.
You would need to start a weed filer first, build the seaweedfs-hadoop3-client-4.00.jar, and do the following:
# optionally adjust hadoop memory allocation
$ export HADOOP_CLIENT_OPTS="-Xmx4g"
$ cd ${HADOOP_HOME}
# create etc/hadoop/mapred-site.xml, just to satisfy hdfs dfs. skip this if the file already exists.
$ echo "<configuration></configuration>" > etc/hadoop/mapred-site.xml
# on hadoop3
$ bin/hdfs dfs -Dfs.defaultFS=seaweedfs://localhost:8888 \
-Dfs.seaweedfs.impl=seaweed.hdfs.SeaweedFileSystem \
-libjars ./seaweedfs-hadoop3-client-4.00.jar \
-ls /
Both reads and writes are working fine.
- Configure Hadoop to use SeaweedFS in
etc/hadoop/conf/core-site.xml.core-site.xmlresides on each node in the Hadoop cluster. You must add the same properties to each instance ofcore-site.xml. There are several properties to modify:-
fs.seaweedfs.impl: This property defines the Seaweed HCFS implementation classes that are contained in the SeaweedFS HDFS client JAR. It is required. -
fs.defaultFS: This property defines the default file system URI to use. It is optional if a path always has prefixseaweedfs://localhost:8888. -
fs.AbstractFileSystem.seaweedfs.impl: Add the SeaweedFS implementation of Hadoop AbstractFileSystem to delegates to the existing SeaweedFS FileSystem and is only necessary for use with Hadoop 3.x. -
fs.seaweed.buffer.size: Optionally change the default buffer size 4194304 to a larger number. It will be the default chunk size. -
fs.seaweed.volume.server.access:[direct|publicUrl|filerProxy]Optionally access volume servers via their publicUrl settings, or use filer as a proxy. This is useful when volume servers are inside a cluster and not directly accessible.
-
<configuration>
<property>
<name>fs.seaweedfs.impl</name>
<value>seaweed.hdfs.SeaweedFileSystem</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>seaweedfs://localhost:8888</value>
</property>
<property>
<name>fs.AbstractFileSystem.seaweedfs.impl</name>
<value>seaweed.hdfs.SeaweedAbstractFileSystem</value>
</property>
<property>
<name>fs.seaweed.buffer.size</name>
<value>4194304</value>
</property>
<property>
<name>fs.seaweed.volume.server.access</name>
<!-- [direct|publicUrl|filerProxy] -->
<value>direct</value>
</property>
<property>
<name>fs.seaweed.replication</name>
<value>002</value>
</property>
</configuration>
# Replication Configuration
SeaweedFS supports configurable replication for data durability. You can set the replication level using the `fs.seaweed.replication` property in `core-site.xml`.
SeaweedFS uses a 3-digit replication string format: (data center)(rack)(volume server) copies.
**Examples:**
- `000` or empty: Use default replication (usually 1 copy)
- `001`: 1 additional copy (2 total copies)
- `002`: 2 additional copies (3 total copies)
- `010`: 1 data center + 1 rack copy (3 total copies)
- `020`: 2 rack copies (3 total copies)
If `fs.seaweed.replication` is not set, SeaweedFS will use the HDFS `dfs.replication` parameter and convert it to SeaweedFS format. If `fs.seaweed.replication` is set to an empty string, SeaweedFS will use the default replication configured on the SeaweedFS filer.
**Configuration:**
```xml
<property>
<name>fs.seaweed.replication</name>
<value>002</value>
</property>
You can also set replication programmatically when creating files, though this is less common for HDFS-compatible usage.
# Deploy the SeaweedFS HDFS client jar
$ bin/hadoop classpath
$ cd ${HADOOP_HOME}
$ cp ./seaweedfs-hadoop3-client-4.00.jar share/hadoop/common/lib/
Now you can do this:
$ cd
$ bin/hdfs dfs -ls seaweedfs://localhost:8888/
# Supported HDFS Operations
bin/hdfs dfs -appendToFile README.txt /weedfs/weedfs.txt
bin/hdfs dfs -cat /weedfs/weedfs.txt
bin/hdfs dfs -rm -r /uber
bin/hdfs dfs -chown -R chris:chris /weedfs
bin/hdfs dfs -chmod -R 755 /weedfs
bin/hdfs dfs -copyFromLocal README.txt /weedfs/README.txt.2
bin/hdfs dfs -copyToLocal /weedfs/README.txt.2 .
bin/hdfs dfs -count /weedfs/README.txt.2
bin/hdfs dfs -cp /weedfs/README.txt.2 /weedfs/README.txt.3
bin/hdfs dfs -du -h /weedfs
bin/hdfs dfs -find /weedfs -name "*.txt" -print
bin/hdfs dfs -get /weedfs/weedfs.txt
bin/hdfs dfs -getfacl /weedfs
bin/hdfs dfs -getmerge -nl /weedfs w.txt
bin/hdfs dfs -ls /
bin/hdfs dfs -mkdir /tmp
bin/hdfs dfs -mkdir -p /tmp/x/y
bin/hdfs dfs -moveFromLocal README.txt.2 /tmp/x/
bin/hdfs dfs -mv /tmp/x/y/README.txt.2 /tmp/x/y/README.txt.3
bin/hdfs dfs -mv /tmp/x /tmp/z
bin/hdfs dfs -put README.txt /tmp/z/y/
bin/hdfs dfs -rm /tmp/z/y/*
bin/hdfs dfs -rmdir /tmp/z/y
bin/hdfs dfs -stat /weedfs
bin/hdfs dfs -tail /weedfs/weedfs.txt
bin/hdfs dfs -test -f /weedfs/weedfs.txt
bin/hdfs dfs -text /weedfs/weedfs.txt
bin/hdfs dfs -touchz /weedfs/weedfs.txtx
## Operations Plan to Support
getfattr setfacl setfattr truncate createSnapshot deleteSnapshot renameSnapshot setrep
# Notes
## Atomicity
SeaweedFS satisfies the HCFS [requirements](https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/filesystem/introduction.html) that the following operations to be atomic, when using MySql/Postgres/Sqlite database transactions.
1. Creating a file. If the overwrite parameter is false, the check and creation MUST be atomic.
1. Deleting a file.
1. Renaming a file.
1. Renaming a directory.
1. Creating a single directory with mkdir().
Among these, except file or directory renaming, the following operations are all atomic for any filer store.
1. Creating a file
1. Deleting a file
1. Creating a single directory with mkdir().
## No native shared libraries
The SeaweedFS hadoop client is a pure java library. There are no native libraries to install if you already have Hadoop running.
This is different from many other HCFS options. If native shared libraries are needed, these libraries need to be install on all hadoop nodes. This is quite some task.
## Shaded Fat Jar
One of the headache with complicated Java systems is the jar runtime dependency problem, which is resolved by Go's build time dependency resolution. For this SeaweedFS hadoop client, the required jars are mostly shaded and packaged as one fat jar, so there are no extra jar files needed.
## Note
### use `-Djava.net.preferIPv4Stack=true` if possible, see https://github.com/netty/netty/issues/6454
error message:
Failed construction of Master: class org.apache.hadoop.hbase.master.HMasterCommandLine$LocalHMasterConnection refused: localhost/0:0:0:0:0:0:0:1:18888
### See [[Security-Configuration#for-java-grpc]] if you enabled gRpc security.
Failed construction of Master: class org.apache.hadoop.hbase.master.HMasterCommandLine$LocalHMasternot an SSL/TLS record: 000006040000000000000500004000
- Replication
- Store file with a Time To Live
- Failover Master Server
- Erasure coding for warm storage
- Server Startup via Systemd
- Environment Variables
- Filer Setup
- Directories and Files
- File Operations Quick Reference
- Data Structure for Large Files
- Filer Data Encryption
- Filer Commands and Operations
- Filer JWT Use
- TUS Resumable Uploads
- Filer Cassandra Setup
- Filer Redis Setup
- Super Large Directories
- Path-Specific Filer Store
- Choosing a Filer Store
- Customize Filer Store
- Migrate to Filer Store
- Add New Filer Store
- Filer Store Replication
- Filer Active Active cross cluster continuous synchronization
- Filer as a Key-Large-Value Store
- Path Specific Configuration
- Filer Change Data Capture
- Cloud Drive Benefits
- Cloud Drive Architecture
- Configure Remote Storage
- Mount Remote Storage
- Cache Remote Storage
- Cloud Drive Quick Setup
- Gateway to Remote Object Storage
- Amazon S3 API
- S3 Conditional Operations
- S3 CORS
- S3 Object Lock and Retention
- S3 Object Versioning
- S3 API Benchmark
- S3 API FAQ
- S3 Bucket Quota
- S3 Rate Limiting
- S3 API Audit log
- S3 Nginx Proxy
- Docker Compose for S3
- S3 Configuration - Start Here
-
S3 Credentials (
-s3.config) -
OIDC Integration (
-s3.iam.config) - Amazon IAM API
- AWS IAM CLI
- AWS CLI with SeaweedFS
- s3cmd with SeaweedFS
- rclone with SeaweedFS
- restic with SeaweedFS
- nodejs with Seaweed S3
- Hadoop Compatible File System
- run Spark on SeaweedFS
- run HBase on SeaweedFS
- run Presto on SeaweedFS
- Hadoop Benchmark
- HDFS via S3 connector
- Async Replication to another Filer [Deprecated]
- Async Backup
- Async Filer Metadata Backup
- Async Replication to Cloud [Deprecated]
- Kubernetes Backups and Recovery with K8up
- Structured Data Lake with SMQ and SQL
- Seaweed Message Queue
- SQL Queries on Message Queue
- SQL Quick Reference
- PostgreSQL-compatible Server weed db
- Pub-Sub to SMQ to SQL
- Kafka to Kafka Gateway to SMQ to SQL
- Large File Handling
- Optimization
- Volume Management
- Tiered Storage
- Cloud Tier
- Cloud Monitoring
- Load Command Line Options from a file
- SRV Service Discovery
- Volume Files Structure