【单点】每日突破，MapReduce自定义InputFormat

MapReduce自定义InputFormat

问：如何自定义InputFormat？

答：

创建自定义类，继承需要的InputFormat，如FileInputFormat。重写createRecordReader方法，返回自定义RecordReader。
创建自定义RecordReader，继承RecordReader类，并重写方法。

代码语言：javascript复制

// 初始化资源，一般用于打开IO流
// 常用IO流为FSDataInputStream，默认会定义在成员变量中：inputStream
public void initialize(InputSplit split,
                                  TaskAttemptContext context
                                  ) throws IOException, InterruptedException
{
	FileSplit fs = (FileSplit) split;
	Path path = fs.getPath();
	FileSystem fileSystem = path.getFileSystem(context.getConfiguration());
	inputStream = fileSystem.open(path);
}

// 关闭资源，一般用于关闭IO流
public void close() throws IOException {
	IOUtils.closeStream(inputStream);
}

// 类似于指针，如果要读取的数据存在，返回true，否则返回false
public boolean nextKeyValue() throws IOException, InterruptedException {
}

// 获取当前行的key
public KEYIN getCurrentKey() throws IOException, InterruptedException;
  
// 获取当前行的value
public VALUEIN getCurrentValue() throws IOException, InterruptedException;

// 返回数据读取进度，0-1
public float getProgress() throws IOException, InterruptedException;

为Job指定InputFormat。

代码语言：javascript复制

job.setInputFormatClass();

问：自定义InputFormat时，如果不想让文件在读取时被切片，可以怎么做？

答：

重写isSplitable方法，返回false。

问：如果没有自定义Map、Reduce，默认会执行什么操作？

答：

Mapper会读取数据，然后输出（key,value）。Reducer接收数据，遍历value后输出(key,value)。相当于是按照InputFormat将数据读取为(key,value)格式后原样输出。

今天的单点，你是否get到了呢？每日单点，用5分钟收获一点！今天你打卡了没？

后话

如果有帮助的，记得点赞、关注。在公众号《数舟》中，可以免费获取专栏《数据仓库》配套的视频课程、大数据集群自动安装脚本，并获取进群交流的途径。

我所有的大数据技术内容也会优先发布到公众号中。如果对某些大数据技术有兴趣，但没有充足的时间，在群里提出，我为大家安排分享。

大数据 mapreduce 面向对象编程

0 人点赞