Read Excel files within of DataStage – using Java Stages – Part 1

With DataStage older than version 9.1, it was not possible to read Excel doucments (maybe, when using a Windows Server and creating an ODBC interface for teh Excel). But business folks, like the usage of Excel to collect their data (Excel is a good and powerful tool, but if it the right choice to collect data?), and with that, it is very common in a data migration project that you get hundreds of Excel to process in your ETL tools.

When not using DataStage 9.1, maybe the usage of a Java stage could help you

Because with the Java API Apache POI, it is easy to read Excel files (97 and 2007 format).

This code part, takes care about a simple implementation for a wrapper, so that POI can be used very simple by a Java stage.

The requirements that should be met:

  • Filename
  • Sheet
  • Columns
  • Startrow

The first three points, will be handled by the wrapper. This can be called with any filename, any sheet and any defined columns list. The start row, will be considered by the Java stage implementation.

package de.kissoft.datastage.excel;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;

import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.SpreadsheetVersion;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.FormulaError;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

import de.kissoft.datastage.exceptions.RejectException;

/**
 * A simple Apache POI wrapper.
 * <br/><br/>
 * Usage:<br/>
 * ExcelReader reader = new ExcelReader("Daten.xls", "Sheet1", "A,B");
 * <br/>List<String> zeile1 = reader.getRow(1);
 * <br/><br/>
 * Apache POI starts the rows with "0", this is hidden by this program.
 * For the user it makes more sense to start with "1".
 * 
 * @author Michael Schmut
 */
public class ExcelReader {
  private int[] columns;
  private Sheet sheet;
  private SpreadsheetVersion version;

  
  /**
   * Call the Excel wrapper with a string a file reference.
   * 
   * @param file File name to open
   * @param sheetName target Excel sheet
   * @param columnList comma separared list of column letters 
   * @throws Exception If the file doesn't exist, or the column 
   *    letters are out of range.
   */
  public ExcelReader(String file, String sheetName, String columnList) 
      throws Exception {
    this(new File(file), sheetName, columnList);
  }
  
  /**
   * Call Excel wrapper with a file pointer.
   * 
   * @param file Excel to open
   * @param sheetName target Excel sheet
   * @param columnList comma separared list of column letters 
   * @throws Exception If the file doesn't exist, or the column 
   *    letters are out of range.
   */
  public ExcelReader(File excelFile, String sheetName, String columnList) 
      throws Exception {
    Workbook workbook = openWorkbook(excelFile);

    sheet = workbook.getSheet(sheetName);
    if (sheet == null) {
      throw new Exception("The given sheet " +
          "doesn't exist in this Excel file.");
    }

    setColumnLetters(columnList);
  }

  /**
   * Read the workbook
   * 
   * @param excelFile Excel file reference
   * @return When the file was opened succesful, the workbook
   * @throws Exception When the file doesn't exist, or it is no
   *     Excel file.
   */
  private Workbook openWorkbook(File excelFile) throws Exception {
    if (!excelFile.exists()) {
      throw new Exception("The given file '"+excelFile.getName()+
          "' does not exist.");
    }  if (excelFile.isDirectory()) {
      throw new Exception("The given name '"+excelFile.getName()+
          "' is a folder, not a file.");
    }
    
    Workbook workbook = null;
    try {
      InputStream inStream = new FileInputStream(excelFile);
      
      // Here a simple check, if it is a Office 2007 format,
      // this check is done by the file name "XLSX"
      if (excelFile.getName().toUpperCase().endsWith(".XLSX")) {
        workbook = new XSSFWorkbook(inStream);
        setExcelVersion(SpreadsheetVersion.EXCEL2007);
      } else {
        workbook = new HSSFWorkbook(inStream);
        setExcelVersion(SpreadsheetVersion.EXCEL97);
      }
    } catch (Exception e) {
      throw new Exception("An error occured, while opening " +
          "the Excel document.", e);
    }
    return workbook;
  }
  
  /**
   * Set the version of Excel
   * 
   * @param The used Excel version (2007 or 97)
   */
  private void setExcelVersion(SpreadsheetVersion version) {
  	this.version = version;
  }
  
  /**
   * Return the used Excel version
   * 
   * @return The Excel version
   */
  private SpreadsheetVersion getExcelVersion() {
  	return version;
  }
  
  /**
   * Split the list of column letters
   * 
   * @param columnList comma separated list of column letters
   * @throws Exception When the column letters are out of range
   */
  private void setColumnLetters(String columnList) throws Exception {
    String[] columnLetters = columnList.toUpperCase().split(",");
    
    columns = new int[columnLetters.length];
    
    for (int i = 0; i < columnLetters.length; i++) {
      if (!CellReference.isColumnWithnRange(columnLetters[i], getExcelVersion())) {
        throw new Exception("The given column '"+columnLetters[i]+
            "' is out of range - the last possible column is '"+
            getExcelVersion().getLastColumnName()+"'.");
      }
      columns[i] = CellReference.convertColStringToIndex(columnLetters[i]);
    }
  }

  /**
   * Detect the number of rows in the Excel.
   * 
   * 
   * @return Number of rows in the sheet
   */
  public int getRowCount() {
    int rowCount = sheet.getLastRowNum();
    if (rowCount >= 0)
      rowCount++;
    return rowCount;
  }
  
  /**
   * Return the row with the values, defined by the column list
   * 
   * 
   * @param rowNum Row number to extract
   * @return Values of the result columns
   * @throws RejectException When the row contains error values
   */
  public List<String> getRow(int rowNum) throws RejectException {
    rowNum--;
    List<String> result = new ArrayList<String>();
    
    Row row = sheet.getRow(rowNum);
    
    for (int column : columns) {
      Cell cell = row.getCell(column);
      
      if (cell == null) {
        result.add("");
        continue;
      }
      
      int cellType = cell.getCellType();
      
      if (cellType == Cell.CELL_TYPE_ERROR) {
        throw new RejectException(CellReference.convertNumToColString(column), 
            ""+rowNum, 
            FormulaError.forInt(cell.getErrorCellValue()).getString(), 
            "The row '"+(rowNum+1)+"' contains an error".");
      }
      
      cell.setCellType(Cell.CELL_TYPE_STRING);
      result.add(cell.getStringCellValue());
      
    }
    
    return result;
  }
}

When calling the constructor, the three points, defined as requirements, a directly handled. The wrapper checks, if the does exist, and if everything is ok, with the sheetname and the column list.
For the Java Stage the method getRowCount() is needed, to get the full set of rows, that the Java stage can inform DataStage how many records to extract (or when the last record was extracted).
In the method getRow() finally the values are extracted. All values are mapped to a string value (see this code snippet as an entry point, you need to think about, what are your pain points, which you need to check), if a cell contains an error, the complete row is rejected (also here, think about what rules your business need to follow).

For the completeness here the RejectException

package de.kissoft.datastage.exceptions;

/**
 * Exception for rejected rows.
 * 
 * @author Michael Schmut
 */
public class RejectException extends Exception {
  private static final long serialVersionUID = -6407737496257508581L;

  /**
   * Create a rejected row
   * @param column The column letter for the rejected column
   * @param row The row number
   * @param value The error value
   * @param message A message for the user
   */
  public RejectException(String column, String row, String value, 
      String message) {
    super(message);
  }
}

At the end, a Gradle build script, which can directly be used to build the stuff.

apply plugin: 'java'
apply plugin: 'eclipse'

repositories {
	mavenCentral()
}

configurations {
    poi
    poiooxml
}

dependencies {
    testCompile group: 'junit', name: 'junit', version: '4.8.+'
    poi 'org.apache.poi:poi:3.9'
    poiooxml 'org.apache.poi:poi-ooxml:3.9'
    compile configurations.poi
    compile configurations.poiooxml
}

eclipse {
  project {
    name = 'ExcelStage'
  }

  classpath {
    plusConfigurations += configurations.poi
    plusConfigurations += configurations.poiooxml

    defaultOutputDir = compileJava.destinationDir
  }
}

In this case the testCompile is newly added, which would automatically execute test cases, if there are some defined in folder src/test/java.

Feel free to use this code example – but on your own risk.

In the next step a Java Stage class will be created

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.