Agile Data Science - Extracting features with PySpark

In this section, we will find out about the application of the extracting features with PySpark in Agile Data Science.

Overview of Spark

Apache Spark can be characterized as a fast real-time processing framework. It does computations to analyze data in real time. Apache Spark is introduced as stream processing system in real-time and can also take care of batch processing. Apache Spark supports interactive questions and iterative algorithms.

Spark is written in “Scala programming language”.

PySpark can be considered as a blend of Python with Spark. PySpark offers PySpark shell, which joins Python API to the Spark core and introduces the Spark context. Most of the data researchers use PySpark for tracking features as discussed in the past section.

In this example, we will concentrate on the transformations to build a dataset called counts and save it to a specific file.

text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) 
   .map(lambda word: (word, 1)) 
   .reduceByKey(lambda a, b: a + b)

Using PySpark, a user can work with RDDs in python programming language. The inbuilt library, which covers the basics of Data Driven documents and components, helps in this.

Input your Topic Name and press Enter.