Bops is a numpy-based analysis module focusing on the manipulation, grouping and filtering of data from various sources. Bops also has map-reduce functionality. While there are some datasets which need distributed map-reduce jobs, the author has the point of view that most do not. Bops gives tremendous power in data grouping without the sacrifice of speed or simplicity.
Bops is tightly integrated with numpy to produce a very fast analysis package. The module has one main class for data manipulation, called a ‘bop’. Bops was initially named for ‘boolean operations’. The module has been greatly expanded to include map-reduce and data grouping on top of the initial filtering capability.
Bops expects a two-dimensional data structure for initialization along with the attributes of the data (ie. column names). After the data is contained in a ‘bop’, it can be filtered (‘select’ function), grouped on multiple columns (‘groupby’ function) and sorted (‘orderby’ function).
The ‘select’ function allows you to filter on multiple aspects of the data by manipulating numpy boolean arrays. The ‘groupby’ function can group data on similar attributes. However, unlike the ‘GROUP BY’ functions found in SQL, bops’ ‘groupby’ function returns the data found in the group along with the group’s unique identifiers. On top of these functions, bops also contains a sort function, called ‘orderby’, which allows the programmer to order the data on multiple columns.
These functions provide enormous power in data analysis, primarily by grouping data on multiple attributes then returning the results to be manipulated. This strength is magnified by added map-reduce functionality. The map function allows a programmer to aggregate data based on custom logic. A simple example would be grouping by gender and decade of age (30’s, 40’s, …) for every row in a dataset. A reduce function would then be ran on each group found by the map function. Using the same example, one could use the built-in len or sum functions in python as the reduce function to generate histograms of gender and age groups.
Giving the programmer capability and removing limitations is the goal for Bops to aid data analysis.
Changes v0.4.1 - 0.5
- Added aliases:
- float: np.float_
- int: np.int_
- bool: np.bool_
- str: np.str_
- unicode: np.unicode_
- complex: np.complex_
- Changed the default expand option to True for the ‘mapreduce’ and ‘mapreducebatch’ functions. This is to comply with the groupby function, so that the expand options would match across the module.
A mailing list has been created to support the use of this module. You can join and follow the discussion on Google groups. Any errors, issues and enhancements can be discussed here.
Bops aims to be a top-notch data analysis module, but only with your help can this module actually be great. Please chime into the discussion. Your inputs are welcome as well as any suggested features, patches or fixes.
The module is now on Google Code: http://code.google.com/p/bops/
Issues, bugs and suggested enhancments can be submitted there.
TODO: Figure out how to actually get changelog content.
Changelog content for this version goes here.