apache_beam.dataframe.doctests module

A module that allows running existing pandas doctests with Beam dataframes.

This module hooks into the doctesting framework by providing a custom runner and, in particular, an OutputChecker, as well as providing a fake object for mocking out the pandas module.

The (novel) sequence of events when running a doctest is as follows.

  1. The test invokes pd.DataFrame(…) (or similar) and an actual dataframe is computed and stashed but a Beam deferred dataframe is returned in its place.

  2. Computations are done on these “dataframes,” resulting in new objects, but as these are actually deferred, only expression trees are built. In the background, a mapping of id -> deferred dataframe is stored for each newly created dataframe.

  3. When any dataframe is printed out, the repr has been overwritten to print Dataframe[id]. The aforementened mapping is used to map this back to the actual dataframe object, which is then computed via Beam, and its the (stringified) result plugged into the actual output for comparison.

  4. The comparison is then done on the sorted lines of the expected and actual values.

class apache_beam.dataframe.doctests.FakePandasObject(pandas_obj, test_env)[source]

Bases: object

A stand-in for the wrapped pandas objects.

class apache_beam.dataframe.doctests.TestEnvironment[source]

Bases: object

A class managing the patching (of methods, inputs, and outputs) needed to run and validate tests.

These classes are patched to be able to recognize and retrieve inputs and results, stored in self._inputs and self._all_frames respectively.

fake_pandas_module()[source]
context()[source]

Creates a context within which DeferredBase types are monkey patched to record ids.

class apache_beam.dataframe.doctests.BeamDataframeDoctestRunner(env, use_beam=True, wont_implement_ok=None, not_implemented_ok=None, skip=None, **kwargs)[source]

Bases: DocTestRunner

A Doctest runner suitable for replacing the pd module with one backed by beam.

run(test, **kwargs)[source]
report_success(out, test, example, got)[source]
fake_pandas_module()[source]
summarize()[source]
summary()[source]
class apache_beam.dataframe.doctests.AugmentedTestResults(failed, attempted)[source]

Bases: TestResults

Create new instance of TestResults(failed, attempted)

class apache_beam.dataframe.doctests.Summary(failures=0, tries=0, skipped=0, error_reasons=None)[source]

Bases: object

result()[source]
summarize()[source]
apache_beam.dataframe.doctests.parse_rst_ipython_tests(rst, name, extraglobs=None, optionflags=None)[source]

Extracts examples from an rst file and produce a test suite by running them through pandas to get the expected outputs.

apache_beam.dataframe.doctests.test_rst_ipython(rst, name, report=False, wont_implement_ok=(), not_implemented_ok=(), skip=(), **kwargs)[source]

Extracts examples from an rst file and run them through pandas to get the expected output, and then compare them against our dataframe implementation.

apache_beam.dataframe.doctests.teststring(text, wont_implement_ok=None, not_implemented_ok=None, **kwargs)[source]
apache_beam.dataframe.doctests.teststrings(texts, report=False, **runner_kwargs)[source]
apache_beam.dataframe.doctests.set_pandas_options()[source]
apache_beam.dataframe.doctests.with_run_patched_docstring(target=None)[source]
apache_beam.dataframe.doctests.testfile(*args, **kwargs)[source]

Run all pandas doctests in the specified file.

Arguments skip, wont_implement_ok, not_implemented_ok are all in the format:

{
   "module.Class.method": ['*'],
   "module.Class.other_method": [
     'instance.other_method(bad_input)',
     'observe_result_of_bad_input()',
   ],
}

‘*’ indicates all examples should be matched, otherwise the list is a list of specific input strings that should be matched.

All arguments are kwargs.

Parameters:
  • optionflags (int) – Passed through to doctests.

  • extraglobs (Dict[str,Any]) – Passed through to doctests.

  • use_beam (bool) – If true, run a Beam pipeline with partitioned input to verify the examples, else use PartitioningSession to simulate distributed execution.

  • skip (Dict[str,str]) – A set of examples to skip entirely. If a key is ‘*’, an example will be skipped in all test scenarios.

  • wont_implement_ok (Dict[str,str]) – A set of examples that are allowed to raise WontImplementError.

  • not_implemented_ok (Dict[str,str]) – A set of examples that are allowed to raise NotImplementedError.

Returns:

A doctest result describing the passed/failed tests.

Return type:

TestResults

apache_beam.dataframe.doctests.testmod(*args, **kwargs)[source]

Run all pandas doctests in the specified module.

Arguments skip, wont_implement_ok, not_implemented_ok are all in the format:

{
   "module.Class.method": ['*'],
   "module.Class.other_method": [
     'instance.other_method(bad_input)',
     'observe_result_of_bad_input()',
   ],
}

‘*’ indicates all examples should be matched, otherwise the list is a list of specific input strings that should be matched.

All arguments are kwargs.

Parameters:
  • optionflags (int) – Passed through to doctests.

  • extraglobs (Dict[str,Any]) – Passed through to doctests.

  • use_beam (bool) – If true, run a Beam pipeline with partitioned input to verify the examples, else use PartitioningSession to simulate distributed execution.

  • skip (Dict[str,str]) – A set of examples to skip entirely. If a key is ‘*’, an example will be skipped in all test scenarios.

  • wont_implement_ok (Dict[str,str]) – A set of examples that are allowed to raise WontImplementError.

  • not_implemented_ok (Dict[str,str]) – A set of examples that are allowed to raise NotImplementedError.

Returns:

A doctest result describing the passed/failed tests.

Return type:

TestResults